Commit Graph

1293 Commits

Author SHA1 Message Date
Avi Kivity
eefb6a0642 Merge 'storage_proxy: node_local_only: always use my_host_id' from Petr Gusev
The previous implementation did not handle topology changes well:
* In `node_local_only` mode with CL=1, if the current node is pending, the CL is increased to 2, causing
`unavailable_exception`.
* If the current tablet is in `write_both_read_old` and we try to read with `node_local_only` on the new node, the replica list will be empty.

This patch changes `node_local_only` mode to always use `my_host_id` as the replica list. An explicit check ensures the current node is a replica for the operation; otherwise `on_internal_error` is called.

backport: not needed, since `node_local_only` is only used in LWT for tablets and it hasn't been released yet.

Closes scylladb/scylladb#25508

* github.com:scylladb/scylladb:
  test_tablets_lwt: add test_lwt_during_migration
  storage_proxy: node_local_only: always use my_host_id
2025-08-20 12:11:44 +03:00
Petr Gusev
ed6bec2cac storage_proxy: node_local_only: always use my_host_id
The previous implementation did not handle topology changes well:
* In node_local_only mode with CL=1, if the current node is pending,
  the CL is raised to 2, causing unavailable_exception.
* If the current tablet is in write_both_read_old and we read with
  node_local_only on the new node, the replica list is empty.

This patch changes node_local_only mode to always use my_host_id as
the replica list. An explicit check ensures the current node is a
replica for the operation; otherwise on_internal_error is called.
2025-08-19 16:11:49 +02:00
Avi Kivity
41475858aa storage_proxy: endpoint_filter(): fix rack count confusion
endpoint_filter() is used by batchlog to select nodes to replicate
to.

It contains an unordered_multimap data structure that maps rack names
to nodes.

It misuses std::unordered_map::bucket_count() to count the number of
racks. While values that share a key in a multimap will definitly
be in the same bucket, it's possible for values that don't share a
key to share a bucket. Therefore bucket_count() undercounts the
number of racks.

Fix this by using a more accurate data structure: a map of a set.

The patch changes validated.bucket_count() to validated.size()
and validated.size() to a new variable nr_validated.

The patch does cause an extra two allocations per rack (one for the
unordered_map node, one for the unordered_set bucket vector), but
this is only used for logged batches, so it is amortized over all
the mutations in the logged batch.

Closes scylladb/scylladb#25493
2025-08-19 11:58:39 +03:00
Petr Gusev
8bd936b72c storage_proxy: preserve accept error messages 2025-08-13 13:43:12 +02:00
Petr Gusev
00c25d396f storage_proxy: preserve prepare error message 2025-08-13 13:43:12 +02:00
Petr Gusev
0724fafe47 storage_proxy: fix log message 2025-08-13 13:40:09 +02:00
Petr Gusev
ff89c03c7f exceptions: add constructors that accept explicit error messages
To improve debuggability, we need to propagate original error messages
from Paxos verbs to the user. This change adds constructors that take
an error message directly, enabling better error reporting.

Additionally, functions such as write_timeout_to_read,
write_failure_to_read etc are updated to use these message-based
constructors. These functions are used in storage_proxy::cas to
convert between different error types, and without this change,
they could lose the original error message during conversion.
2025-08-12 16:31:05 +02:00
Avi Kivity
8164f72f6e Merge 'Separate local_effective_replication_map from vnode_effective_replication_map' from Benny Halevy
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.

However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.

Refs #22733

* No backport required

Closes scylladb/scylladb#25222

* github.com:scylladb/scylladb:
  locator: abstract_replication_strategy: implement local_replication_strategy
  locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently
  locator: abstract_replication_map: rename make_effective_replication_map
  locator: abstract_replication_map: rename calculate_effective_replication_map
  replica: database: keyspace: rename {create,update}_effective_replication_map
  locator: effective_replication_map_factory: rename create_effective_replication_map
  locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al
  locator: abstract_replication_strategy: rename global_vnode_effective_replication_map
  keyspace: rename get_vnode_effective_replication_map
  dht: range_streamer: use naked e_r_m pointers
  storage_service: use naked e_r_m pointers
  alternator: ttl: use naked e_r_m pointers
  locator: abstract_replication_strategy: define is_local
2025-08-07 12:51:43 +03:00
Benny Halevy
ec85678de1 locator: abstract_replication_strategy: define is_local
Prefer for specializing the local replication strategy,
local effective replication map, et. al byt defining
an is_local() predicate, similar to uses_tablets().

Note that is_vnode_based() still applies to local replication
strategy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 13:34:23 +03:00
Avi Kivity
630b3d31bb storage_proxy: reduce allocations in send_to_live_endpoints()
send_to_live_endpoints() computes sets of endpoints to
which we send mutations - remote endpoints (where we send
to each set as a whole, using forwarding), and local endpoints,
where we send directly. To make handling regular, each local
endpoint is treated as its own set. Thus, each local endpoint
and each datacenter receive one RPC call (or local call if the
coordinator is also a replica).

These sets are maintained a std::unordered_map (for remote endpoints)
and a vector with the same value_type as the map (for local endpoints).
The key part of the vector payload is initialized to the empty string.

We simplify this by noting that the datacenter name is never used
after this computation, so the vector can hold just the replica sets,
without the fake datacenter name. The downstream variable `all` is
adjusted to point just to the replica set as well.

As a reward for our efforts, the vector's contents becomes nothrow
move constructible (no string), and we can convert it to a small_vector,
which reduces allocations in the common case of RF<=3.

The reduction in allocations is visible in perf-simple-query --write
results:

```
before 165080.62 tps ( 60.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   53438 insns/op,   26705 cycles/op,        0 errors)

after  164513.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   53347 insns/op,   26761 cycles/op,        0 errors)
```

The instruction count reduction is a not very impressive 70/op:

before
```
instructions_per_op:
	mean=   53412.22 standard-deviation=32.12
	median= 53420.53 median-absolute-deviation=20.32
	maximum=53462.23 minimum=53290.06
```

after
```
instructions_per_op:
	mean=   53350.32 standard-deviation=32.38
	median= 53353.71 median-absolute-deviation=13.60
	maximum=53415.20 minimum=53222.24
```

Perhaps the extra code from small_vector defeated some inlining,
which negated some of the gain from the reduced allocations. Perhaps
a build with full profiling will gain it back (my builds were without
pgo).

Closes scylladb/scylladb#25270
2025-08-06 11:28:20 +03:00
Petr Gusev
e120ee6d32 storage_proxy.cc: get_cas_shard: fallback to the primary replica shard
Currently, get_cas_shard uses shard_for_reads to decide which
shard to use for LWT execution—both on replicas and the coordinator.

If the coordinator is not a replica, shard_for_reads returns a default
shard (shard 0). There are at least two problems with this:
* shard 0 can become overloaded, because all LWT
coordinators-but-not-replacas are served on it.
* mismatch with replicas: the default shard doesn't match what
shard_for_reads returns on replicas. This hinders the "same shard for
client and server" RPC level optimization.

In this commit we change get_cas_shard to use a primary replica
shard if the current node is not a replica. This guarantees that all
LWT coordinators for the same tablet will be served on the same shard.
This is important for LWT coordinator locks
(paxos::paxos_state::get_cas_lock). Also, if all tablet replicas on
different nodes live on the same shard, RPC
optimization will make sure that no additional smp::submit_to will
be needed on the server side.

Fixes scylladb/scylladb#20497
2025-07-29 17:07:04 +02:00
Petr Gusev
65c7e36b7c storage_proxy: handle node_local_only in query
In this commit we support node_local_only flag in read code path in
storage_proxy.
2025-07-24 19:48:08 +02:00
Petr Gusev
2d747d97b8 storage_proxy: handle node_local_only in mutate
We add the remove_non_local_host_ids() helper, which
will be used in the next commit to support the read
path. HostIdVector concept is introduced to be able
to handle both host_id_vector_replica_set and
host_id_vector_topology_change uniformly.

The storage_proxy_coordinator_mutate_options class
is declared outside of storage_proxy to avoid C++
compiler complaints about default field initializers.
In particular, some storage_proxy methods use this
class for optional parameters with default values,
which is not allowed when the class is defined inside
storage_proxy.
2025-07-24 19:48:08 +02:00
Petr Gusev
4c1aca3927 storage_proxy: add coordinator_mutate_options
In upcoming commits, we want to add a node_local_only flag to both read
and write paths in storage_proxy. This requires passing the flag from
query_processor to the part of storage_proxy where replica selection
decisions are made.

For reads, it's sufficient to add the flag to the existing
coordinator_query_options class. For writes, there is no such options
container, so we introduce coordinator_mutate_options in this commit.

In the future, we may move some of the many mutate() method arguments
into this container to simplify the code.
2025-07-24 19:48:08 +02:00
Petr Gusev
b6ccaffd45 storage_proxy: rename create_write_response_handler -> make_write_response_handler
Most of the create_write_response_handler overloads follow the same
signature pattern to satisfy the sp::mutate_prepare call. The one which
doesn't follow it is invoked by others and is responsible for creating
a concrete handler instance. In this refactoring commit we rename
it to make_write_response_handler to reduce confusion.
2025-07-24 19:48:08 +02:00
Petr Gusev
db946edd1d storage_proxy: simplify mutate_prepare
This is a refactoring commit. We remove extra lambda parameters from
mutate_prepare since the CreateWriteHandler lambda can simply
capture them.

We can't std::move(permit) in another mutate_prepare overload,
because each handler wants its own copy of this pemit.
2025-07-24 19:48:08 +02:00
Petr Gusev
ac4bc3f816 paxos_state: lazily create paxos state table
We call paxos_store::ensure_initialized in the beginning of
storage_proxy::cas to create a paxos state table for a user table if
it doesn't exist. When the LWT coordinator sends RPCs to replicas,
some of them may not yet have the paxos schema. In
paxos_store::get_paxos_state_schema we just wait for them to appear,
or throw 'no_such_column_family' if the base table was dropped.
2025-07-24 19:48:08 +02:00
Petr Gusev
6e87a6cdb0 paxos_state: extract state access functions into paxos_store
Introduce paxos_store abstraction to isolate Paxos state access.
Prepares for supporting either system.paxos or a co-located
table as the storage backend.
2025-07-24 16:39:50 +02:00
Gleb Natapov
ab6e328226 storage_proxy: preallocate write response handler hash table
Currently it grows dynamically and triggers oversized allocation
warning. Also it may be hard to find sufficient contiguous memory chunk
after the system runs for a while. This patch pre-allocates enough
memory for ~1M outstanding writes per shard.

Fixes #24660
Fixes #24217

Closes scylladb/scylladb#25098
2025-07-24 09:46:42 +03:00
Patryk Jędrzejczak
f89ffe491a Merge 'storage_service: cancel all write requests after stopping transports' from Sergey Zolotukhin
When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore.

If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out.

This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped.

Fixes scylladb/scylladb#23665

Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3.

Closes scylladb/scylladb#24714

* https://github.com/scylladb/scylladb:
  storage_service: Cancel all write requests on storage_proxy shutdown
  test: Add test for unfinished writes during shutdown and topology change
2025-07-24 09:46:42 +03:00
Sergey Zolotukhin
e0dc73f52a storage_service: Cancel all write requests on storage_proxy shutdown
During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown`
as one of the first steps. However, even after RPCs are shut down, some write handlers in
`storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM.
Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are
concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block
the messaging server shutdown and delay the entire shutdown process until the write timeout occurs.

This change introduces the cancellation of all outstanding write handlers in `storage_proxy`
during shutdown to prevent unnecessary delays.

Fixes scylladb/scylladb#23665
2025-07-22 15:03:30 +02:00
Sergey Zolotukhin
bc934827bc test: Add test for unfinished writes during shutdown and topology change
This test reproduces an issue where a topology change and an ongoing write query
during query coordinator shutdown can cause the node to get stuck.

When a node receives a write request, it creates a write handler that holds
a copy of the current table's ERM (Effective Replication Map). The ERM ensures
that no topology or schema changes occur while the request is being processed.

After the query coordinator receives the required number of replica write ACKs
to satisfy the consistency level (CL), it sends a reply to the client. However,
the write response handler remains alive until all replicas respond — the remaining
writes are handled in the background.

During shutdown, when all network connections are closed, these responses can no longer
be received. As a result, the write response handler is only destroyed once the write
timeout is reached.

This becomes problematic because the ERM held by the handler blocks topology or schema
change commands from executing. Since shutdown waits for these commands to complete,
this can lead to unnecessary delays in node shutdown and restarts, and occasional
test case failures.

Test for: scylladb/scylladb#23665
2025-07-22 15:03:13 +02:00
Benny Halevy
3feb759943 everywhere: use utils::chunked_vector for list of mutations
Currently, we use std::vector<*mutation> to keep
a list of mutations for processing.
This can lead to large allocation, e.g. when the vector
size is a function of the number of tables.

Use a chunked vector instead to prevent oversized allocations.

`perf-simple-query --smp 1` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (read path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

89055.97 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   18003 cycles/op,        0 errors)
103372.72 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39380 insns/op,   17300 cycles/op,        0 errors)
98942.27 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39413 insns/op,   17336 cycles/op,        0 errors)
103752.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39407 insns/op,   17252 cycles/op,        0 errors)
102516.77 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39403 insns/op,   17288 cycles/op,        0 errors)
throughput:
	mean=   99528.13 standard-deviation=6155.71
	median= 102516.77 median-absolute-deviation=3844.59
	maximum=103752.93 minimum=89055.97
instructions_per_op:
	mean=   39403.99 standard-deviation=14.25
	median= 39406.75 median-absolute-deviation=9.30
	maximum=39416.63 minimum=39380.39
cpu_cycles_per_op:
	mean=   17435.81 standard-deviation=318.24
	median= 17300.40 median-absolute-deviation=147.59
	maximum=18002.53 minimum=17251.75
```

After (read path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
59755.04 tps ( 66.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39466 insns/op,   22834 cycles/op,        0 errors)
71854.16 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   17883 cycles/op,        0 errors)
82149.45 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39411 insns/op,   17409 cycles/op,        0 errors)
49640.04 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   19975 cycles/op,        0 errors)
54963.22 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   18235 cycles/op,        0 errors)
throughput:
	mean=   63672.38 standard-deviation=13195.12
	median= 59755.04 median-absolute-deviation=8709.16
	maximum=82149.45 minimum=49640.04
instructions_per_op:
	mean=   39448.38 standard-deviation=31.60
	median= 39466.17 median-absolute-deviation=25.75
	maximum=39474.12 minimum=39411.42
cpu_cycles_per_op:
	mean=   19267.01 standard-deviation=2217.03
	median= 18234.80 median-absolute-deviation=1384.25
	maximum=22834.26 minimum=17408.67
```

`perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (write path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
63736.96 tps ( 59.4 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   49667 insns/op,   19924 cycles/op,        0 errors)
64109.41 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   49992 insns/op,   20084 cycles/op,        0 errors)
56950.47 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50005 insns/op,   20501 cycles/op,        0 errors)
44858.42 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50014 insns/op,   21947 cycles/op,        0 errors)
28592.87 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50027 insns/op,   27659 cycles/op,        0 errors)
throughput:
	mean=   51649.63 standard-deviation=15059.74
	median= 56950.47 median-absolute-deviation=12087.33
	maximum=64109.41 minimum=28592.87
instructions_per_op:
	mean=   49941.18 standard-deviation=153.76
	median= 50005.24 median-absolute-deviation=73.01
	maximum=50027.07 minimum=49667.05
cpu_cycles_per_op:
	mean=   22023.01 standard-deviation=3249.92
	median= 20500.74 median-absolute-deviation=1938.76
	maximum=27658.75 minimum=19924.32
```

After (write path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
53395.93 tps ( 59.4 allocs/op,  16.5 logallocs/op,  14.3 tasks/op,   50326 insns/op,   21252 cycles/op,        0 errors)
46527.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50704 insns/op,   21555 cycles/op,        0 errors)
55846.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50731 insns/op,   21060 cycles/op,        0 errors)
55669.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50735 insns/op,   21521 cycles/op,        0 errors)
52130.17 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50757 insns/op,   21334 cycles/op,        0 errors)
throughput:
	mean=   52713.91 standard-deviation=3795.38
	median= 53395.93 median-absolute-deviation=2955.40
	maximum=55846.30 minimum=46527.83
instructions_per_op:
	mean=   50650.57 standard-deviation=182.46
	median= 50731.38 median-absolute-deviation=84.09
	maximum=50756.62 minimum=50325.87
cpu_cycles_per_op:
	mean=   21344.42 standard-deviation=202.86
	median= 21334.00 median-absolute-deviation=176.37
	maximum=21554.61 minimum=21060.24
```

Fixes #24815

Improvement for rare corner cases. No backport required

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24919
2025-07-13 19:13:11 +03:00
Michael Litvak
a9b476e057 test: test_batchlog_manager: test batch replay when a node is down
Add a test of the batchlog manager replay loop applying failed batches
while some replica is down.

The test reproduces an issue where the batchlog manager tries to replay
a failed batch, doesn't get a response from some replica, and becomes
stuck.

It verifies that the batchlog manager can eventually recover from this
situation and continue applying failed batches.
2025-07-07 12:23:06 +03:00
Michael Litvak
7150632cf2 batchlog_manager: abort writes on shutdown
On shutdown of batchlog manager, abort all writes of replayed batches
by the batchlog manager.

To achieve this we set the appropriate write_type to BATCH, and on
shutdown cancel all write handlers with this type.
2025-07-07 12:23:06 +03:00
Michael Litvak
fc5ba4a1ea batchlog_manager: create cancellable write response handler
When replaying a batch mutation from the batchlog manager and sending it
to all replicas, create the write response handler as cancellable.

To achieve this we define a new wrapper type for batchlog mutations -
batchlog_replay_mutation, and this allows us to overload
create_write_response_handler for this type. This is similar to how it's
done with hint_wrapper and read_repair_mutation.
2025-07-07 12:23:06 +03:00
Michael Litvak
8d48b27062 storage_proxy: add write type parameter to mutate_internal
Currently mutate_internal has a boolean parameter `counter_write` that
indicates whether the write is of counter type or not.

We replace it with a more general parameter that allows to indicate the
write type.

It is compatible with the previous behavior - for a counter write, the
type COUNTER is passed, and otherwise a default value will be used
as before.
2025-07-07 12:23:06 +03:00
Avi Kivity
60f407bff4 storage_proxy: avoid large allocation when storing batch in system.batchlog
Currently, when computing the mutation to be stored in system.batchlog,
we go through data_value. In turn this goes through `bytes` type
(#24810), so it causes a large contiguous allocation if the batch is
large.

Fix by going through the more primitive, but less contiguous,
atomic_cell API.

Fixes #24809.

Closes scylladb/scylladb#24811
2025-07-04 10:43:05 +03:00
Gleb Natapov
ca7837550d topology coordinator: do not set request_type field for truncation command if topology_global_request_queue feature is not enabled yet
Old nodes do not expect global topology request names to be in
request_type field, so set it only if a cluster is fully upgraded
already.

Closes scylladb/scylladb#24731
2025-07-02 17:09:29 +02:00
Nadav Har'El
e12ff4d3ab Merge 'LWT: use tablet_metadata_guard' from Petr Gusev
This PR is a step towards enabling LWT for tablet-based tables.

It pursues several goals:
* Make it explicit that the tablet can't migrate after the `cas_shard` check in `selec_statement/modification_statement`. Currently, `storage_proxy::cas` expects that the client calls it on a correct shard -- the one which owns the partition key the LWT is running on. There reasons for that are explained in [this commit](f16e3b0491 (diff-1073ea9ce4c5e00bb6eb614154f523ba7962403a4fe6c8cd877d1c8b73b3f649)) message. The statements check the current shard and invokes `bounce_to_shard` if it's not the right one. However , the erm strong pointer is only captured in `storage_proxy::cas` and until that moment there is no explicit structure in the code which would prevent the ongoing migrations. In this PR we introduce such stucture -- `erm_handle`. We create it before the `cas_check` and pass it down to `storage_proxy::cas` and `paxos_response_handler`.
* Another goal of this PR is an optimization -- we don't want to hold erm for the duration of entire LWT, unless it directly affects the current tablet. The is a `tablet_metadata_guard` class which is used for long running tablet operations. It automatically switches to a new erm if the topology change represented by the new erm doesn't affect the current tablet. We use this class in `erm_handle` if the table uses tablets. Otherwise, `erm_handle` just stores erm directly.
* Fixes [shard bouncing issue in alternator](https://github.com/scylladb/scylladb/issues/17399)

Backport: not needed (new feature).

Closes scylladb/scylladb#24495

* github.com:scylladb/scylladb:
  LWT: make cas_shard non-optional in sp::cas
  LWT: create cas_shard in select_statement
  LWT: create cas_shard in modification and batch statements
  LWT: create cas_shard in alternator
  LWT: use cas_shard in storage_proxy::cas
  do_query_with_paxos: remove redundant cas_shard check
  storage_proxy: add cas_shard class
  sp::cas_shard: rename to get_cas_shard
  token_metadata_guard: a topology guard for a token
  tablet_metadata_guard: mark as noncopyable and nonmoveable
2025-07-01 11:33:20 +03:00
Petr Gusev
35aba76401 LWT: make cas_shard non-optional in sp::cas
We also make sp::cas_shard function local since it's now
not used directly by sp clients.
2025-06-30 10:37:33 +02:00
Petr Gusev
deb7afbc87 LWT: use cas_shard in storage_proxy::cas
Take cas_shard parameter in sp::cas and pass token_metadata_guard down to paxos_response_handler.

We make cas_shard parameter optional in storage_proxy methods
to make the refactoring easier. The sp::cas method constructs a new
token_metadata_guard if it's not set. All call sites pass null
in this commit, we will add the proper implementation in the next
commits.
2025-06-30 10:33:17 +02:00
Petr Gusev
94f0717a1e do_query_with_paxos: remove redundant cas_shard check
The same check is done in the sp::cas method.
2025-06-30 10:33:17 +02:00
Petr Gusev
43c4de8ad1 storage_proxy: add cas_shard class
The sp::cas method must be called on the correct shard,
as determined by sp::cas_shard. Additionally, there must
be no asynchronous yields between the shard check and
capturing the erm strong pointer in sp::cas. While
this condition currently holds, it's fragile and
easy to break.

To address this, future commits will move the capture of
token_metadata_guard to the call sites of sp::cas, before
performing the shard check.

As a first step, this commit introduces a cas_shard class
that wraps both the target shard and a token_metadata_guard
instance. This ensures the returned shard remains valid for
the given tablet as long as the guard is held.
In the next commits, we’ll pass a cas_shard instance
to sp::cas as a separate parameter.
2025-06-30 10:33:17 +02:00
Gleb Natapov
5f953eb092 storage_proxy: retry paxos repair even if repair write succeeded
After paxos state is repaired in begin_and_repair_paxos we need to
re-check the state regardless if write back succeeded or not. This
is how the code worked originally but it was unintentionally changed
when co-routinized in 61b2e41a23.

Fixes #24630

Closes scylladb/scylladb#24651
2025-06-26 17:06:02 +02:00
Patryk Jędrzejczak
6489308ebc Merge 'Introduce a queue of global topology requests.' from Gleb Natapov
Currently only one global topology request (such as truncate, cdc repair, cleanup and alter table) can be pending. If one is already pending others will be rejected with an error. This is not very user friendly, so this series introduces a queue of global requests which allows queuing many global topology requests simultaneously.

Fixes: #16822

No need to backport since this is a new feature.

Closes scylladb/scylladb#24293

* https://github.com/scylladb/scylladb:
  topology coordinator: simplify truncate handling in case request queue feature is disable
  topology coordinator: fix indentation after the previous patch
  topology coordinator: allow running multiple global commands in parallel
  topology coordinator: Implement global topology request queue
  topology coordinator: Do not cancel global requests in cancel_all_requests
  topology coordinator: store request type for each global command
  topology request: make it possible to hold global request types in request_type field
  topology coordinator: move alter table global request parameters into topology_request table
  topology coordinator: move cleanup global command to report completion through topology_request table
  topology coordinator: no need to create updates vector explicitly
  topology coordinator: use topology_request_tracking_mutation_builder::done() instead of open code it
  topology coordinator: handle error during new_cdc_generation command processing
  topology coordinator: remove unneeded semicolon
  topology coordinator: fix indentation after the last commit
  topology coordinator: move new_cdc_generation topology request to use topology_request table for completion
  gms/feature_service: add TOPOLOGY_GLOBAL_REQUEST_QUEUE feature flag
2025-06-23 16:08:09 +03:00
Asias He
c5a136c3b5 storage_service: Use utils::chunked_vector to avoid big allocation
The following was seen:

```
!WARNING | scylla[6057]:  [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848
seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911
operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706
std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196
 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515
 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380
 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596
locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684
```

Fix by using chunked_vector.

Fixes #24158

Closes scylladb/scylladb#24561
2025-06-19 16:51:01 +03:00
Petr Gusev
aa970bf2e4 sp::cas_shard: rename to get_cas_shard
We intend to introduce a separate cas_shard
class in the next commits. We rename the existing
function here to avoid conflicts.
2025-06-18 11:51:48 +02:00
Tomasz Grabiec
cdb1499898 Merge 'interval: reduce memory footprint' from Avi Kivity
The interval class's memory footprint isn't important for single objects,
but intervals are frequently held in moderately sized collections. In #3335 this
caused a stall. Therefore reducing interval's memory footprint and reduce
allocation pressure.

This series does this by consolidating badly-padded booleans in the object tree
spanned by interval into 5 booleans that are consecutive in memory. This
reduces the space required by these booleans from 40 bytes to 8 bytes.

perf-simple-query report (with refresh-pgo-profiles.sh for each measurement):

before:

252127.60 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37128 insns/op,   18147 cycles/op,        0 errors)
INFO  2025-06-07 21:00:34,010 [shard 0:main] group0_tombstone_gc_handler - Setting reconcile time to   1749319231 (min id=4dbed2f4-43c9-11f0-cbc6-87d1a08b4ca4)
246492.37 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37153 insns/op,   18411 cycles/op,        0 errors)
253633.11 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37127 insns/op,   17941 cycles/op,        0 errors)
254029.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37155 insns/op,   17951 cycles/op,        0 errors)
254465.76 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37123 insns/op,   17906 cycles/op,        0 errors)
throughput:
	mean=   252149.75 standard-deviation=3282.75
	median= 253633.11 median-absolute-deviation=1880.17
	maximum=254465.76 minimum=246492.37
instructions_per_op:
	mean=   37137.24 standard-deviation=15.71
	median= 37127.54 median-absolute-deviation=14.45
	maximum=37155.24 minimum=37122.79
cpu_cycles_per_op:
	mean=   18071.19 standard-deviation=212.25
	median= 17950.62 median-absolute-deviation=130.10
	maximum=18411.50 minimum=17906.13

after:

252561.26 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37039 insns/op,   18075 cycles/op,        0 errors)
256876.44 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37022 insns/op,   17785 cycles/op,        0 errors)
257084.38 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37030 insns/op,   17840 cycles/op,        0 errors)
257305.35 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37042 insns/op,   17804 cycles/op,        0 errors)
258088.53 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37028 insns/op,   17778 cycles/op,        0 errors)
throughput:
	mean=   256383.19 standard-deviation=2185.22
	median= 257084.38 median-absolute-deviation=922.16
	maximum=258088.53 minimum=252561.26
instructions_per_op:
	mean=   37032.17 standard-deviation=8.06
	median= 37030.46 median-absolute-deviation=6.44
	maximum=37041.83 minimum=37021.93
cpu_cycles_per_op:
	mean=   17856.60 standard-deviation=124.70
	median= 17804.16 median-absolute-deviation=71.24
	maximum=18075.50 minimum=17777.95

A small improvement is observed in instructions_per_op. It could be random fluctuations in the compiler performance, or maybe the default constructor/destructor of interval are meaningful even in this simple test.

Small performance improvement, so not a backport candidate.

Closes scylladb/scylladb#24232

* github.com:scylladb/scylladb:
  interval: reduce sizeof
  interval: change start()/end() not to return references to data members
  interval: rename start_ref() back to start() (and end_ref() etc).
  interval: rename start() to start_ref() (and end() etc).
  test: wrapping_interval_test: add more tests for intervals
2025-06-16 09:23:56 +02:00
Avi Kivity
16fb68bb5e interval: rename start_ref() back to start() (and end_ref() etc).
To reduce noise, rename start_ref() back to its original name start(),
after it was changed in the previous patch to force an audit of all calls.
2025-06-14 21:26:16 +03:00
Avi Kivity
3363bc41e2 interval: rename start() to start_ref() (and end() etc).
We are about to change start() to return a proxy object rather
than a `const interval_bound<T>&`. This is generally transparent,
except in one case: `auto x = i.start()`. With the current implementation,
we'll copy object referred to and assign it to x. With the planned
implementation, the proxy object will be assigned to `x`, but it
will keep referring to `i`.

To prevent such problems, rename start() to start_ref() and end()
to end_ref(). This forces us to audit all calls, and redirect calls
that will break to new start_copy() and end_copy() methods.
2025-06-14 21:26:16 +03:00
Gleb Natapov
c00a0554e0 topology coordinator: simplify truncate handling in case request queue feature is disable
After allowing running multiple command in parallel the code that
handles multiple truncates to the same table can be simplified since
now it is executed only if request queue feature is disable, so it does
not need to handle the case where a request may be in the queue.
2025-06-11 11:29:33 +03:00
Gleb Natapov
01dd4b7f30 topology coordinator: fix indentation after the previous patch 2025-06-11 11:29:33 +03:00
Gleb Natapov
a9e99d1d3c topology coordinator: allow running multiple global commands in parallel
Now that we have a global request queue do not check that there is
global request before adding another one. Amend truncation test that
expects it explicitly and add another one that checks that two truncates
can be submitted in parallel.
2025-06-11 11:29:33 +03:00
Gleb Natapov
a0a3a034e0 topology coordinator: Implement global topology request queue
Requests, together with their parameters, are added to the
topology_request tables and the queue of active global requests is
kept in topology state. Thy are processed one by one by the topology
state machine.

Fixes: #16822
2025-06-11 11:29:33 +03:00
Petr Gusev
e456d2d507 storage_proxy: log gate_closed_exception
gate_closed_exception likely signals that we have shutdown order
issues. If we just swallow it we lose information what
exact component was shutdown prematurely.

For example, we stopped local storage before group0 during shutdown
in main.cc. If a group0 command arrives, topology_state_load might
try to write something and get mutation_write_failure_exception,
which results in 'applier fiber stopped because of the error'.
There is no other information in the logs in this case, other
than 'mutation_write_failure_exception'. It's not clear what the
original problem is and what component is triggering it.

In this commit we add a warning to the logs when gate_closed_exception
is thrown from lmutate or rmutate.

Another option is to just remove the try_catch_nested line and allow
gate_closed_exception to be logged as an error below. However,
this might break some tests which check ERROR lines in the logs.
2025-06-10 10:04:04 +02:00
Gleb Natapov
be0b328b19 topology coordinator: store request type for each global command 2025-06-09 13:38:49 +03:00
Tomasz Grabiec
fadfbe8459 Merge 'transport: storage_proxy: release ERM when waiting for query timeout' from Andrzej Jackowski
Before this change, if a read executor had just enough targets to
achieve query's CL, and there was a connection drop (e.g. node failure),
the read executor waited for the entire request timeout to give drivers
time to execute a speculative read in a meantime. Such behavior don't
work well when a very long query timeout (e.g. 1800s) is set, because
the unfinished request blocks topology changes.

This change implements a mechanism to thrown a new
read_failure_exception_with_timeout in the aforementioned scenario.
The exception is caught by CQL server which conducts the waiting, after
ERM is released. The new exception inherits from read_failure_exception,
because layers that don't catch the exception (such as mapreduce
service) should handle the exception just a regular read_failure.
However, when CQL server catch the exception, it returns
read_timeout_exception to the client because after additional waiting
such an error message is more appropriate (read_timeout_exception was
also returned before this change was introduced).

This change:
- Rewrite cql_server::connection::process_request_one to use
  seastar::futurize_invoke and try_catch<> instead of utils::result_try
- Add new read_failure_exception_with_timeout and throws it in storage_proxy
- Add sleep in CQL server when the new exception is caught
- Catch local exceptions in Mapreduce Service and convert them
   to std::runtime_error.
- Add get_cql_exclusive to manager_client.py
- Add test_long_query_timeout_erm

No backport needed - minor issue fix.

Closes scylladb/scylladb#23156

* github.com:scylladb/scylladb:
  test: add test_long_query_timeout_erm
  test: add get_cql_exclusive to manager_client.py
  mapreduce: catch local read_failure_exception_with_timeout
  transport: storage_proxy: release ERM when waiting for query timeout
  transport: remove redundant references in process_request_one
  transport: fix the indentation in process_request_one
  transport: add futures in CQL server exception handling
2025-05-08 12:45:49 +02:00
Andrzej Jackowski
1fca994c7b transport: storage_proxy: release ERM when waiting for query timeout
Before this change, if a read executor had just enough targets to
achieve query's CL, and there was a connection drop (e.g. node failure),
the read executor waited for the entire request timeout to give drivers
time to execute a speculative read in a meantime. Such behavior don't
work well when a very long query timeout (e.g. 1800s) is set, because
the unfinished request blocks topology changes.

This change implements a mechanism to thrown a new
read_failure_exception_with_timeout in the aforementioned scenario.
The exception is caught by CQL server which conducts the waiting, after
ERM is released. The new exception inherits from read_failure_exception,
because layers that don't catch the exception (such as mapreduce
service) should handle the exception just a regular read_failure.
However, when CQL server catch the exception, it returns
read_timeout_exception to the client because after additional waiting
such an error message is more appropriate (read_timeout_exception was
also returned before this change was introduced).

This change:
 - Add new read_failure_exception_with_timeout exception
 - Add throw of read_failure_exception_with_timeout in storage_proxy
 - Add abort_source to CQL server, as well as to_stop() method for
   the correct abort handling
 - Add sleep in CQL server when the new exception is caught

Refs #21831
2025-04-23 09:29:47 +02:00
Benny Halevy
e1fe82ed33 utils: phased_barrier, pluggable: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:00 +03:00