Commit Graph

428 Commits

Author SHA1 Message Date
Pavel Emelyanov
6c1e5c248f main,proxy: Drain proxy in its stop_remote
Currently proxy initialization is pretty disperse, in particular it's
stopped in several steps -- first drain_on_shutdown() then
stop_remote(). In between there's nothing that needs proxy in any
particular sate, so those two steps can be merged into one.

refs: scylladb/scylladb#2737

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#19344
2024-06-27 12:26:51 +02:00
Wojciech Mitros
f70f774e40 mv: gossip the same backlog if a different backlog was sent in a response
Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a the backlog changed
since the last gossip round so a different backlog could have been send
through the response piggyback mechanism. With that information, gossip
will send an unchanged backlog to other nodes in the following gossip round.

Fixes: https://github.com/scylladb/scylladb/issues/18461
2024-06-06 10:45:15 +02:00
Wojciech Mitros
272e80fe0a node_update_backlog: divide adding and fetching backlogs
Currently, we only update the backlogs in node_update_backlog at the
same time when we're fetching them. This is done using storage_proxy's
method get_view_update_backlog, which is confusing because it's a getter
with side-effects. Additionally, we don't always want to update the
backlog when we're reading it (as in gossip which is only on shard 0)
and we don't always want to read it when we're updating it (when we're
not handling any writes but the backlog drops due to background work
finish).

This patch divides the node_view_backlog::add_fetch as well the
storage_proxy::get_view_update_backlog both into two methods; one
for updating and one for reading the backlog. This patch only replaces
the places where we're currently using the view backlog getter, more
situations where we should get/update the backlog should be considered
in a following patch.
2024-06-06 10:45:13 +02:00
Piotr Dulikowski
68eca3778c Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros
This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed.

See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions.

This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh.

The existing mechanism works in the following way:

* Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes
* Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking
* We keep track of the percent of consumed units on each node, this is called `view update backlog`.
* Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level.

This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates.

To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`.

The new algorithm of view update generation looks something like this:
```c++
for(;;) {
    auto updates = generate_updates_batch_with_max_100_rows();
    co_await seastar::sleep(calculate_sleep_time_from_backlogs());
    spawn_background_tasks_for_updates(updates);
}
```
Fixes: https://github.com/scylladb/scylladb/issues/12379

Closes scylladb/scylladb#16819

* github.com:scylladb/scylladb:
  test: add test for bad_allocs during large mv queries
  mv: throttle view update generation for large queries
  exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception
  db/view: extract view throttling delay calculation to a global function
  view_update_generator: add get_storage_proxy()
  storage_proxy: make view backlog getters public
2024-05-16 08:22:54 +02:00
Botond Dénes
155332ebf8 Merge 'Drain view_builder in generic drain (again)' from Pavel Emelyanov
Some time ago #16558 was merged that moved view builder drain into generic drain. After this merge dtests started to fail from time to time, so the PR was reverted (see #18278). In #18295 the hang was found. View builder drain was moved from "before stopping messaging service to "after" it, and view update write handlers in proxy hanged for hard-coded timeout of 5 minutes without being aborted. Tests don't wait for 5 minutes and kill scylla, then complain about it and fail.

This PR brings back the original PR as well as the necessary fix that cancels view update write handlers on stop.

Closes scylladb/scylladb#18408

* github.com:scylladb/scylladb:
  Reapply "Merge 'Drain view_builder in generic drain' from ScyllaDB"
  view: Abort pending view updates when draining
2024-05-09 08:26:44 +03:00
Piotr Dulikowski
64ba620dc2 Merge 'hinted handoff: Use host IDs instead of IPs in the module' from Dawid Mędrek
This pull request introduces host ID in the Hinted Handoff module. Nodes are now identified by their host IDs instead of their IPs. The conversion occurs on the boundary between the module and `storage_proxy.hh`, but aside from that, IPs have been erased.

The changes take into considerations that there might still be old hints, still identified by IPs, on disk – at start-up, we map them to host IDs if it's possible so that they're not lost.

Refs scylladb/scylladb#6403
Fixes scylladb/scylladb#12278

Closes scylladb/scylladb#15567

* github.com:scylladb/scylladb:
  docs: Update Hinted Handoff documentation
  db/hints: Add endpoint_downtime_not_bigger_than()
  db/hints: Migrate hinted handoff when cluster feature is enabled
  db/hints: Handle arbitrary directories in resource manager
  db/hints: Start using hint_directory_manager
  db/hints: Enforce providing IP in get_ep_manager()
  db/hints: Introduce hint_directory_manager
  db/hints/resource_manager: Update function description
  db/hints: Coroutinize space_watchdog::scan_one_ep_dir()
  db/hints: Expose update lock of space watchdog
  db/hints: Add function for migrating hint directories to host ID
  db/hints: Take both IP and host ID when storing hints
  db/hints: Prepare initializing endpoint managers for migrating from IP to host ID
  db/hints: Migrate to locator::host_id
  db/hints: Remove noexcept in do_send_one_mutation()
  service: Add locator::host_id to on_leave_cluster
  service: Fix indentation
  db/hints: Fix indentation
2024-05-06 09:58:18 +02:00
Benny Halevy
890b890e36 storage_proxy: add mutate_locally(vector<frozen_mutation_and_schema>) method
Generalizing the ad-hoc implementation out of
group0_state_machine.write_mutations_to_database.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:42:58 +03:00
Jan Ciolek
4c5cfc7683 storage_proxy: make view backlog getters public
Storage proxy maintains information about both local
and remote view update backlogs.

This information might also be useful outside of storage_proxy,
so let's expose the functions that allow to acces backlog information.

There aren't any implementation quirks that would make
it unsafe to make the functions public, the worst that
can happen is that someone causes a lot of atomic operations
by repeatedly calling get_view_update_backlog().

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2024-05-02 10:59:55 +02:00
Pavel Emelyanov
d47053266b view: Abort pending view updates when draining
When view builder is drained (it now happens very early, but next patch
moves this into regular drain) it waits for all on-going view build
steps to complete. This includes waiting for any outstanding proxy view
writes to complete as well.

View writes in proxy have very high timeout of 5 minutes but they are
cancellable. However, canecelling of such writes happens in proxy's
drain_on_shutdown() call which, in turn, happens pretty late on
shutdown. Effectively, by the time it happens all view writes mush have
completed already, so stop-time cancelling doesn't really work nowadays.

Next patch makes view builder drain happen a bit later during shutdown,
namely -- _after_ shutting down messaging service. When it happen that
late, non-working view writes cancellation becomes critical, as view
builder drain hangs for aforementioned 5 minutes. This patch explicitly
cancels all view writes when view builder stops.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-02 08:16:12 +03:00
Pavel Emelyanov
5d992a4f01 proxy: Remove declaration of nonexisting view_update_write_response_handler class
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18417
2024-05-01 10:15:41 +03:00
Dawid Medrek
cfd03fe273 db/hints: Migrate to locator::host_id
We change the type of node identifiers
used within the module and fix compilation.
Directories storing hints to specific nodes
are now represented by host IDs instead of
IPs.
2024-04-26 22:44:04 +02:00
Dawid Medrek
54ae9797b9 service: Add locator::host_id to on_leave_cluster
We extend the function
endpoint_lifecycle_subscriber::on_leave_cluster
by another argument -- locator::host_id.
It's more convenient to have a consistent
pair of IP and host ID.
2024-04-26 22:44:03 +02:00
Dawid Medrek
a36387d942 service: Fix indentation 2024-04-26 22:44:03 +02:00
Kefu Chai
c323c93fa4 treewide: remove {dclocal_,}read_repair_chance options
dclocal_read_repair_chance and read_repair_chance have been removed
in Cassandra 3.11 and 4.x, see
https://issues.apache.org/jira/browse/CASSANDRA-13910.
if we expose the properties via DDL, Cassandra would fails to consume
the CQL statement to creating the table when performing migration
from Scylla to Cassandra 4.x, as the latter does not understand
these properties anymore.

currently the default values of `dc_local_read_repair_chance` and
`read_repair_chance` are both "0". so this is practically disabled,
unless user deliberately set them to a value greater than 0.

also, as a side effect, Cassandra 4.x has better support of
Python3. the cqlsh shipped along with Cassandra 3.11.16 only
supports python2.7, see
https://github.com/apache/cassandra/blob/cassandra-3.11.16/bin/cqlsh.py
it errors out if the system only provides python3 with the error
of

```
No appropriate python interpreter found.
```
but modern linux systems do not provide python2 anymore.

so, in this change, we deprecate these two options.

Fixes #3502
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-25 17:15:27 +08:00
Avi Kivity
4ddf82e58b treewide: don't #include "gms/feature_service.hh" from other headers
feature_service.hh is a high-level header that integrates much
of the system functionality, so including it in lower-level headers
causes unnecessary rebuilds. Specifically, when retiring features.

Fix by removing feature_service.hh from headers, and supply forward
declarations and includes in .cc where needed.

Closes scylladb/scylladb#18005
2024-03-26 15:31:18 +02:00
Pavel Emelyanov
7c5c89ba8d Revert "Merge 'Use utils::directories instead of db::config to get dirs' from Patryk Wróbel"
This reverts commit 370fbd346c, reversing
changes made to 0912d2a2c6.

This makes scylla-manager mis-interpret the data_file_directories
somehow, issue #17078
2024-01-31 15:08:14 +03:00
Kefu Chai
b931d93668 treewide: fix misspellings in code comments
these misspellings are identified by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17004
2024-01-31 09:16:10 +02:00
Patryk Wrobel
f08768e767 service/storage_proxy: use utils::directories to get paths of dirs
This change replaces usage of db::config with
usage of utils::directories to get paths of
directories in service/storage_proxy.

Refs: scylladb#5626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-01-29 13:11:33 +01:00
Gleb Natapov
5b246920ae storage_proxy: allow to wait for all ongoing writes
We want to be able to wait for all writes started through the storage
proxy before a fence is advanced. Add phased_barrier that is entered
on each local write operation before checking the fence to do so. A
write will be either tracked by the phased_barrier or fenced. This will
be needed to wait for all non fenced local writes to complete before
starting a cleanup.
2024-01-14 14:44:07 +02:00
Patryk Jędrzejczak
f1dea4bc8a storage_proxy: do not fence reads and writes to local tables
Fencing is necessary only for reads and writes to non-local tables.
Moreover, fencing a read or write to a local table can cause an
error on the bootstrapping node. It is explained in the comment
in storage_proxy::get_fence.

A scenario described in the comment has been reported in
scylladb/scylladb#16423. A write to the local RAFT table failed
because of fencing, and it killed server_impl::io_fiber.

Fixes scylladb/scylladb#16423

Closes scylladb/scylladb#16525
2023-12-28 19:34:27 +02:00
Benny Halevy
a529097d96 storage_proxy: use locator::topology rather than fb_utilities
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 10:44:13 +02:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Marcin Maliszkiewicz
020a9c931b db: view: run local materialized view mutations on a separate smp service group
When base write triggers mv write and it needs to be send to another
shard it used the same service group and we could end up with a
deadlock.

This fix affects also alternator's secondary indexes.

Testing was done using (yet) not committed framework for easy alternator
performance testing: https://github.com/scylladb/scylladb/pull/13121.
I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and
then ran:

./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \
--developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \
--duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000

Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds
scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb:

p seastar::get_smp_service_groups_semaphore(2,0)._count
$1 = 0

With the patch I wasn't able to observe the problem, even with 2x
concurrency. I was able to make the process hang with 10x concurrency
but I think it's hitting different limit as there wasn't any depleted
smp service group semaphore and it was happening also on non mv loads.

Fixes https://github.com/scylladb/scylladb/issues/15844

Closes scylladb/scylladb#15845
2023-10-29 18:30:32 +02:00
Pavel Emelyanov
53891dd9cc api,hints: Move gossiper access to proxy
API handlers should try to avoid using any service other than the "main"
one. For hints API this service is going to be proxy, so no gossiper
access in the handler itself.

(indentation is left broken)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 16:14:26 +03:00
Benny Halevy
2c54d7a35a view, storage_proxy: carry effective_replication_map along with endpoints
When sending mutation to remote endpoint,
the selected endpoints must be in sync with
the current effective_replication_map.

Currently, the endpoints are sent down the storage_proxy
stack, and later on an effective_replication_map is retrieved
again, and it might not match the target or pending endpoints,
similar to the case seen in https://github.com/scylladb/scylladb/issues/15138

The correct way is to carry the same effective replication map
used to select said endpoints and pass it down the stack.
See also https://github.com/scylladb/scylladb/pull/15141

Fixes scylladb/scylladb#15144
Fixes scylladb/scylladb#14730

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15142
2023-08-29 09:08:42 +03:00
Botond Dénes
47ce69e9bf Merge 'paxos_response_handler: carry effective replication map' from Benny Halevy
As `create_write_response_handler` on this path accepts
an `inet_address_vector_replica_set` that corresponds to the
effective_replication_map_ptr in the paxos_response_handler,
but currently, the function retrieves a new
effective_replication_map_ptr
that may not hold all the said endpoints.

Fixes scylladb/scylladb#15138

Closes #15141

* github.com:scylladb/scylladb:
  storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler
  storage_proxy: send_to_live_endpoints: throw on_internal_error if node not found
2023-08-28 11:42:38 +03:00
Benny Halevy
4a2e367e92 storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler
As `create_write_response_handler` on this path accepts
an `inet_address_vector_replica_set` that corresponds to the
effective_replication_map_ptr in the paxos_response_handler,
but currently, the function retrieves a new
effective_replication_map_ptr
that may not hold all the said endpoints.

Fixes scylladb/scylladb#15138

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 11:45:13 +03:00
Benny Halevy
098dd5021a storage_proxy: mutate_atomically_result: keep schema of batchlog mutation in context
The batchlog mutation is for system.batchlog.
Rather than looking the schema up in multiple places
do that once and keep it in the context object.

It will be used in the next patch to get a respective
effective_replication_map_ptr.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 10:43:23 +03:00
Benny Halevy
3c122a87b5 storage_proxy: query_partition_key_range_concurrent: turn tail recursion to iteration
Update the function state and loop for the next ranges
instead of nesting it oneself.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 09:43:33 +03:00
Botond Dénes
53da97416a Merge 'Remove qctx from system.paxos table access methods' from Pavel Emelyanov
The "fix" is straightforward -- callers of system_keyspace::*paxos* methods need to get system keyspace from somewhere. This time the only caller is storage_proxy::remote that can have system keyspace via direct dependency reference.

Closes #14758

* github.com:scylladb/scylladb:
  db/system_keyspace: Move and use qctx::execute_cql_with_timeout()
  db/system_keyspace: Make paxos methods non-static
  service/paxos: Add db::system_keyspace& argument to some methods
  test: Optionally initialize proxy remote for cql_test_env
  proxy/remote: Keep sharded<db::system_keyspace>& dependency
2023-07-20 16:53:25 +03:00
Pavel Emelyanov
b0b91bf5ec proxy/remote: Keep sharded<db::system_keyspace>& dependency
This dependency will be needed to call service::paxos_state:: calls and
all of them are done in storage_proxy::remote() methods only

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 17:36:42 +03:00
Avi Kivity
460b28d067 Merge 'Introduce SELECT MUTATION FRAGMENTS statement' from Botond Dénes
SELECT MUTATION FRAGMENTS is a new select statement sub-type, which allows dumping the underling mutations making up the data of a given table. The output of this statement is mutation-fragments presented as CQL rows. Each row corresponds to a mutation-fragment. Subsequently, the output of this statement has a schema that is different than that of the underlying table.  The output schema is derived from the table's schema, as following:
* The table's partition key is copied over as-is
* The clustering key is formed from the following columns:
    - mutation_source (text): the kind of the mutation source, one of: memtable, row-cache or sstable; and the identifier of the individual mutation source.
    - partition_region (int): represents the enum with the same name.
    - the copy of the table's clustering columns
    - position_weight (int): -1, 0 or 1, has the same meaning as that in position_in_partition, used to disambiguate range tombstone changes with the same clustering key, from rows and from each other.
* The following regular columns:
    - metadata (text): the JSON representation of the mutation-fragment's metadata.
    - value (text): the JSON representation of the mutation-fragment's value.

Data is always read from the local replica, on which the query is executed. Migrating queries between coordinators is frobidden.

More details in the documentation commit (last commit).

Example:
```cql
cqlsh> CREATE TABLE ks.tbl (pk int, ck int, v int, PRIMARY KEY (pk, ck));

cqlsh> DELETE FROM ks.tbl WHERE pk = 0;
cqlsh> DELETE FROM ks.tbl WHERE pk = 0 AND ck > 0 AND ck < 2;
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 0, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 1, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 2, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (1, 0, 0);
cqlsh> SELECT * FROM ks.tbl;

 pk | ck | v
----+----+---
  1 |  0 | 0
  0 |  0 | 0
  0 |  1 | 0
  0 |  2 | 0

(4 rows)
cqlsh> SELECT * FROM MUTATION_FRAGMENTS(ks.tbl);

 pk | mutation_source | partition_region | ck | position_weight | metadata                                                                                                                 | mutation_fragment_kind | value
----+-----------------+------------------+----+-----------------+--------------------------------------------------------------------------------------------------------------------------+------------------------+-----------
  1 |      memtable:0 |                0 |    |                 |                                                                                                         {"tombstone":{}} |        partition start |      null
  1 |      memtable:0 |                2 |  0 |               0 | {"marker":{"timestamp":1688122873341627},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122873341627}}} |         clustering row | {"v":"0"}
  1 |      memtable:0 |                3 |    |                 |                                                                                                                     null |          partition end |      null
  0 |      memtable:0 |                0 |    |                 |                                      {"tombstone":{"timestamp":1688122848686316,"deletion_time":"2023-06-30 11:00:48z"}} |        partition start |      null
  0 |      memtable:0 |                2 |  0 |               0 | {"marker":{"timestamp":1688122860037077},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122860037077}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                2 |  0 |               1 |                                      {"tombstone":{"timestamp":1688122853571709,"deletion_time":"2023-06-30 11:00:53z"}} | range tombstone change |      null
  0 |      memtable:0 |                2 |  1 |               0 | {"marker":{"timestamp":1688122864641920},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122864641920}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                2 |  2 |              -1 |                                                                                                         {"tombstone":{}} | range tombstone change |      null
  0 |      memtable:0 |                2 |  2 |               0 | {"marker":{"timestamp":1688122868706989},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122868706989}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                3 |    |                 |                                                                                                                     null |          partition end |      null

(10 rows)
```

Perf simple query:
```
/build/release/scylla perf-simple-query -c1 -m2G --duration=60
```

Before:
```
median 141596.39 tps ( 62.1 allocs/op,  13.1 tasks/op,   43688 insns/op,        0 errors)
median absolute deviation: 137.15
maximum: 142173.32
minimum: 140492.37
```
After:
```
median 141889.95 tps ( 62.1 allocs/op,  13.1 tasks/op,   43692 insns/op,        0 errors)
median absolute deviation: 167.04
maximum: 142380.26
minimum: 141025.51
```

Fixes: https://github.com/scylladb/scylladb/issues/11130

Closes #14347

* github.com:scylladb/scylladb:
  docs/operating-scylla/admin-tools: add documentation for the SELECT * FROM MUTATION_FRAGMENTS() statement
  test/topology_custom: add test_select_from_mutation_fragments.py
  test/boost/database_test: add test for mutation_dump/generate_output_schema_from_underlying_schema
  test/cql-pytest: add test_select_mutation_fragments.py
  test/cql-pytest: move scylla_data_dir fixture to conftest.py
  cql3/statements: wire-in mutation_fragments_select_statement
  cql3/restrictions/statement_restrictions: fix indentation
  cql3/restrictions/statement_restrictions: add check_indexes flag
  cql3/statments/select_statement: add mutation_fragments_select_statement
  cql3: add SELECT MUTATION FRAGMENTS select statement sub-type
  service/pager: allow passing a query functor override
  service/storage_proxy: un-embed coordinator_query_options
  replica: add mutation_dump
  replica: extract query_state into own header
  replica/table: add make_nonpopulating_cache_reader()
  replica/table: add select_memtables_as_mutation_sources()
  tools,mutation: extract the low-level json utilities into mutation/json.hh
  tools/json_writer: fold SstableKey() overloads into callers
  tools/json_writer: allow writing metadata and value separately
  tools/json_writer: split mutation_fragment_json_writer in two classes
  tools/json_writer: allow passing custom std::ostream to json_writer
2023-07-19 11:54:11 +03:00
Botond Dénes
2174276bb7 service/storage_proxy: un-embed coordinator_query_options
So it can be forward declared. Add an embedded alias to reduce churn.
Requires similarly un-embedding clock_type.
2023-07-19 01:28:28 -04:00
Kefu Chai
bab16eb30e treewide: remove #includes not use directly
for faster build times and clear inter-module dependencies, we
should not #includes headers not directly used. instead, we should
only #include the headers directly used by a certain compilation
unit.

in this change, the source files under "/compaction" directories
are checked using clangd, which identifies the cases where we have
an #include which is not directly used. all the #includes identified
by clangd are removed. because some source files rely on the incorrectly
included header file, those ones are updated to #include the header
file they directly use.

if a forward declaration suffice, the declaration is added instead.

see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 17:36:31 +08:00
Gleb Natapov
94fcba5662 storage_proxy: remove unused variable 2023-06-22 15:26:20 +03:00
Tomasz Grabiec
10e05eec66 storage_proxy: Obtain shard from erm in the read path
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Petr Gusev
d34da12240 storage_proxy: add fencing_token and related infrastructure
A new stale_topology_exception was introduced,
it's raised in apply_fence when an RPC comes
with a stale fencing_token.

An overload of apply_fence with future will be
used to wrap the storage_proxy methods which
need to be fenced.
2023-06-15 15:48:00 +04:00
Kamil Braun
a740fbf58a storage_proxy: rename init_messaging_service to start_remote
The function now has more responsibilities than before, rename it
and add a comment to better illustrate this.
2023-06-14 11:41:36 +02:00
Kamil Braun
f26e98c3be storage_proxy: don't pass gossiper& and messaging_service& during initialization
These services are now passed during `init_messaging_service`, and
that's when the `remote` object is constructed.

The `remote` object is then destroyed in `uninit_messaging_service`.

Also, `migration_manager*` became `migration_manager&` in
`init_messaging_service`.
2023-06-14 11:41:36 +02:00
Kamil Braun
10f11b89ea storage_proxy: prepare for missing remote
Prepare the users of `remote` for the possibility that it's gone.

The `remote()` accessor throws an error if it's gone. Observe that
`remote()` is only used in places where it's verified that we really
want to send a message to a remote node, with a small exception:
`truncate_blocking`, which truncates locally by sending an RPC to
ourselves (and truncate always sends RPC to the whole cluster; we might
want to change this behavior in the future, see #11087). Other places
are easy to check (it's either implementations of `apply_remotely`
which is only called for remote nodes, or there's an `if` that checks
we don't apply the operation to ourselves).

There is one direct access to `_remote` which checks first if `_remote`
is available: `storage_proxy::is_alive`. If `_remote` is unavailable, we
consider nodes other than us dead. Indeed, if `gossiper` is unavailable,
we didn't have a chance to gossip with other nodes and mark them alive.
2023-06-14 11:41:36 +02:00
Kamil Braun
ddcbade919 storage_proxy: don't access remote when calculating target replicas for local queries
We only want to access `remote` when it's necessary - when we're
performing a query that involves remote nodes. We want to support local
queries when `remote` (in particular, `gossiper&`) is unavailable.

Add a helper, `storage_proxy::filter_replicas_for_read`, which will
check if it's a local query and return early in that case without
accessing `remote`.
2023-06-14 11:41:34 +02:00
Kamil Braun
ff8d88a228 storage_proxy: introduce const version of remote()
One version is implemented using the other (with `const_cast`) because
some additional safety checks will be added in later commit.
2023-06-13 12:44:03 +02:00
Kamil Braun
0ef35ceed4 service: storage_proxy: make hint write handlers cancellable
Whether a write handler should be cancellable is now controlled by a
parameter passed to `create_write_response_handler`. We plumb it down
from `send_to_endpoint` which is called by hints manager.

This will cause hint write handlers to immediately timeout when we
shutdown or when a destination node is marked as dead.

Fixes #8079
2023-05-29 11:03:18 +02:00
Kamil Braun
eddb7406b4 service: storage_proxy: rename view_update_handlers_list
The list will be used for non-view-update write handlers as well, so
generalize the name. Also generalize some variable names used in the
implementation.

This commit only renames things + some comments were added,
there are no logical changes.
2023-05-29 10:59:50 +02:00
Kamil Braun
c7ef9a12ee service: storage_proxy: make it possible to cancel all write handler types
The `view_update_write_response_handler` class, which is a subclass of
`abstract_write_response_handler`, was created for a single purpose: to
make it possible to cancel a handler for a view update write, which
means we stop waiting for a response to the write, timing out the
handler immediately. This was done to solve issue with node shutdown
hanging because it was waiting for a view update to finish; view updates
were configured with 5 minute timeout. See #3966, #4028.

Now we're having a similar problem with hint updates causing shutdown to
hang in tests (#8079).

`view_update_write_response_handler` implements cancelling by adding
itself to an intrusive list which we then iterate over to timeout each
handler when we shutdown or when gossiper notifies `storage_proxy` that
a node is down.

To make it possible to reuse this algorithm for other handlers, move the
functionality into `abstract_write_response_handler`. We inherit from
`bi::list_base_hook` so it introduces small memory overhead to each
write handler (2 pointers) which was only present for view update
handlers before. But those handlers are already quite large, the
overhead is small compared to their size.

Not all handlers are added to the cancelling list, this is controlled by
the `cancellable` parameter passed to the constructor. For now we're
only cancelling view handlers as before. In following commits we'll also
cancel hint handlers.
2023-05-29 10:42:57 +02:00
Petr Gusev
052b91fb1f storage_proxy: rename get_live_sorted_endpoints->get_endpoints_for_reading
We are going to use remapped_endpoints_for_reading, we need
to make sure we use it in the right place. The
get_live_sorted_endpoints function looks like what we
need - it's used in all read code paths.
From its name, however, this was not obvious.

Also, we add the parameter ks_name as we'll need it
to pass to remapped_endpoints_for_reading.
2023-05-09 18:42:03 +04:00
Pavel Emelyanov
739455c3aa code: Remove global proxy
No code needs global proxy anymore. Keep on-stack values in main and
cql_test_env and keep the pointer on debug:: namespace.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-21 14:18:59 +03:00
Pavel Emelyanov
ab8fc0e166 proxy: Carry replication map with repair mutation(s)
The create_write_response_handler() for read repair needs the e.r.m.
from the caller, because it effectively accepts list of endpoints from
it.

So this patch equips all read_repair_mutation-s with the e.r.m. pointer
so that the handler creation can use it. It's the same for all
mutations, so it's a waste of space, but it's not bad -- there's
typically few mutations in this range and the entry passed there is
temporary, so even lots of them won't occupy lots of memory for long.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-14 14:03:39 +03:00
Pavel Emelyanov
140f373e15 proxy: Wrap read repair entries into read_repair_mutation
The schedule_repair() operates on a map of endpoint:mutations pairs.
Next patch will need to extend this entry and it's going to be easier if
the entry is wrapped in a helper structure in advance.

This is where the forwardable reference cursor from the previous patch
gets its user. The schedule_repair() produces a range of rvalue
wrappers, but the create_write_response_handler accepting it is OK, it
copies mutations anyway.

The printing operator is added to facilitate mutations logging from
mutate_internal() method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-14 14:01:12 +03:00
Michał Jadwiszczak
360dbf98f1 storage_proxy: add describe_ring() method
In order to execute `DESC CLUSTER`, there has to be a way to describe
ring. `storage_service` is not available at query execution. This patch
adds `describe_ring()` as a method of `storage_proxy()` (using helper
function from `locator/util.hh`).
2022-12-10 12:51:05 +01:00