Commit Graph

213 Commits

Author SHA1 Message Date
Avi Kivity
6822e3b88a query_id: extract into new header
query_id currently lives query-request.hh, a busy place
with lots of dependencies. In turn it gets pulled by
uuid.idl.hh, which is also very central. This makes
test/raft/randomized_nemesis_test.cc which is nominally
only dependent on Raft rebuild on random header file changes.

Fix by extracting into a new header.

Closes #13042
2023-03-01 10:25:25 +02:00
Kefu Chai
c0824c6c25 build: cmake: put generated sources into ${scylla_gen_build_dir}
to be aligned with the convention of configure.py

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-17 18:38:44 +08:00
Kefu Chai
7b431748a8 build: cmake: extract idl library out
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-17 18:38:44 +08:00
Botond Dénes
ef50170120 Merge 'build: cmake: sync with configure (2/n)' from Kefu Chai
* build: cmake: extract idl out
* build: cmake: link cql3 against xxHash
* build: cmake: correct the check in Findlibdeflate.cmake
* build: cmake find_package(libdeflate) earlier
* build: cmake: set more properties to alternator library
* build: cmake: include generate_cql_grammar
* build: cmake: find xxHash package
* build: cmake: add build mode support

Closes #12866

* github.com:scylladb/scylladb:
  build: cmake: correct generate_cql_grammar
  build: cmake: extract idl out
  build: cmake: link cql3 against xxHash
  build: cmake: correct the check in Findlibdeflate.cmake
  build: cmake: find_package(libdeflate) earlier
  build: cmake: set more properties to alternator library
  build: cmake: include generate_cql_grammar
  build: cmake: find xxHash package
  build: cmake: add build mode support
2023-02-16 07:11:26 +02:00
Kefu Chai
2718963a2a build: cmake: extract idl out
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-16 00:07:40 +08:00
Avi Kivity
69a385fd9d Introduce schema/ module
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.

Closes #12858
2023-02-15 11:01:50 +02:00
Avi Kivity
c5e4bf51bd Introduce mutation/ module
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.

mutation_reader remains in the readers/ module.

mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.

This is a step forward towards librarization or modularization of the
source base.

Closes #12788
2023-02-14 11:19:03 +02:00
Michał Sala
bbbe12af43 forward_service: fix timeout support in parallel aggregates
`forward_request` verb carried information about timeouts using
`lowres_clock::time_point` (that came from local steady clock
`seastar::lowres_clock`). The time point was produced on one node and
later compared against other node `lowres_clock`. That behavior
was wrong (`lowres_clock::time_point`s produced with different
`lowres_clock`s cannot be compared) and could lead to delayed or
premature timeout.

To fix this issue, `lowres_clock::time_point` was replaced with
`lowres_system_clock::time_point` in `forward_request` verb.
Representation to which both time point types serialize is the same
(64-bit integer denoting the count of elapsed nanoseconds), so it was
possible to do an in-place switch of those types using logic suggested
by @avikivity:
    - using steady_clock is just broken, so we aren't taking anything
        from users by breaking it further
    - once all nodes are upgraded, it magically starts to work

Closes #12529
2023-01-16 12:08:13 +02:00
Aleksandra Martyniuk
60e298fda1 repair: change utils::UUID to node_ops_id
Type of the id of node operations is changed from utils::UUID
to node_ops_id. This way the id of node operations would be easily
distinguished from the ids of other entities.

Closes #11673
2022-12-20 17:04:47 +02:00
Gleb Natapov
022a825b33 raft: introduce not_a_member error and return it when non member tries to do add/modify_config
Currently if a node that is outside of the config tries to add an entry
or modify config transient error is returned and this causes the node
to retry. But the error is not transient. If a node tries to do one of
the operations above it means it was part of the cluster at some point,
but since a node with the same id should not be added back to a cluster
if it is not in the cluster now it will never be.

Return a new error not_a_member to a caller instead.

Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>
2022-12-05 17:11:04 +01:00
Kamil Braun
cbdcc944b5 service/raft: specialized verb for failure detector pinger
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.

There are multiple reasons to do so.

One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.

Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.

Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.

Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.

To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).

Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
2022-12-01 20:54:18 +01:00
Kamil Braun
0c9cb5c5bf Merge 'raft: wait for the next tick before retrying' from Gusev Petr
When `modify_config` or `add_entry` is forwarded to the leader, it may
reach the node at "inappropriate" time and result in an exception. There
are two reasons for it - the leader is changing and, in case of
`modify_config`, other `modify_config` is currently in progress. In both
cases the command is retried, but before this patch there was no delay
before retrying, which could led to a tight loop.

The patch adds a new exception type `transient_error`. When the client
receives it, it is obliged to retry the request after some delay.
Previously leader-side exceptions were converted to `not_a_leader`,
which is strange, especially for `conf_change_in_progress`.

Fixes: #11564

Closes #11769

* github.com:scylladb/scylladb:
  raft: rafactor: remove duplicate code on retries delays
  raft: use wait_for_next_tick in read_barrier
  raft: wait for the next tick before retrying
2022-11-16 18:20:54 +01:00
Petr Gusev
5e15c3c9bd raft: wait for the next tick before retrying
When modify_config or add_entry is forwarded
to the leader, it may reach the node at
"inappropriate" time and result in an exception.
There are two reasons for it - the leader is
changing and, in case of modify_config, other
modify_config is currently in progress. In
both cases the command is retried, but before
this patch there was no delay before retrying,
which could led to a tight loop.

The patch adds a new exception type transient_error.
When the client node receives it, it is obliged to retry
the request, possibly after some delay. Previously, leader-side
exceptions were converted to not_a_leader exception,
which is strange, especially for conf_change_in_progress.

We add a delay before retrying in modify_config
and add_entry if the client hasn't received any new
information about the leader since the last attempt.
This can happen if the server
responds with a transient_error with an empty leader
and the current node has not yet learned the new leader.
We neglect an excessive delay if the newly elected leader
is the same as the previous one, this supposed to be a rare.

Fixes: #11564
2022-11-15 11:49:26 +04:00
Aleksandra Martyniuk
e2c7c1495d repair: change UUID to task_id
Change type of repair id from utils::UUID to task_id to distinguish
them from ids of other entities.
2022-10-31 10:07:08 +01:00
Konstantin Osipov
3e46c32d7b raft: (discovery) do not use raft::server_address to carry IP data
We plan to remove IP information from Raft addresses.
raft::server_address is used in Raft configuration and
also in discovery, which is a separate algorithm, as a handy data
structure, to avoid having new entities in RPC.

Since we plan to remove IP addresses from Raft configuration,
using raft::server_address in discovery and still storing
IPs in it would create ambiguity: in some uses raft::server_address
would store an IP, and in others - would not.

So switch to an own data structure for the purposes of discovery,
discovery_peer, which contains a pair ip, raft server id.

Note to reviewers: ideally we should switch to URIs
in discovery_peer right away. Otherwise we may have to
deal with incompatible changes in discovery when adding URI
support to Scylla.
2022-10-10 16:24:33 +03:00
Benny Halevy
480b4759a9 idl: streaming: include stream_fwd.hh
To keep the idl definition of plan_id from
getting out of sync with the one in stream_fwd.hh.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11720
2022-10-06 13:49:26 +02:00
Mikołaj Grzebieluch
db88525774 raft: broadcast_tables: add execution of intermediate language
Extended `group0_command` to enable transmission of `raft::broadcast_tables::query`.
Added `add_entry_unguarded` method in `raft_group0_client` for dispatching raft
commands without `group0_guard`.
Queries on group0_kv_store are executed in `group_0_state_machine::apply`,
but for now don't return results. They don't use previous state id, so they will
block concurrent schema changes, but these changes won't block queries.

In this version snapshots are ignored.
2022-09-08 15:25:36 +02:00
Mikołaj Grzebieluch
c541d5c363 raft: broadcast_tables: add definition of intermediate language
In broadcast tables, raft command contains a whole program to be executed.
Sending and parsing on each node entire CQL statement is inefficient,
thus we decided to compile it to an intermediate language which can be
easily serializable.

This patch adds a definition of such a language. For now, only the following
types of statements can be compiled:
* select value where key = CONST from system.broadcast_kv_store;
* update system.broadcast_kv_store set value = CONST where key = CONST;
* update system.broadcast_kv_store set value = CONST where key = CONST if value = CONST;
where CONST is string literal.
2022-09-08 14:03:51 +02:00
Tomasz Grabiec
1d0264e1a9 Merge 'Implement Raft upgrade procedure' from Kamil Braun
Start with a cluster with Raft disabled, end up with a cluster that performs
schema operations using group 0.

Design doc:
https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/
(TODO: replace this with .md file - we can do it as a follow-up)

The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from `system.peers`
  table)
- enter `synchronize` upgrade state, in which group 0 operations are disabled
- wait until all members of group 0 entered `synchronize` state or some member
  entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used for
  schema operations.

With the procedure comes a recovery mode in case the upgrade procedure gets
stuck (and it may if we lose a node during recovery - the procedure, to
correctly establish a single group 0 cluster, requires contacting every node).

This recovery mode can also be used to recover clusters with group 0 already
established if they permanently lose a majority of nodes - killing two birds with
one stone. Details in the last commit message.

Read the design doc, then read the commits in topological order
for best reviewing experience.

---

I did some manual tests: upgrading a cluster, using the cluster to add nodes,
remove nodes (both with `decommission` and `removenode`), replacing nodes.
Performing recovery.

As a follow-up, we'll need to implement tests using the new framework (after
it's ready). It will be easy to test upgrades and recovery even with a single
Scylla version - we start with a cluster with the RAFT flag disabled, then
rolling-restart while enabling the flag (and recovery is done through simple
CQL statements).

Closes #10835

* github.com:scylladb/scylladb:
  service/raft: raft_group0: implement upgrade procedure
  service/raft: raft_group0: extract `tracker` from `persistent_discovery::run`
  service/raft: raft_group0: introduce local loggers for group 0 and upgrade
  service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
  service/raft: raft_group0_client: prepare for upgrade procedure
  service/raft: introduce `group0_upgrade_state`
  db: system_keyspace: introduce `load_peers`
  idl-compiler: introduce cancellable verbs
  message: messaging_service: cancellable version of `send_schema_check`
2022-08-25 11:32:06 +03:00
Benny Halevy
314e45d957 streaming: define plan_id as a strong tagged_uuid type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 19:45:30 +03:00
Kamil Braun
ac5f4248a9 service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
During the upgrade procedure nodes will want to obtain the upgrade state
of other nodes to proceed. This is what the new verb is for.
2022-08-19 19:15:19 +02:00
Kamil Braun
7e56251aea service/raft: introduce group0_upgrade_state
Define an enum class, `group0_upgrade_state`, describing the state of
the upgrade procedure (implemented in later commits).

Provide IDL definitions for (de)serialization.

The node will have its current upgrade state stored on disk in
`system.scylla_local` under the `group0_upgrade_state` key. If the key
is not present we assume `use_pre_raft_procedures` (meaning we haven't
started upgrading yet or we're at the beginning of upgrade).

Introduce `system_keyspace` accessor methods for storing and retrieving
the on-disk state.
2022-08-19 19:15:19 +02:00
Benny Halevy
d295d8e280 everywhere: define locator::host_id as a strong tagged_uuid type
So it can be distinguished from other uuid-based
identifiers in the system.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11276
2022-08-12 06:01:44 +03:00
Botond Dénes
2656968db2 service/storage_proxy: propagate last position on digest reads
We want to transmit the last position as determined by the replica on
both result and digest reads. Result reads already do that via the
query::result, but digest reads don't yet as they don't return the full
query::result structure, just the digest field from it. Add the last
position to the digest read's return value and collect these in the
digest resolver, along with the returned digests.
2022-08-10 06:03:37 +03:00
Botond Dénes
d1d53f1b84 query: add tombstone-limit to read-command
Propagate the tombstone-limit from coordinator to replicas, to make sure
all is using the same limit.
2022-08-10 06:01:47 +03:00
Benny Halevy
c71ef330b2 query-request, everywhere: define and use query_id as a strong type
Define query_id as a tagged_uuid
So it can be differentiated from other uuid-class types.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:13:28 +03:00
Benny Halevy
2b017ce285 schema, everywhere: define and use table_schema_version as a strong type
Define table_schema_version as a distinct tagged_uuid class,
So it can be differentiated from other uuid-class types,
in particular table_id.

Added reversed(table_schema_version) for convenience
and uniformity since the same logic is currently open coded
in several places.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:45 +03:00
Benny Halevy
257d74bb34 schema, everywhere: define and use table_id as a strong type
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.

Fixes #11207

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:41 +03:00
Benny Halevy
8235cfdf7a utils: tagged_uuid: rename to_uuid() to uuid()
To make it more generic, similar to other uuid() get
methods we have.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
1fda686f96 idl: make idl headers self-sufficient
Add include statements to satisfy dependencies.

Delete, now unneeded, include directives from the upper level
source files.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
da4f0aae37 idl-compiler: add include statements
For generating #include directives in the generated files,
so we don't have to hand-craft include the dependencies
in the right order.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
4f275a17b4 idl_test: add a struct depending on UUID
For testing the next change which adds
import and include statements to the idl language.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Botond Dénes
014c5b56a3 query-result: move last_pos up to query::result
query_result was the wrong place to put last position into. It is only
included in data-responses, but not on digest-responses. If we want to
support empty pages from replicas, both data and digest responses have
to include the last position. So hoist up the last position to the
parent structure: query::result. This is a breaking change inter-node
ABI wise, but it is fine: the current code wasn't released yet.

Closes #11072
2022-07-20 13:28:09 +03:00
Tomasz Grabiec
04f9a150be Merge 'raft: split can_vote field form server_address to separate struct' from Kamil Braun
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.

Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.

Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. Replace the constructor with helper
functions which specify in comments that they are supposed to be used in
tests or in contexts where `info` doesn't matter (e.g. when checking
presence in an `unordered_set`, where the equality operator and hash
operate only on the `id`).

Closes #11047

* github.com:scylladb/scylla:
  raft: fsm: fix `entry_size` calculation for config entries
  raft: split `can_vote` field from `server_address` to separate struct
  serializer_impl: generalize (de)serialization of `unordered_set`
  to_string: generalize `operator<<` for `unordered_set`
2022-07-20 12:20:52 +02:00
Kamil Braun
daf9c53bb8 raft: split can_vote field from server_address to separate struct
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.

Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.

Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. The constructor was used for two
purposes:
- Invoking set operations such as `contains`. To solve this we use C++20
  transparent hash and comparator functions, which allow invoking
  `contains` and similar functions by providing a different key type (in
  this case `raft::server_id` in set of addresses, for example).
- constructing addresses without `info`s in tests. For this we provide
  helper functions in the test helpers module and use them.
2022-07-18 18:22:10 +02:00
Jadw1
29a0be75da forward_service: support UDA and native aggregate parallelization
Enables parallelization of UDA and native aggregates. The way the
query is parallelized is the same as in #9209. Separate reduction
type for `COUNT(*)` is left for compatibility reason.
2022-07-18 15:25:41 +02:00
Avi Kivity
973d2a58d0 Merge 'docs: move docs to docs/dev folder' from David Garcia
In order to allow our Scylla OSS customers the ability to select a version for their documentation, we are migrating the Scylla docs content to the Scylla OSS repository. This PR covers the following points of the [Migration Plan](https://docs.google.com/document/d/15yBf39j15hgUVvjeuGR4MCbYeArqZrO1ir-z_1Urc6A/edit#):

1. Creates a subdirectory for dev docs: /docs/dev
2. Moves the existing dev doc content in the scylla repo to /docs/dev, but keep Alternator docs in /docs.
3. Flattens the structure in /docs/dev (remove the subfolders).
4. Adds redirects from `scylla.docs.scylladb.com/<version>/<document>` to `https://github.com/scylladb/scylla/blob/master/docs/dev/<document>.md`
5. Excludes publishing docs for /docs/devs.

1. Enter the docs folder with `cd docs`.
2. Run `make redirects`.
3. Enter the docs folder and run `make preview`. The docs should build without warnings.
4. Open http://127.0.0.1:5500 in your browser. You shoul donly see the alternator docs.
5. Open http://127.0.0.1:5500/stable/design-notes/IDL.html in your browser. It should redirect you to https://github.com/scylladb/scylla/blob/master/docs/dev/IDL.md and raise a 404 error since this PR is not merged yet.
6. Surf the `docs/dev` folder. It should have all the scylla project internal docs without subdirectories.

Closes #10873

* github.com:scylladb/scylla:
  Update docs/conf.py
  Update docs/dev/protocols.md
  Update docs/dev/README.md
  Update docs/dev/README.md
  Update docs/conf.py
  Fix broken links
  Remove source folder
  Add redirections
  Move dev docs to docs/dev
2022-07-03 20:37:11 +03:00
Pavel Emelyanov
85033ea6ae Merge 'A bunch of refactors related to Raft group 0' from Kamil Braun
The commits here were extracted from PR https://github.com/scylladb/scylla/pull/10835 which implements upgrade procedure for Raft group 0.

They are mostly refactors which don't affect the behavior of the system, except one: the commit 4d439a16b3 causes all schema changes to be bounced to shard 0. Previously, they would only be bounced when the local Raft feature was enabled. I do that because:
1. eventually, we want this to be the default behavior
2. in the upgrade PR I remove the `is_raft_enabled()` function - the function was basically created with the mindset "Raft is either enabled or not" - which was right when we didn't support upgrade, but will be incorrect when we introduce intermediate states (when we upgrade from non-raft-based to raft-based operations); the upgrade PR introduces another mechanism to dispatch based on the upgrade state, but for the case of bouncing to shard 0, dispatching is simply not necessary.

Closes #10864

* github.com:scylladb/scylla:
  service/raft: raft_group_registry: add assertions when fetching servers for groups
  service/raft: raft_group_registry: remove `_raft_support_listener`
  service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map
  service/raft: raft_group0: move group 0 RPC handlers from `storage_service`
  service/raft: messaging: extract raft_addr/inet_addr conversion functions
  service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster`
  treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls
  test/boost: memtable_test: perform schema operations on shard 0
  test/boost: cdc_test: remove test_cdc_across_shards
  message: rename `send_message_abortable` to `send_message_cancellable`
  message: change parameter order in `send_message_oneway_timeout`
2022-06-29 16:51:54 +03:00
David Garcia
8e7ebea335 Merge remote-tracking branch 'upstream/master' into move-dev-docs 2022-06-28 11:02:38 +01:00
Avi Kivity
3131cbea62 Merge 'query: allow replica to provide arbitrary continue position' from Botond Dénes
Currently, we use the last row in the query result set as the position where the query is continued from on the next page. Since only live rows make it into query result set, this mandates the query to be stopped on a live row on the replica, lest any dead rows or tombstones processed after the live rows, would have to be re-processed on the next page (and the saved reader would have to be thrown away due to position mismatch). This requirement of having to stop on a live row is problematic with datasets which have lots of dead rows or tombstones, especially if these form a prefix. In the extreme case, a query can time out before it can process a single live row and the data-set becomes effectively unreadable until compaction gets rid of the tombstones.
This series prepares the way for the solution: it allows the replica to determine what position the query should continue from on the next page. This position can be that of a dead row, if the query stopped on a dead row. For now, the replica supplies the same position that would have been obtained with looking at the last row in the result set, this series merely introduces the infrastructure for transferring a position together with the query result, and it prepares the paging logic to make use of this position. If the coordinator is not prepared for the new field, it will simply fall-back to the old way of looking at the last row in the result set. As I said for now this is still the same as the content of the new field so there is no problem in mixed clusters.

Refs: https://github.com/scylladb/scylla/issues/3672
Refs: https://github.com/scylladb/scylla/issues/7689
Refs: https://github.com/scylladb/scylla/issues/7933

Tests: manual upgrade test.
I wrote a data set with:
```
./scylla-bench -mode=write -workload=sequential -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -clustering-row-size=8096 -partition-count=1000
```
This creates large, 80MB partitions, which should fill many pages if read in full. Then I started a read workload:
```
./scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100
```
I confirmed that paging is happening as expected, then upgraded the nodes one-by-one to this PR (while the read-load was ongoing). I observed no read errors or any other errors in the logs.

Closes #10829

* github.com:scylladb/scylla:
  query: have replica provide the last position
  idl/query: add last_position to query_result
  mutlishard_mutation_query: propagate compaction state to result builder
  multishard_mutation_query: defer creating result builder until needed
  querier: use full_position instead of ad-hoc struct
  querier: rely on compactor for position tracking
  mutation_compactor: add current_full_position() convenience accessor
  mutation_compactor: s/_last_clustering_pos/_last_pos/
  mutation_compactor: add state accessor to compact_mutation
  introduce full_position
  idl: move position_in_partition into own header
  service/paging: use position_in_partition instead of clustering_key for last row
  alternator/serialization: extract value object parsing logic
  service/pagers/query_pagers.cc: fix indentation
  position_in_partition: add to_string(partition_region) and parse_partition_region()
  mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh
2022-06-27 12:23:21 +03:00
David Garcia
bb21c3c869 Move dev docs to docs/dev 2022-06-24 18:07:08 +01:00
Kamil Braun
8e907cbf57 service/raft: raft_group0: move group 0 RPC handlers from storage_service
And generate the boilerplate from IDL declarations.
Simplifies the code, and the code now resides where it belongs.
2022-06-23 16:14:41 +02:00
Botond Dénes
009d2fe2f7 idl/query: add last_position to query_result
To be used to allow the replica to specify the last position in the
stream, where the query was left off. Currently this is always
the same as the implicit position -- the last row in the result-set --
but this requires only stopping the read on a live row, which is a
requirement we want to lift: we want to be able to stop on a tombstone.
As tombstones are not included in the query result, we have to allow the
replica to overwrite the last seen position explicitly.
This patch introduces the new field in the query-result IDL but it is
not written to yet, nor is it read, that is left for the next patches.
2022-06-23 13:36:24 +03:00
Botond Dénes
119be5d5db idl: move position_in_partition into own header
So it can be used without pulling in all of partition_checksum.idl.hh.
2022-06-23 13:36:24 +03:00
Botond Dénes
2b0bc11f2e service/paging: use position_in_partition instead of clustering_key for last row
The former allows for expressing more positions, like a position
before/after a clustering key. This practically enables the coordinator
side paging logic, for a query to be stopped at a tombstone (which can
have said positions).
2022-06-23 13:36:20 +03:00
Piotr Dulikowski
d3d9add219 storage_proxy: add per partition rate limit info to read RPC
Now, the read RPC accept the per partition rate limit info parameter. It
is passed on to query_result_local(_digest) methods.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
02469e0b15 storage_proxy: add per partition rate limit info to write RPC
Adds db::per_partition_rate_limit::info parameter to the write RPC. The
rate limit info controls the behavior of the rate limiter on the
replica.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
0fe8b55427 db: add rate_limiter
Introduces the rate_limiter, a replica-side data structure meant for
tracking the frequence with which each partition is being accessed
(separately for reads and writes) and deciding whether the request
should be accepted and processed further or rejected.

The limiter is implemented as a statically allocated hashmap which keeps
track of the frequency with which partitions are accessed. Its entries
are incremented when an operation is admitted and are decayed
exponentially over time.

If a partition is detected to be accessed more than its limit allows,
requests are rejected with a probability calculated in such a way that,
on average, the number of accepted requests is kept at the limit.

The structure currently weights a bit above 1MB and each shard is meant
to keep a separate instance. All operations are O(1), including the
periodic timer.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
2162bb9f3b storage_proxy: propagate rate_limit_exception through read RPC
This commit modifies the read RPC and the storage_proxy logic so that
the coordinator knows whether a read operation failed due to rate limit
being exceeded, and returns `exceptions::rate_limit_exception` if that
happens.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
51546b0609 storage_proxy: pass rate_limit_exception through write RPC
This commit modifies the storage_proxy logic so that the coordinator
knows whether a write operation failed due to rate limit being exceeded,
and returns `exceptions::rate_limit_exception` when that happens.
2022-06-22 20:16:48 +02:00