scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 19:10:42 +00:00

Author	SHA1	Message	Date
Botond Dénes	014c5b56a3	query-result: move last_pos up to query::result query_result was the wrong place to put last position into. It is only included in data-responses, but not on digest-responses. If we want to support empty pages from replicas, both data and digest responses have to include the last position. So hoist up the last position to the parent structure: query::result. This is a breaking change inter-node ABI wise, but it is fine: the current code wasn't released yet. Closes #11072	2022-07-20 13:28:09 +03:00
Tomasz Grabiec	04f9a150be	Merge 'raft: split `can_vote` field form `server_address` to separate struct' from Kamil Braun Whether a server can vote in a Raft configuration is not part of the address. `server_address` was used in many context where `can_vote` is irrelevant. Split the struct: `server_address` now contains only `id` and `server_info` as it did before `can_vote` was introduced. Instead we have a `config_member` struct that contains a `server_address` and the `can_vote` field. Also remove an "unsafe" constructor from `server_address` where `id` was provided but `server_info` was not. The constructor was used for tests where `server_info` is irrelevant, but it's important not to forget about the info in production code. Replace the constructor with helper functions which specify in comments that they are supposed to be used in tests or in contexts where `info` doesn't matter (e.g. when checking presence in an `unordered_set`, where the equality operator and hash operate only on the `id`). Closes #11047 * github.com:scylladb/scylla: raft: fsm: fix `entry_size` calculation for config entries raft: split `can_vote` field from `server_address` to separate struct serializer_impl: generalize (de)serialization of `unordered_set` to_string: generalize `operator<<` for `unordered_set`	2022-07-20 12:20:52 +02:00
Kamil Braun	daf9c53bb8	raft: split `can_vote` field from `server_address` to separate struct Whether a server can vote in a Raft configuration is not part of the address. `server_address` was used in many context where `can_vote` is irrelevant. Split the struct: `server_address` now contains only `id` and `server_info` as it did before `can_vote` was introduced. Instead we have a `config_member` struct that contains a `server_address` and the `can_vote` field. Also remove an "unsafe" constructor from `server_address` where `id` was provided but `server_info` was not. The constructor was used for tests where `server_info` is irrelevant, but it's important not to forget about the info in production code. The constructor was used for two purposes: - Invoking set operations such as `contains`. To solve this we use C++20 transparent hash and comparator functions, which allow invoking `contains` and similar functions by providing a different key type (in this case `raft::server_id` in set of addresses, for example). - constructing addresses without `info`s in tests. For this we provide helper functions in the test helpers module and use them.	2022-07-18 18:22:10 +02:00
Jadw1	29a0be75da	forward_service: support UDA and native aggregate parallelization Enables parallelization of UDA and native aggregates. The way the query is parallelized is the same as in #9209. Separate reduction type for `COUNT(*)` is left for compatibility reason.	2022-07-18 15:25:41 +02:00
Avi Kivity	973d2a58d0	Merge 'docs: move docs to docs/dev folder' from David Garcia In order to allow our Scylla OSS customers the ability to select a version for their documentation, we are migrating the Scylla docs content to the Scylla OSS repository. This PR covers the following points of the [Migration Plan](https://docs.google.com/document/d/15yBf39j15hgUVvjeuGR4MCbYeArqZrO1ir-z_1Urc6A/edit#): 1. Creates a subdirectory for dev docs: /docs/dev 2. Moves the existing dev doc content in the scylla repo to /docs/dev, but keep Alternator docs in /docs. 3. Flattens the structure in /docs/dev (remove the subfolders). 4. Adds redirects from `scylla.docs.scylladb.com/<version>/<document>` to `https://github.com/scylladb/scylla/blob/master/docs/dev/<document>.md` 5. Excludes publishing docs for /docs/devs. 1. Enter the docs folder with `cd docs`. 2. Run `make redirects`. 3. Enter the docs folder and run `make preview`. The docs should build without warnings. 4. Open http://127.0.0.1:5500 in your browser. You shoul donly see the alternator docs. 5. Open http://127.0.0.1:5500/stable/design-notes/IDL.html in your browser. It should redirect you to https://github.com/scylladb/scylla/blob/master/docs/dev/IDL.md and raise a 404 error since this PR is not merged yet. 6. Surf the `docs/dev` folder. It should have all the scylla project internal docs without subdirectories. Closes #10873 * github.com:scylladb/scylla: Update docs/conf.py Update docs/dev/protocols.md Update docs/dev/README.md Update docs/dev/README.md Update docs/conf.py Fix broken links Remove source folder Add redirections Move dev docs to docs/dev	2022-07-03 20:37:11 +03:00
Pavel Emelyanov	85033ea6ae	Merge 'A bunch of refactors related to Raft group 0' from Kamil Braun The commits here were extracted from PR https://github.com/scylladb/scylla/pull/10835 which implements upgrade procedure for Raft group 0. They are mostly refactors which don't affect the behavior of the system, except one: the commit `4d439a16b3` causes all schema changes to be bounced to shard 0. Previously, they would only be bounced when the local Raft feature was enabled. I do that because: 1. eventually, we want this to be the default behavior 2. in the upgrade PR I remove the `is_raft_enabled()` function - the function was basically created with the mindset "Raft is either enabled or not" - which was right when we didn't support upgrade, but will be incorrect when we introduce intermediate states (when we upgrade from non-raft-based to raft-based operations); the upgrade PR introduces another mechanism to dispatch based on the upgrade state, but for the case of bouncing to shard 0, dispatching is simply not necessary. Closes #10864 * github.com:scylladb/scylla: service/raft: raft_group_registry: add assertions when fetching servers for groups service/raft: raft_group_registry: remove `_raft_support_listener` service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map service/raft: raft_group0: move group 0 RPC handlers from `storage_service` service/raft: messaging: extract raft_addr/inet_addr conversion functions service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster` treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls test/boost: memtable_test: perform schema operations on shard 0 test/boost: cdc_test: remove test_cdc_across_shards message: rename `send_message_abortable` to `send_message_cancellable` message: change parameter order in `send_message_oneway_timeout`	2022-06-29 16:51:54 +03:00
David Garcia	8e7ebea335	Merge remote-tracking branch 'upstream/master' into move-dev-docs	2022-06-28 11:02:38 +01:00
Avi Kivity	3131cbea62	Merge 'query: allow replica to provide arbitrary continue position' from Botond Dénes Currently, we use the last row in the query result set as the position where the query is continued from on the next page. Since only live rows make it into query result set, this mandates the query to be stopped on a live row on the replica, lest any dead rows or tombstones processed after the live rows, would have to be re-processed on the next page (and the saved reader would have to be thrown away due to position mismatch). This requirement of having to stop on a live row is problematic with datasets which have lots of dead rows or tombstones, especially if these form a prefix. In the extreme case, a query can time out before it can process a single live row and the data-set becomes effectively unreadable until compaction gets rid of the tombstones. This series prepares the way for the solution: it allows the replica to determine what position the query should continue from on the next page. This position can be that of a dead row, if the query stopped on a dead row. For now, the replica supplies the same position that would have been obtained with looking at the last row in the result set, this series merely introduces the infrastructure for transferring a position together with the query result, and it prepares the paging logic to make use of this position. If the coordinator is not prepared for the new field, it will simply fall-back to the old way of looking at the last row in the result set. As I said for now this is still the same as the content of the new field so there is no problem in mixed clusters. Refs: https://github.com/scylladb/scylla/issues/3672 Refs: https://github.com/scylladb/scylla/issues/7689 Refs: https://github.com/scylladb/scylla/issues/7933 Tests: manual upgrade test. I wrote a data set with: ``` ./scylla-bench -mode=write -workload=sequential -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -clustering-row-size=8096 -partition-count=1000 ``` This creates large, 80MB partitions, which should fill many pages if read in full. Then I started a read workload: ``` ./scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100 ``` I confirmed that paging is happening as expected, then upgraded the nodes one-by-one to this PR (while the read-load was ongoing). I observed no read errors or any other errors in the logs. Closes #10829 * github.com:scylladb/scylla: query: have replica provide the last position idl/query: add last_position to query_result mutlishard_mutation_query: propagate compaction state to result builder multishard_mutation_query: defer creating result builder until needed querier: use full_position instead of ad-hoc struct querier: rely on compactor for position tracking mutation_compactor: add current_full_position() convenience accessor mutation_compactor: s/_last_clustering_pos/_last_pos/ mutation_compactor: add state accessor to compact_mutation introduce full_position idl: move position_in_partition into own header service/paging: use position_in_partition instead of clustering_key for last row alternator/serialization: extract value object parsing logic service/pagers/query_pagers.cc: fix indentation position_in_partition: add to_string(partition_region) and parse_partition_region() mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh	2022-06-27 12:23:21 +03:00
David Garcia	bb21c3c869	Move dev docs to docs/dev	2022-06-24 18:07:08 +01:00
Kamil Braun	8e907cbf57	service/raft: raft_group0: move group 0 RPC handlers from `storage_service` And generate the boilerplate from IDL declarations. Simplifies the code, and the code now resides where it belongs.	2022-06-23 16:14:41 +02:00
Botond Dénes	009d2fe2f7	idl/query: add last_position to query_result To be used to allow the replica to specify the last position in the stream, where the query was left off. Currently this is always the same as the implicit position -- the last row in the result-set -- but this requires only stopping the read on a live row, which is a requirement we want to lift: we want to be able to stop on a tombstone. As tombstones are not included in the query result, we have to allow the replica to overwrite the last seen position explicitly. This patch introduces the new field in the query-result IDL but it is not written to yet, nor is it read, that is left for the next patches.	2022-06-23 13:36:24 +03:00
Botond Dénes	119be5d5db	idl: move position_in_partition into own header So it can be used without pulling in all of partition_checksum.idl.hh.	2022-06-23 13:36:24 +03:00
Botond Dénes	2b0bc11f2e	service/paging: use position_in_partition instead of clustering_key for last row The former allows for expressing more positions, like a position before/after a clustering key. This practically enables the coordinator side paging logic, for a query to be stopped at a tombstone (which can have said positions).	2022-06-23 13:36:20 +03:00
Piotr Dulikowski	d3d9add219	storage_proxy: add per partition rate limit info to read RPC Now, the read RPC accept the per partition rate limit info parameter. It is passed on to query_result_local(_digest) methods.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	02469e0b15	storage_proxy: add per partition rate limit info to write RPC Adds db::per_partition_rate_limit::info parameter to the write RPC. The rate limit info controls the behavior of the rate limiter on the replica.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	0fe8b55427	db: add rate_limiter Introduces the rate_limiter, a replica-side data structure meant for tracking the frequence with which each partition is being accessed (separately for reads and writes) and deciding whether the request should be accepted and processed further or rejected. The limiter is implemented as a statically allocated hashmap which keeps track of the frequency with which partitions are accessed. Its entries are incremented when an operation is admitted and are decayed exponentially over time. If a partition is detected to be accessed more than its limit allows, requests are rejected with a probability calculated in such a way that, on average, the number of accepted requests is kept at the limit. The structure currently weights a bit above 1MB and each shard is meant to keep a separate instance. All operations are O(1), including the periodic timer.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	2162bb9f3b	storage_proxy: propagate rate_limit_exception through read RPC This commit modifies the read RPC and the storage_proxy logic so that the coordinator knows whether a read operation failed due to rate limit being exceeded, and returns `exceptions::rate_limit_exception` if that happens.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	51546b0609	storage_proxy: pass rate_limit_exception through write RPC This commit modifies the storage_proxy logic so that the coordinator knows whether a write operation failed due to rate limit being exceeded, and returns `exceptions::rate_limit_exception` when that happens.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	621b7f35e2	replica: add rate_limit_exception and a simple serialization framework Introduces `replica::rate_limit_exception` - an exceptions that is supposed to be thrown/returned on the replica side when the request is rejected due to the exceeding the per-partition rate limit. Additionally, introduces the `exception_variant` type which allows to transport the new exception over RPC while preserving the type information. This will be useful in later commits, as the coordinator will have to know whether a replica has failed due to rate limit being exceeded or another kind of error. The `exception_variant` currently can only either hold "other exception" (std::monostate) or the aforementioned `rate_limit_exception`, but can be extended in a backwards-compatible way in the future to be able to hold more exceptions that need to be handled in a different way.	2022-06-22 20:07:58 +02:00
Juliusz Stasiewicz	00a6fda7b9	tracing: Trace slow queries on replicas wrt. parent's clock Secondary tracing sessions used to compute the execution time from the point of their `begin()`-ning, not the parent session's `begin()`. As a result, replica reported a slow query if it exceeded the entire threshold on that replica too. This change augments `trace_info` with the TS of parent's session starting point, to be used as a reference on replicas. Fixes #9403 Closes #10005	2022-02-10 12:03:53 +01:00
Michał Sala	aec96be553	forward_service: add tracing	2022-02-01 21:14:41 +01:00
Michał Sala	fff454761a	messaging_service: add verb for count() request forwarding Except for the verb addition, this commit also defines forward_request and forward_result structures, used as an argument and result of the new rpc. forward_request is used to forward information about select statement that does count() (or other aggregating functions such as max, min, avg in the future). Due to the inability to serialize cql3::statements::select_statement, I chose to include query::read_command, dht::partition_range_vector and some configuration options in forward_request. They can be serialized and are sufficient enough to allow creation of service::pager::query_pagers::pager.	2022-02-01 21:14:41 +01:00
Kamil Braun	f3c0c73d36	idl: group0_state_machine: fix license blurb	2022-01-25 17:48:46 +01:00
Kamil Braun	509ac2130f	service: raft: group0_state_machine: introduce `group0_command` Objects of this type will be serialized and sent as commands to the group 0 state machine. They contain a set of mutations which modify group 0 tables (at this point: schema tables and group 0 history table), the 'previous state ID' which is the last state ID present in the history table when the operation described by this command has started, and the 'new state ID' which will be appended to the history table if this change is successful (successful = the previous state ID is still equal to the last state ID in the history table at the moment of application). It also contains the address of the node which constructed this command. The state ID mechanism will be described in more detail in a later commit.	2022-01-24 15:20:37 +01:00
Kamil Braun	cc0c54ea15	service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table The MIGRATION_REQUEST verb is currently used to pull the contents of schema tables (in the form of mutations) when nodes synchronize schemas. We will (ab)use the verb to fetch additional data, such as the contents of the group 0 history table, for purposes of group 0 snapshot transfer. We extend `schema_pull_options` with a flag specifying that the puller requests the additional data associated with group 0 snapshots. This flag is `false` by default, so existing schema pulls will do what they did before. If the flag is `true`, the migration request handler will include the contents of group 0 history table. Note that if a request is set with the flag set to `true`, that means the entire cluster must have enabled the Raft feature, which also means that the handler knows of the flag.	2022-01-24 15:20:37 +01:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Gleb Natapov	c500a90902	raft service: make one way raft messages truly one way Raft core does not expect replies for most messages it sends, but they are defined as two way by the IDL currently. Fix them to be one way.	2022-01-13 13:14:46 +02:00
Gleb Natapov	b1fea20d36	raft: move raft verbs to the IDL	2022-01-13 13:14:46 +02:00
Gleb Natapov	8a25b740df	raft: split idl to rpc and storage Storage uses only small part of the IDL, so it can include only the part that is relevant to it.	2022-01-13 13:14:46 +02:00
Gleb Natapov	c5474f9ac2	raft: simplify raft idl definitions We may use high level types in the IDL.	2022-01-13 13:14:46 +02:00
Nadav Har'El	23e93a26b3	Merge 'Alternator: stream results + chunk results to remove large allocations' from Calle Wilund Refs: #9555 When running the "Kraken" dynamodb streams test to provoke the issued observed by QA, I noticed on my setup mainly two things: Large allocation stalls (+ warnings) and timeouts on read semaphores in DB. This tries to address the first issue, partly by making query_result_view serialization using chunked vector instead of linear one, and by introducing a streaming option for json return objects, avoiding linearizing to string before wire. Note that the latter has some overhead issues of its own, mainly data copying, since we essentially will be triple buffering (local, wrapped http stream, and final output stream). Still, normal string output will typically do a lot of realloc which is potential extra copies as well, so... This is not really performance tested, but with these tweaks I no longer get large alloc stalls at least, so that is a plus. :-) Closes #9713 * github.com:scylladb/scylla: alternator::executor: Use streamed result for scan etc if large result alternator::streams: Use streamed result in get_records if large result executor/server: Add routine to make stream object return rjson: Add print to stream of rjson::value query_idl: Make qr_partition::rows/query_result::partitions chunked	2022-01-12 15:53:31 +02:00
Calle Wilund	706f20442b	query_idl: Make qr_partition::rows/query_result::partitions chunked When doing potentially large (internal) queries, i.e. alternator streams, we can cause large allocations here.	2022-01-11 13:52:40 +00:00
Gleb Natapov	1db151bd75	storage_proxy: move all verbs to the IDL Define all verbs in the IDL instead of manually codding them.	2022-01-10 14:58:28 +02:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Piotr Jastrzebski	ae2c199bcd	max_result_size: Add page_size field With this new field comes a new member function called get_page_size. This new function will be used by the result_memory_accounter to decide when to cut a page. The behaviour of get_page_size depends on whether page_size field is set. This is distinguished by page size being equal to 0 or not. When page_size is equal to 0 then it's not set and hard_limit will be returned from get_page_size. Otherwise, get_page_size will return page_size field. When read_command is received from an old node, page_size will be equal to 0 and hard_limit will be used to determine the page size. This is consistent with the behaviour on the old nodes. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-12-28 16:37:49 +01:00
Konstantin Osipov	c22f945f11	raft: (service) manage Raft configuration during topology changes Operations of adding or removing a node to Raft configuration are made idempotent: they do nothing if already done, and they are safe to resume after a failure. However, since topology changes are not transactional, if a bootstrap or removal procedure fails midway, Raft group 0 configuration may go out of sync with topology state as seen by gossip. In future we must change gossip to avoid making any persistent changes to the cluster: all changes to persistent topology state will be done exclusively through Raft Group 0. Specifically, instead of persisting the tokens by advertising them through gossip, the bootstrap will commit a change to a system table using Raft group 0. nodetool will switch from looking at gossip-managed tables to consulting with Raft Group 0 configuration or Raft-managed tables. Once this transformation is done, naturally, adding a node to Raft configuration (perhaps as a non-voting member at first) will become the first persistent change to ring state applied when a node joins; removing a node from the Raft Group 0 configuration will become the last action when removing a node. Until this is done, do our best to avoid a cluster state when a removed node or a node which addition failed is stuck in Raft configuration, but the node is no longer present in gossip-managed system tables. In other words, keep the gossip the primary source of truth. For this purpose, carefully chose the timing when we join and leave Raft group 0: Join the Raft group 0 only after we've advertised our tokens, so the cluster is aware of this node, it's visible in nodetool status, but before node state jumps to "normal", i.e. before it accepts queries. Since the operation is idempotent, invoke it on each restart. Remove the node from Group 0 before its tokens are removed from gossip-managed system tables. This guarantees that if removal from Raft group 0 fails for whatever reason, the node stays in the ring, so nodetool removenode and friends are re-tried. Add tracing.	2021-11-25 12:35:42 +03:00
Konstantin Osipov	e3751068fe	raft: (server) allow adding entries/modify config on a follower Implement an RPC to forward add_entry calls from the follower to leader. Bounce & retry in case of not_a_leader. Do not retry in case of uncertainty - this can lead to adding duplicate entries. The feature is added to core Raft since it's needed by all current clients - both topology and schema changes. When forwarding an entry to a remote leader we may get back a term/index pair that conflicts (has the same index, but is with a higher term) with a local entry we're still waiting on. This can happen, e.g. because there was a leader change and the log was truncated, but we still haven't got the append_entries RPC from the new leader, still haven't truncated the log locally, still haven't aborted all the local waits for truncated entries. Only remove the offending entry from the wait list and abort it. There may be entries labeled with an older term to the right (with higher commit index) of the conflicting entry. However, finding them, would require a linear scan. If we allow it, we may end up doing this linear scan for every conflicting entry during the transition period, which brings us to N^2 complexity of this step. At the same time, as soon as append_entries that commits a higher-term entry with the same index reaches the follower, the waits for the respective truncated entry will be aborted anyway (see notify_waiters() which sets dropped_entry exception), so the scan is unnecessary. Similarly to being able to add entries, allow to modify Raft group configuration on a follower. The implementation works the same way as adding entries - forwards the command to the leader. Now that add_entry() or modify_config never throws not_a_leader, it's more likely to throw timed_out_error, e.g. in case the network is partitioned. Previously it was only possible due to a semaphore wait timeout, and this scenario was not tested. Handle timed_out_error on RPC level to let the existing tests (specifically the randomized nemesis test) pass.	2021-11-25 11:50:38 +03:00
Avi Kivity	115d6d8d4c	system_keyspace: prepare forward-declared members In anticipation of making system_keyspace a class instead of a namespace, rename any member that is currently forward-declared, since one can't forward-declare a class member. Each member is taken out of the system_keyspace namespace and gains a system_keyspace prefix. Aliases are added to reduce code churn. The result isn't lovely, but can be adjusted later.	2021-09-13 15:11:26 +03:00
Botond Dénes	502a45ad58	treewide: switch to native reversed format for reverse reads We define the native reverse format as a reversed mutation fragment stream that is identical to one that would be emitted by a table with the same schema but with reversed clustering order. The main difference to the current format is how range tombstones are handled: instead of looking at their start or end bound depending on the order, we always use them as-usual and the reversing reader swaps their bounds to facilitate this. This allows us to treat reversed streams completely transparently: just pass along them a reversed schema and all the reader, compacting and result building code is happily ignorant about the fact that it is a reversed stream.	2021-09-09 15:42:15 +03:00
Gleb Natapov	ce40b01b07	raft: rename snapshot into snapshot_descriptor The snapshot structure does not contain the snapshot itself but only refers to it trough its id. Rename it to snapshot_descriptor for clarity.	2021-08-29 12:53:03 +03:00
Gleb Natapov	03a266d73b	raft: make read_barrier work on a follower as well as on a leader This patch implements RAFT extension that allows to perform linearisable reads by accessing local state machine. The extension is described in section 6.4 of the PhD. To sum it up to perform a read barrier on a follower it needs to asks a leader the last committed index that it knows about. The leader must make sure that it is still a leader before answering by communicating with a quorum. When follower gets the index back it waits for it to be applied and by that completes read_barrier invocation. The patch adds three new RPC: read_barrier, read_barrier_reply and execute_read_barrier_on_leader. The last one is the one a follower uses to ask a leader about safe index it can read. First two are used by a leader to communicate with a quorum.	2021-08-25 08:57:13 +03:00
Piotr Dulikowski	e18b29765a	hints: add hint sync point structure Adds a sync_point structure. A sync point is a (possibly incomplete) mapping from hint queues to a replay position in it. Users will be able to create sync points consisting of the last written positions of some hint queues, so then they can wait until hint replay in all of the queues reach that point. The sync point supports serialization - first it is serialized with the help of IDL to a binary form, and then converted to a hexadecimal string. Deserialization is also possible.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	0d74dee683	Revert "messaging_service: add verbs for hint sync points" This reverts commit `82c419870a`. This commit removes the HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK rpc verbs. The upcoming HTTP API for waiting for hint replay will be restricted to waiting for hints on the node handling the request, so there is no need for new verbs.	2021-08-09 09:24:36 +02:00
Gleb Natapov	4764028cb3	raft: Remove leader_id from append_request The filed is not used anywhere. Message-Id: <YP0khmjK2JSp77AG@scylladb.com>	2021-07-28 20:30:07 +02:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Asias He	72cc596842	repair: Wire off-strategy compaction for regular repair We have enabled off-strategy compaction for bootstrap, replace, decommission and removenode operations when repair based node operation is enabled. Unlike node operations like replace or decommission, it is harder to know when the repair of a table is finished because users can send multiple repair requests one after another, each request repairing a few token ranges. This patch wires off-strategy compaction for regular repair by adding a timeout based automatic off-strategy compaction trigger mechanism. If there is no repair activity for sometime, off-strategy compaction will be triggered for that table automatically. Fixes #8677 Closes #8678	2021-05-26 11:41:27 +03:00
Avi Kivity	d6d6758857	Merge 'Switch to use NODE_OPS_CMD for decommission and bootstrap operation' from Asias He In commit `323f72e48a` (repair: Switch to use NODE_OPS_CMD for replace operation), we switched replace operation to use the new NODE_OPS_CMD infrastructure. In this patch set, we continue the work to switch decommission and bootstrap operation to use NODE_OPS_CMD. Fixes #8472 Fixes #8471 Closes #8481 * github.com:scylladb/scylla: repair: Switch to use NODE_OPS_CMD for bootstrap operation repair: Switch to use NODE_OPS_CMD for decommission operation	2021-05-06 17:28:19 +03:00
Avi Kivity	6ffd813b7b	Merge 'hints: delay repair until hints are replayed' from Piotr Dulikowski Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency. When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter. This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached. This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value. Fixes #8102 Tests: - unit(dev) - some manual tests: - shutting down repair coordinator during hints replay, - shutting down node participating in repair during hints replay, Closes #8452 * github.com:scylladb/scylla: repair: introduce abort_source for repair abort repair: introduce abort_source for shutdown storage_proxy: add abort_source to wait_for_hints_to_be_replayed storage_proxy: stop waiting for hints replay when node goes down hints: dismiss segment waiters when hint queue can't send repair: plug in waiting for hints to be sent before repair repair: add get_hosts_participating_in_repair storage_proxy: coordinate waiting for hints to be sent config: add wait_for_hint_replay_before_repair option storage_proxy: implement verbs for hint sync points messaging_service: add verbs for hint sync points storage_proxy: add functions for syncing with hints queue db/hints: make it possible to wait until current hints are sent db/hints: add a metric for counting processed files db/hints: allow to forcefully update segment list on flush	2021-05-03 18:47:27 +03:00
Asias He	84a78f4558	repair: Switch to use NODE_OPS_CMD for bootstrap operation In commit `323f72e48a` (repair: Switch to use NODE_OPS_CMD for replace operation), we switched replace operation to use the new NODE_OPS_CMD infrastructure. In this patch, we continue the work to switch bootstrap operation to use NODE_OPS_CMD. The benefits: - It is more reliable to detect pending node operations, to avoid multiple topology changes at the same time, than using gossip status. - The cluster reverts to a state before the bootstrap operation automatically in case of error much faster than gossip. - Allows users to pass a list of dead nodes to ignore during bootstrap explicitly. - The BOOTSTRAP gossip status is not needed any more. This is one step closer to achieve gossip-less topology change. Fixes #8472	2021-04-28 09:53:04 +08:00
Piotr Dulikowski	82c419870a	messaging_service: add verbs for hint sync points Adds two verbs: HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK. Those will make it possible to create a sync point and regularly poll to check its existence.	2021-04-27 15:06:39 +02:00

1 2 3 4

181 Commits