Commit Graph

122 Commits

Author SHA1 Message Date
Yaniv Kaul
d2ef100b60 Typos: more/less then -> more/less than
Fix repated typos in comments: more then -> more than, less then -> less than

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#17303
2024-02-13 17:16:15 +02:00
Kamil Braun
e9e24f47ec Merge 'raft topology: implement upgrade and recovery procedure' from Piotr Dulikowski
This PR implements a procedure that upgrades existing clusters to use
raft-based topology operations. The procedure does not start
automatically, it must be triggered manually by the administrator after
making sure that no topology operations are currently running.

Upgrade is triggered by sending `POST
/storage_service/raft_topology/upgrade` request. This causes the
topology coordinator to start who drives the rest of the process: it
builds the `system.topology` state based on information observed in
gossip and tells all nodes to switch to raft mode. Then, topology
coordinator runs normally.

Upgrade progress is tracked in a new static column `upgrade_state` in
`system.topology`.

The procedure also serves as an extension to the current recovery
procedure on raft. The current recovery procedure requires restarting
nodes in a special mode which disables raft, perform `nodetool
removenode` on the dead nodes, clean up some state on the nodes and
restart them so that they automatically rebuild the group 0. Raft
topology fits into existing procedure by falling back to legacy topology
operations after disabling raft. After rebuilding the group 0, upgrade
needs to be triggered again.

Because upgrade is manual and it might not be convenient for
administrators to run it right after upgrading the cluster, we allow the
cluster to operate in legacy topology operations mode until upgrade,
which includes allowing new nodes to join. In order to allow it, nodes
now ask the cluster about the mode they should use to join before
proceeding by using a new `JOIN_NODE_QUERY` RPC.

The procedure is explained in more detail in `topology-over-raft.md`.

Fixes: https://github.com/scylladb/scylladb/issues/15008

Closes scylladb/scylladb#17077

* github.com:scylladb/scylladb:
  test/topology_custom: upgrade/recovery tests for topology on raft
  cdc/generation_service: in legacy mode, fall back to raft tables
  system_keyspace: add read_cdc_generation_opt
  cdc/generation_service: turn off gossip notifications in raft topo mode
  cql_test_env: move raft_topology_change_enabled var earlier
  group0_state_machine: pull snapshot after raft topology feature enabled
  storage_service: disable persistent feature enabler on upgrade
  storage_service: replicate raft features to system.peers
  storage_service: gossip tokens and cdc generation in raft topology mode
  API: add api for triggering and monitoring topology-on-raft upgrade
  storage_service: infer which topology operations to use on startup
  storage_service: set the topology kind value based on group 0 state
  raft_group0: expose link to the upgrade doc in the header
  feature_service: fall back to checking legacy features on startup
  storage_service: add fiber for tracking the topology upgrade progress
  gms: feature_service: add SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES
  topology_coordinator: implement core upgrade logic
  topology_coordinator: extract top-level error handling logic
  storage_service: initialize discovery leader's state earlier
  topology_coordinator: allow for custom sharding info in prepare_and_broadcast_cdc_generation_data
  topology_coordinator: allow for custom sharding info in prepare_new_cdc_generation_data
  topology_coordinator: remove outdated fixme in prepare_new_cdc_generation_data
  topology_state_machine: introduce upgrade_state
  storage_service: disallow topology ops when upgrade is in progress
  raft_group0_client: add in_recovery method
  storage_service: introduce join_node_query verb
  raft_group0: make discover_group0 public
  raft_group0: filter current node's IP in discover_group0
  raft_group0: remove my_id arg from discover_group0
  storage_service: make _raft_topology_change_enabled more advanced
  docs: document raft topology upgrade and recovery
2024-02-09 11:54:53 +01:00
Kefu Chai
64c829da70 docs: reformat the state machine diagram using mermaid
for better readability

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16620
2024-02-08 19:43:53 +02:00
Piotr Dulikowski
1104f8b00f docs: document raft topology upgrade and recovery 2024-02-07 09:54:54 +01:00
Patryk Jędrzejczak
2687204c7f docs: dev: topology-over-raft: align indentation 2024-02-02 16:55:28 +01:00
Patryk Jędrzejczak
fdd3c3a280 docs: dev: topology-over-raft: document the rollback_to_normal state
In one of the previous patches, we changed the `rollback_to_normal`
state from a node state to a transition state. We document it
in this patch. The node state wasn't documented, so there is
nothing to remove.
2024-02-02 16:55:28 +01:00
Avi Kivity
c8397f0287 Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho
The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer.

If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint.

Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on.

A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split.

Tablet metadata gains 2 new fields for managing this:
resize_type: resize decision type, can be either of "merge", "split", or "none".
resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator).

A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ).
When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready.

When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split").
The split monitor will start splitting of compaction groups (using mechanism introduced here: 081f30d149) for the table. And once splitting work is completed, the table updates its local state as having completed split.

When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step).

Fixes #16536.

Closes scylladb/scylladb#16580

* github.com:scylladb/scylladb:
  test/topology_experimental_raft: Add tablet split test
  replica: Bypass reshape on boot with tablets temporarily
  replica: Fix table::compaction_group_for_sstable() for tablet streaming
  test/topology_experimental_raft: Disable load balancer in test fencing
  replica: Remap compaction groups when tablet split is finalized
  service: Split tablet map when split request is finalized
  replica: Update table split status if completed split compaction work
  storage_service: Implement split monitor
  topology_cordinator: Generate updates for resize decisions made by balancer
  load_balancer: Introduce metrics for resize decisions
  db: Make target tablet size a live-updateable config option
  load_balancer: Implement resize decisions
  service: Wire table_resize_plan into migration_plan
  service: Introduce table_resize_plan
  tablet_mutation_builder: Add set_resize_decision()
  topology_coordinator: Wire load stats into load balancer
  storage_service: Allow tablet split and migration to happen concurrently
  topology_coordinator: Periodically retrieve table_load_stats
  locator: Introduce topology::get_datacenter_nodes()
  storage_service: Implement table_load_stats RPC
  replica: Expose table_load_stats in table
  replica: Introduce storage_group::live_disk_space_used()
  locator: Introduce table_load_stats
  tablets: Add resize decision metadata to tablet metadata
  locator: Introduce resize_decision
2024-01-31 13:59:56 +02:00
Patryk Jędrzejczak
7c10cae6c4 docs: dev: topology-over-raft: document the left_token_ring state
In one of the previous patches, we changed the `left_token_ring`
state from a node state to a transition state. We document it
in this patch. The node state wasn't documented, so there is
nothing to remove.
2024-01-29 10:39:07 +01:00
Kefu Chai
9ee6c00c84 docs: fix misspellings
these misspellings are identified by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17005
2024-01-26 13:14:21 +02:00
Raphael S. Carvalho
7ed5b44d52 load_balancer: Implement resize decisions
This implements the ability in load balancer to emit split or merge
requests, cancel ongoing ones if they're no longer needed, and
also finalize those that are ready for the topology changes.

That's all based on average tablet size, collected by coordinator
from all nodes, and split and merge thresholds.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:36:08 -03:00
Raphael S. Carvalho
0d5ba1ee4b tablets: Add resize decision metadata to tablet metadata
The new metadata describes the ongoing resize operation (can be either
of merge, split or none) that spans tablets of a given table.
That's managed by group0, so down nodes will be able to see the
decision when they come back up and see the changes to the
metadata.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:36:06 -03:00
Avi Kivity
69d597075a Merge 'tablets: Add support for removenode and replace handling' from Tomasz Grabiec
New tablet replicas are allocated and rebuilt synchronously with node
operations. They are safely rebuilt from all existing replicas.
The list of ignored nodes passed to node operations is respected.

Tablet scheduler is responsible for scheduling tablet rebuilding transition which
changes the replicas set. The infrastructure for handling decommission
in tablet scheduler is reused for this.

Scheduling is done incrementally, respecting per-shard load
limits. Rebuilding transitions are recognized by load calculation to
affect all tablet replicas.

New kind of tablet transition is introduced called "rebuild" which
adds new tablet replica and rebuilds it from existing replicas. Other
than that, the transition goes through the same stages as regular
migration to ensure safe synchronization with request coordinators.

In this PR we simply stream from all tablet replicas. Later we should
switch to calling repair to avoid sending excessive amounts of data.

Fixes https://github.com/scylladb/scylladb/issues/16690.

Closes scylladb/scylladb#16894

* github.com:scylladb/scylladb:
  tests: tablets: Add tests for removenode and replace
  tablets: Add support for removenode and replace handling
  topology_coordinator: tablets: Do not fail in a tight loop
  topology_coordinator: tablets: Avoid warnings about ignored failured future
  storage_service, topology: Track excluded state in locator::topology
  raft topology: Introduce param-less topology::get_excluded_nodes()
  raft topology: Move get_excluded_nodes() to topology
  tablets: load_balancer: Generalize load tracking
  tablets: Introduce get_migration_streaming_info() which works on migration request
  tablets: Move migration_to_transition_info() to tablets.hh
  tablets: Extract get_new_replicas() which works on migraiton request
  tablets: Move tablet_migration_info to tablets.hh
  tablets: Store transition kind per tablet
2024-01-25 14:49:43 +02:00
Avi Kivity
4a57b67634 docs: add a rough diagram of module interaction
It is incomplete and maybe inaccurate, but it is a start.

Closes scylladb/scylladb#16903
2024-01-23 18:08:48 +02:00
Tomasz Grabiec
4a06ffb43c tablets: Store transition kind per tablet
Will be used to distinguish regular migration from rebuild, repair and
RF change.
2024-01-23 01:12:57 +01:00
Eliran Sinvani
32d8dadf1a Add code coverage documentation
Add `docs/dev/code-coverage.md` with explanations about how to work with
the different tools added for coverage reporting and cli options added
to `configure.py` and `test.py`

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2024-01-18 11:11:34 +02:00
Sylwia Szunejko
91a5a41313 add a way to negotiate generation of the tablet info for drivers
Tablets metadata is quite expensive to generate (each data_value is
an allocation), so an old driver (without support for tablets) will
generate huge amounts of such notifications. This commit adds a way
to negotiate generation of the notification: a new driver will ask
for them, and an old driver won't get them. It uses the
OPTIONS/SUPPORTED/STARTUP protocol described in native_protocol_v4.spec.

Closes scylladb/scylladb#16611
2024-01-02 20:00:50 +02:00
Sylwia Szunejko
467d466f7e put all tablet info into one field of custom_payload and update docs
Previously, the tablet information was sent to the drivers
in two pieces within the custom_payload. We had information
about the replicas under the `tablet_replicas` key and token range
information under `token_range`. These names were quite generic
and might have caused problems for other custom_payload users.
Additionally, dividing the information into two pieces raised
the question of what to do if one key is present while the other
is missing.

This commit changes the serialization mechanism to pack all information
under one specific name, `tablets-routing-v1`.

From: Sylwia Szunejko <sylwia.szunejko@scylladb.com>

Closes scylladb/scylladb#16148
2024-01-02 14:35:37 +02:00
Tomasz Grabiec
effb9fb3cb Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun
When performing a schema change through group 0, extend the schema mutations with a version that's persisted and then used by the nodes in the cluster in place of the old schema digest, which becomes horribly slow as we perform more and more schema changes (#7620).

If the change is a table create or alter, also extend the mutations with a version for this table to be used for `schema::version()`s instead of having each node calculate a hash which is susceptible to bugs (#13957).

When performing a schema change in Raft RECOVERY mode we also extend schema mutations which forces nodes to revert to the old way of calculating schema versions when necessary.

We can only introduce these extensions if all of the cluster understands them, so protect this code by a new cluster/schema feature, `GROUP0_SCHEMA_VERSIONING`.

Fixes: #7620
Fixes: #13957

---

This is a reincarnation of PR scylladb/scylladb#15331. The previous PR was reverted due to a bug it unmasked; the bug has now been fixed (scylladb/scylladb#16139). Some refactors from the previous PR were already merged separately, so this one is a bit smaller.

I have checked with @Lorak-mmk's reproducer (https://github.com/Lorak-mmk/udt_schema_change_reproducer -- many thanks for it!) that the originally exposed bug is no longer reproducing on this PR, and that it can still be reproduced if I revert the aforementioned fix on top of this PR.

Closes scylladb/scylladb#16242

* github.com:scylladb/scylladb:
  docs: describe group 0 schema versioning in raft docs
  test: add test for group 0 schema versioning
  feature_service: enable `GROUP0_SCHEMA_VERSIONING` in Raft mode
  schema_tables: don't delete `version` cell from `scylla_tables` mutations from group 0
  migration_manager: add `committed_by_group0` flag to `system.scylla_tables` mutations
  schema_tables: use schema version from group 0 if present
  migration_manager: store `group0_schema_version` in `scylla_local` during schema changes
  system_keyspace: make `get/set_scylla_local_param` public
  feature_service: add `GROUP0_SCHEMA_VERSIONING` feature
2023-12-11 12:17:57 +01:00
Kamil Braun
3352d9bccc docs: describe group 0 schema versioning in raft docs 2023-12-08 17:46:31 +01:00
Avi Kivity
9c0f05efa1 Merge 'Track tablet streaming under global sessions to prevent side-effects of failed streaming' from Tomasz Grabiec
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.

This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.

The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.

The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.

This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.

Closes scylladb/scylladb#15847

* github.com:scylladb/scylladb:
  test: tablets: Add test for failed streaming being fenced away
  error_injection: Introduce poll_for_message()
  error_injection: Make is_enabled() public
  api: Add API to kill connection to a particular host
  range_streamer: Do not block topology change barriers around streaming
  range_streamer, tablets: Do not keep token metadata around streaming
  tablets: Fail gracefully when migrating tablet has no pending replica
  storage_service, api: Add API to disable tablet balancing
  storage_service, api: Add API to migrate a tablet
  storage_service, raft topology: Run streaming under session topology guard
  storage_service, tablets: Use session to guard tablet streaming
  tablets: Add per-tablet session id field to tablet metadata
  service: range_streamer: Propagate topology_guard to receivers
  streaming: Always close the rpc::sink
  storage_service: Introduce concept of a topology_guard
  storage_service: Introduce session concept
  tablets: Fix topology_metadata_guard holding on to the old erm
  docs: Document the topology_guard mechanism
2023-12-07 16:29:02 +02:00
Yaniv Kaul
862909ee4f Typos: fix typos in documentation
Using codespell, went over the docs and fixed some typos.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#16275
2023-12-07 11:10:17 +02:00
Tomasz Grabiec
6cd310fc1a docs: Document the topology_guard mechanism 2023-12-05 14:09:34 +01:00
Avi Kivity
60af2f3cb2 Merge 'New commitlog file format using tagged pages' from Calle Wilund
Prototype implementation of format suggested/requested by @avikivity:

Divides segments into disk-write-alignment sized pages, each tagged with segment ID + CRC of data content.
When read, we both verify sector integrity (CRC) to detect corruption, as well as matching ID read with expected one.

If the latter mismatches we have a prematurely terminated segment (read truncation), which, depending on whether the CL is
written in batch or periodic mode, as well as explicit sync, can mean data loss.

Note: all-zero pages are treated as kosher, both to align with newly allocated segments, as well as fully terminated (zero-page) ones.

Note: This is a preview/RFC - the rest of the file format is not modified. At least parts of entry CRC could probably be removed, but I have not done so yet (needs some thinking).

Note: Some slight abstraction breaks in impl. and probably less than maximal efficiency.

v2:
* Removed entry CRC:s in file format.
* Added docs on format v3
* Added one more test for recycling-truncation

v3:
* Fixed typos in size calc and docs
* Changed sect metadata order
* Explicit iter type

Closes scylladb/scylladb#15494

* github.com:scylladb/scylladb:
  commitlog_test: Add test for replaying large-ish mutation
  commitlog_test: Add additional test for segmnent truncation
  docs: Add docs on commitlog format 3
  commitlog: Remove entry CRC from file format
  commitlog: Implement new format using CRC:ed sectors
  commitlog: Add iterator adaptor for doing buffer splitting into sub-page ranges
  fragmented_temporary_buffer: Add const iterator access to underlying buffers
  commitlog_replayer: differentiate between truncated file and corrupt entries
2023-12-04 13:31:13 +01:00
Michał Jadwiszczak
e1d86f9afb docs: add LIST EFFECTIVE SERVICE LEVEL statement docs 2023-11-30 13:07:20 +01:00
sylwiaszunejko
ac51c417ea docs: add documentation about sending tablet info to protocol extensions 2023-11-22 09:23:43 +01:00
Calle Wilund
57a4645c81 docs: Add docs on commitlog format 3 2023-11-21 08:50:57 +00:00
Kamil Braun
0846d324d7 Merge 'rollback topology operation on streaming failure' from Gleb
This patch series adds error handling for streaming failure during
topology operations instead of an infinite retry. If streaming fails the
operation is rolled back: bootstrap/replace nodes move to left and
decommissioned/remove nodes move back to normal state.

* 'gleb/streaming-failure-rollback-v4' of github.com:scylladb/scylla-dev:
  raft: make sure that all operation forwarded to a leader are completed before destroying raft server
  storage_service: raft topology: remove code duplication from global_tablet_token_metadata_barrier
  tests: add tests for streaming failure in bootstrap/replace/remove/decomission
  test/pylib: do not stop node if decommission failed with an expected error
  storage_service: raft topology: fix typo in "decommission" everywhere
  storage_service: raft topology: add streaming error injection
  storage_service: raft topology: do not increase topology version during CDC repair
  storage_service: raft topology: rollback topology operation on streaming failure.
  storage_service: raft topology: load request parameters in left_token_ring state as well
  storage_service: raft topology: do not report term_changed_error during global_token_metadata_barrier as an error
  storage_service: raft topology: change global_token_metadata_barrier error handling to try/catch
  storage_service: raft topology: make global_token_metadata_barrier node independent
  storage_service: raft topology: split get_excluded_nodes from exec_global_command
  storage_service: raft topology: drop unused include_local and do_retake parameters from exec_global_command which are always true
  storage_service: raft topology: simplify streaming RPC failure handling
2023-11-02 10:15:45 +01:00
Aleksandr Bykov
6b991b4791 doc: add note about run test.py with toolchain/dbuild
test.py tests could be run with toolchain/dbuild and in this case
there is no need to executed ./install-dependicies.sh.

Closes scylladb/scylladb#15837
2023-10-29 18:30:32 +02:00
Gleb Natapov
cee7aab32c storage_service: raft topology: fix typo in "decommission" everywhere 2023-10-25 13:03:57 +03:00
Piotr Dulikowski
dd4579637b docs: document the new join procedure 2023-09-26 15:56:52 +02:00
Kefu Chai
f3f31f0c65 main.cc: rename aws option
- s/aws_key/aws_access_key_id/
- s/aws_secret/aws_secret_access_key/
- s/aws_token/aws_session_token/

rename them to more popular names, these names are also used by
boto's API. this should improve the readability and consistency.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-23 14:31:32 +08:00
Michael Huang
62a8a31be7 cdc: use chunked_vector for topology_description entries
Lists can grow very big. Let's use a chunked vector to prevent large contiguous
allocations.
Fixes: #15302.

Closes scylladb/scylladb#15428
2023-09-18 23:17:01 +03:00
Kefu Chai
30ef69fcb2 docs/dev/object_store: add more samples
in hope to lower the bar to testing object store.

* add language specifier for better readability of the document.
  to highlight the config with YAML syntax
* add more specific comment on the AWS related settings
* explain that endpoint should match in the CREATE KEYSPACE
  statement and the one defined by the YAML configuration.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15433
2023-09-15 17:35:17 +03:00
Avi Kivity
a3d73bfba7 Merge 'Add support for decommission with tablets' from Tomasz Grabiec
Load balancer will recognize decommissioning nodes and will
move tablet replicas away from such nodes with highest priority.

Topology changes have now an extra step called "tablet draining" which
calls the load balancer. The step will execute tablet migration track
as long as there are nodes which require draining. It will not do regular
load balancing.

If load balancer is unable to find new tablet replicas, because RF
cannot be met or availability is at risk due to insufficient node
distribution in racks, it will throw an exception. Currently, topology
change will retry in a loop. We should make this error cause topology
change to be aborted. There is no infrastructure for
aborts yet, so this is not implemented.

Closes #15197

* github.com:scylladb/scylladb:
  tablets, raft topology: Add support for decommission with tablets
  tablet_allocator: Compute load sketch lazily
  tablet_allocator: Set node id correctly
  tablet_allocator: Make migration_plan a class
  tablets: Implement cleanup step
  storage_service, tablets: Prevent stale RPCs from running beyond their stage
  locator: Introduce tablet_metadata_guard
  locator, replica: Add a way to wait for table's effective_replication_map change
  storage_service, tablets: Extract do_tablet_operation() from stream_tablet()
  raft topology: Add break in the final case clause
  raft topology: Fix SIGSEGV when trace-level logging is enabled
  raft topology: Set node state in topology
  raft topology: Always set host id in topology
2023-09-14 17:16:23 +03:00
Tomasz Grabiec
551cc0233d tablets, raft topology: Add support for decommission with tablets
Load balancer will recognize decommissioning nodes and will
move tablet replicas away from such nodes with highest priority.

Topology changes have now an extra step called "tablet draining" which
calls the load balancer. The step will execute tablet migration track
as long as there are nodes which require draining. It will not do regular
load balancing.

If load balancer is unable to find new tablet replicas, because RF
cannot be met or availability is at risk due to insufficient node
distribution in racks, it will throw an exception. Currently, topology
change will retry in a loop. We should make this error cause topology
change to be paused so that admin becomes aware of the problem and
issues an abort on the topology change. There is no infrastructure for
aborts yet, so this is not implemented.
2023-09-14 13:05:49 +02:00
Kefu Chai
60c293ed7d doc/dev: correct the path to object_storage.yaml
we get the path object storage config like:

```c++
db::config::get_conf_sub("object_storage.yaml").native()
```
so, the default path should be $SCYLLA_CONF/object_storage.yaml.

in this change, it is corrected.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15406
2023-09-14 10:40:55 +03:00
Patryk Jędrzejczak
1c58c6336a system_keyspace: change id to timeuuid in CDC_GENERATIONS_V3
We change the type of IDs in CDC_GENERATIONS_V3 to timeuuid to
give them a time-based order. We also change how we initialize
them so that the new CDC generation always has the highest ID.
This is the last step to enabling the efficient clearing of
obsolete CDC generation data.

Additionally, we change the types of current_cdc_generation_uuid,
new_cdc_generation_data_uuid and the second values of the elements
in unpublished_cdc_generations to timeuuid, so that they match id
in CDC_GENERATIONS_V3.
2023-09-12 11:43:34 +02:00
Patryk Jędrzejczak
fab066cffe cdc: generation: remove topology_description_generator
After moving the creation of uuid out of
make_new_generation_description, this function only calls the
topology_description_generator's constructor and its generate
method. We could remove this function, but we instead simplify
the code by removing the topology_description_generator class.
We can do this refactor because make_new_generation_description
is the only place using it. We inline its generate method into
make_new_generation_description and turn its private methods into
static functions.
2023-09-12 11:18:54 +02:00
Patryk Jędrzejczak
2cd430ac80 system_kayspace: make CDC_GENERATIONS_V3 single-partition
We make CDC_GENERATIONS_V3 single-partition by adding the key
column and changing the clustering key from range_end to
(id, range_end). This is the first step to enabling the efficient
clearing of obsolete CDC generation data, which we need to prevent
Raft-topology snapshots from endlessly growing as we introduce new
generations over time. The next step is to change the type of the id
column to timeuuid. We do it in the following commits.

After making CDC_GENERATIONS_V3 single-partition, there is no easy
way of preserving the num_ranges column. As it is used only for
sanity checking, we remove it to simplify the implementation.
2023-09-12 09:51:45 +02:00
Patryk Jędrzejczak
2643ccc70e docs: remove information about publish_cdc_generation
We update documentation after replacing the
topology::transition_state::publish_cdc_generation state with
the CDC generation publisher fiber.
2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak
5ed9d4db6d raft topology: add unpublished_cdc_generations to system.topology
In the following commits, we replace the
topology::transition_state::publish_cdc_generation state with
a background fiber that continually publishes committed CDC
generations. To make these generations accessible to the
topology coordinator, we store them in the new column of
system.topology -- unpublished_cdc_generations.
2023-09-08 09:05:01 +02:00
Nadav Har'El
5625624533 doc/dev: add document about analyzing build time
Add a document describing in detail how to use clang's "-ftime-trace"
option, and the ClangBuildAnalyzer tool, to find the source files,
header files and templates which slow down Scylla's build the most.

I've used this tool in the past to reduce Scylla build time - see
commits:

   fa7a302130 (reduced 6.5%)
   f84094320d (reduced 0.1%)
   6ebf32f4d7 (reduced 1%)
   d01e1a774b (reduced 4%)

I'm hoping that documenting how to use this tool will allow other
developers to suggest similar commits.

Refs #1.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #15209
2023-09-01 11:33:36 +03:00
Kamil Braun
169d19e5b0 Merge 'raft topology: support --ignore-dead-nodes in removenode and replace' from Patryk Jędrzejczak
We add support for `--ignore-dead-nodes` in `raft_removenode` and
`--ignore-dead-nodes-for-replace` in `raft_replace`. For now, we allow
passing only host ids of the ignored nodes. Supporting IPs is currently
impossible because `raft_address_map` doesn't provide a mapping from IP
to a host id.

The main steps of the implementation are as follows:
- add the `ignore_nodes` column to `system.topology`,
- set the `ignore_nodes` value of the topology mutation in `raft_removenode` and `raft_replace`,
- extend `service::request_param` with alternative types that allow storing a set of ids of the ignored nodes,
- load `ignore_nodes` from `system.topology` into `request_param` in `system_keyspace::load_topology_state`,
- add `ignore_nodes` to `exclude_nodes` in `topology_coordinator::exec_global_command`,
- pass `ignore_nodes` to `replace_with_repair` and `remove_with_repair` in `storage_service::raft_topology_cmd_handler`.

Additionally, we add `test_raft_ignore_nodes.py` with two tests that verify the added changes.

Fixes #15025

Closes #15113

* github.com:scylladb/scylladb:
  test: add test_raft_ignore_nodes
  test: ManagerClient.remove_node: allow List[HostId] for ignore_dead
  raft topology: pass ignore_nodes to {replace, remove}_with_repair
  raft topology: exec_global_command: add ignore_nodes to exclude_nodes
  raft topology: exec_global_command: change type of exclude_nodes
  topology_state_machine: extend request_param with a set of raft ids
  raft topology: set ignore_nodes in raft_removenode and raft_replace
  utils: introduce split_comma_separated_list
  raft topology: add the ignore_nodes column to system.topology
2023-08-22 18:04:59 +02:00
Patryk Jędrzejczak
16f5db8af2 raft topology: add the ignore_nodes column to system.topology
In the following commits, we add support for --ignore-dead-nodes
in raft_removenode and --ignore-dead-nodes-for-replace in
raft_replace. To make these request parameters accessible for the
topology coordinator, we store them in the new ignore_nodes
column of system.topology.
2023-08-22 10:30:12 +02:00
Avi Kivity
23be6f0336 tablets: change persistent type of replica set from set to list
The system.tablets table stores replica sets as a CQL set type,
which is sorted. This means that if, in a tablet replica set
[n1, n2, n3] n2 is replaced with n4, then on reload we'll see
[n1, n3, n4], changing the relative position of n3 from the third
replica to the second.

The relative position of replicas in a replica set is important
for materialized views, as they use it to pair base replicas with
view replicas. To prepare for materialized views using tablets,
change the persistent data type to list, which preserves order.

The code that generates new replica sets already preserves order:
see locator::replace_replica().

While this changes the system schema, tablets are an experimental
feature so we don't need to worry about upgrades.

Closes #15111
2023-08-21 22:55:14 +02:00
Tomasz Grabiec
fe181b3bac tablets: Balance tablets concurrently with active migrations
After this change, the load balancer can make progress with active
migrations. If the algorithm is called with active tablet migrations
in tablet metadata, those are treated by load balancer as if they were
already completed. This allows the algorithm to incrementally make
decision which when executed with active migrations will produce the
desired result.

Overload of shards is limited by the fact that the algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.

The coordinator executes the load balancer on edges of tablet state
machine stransitions. This allows new migrations to be started as soon
as tablets finish streaming.

The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.
2023-07-31 01:45:23 +02:00
Tomasz Grabiec
05519bd5e5 doc: Document tablet migration state machine and load balancer 2023-07-25 21:08:02 +02:00
Benny Halevy
26ff8f7bf7 docs: dml: add update ordering section
and add docs/dev/timestamp-conflict-resolution.md
to document the details of the conflict resolution algorithm.

Refs scylladb/scylladb#14063

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-20 11:55:54 +03:00
Avi Kivity
5acb137c2e Merge 'docs/dev/reader-concurrency-semaphore.md: add section about operations' from Botond Dénes
Containing two tables, describing all the possible operations seen in user, system and streaming semaphore diagnostics dumps.

Closes #14171

* github.com:scylladb/scylladb:
  docs/dev/reader-concurrency-semaphore.md: add section about operations
  docs/dev/reader-concurrency-semaphore.md: switch to # headers markings
  reader_concurrency_semaphore: s/description/operation/ in diagnostics dumps
2023-06-07 22:53:18 +03:00
Botond Dénes
0c632b6e3d docs/dev/reader-concurrency-semaphore.md: add section about operations
Containing two tables, describing all the possible operations seen in
user, system and streaming semaphore diagnostics dumps.
2023-06-07 14:22:52 +03:00