Commit Graph

341 Commits

Author SHA1 Message Date
Michael Huang
62a8a31be7 cdc: use chunked_vector for topology_description entries
Lists can grow very big. Let's use a chunked vector to prevent large contiguous
allocations.
Fixes: #15302.

Closes scylladb/scylladb#15428
2023-09-18 23:17:01 +03:00
Kefu Chai
30ef69fcb2 docs/dev/object_store: add more samples
in hope to lower the bar to testing object store.

* add language specifier for better readability of the document.
  to highlight the config with YAML syntax
* add more specific comment on the AWS related settings
* explain that endpoint should match in the CREATE KEYSPACE
  statement and the one defined by the YAML configuration.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15433
2023-09-15 17:35:17 +03:00
Avi Kivity
a3d73bfba7 Merge 'Add support for decommission with tablets' from Tomasz Grabiec
Load balancer will recognize decommissioning nodes and will
move tablet replicas away from such nodes with highest priority.

Topology changes have now an extra step called "tablet draining" which
calls the load balancer. The step will execute tablet migration track
as long as there are nodes which require draining. It will not do regular
load balancing.

If load balancer is unable to find new tablet replicas, because RF
cannot be met or availability is at risk due to insufficient node
distribution in racks, it will throw an exception. Currently, topology
change will retry in a loop. We should make this error cause topology
change to be aborted. There is no infrastructure for
aborts yet, so this is not implemented.

Closes #15197

* github.com:scylladb/scylladb:
  tablets, raft topology: Add support for decommission with tablets
  tablet_allocator: Compute load sketch lazily
  tablet_allocator: Set node id correctly
  tablet_allocator: Make migration_plan a class
  tablets: Implement cleanup step
  storage_service, tablets: Prevent stale RPCs from running beyond their stage
  locator: Introduce tablet_metadata_guard
  locator, replica: Add a way to wait for table's effective_replication_map change
  storage_service, tablets: Extract do_tablet_operation() from stream_tablet()
  raft topology: Add break in the final case clause
  raft topology: Fix SIGSEGV when trace-level logging is enabled
  raft topology: Set node state in topology
  raft topology: Always set host id in topology
2023-09-14 17:16:23 +03:00
Tomasz Grabiec
551cc0233d tablets, raft topology: Add support for decommission with tablets
Load balancer will recognize decommissioning nodes and will
move tablet replicas away from such nodes with highest priority.

Topology changes have now an extra step called "tablet draining" which
calls the load balancer. The step will execute tablet migration track
as long as there are nodes which require draining. It will not do regular
load balancing.

If load balancer is unable to find new tablet replicas, because RF
cannot be met or availability is at risk due to insufficient node
distribution in racks, it will throw an exception. Currently, topology
change will retry in a loop. We should make this error cause topology
change to be paused so that admin becomes aware of the problem and
issues an abort on the topology change. There is no infrastructure for
aborts yet, so this is not implemented.
2023-09-14 13:05:49 +02:00
Kefu Chai
60c293ed7d doc/dev: correct the path to object_storage.yaml
we get the path object storage config like:

```c++
db::config::get_conf_sub("object_storage.yaml").native()
```
so, the default path should be $SCYLLA_CONF/object_storage.yaml.

in this change, it is corrected.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15406
2023-09-14 10:40:55 +03:00
Patryk Jędrzejczak
1c58c6336a system_keyspace: change id to timeuuid in CDC_GENERATIONS_V3
We change the type of IDs in CDC_GENERATIONS_V3 to timeuuid to
give them a time-based order. We also change how we initialize
them so that the new CDC generation always has the highest ID.
This is the last step to enabling the efficient clearing of
obsolete CDC generation data.

Additionally, we change the types of current_cdc_generation_uuid,
new_cdc_generation_data_uuid and the second values of the elements
in unpublished_cdc_generations to timeuuid, so that they match id
in CDC_GENERATIONS_V3.
2023-09-12 11:43:34 +02:00
Patryk Jędrzejczak
fab066cffe cdc: generation: remove topology_description_generator
After moving the creation of uuid out of
make_new_generation_description, this function only calls the
topology_description_generator's constructor and its generate
method. We could remove this function, but we instead simplify
the code by removing the topology_description_generator class.
We can do this refactor because make_new_generation_description
is the only place using it. We inline its generate method into
make_new_generation_description and turn its private methods into
static functions.
2023-09-12 11:18:54 +02:00
Patryk Jędrzejczak
2cd430ac80 system_kayspace: make CDC_GENERATIONS_V3 single-partition
We make CDC_GENERATIONS_V3 single-partition by adding the key
column and changing the clustering key from range_end to
(id, range_end). This is the first step to enabling the efficient
clearing of obsolete CDC generation data, which we need to prevent
Raft-topology snapshots from endlessly growing as we introduce new
generations over time. The next step is to change the type of the id
column to timeuuid. We do it in the following commits.

After making CDC_GENERATIONS_V3 single-partition, there is no easy
way of preserving the num_ranges column. As it is used only for
sanity checking, we remove it to simplify the implementation.
2023-09-12 09:51:45 +02:00
Patryk Jędrzejczak
2643ccc70e docs: remove information about publish_cdc_generation
We update documentation after replacing the
topology::transition_state::publish_cdc_generation state with
the CDC generation publisher fiber.
2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak
5ed9d4db6d raft topology: add unpublished_cdc_generations to system.topology
In the following commits, we replace the
topology::transition_state::publish_cdc_generation state with
a background fiber that continually publishes committed CDC
generations. To make these generations accessible to the
topology coordinator, we store them in the new column of
system.topology -- unpublished_cdc_generations.
2023-09-08 09:05:01 +02:00
Nadav Har'El
5625624533 doc/dev: add document about analyzing build time
Add a document describing in detail how to use clang's "-ftime-trace"
option, and the ClangBuildAnalyzer tool, to find the source files,
header files and templates which slow down Scylla's build the most.

I've used this tool in the past to reduce Scylla build time - see
commits:

   fa7a302130 (reduced 6.5%)
   f84094320d (reduced 0.1%)
   6ebf32f4d7 (reduced 1%)
   d01e1a774b (reduced 4%)

I'm hoping that documenting how to use this tool will allow other
developers to suggest similar commits.

Refs #1.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #15209
2023-09-01 11:33:36 +03:00
Kamil Braun
169d19e5b0 Merge 'raft topology: support --ignore-dead-nodes in removenode and replace' from Patryk Jędrzejczak
We add support for `--ignore-dead-nodes` in `raft_removenode` and
`--ignore-dead-nodes-for-replace` in `raft_replace`. For now, we allow
passing only host ids of the ignored nodes. Supporting IPs is currently
impossible because `raft_address_map` doesn't provide a mapping from IP
to a host id.

The main steps of the implementation are as follows:
- add the `ignore_nodes` column to `system.topology`,
- set the `ignore_nodes` value of the topology mutation in `raft_removenode` and `raft_replace`,
- extend `service::request_param` with alternative types that allow storing a set of ids of the ignored nodes,
- load `ignore_nodes` from `system.topology` into `request_param` in `system_keyspace::load_topology_state`,
- add `ignore_nodes` to `exclude_nodes` in `topology_coordinator::exec_global_command`,
- pass `ignore_nodes` to `replace_with_repair` and `remove_with_repair` in `storage_service::raft_topology_cmd_handler`.

Additionally, we add `test_raft_ignore_nodes.py` with two tests that verify the added changes.

Fixes #15025

Closes #15113

* github.com:scylladb/scylladb:
  test: add test_raft_ignore_nodes
  test: ManagerClient.remove_node: allow List[HostId] for ignore_dead
  raft topology: pass ignore_nodes to {replace, remove}_with_repair
  raft topology: exec_global_command: add ignore_nodes to exclude_nodes
  raft topology: exec_global_command: change type of exclude_nodes
  topology_state_machine: extend request_param with a set of raft ids
  raft topology: set ignore_nodes in raft_removenode and raft_replace
  utils: introduce split_comma_separated_list
  raft topology: add the ignore_nodes column to system.topology
2023-08-22 18:04:59 +02:00
Patryk Jędrzejczak
16f5db8af2 raft topology: add the ignore_nodes column to system.topology
In the following commits, we add support for --ignore-dead-nodes
in raft_removenode and --ignore-dead-nodes-for-replace in
raft_replace. To make these request parameters accessible for the
topology coordinator, we store them in the new ignore_nodes
column of system.topology.
2023-08-22 10:30:12 +02:00
Avi Kivity
23be6f0336 tablets: change persistent type of replica set from set to list
The system.tablets table stores replica sets as a CQL set type,
which is sorted. This means that if, in a tablet replica set
[n1, n2, n3] n2 is replaced with n4, then on reload we'll see
[n1, n3, n4], changing the relative position of n3 from the third
replica to the second.

The relative position of replicas in a replica set is important
for materialized views, as they use it to pair base replicas with
view replicas. To prepare for materialized views using tablets,
change the persistent data type to list, which preserves order.

The code that generates new replica sets already preserves order:
see locator::replace_replica().

While this changes the system schema, tablets are an experimental
feature so we don't need to worry about upgrades.

Closes #15111
2023-08-21 22:55:14 +02:00
Tomasz Grabiec
fe181b3bac tablets: Balance tablets concurrently with active migrations
After this change, the load balancer can make progress with active
migrations. If the algorithm is called with active tablet migrations
in tablet metadata, those are treated by load balancer as if they were
already completed. This allows the algorithm to incrementally make
decision which when executed with active migrations will produce the
desired result.

Overload of shards is limited by the fact that the algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.

The coordinator executes the load balancer on edges of tablet state
machine stransitions. This allows new migrations to be started as soon
as tablets finish streaming.

The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.
2023-07-31 01:45:23 +02:00
Tomasz Grabiec
05519bd5e5 doc: Document tablet migration state machine and load balancer 2023-07-25 21:08:02 +02:00
Benny Halevy
26ff8f7bf7 docs: dml: add update ordering section
and add docs/dev/timestamp-conflict-resolution.md
to document the details of the conflict resolution algorithm.

Refs scylladb/scylladb#14063

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-20 11:55:54 +03:00
Avi Kivity
5acb137c2e Merge 'docs/dev/reader-concurrency-semaphore.md: add section about operations' from Botond Dénes
Containing two tables, describing all the possible operations seen in user, system and streaming semaphore diagnostics dumps.

Closes #14171

* github.com:scylladb/scylladb:
  docs/dev/reader-concurrency-semaphore.md: add section about operations
  docs/dev/reader-concurrency-semaphore.md: switch to # headers markings
  reader_concurrency_semaphore: s/description/operation/ in diagnostics dumps
2023-06-07 22:53:18 +03:00
Botond Dénes
0c632b6e3d docs/dev/reader-concurrency-semaphore.md: add section about operations
Containing two tables, describing all the possible operations seen in
user, system and streaming semaphore diagnostics dumps.
2023-06-07 14:22:52 +03:00
Botond Dénes
0067fa0a09 docs/dev/reader-concurrency-semaphore.md: switch to # headers markings
As they allow for more levels, than the current `---` and `===` ones.
2023-06-07 14:22:10 +03:00
Botond Dénes
c4faa05888 reader_concurrency_semaphore: s/description/operation/ in diagnostics dumps
"description" is not the respective column contains, so fix the header.
2023-06-07 14:21:48 +03:00
Marcin Maliszkiewicz
8b06684a8c docs: dev: document pytest run convenience script
Closes #13995
2023-06-07 12:37:52 +03:00
Kefu Chai
8e7c7e1079 docs/dev/repair_based_node_ops: better formatting
* indent the nested paragraphs of list items
* use table to format the time sequence for better
  readability

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14016
2023-05-25 08:31:43 +03:00
Botond Dénes
eb457b6104 Merge 'fixed broken links, added community forum link, university link, spelling and other mistakes' from Guy Shtub
Closes #13979

* github.com:scylladb/scylladb:
  Update docker-hub.md
  Update docs/dev/docker-hub.md
  Update docs/dev/docker-hub.md
  Update docs/dev/docker-hub.md
  Update docs/dev/docker-hub.md
  Update docs/dev/docker-hub.md
  fixed broken links, added community forum link, university link,  other mistakes
2023-05-24 09:58:58 +03:00
Guy Shtub
65c0afc899 Update docker-hub.md 2023-05-24 07:34:58 +03:00
Guy Shtub
7e3d768369 Update docs/dev/docker-hub.md
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
2023-05-24 07:27:07 +03:00
Guy Shtub
6329036656 Update docs/dev/docker-hub.md
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
2023-05-24 07:26:42 +03:00
Guy Shtub
3538a2e1c2 Update docs/dev/docker-hub.md
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
2023-05-24 07:23:51 +03:00
Guy Shtub
53183d6302 Update docs/dev/docker-hub.md
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
2023-05-24 07:23:37 +03:00
Guy Shtub
2677d47bbc Update docs/dev/docker-hub.md
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
2023-05-24 07:23:28 +03:00
Kefu Chai
b8c565875b docs/dev/system_keyspace: add raft table
it is one of the non-volatile tables. we need add more of them.
but let's do this piecemeal.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-24 10:08:04 +08:00
Kefu Chai
eee0003312 docs/dev/system_keyspace: move sstables and tablets into another section
not all tables in system keyspace are volatile. among other things,
system.sstables and system.tablets are persisted using sstables like
regular user tables. so move them into the section where we have
other regular tables there.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-24 10:08:03 +08:00
Kefu Chai
1246568e3b docs/dev/system_keyspace: use timeuuid for sstables.generation
we changed the type of generation column in system.sstables
from bigint to timeuuid in 74e9e6dd1a
but that change failed to update the document accordingly. so let's
update the document to reflect the change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13994
2023-05-23 14:37:28 +03:00
Guy Shtub
eefaad189a fixed broken links, added community forum link, university link, other mistakes 2023-05-22 13:12:16 +03:00
Tomasz Grabiec
9d4bca26cc Merge 'raft topology: implement check_and_repair_cdc_streams API' from Kamil Braun
`check_and_repair_cdc_streams` is an existing API which you can use when the
current CDC generation is suboptimal, e.g. after you decommissioned a node the
current generation has more stream IDs than you need. In that case you can do
`nodetool checkAndRepairCdcStreams` to create a new generation with fewer
streams.

It also works when you change number of shards on some node. We don't
automatically introduce a new generation in that case but you can use
`checkAndRepairCdcStreams` to create a new generation with restored
shard-colocation.

This PR implements the API on top of raft topology, it was originally
implemented using gossiper.  It uses the `commit_cdc_generation` topology
transition state and a new `publish_cdc_generation` state to create new CDC
generations in a cluster without any nodes changing their `node_state`s in the
process.

Closes #13683

* github.com:scylladb/scylladb:
  docs: update topology-over-raft.md
  test: topology_experimental_raft: test `check_and_repair_cdc` API
  raft topology: implement `check_and_repair_cdc_streams` API
  raft topology: implement global request handling
  raft topology: introduce `prepare_new_cdc_generation_data`
  raft_topology: `get_node_to_work_on_opt`: return guard if no node found
  raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition
  raft topology: separate `publish_cdc_generation` state
  raft topology: non-node-specific `exec_global_command`
  raft topology: introduce `start_operation()`
  raft topology: non-node-specific `topology_mutation_builder`
  topology_state_machine: introduce `global_topology_request`
  topology_state_machine: use `uint16_t` for `enum_class`es
  raft topology: make `new_cdc_generation_data_uuid` topology-global
2023-05-22 11:33:58 +02:00
Calle Wilund
469e710caa docs: Add initial doc on commitlog segment file format
Refs #12849

Just a few lines on the file format of segments.

Closes #13848
2023-05-15 16:22:44 +03:00
Kamil Braun
ddb5b45aef docs: update topology-over-raft.md
It was already outdated before this PR.
Describe the version of topology state machine implemented in this PR.
Fix some typos and make it proper markdown so it renders nicely on
GitHub etc.
2023-05-08 16:49:01 +02:00
Pavel Emelyanov
0b18e3bff9 doc: Add a document describing how to configure S3 backend
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:23:38 +03:00
Tomasz Grabiec
9d786c1ebc db: tablets: Add persistence layer 2023-04-24 10:49:37 +02:00
Kamil Braun
88aff50e8b docs: cdc: describe generation changes using group 0 topology coordinator
Update the `Generation switching` section: most of the existing
description landed in `Gossiper-based topology changes` subsection, and
a new subsection was added to describe Raft group 0 based topology
changes. Marked as WIP - we expect further development in this area
soon.

The existing gossiper-based description was also updated a bit.
2023-04-20 16:36:41 +02:00
Botond Dénes
edc75f51ff docs/dev/reader-concurrency-semaphore.md: expand on how the semaphore works
Greatly expand on the details of how the semaphore works.
Organize the content into thematic chapters to improve navigation.
Improve formatting while at it.
2023-04-14 08:51:24 -04:00
Botond Dénes
943ae7fc69 reader_permit: give better names to active* states
The names of these states have been the source of confusion ever since
they were introduced. Give them names which better reflects their true
meaning and gives less room for misinterpretation. The changes are:
* active/unused  -> active
* active/used    -> active/need_cpu
* active/blocked -> active/await

Hopefully the new names do a better job at conveying what these states
really mean:
* active - a regular admitted permit, which is active (as opposed to
  an inactive permit).
* active/need_cpu - an active permit which was marked as needing CPU for
  the read to make progress. This permit prevents admission of new
  permits while it is in this state.
* active/await - a former active/need_cpu permit, which has to wait on
  I/O or a remote shard. While in this state, it doesn't block the
  admission of new permits (pending other criteria such as resource
  availability).
2023-04-14 08:40:46 -04:00
Pavel Emelyanov
08e9046d07 system_keyspace: Add ownership table
The schema is

CREATE TABLE system.sstables (
    location text,
    generation bigint,
    format text,
    status text,
    uuid uuid,
    version text,
    PRIMARY KEY (location, generation)
)

A sample entry looks like:

 location                                                            | generation | format | status | uuid                                 | version
---------------------------------------------------------------------+------------+--------+--------+--------------------------------------+---------
 /data/object_storage_ks/test_table-d096a1e0ad3811ed85b539b6b0998182 |          2 |    big | sealed | d0a743b0-ad38-11ed-85b5-39b6b0998182 |      me

The uuid field points to the "folder" on the storage where the sstable
components are. Like this:

s3
`- test_bucket
   `- f7548f00-a64d-11ed-865a-0c1fbc116bb3
      `- Data.db
       - Index.db
       - Filter.db
       - ...

It's not very nice that the whole /var/lib/... path is in fact used as
location, it needs the PR #12707 to fix this place.

Also, the "status" part is not yet fully functional, it only supports
three options:

- creating -- the same as TemporaryTOC file exists on disk
- sealed -- default state
- deleting -- the analogy for the deletion log on disk

The latter needs support from the distributed_loader, which's not yet
there. In fact, distributes_loader also needs to be patched to actualy
select entries from this table on load. Also it needs the mentioned
PR #12707 to support staging and quarantine sstables.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-10 16:44:28 +03:00
Kefu Chai
c24a9600af docs: dev: correct a typo
s/By expending/By expanding/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13392
2023-03-31 17:19:08 +03:00
Kefu Chai
11cea36c12 docs: dev: write mathematical expressions in LaTeX
for better readability

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13341
2023-03-29 15:07:14 +03:00
Gleb Natapov
5e232ebee5 system_keyspace: add a table to persist topology change state machine's state
Add local table to store topology change state machine's state there.
Also add a function that loads the state to memory.
2023-03-21 16:06:43 +02:00
Gleb Natapov
a2b7d2c1a1 service: Introduce topology state machine data structures
The topology state machine will track all the nodes in a cluster,
their state, properties (topology, tokens, etc) and requested actions.

Node state can be one of those:
 none             - the node is not yet in the cluster
 bootstrapping    - the node is currently bootstrapping
 decommissioning  - the node is being decommissioned
 removing         - the node is being removed
 replacing        - the node is replacing another node
 normal           - the node is working normally
 rebuild          - the node is being rebuilt
 left             - the node is left the cluster

Nodes in state left are never removed from the state.

Tokens also can be in one of the states:

write_both_read_old - writes are going to new and old replica, but reads are from
                      old replicas still
write_both_read_new - writes still going to old and new replicas but reads are
                      from new replica
owner               - tokens are owned by the node and reads and write go to new
                      replica set only

Tokens that needs to be move start in 'write_both_read_old' state. After entire
cluster learns about it streaming start. After the streaming tokens move
to 'write_both_read_new' state and again the whole cluster needs to learn about it
and make sure no reads started before that point exist in the system.
After that tokens may move to 'owner' state.

topology_request is the field through which a topology operation request
can be issued to a node. A request is one of the topology operation
currently supported: join, leave, replace or remove.
2023-03-21 16:06:43 +02:00
Wojciech Mitros
52eb70aef0 docs: make wasm documentation visible for users
Until now, the instructions on generating wasm files and using them
for Scylla UDFs were stored in docs/dev, so they were not visible
on the docs website. Now that the Rust helper library for UDFs
is ready, and we're inviting users to try it out, we should also
make the rest of the Wasm UDF documentation readily available
for the users.

Closes #13139
2023-03-14 16:21:23 +02:00
Wojciech Mitros
d4851ccae7 treewide: rename the "xwasm" UDF language to "wasm"
When the WASM UDFs were first introduced, the LANGUAGE required in
the CQL statements to use them was "xwasm", because the ABI for the
UDFs was still not specified and changes to it could be backwards
incompatible.
Now, the ABI is stabilized, but if backwards incompatible changes
are made in the future, we will add a new ABI version for them, so
the name "xwasm" is no longer needed and we can finally
change it to "wasm".

Closes #13089
2023-03-07 10:21:11 +02:00
Wojciech Mitros
6d2e785b5c docs: update wasm.md
The WASM UDF implementation has changed since the last time the docs
were written. In particular, the Rust helper library has been
released, and using it should be the recommended method.
Some decisions that were only experimental at the start, were also
"set in stone", so we should refer to them as such.

The docs also contain some code examples. This patch adds tests for
these examples to make sure that they are not wrong and misleading.

Closes #12941
2023-02-28 20:59:25 +02:00