Commit Graph

249 Commits

Author SHA1 Message Date
Andrei Chekun
747f2b1301 docs: add more steps in installation of test.py
Documentation for --gather-metric parameter was missing. This functionality can
break regular flow of using test.py, because of possible misconfiguration of
the cgroup on the local machine. Added explanation how to deal with potential
issue of gathering metrics functionality and how to switch it off.

Fixes: https://github.com/scylladb/scylladb/issues/20763

Closes scylladb/scylladb#24095
2025-05-13 13:08:18 +03:00
Kefu Chai
3e3f583b84 docs/dev/tombstone.md: fix a typo
s/alwas/always/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23734
2025-04-15 10:54:42 +03:00
Avi Kivity
5e1cf90a51 build: replace tools/java submodule with packaged cassandra-stress
We no longer use tools/java (scylladb/scylla-tools-java.git) for
nodetool or cqlsh; only cassandra-stress. Since that is available
in package form install that and excise the tools/java submodule
from the source tree.

pgo/ is adjusted to use the packaged cassandra-stress (and the cqlsh
submodule).

A few jmx references are dropped as well.

Frozen toolchain regenerated.

Optimized clang from

  https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz

Closes scylladb/scylladb#23698
2025-04-15 10:11:28 +03:00
Avi Kivity
9559e53f55 Merge 'Adjust tablet-mon.py for capacity-aware load balancing' from Tomasz Grabiec
After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph.

To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer.

Closes scylladb/scylladb#23584

* github.com:scylladb/scylladb:
  tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization
  tablet-mon.py: Center tablet id text properly in the vertical axis
  tablet-mon.py: Show migration stage tag in table mode only when migrating
  virtual-tables: Introduce system.load_per_node
  virtual_tables: memtable_filling_virtual_table: Propagate permit to execute()
  docs: virtual-tables: Fix instructions
  service: tablets: Keep load_stats inside tablet_allocator
2025-04-10 14:59:08 +03:00
Tomasz Grabiec
b5211cca85 Merge 'tablets: rebuild: use repair for tablet rebuild' from Aleksandra Martyniuk
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.

In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.

Fixes: https://github.com/scylladb/scylladb/issues/17174.

Needs backport to 2025.1 to prevent out of space during streaming

Closes scylladb/scylladb#23187

* github.com:scylladb/scylladb:
  test: add test for rebuild with repair
  locator: service: move to rebuild_v2 transition if cluster is upgraded
  locator: service: add transition to rebuild_repair stage for rebuild_v2
  locator: service: add rebuild_repair tablet transition stage
  locator: add maybe_get_primary_replica
  locator: service: add rebuild_v2 tablet transition kind
  gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
2025-04-09 21:35:37 +02:00
Tomasz Grabiec
0b9a75d7b6 virtual-tables: Introduce system.load_per_node
Can be used to query per-node stats about load as seen by the load
balancer.

In particular, node's capacity will be used by tablet-mon.py to
scale tablet columns so that equal height is equal node utilization.
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
34beaa30b5 docs: virtual-tables: Fix instructions 2025-04-09 20:21:51 +02:00
Botond Dénes
583a813d17 docs/dev/tombstone.md: fix link to ddl.html
Closes scylladb/scylladb#23622
2025-04-08 16:18:50 +03:00
Aleksandra Martyniuk
eb17af6143 locator: service: add transition to rebuild_repair stage for rebuild_v2
Modify write_both_read_old and streaming stages in rebuild_v2 transition
kind: write_both_read_old moves to rebuild_repair stage and streaming stage
streams data only from one replica.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
ed7b8bb787 locator: service: add rebuild_v2 tablet transition kind
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

To differentiate the two streaming methods, a new tablet transition
kind - rebuild_v2 - is added.

The transtions and stages for rebuild_v2 transition kind will be
added in the following patches.
2025-04-08 10:42:01 +02:00
Botond Dénes
3bad46a6e2 docs/dev: add tombstone.md
An exhaustive document on the tombstone related internal logic as well
as the user-facing aspects.

Closes scylladb/scylladb#23454
2025-04-01 20:17:57 +03:00
Pavel Emelyanov
2ee9cec1d3 Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar
Move `object_storage.yaml` endpoints to `scylla.yaml`

This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed.

This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f

Refs https://github.com/scylladb/scylladb/issues/22428

Closes scylladb/scylladb#22952

* github.com:scylladb/scylladb:
  Remove db::config::object_storage_config
  Move `object_storage.yaml` endpoints to `scylla.yaml`
2025-04-01 16:01:44 +03:00
Michał Chojnowski
d33ffb221b docs/dev: add sstable-compression-dicts.md 2025-04-01 00:07:31 +02:00
Robert Bindar
e3a3508960 Move object_storage.yaml endpoints to scylla.yaml
This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-03-31 13:39:39 +03:00
Avi Kivity
2b9e1e61d0 docs: reader_concurrency_semaphore: document CPU concurrency limit
Document the CPU concurrency implemented in 3d816b7c16
and adjusted in 3d12451d1f.

Closes scylladb/scylladb#23404
2025-03-31 09:39:55 +03:00
Aleksandra Martyniuk
4c75701756 docs: locator: update the docs and formatter of tablet_task_info 2025-02-14 09:13:11 +01:00
TripleChecker
8d64be94e2 Fix typos 2025-02-13 01:54:08 +02:00
TripleChecker
e72e6fadeb Fix typos 2025-02-11 00:17:43 +02:00
Pavel Emelyanov
81f7a6d97d doc: Update system.sstables table schema description
The partition key had been renamed and its type changed some time ago,
but the doc wasn't updated. Fix it.

refs: #20998

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#22683
2025-02-10 16:09:49 +02:00
Botond Dénes
51a273401c Merge 'test: tablets_test: Create proper schema in load balancer tests' from Tomasz Grabiec
This PR converts boost load balancer tests in preparation for load balancer changes
which add per-table tablet hints. After those changes, load balancer consults with the replication
strategy in the database, so we need to create proper schema in the
database. To do that, we need proper topology for replication
strategies which use RF > 1, otherwise keyspace creation will fail.

Topology is created in tests via group0 commands, which is abstracted by
the new `topology_builder` class.

Tests cannot modify token_metadata only in memory now as it needs to be
consistent with the schema and on-disk metadata. That's why modifications to
tablet metadata are now made under group0 guard and save back metadata to disk.

Closes scylladb/scylladb#22648

* github.com:scylladb/scylladb:
  test: tablets: Drop keyspace after do_test_load_balancing_merge_colocation() scenario
  tests: tablets: Set initial tablets to 1 to exit growing mode
  test: tablets_test: Create proper schema in load balancer tests
  test: lib: Introduce topology_builder
  test: cql_test_env: Expose topology_state_machine
  topology_state_machine: Introduce lock transition
2025-02-10 16:08:41 +02:00
Avi Kivity
9712390336 Merge 'Add per-table tablet options in schema' from Benny Halevy
This series extends the table schema with per-table tablet options.
The options are used as hints for initial tablet allocation on table creation and later for resize (split or merge) decisions,
when the table size changes.

* New feature, no backport required

Closes scylladb/scylladb#22090

* github.com:scylladb/scylladb:
  tablets: resize_decision: get rid of initial_decision
  tablet_allocator: consider tablet options for resize decision
  tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member
  network_topology_strategy: allocate_tablets_for_new_table: consider tablet options
  network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner
  network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0
  cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0
  cql3/create_keyspace_statement: add deprecation warning for initial tablets
  test: cqlpy: test_tablets: add tests for per-table tablet options
  schema: add per-table tablet options
  feature_service: add TABLET_OPTIONS cluster schema feature
2025-02-08 20:32:19 +02:00
Tomasz Grabiec
61532eb53b topology_state_machine: Introduce lock transition
Will be used in load balancer tests to prevent concurrent topology
operations, in particular background load balancing.

load balancer will be invoked explicitly by the test. Disabling load
balancer in topology is not a solution, because we want the explicit
call to perform the load balancing.
2025-02-07 16:09:21 +01:00
Avi Kivity
861fb58e14 Merge 'vector: add support for vector type' from Dawid Pawlik
This pull request is an implementation of vector data type similar to one used by Apache Cassandra.

The patch contains:
- implementation of vector_type_impl class
- necessary functionalities similar to other data types
- support for serialization and deserialization of vectors
- support for Lua and JSON format
- valid CQL syntax for `vector<>` type
- `type_parser` support for vectors
- expression adjustments such as:
    - add `collection_constructor::style_type::vector`
    - rename `collection_constructor::style_type::list` to `collection_constructor::style_type::list_or_vector`
- vector type encoding (for drivers)
- unit tests
- cassandra compatibility tests
- necessary documentation

Co-authored-by: @janpiotrlakomy

Fixes https://github.com/scylladb/scylladb/issues/19455

Closes scylladb/scylladb#22488

* github.com:scylladb/scylladb:
  docs: add vector type documentation
  cassandra_tests: translate tests covering the vector type
  type_codec: add vector type encoding
  boost/expr_test: add vector expression tests
  expression: adjust collection constructor list style
  expression: add vector style type
  test/boost: add vector type cql_env boost tests
  test/boost: add vector type_parser tests
  type_parser: support vector type
  cql3: add vector type syntax
  types: implement vector_type_impl
2025-02-06 20:36:50 +02:00
Benny Halevy
c5668d99c9 schema: add per-table tablet options
Unlike with vnodes, each tablet is served only by a single
shard, and it is associated with a memtable that, when
flushed, it creates sstables which token-range is confined
to the tablet owning them.

On one hand, this allows for far better agility and elasticity
since migration of tablets between nodes or shards does not
require rewriting most if not all of the sstables, as required
with vnodes (at the cleanup phase).

Having too few tablets might limit performance due not
being served by all shards or by imbalance between shards
caused by quantization.  The number of tabelts per table has to be
a power of 2 with the current design, and when divided by the
number of shards, some shards will serve N tablets, while others
may serve N+1, and when N is small N+1/N may be significantly
larger than 1. For example, with N=1, some shards will serve
2 tablet replicas and some will serve only 1, causing an imbalance
of 100%.

Now, simply allocating a lot more tablets for each table may
theoretically address this problem, but practically:
a. Each tablet has memory overhead and having too many tablets
in the system with many tables and many tablets for each of them
may overwhelm the system's and cause out-of-memory errors.
b. Too-small tablets cause a proliferation of small sstables
that are less efficient to acces, have higher metadata overhead
(due to per-sstable overhead), and might exhaust the system's
open file-descriptors limitations.

The options introduced in this change can help the user tune
the system in two ways:
1. Sizing the table to prevent unnecessary tablet splits
and migrations.  This can be done when the table is created,
or later on, using ALTER TABLE.
2. Controlling min_per_shard_tablet_count to improve
tablet balancing, for hot tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-06 08:55:51 +02:00
Ernest Zaslavsky
29e60288de docs: update the object_storage.md and admin.rst
Added additional options and best practices for AWS authentication.
2025-02-05 14:57:19 +02:00
Dawid Pawlik
a68bf6dcc1 docs: add vector type documentation
Add missing vector type documentation including: definition of vector,
adjustment of term definition, JSON encoding, Lua and cql3 type
mapping, vector dimension limit, and keyword specification.
2025-01-28 21:14:49 +01:00
Aleksandra Martyniuk
18cc79176a api: task_manager: do not unregister tasks on get_status
Currently, /task_manager/task_status_recursive/{task_id} and
/task_manager/task_status/{task_id} unregister queries task if it
has already finished.

The status should not disappear after being queried. Do not unregister
finished task when its status or recursive status is queried.
2025-01-27 11:23:45 +01:00
Aleksandra Martyniuk
e37d1bcb98 api: task_manager: add /task_manager/drain
In the following patches, get_status won't be unregistering finished
tasks. However, tests need a functionality to drop a task, so that
they could manipulate only with the tasks for operations that were
invoked by these tests.

Add /task_manager/drain/{module} to unregister all finished tasks
from the module. Add respective nodetool command.
2025-01-27 11:23:45 +01:00
Paweł Zakrzewski
702e727e33 audit: Add documentation for the audit subsystem
Adds detailed documentation covering the new audit subsystem:

- Add new audit.md design document explaining:
  - Core concepts and design decisions
  - CQL extensions for audit management
  - Implementation details and trigger evaluation
  - Prior art references from other databases

- Add user-facing documentation:
  - New auditing.rst guide with configuration and usage details
  - Integration with security documentation index
  - Updates to cluster management procedures
  - Updates to security checklist

The documentation covers all aspects of the audit system including:
- Configuration options and storage backends (syslog/table)
- Audit categories (DCL/DDL/AUTH/DML/QUERY/ADMIN)
- Permission model and security considerations
- Failure handling and logging
- Example configurations and output formats

This ensures users have complete guidance for setting up and using
the new audit capabilities.
2025-01-15 11:10:35 +01:00
Kefu Chai
f8885a4afd dist/docker,docs: replace "--experimental" with "--experimental-features"
The "--experimental" option was removed in commit f6cca741ea. Using this
deprecated option now causes Scylla to fail with the error:

```
error: the argument ('on') for option '--experimental-features' is invalid
```

So, in this change, let's update the docker entry point script to use
`--experimental-features` command line option instead. The related
document is updated accordingly.

Fixes scylladb/scylladb#22207
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22283
2025-01-14 07:56:38 -05:00
Avi Kivity
814942505f Merge 'Introduce Encryption-at-Rest (EAR) for sstables and commitlog' from Calle Wilund
Fixes https://github.com/scylladb/scylla-enterprise/issues/5016#issuecomment-2558464631

EAR - encryption at rest. Allows on-disk file encryption of sstables and commitlog data.
Introduces OpenSSL based file level encrypted storage, managed via a set of providers
ranging from local files to cloud KMS providers.

For a more comprehensive explanation, see the included docs (or if possible, original
source tree).

Manual bulk merge of EAR feature from enterprise repo to main scylla repo.

Breaks some features apart, but main EAR is still a humongous commit, because to separate this
I would have to mess with code incrementally, adding time and risk.

This PR includes the local file gen tool, tests and also p11 validation.

Note: CI will not execute the full tests unless master CI is set to provide the same environment
as the enterprise one. Not sure about the status of this ATM.

Note: Includes code to compile against cryptsoft kmipc SDK, but not the SDK. If you happen to
check out this tree in the scylla folder and configure, it will be linked against and KMIP functionality
will be enabled, otherwise not.

Closes scylladb/scylladb#22233

* github.com:scylladb/scylladb:
  docs: Add EAR docs
  main/build: Add p11-kit and initialize
  tools: Add local-file-key-generator tool
  tests: Add EAR tests
  tmpdir: shorten test tempdir path
  EAR: port the ear feature from enterprise
  cql_test_env: Add optional query timeout
  schema/migration_manager: Add schema validate
  sstables: add get_shared_components accessor
  config/config_file: Add exports and definitions of config_type_for<>
2025-01-12 16:10:46 +02:00
Calle Wilund
8e828f608d docs: Add EAR docs
Merge docs relating to EAR.
2025-01-09 10:40:47 +00:00
Botond Dénes
a2436f139f docs/dev: review-checklist.md: expand the guide for good commit log
Closes scylladb/scylladb#22214
2025-01-08 13:01:35 +02:00
Avi Kivity
202f16e799 Merge 'Introduce workload prioritization for service levels' from Piotr Dulikowski
This series introduces workload prioritization: an extension of the service levels feature which allows specifying "shares" per service level. The number of shares determines the priority of the user which has this service level attached (if multiple are attached then the one with the lowest shares wins).

Different service levels will be isolated in the following way:

- Each service level gets its own scheduling group with the number of shares (corresponding to the service level's number of shares), which controls the priority of the CPU and I/O used for user operations running on that service level.
- Each service level gets two reader concurrency semaphores, one for user reads and the other for read-before-write done for view updates.
- Each service level gets its own TCP connections for RPC to prevent priority inversion issues.

Because of the mandatory use of scheduling groups, which are a globally limited resource, the number of service levels is now limited to 7 user created service levels + 1 created by default that cannot be removed.

This feature has been previously only available in ScyllaDB Enterprise but has been made available for the source available ScyllaDB. The series was created by comparing the master branch with source-available-workbranch / enterprise branch and taking the workload prioritization related parts from the diff, then molding the resulting diff into a proper series. Some very minor changes were made such as fixing whitespace, removing unused or unnecessary code, adding some boilerplate (in api/) which was missing, but otherwise no major changes have been made.

No backport is required.

Closes scylladb/scylladb#22031

* github.com:scylladb/scylladb:
  tracing: record scheduling group in trace event record
  qos: un-shared-from-this standard_service_level_distributed_data_accessor
  alternator: execute under scheduling group for service level
  test.py: support multiple commands in prepare_cql in suite.yml
  docs: add documentation for workload prioritization
  docs/dev: describe workload prioritization features in service_levels
  test/auth_cluster: test workload prioritization in service level tests
  cqlpy/test_service_levels: add workload prioritization tests
  api: introduce service levels specific API
  api/cql_server_test: add information about scheduling group
  db/virtual_tables: add scheduling group column to system.clients
  test/boost: update service_level_controller_test for workload prio
  qos: include number of shares in DESCRIBE
  cql3/statements: update SL statements for workload prioritization
  transport/server: use scheduling group assigned to current user
  messaging_service: use separate set of connections per service levels
  replica/database: add reader concurrency semaphore groups
  qos: manage and assign scheduling groups to service levels
  qos: use the shares field in service level reads/writes
  qos: add shares to service_level_options
  qos: explicitly specify columns when querying service level tables
  db/system_distributed_keyspace: add shares column and upgrade code
  db/system_keyspace: adjust SL schema for workload prioritization
  gms: introduce WORKLOAD_PRIORITIZATION cluster feature
  build: increase the max number of scheduling groups
  qos: return correct error code when SL does not exist
2025-01-02 20:05:36 +02:00
Kefu Chai
233e3969c4 utils: correct misspellings
these misspellings were identified by codespell. let's fix them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22143
2025-01-02 16:47:57 +02:00
Piotr Dulikowski
241e710c19 docs/dev: describe workload prioritization features in service_levels
The concept of shares, and some helper HTTP APIs, are now described in
the developer documentation for service levels.
2025-01-02 07:13:34 +01:00
Michał Chojnowski
fdb2d2209c messaging_service: use advanced_rpc_compression::tracker for compression
This patch sets up an `alien_worker`, `advanced_rpc_compression::tracker`,
`dict_sampler` and `dictionary_service` in `main()`, and wires them to each other
and to `messaging_service`.

`messaging_service` compresses its network traffic with compressors managed by
the `advanced_rpc_compression::tracker`. All this traffic is passed as a single
merged "stream" through `dict_sampler`.

`dictionary_service` has access to `dict_sampler`.
On chosen nodes (by default: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the `alien_worker` thread) and publishes the new dictionary
to `system.dicts` via Raft.

This update triggers a callback into `advanced_rpc_compression::tracker` on all nodes,
which updates the dictionary used by the compressors it manages.
2024-12-27 10:17:58 +01:00
Michał Chojnowski
0fd1050784 utils: add advanced_rpc_compressor
Adds glue needed to pass lz4 and zstd with streaming and/or dictionaries
as the network traffic compressors for Seastar's RPC servers.

The main jobs of this glue are:
1. Implementing the API expected by Seastar from RPC compressors.
2. Expose metrics about the effectiveness of the compression.
3. Allow dynamically switching algorithms and dictionaries on a running
   connection, without any extra waits.

The biggest design decision here is that the choice of algorithm and dictionary
is negotiated by both sides of the connection, not dictated unilaterally by the
sender.

The negotiation algorithm is fairly complicated (a TLA+ model validating
it is included in the commit). Unilateral compression choice would be much simpler.
However, negotiation avoids re-sending the same dictionary over every
connection in the cluster after dictionary updates (with one-way communication,
it's the only reliable way to ensure that our receiver possesses the dictionary
we are about to start using), lets receivers ask for a cheaper compression mode
if they want, and lets them refuse to update a dictionary if they don't think
they have enough free memory for that.

In hindsight, those properties probably weren't worth the extra complexity and
extra development effort.

Zstd can be quite expensive, so this patch also includes a mechanism which
temporarily downgrades the compressor from zstd to lz4 if zstd has been
using too much CPU in a given slice of time. But it should be noted that
this can't be treated as a reliable "protection" from negative performance
effects of zstd, since a downgrade can happen on the sender side,
and receivers are at the mercy of senders.
2024-12-23 23:37:02 +01:00
Tomasz Grabiec
8e60a0b831 Merge 'truncate: make TRUNCATE TABLE safe with tablets' from Ferenc Szili
Currently truncating a table works by issuing an RPC to all the nodes which call `database::truncate_table_on_all_shards()`, which makes sure that older writes are dropped.

It works with tablets, but is not safe. A concurrent replication process may bring back old data.

This change makes makes TRUNCATE TABLE a topology operation, so that it excludes with other processes in the system which could interfere with it. More specifically, it makes TRUNCATE a global topology request.

Backporting is not needed.

Fixes #16411

Closes scylladb/scylladb#19789

* github.com:scylladb/scylladb:
  docs: docs: topology-over-raft: Document truncate_table request
  storage_proxy: fix indentation and remove empty catch/rethrow
  test: add tests for truncate with tablets
  storage_proxy: use new TRUNCATE for tablets
  truncate: make TRUNCATE a global topology operation
  storage_service: move logic of wait_for_topology_request_completion()
  RPC: add truncate_with_tablets RPC with frozen_topology_guard
  feature_service: added cluster feature for system.topology schema change
  system.topology_requests: change schema
  storage_proxy: propagate group0 client and TSM dependency
2024-12-10 17:50:50 +01:00
Ferenc Szili
49cc771bda docs: docs: topology-over-raft: Document truncate_table request 2024-12-09 16:38:50 +01:00
Tomasz Grabiec
7e2875d648 Merge 'Add tablet merge support' from Raphael Raph Carvalho
The goal of merge is to reduce the tablet count for a shrinking table. Similar to how split increases the count while the table is growing. The load balancer decision to merge is implemented today (came with infrastructure introduced for split), but it wasn't handled until now.

Initial tablet count is respected while the table is in "growing mode". For example, the table leaves it if there was a need to split above the initial tablet count. After the table leaves the mode, the average size can be trusted to determine that the table is shrinking. Merge decision is emitted if the average tablet size is 50% of the target. Hysteresis is applied to avoid oscillations between split and merges.

Similar to split, the decision to merge is recorded in tablet map's resize_type field with the string "merge". This is important in case of coordinator failover, so new coordinator continues from where the old left off.

Unlike split, the preparation phase during merge is not done by the replica (with split compactions), but rather by the coordinator by co-locating sibling tablets in the same node's shard. We can define sibling tablets as tablets that have contiguous range and will become one after merge. The concept is based on the power-of-two constraint and token contiguity. For example, in a table with 4 tablets, tablets of ids 0 and 1 are siblings, 2 and 3 are also siblings.

The algorithm for co-locating sibling tablets is very simple. The balancer is responsible for it, and it will emit migrations so that "odd" tablet will follow the "even" one. For example, tablet 1 will be migrated to where tablet 0 lives. Co-location is low in priority, it's not the end of the world to delay merge, but it's not ideal to delay e.g. decommission or even regular load balancing as that can translate into temporary unbalancing, impacting the user activities. So co-location migrations will happen when there is no more important work to do.
While regular balancing is higher in priority, it will not undo the co-location work done so far. It does that by treating co-located tablets as if they were already merged. The load inversion convergence check was adjusted so balancer understand when two tablets are being migrated instead of one, to avoid oscillations.

When balancer completes co-location work for a table undergoing merge, it will put the id of the table into the resize_plan, which is about communicating with the topology coordinator that a table is ready for it. With all sibling tablets co-located, the coordinator can resize the tablet map (reduce it by a factor of 2) and record the new map into group0. All the replicas will react to it (on token metadata update) by merging the storage (memtable(s) + sstables) of sibling tablets into one.

Fixes #18181.

system test details:

test: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/tablets_split_merge_test.py
yaml file: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/test-cases/features/tablets/tablets-split-merge-test.yaml

instance type: i3.8xlarge
nodes: 3
target tablet size: 0.5G (scaled down by 10, to make it easier to trigger splits and merges)
description: multiple cycles of growing and shrinking the data set in order to trigger splits and merges.
data_set_size: ~100G
initial_tablets: 64, so it grew to 128 tablets on split, and back to 64 on merge.

latency of reads and writes that happened in parallel to split and merge:
```
$ for i in scylla-bench*; do cat $i | grep "Mode\|99th:\|99\.9th:"; done
Mode:			 write
  99.9th:	 3.145727ms
  99th:		 1.998847ms
  99.9th:	 3.145727ms
  99th:		 2.031615ms
Mode:			 read
  99.9th:	 3.145727ms
  99th:		 2.031615ms
  99.9th:	 3.145727ms
  99th:		 2.031615ms
Mode:			 write
  99.9th:	 3.047423ms
  99th:		 1.933311ms
  99.9th:	 3.047423ms
  99th:		 1.933311ms
Mode:			 read
  99.9th:	 3.145727ms
  99th:		 1.900543ms
  99.9th:	 3.145727ms
  99th:		 1.900543ms
Mode:			 write
  99.9th:	 5.079039ms
  99th:		 3.604479ms
  99.9th:	 35.389439ms
  99th:		 25.624575ms
Mode:			 write
  99.9th:	 3.047423ms
  99th:		 1.998847ms
  99.9th:	 3.047423ms
  99th:		 1.998847ms
Mode:			 read
  99.9th:	 3.080191ms
  99th:		 2.031615ms
  99.9th:	 3.112959ms
  99th:		 2.031615ms
```

Closes scylladb/scylladb#20572

* github.com:scylladb/scylladb:
  docs: Document tablet merging
  tests/boost: Add test to verify correctness of balancer decisions during merge
  tests/topology_experimental_raft: Add tablet merge test
  service: Handle exception when retrying split
  service: Co-locate sibling tablets for a table undergoing merge
  gms: Add cluster feature for tablet merge
  service: Make merge of resize plan commutative
  replica: Implement merging of compaction groups on merge completion
  replica: Handle tablet merge completion
  service: Implement tablet map resize for merge
  locator: Introduce merge_tablet_info()
  service: Rename topology::transition_state::tablet_split_finalization
  service: Respect initial_tablet_count if table is in growing mode
  service: Wire migration_tablet_set into the load balancer
  locator: Add tablet_map::sibling_tablets()
  service: Introduce sorted_replicas_for_tablet_load()
  locator/tablets: Extend tablet_replica equality comparator to three-way
  service: Introduce alias to per-table candidate map type
  service: Add replication constraint check variant for migration_tablet_set
  service: Add convergence check variant for migration_tablet_set
  service: Add migration helpers for migration_tablet_set
  service/tablet_allocator: Introduce migration_tablet_set
  service: Introduce migration_plan::add(migrations_vector)
  locator/tablets: Introduce tablet_map::for_each_sibling_tablets()
  locator/tablets: Introduce tablet_map::needs_merge()
  locator/tablets: Introduce resize_decision::initial_decision()
  locator/tablets: Fix return type of three-way comparison operators
  service: Extract update of node load on migrations
  service: Extract converge check for intra-node migration
  service: Extract erase of tablet replicas from candidate list
  scripts/tablet-mon: Allow visualization of tablet id
2024-12-06 18:06:20 +01:00
Emil Maskovsky
2b07d93bea raft: clean up the documentation
Small adjustments and improvements to the documentation in the raft
section.

Fixing Markdown lint warnings:
- MD004/ul-style: Unordered list style [Expected: dash; Actual: asterisk]
- MD007/ul-indent: Unordered list indentation [Expected: 0; Actual: 2]
- MD032/blanks-around-lists: Lists should be surrounded by blank lines
- MD036/no-emphasis-as-heading: Emphasis used instead of a heading
- MD046/code-block-style: Code block style [Expected: fenced; Actual: indented]

Closes scylladb/scylladb#21780
2024-12-05 13:44:11 +01:00
Raphael S. Carvalho
d93a0040e5 docs: Document tablet merging
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-04 13:11:11 -03:00
Raphael S. Carvalho
e00798f1b1 service: Rename topology::transition_state::tablet_split_finalization
This transition state will be reused by merge completion, so let's
rename it to tablet_resize_finalization.
The completion handling path will also be reused, so let's rename
functions involved similarly.

The old name "tablet split finalization" is deprecated but still
recognized and points to the correct transition. Otherwise, the
reverse lookup would fail when populating topology system table
which last state was split finalization.

NOTE:
I thought of adding a new tablet_merge_finalization, but it would
complicate things since more than one table could be ready for
either split or merge, so you need a generic transition state
for handling resize completion.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Botond Dénes
87bdfb80aa docs/dev/reader-concurrency-semaphore.md: fix formatting of diagnostics dump
Indent the whole thing so it is formatted as code, not as text.

Closes scylladb/scylladb#21693
2024-11-27 12:13:16 +03:00
Aleksandra Martyniuk
3b86150e88 docs: update task manager docs 2024-11-26 09:57:41 +01:00
Avi Kivity
29497f8c5d Merge 'Automatically compute schema version of system tables' from Tomasz Grabiec
Schema of system tables is defined statically and table_schema_version needs to be explicitly set in code like this:

```
builder.with_version(system_keyspace::generate_schema_version(table_id, version_offset));
```

Whenever schema is changed, the schema version needs to change, otherwise we hit undefined behavior when trying to interpret mutation data created with the old schema using the new schema.

It's not obvious that one needs to do that and developers often forget to do that. There were several instances of mistakes of omission, some caught during review, some not, e.g.: 31ea74b96e.

This patch changes definitions to call the new `schema_builder::with_hash_version()`, which will make the schema builder compute version from schema definition so that changes of the schema will automatically change the version. This way we no longer rely on the developer to remember to bump the version offset.

All nodes should arrive at the same version, which is verified by existing `test_group0_schema_versioning` and a new unit test: `test_system_schema_version_is_stable`.

Closes scylladb/scylladb#21602

* github.com:scylladb/scylladb:
  system_tables: Compute schema version automatically
  schema_builder: Introduce with_hash_version()
  schema: Store raw_view_info in schema::raw_schema
  schema: Remove dead comment
  hashing: Add hasher for unordered_map
  hashing: Add hasher for unique_ptr
  hashing: Add hasher for double

[avi: add missing include <memory> to hashing.hh]
2024-11-24 18:44:32 +02:00
Asias He
9d58a911f1 docs: Update system_keyspace.md for tablet repair related info 2024-11-20 09:42:41 +08:00
Asias He
afd356ea9a docs: Add docs for tablet repair migration 2024-11-20 09:42:41 +08:00
Tomasz Grabiec
8738d9bfa0 system_tables: Compute schema version automatically
This depends on the previous change to the schema_builder
which makes version computation depend on definition only
instead of being new time uuid.

This way we avoid the possibility for a common mistake
when schema of a system table is extended but we forget
to bump up its version passed to .with_version().
2024-11-15 19:16:41 +01:00