Commit Graph

268 Commits

Author SHA1 Message Date
Avi Kivity
611918056a Merge 'repair: Add tablet incremental repair support' from Asias He
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.

Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.

This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:

The repaired_at on disk field of a SSTable is used.
   - A 64-bit number increases sequentially

The sstables_repaired_at is added to the system.tablets table.
   - repaired_at <= sstables_repaired_at means the sstable is repaired

The being_repaired in memory field of a SSTable is added.
   - A repair UUID tells which sstable has participated in the repair

Initial test results:

    1) Medium dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~500GB
    Cluster pre-populated with ~500GB of data before starting repairs job.
    Results for Repair Timings:
    The regular repair run took 210 mins.
    Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
    The speedup is: 183 mins  / 48s = 228X

    2) Small dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~167GB
    Cluster pre-populated with ~167GB of data before starting the repairs job.
    Regular repair 1st run took 110s,  2nd and 3rd runs took 110s.
    Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
    The speedup is: 110s / 1.5s = 73X

    3) Large dataset results
    Node amount: 6
    Instance type: i4i.2xlarge, 3 racks
    50% of base load, 50% read/write
    Dataset == Sum of data on each node

    Dataset     Non-incremental repair (minutes)
    1.3 TiB     31:07
    3.5 TiB     25:10
    5.0 TiB     19:03
    6.3 TiB     31:42

    Dataset     Incremental repair (minutes)
    1.3 TiB     24:32
    3.0 TiB     13:06
    4.0 TiB     5:23
    4.8 TiB     7:14
    5.6 TiB     3:58
    6.3 TiB     7:33
    7.0 TiB     6:55

Fixes #22472

Closes scylladb/scylladb#24291

* github.com:scylladb/scylladb:
  replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
  compaction: Move compaction_reenabler to compaction_reenabler.hh
  topology_coordinator: Make rpc::remote_verb_error to warning level
  repair: Add metrics for sstable bytes read and skipped from sstables
  test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
  test.py: Add tests for tablet incremental repair
  repair: Add tablet incremental repair support
  compaction: Add tablet incremental repair support
  feature_service: Add TABLET_INCREMENTAL_REPAIR feature
  tablet_allocator: Add tablet_force_tablet_count_increase and decrease
  repair: Add incremental helpers
  sstable: Add being_repaired to sstable
  sstables: Add set_repaired_at to metadata_collector
  mutation_compactor: Introduce add operator to compaction_stats
  tablet: Add sstables_repaired_at to system.tablets table
  test: Fix drain api in task_manager_client.py
2025-08-19 13:13:22 +03:00
Botond Dénes
660ea9202a docs/dev/tombstone.md: document the memtable overlap check elision optimization 2025-08-11 17:20:12 +03:00
Asias He
5377f87e5a tablet: Add sstables_repaired_at to system.tablets table
It is used to store the repaired_at for each tablet.
2025-08-11 10:10:07 +08:00
Pavel Emelyanov
5fcdf948d9 doc: Update system.clients schema with scheduling_group cell
It was added by 9319d65971 (db/virtual_tables: add scheduling group
column to system.clients) recently.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25294
2025-08-05 10:16:20 +03:00
Andrei Chekun
a6a3d119e8 docs: update documentation with new way of running C++ tests
Documentation had outdated information how to run C++ test.
Additionally, some information added about gathered test metrics.

Closes scylladb/scylladb#25180
2025-07-29 14:36:19 +03:00
Pawel Pery
eadbf69d6f vector_store_client: implement ANN API
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

It implements a functionality for ANN search request to a vector-store
service. It sends request, receive response and after parsing it returns
the list of primary keys.

It adds json parsing functionality specific for the HTTP ANN API.

It adds a hardcoded http request timeout for retrieving response from
the Vector Store service.

It also adds an automatic boost test of the ANN search interface, which
uses a mockup http server in a background to simulate vector-store
service.

It adds a documentation for HTTP API protocol used used for ANN
functionality.

Fixes: VS-47
2025-07-09 11:54:51 +02:00
Avi Kivity
dfaed80f55 Merge 'types: add byte-comparable format support for native cql3 types' from Lakshmi Narayanan Sreethar
This PR introduces a new `comparable_bytes` class to add byte-comparable format support for all the [native cql3 data types](https://opensource.docs.scylladb.com/stable/cql/types.html#native-types) except `counter` type as that is not comparable. The byte-comparable format is a pre-requisite for implementing the trie based index format for our sstables(https://github.com/scylladb/scylladb/issues/19191). This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md

Note that support for composite data types like lists, maps, and sets has not been implemented yet and will be made available in a separate PR.

Refs https://github.com/scylladb/scylladb/issues/19407

New feature - backport not required.

Closes scylladb/scylladb#23541

* github.com:scylladb/scylladb:
  types/comparable_bytes: add testcase to verify compatibility with cassandra
  types/comparable_bytes: support variable-length natively byte-ordered data types
  types/comparable_bytes: support decimal cql3 types
  types/comparable_bytes: introduce count_digits() method
  types/comparable_bytes: support uuid and timeuuid cql3 types
  types/comparable_bytes: support varint cql3 type
  types/comparable_bytes: support skipping sign byte write in decode_signed_long_type
  types/comparable_bytes: introduce encode/decode_varint_length
  types/comparable_bytes: support float and double cql3 types
  types/comparable_bytes: support date, time and timestamp cql3 types
  types/comparable_bytes: support bigint cql3 type
  types/comparable_bytes: support fixed length signed integers
  types/comparable_bytes: support boolean cql3 type
  types: introduce comparable_bytes class
  bytes_ostream: overload write() to support writing from FragmentedView
  docs: fix minor typo in docs/dev/cql3-type-mapping.md
2025-07-02 11:58:32 +03:00
Lakshmi Narayanan Sreethar
068e74b457 docs: fix minor typo in docs/dev/cql3-type-mapping.md
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Michael Litvak
6fa5d2f7c8 docs: topology-over-raft: document co-located tables 2025-07-01 13:20:19 +03:00
Michael Litvak
4777444024 tablets: add base_table column to system.tablets
Add a new column base_table to the system.tablets table.

It can be set to point to another table to indicate that the tablets of
this table are co-located with the tablets of the base table.

When it's set, we don't store other tablet information in system.tablets
and in the in-memory tablet map object for this table, and we need to
refer instead to the base table tablet information. The method
get_tablet_map always returns the base tablet map.
2025-07-01 10:29:59 +03:00
Michael Litvak
4e2742a30b docs: update system.tablets schema
The schema of system.tablets in the docs is outdated. replace it with
the current schema.
2025-07-01 10:29:59 +03:00
Avi Kivity
b33dd2bd7d Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.

Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.

Add a full-stack test which checks that rows with bad keys are correctly handled.

Fixes: https://github.com/scylladb/scylladb/issues/24489

The bug is present in all versions, has to be backported to all supported versions.

Closes scylladb/scylladb#24492

* github.com:scylladb/scylladb:
  test/boost/sstable_datafile_test: add test for corrupt data
  sstables/mx/writer: handler rows with empty keys
  test/lib/cql_assertions: introduce columns_assertions
  sstables: add corrupt_data_handler to sstables::sstables
  tools/scylla-sstable: make large_data_handler a local
  db: introduce corrupt_data_handler
  mutation: introduce frozen_mutation_fragment_v2
  mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
  mutation/mutation_partition_view: extract de-ser of {clustering,static} row
  idl-compiler.py: generate skip() definition for enums serializers
  idl: extract full_position.idl from position_in_partition.idl
  db/system_keyspace: add apply_mutation()
  db/system_keyspace: introduce the corrupt_data table
2025-06-29 18:18:36 +03:00
Robert Bindar
6e7cab5b45 Add repository layout dev documentation
This change adds an md file which gives a high
level overview of the scylladb repository, the
components each path contains and a basic description
for each one of them. This is mainly intended for
onboarding engineers to help get a mental picture when
starting ramping up on Scylla concepts.

Refs #22908

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#23010
2025-06-25 13:58:05 +03:00
Botond Dénes
92b5fe8983 db/system_keyspace: introduce the corrupt_data table
To serve as a place to store corrupt mutation fragments. These fragments
cannot be written to sstables, as they would be spread around by
compaction and/or repair. They even might make parsing the sstable
impossible. So they are stored in this special table instead, kept
around to be inspected later and possibly restored if possible.
2025-06-24 11:05:30 +03:00
Patryk Jędrzejczak
6489308ebc Merge 'Introduce a queue of global topology requests.' from Gleb Natapov
Currently only one global topology request (such as truncate, cdc repair, cleanup and alter table) can be pending. If one is already pending others will be rejected with an error. This is not very user friendly, so this series introduces a queue of global requests which allows queuing many global topology requests simultaneously.

Fixes: #16822

No need to backport since this is a new feature.

Closes scylladb/scylladb#24293

* https://github.com/scylladb/scylladb:
  topology coordinator: simplify truncate handling in case request queue feature is disable
  topology coordinator: fix indentation after the previous patch
  topology coordinator: allow running multiple global commands in parallel
  topology coordinator: Implement global topology request queue
  topology coordinator: Do not cancel global requests in cancel_all_requests
  topology coordinator: store request type for each global command
  topology request: make it possible to hold global request types in request_type field
  topology coordinator: move alter table global request parameters into topology_request table
  topology coordinator: move cleanup global command to report completion through topology_request table
  topology coordinator: no need to create updates vector explicitly
  topology coordinator: use topology_request_tracking_mutation_builder::done() instead of open code it
  topology coordinator: handle error during new_cdc_generation command processing
  topology coordinator: remove unneeded semicolon
  topology coordinator: fix indentation after the last commit
  topology coordinator: move new_cdc_generation topology request to use topology_request table for completion
  gms/feature_service: add TOPOLOGY_GLOBAL_REQUEST_QUEUE feature flag
2025-06-23 16:08:09 +03:00
Robert Bindar
1dd37ba47a Add dev documentation for manipulating s3 data manually
This patch intends to give an overview of where, when and how we store
data in S3 and provide a quick set of commands
which help gain local access to the data in case there is a need for
manual intervention.

The patch also collects in the same place links/descriptions for all
formats we use in S3.

Fixes #22438

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#24323
2025-06-17 13:21:30 +03:00
Gleb Natapov
a0a3a034e0 topology coordinator: Implement global topology request queue
Requests, together with their parameters, are added to the
topology_request tables and the queue of active global requests is
kept in topology state. Thy are processed one by one by the topology
state machine.

Fixes: #16822
2025-06-11 11:29:33 +03:00
Evgeniy Naydanov
cdc4b520da test.py: cql: run tests using bare pytest command
Create a custom pytest test collector for .cql files and
move CQL test execution logic from `CQLApprovalTest` class
and `pylib/cql_repl/cql_repl.py` file to `CqlTest.runtest()`
method.

In result, the only difference between CQLApproval and Python
suite types is suffixes of test files.
2025-06-03 07:54:51 +00:00
Andrzej Jackowski
086df24555 transport: implement SCYLLA_USE_METADATA_ID support
Metadata id was introduced in CQLv5 to make metadata of prepared
statement consistent between driver and database. This commit introduces
a protocol extension that allows to use the same mechanism in CQLv4.

This change:
 - Introduce SCYLLA_USE_METADATA_ID protocol extension for CQLv4
 - Introduce METADATA_CHANGED flag in RESULT. The flag cames directly
   from CQLv5 binary protocol. In CQLv4, the bit was never used, so we
   assume it is safe to reuse it.
 - Implement handling of metadata_id and METADATA_CHANGED in RESULT rows
 - Implement returning metadata_id in RESULT prepared
 - Implement reading metadata_id from EXECUTE
 - Added description of SCYLLA_USE_METADATA_ID in documentation

Metadata_id is wrapped in cql_metadata_id_wrapper because we need to
distinguish the following situations:
 - Metadata_id is not supported by the protocol (e.g. CQLv4 without the
   extension is used)
 - Metadata_id is supported by the protocol but not set - e.g. PREPARE
   query is being handled: it doesn't contain metadata_id in the
   request but the reply (RESULT prepared) must contain metadata_id
 - Metadata_id is supported by the protocol and set, any number of
   bytes >= 0 is allowed, according to the CQLv5 protocol specification

Fixes scylladb/scylladb#20860
2025-05-14 09:59:16 +02:00
Andrei Chekun
747f2b1301 docs: add more steps in installation of test.py
Documentation for --gather-metric parameter was missing. This functionality can
break regular flow of using test.py, because of possible misconfiguration of
the cgroup on the local machine. Added explanation how to deal with potential
issue of gathering metrics functionality and how to switch it off.

Fixes: https://github.com/scylladb/scylladb/issues/20763

Closes scylladb/scylladb#24095
2025-05-13 13:08:18 +03:00
Kefu Chai
3e3f583b84 docs/dev/tombstone.md: fix a typo
s/alwas/always/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23734
2025-04-15 10:54:42 +03:00
Avi Kivity
5e1cf90a51 build: replace tools/java submodule with packaged cassandra-stress
We no longer use tools/java (scylladb/scylla-tools-java.git) for
nodetool or cqlsh; only cassandra-stress. Since that is available
in package form install that and excise the tools/java submodule
from the source tree.

pgo/ is adjusted to use the packaged cassandra-stress (and the cqlsh
submodule).

A few jmx references are dropped as well.

Frozen toolchain regenerated.

Optimized clang from

  https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz

Closes scylladb/scylladb#23698
2025-04-15 10:11:28 +03:00
Avi Kivity
9559e53f55 Merge 'Adjust tablet-mon.py for capacity-aware load balancing' from Tomasz Grabiec
After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph.

To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer.

Closes scylladb/scylladb#23584

* github.com:scylladb/scylladb:
  tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization
  tablet-mon.py: Center tablet id text properly in the vertical axis
  tablet-mon.py: Show migration stage tag in table mode only when migrating
  virtual-tables: Introduce system.load_per_node
  virtual_tables: memtable_filling_virtual_table: Propagate permit to execute()
  docs: virtual-tables: Fix instructions
  service: tablets: Keep load_stats inside tablet_allocator
2025-04-10 14:59:08 +03:00
Tomasz Grabiec
b5211cca85 Merge 'tablets: rebuild: use repair for tablet rebuild' from Aleksandra Martyniuk
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.

In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.

Fixes: https://github.com/scylladb/scylladb/issues/17174.

Needs backport to 2025.1 to prevent out of space during streaming

Closes scylladb/scylladb#23187

* github.com:scylladb/scylladb:
  test: add test for rebuild with repair
  locator: service: move to rebuild_v2 transition if cluster is upgraded
  locator: service: add transition to rebuild_repair stage for rebuild_v2
  locator: service: add rebuild_repair tablet transition stage
  locator: add maybe_get_primary_replica
  locator: service: add rebuild_v2 tablet transition kind
  gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
2025-04-09 21:35:37 +02:00
Tomasz Grabiec
0b9a75d7b6 virtual-tables: Introduce system.load_per_node
Can be used to query per-node stats about load as seen by the load
balancer.

In particular, node's capacity will be used by tablet-mon.py to
scale tablet columns so that equal height is equal node utilization.
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
34beaa30b5 docs: virtual-tables: Fix instructions 2025-04-09 20:21:51 +02:00
Botond Dénes
583a813d17 docs/dev/tombstone.md: fix link to ddl.html
Closes scylladb/scylladb#23622
2025-04-08 16:18:50 +03:00
Aleksandra Martyniuk
eb17af6143 locator: service: add transition to rebuild_repair stage for rebuild_v2
Modify write_both_read_old and streaming stages in rebuild_v2 transition
kind: write_both_read_old moves to rebuild_repair stage and streaming stage
streams data only from one replica.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
ed7b8bb787 locator: service: add rebuild_v2 tablet transition kind
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

To differentiate the two streaming methods, a new tablet transition
kind - rebuild_v2 - is added.

The transtions and stages for rebuild_v2 transition kind will be
added in the following patches.
2025-04-08 10:42:01 +02:00
Botond Dénes
3bad46a6e2 docs/dev: add tombstone.md
An exhaustive document on the tombstone related internal logic as well
as the user-facing aspects.

Closes scylladb/scylladb#23454
2025-04-01 20:17:57 +03:00
Pavel Emelyanov
2ee9cec1d3 Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar
Move `object_storage.yaml` endpoints to `scylla.yaml`

This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed.

This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f

Refs https://github.com/scylladb/scylladb/issues/22428

Closes scylladb/scylladb#22952

* github.com:scylladb/scylladb:
  Remove db::config::object_storage_config
  Move `object_storage.yaml` endpoints to `scylla.yaml`
2025-04-01 16:01:44 +03:00
Michał Chojnowski
d33ffb221b docs/dev: add sstable-compression-dicts.md 2025-04-01 00:07:31 +02:00
Robert Bindar
e3a3508960 Move object_storage.yaml endpoints to scylla.yaml
This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-03-31 13:39:39 +03:00
Avi Kivity
2b9e1e61d0 docs: reader_concurrency_semaphore: document CPU concurrency limit
Document the CPU concurrency implemented in 3d816b7c16
and adjusted in 3d12451d1f.

Closes scylladb/scylladb#23404
2025-03-31 09:39:55 +03:00
Aleksandra Martyniuk
4c75701756 docs: locator: update the docs and formatter of tablet_task_info 2025-02-14 09:13:11 +01:00
TripleChecker
8d64be94e2 Fix typos 2025-02-13 01:54:08 +02:00
TripleChecker
e72e6fadeb Fix typos 2025-02-11 00:17:43 +02:00
Pavel Emelyanov
81f7a6d97d doc: Update system.sstables table schema description
The partition key had been renamed and its type changed some time ago,
but the doc wasn't updated. Fix it.

refs: #20998

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#22683
2025-02-10 16:09:49 +02:00
Botond Dénes
51a273401c Merge 'test: tablets_test: Create proper schema in load balancer tests' from Tomasz Grabiec
This PR converts boost load balancer tests in preparation for load balancer changes
which add per-table tablet hints. After those changes, load balancer consults with the replication
strategy in the database, so we need to create proper schema in the
database. To do that, we need proper topology for replication
strategies which use RF > 1, otherwise keyspace creation will fail.

Topology is created in tests via group0 commands, which is abstracted by
the new `topology_builder` class.

Tests cannot modify token_metadata only in memory now as it needs to be
consistent with the schema and on-disk metadata. That's why modifications to
tablet metadata are now made under group0 guard and save back metadata to disk.

Closes scylladb/scylladb#22648

* github.com:scylladb/scylladb:
  test: tablets: Drop keyspace after do_test_load_balancing_merge_colocation() scenario
  tests: tablets: Set initial tablets to 1 to exit growing mode
  test: tablets_test: Create proper schema in load balancer tests
  test: lib: Introduce topology_builder
  test: cql_test_env: Expose topology_state_machine
  topology_state_machine: Introduce lock transition
2025-02-10 16:08:41 +02:00
Avi Kivity
9712390336 Merge 'Add per-table tablet options in schema' from Benny Halevy
This series extends the table schema with per-table tablet options.
The options are used as hints for initial tablet allocation on table creation and later for resize (split or merge) decisions,
when the table size changes.

* New feature, no backport required

Closes scylladb/scylladb#22090

* github.com:scylladb/scylladb:
  tablets: resize_decision: get rid of initial_decision
  tablet_allocator: consider tablet options for resize decision
  tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member
  network_topology_strategy: allocate_tablets_for_new_table: consider tablet options
  network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner
  network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0
  cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0
  cql3/create_keyspace_statement: add deprecation warning for initial tablets
  test: cqlpy: test_tablets: add tests for per-table tablet options
  schema: add per-table tablet options
  feature_service: add TABLET_OPTIONS cluster schema feature
2025-02-08 20:32:19 +02:00
Tomasz Grabiec
61532eb53b topology_state_machine: Introduce lock transition
Will be used in load balancer tests to prevent concurrent topology
operations, in particular background load balancing.

load balancer will be invoked explicitly by the test. Disabling load
balancer in topology is not a solution, because we want the explicit
call to perform the load balancing.
2025-02-07 16:09:21 +01:00
Avi Kivity
861fb58e14 Merge 'vector: add support for vector type' from Dawid Pawlik
This pull request is an implementation of vector data type similar to one used by Apache Cassandra.

The patch contains:
- implementation of vector_type_impl class
- necessary functionalities similar to other data types
- support for serialization and deserialization of vectors
- support for Lua and JSON format
- valid CQL syntax for `vector<>` type
- `type_parser` support for vectors
- expression adjustments such as:
    - add `collection_constructor::style_type::vector`
    - rename `collection_constructor::style_type::list` to `collection_constructor::style_type::list_or_vector`
- vector type encoding (for drivers)
- unit tests
- cassandra compatibility tests
- necessary documentation

Co-authored-by: @janpiotrlakomy

Fixes https://github.com/scylladb/scylladb/issues/19455

Closes scylladb/scylladb#22488

* github.com:scylladb/scylladb:
  docs: add vector type documentation
  cassandra_tests: translate tests covering the vector type
  type_codec: add vector type encoding
  boost/expr_test: add vector expression tests
  expression: adjust collection constructor list style
  expression: add vector style type
  test/boost: add vector type cql_env boost tests
  test/boost: add vector type_parser tests
  type_parser: support vector type
  cql3: add vector type syntax
  types: implement vector_type_impl
2025-02-06 20:36:50 +02:00
Benny Halevy
c5668d99c9 schema: add per-table tablet options
Unlike with vnodes, each tablet is served only by a single
shard, and it is associated with a memtable that, when
flushed, it creates sstables which token-range is confined
to the tablet owning them.

On one hand, this allows for far better agility and elasticity
since migration of tablets between nodes or shards does not
require rewriting most if not all of the sstables, as required
with vnodes (at the cleanup phase).

Having too few tablets might limit performance due not
being served by all shards or by imbalance between shards
caused by quantization.  The number of tabelts per table has to be
a power of 2 with the current design, and when divided by the
number of shards, some shards will serve N tablets, while others
may serve N+1, and when N is small N+1/N may be significantly
larger than 1. For example, with N=1, some shards will serve
2 tablet replicas and some will serve only 1, causing an imbalance
of 100%.

Now, simply allocating a lot more tablets for each table may
theoretically address this problem, but practically:
a. Each tablet has memory overhead and having too many tablets
in the system with many tables and many tablets for each of them
may overwhelm the system's and cause out-of-memory errors.
b. Too-small tablets cause a proliferation of small sstables
that are less efficient to acces, have higher metadata overhead
(due to per-sstable overhead), and might exhaust the system's
open file-descriptors limitations.

The options introduced in this change can help the user tune
the system in two ways:
1. Sizing the table to prevent unnecessary tablet splits
and migrations.  This can be done when the table is created,
or later on, using ALTER TABLE.
2. Controlling min_per_shard_tablet_count to improve
tablet balancing, for hot tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-06 08:55:51 +02:00
Ernest Zaslavsky
29e60288de docs: update the object_storage.md and admin.rst
Added additional options and best practices for AWS authentication.
2025-02-05 14:57:19 +02:00
Dawid Pawlik
a68bf6dcc1 docs: add vector type documentation
Add missing vector type documentation including: definition of vector,
adjustment of term definition, JSON encoding, Lua and cql3 type
mapping, vector dimension limit, and keyword specification.
2025-01-28 21:14:49 +01:00
Aleksandra Martyniuk
18cc79176a api: task_manager: do not unregister tasks on get_status
Currently, /task_manager/task_status_recursive/{task_id} and
/task_manager/task_status/{task_id} unregister queries task if it
has already finished.

The status should not disappear after being queried. Do not unregister
finished task when its status or recursive status is queried.
2025-01-27 11:23:45 +01:00
Aleksandra Martyniuk
e37d1bcb98 api: task_manager: add /task_manager/drain
In the following patches, get_status won't be unregistering finished
tasks. However, tests need a functionality to drop a task, so that
they could manipulate only with the tasks for operations that were
invoked by these tests.

Add /task_manager/drain/{module} to unregister all finished tasks
from the module. Add respective nodetool command.
2025-01-27 11:23:45 +01:00
Paweł Zakrzewski
702e727e33 audit: Add documentation for the audit subsystem
Adds detailed documentation covering the new audit subsystem:

- Add new audit.md design document explaining:
  - Core concepts and design decisions
  - CQL extensions for audit management
  - Implementation details and trigger evaluation
  - Prior art references from other databases

- Add user-facing documentation:
  - New auditing.rst guide with configuration and usage details
  - Integration with security documentation index
  - Updates to cluster management procedures
  - Updates to security checklist

The documentation covers all aspects of the audit system including:
- Configuration options and storage backends (syslog/table)
- Audit categories (DCL/DDL/AUTH/DML/QUERY/ADMIN)
- Permission model and security considerations
- Failure handling and logging
- Example configurations and output formats

This ensures users have complete guidance for setting up and using
the new audit capabilities.
2025-01-15 11:10:35 +01:00
Kefu Chai
f8885a4afd dist/docker,docs: replace "--experimental" with "--experimental-features"
The "--experimental" option was removed in commit f6cca741ea. Using this
deprecated option now causes Scylla to fail with the error:

```
error: the argument ('on') for option '--experimental-features' is invalid
```

So, in this change, let's update the docker entry point script to use
`--experimental-features` command line option instead. The related
document is updated accordingly.

Fixes scylladb/scylladb#22207
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22283
2025-01-14 07:56:38 -05:00
Avi Kivity
814942505f Merge 'Introduce Encryption-at-Rest (EAR) for sstables and commitlog' from Calle Wilund
Fixes https://github.com/scylladb/scylla-enterprise/issues/5016#issuecomment-2558464631

EAR - encryption at rest. Allows on-disk file encryption of sstables and commitlog data.
Introduces OpenSSL based file level encrypted storage, managed via a set of providers
ranging from local files to cloud KMS providers.

For a more comprehensive explanation, see the included docs (or if possible, original
source tree).

Manual bulk merge of EAR feature from enterprise repo to main scylla repo.

Breaks some features apart, but main EAR is still a humongous commit, because to separate this
I would have to mess with code incrementally, adding time and risk.

This PR includes the local file gen tool, tests and also p11 validation.

Note: CI will not execute the full tests unless master CI is set to provide the same environment
as the enterprise one. Not sure about the status of this ATM.

Note: Includes code to compile against cryptsoft kmipc SDK, but not the SDK. If you happen to
check out this tree in the scylla folder and configure, it will be linked against and KMIP functionality
will be enabled, otherwise not.

Closes scylladb/scylladb#22233

* github.com:scylladb/scylladb:
  docs: Add EAR docs
  main/build: Add p11-kit and initialize
  tools: Add local-file-key-generator tool
  tests: Add EAR tests
  tmpdir: shorten test tempdir path
  EAR: port the ear feature from enterprise
  cql_test_env: Add optional query timeout
  schema/migration_manager: Add schema validate
  sstables: add get_shared_components accessor
  config/config_file: Add exports and definitions of config_type_for<>
2025-01-12 16:10:46 +02:00