Before, the `nodetool getendpoints` expected the key as one string separated by : (for example 1:val:ue). This caused errors if any part of the key had a colon because it was unclear whether a colon was a separator or part of the key.
This change adds a new API endpoint, `/storage_service/natural_endpoints/v2/{keyspace}`, which accepts composite partition keys as multiple key_component query parameters (e.g., ?key_component=1&key_component=val:ue). The `nodetool getendpoints` command was updated to support a new `--key-components` option, allowing users to pass key components as an array. The client and test infrastructure were extended to support multiple values for a query parameter, and tests were added to verify correct behavior with composite keys.
The previous method of passing partition keys as colon-separated strings is preserved for backward compatibility.
Backport is not required, since this change relies on recent Seastar updates
Fixes#16596Closesscylladb/scylladb#26169
* github.com:scylladb/scylladb:
docs: document --key-components option for getendpoints
test/nodetool/test_getendpoints: add coverage for --key-components param in getendpoints
nodetool: Introduce new option --key-components to specify compound partition keys as array
rest_api/test_storage_service: add v2 natural_endpoints test for composite key with multiple components
api/storage_service: add GET 'natural_endpoints' v2 to support composite keys with ':'
rest_api_mock: support duplicate query parameters
test/rest_api: support multiple query values per key in RestApiSession.send()
nodetool: add support of new seastar query_parameters_type to scylla_rest_client
We would like to have an additional service level
available for users of the Vector Store service,
which would allow us to de/prioritize vector
operations as needed. To allow that, we increase
the number of scheduling groups from 19 to 20
and adjust the related test accordingly.
Closesscylladb/scylladb#26316
This is yet another part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation.
This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version.
(Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)).
The high-level structure of the PR is:
1. Introduce new component types — `Partitions` and `Rows`.
2. Teach `class sstable` to open them when they exist.
3. Teach the sstable writer how to write index data to them.
4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead).
5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`.
6. Prepare unit tests for the appearance of `ms`.
7. Enable `ms` in unit tests.
8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled).
9. Prepare integration tests for the appearance of `ms`.
10. Enable both `ms` and `me` in tests where we want both versions to be tested.
This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config.
Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`.
This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row.
`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`:
```
offset stride rows iterations avg aio aio (KiB)
500000 1 1 70 18.0 18 128
500001 1 1 647 19.0 19 132
0 1000000 1 748 15.0 15 116
0 500000 2 372 29.0 29 284
0 250000 4 227 56.0 56 504
0 125000 8 116 106.0 106 928
0 62500 16 67 195.0 195 1732
```
`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`:
```
offset stride rows iterations avg aio aio (KiB)
500000 1 1 51 5.1 5 20
500001 1 1 64 5.3 5 20
0 1000000 1 679 4.0 4 16
0 500000 2 492 8.0 8 88
0 250000 4 804 16.0 16 232
0 125000 8 409 31.0 31 516
0 62500 16 97 54.0 54 1056
```
Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`:
Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`.
For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`.
External tests:
I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed.
New functionality, no backport needed.
Closesscylladb/scylladb#26215
* github.com:scylladb/scylladb:
test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes
test/cluster: add test_bti_index.py
test: prepare bypass_cache_test.py for `ms` sstables
sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present
test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables
tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write`
db/config: expose "ms" format to the users via database config
test: in Python tests, prepare some sstable filename regexes for `ms`
sstables: add `ms` to `all_sstable_versions`
test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests
test/lib/index_reader_assertions: skip some row index checks for BTI indexes
test/boost/sstable_inexact_index_test: explicitly use a `me` sstable
test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables
test/resource: add `ms` sample sstable files for relevant tests
test/boost/sstable_compaction_test: prepare for `ms` sstables.
test/boost/index_reader_test: prepare for `ms` sstables
test/boost/bloom_filter_tests: prepare for `ms` sstables
test/boost/sstable_datafile_test: prepare for `ms` sstables
test/boost/sstable_test: prepare for `ms` sstables.
sstables: introduce `ms` sstable format version
tools/scylla-sstable: default to "preferred" sstable version, not "highest"
sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader
sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash
sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller
sstables/mx: make Index and Summary components optional
sstables: open Partitions.db early when it's needed to populate key range for sharding metadata
sstables: adapt sstable::set_first_and_last_keys to sstables without Summary
sstables: implement an alternative way to rebuild bloom filters for sstables without Index
utils/bloom_filter: add `add(const hashed_key&)`
sstables: adapt estimated_keys_for_range to sstables without Summary
sstables: make `sstable::estimated_keys_for_range` asynchronous
sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary
replica/database: add table::estimated_partitions_in_range()
sstables/mx: implement sstable::has_partition_key using a regular read
sstables: use BTI index for queries, when present and enabled
sstables/mx/writer: populate BTI index files
sstables: create and open BTI index files, when enabled
sstables: introduce Partition and Rows component types
sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`
Introduce `ms` -- a new sstable format version which
is a hybrid of Cassandra's `me` and `da`.
It is based on `me`, but with the index components
(Summary.db and Index.db) replaced with the index
components of `da` (Partitions.db and Rows.db).
As of this patch, the version is never chosen
anywhere for writing sstables yet. It is only introduced.
We will add it to unit tests in a later commit,
and expose it to users in yet later commit.
This PR adds the missing documentation for the SELECT ... ANN statement that allows performing vector queries. This is just the basic explanation of the grammar and how to use it. More comprehensive documentation about vector search will be added separately in Scylla Cloud documentation and features description. Links to this additional documentation will be added as part of VECTOR-244.
Fixes: VECTOR-247.
No backport is needed as this is the new feature.
Closesscylladb/scylladb#26282
* github.com:scylladb/scylladb:
cql3: Update error messages to be in line with documentation.
docs: Add CQL documentation for vector queries using SELECT ANN
The code in `multishard_mutation_query.cc` implements the replica-side of range scans and as such it belongs in the replica module. Take the opportunity to also rename it to `multishard_query`, the code implements both data and mutation queries for a long time now.
Code cleanup, no backport required.
Closesscylladb/scylladb#26279
* github.com:scylladb/scylladb:
test/boost: rename multishard_mutation_query_test to multishard_query_test
replica/multishard_query: move code into namespace replica
replica/multishard_query.cc: update logger name
docs/paged-queries.md: update references to readers
root,replica: move multishard_mutation_query to replica/
An offline, scylla-sstable variant of nodetool upgradesstables command.
Applies latest (or selected) sstable version and latest schema.
Closesscylladb/scylladb#26109
This patch adds the missing documentation for the SELECT ... ANN
statement that allows performing vector queries. This is just the
basic explanation of the grammar and how to use it. More
comprehensive documentation about vector search will be added
separately in Scylla Cloud documentation and features description.
Links to this additional documentation will be added as part of
VECTOR-244.
Fixes: VECTOR-247.
This patch adds CQL documentation about creating vector search
indexes. It includes the syntax and description of parameters.
It does not cover VECTOR type that is already supported and
documented and it does not cover querying vectors which will be
covered by a separate PR.
Fixes: VECTOR-217
Closesscylladb/scylladb#26233
Fixed some minor grammatical changes in the documentation.
Closesscylladb/scylladb#24675
* github.com:scylladb/scylladb:
Update docs/features/cdc/cdc-streams.rst
Small grammatical changes
This change adds the documentation section which explains the algorithm
to compute the absolute number of tablets a table has.
Fixes: #25740Closesscylladb/scylladb#25741
As the proper documentation of CPP-RS Driver is already there, let's update the links to point to it instead of the GitHub repo.
Closesscylladb/scylladb#26089
update all the references about the issue of tablets support for
alternator streams to issue #23838 instead of #16317.
The issue #16317 is about support of CDC with tablets, but it is now
closed and it didn't address alternator streams. the remaining issues
about alternator streams should be addressed as part of #23838, so fix
the references in order for them not to be missed.
initial implementation to support CDC in tablets-enabled keyspaces.
The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing
It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge.
Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future.
In this PR we:
* add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata
* add virtual tables for CDC consumers
* the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata
* enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet.
* on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet.
* the cdc tablets are co-located with the base tablets
Fixes https://github.com/scylladb/scylladb/issues/22576
backport not needed - new feature
update dtests: https://github.com/scylladb/scylla-dtest/pull/5897
update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102
update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136Closesscylladb/scylladb#23795
* github.com:scylladb/scylladb:
docs/dev: update CDC dev docs for tablets
doc: update CDC docs for tablets
test: cluster_events: enable add_cdc and drop_cdc
test/cql: enable cql cdc tests to run with tablets
test: test_cdc_with_alter: adjust for cdc with tablets
test/cqlpy: adjust cdc tests for tablets
test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests
cdc: enable cdc with tablets
topology coordinator: change streams on tablet split/merge
cdc: virtual tables for cdc with tablets
cdc: generate_stream_diff helper function
cdc: choose stream in tablets enabled keyspaces
cdc: rename get_stream to get_vnode_stream
cdc: load tablet streams metadata from tables
cdc: helper functions for reading metadata from tables
cdc: colocate cdc table with base
cdc: remove streams when dropping CDC table
cdc: create streams when allocating tablets
migration_listener: add on_before_allocate_tablet_map notification
cdc: notify when creating or dropping cdc table
cdc: move cdc table creation to pre_create
cdc: add internal tables for cdc with tablets
cdc: add cdc_with_tablets feature flag
cdc: add is_log_schema helper
This command was written for an investigation and was used exactly once.
This would have been a perfect candidate for the (also rarely used)
scylla-sstable script command, but it didn't exist yet.
Drop this command from the tool, such super-specific commands should be
written as sstable-scripts nowadays, which is what we will do if we ever
need this again.
Closesscylladb/scylladb#26062
According to the changes in Vector Store API (VECTOR-148) the `embedding` term
should be changed to `vector`. As `vector` term is used for STL class the
internal type or variable names would be changed to `vs_vector` (aka vector
store vector). This patch changes also the HTTP ann json request payload
according to the Vector Store API changes.
Fixes: VECTOR-229
Closesscylladb/scylladb#26050
Capacity based balancing was introduced in 2025.1. It computes balance
based on a node's capacity: the number of tablets located on a node
should be directly proportional to that node's storage capacity.
This change adds this explanation to the docs.
Fixes: #25686Closesscylladb/scylladb#25687
We were recently surprised (in pull request #25797) to "discover" that
Scylla does not allow granting SELECT permissions on individual
materialized views. Instead, all materialized views of a base table
are readable if the base table is readable.
In this patch we document this fact, and also add a test to verify
that it is indeed true. As usual for cqlpy tests, this test can also
be run on Cassandra - and it passes showing that Cassandra also
implemented it the same way (which isn't surprising, given that we
probably copied our initial implementation from them).
The test demonstrates that neither Scylla nor Cassandra prints an error
when attempting to GRANT permissions on a specific materialized view -
but this GRANT is simply ignored. This is not ideal, but it is the
existing behavior in both and it's not important now to change it.
Additionally, because pull request #25797 made CDC-log permissions behave
the same as materialized views - i.e., you need to make the base table
readable to allow reading from the CDC log, this patch also documents
this fact and adds a test for it also.
Fixes#25800Closesscylladb/scylladb#25827
scylla-sstable write (and scrub) moved to UUID generations in
514f59d157, but said patch forgot to
update the docs. This is fixed here.
Closesscylladb/scylladb#25965
This PR introduces a major rewrite of the EaR document. The initial motivation for this PR was to fully cover all our supported key providers with working examples, and to add instructions for key rotation. However, many other improvements were made along the way.
Main changes in this PR:
* Add a high-level description for every key provider. Mention limitations.
* Better organize existing provider-specific instructions by placing them into clearly separated, tabbed sections.
* Add instructions for the replicated key provider. Mention explicitly that it cannot be used as default option for user or system encryption, and that it does not support key rotation.
* Add more examples for KMS and GCP to cover all credential types.
* Document missing configuration options.
* Add a new section for key rotation.
Notes:
* Some of the patches in this series have been cherry-picked from Laszlo's wip branch.
* This PR is expected to conflict with the Azure Key Vault PR, which should be merged first. (https://github.com/scylladb/scylladb/pull/23920/)
* Support for KMIP system keys in the Replicated Key Provider is currently broken. (https://github.com/scylladb/scylladb/issues/24443)
Fixesscylladb/scylla-enterprise#3535.
Refs scylladb/scylla-enterprise#3183.
Only doc changes. No backport is needed.
Closesscylladb/scylladb#24558
* github.com:scylladb/scylladb:
encryption-at-rest.rst: add "Rotate Encryption Keys" section
encryption-at-rest.rst: rewrite "Encrypt System Resources" section
encryption-at-rest.rst: rewrite "Update Encryption Properties of Existing Tables" section
encryption-at-rest.rst: rewrite "Encrypt a Single Table" section
encryption-at-rest.rst: rewrite "Encrypt Tables" section
encryption-at-rest.rst: update "Set the Azure Host" section
encryption-at-rest.rst: update "Set the GCP Host" section
encryption-at-rest.rst: update "Set the KMS Host" section
encryption-at-rest.rst: update "Set the KMIP Host" section
encryption-at-rest.rst: rewrite "Create Encryption Keys" section
encryption-at-rest.rst: rewrite "Key Providers" section
encryption-at-rest.rst: hoist and update "Cipher Algorithm Descriptors"
encryption-at-rest.rst: rewrite/replace section "Encryption Key Types"
encryption-at-rest.rst: About: describe high-level operation more precisely
encryption-at-rest.rst: improve wording / formatting in About intro
encryption-at-rest.rst: users (plural) typo fix
encryption-at-rest.rst: rewrap
encryption-at-rest.rst: strip trailing whitespace
Allow for the following CQL syntax:
```
CREATE KEYSPACE [IF NOT EXISTS] <name>;
```
for example:
```
CREATE KEYSPACE test_keyspace;
```
With this syntax all the keyspace's parameters would be defaulted to:
replication strategy = `NetworkTopologyStrategy`,
replication factor = number of racks , but excluding racks that only have arbiter nodes
storage options, durable writes = defaults we normally would use,
tablets enabled if they are enabled in the db configuration, e.g. scylla.yaml or db/config.cc by default.
Options besides `replication` already have defaults. `replication` had to be specified, but it could be an empty set, where defaults for sub-options (replication strategy and replication factor) would be used - `replication = {}`. Now there is no need for specifying an empty set - omitting `replication = {}` has the same effect as `replication = {}`.
Since all the options now have defaults, `WITH` is optional for `CREATE KEYSPACE` statement.
Fixes#25145
This is an improvement, no backport needed.
Closesscylladb/scylladb#25872
* github.com:scylladb/scylladb:
docs: cql: default create keyspace syntax
test: cqlpy: add test for create keyspace with no options specified
cql: default `CREATE KEYSPACE` syntax
The `--incremental-mode` option specifies the incremental repair mode.
Can be 'disabled', 'regular', or 'full'.
'regular': The incremental repair logic is enabled. Unrepaired sstables
will be included for repair. Repaired sstables will be skipped. The
incremental repair states will be updated after repair.
'full': The incremental repair logic is enabled. Both repaired and
unrepaired sstables will be included for repair. The incremental repair
states will be updated after repair.
'disabled': The incremental repair logic is disabled completely. The
incremental repair states, e.g., repaired_at in sstables and
sstables_repaired_at in the system.tablets table, will not be updated
after repair.
When the option is not provided, it defaults to regular.
Fixes#25931Closesscylladb/scylladb#25969
The existing article is already extensive and covers pretty much
all of the details useful to the user. However, the document
lacked minute information like the default names of the DC and rack
in case of SimpleSnitch or it didn't explicitly specify the behavior
of RackInferringSnitch (though arguably the existing example was more
than sufficient).
Fixesscylladb/scylladb#23528Closesscylladb/scylladb#25700
This patch introduces a new `incremental_mode` parameter to the tablet
repair REST API, providing more fine-grained control over the
incremental repair process.
Previously, incremental repair was on and could not be turned off. This
change allows users to select from three distinct modes:
- `regular`: This is the default mode. It performs a standard
incremental repair, processing only unrepaired sstables and skipping
those that are already repaired. The repair state (`repaired_at`,
`sstables_repaired_at`) is updated.
- `full`: This mode forces the repair to process all sstables, including
those that have been previously repaired. This is useful when a full
data validation is needed without disabling the incremental repair
feature. The repair state is updated.
- `disabled`: This mode completely disables the incremental repair logic
for the current repair operation. It behaves like a classic
(pre-incremental) repair, and it does not update any incremental
repair state (`repaired_at` in sstables or `sstables_repaired_at` in
the system.tablets table).
The implementation includes:
- Adding the `incremental_mode` parameter to the
`/storage_service/repair/tablet` API endpoint.
- Updating the internal repair logic to handle the different modes.
- Adding a new test case to verify the behavior of each mode.
- Updating the API documentation and developer documentation.
Fixes#25605Closesscylladb/scylladb#25693
This patch updates the create keyspace statement docs. It explains how
the `replication` option in the create keyspace statement is now optional,
and behaves the same as if we specified an empty set as following:
`WITH replication = {}`.
An example with no `replication` option specified has also been added.
Refs #25145
When creating a new keyspace, replication factor must be stated.
For example:
`CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy', 'replication_factor': 3 };`
This patch changes it in the following way - if there is no
replication factor specified, use default replication factor.
Default replication factor is equal to the number of racks that
are not arbiter-only, i.e. racks that have at least one non-arbiter node.
The following syntax is now valid:
`CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy' };`
`CREATE KEYSPACE ks WITH REPLICATION { };`
Fixes#16028
Backport is not needed. This is an enhancement for future releases.
Closesscylladb/scylladb#25570
* github.com:scylladb/scylladb:
docs/cql: update documentation for default replication factor
test/cqlpy: add keyspace creation default replication factor tests
cql3: add default replication factor to `create_keyspace_statement`
This commit adds missing fields to GetRecords responses: `awsRegion` and
`eventVersion`. We also considered changing `eventSource` from
`scylladb:alternator` to `aws:dynamodb` and setting `SizeBytes` subfield
inside the `dynamodb` field.
We set `awsRegion` to the datacenter's name of the node that received
the request. This is in line with the AWS documentation, except that
Scylla has no direct equivalent of a region, so we use the datacenter's
name, which is analogous to DynamoDB's concept of region.
The field `eventVersion` determines the structure of a Record. It is
updated whenever the structure changes. We think that adding a field
`userIdentity` bumped the version from `1.0` to `1.1`. Currently, Scylla
doesn't support this field (#11523), hence we use the older 1.0 version.
We have decided to leave `eventSource` as is, since it's easy to modify
it in case of problems to `aws:dynamodb` used by DynamoDB.
Not setting `SizeBytes` subfield inside the `dynamodb` field was
dictated by the lack of apparent use cases. The documentation is unclear
about how `SizeBytes` is calculated and after experimenting a little
bit, I haven't found an obvious pattern.
Fixes: #6931Closesscylladb/scylladb#24903
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
- includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
- applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations
The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.
Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`.
When not provided, tests are skipped.
Test scenarios:
1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2
The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).
The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.
`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:
Read path (before)
```
instructions_per_op:
mean= 39661.51 standard-deviation=34.53
median= 39655.39 median-absolute-deviation=23.33
maximum=39708.71 minimum=39622.61
```
Read path (after)
```
instructions_per_op:
mean= 39691.68 standard-deviation=34.54
median= 39683.14 median-absolute-deviation=11.94
maximum=39749.32 minimum=39656.63
```
Write path (before):
```
instructions_per_op:
mean= 50942.86 standard-deviation=97.69
median= 50974.11 median-absolute-deviation=34.25
maximum=51019.23 minimum=50771.60
```
Write path (after):
```
instructions_per_op:
mean= 51000.15 standard-deviation=115.04
median= 51043.93 median-absolute-deviation=52.19
maximum=51065.81 minimum=50795.00
```
Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871
No backport, as it is a new feature.
Closesscylladb/scylladb#23917
* github.com:scylladb/scylladb:
tests/cluster: Add new storage tests
test/scylla_cluster: Override workdir when passed via cmdline
streaming: Reject incoming migrations
storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
repair_service: Add a facility to disable the service
compaction_manager: Subscribe to out of space controller
compaction_manager: Replace enabled/disabled states with running state
database: Add critical_disk_utilization mode database can be moved to
disk_space_monitor: add subscription API for threshold-based disk space monitoring
docs: Add feature documentation
config: Add critical_disk_utilization_level option
replica/exceptions: Add a new custom replica exception
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.
The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.
The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.
The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).
For detailed description check: `docs/dev/view-building-coordinator.md`
Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930
---
This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942Closesscylladb/scylladb#23760
* github.com:scylladb/scylladb:
test/cluster: add view build status tests
test/cluster: add view building coordinator tests
utils/error_injection: allow to abort `injection_handler::wait_for_message()`
test: adjust existing tests
utils/error_injection: add injection with `sleep_abortable()`
db/view/view_builder: ignore `no_such_keyspace` exception
docs/dev: add view building coordinator documentation
db/view/view_building_worker: work on `process_staging` tasks
db/view/view_building_worker: register staging sstable to view building coordinator when needed
db/view/view_building_worker: discover staging sstables
db/view/view_building_worker: add method to register staging sstable
db/view/view_update_generator: add method to process staging sstables instantly
db/view/view_update_generator: extract generating updates from staging sstables to a method
db/view/view_update_generator: ignore tablet-based sstables
db/view/view_building_coordinator: update view build status on node join/left
db/view/view_building_coordinator: handle tablet operations
db/view: add view building task mutation builder
service/topology_coordinator: run view building coordinator
db/view: introduce `view_building_coordinator`
db/view/view_building_worker: update built views locally
db/view: introduce `view_building_worker`
db/view: extract common view building functionalities
db/view: prepare to create abstract `view_consumer`
message/messaging_service: add `work_on_view_building_tasks` RPC
service/topology_coordinator: make `term_changed_error` public
db/schema_tables: create/cleanup tasks when an index is created/dropped
service/migration_manager: cleanup view building state on drop keyspace
service/migration_manager: cleanup view building state on drop view
service/migration_manager: create view building tasks on create view
test/boost: enable proxy remote in some tests
service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
service/migration_manager: coroutinize `prepare_new_view_announcement()`
service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
service: reload `view_building_state_machine` on group0 apply()
service/vb_coordinator: add currently processing base
db/system_keyspace: move `get_scylla_local_mutation()` up
db/system_keyspace: add `view_building_tasks` table
db/view: add view_building_state and views_state
db/system_keyspace: add method to get view build status map
db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
db/system_keyspace: move `internal_system_query_state()` function earlier
db/view: ignore tablet-based views in `view_builder`
gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
Update create-keyspace-statement section of ddl.rst since replication factor is no longer mandatory.
Add an example for keyspace creation without specifying replication factor.
Add an example for keyspace creation without specifying both `class` and replication factor.
Refs: #16028
add --dc and --rack commandline arguments to the scylla docker image, to
allow starting a node with a specified dc and rack names in a simple
way.
This is useful mostly for small examples and demonstrations of starting
multiple nodes with different racks, when we prefer not to bother with
editing configuration files. The ability to assign nodes to different
racks is especially important with RF=Rack enforcing.
The previous method to achieve this is to set the snitch to
GossipingPropertyFileSnitch and provide a configuration file in
/etc/scylla/cassandra-rackdc.properties with the name of the dc and
rack.
The new dc and rack parameters are implemented similarly by using the
snitch GossipingPropertyFileSnitch and writing the dc and rack values to
the rackdc properties file. We don't support passing the parameters
together with a different snitch, or when mounting a properties file
from the host, because we don't want to overwrite it.
Example:
docker run -d --name scylla1 scylladb/scylla --dc my_dc1 --rack my_rack1
Fixesscylladb/scylladb#23423Closesscylladb/scylladb#25607