Compare commits

..

199 Commits

Author SHA1 Message Date
Piotr Smaron
d4c28690e1 db: fail reads and writes with local consistencty level to a DC with RF=0
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.

This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.

The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)

and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.

Fixes: scylladb/scylladb#27893
2026-01-22 12:49:45 +01:00
Piotr Smaron
9475659ae8 db: consistency_level: split local_quorum_for()
The core of `local_quorum_for()` has been extracted to
`get_replication_factor_for_dc()`, which is going to be used later,
while `local_quorum_for()` itself has been recreated using the exracted
part.
2026-01-22 12:49:23 +01:00
Piotr Smaron
0b3ee197b6 db: consistency_level: fix nrs -> nts abbreviation
`network_topology_strategy` was abbreviated with `nrs`, and not `nts`. I
think someone incorrectly assumed it's 'network Replication strategy', hence
nrs.
2026-01-22 12:48:37 +01:00
Sergey Zolotukhin
799d837295 test: disable test_start_bootstrapped_with_invalid_seed
The test intermittently fails when an invalid DNS name is resolved,
likely due to ISP DNS error hijacking (see scylladb/scylladb#28153).

Disable this test to unblock CI.

Fixes scylladb/scylladb#28153

Closes scylladb/scylladb#28162
2026-01-15 10:25:45 +01:00
Jenkins Promoter
51d61f809e Update pgo profiles - aarch64 2026-01-15 05:13:03 +02:00
Jenkins Promoter
eed1e7fa23 Update pgo profiles - x86_64 2026-01-15 04:33:43 +02:00
Tomasz Grabiec
eef798d84f Merge 'Distribute data evenly among primary replicas during restore' from Robert Bindar
Most likely 817fdad uncovered the fact that our choice of primary replica was resonating with tablet allocation and we were ending up picking the same replica as primary within a scope instead of rotating primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes with primary_replica_only=true would put all data into 3 nodes, leaving the other 6 unused. The balancing of the dataset was performed by the subsequent repair step.

This PR fixes this by changing the formula for picking up the primary replica out of a set of eligible replicas from within the passed scope.
The PR also extends the testing scenarios in `test_backup.py` so we get to run restore for a set of topologies, for all combinations of scope, primary_replica_only and min_tablet_counts.
Most of the work was done by @bhalevy [here](https://github.com/scylladb/scylladb/compare/master...bhalevy:scylla:load-balance-primary-replica), this PR just splitted it and did touchups here and there.

Fixes #27281

Closes scylladb/scylladb#27397

* github.com:scylladb/scylladb:
  test: reduce dataset and number of test cases or debug builds
  test: bump repair timeout up, it's sometimes not enough in CI
  test: refactor test_refresh.py to match test_restore_with_streaming_scopes.
  test: extend test_restore_with_streaming_scopes
  test: Adjust test_restore_primary_replica_different_dc_scope_all
  test: Refactor restoring code in test_backup to match SM pattern
  test: add check_mutation_replicas calls after fresh creation of dataset
  test: extend create_dataset to accept consistency_level
  test: refactor check_mutation_replicas so it's more readable
  test: make create_dataset async and refactor so it's configurable
  test: use defaultdict in collect_mutations
  test: add log marks to facilitate reusing server for restore
  locator: tablets: Distribute data evenly among primary replicas during restore
2026-01-14 18:57:55 +01:00
Avi Kivity
bd08b6e5b2 Merge 'Unify configuration of object storage endpoints (take 2)' from Pavel Emelyanov
To configure S3 storage, one needs to do

```
object_storage_endpoints:
  - name: s3.us-east-1.amazonaws.com
    port: 443
    https: true
    aws_region: us-east-1
```

and for GCS it's

```
object_storage_endpoints:
  - name: https://storage.googleapis.com:433
    type: gs
    credentials_file: <gcp account credentials json file>
```

This PR updates the S3 part to look like

```
object_storage_endpoints:
  - name: https://s3.us-east-1.amazonaws.com:443
    aws_region: us-east-1
```

fixes: #26570

This is 2nd attempt, previous one (#27360) was reverted because it reported endpoint configs in new format via API and CQL always, even if the endpoint was configured in the old way. This "broke" scylla manager and some dtests. This version has this bug fixed, and endpoints are reported in the same format as they were configured with.

About correctness of the changes.

No modifications to existing tests are made here, so old format is respected correctly (as far as it's covered by tests). To prove the new format works the the test_get_object_store_endpoints is extended to validate both options. Some preparations to this test to make this happen come on their own with the PR #28111  to show that they are valid and pass before changing the core code.

Enhancing the way configuration is made, likely no need to backport.

Closes scylladb/scylladb#28112

* github.com:scylladb/scylladb:
  test: Validate S3 endpoints new format works
  docs: Update docs according to new endpoints config option format
  object_storage: Create s3 client with "extended" endpoint name
  s3/storage: Tune config updating
  sstable: Shuffle args for s3_client_wrapper
  test: Rename badconf variable into objconf
  test: Split the object_store/test_get_object_store_endpoints test
2026-01-14 18:29:03 +02:00
Yaniv Michael Kaul
d919aacc69 storage_proxy: mark write_timeouts metric for counter write timeouts
When a counter write times out (due to rpc::timeout_error or timed_out_error),
the code was throwing mutation_write_timeout_exception but not marking the
write_timeouts metric. This resulted in counter write timeouts not being
counted in the scylla_storage_proxy_coordinator_write_timeouts metric.

Regular writes go through mutate_internal -> mutate_end, which catches
mutation_write_timeout_exception and marks the metric. However, counter
writes use a separate code path (mutate_counters) that has its own
exception handling but was missing the metric update.

This fix adds get_stats().write_timeouts.mark() before throwing the
timeout exception in the counter write path, consistent with how the
CAS path handles cas_write_timeouts.

Refs: https://scylladb.atlassian.net/browse/SCYLLADB-245

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#28019
2026-01-14 17:50:46 +02:00
Gleb Natapov
bee5f63cb6 topology coordinator: complete pending operation for a replaced node
A replaced node may have pending operation on it. The replace operation
will move the node into the 'left' state and the request will never be
completed. More over the code does not expect left node to have a
request. It will try to process the request and will crash because the
node for the request will not be found.

The patch checks is the replaced node has peening request and completes
it with failure. It also changes topology loading code to skip requests
for nodes that are in a left state. This is not strictly needed, but
makes the code more robust.

Fixes #27990

Closes scylladb/scylladb#28009
2026-01-14 13:11:27 +01:00
Botond Dénes
551eecab63 Merge 'EAR: deprecate the replicated key provider' from Calle Wilund
Refs #22733.

Adds runtime warning and docs info that replicated provider is deprecated and will be removed.

Fixes #27292

Closes scylladb/scylladb#27270

* github.com:scylladb/scylladb:
  docs::encryption: Add warning that replicated provider is deprecated
  ent::encryption: Switch default key provider from replicated to local
  replicated_key_provider: Add deprecation warning on usage
2026-01-14 13:47:23 +02:00
Patryk Jędrzejczak
6b5923c64e test: test_group0_schema_versioning: wait for schema sync in system.local
`test_schema_versioning_with_recovery` is currently flaky. It performs
a write with CL=ALL and then checks if the schema version is the same on
all nodes by calling `verify_table_versions_synced`. All nodes are expected
to sync their schema before handling the replica write. The node in
RECOVERY mode should do it through a schema pull, and other nodes should do
it through a group 0 read barrier.

The problem is in `verify_local_schema_versions_synced` that compares the
schema versions in `system.local`. The node in RECOVERY mode updates the
schema version in `system.local` after it acknowledges the replica write
as completed. Hence, the check can fail.

We fix the problem by making the function wait until the schema versions
match.

Note that RECOVERY mode is about to be retired together with the whole
gossip-based topology in 2026.2. So, this test is about to be deleted.
However, we still want to fix it, so that it doesn't bother us in older
branches.

Fixes #23803

Closes scylladb/scylladb#28114
2026-01-14 09:55:45 +01:00
Jakub Smolar
aefd815194 test.py: add pexpect to the dependencies
Use pexpect to control a presistent GDB process with pattern reads and timeouts. This makes 'scylla_gdb' tests faster and less flaky.
Added python3-pexpect in 'install-dependencies.sh'.

Closes scylladb/scylladb#26419

[avi:
  build optimized clang 21.1.8
  regenerated frozen toolchain with optimized clang from

    https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz
    https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz
]

Closes scylladb/scylladb#28134
2026-01-14 10:17:37 +02:00
Botond Dénes
122b7847e5 Merge 'index: Accept view properties in CREATE INDEX' from Dawid Mędrek
Problem
-------
Secondary indexes are implemented via materialized views under the
hood. The way an index behaves is determined by the configuration
of the view. Currently, it can be modified by performing the CQL
statement `ALTER MATERIALIZED VIEW` on it. However, that raises some
concerns.

Consider, for instance, the following scenario:

1. The user creates a secondary index on a table.
2. In parallel, the user performs writes to the base table.
3. The user modifies the underlying materialized view, e.g. by setting
   the `synchronous_updates` to `true` [1].

Some of the writes that happened before step 3 used the default value
of the property (which is `false`). That had an actual consequence
on what happened later on: the view updates were performed
asynchronously. Only after step 3 had finished did it change.

Unfortunately, as of now, there is no way to avoid a situation like
that. Whenever the user wants to configure a secondary index they're
creating, they need to do it in another schema change. Since it's
not always possible to control how the database is manipulated in
the meantime, it leads to problems like the one described.

That's not all, though. The fact that it's not possible to configure
secondary indexes is inconsistent with other schema entities. When
it comes to tables or materialized views, the user always have a means
to set some or even all of the properties during their creation.

Solution
--------
The solution to this problem is extending the `CREATE INDEX` CQL
statement by view properties. The syntax is of form:

```
> CREATE INDEX <index name>
> .. ON <keyspace>.<table> (<columns>)
> .. WITH <properties>
```

where `<properties>` corresponds to both index-specific and view
properties [2, 3]. View properties can only be used with indexes
implemented with materialized views; for example, it will be impossible
to create a vector index when specifying any view property (see
examples below).

When a view property is provided, it will be applied when creating the
underlying materialized view. The behavior should be similar to how
other CQL statements responsible for creating schema entities work.

High-level implementation strategy
----------------------------------
1. Make auxiliary changes.
2. Introduce data structures representing the new set of index
   properties: both index-specific and those corresponding to the
   underlying view.
3. Extend `CREATE INDEX` to accept view properties.
4. Extend `DESCRIBE INDEX` and other `DESCRIBE` statements to include
   view properties in their output.

User documentation is also updated at the steps to reflect the
corresponding changes.

Implementation considerations
-----------------------------
There are a number of schema properties that are now obsolete. They're
accepted by other CQL statements, but they have no effect. They
include:

* `index_interval`
* `replicate_on_write`
* `populate_io_cache_on_flush`
* `read_repair_chance`
* `dclocal_read_repair_chance`

If the user tries to create a secondary index specifying any of those
keywords, the statement will fail with an appropriate error (see
examples below).

Unlike materialized views, we forbid specifying the clustering order
when creating a secondary index [4]. This limitation may be lifted
later on, but it's a detail that may or may not prove troublesome. It's
better to postpone covering it to when we have a better perspective on
the consequences it would bring.

Examples
--------
Good examples
```
> CREATE INDEX idx ON ks.t (v);
> CREATE INDEX idx ON ks.t (v) WITH comment = 'ok view property';
> CREATE INDEX idx ON ks.t (v)
  .. WITH comment = 'multiple view properties are ok'
  .. AND synchronous_updates = true;
> CREATE INDEX idx ON ks.t (v)
  .. WITH comment = 'default value ok'
  .. AND synchronous_updates = false;
```

Bad examples
```
> CREATE INDEX idx ON ks.t (v) WITH replicate_on_write = true;

SyntaxException: Unknown property 'replicate_on_write'

> CREATE INDEX idx ON ks.t (v)
  .. WITH OPTIONS = {'option1': 'value1'}
  .. AND comment = 'some text';

InvalidRequest: Error from server: code=2200 [Invalid query]
  message="Cannot specify options for a non-CUSTOM index"

> CREATE CUSTOM INDEX idx ON ks.t (v)
  .. WITH OPTIONS = {'option1': 'value1'}
  .. AND comment = 'some text';

InvalidRequest: Error from server: code=2200 [Invalid query]
  message="CUSTOM index requires specifying the index class"

> CREATE CUSTOM INDEX idx ON ks.t (v)
  .. USING 'vector_index'
  .. WITH OPTIONS = {'option1': 'value1'}
  .. AND comment = 'some text';

InvalidRequest: Error from server: code=2200 [Invalid query]
  message="You cannot use view properties with a vector index"

> CREATE INDEX idx ON ks.t (v) WITH CLUSTERING ORDER BY (v ASC);

InvalidRequest: Error from server: code=2200 [Invalid query]
  message="Indexes do not allow for specifying the clustering order"
```

and so on. For more examples, see the relevant tests.

References:
[1] https://docs.scylladb.com/manual/branch-2025.4/cql/cql-extensions.html#synchronous-materialized-views
[2] https://docs.scylladb.com/manual/branch-2025.4/cql/secondary-indexes.html#create-index
[3] https://docs.scylladb.com/manual/branch-2025.4/cql/mv.html#mv-options
[4] https://docs.scylladb.com/manual/branch-2025.4/cql/dml/select.html#ordering-clause

Fixes scylladb/scylladb#16454

Backport: not needed. This is an enhancement.

Closes scylladb/scylladb#24977

* github.com:scylladb/scylladb:
  cql3: Extend DESC INDEX by view properties
  cql3: Forbid using CLUSTERING ORDER BY when creating index
  cql3: Extend CREATE INDEX by MV properties
  cql3/statements/create_index_statement: Allow for view options
  cql3/statements/create_index_statement: Rename member
  cql3/statements/index_prop_defs: Re-introduce index_prop_defs
  cql3/statements/property_definitions: Add extract_property()
  cql3/statements/index_prop_defs.cc: Add namespace
  cql3/statements/index_prop_defs.hh: Rename type
  cql3/statements/view_prop_defs.cc: Move validation logic into file
  cql3/statements: Introduce view_prop_defs.{hh,cc}
  cql3/statements/create_view_statement.cc: Move validation of ID
  schema/schema.hh: Do not include index_prop_defs.hh
2026-01-14 09:54:27 +02:00
Pavel Emelyanov
e57ee84662 util: Re-use seastar::util::memory_data_sink
A data_sink that stores buffers into an in-memory collection had
appeared in seastar recently. In Scylla there's similar thing that uses
memory_data_sink_buffer as a container, so it's possible to drop the
data_sink_impl iself in favor of seastar implementation.

For that to work there should be append_buffers() overload for the
aforementioned container. For its nice implementation the container, in
turn, needs to get push_back() method and value_type trait. The method
already exists, but is called put(), so just rename it. There's one more
user of it this method in S3 client, and it can enjoy the added
append_buffers() helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28124
2026-01-14 08:54:00 +02:00
Avi Kivity
3fc5a32136 tools: toolchain: update instructions for building optimized clang with version information
The instructions for building optimized clang neglected to mention
that the clang version to be built must be specified. Correct that.

Closes scylladb/scylladb#28135
2026-01-14 06:46:20 +02:00
Botond Dénes
6e17bf5c1a tools/scylla-nodetool: migrate to std::localtime
fmt::localtime() is now deprecated, users should migrate to equivalents
from the standard libraries.
std::localtime is not thread safe, so a local wrapper is introduced,
based on the thread-safe localtime_r() (from libc).

Closes scylladb/scylladb#27821
2026-01-13 20:46:31 +02:00
Avi Kivity
489d1a0fbc Merge 'replica: don't throw exceptions for read timeout' from Botond Dénes
Read timeouts are a common occurence and they typically occur when the replica is overloaded. So throwing exceptions for read timeouts is very harmful. Be careful not to thow exceptions while propagating them up the future chain. Add a test to enfore and detect regressions.

Fixes: scylladb/scylladb#25062

Improvement, normally not a backport candidate, but we may decide to backport if customer(s) are found to suffer from this.

Closes scylladb/scylladb#25068

* github.com:scylladb/scylladb:
  reader_permit: remove check_abort()
  test/boost/database_test: add test for read timeout exceptions
  sstables/mx/reader: don't throw exceptions on the read-path
  readers/multishard: don't throw exceptions on the read-path
  replica/table: don't throw exceptions on the read-path
  multishard_mutation_query: fix indentation
  multishard_mutation_query: don't throw exceptions on the read-path
  service/storage_proxy: don't throw exceptions on the full-scan path
  cql3/query_processor: don't throw exceptions on the read-path
  reader_permit: add get_abort_exception()
2026-01-13 16:17:41 +02:00
Avi Kivity
c6dfae5661 treewide: #include Seastar headers with angle brackets
Seastar is an external library from the point of view of
ScyllaDB, so should be included with angle brackets.

Closes scylladb/scylladb#27947
2026-01-13 14:56:15 +02:00
Tomasz Grabiec
63b9a7e2b5 test: pylib: log_browsing: Grep logs without considering newly appended lines
At the end of the test case, the framework greps logs for errors and
backtraces. The servers are still running at this point. Some test
cases enable debug-level logging. If servers manage to produce new
lines between the python script processes them, the grep will never
return.

Protect against this by grepping over a file snapshot.

Fixes #28086

Closes scylladb/scylladb#28088
2026-01-13 14:41:02 +02:00
Nadav Har'El
fc6fff61d1 docs/alternator: add document on reducing Alternator network costs
This patch adds a new document, docs/alternator/network.md,
explaining the various mechanisms that can be used to reduce
network usage in Alternator. It explains compression of requests
and responses, header reduction, rack-aware routing, and RPC compression.

Many of these topics - especially support in the client libraries -
are work in progress, so some details are still missing in the new
document. Still, I think it is a good start that can be improved
later.

Fixes #27915.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27927
2026-01-13 14:29:01 +02:00
Pavel Emelyanov
9ffd22491f test: Validate S3 endpoints new format works
Extend the test_get_object_store_endpoints() test to configure S3
endpoints in full-url format and check that they are rendered properly
via API/CQL.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:18 +03:00
Pavel Emelyanov
bd225784bd docs: Update docs according to new endpoints config option format
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
f227de24b2 object_storage: Create s3 client with "extended" endpoint name
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.

The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
8f97e6b3de s3/storage: Tune config updating
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
bee3564564 sstable: Shuffle args for s3_client_wrapper
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
83e88d206c test: Rename badconf variable into objconf
It's not actually a "bad" config, it's just some config the test works
with.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:23:20 +03:00
Pavel Emelyanov
9c627bc44a test: Split the object_store/test_get_object_store_endpoints test
It tests two things -- the way object storage config is represented via
API and CQL (from sytem.config) and that updating config affects CREATE
KEYSPACE CQL (with keyspace storage options)

It's better to split the test, as its former part is going to be
extented to validate old/new config formats (see #26570)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:23:03 +03:00
Robert Bindar
dfcabb5fa4 test: reduce dataset and number of test cases or debug builds
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:51 +02:00
Robert Bindar
ca3c57e821 test: bump repair timeout up, it's sometimes not enough in CI
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:49 +02:00
Robert Bindar
6f5e58e718 test: refactor test_refresh.py to match test_restore_with_streaming_scopes.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:48 +02:00
Robert Bindar
6e636a4231 test: extend test_restore_with_streaming_scopes
to test restoring with a different min_tablet_count
than the schema was originally created with.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:46 +02:00
Robert Bindar
92cd1ddec3 test: Adjust test_restore_primary_replica_different_dc_scope_all
to match the new topology arhitecture

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:44 +02:00
Robert Bindar
db13ece9a0 test: Refactor restoring code in test_backup to match SM pattern
This patch refactors the restoring code in cluster/test_backup.py
so it matches better the way SM works.
The patch also refactors test_restore_with_streaming_scopes so to
facilitate running restore scenarios under all supported scopes
with or w/o primary_replica_only enabled by reusing the servers
and backups for a topology. This allows us to test a lot more scenarios
without making the test impossibly slow.

split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:43 +02:00
Robert Bindar
ba01589f53 test: add check_mutation_replicas calls after fresh creation of dataset
to validate that mutation assertions are sane

split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:41 +02:00
Robert Bindar
b835d32cb0 test: extend create_dataset to accept consistency_level
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:35 +02:00
Robert Bindar
e7d44356d9 test: refactor check_mutation_replicas so it's more readable
split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:31 +02:00
Robert Bindar
733b4dbbb7 test: make create_dataset async and refactor so it's configurable
with num_keys and min_tablet_count

split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:20 +02:00
Robert Bindar
f2c8949e4a test: use defaultdict in collect_mutations
split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:45:03 +02:00
Robert Bindar
45faeba97d test: add log marks to facilitate reusing server for restore
split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:44:48 +02:00
Robert Bindar
d88036db48 locator: tablets: Distribute data evenly among primary replicas during restore
Most likely 817fdad uncovered the fact that our choice of
primary replica was resonating with tablet allocation and we were ending up
picking the same replica as primary within a scope instead of rotating
primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes cluster
with primary_replica_only=true would put all data into 3 nodes, leaving
the other 6 unused. The balancing of the dataset was performed by the
subsequent repair step.

split from bhalevy/load-balance-primary-replica

Fixes #27281

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:44:20 +02:00
Nadav Har'El
2a831ad373 Merge 'Address CodeQL Errors' from Botond Dénes
Address all errors reported by CodeQL as reported on https://github.com/scylladb/scylladb/security/quality.
This is a mixed bag, with some harmless issues, while others are severe problems which will result in the code breaking (if it is even run). I suspect some of the more severe problems were found in dead code that is not used at all -- hence nobody noticed.
Still, these issues are good to fix, so we can reduce noise in the reports and improve the maintainability of the code.

Code cleanup, no backport

Closes scylladb/scylladb#27838

* github.com:scylladb/scylladb:
  pgo/pgo.py: don't mutate input params
  test/pylib/coverage_utils.py: profdata_to_lcov: don't mutate defaulted param
  test/cluster/dtest/tools/misc.py: add type annotations to list_to_hashed_dict()
  idl-compiler.py: raise TypeError instead of raw str
  test/pylib/lcov_utils.py: don't call set when iterating over it
  configure.py: move away from .format(**locals())
  test/cluster/object_store/conftest.py: add missing call to parent constructor
  idl-compiler.py: add missing call to parent class constructor
  tools/scyllatop/fake.py: pass correct number of args to _add_metric
2026-01-13 11:43:57 +02:00
Nadav Har'El
609b283d98 test/cqlpy: add another reproducer for known issue
This patch adds a second reproducer for issue #25839, which is about
scanning a secondary index which returns partial results. The new test
uses count(*) without requesting the row themselves, but still has the
same problem of counting only part of the rows. This is the problem that
a user reported in issue #28026.

Unlike the previous test, this test works correctly on older versions
of Scylla - by using larger data, like on Cassandra - without changing
a configuration variable that did not yet exist. So with this test we
can confirm that this bug is a Scylla 5.2 regression:

  test/cqlpy/run --release 5.1 test_secondary_index.py::test_short_count

passes, while

  test/cqlpy/run --release 5.2 test_secondary_index.py::test_short_count

fails. It also fails on master, so the new test is marked "xfail".

Refs #25839
Refs #28026

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28108
2026-01-13 11:15:27 +02:00
Botond Dénes
7b562bb185 Merge 'system.clients: Address SSL refactor review comments' from Piotr Smaron and Copilot
Addresses outstanding review comments from PR #22961 where SSL field
collection was refactored into generic_server::connection base class.
This patch consists of minor cosmetic enhancements for increased
readability, mainly, with some minor fixups explained in specific
commits.

Cosmetic changes, no need to backport.

Closes scylladb/scylladb#27575

* github.com:scylladb/scylladb:
  test_ssl: fix indentation
  generic_server: improve logging broken TLS connection
  test_ssl: improve timeout and readability
  alternator/server: update SSL comment
2026-01-13 11:00:26 +02:00
Botond Dénes
354c805e6a reader_permit: remove check_abort()
This method can cause performance regressions if used in the wrong place
-- namely if it is used to abort reads by throwing the abort exception.
Exceptions should be propagated during reads without throwing them,
otherwise they cause extra CPU load, making a bad situation worse.
Remove this method, so it doesn't accidentally get more users, migrate
remaining users to get_abort_exception().
2026-01-13 10:47:57 +02:00
Botond Dénes
a0ddac655d test/boost/database_test: add test for read timeout exceptions
Read timeouts shouldn't trigger exceptions thrown, exceptions should be
solely propagated via futures, otherwise they put extra strain on the
system at the worst possible time: when it is overload already enough
that reads started to time out.
The test covers both single partition reads and full scans, with two
scenarios:
* timeout while the read is queued
* timeout when the read is already ongoing
2026-01-13 10:47:57 +02:00
Botond Dénes
6801611b94 sstables/mx/reader: don't throw exceptions on the read-path
If the read is aborted via the permit (due to timeout) don't throw the
abort exception, instead propagate it via the future chain.
Also, use try_catch<> instead of try ... catch to decorate
malformed_sstable_exception with the file name.
2026-01-13 10:47:57 +02:00
Botond Dénes
fa6ffe9d20 readers/multishard: don't throw exceptions on the read-path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Botond Dénes
1f6f7ceb68 replica/table: don't throw exceptions on the read-path
Use coroutine::as_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.

Not using coroutine::try_future(), because on the error path, the
querier has to be closed.
2026-01-13 10:47:57 +02:00
Botond Dénes
131489fe48 multishard_mutation_query: fix indentation
Left broken by previous patch.
2026-01-13 10:47:57 +02:00
Botond Dénes
7eeb7fcfba multishard_mutation_query: don't throw exceptions on the read-path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Botond Dénes
9bea842c01 service/storage_proxy: don't throw exceptions on the full-scan path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Botond Dénes
404f5b0808 cql3/query_processor: don't throw exceptions on the read-path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Botond Dénes
5cd237fec5 reader_permit: add get_abort_exception()
Will replace check_abort(). The latter throws an exception which is
something we want to avoid when a read is aborted, in particular when it
times out.
Also add a convenience get_abort_exception() method to mutation_reader.
2026-01-13 10:47:57 +02:00
Michał Hudobski
c8aa49b196 vector search, paging: add test for paging warnings
We add a test that validates that indexed queries
do not throw a warning related to vector search paging

Fixes: SCYLLADB-248

Closes scylladb/scylladb#28077
2026-01-13 10:33:36 +02:00
Nadav Har'El
34191d8fd4 alternator: fix signature checking of headers with multiple spaces
We have a test in test_compressed_response.py that reproduces a bug
where in Alternator's signature checking code, if a header had multiple
consecutive spaces its signature isn't checked correctly.

This patch fixes this and that xfailing test begins to pass.

But it turns out that the handling of multiple consecutive spaces in
headers when calculating the authentication signature is just one example
of "header canonization" that the AWS Signature V4 specification requires
us to do. There are additional types of header canonization that Alternator
must do, and this patch also adds new tests in test_authorization.py for
checking *all* the types of canonization.

Fortunately, for all other types of canonizations, we already handled
them correctly - Alternator already lowercases header names, sorts them
alphabetically and removes leading and trailing spaces before calculating
the signature. So most of the new tests added pass also without this patch,
and only one of them, test_canonization_middle_whitespace, needs this
patch to pass. As usual, all the new tests also pass on DynamoDB.

Fixes #27775

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28102
2026-01-13 10:29:13 +02:00
Andrei Chekun
3f14d45d6e test.py: fix the link for the failed_test directory
With new UI Jenkins escaping the HTML tags during rendering to prevent
XSS. This will show just link without custom name as a string that can
be copied and then pasted to navigate to the failed directory.

Closes scylladb/scylladb#28062
2026-01-13 10:22:38 +02:00
Yaniv Kaul
4e3aa53f8b test/rest_api/test_gossiper.py: fix for Variable defined multiple times
To fix the problem, we need to remove the first, redundant definition of
test_gossiper_unreachable_endpoints (lines 19-24). The second definition
(lines 25-40) should be retained as it has more substantial test logic.
No other code changes or imports are needed, as the test logic is
preserved fully in the retained definition.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27632
2026-01-13 10:17:29 +02:00
Yaniv Kaul
1722e35a7e .github/workflows/docs-validate-metrics.yml: add workflow permissions
Potential fix for code scanning alert no. 167: Workflow does not contain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27819
2026-01-13 10:16:35 +02:00
Yaniv Kaul
39c26794ec .github/workflows/docs-pr.yaml: add workflow permissions
Potential fix for code scanning alert no. 145: Workflow does not contain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27808
2026-01-13 10:15:45 +02:00
Yaniv Michael Kaul
d9cce7ccbf .github/copilot-instructions.md: add test philosophy
Try to teach CoPi a bit about how we'd like to see it implement tests, according to this repo best practices.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#28032
2026-01-13 10:05:13 +02:00
Anna Stuchlik
8141283262 doc: add the version name to the Install ScyllaDB page
Fixes https://github.com/scylladb/scylladb/issues/28021

Closes scylladb/scylladb#28022
2026-01-13 10:01:48 +02:00
Avi Kivity
66aee0fb5e alternator: add optional listeners for proxy protocol v2
Following 954f2cbd2f, which added proxy protocol v2 listeners
for CQL, we do the same for alternator. We add two optional ports
for plain and TLS-wrapped HTTP.

We test each new port, that the old ports still work, and that
mixing up a port with no proxy protocol and a connection with proxy
protocol (or the opposite) fails. The latter serves to show
that the testing strategy is valid and doesn't just pass whatever
happens. We also verify that the correct addresses (and TLS mode)
show up in system.clients.

Closes scylladb/scylladb#27889
2026-01-13 09:59:24 +02:00
Nadav Har'El
e7df03127b alternator: support "deflate" encoding in request compression
Currently Alternator supports compressed requests in the gzip format
with "Content-Encoding: gzip". We did not support any other compression
formats.

It turns out that DynamoDB also supports the "deflate" encoding.
The "deflate" format is just a small variant of gzip and also supported
by the same zlib library that we already use, so it is very easy
to add support for it as well. So this patch adds it.

Beyond compatibility with DynamoDB, another benefit of this patch is
symmetry with our response compression support (PR #27454), where
we supported both gzip and deflate compression of responses - so
we should support the same for requests.

This patch also adds tests for Content-Encoding: deflate, which pass
on DynamoDB (proving that "deflate" is indeed supported there).
On Alternator the new tests failed before this patch and pass with
this patch.

Refs #27243 (which asks to support more compression formats).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27917
2026-01-13 09:58:12 +02:00
Nadav Har'El
a1f198d453 test/cqlpy: translate Cassandra's test InsertInvalidateSizedRecordsTest
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertInvalidateSizedRecordsTest.java into our
cqlpy framework.

This is one of the tests added to Cassandra as part of the vector
search work, but actually has nothing to do with vector search -
it checks what happens when key columns of different types exceeed
their maximum size (64KB).

Unfortunately, each one of the tests added here *fail* on ScyllaDB,
providing more reproducers for two already known issues (which
already had plenty of reproducers...):

Refs  #8627 Cleanly reject updates with indexed values where value > 64k
Refs #12247 Better error reporting for oversized keys during INSERT

One of the tests also fails on Cassandra, due to CASSANDRA-19270.
It is not clear to me how this unit test actually passed on Cassandra,
I can only guess that the Python driver somehow makes the request
differently than what the Java unit tests use to make requests to
Cassandra.

One of the tests in the original Cassandra source file I did not
translate, readingEmptyStringsForDifferentTypes, because it tests
cqlsh, not pure CQL.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27944
2026-01-13 08:59:36 +02:00
Yaniv Kaul
2c26ca3651 .github/workflows/read-toolchain.yaml: hardening for code scanning alert no. 146: Workflow does not contain permissions #27807
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27807
2026-01-13 08:57:53 +02:00
tomek7667
19313d67e3 docs/cql/ddl.rst: fix formatting of deprecated initial sub-option
Closes scylladb/scylladb#26852
2026-01-13 08:55:24 +02:00
Anna Stuchlik
14cadcbc18 doc: remove references to Open Source
Fixes https://github.com/scylladb/scylladb/issues/28118

Closes scylladb/scylladb#28119
2026-01-13 08:43:26 +02:00
Botond Dénes
25db8f6a70 pgo/pgo.py: don't mutate input params
It is considered a dangerous practice with possible unintended
side-effects, affecting later calls to the same function.
Found by CodeQL "Modification of parameter with default".
2026-01-13 08:33:17 +02:00
Botond Dénes
4ec3d76b87 test/pylib/coverage_utils.py: profdata_to_lcov: don't mutate defaulted param
It is considered a dangerous practice as it creates a side-effect for
later calls to the same function.
Create a new variable instead and mutate that. Also remove the unused
update_known_ids parameter, which defaults to True and no caller changes
it. Passing False to this param also seem to have no effect. Instead of
trying to guess what the desired effect of passing False is and fixing
it, just remove this unused param.

Found by CodeQL "Modification of parameter with default".
2026-01-13 08:33:17 +02:00
Botond Dénes
157fe2b80e test/cluster/dtest/tools/misc.py: add type annotations to list_to_hashed_dict()
To hopefully shut up CodeQL "Iterable can be either a string or a sequence".
This change makes the code more readable anyway, so it is more than just
a gratuitous change to make some code-scanner happy.
2026-01-13 08:33:17 +02:00
Botond Dénes
849332c5c2 idl-compiler.py: raise TypeError instead of raw str
Unlike in C++, in Python one can only throw objects which inherit from
Exception. The message complains about wrong type so wrap it in
TypeError before passing to raise.
Found by CodeQL "Illegal raise".
2026-01-13 08:33:17 +02:00
Botond Dénes
3c6e9637a0 test/pylib/lcov_utils.py: don't call set when iterating over it
Probably a typo. Found by CodeQL "Non-callable called".
2026-01-13 08:33:17 +02:00
Botond Dénes
eb4ee5a126 configure.py: move away from .format(**locals())
Use f strings instead, they are just as convenient with the added bonus
of editors providing syntax highighting for it.

Additionally, this shuts up CodeQL complaint about "Suspicious unused
loop iteration variable" in loops where the loop variable was passed to
format indirectly via **locals().
2026-01-13 08:33:17 +02:00
Botond Dénes
d2db84714e test/cluster/object_store/conftest.py: add missing call to parent constructor
Replace manual init of parent fields.
Found by CodeQL: "Missing call to superclass `__init__` during object
initialization".

The secret_key is not initialized to server.secret_key, instead of
server.access_key. This probably fixes a (benign) bug.
2026-01-13 08:33:17 +02:00
Botond Dénes
26e978cf08 idl-compiler.py: add missing call to parent class constructor
Replace manual init of parent fields.
Found by CodeQL "Missing call to superclass `__init__` during object
initialization".
2026-01-13 08:33:17 +02:00
Botond Dénes
20d0782dc4 tools/scyllatop/fake.py: pass correct number of args to _add_metric
Discovered by CodeQL "Wrong number of arguments in call".
2026-01-13 08:33:17 +02:00
Michał Jadwiszczak
649efd198f docs/dev/service_levels: update docs to service levels on raft
Since Scylla 6.0, service levels are manged by Raft group0.
This patch updates table name used by service levels and adds a
paragraph describing service levels on raft.

Fixes scylladb/scylladb#18177

Closes scylladb/scylladb#26556
2026-01-13 06:49:18 +02:00
Botond Dénes
725e99e263 Merge 'test.py: Fix boost logs' from Andrei Chekun
Write the boost logs into stdout in HRF format and in XML to the file. The XML file will be used for parsing and providing the error information in the summary section of the fail.

Fixes: https://github.com/scylladb/scylladb/issues/28045

Framework enhancements, no need to backport.

Closes scylladb/scylladb#28107

* github.com:scylladb/scylladb:
  test.py: remove XML log from fail summary
  test.py: fix truncated boost output to stdout file
2026-01-13 06:19:05 +02:00
Anna Stuchlik
791ab4ed02 doc: clarify the information about SSTable version support
Fixes https://github.com/scylladb/scylladb/issues/27765

Closes scylladb/scylladb#27835
2026-01-13 06:17:37 +02:00
Andrei Chekun
dfa6a61721 test.py: remove XML log from fail summary
Remove XML log from fail summary.
Add text from the first error in the XML file to the fail summary
2026-01-12 14:26:58 +01:00
Andrei Chekun
d96a50481a test.py: fix truncated boost output to stdout file
Change the behavior of the catching the boost log output. With this
change boost will output it's logging to stdour with HRF format and to
the tempfile in XML format. This will help for easier debuggint when all
messages will be in the output file and still in the fail summary.
2026-01-12 14:15:23 +01:00
Botond Dénes
6bcc18e5c6 erge 'test.py: integrate python tests to be executed with pytest runner' from Andrei Chekun
This will move responsibility for running tests with pytest in the same manner as it was done with boost tests. From this commit, test.py is not responsible anymore for running python tests and relies completely on pytest.
This is another step for unification of test execution.
Convert skip_mode function to `pytest.mark` to be able to use to annotate the whole module instead of each test explicitly.

NOTE: this is a breaking change. From this commit, several directories with tests will require a path to the file to launch the test. Affected directories
test/alternator
test/broadcast_tables
test/cql
test/cqlpy
test/rest_api

Changes only in framework, so no backport.

This PR will increase the amount of the tests by 30 test, due to the fact that how test.py and pytest discover tests. test.py count a file as a test, and when skip used in suite.yaml it will exclude the tests from discovery completely.
While the pytest count test funstion as a test and uses skip_mode mark and will discover the tests, but it will skip them during execution, hence the difference
test.py output before PR:
```bash
> ./test.py --mode=release  rest_api/test_compaction_task rest_api/test_task_manager --list --no-gather-metrics

```
test.py output in this PR:
```bash
> ./test.py --mode=release  test/rest_api/test_compaction_task.py test/rest_api/test_task_manager.py --list

rest_api/test_compaction_task.py::test_global_major_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_major_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_cleanup_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_offstrategy_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_rewrite_sstables_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_reshaping_compaction_task.release.1
rest_api/test_compaction_task.py::test_resharding_compaction_task.release.1
rest_api/test_compaction_task.py::test_regular_compaction_task.release.1
rest_api/test_compaction_task.py::test_compaction_task_abort.release.1
rest_api/test_compaction_task.py::test_major_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_cleanup_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_offstrategy_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_rewrite_sstables_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_compaction_progress[major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_compaction_task.py::test_compaction_progress[shard_major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_compaction_task.py::test_compaction_progress[table_major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_task_manager.py::test_task_manager_modules.release.1
rest_api/test_task_manager.py::test_task_manager_tasks.release.1
rest_api/test_task_manager.py::test_task_manager_status_running.release.1
rest_api/test_task_manager.py::test_task_manager_status_done.release.1
rest_api/test_task_manager.py::test_task_manager_status_failed.release.1
rest_api/test_task_manager.py::test_task_manager_not_abortable.release.1
rest_api/test_task_manager.py::test_task_manager_wait.release.1
rest_api/test_task_manager.py::test_task_manager_ttl.release.1
rest_api/test_task_manager.py::test_task_manager_user_ttl.release.1
rest_api/test_task_manager.py::test_task_manager_sequence_number.release.1
rest_api/test_task_manager.py::test_task_manager_recursive_status.release.1
rest_api/test_task_manager.py::test_module_not_exists.release.1
rest_api/test_task_manager.py::test_task_folding.release.1
rest_api/test_task_manager.py::test_abort_on_unregistered_task.release.1
```

Fixes: https://github.com/scylladb/scylladb/issues/27716

Closes scylladb/scylladb#26395

* github.com:scylladb/scylladb:
  test.py: fix test_vector_similarity.py
  docs: add directories excluded from test.py
  test.py: prevent file descriptors leaking
  test.py: capture print inside the test
  test.py: do not print header for collection with test.py
  test.py: remove not supported functionality
  test.py: switch of execution of several test directories by test.py runner
  test.py: integrate python tests to be executed with pytest runner
  test.py: fix test/vector_search_validator to be able to run with pytest
  test.py: prepare base class for migration
  test.py: move environment preparation to one method
  test.py: introduce new environment variable TESTPY_PREPARED_ENVIRONMENT
2026-01-12 14:17:19 +02:00
Botond Dénes
04b8f72946 Merge 'repair: Implement auto repair for tablet repair' from Asias He
repair: Implement auto repair for tablet repair

This patch implements the basic auto repair support for tablet repair.

It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.

- auto_repair_enabled_default

Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.

- auto_repair_threshold_default_in_seconds

Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.

The following metrcis are added:

- auto_repair_needs_repair_nr

The number of tablets with auto repair enabled that needs repair

- auto_repair_enabled_nr

The number of tablets with auto repair enabled

The metrics are useful to tell if auto repair is falling behind.

In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.

Fixes SCYLLADB-99

New feature. No backport.

Closes scylladb/scylladb#27534

* github.com:scylladb/scylladb:
  topology_coordinator: Add metrics for tablet repair
  repair: Implement auto repair for tablet repair
2026-01-12 14:16:01 +02:00
Yaniv Kaul
f1c9eda49e Potential fix for code scanning alert no. 144: Workflow does not contain permissions
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27809
2026-01-12 12:21:35 +02:00
Marcin Maliszkiewicz
3c9f52e709 Merge 'doc: update the Web Installer instructions' from Anna Stuchlik
This PR:
- Replaces a fixed version name with the variable for the current version in the instructions for installing a non-default version with Web Installer. This will make using the installer more user-friendly.
- Removes the instruction for Open Source from the Web Installer docs.

Fixes https://github.com/scylladb/scylladb/issues/28005
Fixes https://github.com/scylladb/scylladb/issues/28079

Closes scylladb/scylladb#28046

* github.com:scylladb/scylladb:
  doc: remove the instruction for Open Source from the Web Installer docs
  doc: add the version variable to the Web Installer instructions
2026-01-12 11:10:04 +01:00
Petr Gusev
889d7782ed treewide: use coroutine::maybe_yield in coroutines
It's more efficient since coroutine::maybe_yield returns
a lightweight struct (awaitable), not the future.

Closes scylladb/scylladb#28101
2026-01-12 10:38:47 +01:00
Marcin Maliszkiewicz
09af3828ab auth: remove confusing deprecation msg from hash_with_salt
Closes scylladb/scylladb#27705
2026-01-12 10:12:54 +01:00
Asias He
7980890029 topology_coordinator: Add metrics for tablet repair
- scylla_tablet_ops_failed

Number of failed tablet {auto, user} repair

- scylla_tablet_ops_succeeded

Number of succeeded tablet {auto, user} repair

Currently auto_repair and user_repair tablet task are added. We can add
more tablet tasks later, e.g., rebuild, migration.
2026-01-12 15:26:05 +08:00
Alex
e430065c92 db: views: serialize create/drop view operations via shard 0
Create and drop view operations are currently performed on all shards, and their execution is not fully serialized. On slower processors this can lead to interleavings that leave stale entries in `system.scylla_views_build`

A problematic sequence looks like this:

* `on_create_view()` runs on shard 0 → entries for shard 0 and shard 1 are created
* `on_drop_view()` runs on shard 0 → entry for shard 0 is removed
* `on_create_view()` runs on shard 1 → entries for shard 0 and shard 1 are created again
* `on_drop_view()` runs on shard 1 → entry for shard 1 is removed, while the shard 0 entry remains

This results in a leftover row in `system.scylla_views_builds_in_progress`, causing `view_build_test.cc` to get stuck indefinitely in an eventual state and eventually be terminated by CI.

This patch fixes the issue by fully serializing all view create and drop operations through shard 0. Shard 0 becomes the single execution point and notifies other shards to perform their work in order. Requests originating.

new process:
  - view_builder::on_create_view(...) runs only on shard 0 and kicks off dispatch_create_view(...) in the background.
  - dispatch_create_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed.
  - dispatch_create_view(...) calls handle_seed_view_build_progress(...) on shard 0. That:
      - writes the global “build progress” row across all shards via _sys_ks.register_view_for_building_for_all_shards(...).
  - After seeding, dispatch_create_view(...) broadcasts to all shards with container().invoke_on_all(...).
  - Each shard runs handle_create_view_local(...), which:
      - waits for pending base writes/streams, flushes the base,
      - resets the reader to the current token and adds the new view,
      - handles errors and triggers _build_step to continue processing.

  Drop view

  - view_builder::on_drop_view(...) runs only on shard 0 and kicks off dispatch_drop_view(...) in the background.
  - dispatch_drop_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed.
  - It broadcasts handle_drop_view_local(...) to all shards with invoke_on_all(...).
  - Each shard runs handle_drop_view_local(...), which:
      - removes the view from local build state (_base_to_build_step and _built_views) by scanning existing steps,
      - ignores missing keyspace cases.
  - After all shards finish local cleanup, shard 0 runs handle_drop_view_global_cleanup(...), which:
      - removes global build progress, built‑view state, and view build status in system tables,

  Shutdown

  - drain() waits on _view_notification_sem before _sem so in‑flight dispatches finish before bookkeeping is halted.

In addition, the test is adjusted to remove the long eventual wait (596.52s / 30 iterations) and instead rely on the default wait of 17 iterations (~4.37 minutes), eliminating unnecessary delays while preserving correctness.

Fixes: https://github.com/scylladb/scylladb/issues/27898
Backport: not required as the problem happens on master

Closes scylladb/scylladb#27929
2026-01-12 09:23:22 +02:00
Michał Hudobski
92c988514c vector_search: allow all where clauses in vector search queries
To prepare for implementation of filtering we skip validation
of where clauses in vector search queries. All queries that would
be blocked by the lack of ALLOW FILTERING now will pass through.

Fixes: VECTOR-410

Closes scylladb/scylladb#27758
2026-01-11 12:56:44 +02:00
Marcin Maliszkiewicz
03e0dd0841 Merge 'test/alternator: fix most tests to run on DynamoDB' from Nadav Har'El
We can run Alternator's tests against DynamoDB with `test/alternator/run --aws`, and our intention is that all except a few specially marked should pass on DynamoDB - indicating that the test itself is correct and checks compatibility with DynamoDB and not with some misunderstood spec.

Before this patch series, almost two dozen Alternator's tests failed on DynamoDB. This series fixes most of them.

Refs #26079 (it fixes almost all the problems but probably not all of them so let's keep the issue open for a while longer)

Closes scylladb/scylladb#27995

* github.com:scylladb/scylladb:
  test/alternator: fix some expected error messages to fit DynamoDB
  test/alternator: fix compressed request test on non-us-east1
  test/alternator: fix test's expected error message on DynamoDB
  test/alternator: mark Alternator-only test scylla_only
  test/alternator: fix test on DynamoDB
  test/alternator: increase wait_for_gsi() timeout
  test/alternator: fix test passing a spurious parameter
2026-01-09 18:05:20 +01:00
Botond Dénes
7e1c8776b7 docs: remove sstabledump and sstablemetadata
These tools are deprecated and no longer shipped by ScyllaDB packages.
They no longer support the latest SSTable versions and ScyllaDB-only
features, like encryption and dictionary based compression.

Remove them from the documentation.

Closes scylladb/scylladb#27608
2026-01-09 17:31:54 +01:00
Dawid Mędrek
2385afa1c7 scripts/pull_github_pr.sh: Update instructions for creating token
The interface of Jenkins has changed, and the instructions for creating
a token are out-of-date. This commit updates them.

Closes scylladb/scylladb#28054
2026-01-09 17:45:00 +02:00
Ferenc Szili
0ede8d154b docs: add docs for size based load balancing
This patch updates the documentation for size based load balancing.

Closes scylladb/scylladb#27616
2026-01-09 16:25:25 +02:00
Andrei Chekun
1f60208aa0 test.py: fix test_vector_similarity.py
There is a known limitation of the xdist.
Since it makes discovery in each thread, then compare it with master thread. The discovered lists of test should be the same. Sets are not order guaranteed, so they should not be used for parametrized testing, because discovery of the tests with using xdist will fail.
This PR just converts set to dist, to eliminate issue mentioned above.
2026-01-09 15:08:40 +01:00
Yaniv Michael Kaul
af8eaa9ea5 scripts: fixes flagged by CodeQL/PyLens
Unused imports, unused variables and such.
Initially, there were no functional changes, just to get rid of some standard CodeQL warnings.

I've then broken the CI, as apparently there's a install time(!?) Python script creation for the sole purpose of product
naming. I changed it - we have it in etcdir, as SCYLLA-PRODUCT-FILE.
So added (copied from a different script) a get_product() helper function in scylla_util.py and used it instead.

While at it, also fixed the too broad import from scylla_util, which 'forced' me to also fix other specific imports (such as shutil).

Improvement - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#27883
2026-01-09 15:13:12 +02:00
Anna Stuchlik
396093ff60 doc: remove the instruction for Open Source from the Web Installer docs
Fixes https://github.com/scylladb/scylladb/issues/28079
2026-01-09 14:07:32 +01:00
Botond Dénes
af6cb0d0a4 Merge 'raft topology: preserve IP -> ID mapping of a replacing node on restart' from Patryk Jędrzejczak
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
  require IP of the replacing node (like client requests or REST API
  requests coming from nodetool).

We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.

For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.

For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.

We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.

We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.

We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).

Fixes #28057

Backport this PR to all branches as it fixes a problematic bug.

Closes scylladb/scylladb#27435

* github.com:scylladb/scylladb:
  gossiper: add_saved_endpoint: make generations of excluded nodes negative
  test: introduce test_full_shutdown_during_replace
  utils: error_injection: allow aborting wait_for_message
  raft topology: preserve IP -> ID mapping of a replacing node on restart
2026-01-09 14:56:16 +02:00
Calle Wilund
a7cdb602e1 db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc
Fixes #27992

When doing a commit log oversized allocation, we lock out all other writers by grabbing
the _request_controller semaphore fully (max capacity).
We thereafter assert that the semaphore is in fact zero. However, due to how things work
with the bookkeep here, the semaphore can in fact become negative (some paths will not
actually wait for the semaphore, because this could deadlock).

Thus, if, after we grab the semaphore and execution actually returns to us (task schedule),
new_buffer via segment::allocate is called (due to a non-fully-full segment), we might
in fact grab the segment overhead from zero, resulting in a negative semaphore.

The same problem applies later when we try to sanity check the return of our permits.

Fix is trivial, just accept less-than-zero values, and take same possible ltz-value
into account in exit check (returning units)

Added whitebox (special callback interface for sync) unit test that provokes/creates
the race condition explicitly (and reliably).

Closes scylladb/scylladb#27998
2026-01-09 14:06:58 +02:00
Łukasz Paszkowski
7bf26ece4d test_user_writes_rejection: Fix test flakiness caused by typo and non-local CL=ONE reads
The current code:
```
try:
   cql.execute(f"INSERT INTO {cf} (pk, t) VALUES (-1, 'x')", host=host[0], execution_profile=cl_one_profile).result()
except Exception:
   pass
```

contains a typo: `host=host[0]` which throws an exception becase Host
object is not subscriptable. The test does not fail because the except
block is too broad and suppresses all exceptions.

Fixing the typo alone is insufficient. The write still succeeds because
the remaining nodes are UP and the query uses CL=ONE, so no failure
should be expected.

Another source of flakiness is data verification:
```
SELECT * FROM {cf} WHERE pk = 0;
```

Even when a coordinator is explicitly provided, using CL=ONE does not
guarantee a local read. The coordinator may forward the read request to
another replica, causing the verification to fail nondeterministically.

This patch rewrites the tests to address these issues:
- Fix the typo: `host[0]` to `hosts[0]`
- Verify data using `MUTATION_FRAGMENTS({cf})` which guarantees a local
  read on the coordinator node
- Reconnect the driver after node restart

Fixes https://github.com/scylladb/scylladb/issues/27933

Closes scylladb/scylladb#27934
2026-01-09 13:42:05 +02:00
Andrei Chekun
82e81a8664 docs: add directories excluded from test.py
Add new directories that are excluded from the test.py executor and will
be fully managed by pytest
2026-01-09 11:59:25 +01:00
Andrei Chekun
353bae7d66 test.py: prevent file descriptors leaking
With migration to the pytest, file descriptors will be hanged during the whole life of the process. Previously it was not an issue, because test.py was executing only one file with Popen, so descriptors will be freed with process done. With new approach they are blocked. This will allow to eliminate this.
Fix issue when we had issue with getting cluster and then trying to set it dirty while it None.
Put cluster to the pool only if it was created
2026-01-09 11:59:25 +01:00
Andrei Chekun
67c5267053 test.py: capture print inside the test
Capture the printing inside the test case to output it after the test and not directly during the testing process.
2026-01-09 11:59:25 +01:00
Andrei Chekun
594aedd6a5 test.py: do not print header for collection with test.py
Skip printing the default pytest headers when printing list of the tests.
Before:
```
$ ./test.py --mode=dev test/boost/sstable_conforms_to_mutation_source_test.cc --list

Test session starts (platform: linux, Python 3.13.9, pytest 8.3.4, pytest-sugar 1.1.1)
rootdir: /home/xtrey/projects/scylladb/test
configfile: pytest.ini
plugins: xdist-3.8.0, allure-pytest-2.15.0, sugar-1.1.1, anyio-4.8.0, asyncio-0.24.0, timeout-2.3.1
asyncio: mode=Mode.AUTO, default_loop_scope=session
timeout: 24000.0s
timeout method: signal
timeout func_only: False
session timeout: 24000.0s
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_reversing_reader_random_schema.dev.1

Results (0.06s):
```
After:
```
$ ./test.py --mode=dev test/boost/sstable_conforms_to_mutation_source_test.cc --list

boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_reversing_reader_random_schema.dev.1

Results (0.06s):
```
2026-01-09 11:59:25 +01:00
Andrei Chekun
21a1ff3d5c test.py: remove not supported functionality
In the current state pytest do not support the order of execution, so this parameter is removed. There is no big need in this due to the differences what pytest and test.py counted test. pytest run test functions in the threads, while test.py executed test files in the threads. That's why pytest's way is more granular and allows to fill threads better.
Remove skip node, since it already added as a pytest mark for each test in the file.
Remove pool_size, since this is not used by pytest at all. Pytest uses
xdist to set the amount of threads instead of pool_size used by test.py
2026-01-09 11:59:25 +01:00
Andrei Chekun
e8c50a5ad4 test.py: switch of execution of several test directories by test.py runner
With this commit test.py will lose ability to run tests by itself always bypassing execution to the pytest.

NOTE: this is a breaking change. From this commit, several directories
with tests will require a path to the file to launch the test.
Affected directories
test/alternator
test/broadcast_tables
test/cql
test/cqlpy
test/rest_api
2026-01-09 11:59:25 +01:00
Andrei Chekun
61d49525ad test.py: integrate python tests to be executed with pytest runner
With this commit test.py will be bypassing the tests execution to the pytest. However, it will still be able to run test by itself.
With providing test name like `broadcast_tables/test_broadcast_tables` it will execute test with test.py runner, but if the path to the file will be provided like `test/broadcast_tables/test_broadcast_tables.py` it will bypass execution to the pytest.
`--test-py-init` tells to run pytest session in test.py-compatible mode
Update the help text for the name parameter for test.py about changes
how it works and which directory is served by pytest
2026-01-09 11:59:25 +01:00
Andrei Chekun
808b29885f test.py: fix test/vector_search_validator to be able to run with pytest
build_mode fixture have dynamic scope. It depends how the pytest is
executed. When it executed through test.py scope will be session and
since it's broader that package everything work fine. While with pure
pytest it will fail because build_mode will have module scope.
This fix allows to run tests with pure pytest, this needed for migration
test to be executed by pytest runner instead test.py.
2026-01-09 11:59:25 +01:00
Andrei Chekun
8252de7b55 test.py: prepare base class for migration
Since all tests share the same base class and some of the tests executed by test.py and some with pytest, we need to handle two cases where configuration is located: suite.yaml and test_config.yaml
After full migration suite.yaml case will be removed
2026-01-09 11:59:25 +01:00
Andrei Chekun
48ff74b6b2 test.py: move environment preparation to one method
Since anyway these two methods should be called one by one in two different cases: when test.py executes test and pytest executes test, merging them into one. Additionally, set environment variable to show the underneath pytest process that environment was already prepared and there is no need to clean directories or start additional services.
2026-01-09 11:59:25 +01:00
Andrei Chekun
e074e21490 test.py: introduce new environment variable TESTPY_PREPARED_ENVIRONMENT
Introduce the new environment variable that will be used to signalize to the pytest runner that environment war already prepared by test.py. This needed to be able to run the test with pytest and test.py(that actually will run pytest underneath).
2026-01-09 11:59:25 +01:00
copilot-swe-agent[bot]
8c48b82b84 test_ssl: fix indentation 2026-01-09 10:27:17 +01:00
Piotr Smaron
2bcbebe92d generic_server: improve logging broken TLS connection
Preiously we were logging a broken TLS connection and then this has been
logged later again, so now instead of logging we're constructing an
exception with a message extened with TLS info, which later will be
catched with its full message still logged.
2026-01-09 10:24:55 +01:00
Piotr Smaron
7016fc4835 test_ssl: improve timeout and readability
1. With this change the test really waits 10s, previously (in case
   something went wrong), the timeout could take way more than that.
2. Added `else` to above `if` to increase clarity of execution flow -
   it doesn't change logic, but makes it more clear.
2026-01-09 10:22:19 +01:00
Asias He
7ba7b25bdd repair: Implement auto repair for tablet repair
This patch implements the basic auto repair support for tablet repair.

It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.

- auto_repair_enabled_default

Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.

- auto_repair_threshold_default_in_seconds

Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.

The following metrcis are added:

- auto_repair_needs_repair_nr

The number of tablets with auto repair enabled that needs repair

- auto_repair_enabled_nr

The number of tablets with auto repair enabled

The metrics are useful to tell if auto repair is falling behind.

In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.

Fixes SCYLLADB-99
2026-01-09 16:11:39 +08:00
Botond Dénes
60570d7114 Merge 'topology coordinator: restrict node join/remove to preserve RF-rack validity' from Michael Litvak
Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations:
* Altering a keyspace's RF if it would make the keyspace RF-rack-invalid
* Adding a node in a new rack
* Removing / Decommissioning the last node in a rack

Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views.

The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid.

Fixes scylladb/scylladb#23345
Fixes https://github.com/scylladb/scylladb/issues/26820

backport to relevant versions for materialized views with tablets since it depends on rf-rack validity

Closes scylladb/scylladb#26354

* github.com:scylladb/scylladb:
  docs: update RF-rack restrictions
  cql3: don't apply RF-rack restrictions on vector indexes
  cql3: add warning when creating mv/index with tablets about rf-rack
  service/tablet_allocator: always allow tablet merge of tables with views
  locator: extend rf-rack validation for rack lists
  test: test rf-rack validity when creating keyspace during node ops
  locator: fix rf-rack validation during node join/remove
  test: test topology restrictions for views with tablets
  test: add test_topology_ops_with_rf_rack_valid
  topology coordinator: restrict node join/remove to preserve RF-rack validity
  topology coordinator: add validation to node remove
  locator: extend rf-rack validation functions
  view: change validate_view_keyspace to allow MVs if RF=Racks
  db: enforce rf-rack-validity for keyspaces with views
  replica/db: add enforce_rf_rack_validity_for_keyspace helper
  db: remove enforce parameter from check_rf_rack_validity
  test: adjust test to not break rf-rack validity
2026-01-09 10:01:23 +02:00
Patryk Jędrzejczak
eee2b6c7af Merge 'tablets: Make balancing disabling RPC preempt tablet transitions' from Tomasz Grabiec
Disabling of balancing waits for topology state machine to become idle, to guarantee that no migrations are happening or will happen after the call returns. But it doesn't interrupt the scheduler, which means the call can take arbitrary amount of time. It may wait for tablet repair to be finished, which can take many hours.

We should do it via topology request, which will interrupt the tablet scheduler.

Enabling of balancing can be immediate.

Fixes https://github.com/scylladb/scylladb/issues/27647
Fixes #27210

Closes scylladb/scylladb#27736

* https://github.com/scylladb/scylladb:
  test: Verify that repair doesn't block disabling of tablet load balancing
  tablets: Make balancing disabling call preempt tablet transitions
2026-01-08 21:55:19 +02:00
Piotr Dulikowski
8e3e39a64a Merge 'service/storage_service: update service levels cache after upgrade to v2' from Michał Jadwiszczak
Service levels cache is empty after upgrade to consistent topology
if no mutations are commited to `system.service_levels_v2` or rolling
restart is not done.

To fix the bug, this patch adds service levels cache reloading after
upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`.

Fixes [SCYLLADB-90](https://scylladb.atlassian.net/browse/SCYLLADB-90)

This fix should be backported to all versions containing service levels on Raft.

[SCYLLADB-90]: https://scylladb.atlassian.net/browse/SCYLLADB-90?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27585

* github.com:scylladb/scylladb:
  service/storage_service: update service levels cache after upgrade to v2
  service/storage_service: check if service levels were already upgraded before doing migration to raft
2026-01-08 21:55:19 +02:00
Michał Hudobski
e2e479f20d auth: fix cdc vector search indexing permission bug
VECTOR_SEARCH_INDEXING permission didn't work on cdc tables as we mistakenly checked for vector indexes on the cdc table insted of the base.
This patch fixes that and adds a test that validates this behavior.

Fixes: VECTOR-476

Closes scylladb/scylladb#28050
2026-01-08 21:55:19 +02:00
Ernest Zaslavsky
19fe630c0e Update seastar submodule
seastar 4dcd4df..dd46b6fe
```
dd46b6fe net: expose DNS TTL via net::hostent
b94f81b0 test: Extend statat() test to check ENOENT exception reporting
```

Closes scylladb/scylladb#28006
2026-01-08 21:55:19 +02:00
Michael Litvak
8f15c7a874 db/view/view_update_generator: move discover_staging_sstables to start
Call discover_staging_sstables in view_update_generator::start() instead
of in the constructor, because the constructor is called during
initialization before sstables are loaded.

The initialization order was changed in 5d1f74b86a and caused this
regression. It means the view update generator won't discover staging
sstables on startup and view updates won't be generated for them. It
also causes issues in sstable cleanup.

view_update_generator::start() is called in a later stage of the
initialization, after sstable loading, so do the discovery of staging
sstables there.

Fixes scylladb/scylladb#27956

Closes scylladb/scylladb#27970
2026-01-08 21:55:19 +02:00
Botond Dénes
8c72dcc1ec Merge 'database: truncate_table_on_all_shards: consider can_flush on all shards' from Benny Halevy
Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard
and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them.

This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged).

Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`.

Fixes #27639

* The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions

Closes scylladb/scylladb#27643

* github.com:scylladb/scylladb:
  test: database_test: do_with_some_data: randomize keys
  database: truncate_table_on_all_shards: drop outdated TODO comment
  database: truncate_table_on_all_shards: consider can_flush on all shards
  memtable_list: unify can_flush and may_flush
  test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
  replica: table, storage_group, compaction_group: add needs_flush
  test: database_test: do_with_some_data_in_thread: accept void callback function
2026-01-08 21:55:19 +02:00
Avi Kivity
633e6e0037 build: update toolchain generation procedure for optimized clang
Explain where to pick up existing clang archives, and how to upload new ones.

Closes scylladb/scylladb#27690
2026-01-08 21:55:18 +02:00
Evgeniy Naydanov
a9da14be19 test: dtest: reproducer for parallel rebuild failure
2-DC cluster parallel non-RBNO rebuild failure when expanding RF in DC2.

Steps to reproduce:
    1. Provision a cluster with 2 datacenters and at least 2 nodes in the second datacenter.
    2. Let’s assume datacenter names are "dc1" and "dc2".
    3. Create a keyspace ("keyspace1") with RF=0 in dc2.
    4. Populate some data into dc1.
    5. Change keyspace1 replication in dc2 to 2.
    6. On 2 nodes in dc2 run the following command in parallel:
         nodetool rebuild --source-dc dc1

Parallel execution of rebuilds is not possible with RBNO enabled.

This test is the repro for #27804

Closes scylladb/scylladb#27747
2026-01-08 21:55:18 +02:00
Botond Dénes
9b4a7f1d14 Merge 'test: cluster: object_store: test_backup: modernize do_abort_restore' from Benny Halevy
Currently the function uses a regular expression
to check the system log for a specific message.
This is tangential to the ability to cleanly abort the restore task, plus the regular expression has a syntax error:
```
test/cluster/object_store/test_backup.py:534
  /home/bhalevy/dev/scylla/test/cluster/object_store/test_backup.py:534: SyntaxWarning: "\(" is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\("? A raw string is also an option.
    await wait_for_first_completed([l.wait_for("Failed to handle STREAM_MUTATION_FRAGMENTS \(receive and distribute phase\) for .+: Streaming aborted", timeout=10) for l in logs])
```

Thsi change modernizes the implementation by:
- using auto_dc_rack for manager.servers_add
- using new_test_keyspace to generate and auto delete the keyspace
- using async gatherio and a prepared statement to insert the data
- simplifing the keys and values by NOT using os.urandom (that is notoriously slow)
- inserting fewer keys in debug mode
- removing the log check

With that, the test can be reenabled in all modes.

* No backport needed since the test was disabled

Closes scylladb/scylladb#27892

* github.com:scylladb/scylladb:
  test_backup: do_abort_restore: reduce data footprint
  test_backup: do_abort_restore: use error injection
  test_backup: do_abort_restore: use asyncio for cql
  test_backup: do_abort_restore: use new_test_keyspace
  test_backup: do_abort_restore: use logger rather than print
  test_backup: do_abort_restore: pass auto_rack_dc to servers_add
2026-01-08 21:55:18 +02:00
Anna Stuchlik
f614482e66 doc: add the patch release upgrade procedure for version 2025.4
Adds the patch upgrade guide based on previous upgrade guides.

Fixes https://github.com/scylladb/scylladb/issues/27982

Closes scylladb/scylladb#27985
2026-01-08 21:55:18 +02:00
Asias He
4f77dd058d repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#27679
2026-01-08 21:55:18 +02:00
Anna Stuchlik
3f1c7c70f5 doc: remove the link to the Download Center
... from the OS support page.

Fixes https://github.com/scylladb/scylladb/issues/28047

Closes scylladb/scylladb#28048
2026-01-08 21:55:18 +02:00
Asias He
0aabf51380 repair: Fix sstable_list_to_mark_as_repaired with multishard writer
It was obseved:

```
test_repair_disjoint_row_2nodes_diff_shard_count was spuriously failing due to
segfault.

backtrace pointed to a failure when allocating an object from the chain of
freed objects, which indicates memory corruption.

(gdb) bt
    at ./seastar/include/seastar/core/shared_ptr.hh:275
    at ./seastar/include/seastar/core/shared_ptr.hh:430
Usual suspect is use-after-free, so ran the reproducer in the sanitize mode,
which indicated shared ptr was being copied into another cpu through the
multi shard writer:

seastar - shared_ptr accessed on non-owner cpu, at: ...
--------
seastar::smp_message_queue::async_work_item<mutation_writer::multishard_writer::make_shard_writer...

```

The multishard writer itself was fine, the problem was in the streaming consumer
for repair copying a shared ptr. It could work fine with same smp setting, since
there will be only 1 shard in the consumer path, from rpc handler all the way
to the consumer. But with mixed smp setting, the ptr would be copied into the
cpus involved, and since the shared ptr is not cpu safe, the refcount change
can go wrong, causing double free, use-after-free.

To fix, we pass a generic incremental repair handler to the streaming
consumer. The handler is safe to be copied to different shards. It will
be a no op if incremental repair is not enabled or on a different shard.

A reproducer test is added. The test could reproduce the crash
consistently before the fix and work well after the fix.

Fixes #27666

Closes scylladb/scylladb#27870
2026-01-08 21:55:18 +02:00
Radosław Cybulski
5f48ab3875 storage_proxy: fix invalid assert
Change invalid `assert(true)` into `SCYLLA_ASSERT(false)`, as
the latter was clearly meant.

Closes scylladb/scylladb#27900
2026-01-08 21:55:18 +02:00
Andrei Chekun
c950c2e582 test.py: convert skip_mode function to pytest.mark
Function skip_mode works only on function and only in cluster test. This if OK
when we need to skip one test, but it's not possible to use it with pytestmark
to automatically mark all tests in the file. The goal of this PR is to migrate

skip_mode to be dynamic pytest.mark that can be used as ordinary mark.

Closes scylladb/scylladb#27853

[avi: apply to test/cluster/test_tablets.py::test_table_creation_wakes_up_balancer]
2026-01-08 21:55:16 +02:00
Tomasz Grabiec
a52de4ecdc test: cluster: test_topology_ops[_encrypted]: Fix failures due to background migrations fencing out writes
The test if flaky, with failures in:

        for server in servers:
>           await check_node_log_for_failed_mutations(manager, server)

test/cluster/test_topology_ops_encrypted.py:84:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

manager = <test.pylib.manager_client.ManagerClient object at 0xffff602e8590>
server = ServerInfo(server_id=1769, ip_addr='127.82.127.43', rpc_address='127.82.127.43', datacenter='DEFAULT_DC', rack='DEFAULT_RACK', pid=186578)

    async def check_node_log_for_failed_mutations(manager: ManagerClient, server: ServerInfo):
        logging.info(f"Checking that node {server} had no failed mutations")
        log = await manager.server_open_log(server.server_id)
        occurrences = await log.grep(expr="Failed to apply mutation from", filter_expr="(TRACE|DEBUG|INFO)")
>       assert len(occurrences) == 0
E       AssertionError

test/cluster/util.py:319: AssertionError

As diagnosed by Gleb in https://github.com/scylladb/scylladb/issues/27942#issuecomment-3710013625:

"The fencing errors here look legit given that we do not wait for all
requests to complete while shutting down the storage proxy. The
scenario is this:

Test does writes to rf=3 keyspace with cl=one. One node is shutting
down while there is a tablet migration. Tablet migration executes
barrier and drain which fails on a node that is been shutdown. The
topology coordinator proceeds fencing the old topology, but there
still can be un-handled mutation requests from the shutting down node
on other nodes and they will generate fencing errors like they should.

They way to avoid it (though it is benign) is to wait for all outgoing
storage proxy requests to complete during shutdown, but even then the
error may still happen since a request may timeout before it is
processed by the other side, so it may be completed by a storage proxy
coordinator side, but still not handled by replica side. This what we
have fencing for in the first place."

Fix by diabling background tablet migrations, so that we have no
topology barriers concurrent with node shutdown.

Fixes #27942

Closes scylladb/scylladb#28034
2026-01-08 21:53:47 +02:00
Tomasz Grabiec
34df158605 test: cluster: Fix NoHostAvailable error in test_not_enough_token_owners
The driver must see server_c before we stop server_a, otherwise
there will be no live host in the pool when we attempt to drop
the keyspace:

```
   @pytest.mark.asyncio
    async def test_not_enough_token_owners(manager: ManagerClient):
        """
        Test that:
        - the first node in the cluster cannot be a zero-token node
        - removenode and decommission of the only token owner fail in the presence of zero-token nodes
        - removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token
          owners would fall below the RF of some keyspace using tablets
        """
        logging.info('Trying to add a zero-token server as the first server in the cluster')
        await manager.server_add(config={'join_ring': False},
                                 property_file={"dc": "dc1", "rack": "rz"},
                                 expected_error='Cannot start the first node in the cluster as zero-token')

        logging.info('Adding the first server')
        server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"})

        logging.info('Adding two zero-token servers')
        # The second server is needed only to preserve the Raft majority.
        server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0]

        logging.info(f'Trying to decommission the only token owner {server_a}')
        await manager.decommission_node(server_a.server_id,
                                        expected_error='Cannot decommission the last token-owning node in the cluster')

        logging.info(f'Stopping {server_a}')
        await manager.server_stop_gracefully(server_a.server_id)

        logging.info(f'Trying to remove the only token owner {server_a} by {server_b}')
        await manager.remove_node(server_b.server_id, server_a.server_id,
                                  expected_error='cannot be removed because it is the last token-owning node in the cluster')

        logging.info(f'Starting {server_a}')
        await manager.server_start(server_a.server_id)

        logging.info('Adding a normal server')
        await manager.server_add(property_file={"dc": "dc1", "rack": "r2"})

        cql = manager.get_cql()

        await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60)

>       async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name:

test/cluster/test_not_enough_token_owners.py:57:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/lib64/python3.14/contextlib.py:221: in __aexit__
    await anext(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830>
opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }"
host = None

    @asynccontextmanager
    async def new_test_keyspace(manager: ManagerClient, opts, host=None):
        """
        A utility function for creating a new temporary keyspace with given
        options. It can be used in a "async with", as:
            async with new_test_keyspace(ManagerClient, '...') as keyspace:
        """
        keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host)
        try:
            yield keyspace
        except:
            logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation")
            raise
        else:
>           await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host)
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')})

test/cluster/util.py:544: NoHostAvailable
```

Fixes #28011

Closes scylladb/scylladb#28040
2026-01-08 21:53:47 +02:00
Andrei Chekun
ee0bf35615 test.py: add custome exit code for pytest in case maxfail reached
This PR adds custom exit code in case when maxfail reached. This is
needed for easier detection why pytest failed in CI.

Closes scylladb/scylladb#28018
2026-01-08 21:53:47 +02:00
Anna Stuchlik
1b653166f1 doc: add the version variable to the Web Installer instructions
This commit replaces a fixed version name with the variable for the current version
in the instructions for installing a non-default version with Web Installer.
This will make using the installer more user-friendly.

Fixes https://github.com/scylladb/scylladb/issues/28005
2026-01-08 10:12:21 +01:00
Benny Halevy
ebd667a8e0 test: database_test: do_with_some_data: randomize keys
With randomized keys, and since we're inserting only 2 keys,
it is possible that they would end up owned only by a single shard,
reproducing #27639 in snapshot_list_contains_dropped_tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:46 +02:00
Benny Halevy
93b827c185 database: truncate_table_on_all_shards: drop outdated TODO comment
The comment was added in 83323e155e
Since then, table::seal_active_memtable was improved to guarantee
waiting on oustanding flushes on success (See d55a2ac762), so
we can remove this TODO comment (it also not covered by any issue
so nobody is planned to ever work on it).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:46 +02:00
Benny Halevy
2a803d2261 database: truncate_table_on_all_shards: consider can_flush on all shards
can_flush might return a different value for each shard
so check it right before deciding whether to flush or clear a memtable
shard.

Note that under normal condition can_flush would always return true
now that it checks only the presence of the seal memtable function
rather than check memtable_list::empty().

Fixes #27639

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:46 +02:00
Benny Halevy
02ee341a03 memtable_list: unify can_flush and may_flush
Now that we have a unit test proving that it's safe to flush an
empty memtable list there is no need to distinguish between
may_flush and can_flush.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:46 +02:00
Benny Halevy
0342a24ee0 test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
Test that table::flush waits on outstanding flushes, even if the active memtable is empty

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:45 +02:00
Benny Halevy
5be6b80936 replica: table, storage_group, compaction_group: add needs_flush
Table needs flush if not all its memtable lists are empty.
To be used in the next patch for a unit test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:41:22 +02:00
Benny Halevy
ec4069246d test: database_test: do_with_some_data_in_thread: accept void callback function
Many test cases already assume `func` is being called a seastar
thread and although the function they pass returns a (ready) future,
it serves no purpose other than to conform to the interface.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:41:22 +02:00
Patryk Jędrzejczak
946a2bb988 storage_service: do not call raft_topology_update_ip for left nodes
This `raft_topology_update_ip` call always returns after `t.find(raft_id)`
returns `nullptr`, so it effectively does nothing. It's not a bug, since
there is no reason to update `system.peers` for left nodes anyway. We
delete the rows corresponding to left nodes in `process_left_node` (called
just above).

Closes scylladb/scylladb#27899
2026-01-07 16:52:13 +01:00
Michał Jadwiszczak
be16e42cb0 service/storage_service: update service levels cache after upgrade to v2
Service levels cache is empty after upgrade to consistent topology
if no mutations are commited to `system.service_levels_v2` or rolling
restart is not done.

To fix the bug, this commit adds service levels cache reloading after
upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`.

Fixes SCYLLADB-90
2026-01-07 14:06:13 +01:00
Michał Jadwiszczak
53d0a2b5dc service/storage_service: check if service levels were already upgraded
before doing migration to raft

There is no need to call `service_level_controller::upgrade_to_v2()`
on every topology state load, we only need to do it once.
2026-01-07 14:06:13 +01:00
Nadav Har'El
f7eae50d98 test/alternator: fix some expected error messages to fit DynamoDB
All tests I am fixing in this patch do pass for me on DynamoDB, but
other developers report that they fail because some DynamoDB servers
apparently use slightly different error messages, with less detail about
the cause of an error. For example, some of our tests currently expect
an error message that looks like:

    An error occurred (ValidationException) when calling the Query
    operation: Invalid operator used in KeyConditionExpression:
    attribute_exists

But some servers don't report the ": attribute_exists" at the end, so
we can't use the word "attribute_exists" it in the test to recognize
the correct error, and needs to use a different word (which both
versions of DynamoDB and Alternator all print).

As another example, the good old DynamoDB error:

    An error occurred (ValidationException) when calling the Query
    operation: 1 validation error detected: Value 'DOG' at
   'conditionalOperator' failed to satisfy constraint: Member must
   satisfy enum value set: [OR, AND]

Got replaced by the following less informative message:

    An error occurred (ValidationException) when calling the Query
    operation: Failed to satisfy constraint: Member must satisfy enum
    value set: [ALL, OR]'

So we need to fix the test to allow it too.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 14:06:33 +02:00
Nadav Har'El
e97fbc2d65 test/alternator: fix compressed request test on non-us-east1
The test test_compressed_request.py::test_compressed_request coerces
boto3 to send a compressed request, and wrongly used region_name=us-east-1
to set up the connection. Theoretically, this doesn't matter because
we also set the correct URL (for either Alternator or the desired region
in AWS). But in fact it does matter, because region name is part of the
request's signature, and DynamoDB refuses the request if it comes to
a different region than it is signed for. So this test fails when run
on DynamoDB on any other region except us-east-1.

The fix is simple - don't use the constant "us-east-1", but pick up the
correct region name from the original connection.

The functions new_dynamodb_session(), new_dynamodb() and
new_dynamodb_stremas() had the same bug and we fix it too, but it didn't
break any test because the only tests using these functions were
Scylla-only so the AWS region problem didn't apply to them.
2026-01-07 13:33:46 +02:00
Nadav Har'El
2c02e463ff test/alternator: fix test's expected error message on DynamoDB
The Alternator test test_tag.py::test_tag_lsi_gsi expects to see an
error - it's not allowed to set a tag on a GSI or LSI - but the error
message that DynamoDB prints recently changed - instead of saying
"ResourceArn" the new error message says "resource arn".

Change the test to allow both forms, so it will pass on both Alternator
(which still uses the word ResourceArn - which is the name of the
parameter) and on DynamoDB (which uses "resource arn").

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 12:51:10 +02:00
Nadav Har'El
4f3150c282 test/alternator: mark Alternator-only test scylla_only
The test test_batch.py::test_batch_write_item_large_broken_connection
failed on DynamoDB (Refs #26079). It turns out this test has many
problems:

1. This test wrongly assumes a batch write needs to complete in one
   attempt - and this fails on DynamoDB with low WCU capacity where
   the batch needs to be resumed in multiple requests. Using boto3's
   batch_writer() fixes this problem.
2. This test has NOTHING to do with batches - so is mis-named and
   mis-placed. The batch write is just a way to prepare some data
   in the table, and the real test is about Query'ing the data back
   and observing the long response and reproducing issue #14454.
   I did not rename or move the test, but left a comment explaining
   the situation.
3. This test is written to assume the Query's response uses HTTP
   chunked encoding. Which isn't actually true for DynamoDB, at least
   not at the time of this writing. So the test fails on DynamoDB.

For the last reason, I made this test scylla_only. This test can't
really be run on DynamoDB without rewriting it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 12:51:10 +02:00
Nadav Har'El
df6b347911 test/alternator: fix test on DynamoDB
The test test_batch.py::test_batch_write_item_large often fails when
running on DynamoDB, and this patch fixes it. The test checks that a
large but not over-the-limits large batch works. However, "works" only
means that the batch is not an error - it doesn't guarantee that all the
items in the batch are performed. If the WCU limits of the table are
exceeded DynamoDB may perform only part of the the batch and return the
remaining items as UnprocessedItems. This not only can happen, it
usually does happen on DynamoDB - because a new on-demand-billing table
always start with a very low WCU capacity.

So in this patch we update the test to recognize and perform the
UnprocessedItems, instead of assuming it needs to be empty.

The test continues to pass on Alternator, and finally passes on
DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 12:51:10 +02:00
Nadav Har'El
9d6a463324 test/alternator: increase wait_for_gsi() timeout
In Alternator tests, the wait_for_gsi() utility function is used in
tests that add a GSI to an existing table, to wait for this new GSI
to become ready. Although this takes a fraction of a second on
Alternator, we noticed that this takes many minutes (!) on DynamoDB
so we used an absurdly high 10 minute timeout to allow tests to also
pass on DynamoDB.

But it turns out that 10 minutes wasn't absurdly high enough, and
tests using it in test_gsi_updatetable.py started to fail on DynamoDB.
Empirically, 10 minutes was enough in the past but it seems that today
adding a GSI to an empty table routinely takes as much as 20 minutes.

So this patch increases the wait_for_gsi() timeout to a whopping 30
minutes. After this patch, the tests in test_gsi_updatetable.py which
used to fail - test_gsi_backfill_with_lsi,
test_gsi_backfill_with_real_column, test_gsi_creates_and_deletes and
test_gsi_backfill_oversized_key now all pass on DynamoDB - but each
takes more than 20 minutes to pass.

To allow the test to fail much more quickly on Alternator (where
creating a GSI takes a fraction of a second), we set a much lower
but still very high timeout when running on Alternator - 60 seconds.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 12:50:54 +02:00
Tomasz Grabiec
ffa11d6a2d test: Verify that repair doesn't block disabling of tablet load balancing
Refs #27647
2026-01-05 13:22:15 +01:00
Tomasz Grabiec
ccdb301731 tablets: Make balancing disabling call preempt tablet transitions
This patch modifies RESTful API handler which disables tablet
balancing to use topology request to wait for already running tablet
transitions. Before, it was just waiting for topology to be idle, so
it could wait much longer than necessary, also for operations which
are not affected by the flag, like repair. And repair can take hours.

New request type is introduced for this synchronization: noop_request.
It will preempt the tablet scheduler, and when the request executes,
we know all later tablet transitions will respect the "balancing
disabled" flag, and only things which are unuaffected by the flag,
like repair, will be scheduled.

Fixes #27647
2026-01-05 13:22:08 +01:00
Nadav Har'El
5c2ca56adf test/alternator: fix test passing a spurious parameter
The test test_streams.py::test_streams_putitem_new_item_overrides_old_lsi
failed on DynamoDB (Refs #26079) because we passed an unused parameter
NonKeyAttributes to the Projection setting an LSI. NonKeyAttributes is
only allowed when ProjectionType=INCLUDE, but we used ProjectionType=ALL.
DynamoDB refuses to create an LSI with such inconsistent parameters,
and we just need to remove this unnecessary parameter from this test.

The reason why this test didn't fail on Alternator is that Alternator
doesn't yet support or even parse the Projection parameter (Refs #5036).

We also add an xfailing test (passes on DynamoDB, fails on Alternator)
checking that a spurious NonKeyAttributes parameter is rejected. When
we get around to implement the projection feature (#5036), this will
be yet another acceptance test for this feature.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-05 13:51:01 +02:00
Benny Halevy
a8114f9bcc test_backup: do_abort_restore: reduce data footprint
To make the test fast, in particular in debug mode
insert fewer keys and do not rely on os.urandom
which is notoriously slow

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:04:08 +02:00
Benny Halevy
c0dd662144 test_backup: do_abort_restore: use error injection
Currently the test depends on timing and enough inserted
data to abort the restore tasks at exactly the right time.
This is flaky in nature, so instead, use error injection
to synchronize the abort with mutation streaming.

Note that with that we no longer get the STREAM_MUTATION_FRAGMENTS
log message, so waiting for it is dropped from the test.
The most imporant thing is that some restore tasks must fail.
(We cannot guarantee all would fail unfortunately)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Benny Halevy
16dd07c7d4 test_backup: do_abort_restore: use asyncio for cql
Use the more modern asyncio facility to run cql queries
and a prepared statement to insert data into the table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Benny Halevy
f1a583c39c test_backup: do_abort_restore: use new_test_keyspace
For creating a keyspace with a unique name and auto-deleting
it on exit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Benny Halevy
3e8431a3d9 test_backup: do_abort_restore: use logger rather than print
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Benny Halevy
acb2c9b045 test_backup: do_abort_restore: pass auto_rack_dc to servers_add
To generate multi-rack cluster, otherwise we get the following error:
```
E   cassandra.protocol.ConfigurationException: <Error from server: code=2300 [Query invalid because of configuration issue] message="Replication factor 3 exceeds the number of racks (1) in dc datacenter1">
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Patryk Jędrzejczak
0fed9f94f8 gossiper: add_saved_endpoint: make generations of excluded nodes negative
The explanation is in the new comment in `gossiper::add_saved_endpoint`.

We add a test for this change. It's "extremely white-box", but it's better
than nothing.
2025-12-29 19:13:55 +01:00
Patryk Jędrzejczak
749b0278e5 test: introduce test_full_shutdown_during_replace 2025-12-29 19:13:55 +01:00
Patryk Jędrzejczak
4526dd93b1 utils: error_injection: allow aborting wait_for_message
The test added in the following commit utilizes it.
2025-12-29 19:13:55 +01:00
Patryk Jędrzejczak
fc4c2df2ce raft topology: preserve IP -> ID mapping of a replacing node on restart
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
  require IP of the replacing node (like client requests or REST API
  requests coming from nodetool).

We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.

For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.

For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.

We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.

We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.

We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).
2025-12-29 19:13:53 +01:00
copilot-swe-agent[bot]
d25d295e84 alternator/server: update SSL comment 2025-12-29 09:34:08 +01:00
Michael Litvak
9f8aea21e3 docs: update RF-rack restrictions
Update the documentation about restrictions to tablets keyspaces related
to RF-rack.

* MV/SI require the keyspace to be RF-rack-valid
* topology operations are restricted if a keyspace has views to preserve
  RF-rack-validity
2025-12-22 09:21:07 +01:00
Michael Litvak
75b5285cdf cql3: don't apply RF-rack restrictions on vector indexes
When creating an index we validate that the keyspace is RF-rack-valid
and print a warning that the keyspace must remain RF-rack-valid.

This should apply only to indexes that are based on materialized views
for which there are consistency concerns when the keyspace is not
RF-rack-valid.

vector indexes are not based on materialized views, hence these
restrictions should not apply to them.
2025-12-22 09:21:07 +01:00
Michael Litvak
06343b58a2 cql3: add warning when creating mv/index with tablets about rf-rack
Creating a MV or index in a tablets-based keyspace now forces additional
restrictions on the keyspace. The keyspace must be RF-rack-valid and it
must remain RF-rack-valid while the view exists.

Add a CQL warning about these restrictions.
2025-12-22 09:21:06 +01:00
Michael Litvak
df801d16da service/tablet_allocator: always allow tablet merge of tables with views
allow tablet merge of tables with views even if the
rf_rack_valid_keyspaces option is not set, because now keyspaces that
have views are enforced to always be rf-rack-valid, regardless of the
option value.
2025-12-22 09:14:30 +01:00
Michael Litvak
07d85af433 locator: extend rf-rack validation for rack lists
Extend the RF-rack validation in `assert_rf_rack_valid_keyspace` to
validate rack-list-based replication as well. Previously, validation was
done only for numeric replication.

If the replication is based on a rack list, we validate that all racks
that are required for replication are present in the topology rack map.
If some rack is needed for replication but is missing, or it doesn't
have normal token owner nodes, the validation fails with an error.
2025-12-22 09:14:30 +01:00
Michael Litvak
d40d06c7ad test: test rf-rack validity when creating keyspace during node ops
add tests that attempt to create a keyspace during different stages of
node join or remove, and verify that the rf-rack condition can't be
broken - either creating the keyspace should fail or the node operation
should fail, depending on the stage.
2025-12-22 09:14:30 +01:00
Michael Litvak
a738905a4b locator: fix rf-rack validation during node join/remove
If a keyspace is created while a node is joining or being removed, it could
break the rf-rack invariant. For example:
1. We have 3 nodes in 3 racks, no keyspaces
2. A new node starts to join in a new rack - passes validation because
   there are no keyspaces
3. Create a keyspace with rf=3 - passes validation because the joining
   node is not a normal token owner yet
4. The new node becomes a normal token owner
5. The rf-rack invariant is broken. We have rf=3 and 4 racks

To fix this, we change the rf-rack check to consider a node as a token
owner if it's either a normal token owner or it has bootstrap tokens and
is about to become a normal token owner.

Now the condition can't be broken. Consider keyspace creation at
different stages of adding a node in our example:
* Before the node is assigned bootstrap tokens: the node is not
  considered. We can create a keyspace with rf=3 as if the node doesn't
  exist, and then node join will fail in the group0 operation that
  assigns bootstrap tokens, because during this operation we check
  rf-rack validity.
* Assigning bootstrap tokens is a single group0 operation that is
  serialized with keyspace creation. During this operation we check that
  adding the node as a token owner will maintain rf-rack validity for all
  keyspaces.
* After the node is assigned bootstrap tokens and until it becomes a
  normal token owner: it is considered as a transitioning token owner by
  the rf-rack check and the rack is considered a transitioning rack. We
  can't count the rack as a normal rack because the node join may still
  fail and rollback. Trying to create a keyspace with either rf=3 or
  rf=4 will fail because we can end up with either 3 or 4 racks.

Similarly, when removing a node, we validate that removing the node will
maintain rf-rack validity in the same group0 operation that changes the
node state to removing/decommissioning, after which the node becomes a
leaving endpoint, and it's not considered a normal token owner anymore
for the rf-rack check.
2025-12-22 09:14:30 +01:00
Michael Litvak
9940dcefa7 test: test topology restrictions for views with tablets
Add tests that verify the restrictions on topology operations when there
are keyspaces with tablets and materialized views.

For such keyspaces, RF=Racks must be enforced while they have
materialized views, therefore adding a node in a new rack or removing a
node that would eliminate a rack should be rejected.
2025-12-22 09:14:30 +01:00
Michael Litvak
870aad7f71 test: add test_topology_ops_with_rf_rack_valid
add new tests for testing that RF-rack validity is maintained when doing
topology operations that may break them, such as adding nodes in new
racks or removing nodes.
2025-12-22 09:14:30 +01:00
Michael Litvak
8f1a566be8 topology coordinator: restrict node join/remove to preserve RF-rack validity
when a new node joins or an existing node is removed / decommissioned,
check if the operation would violate the RF-rack-validity of some
keyspace. if so - reject the operation in order to preserve
RF-rack-validity.

Fixes scylladb/scylladb#23345
Fixes scylladb/scylladb#26820
2025-12-22 09:14:30 +01:00
Michael Litvak
aa8db3b8da topology coordinator: add validation to node remove
add validation to node remove / decommission, similar to node validation
when a node joins.

when starting node remove or decommission, the validation function
checks if the operation is valid and can proceed. if not, it's aborted
with an error message.

we change the return type of validate_joining_node so that it will be
similar and consistent with the new validate_removing_node.
2025-12-22 09:14:29 +01:00
Michael Litvak
9e1f78d162 locator: extend rf-rack validation functions
Extend the locator function assert_rf_rack_valid_keyspace to accept
arbitrary topology dc-rack maps and nodes instead of using the current
token metadata.

This allows us to add a new variant of the function that checks rf-rack
validity given a topology change that we want to apply. we will use it
to check that rf-rack validity will be maintained before applying the
topology change.

The possible topology changes for the check are node add and node remove
/ decommission. These operations can change the number of normal racks -
if a new node is added to a new rack, or the last node is removed from a
rack.
2025-12-22 09:14:29 +01:00
Michael Litvak
8df61f6d99 view: change validate_view_keyspace to allow MVs if RF=Racks
The function validate_view_keyspace checks if a keyspace is eligible for
having materialized views, and it is used for validation when creating a
MV or a MV-based index.

Previously, it was required that the rf_rack_valid_keyspaces option is
set in order for tablets-based keyspaces to be considered eligible, and
the RF-rack condition was enforced when the option is set.

Instead of this, we change the validation to allow MVs in a keyspace if
the RF-rack condition is satisfied for the keyspace - regardless of the
config option.

We remove the config validation for views on startup that validates the
option `rf_rack_valid_keyspaces` is set if there are any views with
tablets, since this is not required anymore.

We can do this without worrying about upgrades because this change will
be effective from 2025.4 where MVs with tablets are first out of
experimental phase.

We update the test for MV and index restrictions in tablets keyspaces
according to the new requirements.

* Create MV/index: previously the test checked that it's allowed only if
  the config option `rf_rack_valid_keyspaces` is set. This is changed
  now so it's always allowed to create MV/index if the keyspace is
  RF-rack-valid. Update the test to verify that we can create MV/index
  when the keyspace is RF-rack-valid, even if the rf_rack option is not
  set, and verify that it fails when the keyspace is RF-rack-invalid.
* Alter: Add a new test to verify that while a keyspace has views, it
  can't be altered to become RF-rack-invalid.
2025-12-22 09:14:29 +01:00
Michael Litvak
de1bb84fca db: enforce rf-rack-validity for keyspaces with views
Extend the RF-rack-validity enforcement to keyspaces that have views,
regardless of the option `rf_rack_valid_keyspaces`.

Previously, RF-rack-validity was enforced when `rf_rack_valid_keyspaces`
was set for all keyspaces. Now we want to allow creating MVs in tablet
keyspaces that are RF-rack-valid and enforce the RF-rack-validity even
if the config option is not set.
2025-12-22 09:13:49 +01:00
Michael Litvak
75b269229d replica/db: add enforce_rf_rack_validity_for_keyspace helper
Add the helper function enforce_rf_rack_validity_for_keyspace that
returns true if RF-rack-validity should be enforced for a keyspace, and
use it wherever we need to check this instead of checking the config
option directly.

This is useful because this condition is used in multiple places, and
having it defined in a single helper function will make it easier to
see and change the enforcement conditions.
2025-12-22 09:13:49 +01:00
Michael Litvak
8b0b0c4d80 db: remove enforce parameter from check_rf_rack_validity
simple refactoring: the enforce parameter is always given the value of
the `rf_rack_valid_keyspaces` option. remove the parameter and use the
option value directly from the db config.

this will be useful for a later change to the enforcement conditions.
2025-12-22 09:13:49 +01:00
Michael Litvak
61ae653693 test: adjust test to not break rf-rack validity
the test test_unfinished_writes_during_shutdown starts 3 nodes in 3
racks and creates a keyspace with RF=3, then adds a new node in a 4th
rack. this breaks rf-rack validity for the keyspace.

we change it instead to add the new node in an existing rack. it doesn't
matter for the test - the test only wants to add a new node to trigger
some topology change.
2025-12-22 09:13:48 +01:00
Dawid Mędrek
df0830044d cql3: Extend DESC INDEX by view properties
We're extending the logic of DESCRIBE INDEX to include properties of the
underlying materialized view. Tests are provided to ensure the
implementation works as intended.
2025-12-16 11:43:38 +01:00
Dawid Mędrek
d0a42852c0 cql3: Forbid using CLUSTERING ORDER BY when creating index
This is a temporary solution as handling this property may require
a bit more attention or at least a bit more focus. For now, let's
forbid using it so it's clear it won't get applied. A simple test
is provided to cover it.

We document the restriction.
2025-12-16 11:43:38 +01:00
Dawid Mędrek
6541362a0b cql3: Extend CREATE INDEX by MV properties
After the previous patch that extended the grammar and provided
basic functionalities to accommodate properties of materialized views
in indexes, this commit takes another step and actually applies them
to the underlying view when it's being created.

We're providing validation tests for each property, with the single
exception of CLUSTERING ORDER BY. That one will be handled separately
in an upcoming commit.

We also update the user documentation.
2025-12-16 11:43:38 +01:00
Dawid Mędrek
e1f1071bc5 cql3/statements/create_index_statement: Allow for view options
We're allowing CREATE INDEX to accept the same set of properties as
materialized views do. Our goal is to give the user an ability to
configure the underlying materialized view of an index directly,
when creating it.

This commit doesn't do anything except for extending the grammar
and passing the right pieces of information to the right destinations.
There's no validation and the options have no effect yet. That will
be done in the following patch.
2025-12-16 11:43:37 +01:00
Dawid Mędrek
2cc01eeffb cql3/statements/create_index_statement: Rename member 2025-12-16 11:43:37 +01:00
Dawid Mędrek
b3099e2f2c cql3/statements/index_prop_defs: Re-introduce index_prop_defs
The type represents a mix of both index-specific and view properties.
Since we cannot easily distinguish which properties belong to which
entity, let's use this abstraction and filter them from the C++ level.

This is a prerequisite for extending the capabilities of CREATE INDEX
by allowing it to configuring the underlying materialized view.
2025-12-16 11:43:37 +01:00
Dawid Mędrek
6dfebc0b6e cql3/statements/property_definitions: Add extract_property()
The method will be useful in an upcoming commit where we'll want to
split a type inherited from `property_definitions` into two separate
entities.
2025-12-16 11:43:37 +01:00
Dawid Mędrek
62a8dd1b7f cql3/statements/index_prop_defs.cc: Add namespace 2025-12-16 11:43:37 +01:00
Dawid Mędrek
dcf2c71204 cql3/statements/index_prop_defs.hh: Rename type
We rename the type `index_prop_defs` to `index_specific_prop_defs`.
The rationale for the change is to distinguish between properties
related directly to a index and properties related to the underlying
view (if applicable).

The type `index_prop_defs` will be re-introduced in an upcomming commit
where it'll encompass both index-related and view-related properties.
This is a prerequisite for it.
2025-12-16 11:43:37 +01:00
Dawid Mędrek
e9108677f7 cql3/statements/view_prop_defs.cc: Move validation logic into file
We're moving the rest of the validation logic that can be moved from
`cql3/statements/{create,alter}_view_statement.cc` to the new file.
2025-12-16 11:43:35 +01:00
Dawid Mędrek
f5f7aeaa0a cql3/statements: Introduce view_prop_defs.{hh,cc}
We're introducing a new type wrapping properties that can be used with
materialized views. Doing that, we achieve the following things:

(1) We can keep validation logic in one place.
(2) We differentiate between properties of a regular table and
    properties of a materialized view.
(3) It provides better modularization and allows for reusing the code.
(4) It gets rid of inconsistencies in the existing code, e.g.
    CREATE MV using one type for properties, while ALTER MV another.

The actual end goal of this commit is to be able to reuse at least part
of the validation logic of MVs in CREATE INDEX and, when it gets added,
ALTER INDEX: we want to endow those statements with an ability to modify
the underlying materialized view without having to modify it directly.

This patch does NOT implement the whole validation logic yet. It will be
done in a following commit.

Refs scylladb/scylladb#16454
2025-12-16 11:43:03 +01:00
Dawid Mędrek
8691d287be cql3/statements/create_view_statement.cc: Move validation of ID
The end goal we have in mind in this commit is to extract the validation
logic of the options used for creating and altering an MV to a separate
place and be able to call from different places in the code.

It will be useful when extending the capabilities of the CREATE INDEX
statement.

In this patch, we move the part of validation responsible for checking
the ID option to keep it close to the other parts of validation of the
options in their "raw" form.
2025-12-15 13:18:48 +01:00
Dawid Mędrek
11c109c623 schema/schema.hh: Do not include index_prop_defs.hh
One of the upcoming commits will lead to a cyclic dependency
of headers because `schema.hh` includes `index_prop_defs.hh`.
To prevent that, we remove the include and replace it with
a manually added alias.

This is not a perfect solution, but doing it properly would
require comprehensive changes. We can do that in a separate
task.
2025-12-15 13:18:48 +01:00
Calle Wilund
5f53d7852e docs::encryption: Add warning that replicated provider is deprecated
And will be removed.
2025-12-01 09:44:44 +00:00
Calle Wilund
52cc30e00c ent::encryption: Switch default key provider from replicated to local
Since we are deprecating the replicated provider, it makes little sense to
have it be default.
2025-12-01 09:44:44 +00:00
Calle Wilund
ba38d58539 replicated_key_provider: Add deprecation warning on usage
Warns user utilizing the provider that the provider is deprecated
and will be removed.
2025-12-01 09:44:44 +00:00
346 changed files with 7130 additions and 1870 deletions

View File

@@ -84,3 +84,14 @@ ninja build/<mode>/scylla
- Strive for simplicity and clarity, add complexity only when clearly justified
- Question requests: don't blindly implement requests - evaluate trade-offs, identify issues, and suggest better alternatives when appropriate
- Consider different approaches, weigh pros and cons, and recommend the best fit for the specific context
## Test Philosophy
- Performance matters. Tests should run as quickly as possible. Sleeps in the code are highly discouraged and should be avoided, to reduce run time and flakiness.
- Stability matters. Tests should be stable. New tests should be executed 100 times at least to ensure they pass 100 out of 100 times. (use --repeat 100 --max-failures 1 when running it)
- Unit tests should ideally test one thing and one thing only.
- Tests for bug fixes should run before the fix - and show the failure and after the fix - and show they now pass.
- Tests for bug fixes should have in their comments which bug fixes (GitHub or JIRA issue) they test.
- Tests in debug are always slower, so if needed, reduce number of iterations, rows, data used, cycles, etc. in debug mode.
- Tests should strive to be repeatable, and not use random input that will make their results unpredictable.
- Tests should consume as little resources as possible. Prefer running tests on a single node if it is sufficient, for example.

View File

@@ -18,6 +18,8 @@ on:
jobs:
release:
permissions:
contents: write
runs-on: ubuntu-latest
steps:
- name: Checkout

View File

@@ -2,6 +2,9 @@ name: "Docs / Build PR"
# For more information,
# see https://sphinx-theme.scylladb.com/stable/deployment/production.html#available-workflows
permissions:
contents: read
env:
FLAG: ${{ github.repository == 'scylladb/scylla-enterprise' && 'enterprise' || 'opensource' }}

View File

@@ -1,5 +1,8 @@
name: Docs / Validate metrics
permissions:
contents: read
on:
pull_request:
branches:

View File

@@ -10,6 +10,8 @@ on:
jobs:
read-toolchain:
runs-on: ubuntu-latest
permissions:
contents: read
outputs:
image: ${{ steps.read.outputs.image }}
steps:

View File

@@ -105,11 +105,23 @@ future<> controller::start_server() {
alternator_port = _config.alternator_port();
_listen_addresses.push_back({addr, *alternator_port});
}
std::optional<uint16_t> alternator_port_proxy_protocol;
if (_config.alternator_port_proxy_protocol()) {
alternator_port_proxy_protocol = _config.alternator_port_proxy_protocol();
_listen_addresses.push_back({addr, *alternator_port_proxy_protocol});
}
std::optional<uint16_t> alternator_https_port;
std::optional<uint16_t> alternator_https_port_proxy_protocol;
std::optional<tls::credentials_builder> creds;
if (_config.alternator_https_port()) {
alternator_https_port = _config.alternator_https_port();
_listen_addresses.push_back({addr, *alternator_https_port});
if (_config.alternator_https_port() || _config.alternator_https_port_proxy_protocol()) {
if (_config.alternator_https_port()) {
alternator_https_port = _config.alternator_https_port();
_listen_addresses.push_back({addr, *alternator_https_port});
}
if (_config.alternator_https_port_proxy_protocol()) {
alternator_https_port_proxy_protocol = _config.alternator_https_port_proxy_protocol();
_listen_addresses.push_back({addr, *alternator_https_port_proxy_protocol});
}
creds.emplace();
auto opts = _config.alternator_encryption_options();
if (opts.empty()) {
@@ -135,20 +147,29 @@ future<> controller::start_server() {
}
}
_server.invoke_on_all(
[this, addr, alternator_port, alternator_https_port, creds = std::move(creds)] (server& server) mutable {
return server.init(addr, alternator_port, alternator_https_port, creds,
[this, addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol, creds = std::move(creds)] (server& server) mutable {
return server.init(addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol, creds,
_config.alternator_enforce_authorization,
_config.alternator_warn_authorization,
_config.alternator_max_users_query_size_in_trace_output,
&_memory_limiter.local().get_semaphore(),
_config.max_concurrent_requests_per_shard);
}).handle_exception([this, addr, alternator_port, alternator_https_port] (std::exception_ptr ep) {
logger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}: {}",
addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF", ep);
}).handle_exception([this, addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol] (std::exception_ptr ep) {
logger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}, proxy-protocol port {}, TLS proxy-protocol port {}: {}",
addr,
alternator_port ? std::to_string(*alternator_port) : "OFF",
alternator_https_port ? std::to_string(*alternator_https_port) : "OFF",
alternator_port_proxy_protocol ? std::to_string(*alternator_port_proxy_protocol) : "OFF",
alternator_https_port_proxy_protocol ? std::to_string(*alternator_https_port_proxy_protocol) : "OFF",
ep);
return stop_server().then([ep = std::move(ep)] { return make_exception_future<>(ep); });
}).then([addr, alternator_port, alternator_https_port] {
logger.info("Alternator server listening on {}, HTTP port {}, HTTPS port {}",
addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF");
}).then([addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol] {
logger.info("Alternator server listening on {}, HTTP port {}, HTTPS port {}, proxy-protocol port {}, TLS proxy-protocol port {}",
addr,
alternator_port ? std::to_string(*alternator_port) : "OFF",
alternator_https_port ? std::to_string(*alternator_https_port) : "OFF",
alternator_port_proxy_protocol ? std::to_string(*alternator_port_proxy_protocol) : "OFF",
alternator_https_port_proxy_protocol ? std::to_string(*alternator_https_port_proxy_protocol) : "OFF");
}).get();
});
}

View File

@@ -5986,6 +5986,11 @@ future<executor::request_return_type> executor::list_tables(client_state& client
_stats.api_operations.list_tables++;
elogger.trace("Listing tables {}", request);
co_await utils::get_local_injector().inject("alternator_list_tables", [] (auto& handler) -> future<> {
handler.set("waiting", true);
co_await handler.wait_for_message(std::chrono::steady_clock::now() + std::chrono::minutes{5});
});
rjson::value* exclusive_start_json = rjson::find(request, "ExclusiveStartTableName");
rjson::value* limit_json = rjson::find(request, "Limit");
std::string exclusive_start = exclusive_start_json ? rjson::to_string(*exclusive_start_json) : "";

View File

@@ -374,13 +374,40 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
for (const auto& header : signed_headers) {
signed_headers_map.emplace(header, std::string_view());
}
std::vector<std::string> modified_values;
for (auto& header : req._headers) {
std::string header_str;
header_str.resize(header.first.size());
std::transform(header.first.begin(), header.first.end(), header_str.begin(), ::tolower);
auto it = signed_headers_map.find(header_str);
if (it != signed_headers_map.end()) {
it->second = std::string_view(header.second);
// replace multiple spaces in the header value header.second with
// a single space, as required by AWS SigV4 header canonization.
// If we modify the value, we need to save it in modified_values
// to keep it alive.
std::string value;
value.reserve(header.second.size());
bool prev_space = false;
bool modified = false;
for (char ch : header.second) {
if (ch == ' ') {
if (!prev_space) {
value += ch;
prev_space = true;
} else {
modified = true; // skip a space
}
} else {
value += ch;
prev_space = false;
}
}
if (modified) {
modified_values.emplace_back(std::move(value));
it->second = std::string_view(modified_values.back());
} else {
it->second = std::string_view(header.second);
}
}
}
@@ -393,6 +420,7 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
datestamp = std::move(datestamp),
signed_headers_str = std::move(signed_headers_str),
signed_headers_map = std::move(signed_headers_map),
modified_values = std::move(modified_values),
region = std::move(region),
service = std::move(service),
user_signature = std::move(user_signature)] (future<key_cache::value_ptr> key_ptr_fut) {
@@ -563,11 +591,11 @@ read_entire_stream(input_stream<char>& inp, size_t length_limit) {
class safe_gzip_zstream {
z_stream _zs;
public:
safe_gzip_zstream() {
// If gzip is true, decode a gzip header (for "Content-Encoding: gzip").
// Otherwise, a zlib header (for "Content-Encoding: deflate").
safe_gzip_zstream(bool gzip = true) {
memset(&_zs, 0, sizeof(_zs));
// The strange 16 + WMAX_BITS tells zlib to expect and decode
// a gzip header, not a zlib header.
if (inflateInit2(&_zs, 16 + MAX_WBITS) != Z_OK) {
if (inflateInit2(&_zs, gzip ? 16 + MAX_WBITS : MAX_WBITS) != Z_OK) {
// Should only happen if memory allocation fails
throw std::bad_alloc();
}
@@ -586,19 +614,21 @@ public:
}
};
// ungzip() takes a chunked_content with a gzip-compressed request body,
// uncompresses it, and returns the uncompressed content as a chunked_content.
// ungzip() takes a chunked_content of a compressed request body, and returns
// the uncompressed content as a chunked_content. If gzip is true, we expect
// gzip header (for "Content-Encoding: gzip"), if gzip is false, we expect a
// zlib header (for "Content-Encoding: deflate").
// If the uncompressed content exceeds length_limit, an error is thrown.
static future<chunked_content>
ungzip(chunked_content&& compressed_body, size_t length_limit) {
ungzip(chunked_content&& compressed_body, size_t length_limit, bool gzip = true) {
chunked_content ret;
// output_buf can be any size - when uncompressing input_buf, it doesn't
// need to fit in a single output_buf, we'll use multiple output_buf for
// a single input_buf if needed.
constexpr size_t OUTPUT_BUF_SIZE = 4096;
temporary_buffer<char> output_buf;
safe_gzip_zstream strm;
bool complete_stream = false; // empty input is not a valid gzip
safe_gzip_zstream strm(gzip);
bool complete_stream = false; // empty input is not a valid gzip/deflate
size_t total_out_bytes = 0;
for (const temporary_buffer<char>& input_buf : compressed_body) {
if (input_buf.empty()) {
@@ -701,6 +731,8 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
sstring content_encoding = req->get_header("Content-Encoding");
if (content_encoding == "gzip") {
content = co_await ungzip(std::move(content), request_content_length_limit);
} else if (content_encoding == "deflate") {
content = co_await ungzip(std::move(content), request_content_length_limit, false);
} else if (!content_encoding.empty()) {
// DynamoDB returns a 500 error for unsupported Content-Encoding.
// I'm not sure if this is the best error code, but let's do it too.
@@ -872,7 +904,9 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
} {
}
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port,
std::optional<uint16_t> port_proxy_protocol, std::optional<uint16_t> https_port_proxy_protocol,
std::optional<tls::credentials_builder> creds,
utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization, utils::updateable_value<uint64_t> max_users_query_size_in_trace_output,
semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
_memory_limiter = memory_limiter;
@@ -880,20 +914,28 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:
_warn_authorization = std::move(warn_authorization);
_max_concurrent_requests = std::move(max_concurrent_requests);
_max_users_query_size_in_trace_output = std::move(max_users_query_size_in_trace_output);
if (!port && !https_port) {
if (!port && !https_port && !port_proxy_protocol && !https_port_proxy_protocol) {
return make_exception_future<>(std::runtime_error("Either regular port or TLS port"
" must be specified in order to init an alternator HTTP server instance"));
}
return seastar::async([this, addr, port, https_port, creds] {
return seastar::async([this, addr, port, https_port, port_proxy_protocol, https_port_proxy_protocol, creds] {
_executor.start().get();
if (port) {
if (port || port_proxy_protocol) {
set_routes(_http_server._routes);
_http_server.set_content_streaming(true);
_http_server.listen(socket_address{addr, *port}).get();
if (port) {
_http_server.listen(socket_address{addr, *port}).get();
}
if (port_proxy_protocol) {
listen_options lo;
lo.reuse_address = true;
lo.proxy_protocol = true;
_http_server.listen(socket_address{addr, *port_proxy_protocol}, lo).get();
}
_enabled_servers.push_back(std::ref(_http_server));
}
if (https_port) {
if (https_port || https_port_proxy_protocol) {
set_routes(_https_server._routes);
_https_server.set_content_streaming(true);
@@ -913,7 +955,15 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:
} else {
_credentials = creds->build_server_credentials();
}
_https_server.listen(socket_address{addr, *https_port}, _credentials).get();
if (https_port) {
_https_server.listen(socket_address{addr, *https_port}, _credentials).get();
}
if (https_port_proxy_protocol) {
listen_options lo;
lo.reuse_address = true;
lo.proxy_protocol = true;
_https_server.listen(socket_address{addr, *https_port_proxy_protocol}, lo, _credentials).get();
}
_enabled_servers.push_back(std::ref(_https_server));
}
});
@@ -986,9 +1036,8 @@ client_data server::ongoing_request::make_client_data() const {
// and keep "driver_version" unset.
cd.driver_name = _user_agent;
// Leave "protocol_version" unset, it has no meaning in Alternator.
// Leave "hostname", "ssl_protocol" and "ssl_cipher_suite" unset.
// As reported in issue #9216, we never set these fields in CQL
// either (see cql_server::connection::make_client_data()).
// Leave "hostname", "ssl_protocol" and "ssl_cipher_suite" unset for Alternator.
// Note: CQL sets ssl_protocol and ssl_cipher_suite via generic_server::connection base class.
return cd;
}

View File

@@ -100,7 +100,9 @@ class server : public peering_sharded_service<server> {
public:
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port,
std::optional<uint16_t> port_proxy_protocol, std::optional<uint16_t> https_port_proxy_protocol,
std::optional<tls::credentials_builder> creds,
utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization, utils::updateable_value<uint64_t> max_users_query_size_in_trace_output,
semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
future<> stop();

View File

@@ -9,6 +9,7 @@
#include <seastar/core/chunked_fifo.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/exception.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/http/exception.hh>
#include "task_manager.hh"
@@ -264,7 +265,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
if (id) {
module->unregister_task(id);
}
co_await maybe_yield();
co_await coroutine::maybe_yield();
}
});
co_return json_void();

View File

@@ -76,11 +76,14 @@ sstring generate_salt(RandomNumberEngine& g, scheme scheme) {
///
/// Hash a password combined with an implementation-specific salt string.
/// Deprecated in favor of `hash_with_salt_async`.
/// Deprecated in favor of `hash_with_salt_async`. This function is still used
/// when generating password hashes for storage to ensure that
/// `hash_with_salt` and `hash_with_salt_async` produce identical results,
/// preserving backward compatibility.
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
[[deprecated("Use hash_with_salt_async instead")]] sstring hash_with_salt(const sstring& pass, const sstring& salt);
sstring hash_with_salt(const sstring& pass, const sstring& salt);
///
/// Async version of `hash_with_salt` that returns a future.

View File

@@ -204,7 +204,7 @@ future<topology_description> topology_description::clone_async() const {
for (const auto& entry : _entries) {
vec.push_back(entry);
co_await seastar::maybe_yield();
co_await coroutine::maybe_yield();
}
co_return topology_description{std::move(vec)};

View File

@@ -15,7 +15,7 @@
#include "mutation/tombstone.hh"
#include "schema/schema.hh"
#include "seastar/core/sstring.hh"
#include <seastar/core/sstring.hh>
#include "types/concrete_types.hh"
#include "types/types.hh"
#include "types/user.hh"

View File

@@ -1092,6 +1092,7 @@ scylla_core = (['message/messaging_service.cc',
'cql3/statements/list_service_level_attachments_statement.cc',
'cql3/statements/list_effective_service_level_statement.cc',
'cql3/statements/describe_statement.cc',
'cql3/statements/view_prop_defs.cc',
'cql3/update_parameters.cc',
'cql3/util.cc',
'cql3/ut_name.cc',
@@ -2805,38 +2806,35 @@ def write_build_file(f,
seastar_dep = f'$builddir/{mode}/seastar/libseastar.{seastar_lib_ext}'
seastar_testing_dep = f'$builddir/{mode}/seastar/libseastar_testing.{seastar_lib_ext}'
f.write('build {seastar_dep}: ninja $builddir/{mode}/seastar/build.ninja | always {profile_dep}\n'
.format(**locals()))
f.write(f'build {seastar_dep}: ninja $builddir/{mode}/seastar/build.ninja | always {profile_dep}\n')
f.write(' pool = submodule_pool\n')
f.write(' subdir = $builddir/{mode}/seastar\n'.format(**locals()))
f.write(' target = seastar\n'.format(**locals()))
f.write('build {seastar_testing_dep}: ninja $builddir/{mode}/seastar/build.ninja | always {profile_dep}\n'
.format(**locals()))
f.write(f' subdir = $builddir/{mode}/seastar\n')
f.write(' target = seastar\n')
f.write(f'build {seastar_testing_dep}: ninja $builddir/{mode}/seastar/build.ninja | always {profile_dep}\n')
f.write(' pool = submodule_pool\n')
f.write(' subdir = $builddir/{mode}/seastar\n'.format(**locals()))
f.write(' target = seastar_testing\n'.format(**locals()))
f.write(' profile_dep = {profile_dep}\n'.format(**locals()))
f.write(f' subdir = $builddir/{mode}/seastar\n')
f.write(' target = seastar_testing\n')
f.write(f' profile_dep = {profile_dep}\n')
for lib in abseil_libs:
f.write('build $builddir/{mode}/abseil/{lib}: ninja $builddir/{mode}/abseil/build.ninja | always {profile_dep}\n'.format(**locals()))
f.write(' pool = submodule_pool\n')
f.write(' subdir = $builddir/{mode}/abseil\n'.format(**locals()))
f.write(' target = {lib}\n'.format(**locals()))
f.write(' profile_dep = {profile_dep}\n'.format(**locals()))
f.write(f'build $builddir/{mode}/abseil/{lib}: ninja $builddir/{mode}/abseil/build.ninja | always {profile_dep}\n')
f.write(f' pool = submodule_pool\n')
f.write(f' subdir = $builddir/{mode}/abseil\n')
f.write(f' target = {lib}\n')
f.write(f' profile_dep = {profile_dep}\n')
f.write(f'build $builddir/{mode}/stdafx.hh.pch: cxx_build_precompiled_header.{mode} stdafx.hh | {profile_dep} {seastar_dep} {abseil_dep} {gen_headers_dep} {pch_dep}\n')
f.write('build $builddir/{mode}/seastar/apps/iotune/iotune: ninja $builddir/{mode}/seastar/build.ninja | $builddir/{mode}/seastar/libseastar.{seastar_lib_ext}\n'
.format(**locals()))
f.write(f'build $builddir/{mode}/seastar/apps/iotune/iotune: ninja $builddir/{mode}/seastar/build.ninja | $builddir/{mode}/seastar/libseastar.{seastar_lib_ext}\n')
f.write(' pool = submodule_pool\n')
f.write(' subdir = $builddir/{mode}/seastar\n'.format(**locals()))
f.write(' target = iotune\n'.format(**locals()))
f.write(' profile_dep = {profile_dep}\n'.format(**locals()))
f.write(textwrap.dedent('''\
f.write(f' subdir = $builddir/{mode}/seastar\n')
f.write(' target = iotune\n')
f.write(f' profile_dep = {profile_dep}\n')
f.write(textwrap.dedent(f'''\
build $builddir/{mode}/iotune: copy $builddir/{mode}/seastar/apps/iotune/iotune
build $builddir/{mode}/iotune.stripped: strip $builddir/{mode}/iotune
build $builddir/{mode}/iotune.debug: phony $builddir/{mode}/iotune.stripped
''').format(**locals()))
'''))
if args.dist_only:
include_scylla_and_iotune = ''
include_scylla_and_iotune_stripped = ''
@@ -2845,16 +2843,16 @@ def write_build_file(f,
include_scylla_and_iotune = f'$builddir/{mode}/scylla $builddir/{mode}/iotune $builddir/{mode}/patchelf'
include_scylla_and_iotune_stripped = f'$builddir/{mode}/scylla.stripped $builddir/{mode}/iotune.stripped $builddir/{mode}/patchelf.stripped'
include_scylla_and_iotune_debug = f'$builddir/{mode}/scylla.debug $builddir/{mode}/iotune.debug'
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-unstripped-{scylla_version}-{scylla_release}.{arch}.tar.gz: package {include_scylla_and_iotune} $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian $builddir/node_exporter/node_exporter | always\n'.format(**locals()))
f.write(' mode = {mode}\n'.format(**locals()))
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz: stripped_package {include_scylla_and_iotune_stripped} $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian $builddir/node_exporter/node_exporter.stripped | always\n'.format(**locals()))
f.write(' mode = {mode}\n'.format(**locals()))
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-debuginfo-{scylla_version}-{scylla_release}.{arch}.tar.gz: debuginfo_package {include_scylla_and_iotune_debug} $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian $builddir/node_exporter/node_exporter.debug | always\n'.format(**locals()))
f.write(' mode = {mode}\n'.format(**locals()))
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz\n'.format(**locals()))
f.write(' mode = {mode}\n'.format(**locals()))
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-{arch}-package.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz\n'.format(**locals()))
f.write(' mode = {mode}\n'.format(**locals()))
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unstripped-{scylla_version}-{scylla_release}.{arch}.tar.gz: package {include_scylla_and_iotune} $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian $builddir/node_exporter/node_exporter | always\n')
f.write(f' mode = {mode}\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz: stripped_package {include_scylla_and_iotune_stripped} $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian $builddir/node_exporter/node_exporter.stripped | always\n')
f.write(f' mode = {mode}\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-debuginfo-{scylla_version}-{scylla_release}.{arch}.tar.gz: debuginfo_package {include_scylla_and_iotune_debug} $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian $builddir/node_exporter/node_exporter.debug | always\n')
f.write(f' mode = {mode}\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f' mode = {mode}\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-{arch}-package.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f' mode = {mode}\n')
f.write(f'build $builddir/dist/{mode}/redhat: rpmbuild $builddir/{mode}/dist/tar/{scylla_product}-unstripped-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f' mode = {mode}\n')

View File

@@ -105,6 +105,7 @@ target_sources(cql3
statements/list_service_level_attachments_statement.cc
statements/list_effective_service_level_statement.cc
statements/describe_statement.cc
statements/view_prop_defs.cc
update_parameters.cc
util.cc
ut_name.cc

View File

@@ -898,6 +898,10 @@ pkDef[cql3::statements::create_table_statement::raw_statement& expr]
| '(' k1=ident { l.push_back(k1); } ( ',' kn=ident { l.push_back(kn); } )* ')' { $expr.add_key_aliases(l); }
;
cfamProperties[cql3::statements::cf_properties& expr]
: cfamProperty[expr] (K_AND cfamProperty[expr])*
;
cfamProperty[cql3::statements::cf_properties& expr]
: property[*$expr.properties()]
| K_COMPACT K_STORAGE { $expr.set_compact_storage(); }
@@ -935,16 +939,22 @@ typeColumns[create_type_statement& expr]
*/
createIndexStatement returns [std::unique_ptr<create_index_statement> expr]
@init {
auto props = make_shared<index_prop_defs>();
auto idx_props = make_shared<index_specific_prop_defs>();
auto props = index_prop_defs();
bool if_not_exists = false;
auto name = ::make_shared<cql3::index_name>();
std::vector<::shared_ptr<index_target::raw>> targets;
}
: K_CREATE (K_CUSTOM { props->is_custom = true; })? K_INDEX (K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
: K_CREATE (K_CUSTOM { idx_props->is_custom = true; })? K_INDEX (K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
(idxName[*name])? K_ON cf=columnFamilyName '(' (target1=indexIdent { targets.emplace_back(target1); } (',' target2=indexIdent { targets.emplace_back(target2); } )*)? ')'
(K_USING cls=STRING_LITERAL { props->custom_class = sstring{$cls.text}; })?
(K_WITH properties[*props])?
{ $expr = std::make_unique<create_index_statement>(cf, name, targets, props, if_not_exists); }
(K_USING cls=STRING_LITERAL { idx_props->custom_class = sstring{$cls.text}; })?
(K_WITH cfamProperties[props])?
{
props.extract_index_specific_properties_to(*idx_props);
view_prop_defs view_props = std::move(props).into_view_prop_defs();
$expr = std::make_unique<create_index_statement>(cf, name, targets, std::move(idx_props), std::move(view_props), if_not_exists);
}
;
indexIdent returns [::shared_ptr<index_target::raw> id]
@@ -1092,9 +1102,9 @@ alterTypeStatement returns [std::unique_ptr<alter_type_statement> expr]
*/
alterViewStatement returns [std::unique_ptr<alter_view_statement> expr]
@init {
auto props = cql3::statements::cf_prop_defs();
auto props = cql3::statements::view_prop_defs();
}
: K_ALTER K_MATERIALIZED K_VIEW cf=columnFamilyName K_WITH properties[props]
: K_ALTER K_MATERIALIZED K_VIEW cf=columnFamilyName K_WITH properties[*props.properties()]
{
$expr = std::make_unique<alter_view_statement>(std::move(cf), std::move(props));
}

View File

@@ -14,6 +14,7 @@
#include <seastar/core/shared_ptr.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/coroutine/as_future.hh>
#include <seastar/coroutine/try_future.hh>
#include "service/storage_proxy.hh"
#include "service/migration_manager.hh"
@@ -993,7 +994,7 @@ query_processor::execute_with_params(
auto opts = make_internal_options(p, values, cl);
auto statement = p->statement;
auto msg = co_await execute_maybe_with_guard(query_state, std::move(statement), opts, &query_processor::do_execute_with_params);
auto msg = co_await coroutine::try_future(execute_maybe_with_guard(query_state, std::move(statement), opts, &query_processor::do_execute_with_params));
co_return ::make_shared<untyped_result_set>(msg);
}
@@ -1003,7 +1004,7 @@ query_processor::do_execute_with_params(
shared_ptr<cql_statement> statement,
const query_options& options, std::optional<service::group0_guard> guard) {
statement->validate(*this, service::client_state::for_internal_calls());
co_return co_await statement->execute(*this, query_state, options, std::move(guard));
co_return co_await coroutine::try_future(statement->execute(*this, query_state, options, std::move(guard)));
}

View File

@@ -19,7 +19,7 @@
#include "locator/abstract_replication_strategy.hh"
#include "mutation/canonical_mutation.hh"
#include "prepared_statement.hh"
#include "seastar/coroutine/exception.hh"
#include <seastar/coroutine/exception.hh>
#include "service/migration_manager.hh"
#include "service/storage_proxy.hh"
#include "service/topology_mutation.hh"
@@ -206,8 +206,9 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
locator::replication_strategy_params(ks_md_update->strategy_options(), ks_md_update->initial_tablets(), ks_md_update->consistency_option()),
topo);
// If `rf_rack_valid_keyspaces` is enabled, it's forbidden to perform a schema change that
// would lead to an RF-rack-valid keyspace. Verify that this change does not.
// If RF-rack-validity must be enforced for the keyspace according to `enforce_rf_rack_validity_for_keyspace`,
// it's forbidden to perform a schema change that would lead to an RF-rack-invalid keyspace.
// Verify that this change does not.
// For more context, see: scylladb/scylladb#23071.
try {
// There are two things to note here:
@@ -225,13 +226,13 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
// disturb it (see scylladb/scylladb#23345), but we ignore that.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
if (qp.db().get_config().rf_rack_valid_keyspaces()) {
if (replica::database::enforce_rf_rack_validity_for_keyspace(qp.db().get_config(), *ks_md)) {
// There's no guarantee what the type of the exception will be, so we need to
// wrap it manually here in a type that can be passed to the user.
throw exceptions::invalid_request_exception(e.what());
} else {
// Even when the configuration option `rf_rack_valid_keyspaces` is set to false,
// we'd like to inform the user that the keyspace they're altering will not
// Even when RF-rack-validity is not enforced for the keyspace, we'd
// like to inform the user that the keyspace they're altering will not
// satisfy the restriction after the change--but just as a warning.
// For more context, see issue: scylladb/scylladb#23330.
warnings.push_back(seastar::format(

View File

@@ -11,6 +11,7 @@
#include <seastar/core/coroutine.hh>
#include "cql3/statements/alter_view_statement.hh"
#include "cql3/statements/prepared_statement.hh"
#include "cql3/statements/view_prop_defs.hh"
#include "service/migration_manager.hh"
#include "service/storage_proxy.hh"
#include "validation.hh"
@@ -22,7 +23,7 @@ namespace cql3 {
namespace statements {
alter_view_statement::alter_view_statement(cf_name view_name, std::optional<cf_prop_defs> properties)
alter_view_statement::alter_view_statement(cf_name view_name, std::optional<view_prop_defs> properties)
: schema_altering_statement{std::move(view_name)}
, _properties{std::move(properties)}
{
@@ -52,8 +53,8 @@ view_ptr alter_view_statement::prepare_view(data_dictionary::database db) const
throw exceptions::invalid_request_exception("ALTER MATERIALIZED VIEW WITH invoked, but no parameters found");
}
auto schema_extensions = _properties->make_schema_extensions(db.extensions());
_properties->validate(db, keyspace(), schema_extensions);
auto schema_extensions = _properties->properties()->make_schema_extensions(db.extensions());
_properties->validate_raw(view_prop_defs::op_type::alter, db, keyspace(), schema_extensions);
bool is_colocated = [&] {
if (!db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
@@ -70,28 +71,15 @@ view_ptr alter_view_statement::prepare_view(data_dictionary::database db) const
}();
if (is_colocated) {
auto gc_opts = _properties->get_tombstone_gc_options(schema_extensions);
auto gc_opts = _properties->properties()->get_tombstone_gc_options(schema_extensions);
if (gc_opts && gc_opts->mode() == tombstone_gc_mode::repair) {
throw exceptions::invalid_request_exception("The 'repair' mode for tombstone_gc is not allowed on co-located materialized view tables.");
}
}
auto builder = schema_builder(schema);
_properties->apply_to_builder(builder, std::move(schema_extensions), db, keyspace(), !is_colocated);
if (builder.get_gc_grace_seconds() == 0) {
throw exceptions::invalid_request_exception(
"Cannot alter gc_grace_seconds of a materialized view to 0, since this "
"value is used to TTL undelivered updates. Setting gc_grace_seconds too "
"low might cause undelivered updates to expire before being replayed.");
}
if (builder.default_time_to_live().count() > 0) {
throw exceptions::invalid_request_exception(
"Cannot set or alter default_time_to_live for a materialized view. "
"Data in a materialized view always expire at the same time than "
"the corresponding data in the parent table.");
}
_properties->apply_to_builder(view_prop_defs::op_type::alter, builder, std::move(schema_extensions),
db, keyspace(), is_colocated);
return view_ptr(builder.build());
}

View File

@@ -12,8 +12,8 @@
#include <seastar/core/shared_ptr.hh>
#include "cql3/statements/view_prop_defs.hh"
#include "data_dictionary/data_dictionary.hh"
#include "cql3/statements/cf_prop_defs.hh"
#include "cql3/statements/schema_altering_statement.hh"
namespace cql3 {
@@ -26,10 +26,10 @@ namespace statements {
/** An <code>ALTER MATERIALIZED VIEW</code> parsed from a CQL query statement. */
class alter_view_statement : public schema_altering_statement {
private:
std::optional<cf_prop_defs> _properties;
std::optional<view_prop_defs> _properties;
view_ptr prepare_view(data_dictionary::database db) const;
public:
alter_view_statement(cf_name view_name, std::optional<cf_prop_defs> properties);
alter_view_statement(cf_name view_name, std::optional<view_prop_defs> properties);
virtual future<> check_access(query_processor& qp, const service::client_state& state) const override;

View File

@@ -19,7 +19,8 @@ namespace statements {
/**
* Class for common statement properties.
*/
class cf_properties final {
class cf_properties {
protected:
const ::shared_ptr<cf_prop_defs> _properties = ::make_shared<cf_prop_defs>();
bool _use_compact_storage = false;
std::vector<std::pair<::shared_ptr<column_identifier>, bool>> _defined_ordering; // Insertion ordering is important

View File

@@ -14,6 +14,7 @@
#include "db/view/view.hh"
#include "exceptions/exceptions.hh"
#include "index/vector_index.hh"
#include "locator/token_metadata_fwd.hh"
#include "prepared_statement.hh"
#include "replica/database.hh"
#include "types/types.hh"
@@ -218,18 +219,24 @@ view_ptr create_index_statement::create_view_for_index(const schema_ptr schema,
std::map<sstring, sstring> tags_map = {{db::SYNCHRONOUS_VIEW_UPDATES_TAG_KEY, "true"}};
builder.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>(tags_map));
}
const schema::extensions_map exts = _view_properties.properties()->make_schema_extensions(db.extensions());
_view_properties.apply_to_builder(view_prop_defs::op_type::create, builder, exts, db, keyspace(), is_colocated);
return view_ptr{builder.build()};
}
create_index_statement::create_index_statement(cf_name name,
::shared_ptr<index_name> index_name,
std::vector<::shared_ptr<index_target::raw>> raw_targets,
::shared_ptr<index_prop_defs> properties,
::shared_ptr<index_specific_prop_defs> idx_properties,
view_prop_defs view_properties,
bool if_not_exists)
: schema_altering_statement(name)
, _index_name(index_name->get_idx())
, _raw_targets(raw_targets)
, _properties(properties)
, _idx_properties(std::move(idx_properties))
, _view_properties(std::move(view_properties))
, _if_not_exists(if_not_exists)
{
}
@@ -252,14 +259,53 @@ static sstring target_type_name(index_target::target_type type) {
void
create_index_statement::validate(query_processor& qp, const service::client_state& state) const
{
if (_raw_targets.empty() && !_properties->is_custom) {
if (_raw_targets.empty() && !_idx_properties->is_custom) {
throw exceptions::invalid_request_exception("Only CUSTOM indexes can be created without specifying a target column");
}
_properties->validate();
_idx_properties->validate();
// FIXME: This is ugly and can be improved.
const bool is_vector_index = _idx_properties->custom_class && *_idx_properties->custom_class == "vector_index";
const bool uses_view_properties = _view_properties.properties()->count() > 0
|| _view_properties.use_compact_storage()
|| _view_properties.defined_ordering().size() > 0;
if (is_vector_index && uses_view_properties) {
throw exceptions::invalid_request_exception("You cannot use view properties with a vector index");
}
const schema::extensions_map exts = _view_properties.properties()->make_schema_extensions(qp.db().extensions());
_view_properties.validate_raw(view_prop_defs::op_type::create, qp.db(), keyspace(), exts);
// These keywords are still accepted by other schema entities, but they don't have effect on them.
// Since indexes are not bound by any backward compatibility contract in this regard, let's forbid these.
static sstring obsolete_keywords[] = {
"index_interval",
"replicate_on_write",
"populate_io_cache_on_flush",
"read_repair_chance",
"dclocal_read_repair_chance",
};
for (const sstring& keyword : obsolete_keywords) {
if (_view_properties.properties()->has_property(keyword)) {
// We use the same type of exception and the same error message as would be thrown for
// an invalid property via `_view_properties.validate_raw`.
throw exceptions::syntax_exception(seastar::format("Unknown property '{}'", keyword));
}
}
// FIXME: This is a temporary limitation as it might deserve more attention.
if (!_view_properties.defined_ordering().empty()) {
throw exceptions::invalid_request_exception("Indexes do not allow for specifying the clustering order");
}
}
std::vector<::shared_ptr<index_target>> create_index_statement::validate_while_executing(data_dictionary::database db) const {
std::pair<std::vector<::shared_ptr<index_target>>, cql3::cql_warnings_vec>
create_index_statement::validate_while_executing(data_dictionary::database db, locator::token_metadata_ptr tmptr) const {
cql3::cql_warnings_vec warnings;
auto schema = validation::validate_column_family(db, keyspace(), column_family());
if (schema->is_counter()) {
@@ -281,13 +327,22 @@ std::vector<::shared_ptr<index_target>> create_index_statement::validate_while_e
// Regular secondary indexes require rf-rack-validity.
// Custom indexes need to validate this property themselves, if they need it.
if (!_properties || !_properties->custom_class) {
if (!_idx_properties || !_idx_properties->custom_class) {
try {
db::view::validate_view_keyspace(db, keyspace());
db::view::validate_view_keyspace(db, keyspace(), tmptr);
} catch (const std::exception& e) {
// The type of the thrown exception is not specified, so we need to wrap it here.
throw exceptions::invalid_request_exception(e.what());
}
if (db.find_keyspace(keyspace()).uses_tablets()) {
warnings.emplace_back(
"Creating an index in a keyspace that uses tablets requires "
"the keyspace to remain RF-rack-valid while the index exists. "
"Some operations will be restricted to enforce this: altering the keyspace's replication "
"factor, adding a node in a new rack, and removing or decommissioning a node that would "
"eliminate a rack.");
}
}
validate_for_local_index(*schema);
@@ -297,14 +352,14 @@ std::vector<::shared_ptr<index_target>> create_index_statement::validate_while_e
targets.emplace_back(raw_target->prepare(*schema));
}
if (_properties && _properties->custom_class) {
auto custom_index_factory = secondary_index::secondary_index_manager::get_custom_class_factory(*_properties->custom_class);
if (_idx_properties && _idx_properties->custom_class) {
auto custom_index_factory = secondary_index::secondary_index_manager::get_custom_class_factory(*_idx_properties->custom_class);
if (!custom_index_factory) {
throw exceptions::invalid_request_exception(format("Non-supported custom class \'{}\' provided", *(_properties->custom_class)));
throw exceptions::invalid_request_exception(format("Non-supported custom class \'{}\' provided", *_idx_properties->custom_class));
}
auto custom_index = (*custom_index_factory)();
custom_index->validate(*schema, *_properties, targets, db.features(), db);
_properties->index_version = custom_index->index_version(*schema);
custom_index->validate(*schema, *_idx_properties, targets, db.features(), db);
_idx_properties->index_version = custom_index->index_version(*schema);
}
if (targets.size() > 1) {
@@ -384,7 +439,7 @@ std::vector<::shared_ptr<index_target>> create_index_statement::validate_while_e
}
}
return targets;
return std::make_pair(std::move(targets), std::move(warnings));
}
void create_index_statement::validate_for_local_index(const schema& schema) const {
@@ -523,7 +578,7 @@ void create_index_statement::validate_target_column_is_map_if_index_involves_key
void create_index_statement::validate_targets_for_multi_column_index(std::vector<::shared_ptr<index_target>> targets) const
{
if (!_properties->is_custom) {
if (!_idx_properties->is_custom) {
if (targets.size() > 2 || (targets.size() == 2 && std::holds_alternative<index_target::single_column>(targets.front()->value))) {
throw exceptions::invalid_request_exception("Only CUSTOM indexes support multiple columns");
}
@@ -537,8 +592,9 @@ void create_index_statement::validate_targets_for_multi_column_index(std::vector
}
}
std::optional<create_index_statement::base_schema_with_new_index> create_index_statement::build_index_schema(data_dictionary::database db) const {
auto targets = validate_while_executing(db);
std::pair<std::optional<create_index_statement::base_schema_with_new_index>, cql3::cql_warnings_vec>
create_index_statement::build_index_schema(data_dictionary::database db, locator::token_metadata_ptr tmptr) const {
auto [targets, warnings] = validate_while_executing(db, tmptr);
auto schema = db.find_schema(keyspace(), column_family());
@@ -554,8 +610,8 @@ std::optional<create_index_statement::base_schema_with_new_index> create_index_s
}
index_metadata_kind kind;
index_options_map index_options;
if (_properties->custom_class) {
index_options = _properties->get_options();
if (_idx_properties->custom_class) {
index_options = _idx_properties->get_options();
kind = index_metadata_kind::custom;
} else {
kind = schema->is_compound() ? index_metadata_kind::composites : index_metadata_kind::keys;
@@ -564,17 +620,17 @@ std::optional<create_index_statement::base_schema_with_new_index> create_index_s
auto existing_index = schema->find_index_noname(index);
if (existing_index) {
if (_if_not_exists) {
return {};
return std::make_pair(std::nullopt, std::move(warnings));
} else {
throw exceptions::invalid_request_exception(
format("Index {} is a duplicate of existing index {}", index.name(), existing_index.value().name()));
}
}
bool existing_vector_index = _properties->custom_class && _properties->custom_class == "vector_index" && secondary_index::vector_index::has_vector_index_on_column(*schema, targets[0]->column_name());
bool custom_index_with_same_name = _properties->custom_class && db.existing_index_names(keyspace()).contains(_index_name);
bool existing_vector_index = _idx_properties->custom_class && _idx_properties->custom_class == "vector_index" && secondary_index::vector_index::has_vector_index_on_column(*schema, targets[0]->column_name());
bool custom_index_with_same_name = _idx_properties->custom_class && db.existing_index_names(keyspace()).contains(_index_name);
if (existing_vector_index || custom_index_with_same_name) {
if (_if_not_exists) {
return {};
return std::make_pair(std::nullopt, std::move(warnings));
} else {
throw exceptions::invalid_request_exception("There exists a duplicate custom index");
}
@@ -590,13 +646,13 @@ std::optional<create_index_statement::base_schema_with_new_index> create_index_s
schema_builder builder{schema};
builder.with_index(index);
return base_schema_with_new_index{builder.build(), index};
return std::make_pair(base_schema_with_new_index{builder.build(), index}, std::move(warnings));
}
future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, utils::chunked_vector<mutation>, cql3::cql_warnings_vec>>
create_index_statement::prepare_schema_mutations(query_processor& qp, const query_options&, api::timestamp_type ts) const {
using namespace cql_transport;
auto res = build_index_schema(qp.db());
auto [res, warnings] = build_index_schema(qp.db(), qp.proxy().get_token_metadata_ptr());
::shared_ptr<event::schema_change> ret;
utils::chunked_vector<mutation> muts;
@@ -626,7 +682,7 @@ create_index_statement::prepare_schema_mutations(query_processor& qp, const quer
column_family());
}
co_return std::make_tuple(std::move(ret), std::move(muts), std::vector<sstring>());
co_return std::make_tuple(std::move(ret), std::move(muts), std::move(warnings));
}
std::unique_ptr<cql3::statements::prepared_statement>

View File

@@ -10,6 +10,8 @@
#pragma once
#include "cql3/statements/index_prop_defs.hh"
#include "cql3/statements/view_prop_defs.hh"
#include "schema_altering_statement.hh"
#include "index_target.hh"
@@ -27,20 +29,25 @@ class index_name;
namespace statements {
class index_prop_defs;
class index_specific_prop_defs;
/** A <code>CREATE INDEX</code> statement parsed from a CQL query. */
class create_index_statement : public schema_altering_statement {
const sstring _index_name;
const std::vector<::shared_ptr<index_target::raw>> _raw_targets;
const ::shared_ptr<index_prop_defs> _properties;
// Options specific to this index.
const ::shared_ptr<index_specific_prop_defs> _idx_properties;
// Options corresponding to the underlying materialized view.
const view_prop_defs _view_properties;
const bool _if_not_exists;
cql_stats* _cql_stats = nullptr;
public:
create_index_statement(cf_name name, ::shared_ptr<index_name> index_name,
std::vector<::shared_ptr<index_target::raw>> raw_targets,
::shared_ptr<index_prop_defs> properties, bool if_not_exists);
::shared_ptr<index_specific_prop_defs> idx_properties, view_prop_defs view_properties, bool if_not_exists);
future<> check_access(query_processor& qp, const service::client_state& state) const override;
void validate(query_processor&, const service::client_state& state) const override;
@@ -53,7 +60,7 @@ public:
schema_ptr schema;
index_metadata index;
};
std::optional<base_schema_with_new_index> build_index_schema(data_dictionary::database db) const;
std::pair<std::optional<base_schema_with_new_index>, cql3::cql_warnings_vec> build_index_schema(data_dictionary::database db, locator::token_metadata_ptr tmptr) const;
view_ptr create_view_for_index(const schema_ptr, const index_metadata& im, const data_dictionary::database&) const;
private:
void validate_for_local_index(const schema& schema) const;
@@ -69,7 +76,7 @@ private:
const sstring& name,
index_metadata_kind kind,
const index_options_map& options);
std::vector<::shared_ptr<index_target>> validate_while_executing(data_dictionary::database db) const;
std::pair<std::vector<::shared_ptr<index_target>>, cql3::cql_warnings_vec> validate_while_executing(data_dictionary::database db, locator::token_metadata_ptr tmptr) const;
};
}

View File

@@ -116,21 +116,21 @@ future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, utils::chun
warnings.push_back("Keyspace `initial` tablets option is deprecated. Use per-table tablet options instead.");
}
// If `rf_rack_valid_keyspaces` is enabled, it's forbidden to create an RF-rack-invalid keyspace.
// Verify that it's RF-rack-valid.
// If RF-rack-validity must be enforced for the keyspace according to `enforce_rf_rack_validity_for_keyspace`,
// it's forbidden to create an RF-rack-invalid keyspace. Verify that it's RF-rack-valid.
// For more context, see: scylladb/scylladb#23071.
try {
// We hold a group0_guard, so it's correct to check this here.
// The topology or schema cannot change while we're performing this query.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
if (cfg.rf_rack_valid_keyspaces()) {
if (replica::database::enforce_rf_rack_validity_for_keyspace(cfg, *ksm)) {
// There's no guarantee what the type of the exception will be, so we need to
// wrap it manually here in a type that can be passed to the user.
throw exceptions::invalid_request_exception(e.what());
} else {
// Even when the configuration option `rf_rack_valid_keyspaces` is set to false,
// we'd like to inform the user that the keyspace they're creating does not
// Even when RF-rack-validity is not enforced for the keyspace, we'd
// like to inform the user that the keyspace they're creating does not
// satisfy the restriction--but just as a warning.
// For more context, see issue: scylladb/scylladb#23330.
warnings.push_back(seastar::format(

View File

@@ -8,6 +8,7 @@
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include "cql3/statements/view_prop_defs.hh"
#include "exceptions/exceptions.hh"
#include "utils/assert.hh"
#include <unordered_set>
@@ -105,7 +106,7 @@ static bool validate_primary_key(
return new_non_pk_column;
}
std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(data_dictionary::database db) const {
std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(data_dictionary::database db, locator::token_metadata_ptr tmptr) const {
// We need to make sure that:
// - materialized view name is valid
// - primary key includes all columns in base table's primary key
@@ -119,15 +120,7 @@ std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(
cql3::cql_warnings_vec warnings;
auto schema_extensions = _properties.properties()->make_schema_extensions(db.extensions());
_properties.validate(db, keyspace(), schema_extensions);
if (_properties.use_compact_storage()) {
throw exceptions::invalid_request_exception(format("Cannot use 'COMPACT STORAGE' when defining a materialized view"));
}
if (_properties.properties()->get_cdc_options(schema_extensions)) {
throw exceptions::invalid_request_exception("Cannot enable CDC for a materialized view");
}
_properties.validate_raw(view_prop_defs::op_type::create, db, keyspace(), schema_extensions);
// View and base tables must be in the same keyspace, to ensure that RF
// is the same (because we assign a view replica to each base replica).
@@ -153,12 +146,21 @@ std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(
schema_ptr schema = validation::validate_column_family(db, _base_name.get_keyspace(), _base_name.get_column_family());
try {
db::view::validate_view_keyspace(db, keyspace());
db::view::validate_view_keyspace(db, keyspace(), tmptr);
} catch (const std::exception& e) {
// The type of the thrown exception is not specified, so we need to wrap it here.
throw exceptions::invalid_request_exception(e.what());
}
if (db.find_keyspace(keyspace()).uses_tablets()) {
warnings.emplace_back(
"Creating a materialized view in a keyspaces that uses tablets requires "
"the keyspace to remain RF-rack-valid while the materialized view exists. "
"Some operations will be restricted to enforce this: altering the keyspace's replication "
"factor, adding a node in a new rack, and removing or decommissioning a node that would "
"eliminate a rack.");
}
if (schema->is_counter()) {
throw exceptions::invalid_request_exception(format("Materialized views are not supported on counter tables"));
}
@@ -341,16 +343,7 @@ std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(
warnings.emplace_back(std::move(warning_text));
}
const auto maybe_id = _properties.properties()->get_id();
if (maybe_id && db.try_find_table(*maybe_id)) {
const auto schema_ptr = db.find_schema(*maybe_id);
const auto& ks_name = schema_ptr->ks_name();
const auto& cf_name = schema_ptr->cf_name();
throw exceptions::invalid_request_exception(seastar::format("Table with ID {} already exists: {}.{}", *maybe_id, ks_name, cf_name));
}
schema_builder builder{keyspace(), column_family(), maybe_id};
schema_builder builder{keyspace(), column_family()};
auto add_columns = [this, &builder] (std::vector<const column_definition*>& defs, column_kind kind) mutable {
for (auto* def : defs) {
auto&& type = _properties.get_reversable_type(*def->column_specification->name, def->type);
@@ -396,14 +389,8 @@ std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(
}
}
_properties.properties()->apply_to_builder(builder, std::move(schema_extensions), db, keyspace(), !is_colocated);
if (builder.default_time_to_live().count() > 0) {
throw exceptions::invalid_request_exception(
"Cannot set or alter default_time_to_live for a materialized view. "
"Data in a materialized view always expire at the same time than "
"the corresponding data in the parent table.");
}
_properties.apply_to_builder(view_prop_defs::op_type::create, builder, std::move(schema_extensions),
db, keyspace(), is_colocated);
auto where_clause_text = util::relations_to_where_clause(_where_clause);
builder.with_view_info(schema, included.empty(), std::move(where_clause_text));
@@ -414,7 +401,7 @@ std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(
future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, utils::chunked_vector<mutation>, cql3::cql_warnings_vec>>
create_view_statement::prepare_schema_mutations(query_processor& qp, const query_options&, api::timestamp_type ts) const {
utils::chunked_vector<mutation> m;
auto [definition, warnings] = prepare_view(qp.db());
auto [definition, warnings] = prepare_view(qp.db(), qp.proxy().get_token_metadata_ptr());
try {
m = co_await service::prepare_new_view_announcement(qp.proxy(), std::move(definition), ts);
} catch (const exceptions::already_exists_exception& e) {

View File

@@ -7,9 +7,9 @@
#pragma once
#include "cql3/statements/schema_altering_statement.hh"
#include "cql3/statements/cf_properties.hh"
#include "cql3/cf_name.hh"
#include "cql3/expr/expression.hh"
#include "cql3/statements/view_prop_defs.hh"
#include <seastar/core/shared_ptr.hh>
@@ -35,7 +35,7 @@ private:
expr::expression _where_clause;
std::vector<::shared_ptr<cql3::column_identifier::raw>> _partition_keys;
std::vector<::shared_ptr<cql3::column_identifier::raw>> _clustering_keys;
cf_properties _properties;
view_prop_defs _properties;
bool _if_not_exists;
public:
@@ -48,7 +48,7 @@ public:
std::vector<::shared_ptr<cql3::column_identifier::raw>> clustering_keys,
bool if_not_exists);
std::pair<view_ptr, cql3::cql_warnings_vec> prepare_view(data_dictionary::database db) const;
std::pair<view_ptr, cql3::cql_warnings_vec> prepare_view(data_dictionary::database db, locator::token_metadata_ptr tmptr) const;
auto& properties() {
return _properties;

View File

@@ -11,6 +11,7 @@
#include <set>
#include <seastar/core/format.hh>
#include "index_prop_defs.hh"
#include "cql3/statements/view_prop_defs.hh"
#include "index/secondary_index.hh"
#include "exceptions/exceptions.hh"
@@ -21,7 +22,9 @@ static void check_system_option_specified(const index_options_map& options, cons
}
}
void cql3::statements::index_prop_defs::validate() const {
namespace cql3::statements {
void index_specific_prop_defs::validate() const {
static std::set<sstring> keywords({ sstring(KW_OPTIONS) });
property_definitions::validate(keywords);
@@ -40,13 +43,13 @@ void cql3::statements::index_prop_defs::validate() const {
}
index_options_map
cql3::statements::index_prop_defs::get_raw_options() const {
index_specific_prop_defs::get_raw_options() const {
auto options = get_map(KW_OPTIONS);
return !options ? std::unordered_map<sstring, sstring>() : std::unordered_map<sstring, sstring>(options->begin(), options->end());
}
index_options_map
cql3::statements::index_prop_defs::get_options() const {
index_specific_prop_defs::get_options() const {
auto options = get_raw_options();
options.emplace(db::index::secondary_index::custom_class_option_name, *custom_class);
if (index_version.has_value()) {
@@ -54,3 +57,25 @@ cql3::statements::index_prop_defs::get_options() const {
}
return options;
}
void index_prop_defs::extract_index_specific_properties_to(index_specific_prop_defs& target) {
if (properties()->has_property(index_specific_prop_defs::KW_OPTIONS)) {
auto value = properties()->extract_property(index_specific_prop_defs::KW_OPTIONS);
std::visit([&target] <typename T> (T&& val) {
target.add_property(index_specific_prop_defs::KW_OPTIONS, std::forward<T>(val));
}, std::move(value));
}
}
view_prop_defs index_prop_defs::into_view_prop_defs() && {
if (properties()->has_property(index_specific_prop_defs::KW_OPTIONS)) {
utils::on_internal_error(seastar::format(
"Precondition has been violated. The property '{}' is still present", index_specific_prop_defs::KW_OPTIONS));
}
view_prop_defs result = std::move(static_cast<view_prop_defs&>(*this));
return result;
}
} // namespace cql3::statements

View File

@@ -10,6 +10,7 @@
#pragma once
#include "cql3/statements/view_prop_defs.hh"
#include "property_definitions.hh"
#include <seastar/core/sstring.hh>
#include "schema/schema_fwd.hh"
@@ -23,7 +24,7 @@ namespace cql3 {
namespace statements {
class index_prop_defs : public property_definitions {
class index_specific_prop_defs : public property_definitions {
public:
static constexpr auto KW_OPTIONS = "options";
@@ -37,6 +38,19 @@ public:
index_options_map get_options() const;
};
struct index_prop_defs : public view_prop_defs {
/// Extract all of the index-specific properties to `target`.
///
/// If there's a property at an index-specific key, and if `target` already has
/// a value at that key, that value will be replaced.
void extract_index_specific_properties_to(index_specific_prop_defs& target);
/// Turns this object into an object of type `view_prop_defs`, as if moved.
///
/// Precondition: the object MUST NOT contain any index-specific property.
view_prop_defs into_view_prop_defs() &&;
};
}
}

View File

@@ -8,8 +8,8 @@
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include "seastar/core/format.hh"
#include "seastar/core/sstring.hh"
#include <seastar/core/format.hh>
#include <seastar/core/sstring.hh>
#include "utils/assert.hh"
#include "cql3/statements/ks_prop_defs.hh"
#include "cql3/statements/request_validations.hh"

View File

@@ -11,6 +11,7 @@
#include <ranges>
#include <seastar/core/format.hh>
#include <stdexcept>
#include "cql3/statements/property_definitions.hh"
#include "exceptions/exceptions.hh"
#include "utils/overloaded_functor.hh"
@@ -102,6 +103,18 @@ bool property_definitions::has_property(const sstring& name) const {
return _properties.contains(name);
}
property_definitions::value_type property_definitions::extract_property(const sstring& name) {
auto it = _properties.find(name);
if (it == _properties.end()) {
throw std::out_of_range{std::format("No property of name '{}'", std::string_view(name))};
}
value_type result = std::move(it->second);
_properties.erase(it);
return result;
}
std::optional<property_definitions::value_type> property_definitions::get(const sstring& name) const {
if (auto it = _properties.find(name); it != _properties.end()) {
return it->second;

View File

@@ -59,6 +59,8 @@ protected:
public:
bool has_property(const sstring& name) const;
value_type extract_property(const sstring& name);
std::optional<value_type> get(const sstring& name) const;
std::optional<extended_map_type> get_extended_map(const sstring& name) const;

View File

@@ -261,7 +261,8 @@ future<> select_statement::check_access(query_processor& qp, const service::clie
auto& cf_name = s->is_view()
? s->view_info()->base_name()
: (cdc ? cdc->cf_name() : column_family());
bool is_vector_indexed = secondary_index::vector_index::has_vector_index(*_schema);
const schema_ptr& base_schema = cdc ? cdc : _schema;
bool is_vector_indexed = secondary_index::vector_index::has_vector_index(*base_schema);
co_await state.has_column_family_access(keyspace(), cf_name, auth::permission::SELECT, auth::command_desc::type::OTHER, is_vector_indexed);
} catch (const data_dictionary::no_such_column_family& e) {
// Will be validated afterwards.
@@ -1976,9 +1977,7 @@ mutation_fragments_select_statement::do_execute(query_processor& qp, service::qu
if (it == indexes.end()) {
throw exceptions::invalid_request_exception("ANN ordering by vector requires the column to be indexed using 'vector_index'");
}
if (index_opt || parameters->allow_filtering() || !(restrictions->is_empty()) || check_needs_allow_filtering_anyway(*restrictions)) {
throw exceptions::invalid_request_exception("ANN ordering by vector does not support filtering");
}
index_opt = *it;
if (!index_opt) {
@@ -2274,7 +2273,9 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
throw exceptions::invalid_request_exception("PER PARTITION LIMIT is not allowed with aggregate queries.");
}
auto restrictions = prepare_restrictions(db, schema, ctx, selection, for_view, _parameters->allow_filtering(),
bool is_ann_query = !_parameters->orderings().empty() && std::holds_alternative<select_statement::ann_vector>(_parameters->orderings().front().second);
auto restrictions = prepare_restrictions(db, schema, ctx, selection, for_view, _parameters->allow_filtering() || is_ann_query,
restrictions::check_indexes(!_parameters->is_mutation_fragments()));
if (_parameters->is_distinct()) {

View File

@@ -0,0 +1,67 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "cql3/statements/view_prop_defs.hh"
namespace cql3::statements {
void view_prop_defs::validate_raw(op_type op, const data_dictionary::database db, sstring ks_name,
const schema::extensions_map& exts) const
{
cf_properties::validate(db, std::move(ks_name), exts);
if (use_compact_storage()) {
throw exceptions::invalid_request_exception(format("Cannot use 'COMPACT STORAGE' when defining a materialized view"));
}
if (properties()->get_cdc_options(exts)) {
throw exceptions::invalid_request_exception("Cannot enable CDC for a materialized view");
}
if (op == op_type::create) {
const auto maybe_id = properties()->get_id();
if (maybe_id && db.try_find_table(*maybe_id)) {
const auto schema_ptr = db.find_schema(*maybe_id);
const auto& ks_name = schema_ptr->ks_name();
const auto& cf_name = schema_ptr->cf_name();
throw exceptions::invalid_request_exception(seastar::format("Table with ID {} already exists: {}.{}", *maybe_id, ks_name, cf_name));
}
}
}
void view_prop_defs::apply_to_builder(op_type op, schema_builder& builder, schema::extensions_map exts,
const data_dictionary::database db, sstring ks_name, bool is_colocated) const
{
_properties->apply_to_builder(builder, exts, db, std::move(ks_name), !is_colocated);
if (op == op_type::create) {
const auto maybe_id = properties()->get_id();
if (maybe_id) {
builder.set_uuid(*maybe_id);
}
}
if (op == op_type::alter) {
if (builder.get_gc_grace_seconds() == 0) {
throw exceptions::invalid_request_exception(
"Cannot alter gc_grace_seconds of a materialized view to 0, since this "
"value is used to TTL undelivered updates. Setting gc_grace_seconds too "
"low might cause undelivered updates to expire before being replayed.");
}
}
if (builder.default_time_to_live().count() > 0) {
throw exceptions::invalid_request_exception(
"Cannot set or alter default_time_to_live for a materialized view. "
"Data in a materialized view always expire at the same time than "
"the corresponding data in the parent table.");
}
}
} // namespace cql3::statements

View File

@@ -0,0 +1,62 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "cql3/statements/cf_properties.hh"
namespace cql3::statements {
/// This type represents the possible properties of the following CQL statements:
///
/// * CREATE MATERIALIZED VIEW,
/// * ALTER MATERIALIZED VIEW.
///
/// Since the sets of the valid properties may differ between those statements, this type
/// is supposed to represent a superset of them.
///
/// This type does NOT guarantee that all of the necessary validation logic will be performed
/// by it. It strives to do that, but you should keep this in mind. What does that mean?
/// Some parts of validation may require more context that's not accessible from here.
///
/// As of yet, this type does not cover all of the validation logic that could be here either.
class view_prop_defs : public cf_properties {
public:
/// The type of a schema operation on a materialized view.
/// These values will be used to guide the validation logic.
enum class op_type {
create,
alter
};
public:
template <typename... Args>
view_prop_defs(Args&&... args) : cf_properties(std::forward<Args>(args)...) {}
// Explicitly delete this method. It's declared in the inherited types.
// The user of this interface should use `validate_raw` instead.
void validate(const data_dictionary::database, sstring ks_name, const schema::extensions_map&) const = delete;
/// Validate the properties for the specified schema operation.
///
/// The validation is *raw* because we mostly validate the properties in their string form (checking if
/// a property exists or not for instance) and only focus on the properties on their own, without
/// having access to any other information.
void validate_raw(op_type, const data_dictionary::database, sstring ks_name, const schema::extensions_map&) const;
/// Apply the properties to the provided schema_builder and validate them.
///
/// NOTE: If the validation fails, this function will throw an exception. What's more important,
/// however, is that the provided schema_builder might have already been modified by that
/// point. Because of that, in presence of an exception, the schema builder should NOT be
/// used anymore.
void apply_to_builder(op_type, schema_builder&, schema::extensions_map, const data_dictionary::database,
sstring ks_name, bool is_colocated) const;
};
} // namespace cql3::statements

View File

@@ -16,6 +16,7 @@
#include <seastar/core/semaphore.hh>
#include <seastar/core/metrics.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/core/sleep.hh>
#include <seastar/coroutine/parallel_for_each.hh>
@@ -377,7 +378,7 @@ future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches
for (const auto& [fm, s] : fms) {
mutations.emplace_back(fm.to_mutation(s));
co_await maybe_yield();
co_await coroutine::maybe_yield();
}
if (!mutations.empty()) {

View File

@@ -502,6 +502,9 @@ public:
void flush_segments(uint64_t size_to_remove);
void check_no_data_older_than_allowed();
// whitebox testing
std::function<future<>()> _oversized_pre_wait_memory_func;
private:
class shutdown_marker{};
@@ -1597,8 +1600,15 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
scope_increment_counter allocating(totals.active_allocations);
// #27992 - whitebox testing. signal we are trying to lock out
// all allocators
if (_oversized_pre_wait_memory_func) {
co_await _oversized_pre_wait_memory_func();
}
auto permit = co_await std::move(fut);
SCYLLA_ASSERT(_request_controller.available_units() == 0);
// #27992 - task reordering _can_ force the available units to negative. this is ok.
SCYLLA_ASSERT(_request_controller.available_units() <= 0);
decltype(permit) fake_permit; // can't have allocate+sync release semaphore.
bool failed = false;
@@ -1859,13 +1869,15 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
}
}
}
SCYLLA_ASSERT(_request_controller.available_units() == 0);
auto avail = _request_controller.available_units();
SCYLLA_ASSERT(avail <= 0);
SCYLLA_ASSERT(permit.count() == max_request_controller_units());
auto nw = _request_controller.waiters();
permit.return_all();
// #20633 cannot guarantee controller avail is now full, since we could have had waiters when doing
// return all -> now will be less avail
SCYLLA_ASSERT(nw > 0 || _request_controller.available_units() == ssize_t(max_request_controller_units()));
SCYLLA_ASSERT(nw > 0 || _request_controller.available_units() == (avail + ssize_t(max_request_controller_units())));
if (!failed) {
clogger.trace("Oversized allocation succeeded.");
@@ -3949,6 +3961,9 @@ void db::commitlog::update_max_data_lifetime(std::optional<uint64_t> commitlog_d
_segment_manager->cfg.commitlog_data_max_lifetime_in_seconds = commitlog_data_max_lifetime_in_seconds;
}
void db::commitlog::set_oversized_pre_wait_memory_func(std::function<future<>()> f) {
_segment_manager->_oversized_pre_wait_memory_func = std::move(f);
}
future<std::vector<sstring>> db::commitlog::get_segments_to_replay() const {
return _segment_manager->get_segments_to_replay();

View File

@@ -385,6 +385,9 @@ public:
// (Re-)set data mix lifetime.
void update_max_data_lifetime(std::optional<uint64_t> commitlog_data_max_lifetime_in_seconds);
// Whitebox testing. Do not use for production
void set_oversized_pre_wait_memory_func(std::function<future<>()>);
using commit_load_reader_func = std::function<future<>(buffer_and_replay_position)>;
class segment_error : public std::exception {};

View File

@@ -1447,6 +1447,10 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"SELECT statements with aggregation or GROUP BYs or a secondary index may use this page size for their internal reading data, not the page size specified in the query options.")
, alternator_port(this, "alternator_port", value_status::Used, 0, "Alternator API port.")
, alternator_https_port(this, "alternator_https_port", value_status::Used, 0, "Alternator API HTTPS port.")
, alternator_port_proxy_protocol(this, "alternator_port_proxy_protocol", value_status::Used, 0,
"Port on which the Alternator API listens for clients using proxy protocol v2. Disabled (0) by default.")
, alternator_https_port_proxy_protocol(this, "alternator_https_port_proxy_protocol", value_status::Used, 0,
"Port on which the Alternator HTTPS API listens for clients using proxy protocol v2. Disabled (0) by default.")
, alternator_address(this, "alternator_address", value_status::Used, "0.0.0.0", "Alternator API listening address.")
, alternator_enforce_authorization(this, "alternator_enforce_authorization", liveness::LiveUpdate, value_status::Used, false, "Enforce checking the authorization header for every request in Alternator.")
, alternator_warn_authorization(this, "alternator_warn_authorization", liveness::LiveUpdate, value_status::Used, false, "Count and log warnings about failed authentication or authorization")
@@ -1566,6 +1570,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"\tdisabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option\n"
"\tenabled: New keyspaces use tablets by default, unless disabled by the tablets={'enabled':false} option\n"
"\tenforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option")
, auto_repair_enabled_default(this, "auto_repair_enabled_default", liveness::LiveUpdate, value_status::Used, false, "Set true to enable auto repair for tablet tables by default. The value will be overridden by the per keyspace or per table configuration which is not implemented yet.")
, auto_repair_threshold_default_in_seconds(this, "auto_repair_threshold_default_in_seconds", liveness::LiveUpdate, value_status::Used, 24 * 3600 , "Set the default time in seconds for the auto repair threshold for tablet tables. If the time since last repair is bigger than the configured time, the tablet is eligible for auto repair. The value will be overridden by the per keyspace or per table configuration which is not implemented yet.")
, view_flow_control_delay_limit_in_ms(this, "view_flow_control_delay_limit_in_ms", liveness::LiveUpdate, value_status::Used, 1000,
"The maximal amount of time that materialized-view update flow control may delay responses "
"to try to slow down the client and prevent buildup of unfinished view updates. "

View File

@@ -464,6 +464,8 @@ public:
named_value<uint16_t> alternator_port;
named_value<uint16_t> alternator_https_port;
named_value<uint16_t> alternator_port_proxy_protocol;
named_value<uint16_t> alternator_https_port_proxy_protocol;
named_value<sstring> alternator_address;
named_value<bool> alternator_enforce_authorization;
named_value<bool> alternator_warn_authorization;
@@ -569,6 +571,8 @@ public:
named_value<double> topology_barrier_stall_detector_threshold_seconds;
named_value<bool> enable_tablets;
named_value<enum_option<tablets_mode_t>> tablets_mode_for_new_keyspaces;
named_value<bool> auto_repair_enabled_default;
named_value<int32_t> auto_repair_threshold_default_in_seconds;
bool enable_tablets_by_default() const noexcept {
switch (tablets_mode_for_new_keyspaces()) {

View File

@@ -31,19 +31,23 @@ size_t quorum_for(const locator::effective_replication_map& erm) {
return replication_factor ? (replication_factor / 2) + 1 : 0;
}
size_t local_quorum_for(const locator::effective_replication_map& erm, const sstring& dc) {
static size_t get_replication_factor_for_dc(const locator::effective_replication_map& erm, const sstring& dc) {
using namespace locator;
const auto& rs = erm.get_replication_strategy();
if (rs.get_type() == replication_strategy_type::network_topology) {
const network_topology_strategy* nrs =
const network_topology_strategy* nts =
static_cast<const network_topology_strategy*>(&rs);
size_t replication_factor = nrs->get_replication_factor(dc);
return replication_factor ? (replication_factor / 2) + 1 : 0;
return nts->get_replication_factor(dc);
}
return quorum_for(erm);
return erm.get_replication_factor();
}
size_t local_quorum_for(const locator::effective_replication_map& erm, const sstring& dc) {
auto rf = get_replication_factor_for_dc(erm, dc);
return rf ? (rf / 2) + 1 : 0;
}
size_t block_for_local_serial(const locator::effective_replication_map& erm) {
@@ -188,18 +192,30 @@ void assure_sufficient_live_nodes(
return pending <= live ? live - pending : 0;
};
auto make_rf_zero_error_msg = [cl] (const sstring& local_dc) {
return format("Cannot achieve consistency level {} in datacenter '{}' with replication factor 0. "
"Ensure the keyspace is replicated to this datacenter or use a non-local consistency level.", cl, local_dc);
};
const auto& topo = erm.get_topology();
const sstring& local_dc = topo.get_datacenter();
switch (cl) {
case consistency_level::ANY:
// local hint is acceptable, and local node is always live
break;
case consistency_level::LOCAL_ONE:
if (size_t local_rf = get_replication_factor_for_dc(erm, local_dc); local_rf == 0) {
throw exceptions::unavailable_exception(make_rf_zero_error_msg(local_dc), cl, 1, 0);
}
if (topo.count_local_endpoints(live_endpoints) < topo.count_local_endpoints(pending_endpoints) + 1) {
throw exceptions::unavailable_exception(cl, 1, 0);
}
break;
case consistency_level::LOCAL_QUORUM: {
if (size_t local_rf = get_replication_factor_for_dc(erm, local_dc); local_rf == 0) {
throw exceptions::unavailable_exception(make_rf_zero_error_msg(local_dc), cl, need, 0);
}
size_t local_live = topo.count_local_endpoints(live_endpoints);
size_t pending = topo.count_local_endpoints(pending_endpoints);
if (local_live < need + pending) {

View File

@@ -14,15 +14,20 @@
#include <boost/lexical_cast.hpp>
#include "utils/s3/creds.hh"
#include "utils/http.hh"
#include "object_storage_endpoint_param.hh"
using namespace std::string_literals;
static auto format_url(std::string_view host, unsigned port, bool use_https) {
return fmt::format("{}://{}:{}", use_https ? "https" : "http", host, port);
}
db::object_storage_endpoint_param::object_storage_endpoint_param(s3_storage s)
: _data(std::move(s))
{}
db::object_storage_endpoint_param::object_storage_endpoint_param(std::string endpoint, s3::endpoint_config config)
: object_storage_endpoint_param(s3_storage{std::move(endpoint), std::move(config)})
: object_storage_endpoint_param(s3_storage{format_url(endpoint, config.port, config.use_https), std::move(config.region), std::move(config.role_arn), true /* legacy_format */})
{}
db::object_storage_endpoint_param::object_storage_endpoint_param(gs_storage s)
: _data(std::move(s))
@@ -32,13 +37,29 @@ db::object_storage_endpoint_param::object_storage_endpoint_param() = default;
db::object_storage_endpoint_param::object_storage_endpoint_param(const object_storage_endpoint_param&) = default;
std::string db::object_storage_endpoint_param::s3_storage::to_json_string() const {
if (!legacy_format) {
return fmt::format("{{ \"type\": \"s3\", \"aws_region\": \"{}\", \"iam_role_arn\": \"{}\" }}",
region, iam_role_arn
);
}
auto url = utils::http::parse_simple_url(endpoint);
return fmt::format("{{ \"port\": {}, \"use_https\": {}, \"aws_region\": \"{}\", \"iam_role_arn\": \"{}\" }}",
config.port, config.use_https, config.region, config.role_arn
url.port, url.is_https(), region, iam_role_arn
);
}
std::string db::object_storage_endpoint_param::s3_storage::key() const {
return endpoint;
// The `endpoint` is full URL all the time, so only return it as a key
// if it wasn't configured "the old way". In the latter case, split the
// URL and return its host part to mimic the old behavior.
if (!legacy_format) {
return endpoint;
}
auto url = utils::http::parse_simple_url(endpoint);
return url.host;
}
std::string db::object_storage_endpoint_param::gs_storage::to_json_string() const {
@@ -99,8 +120,6 @@ const std::string& db::object_storage_endpoint_param::type() const {
db::object_storage_endpoint_param db::object_storage_endpoint_param::decode(const YAML::Node& node) {
auto name = node["name"];
auto aws_region = node["aws_region"];
auto iam_role_arn = node["iam_role_arn"];
auto type = node["type"];
auto get_opt = [](auto& node, const std::string& key, auto def) {
@@ -108,13 +127,20 @@ db::object_storage_endpoint_param db::object_storage_endpoint_param::decode(cons
return tmp ? tmp.template as<std::decay_t<decltype(def)>>() : def;
};
// aws s3 endpoint.
if (!type || type.as<std::string>() == s3_type || aws_region || iam_role_arn) {
if (!type || type.as<std::string>() == s3_type) {
s3_storage ep;
ep.endpoint = name.as<std::string>();
ep.config.port = node["port"].as<unsigned>();
ep.config.use_https = node["https"].as<bool>(false);
ep.config.region = aws_region ? aws_region.as<std::string>() : std::getenv("AWS_DEFAULT_REGION");
ep.config.role_arn = iam_role_arn ? iam_role_arn.as<std::string>() : "";
auto endpoint = name.as<std::string>();
ep.legacy_format = (!endpoint.starts_with("http://") && !endpoint.starts_with("https://"));
if (!ep.legacy_format) {
ep.endpoint = std::move(endpoint);
} else {
ep.endpoint = format_url(endpoint, node["port"].as<unsigned>(), node["https"].as<bool>(false));
}
auto aws_region = node["aws_region"];
ep.region = aws_region ? aws_region.as<std::string>() : std::getenv("AWS_DEFAULT_REGION");
ep.iam_role_arn = get_opt(node, "iam_role_arn", ""s);
return object_storage_endpoint_param{std::move(ep)};
}

View File

@@ -25,7 +25,9 @@ class object_storage_endpoint_param {
public:
struct s3_storage {
std::string endpoint;
s3::endpoint_config config;
std::string region;
std::string iam_role_arn;
bool legacy_format; // FIXME convert it to bool_class after seastar#3198
std::strong_ordering operator<=>(const s3_storage&) const = default;
std::string to_json_string() const;

View File

@@ -961,15 +961,15 @@ public:
auto include_pending_changes = [&table_schemas](schema_diff_per_shard d) -> future<> {
for (auto& schema : d.dropped) {
co_await maybe_yield();
co_await coroutine::maybe_yield();
table_schemas.erase(schema->id());
}
for (auto& change : d.altered) {
co_await maybe_yield();
co_await coroutine::maybe_yield();
table_schemas.insert_or_assign(change.new_schema->id(), change.new_schema);
}
for (auto& schema : d.created) {
co_await maybe_yield();
co_await coroutine::maybe_yield();
table_schemas.insert_or_assign(schema->id(), schema);
}
};

View File

@@ -140,6 +140,7 @@ namespace {
system_keyspace::DICTS,
system_keyspace::VIEW_BUILDING_TASKS,
system_keyspace::CLIENT_ROUTES,
system_keyspace::REPAIR_TASKS,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.is_group0_table = true;
@@ -490,6 +491,24 @@ schema_ptr system_keyspace::repair_history() {
return schema;
}
schema_ptr system_keyspace::repair_tasks() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(NAME, REPAIR_TASKS);
return schema_builder(NAME, REPAIR_TASKS, std::optional(id))
.with_column("task_uuid", uuid_type, column_kind::partition_key)
.with_column("operation", utf8_type, column_kind::clustering_key)
// First and last token for of the tablet
.with_column("first_token", long_type, column_kind::clustering_key)
.with_column("last_token", long_type, column_kind::clustering_key)
.with_column("timestamp", timestamp_type)
.with_column("table_uuid", uuid_type, column_kind::static_column)
.set_comment("Record tablet repair tasks")
.with_hash_version()
.build();
}();
return schema;
}
schema_ptr system_keyspace::built_indexes() {
static thread_local auto built_indexes = [] {
schema_builder builder(generate_legacy_id(NAME, BUILT_INDEXES), NAME, BUILT_INDEXES,
@@ -2356,6 +2375,7 @@ std::vector<schema_ptr> system_keyspace::all_tables(const db::config& cfg) {
corrupt_data(),
scylla_local(), db::schema_tables::scylla_table_schema_history(),
repair_history(),
repair_tasks(),
v3::views_builds_in_progress(), v3::built_views(),
v3::scylla_views_builds_in_progress(),
v3::truncated(),
@@ -2599,6 +2619,32 @@ future<> system_keyspace::get_repair_history(::table_id table_id, repair_history
});
}
future<utils::chunked_vector<canonical_mutation>> system_keyspace::get_update_repair_task_mutations(const repair_task_entry& entry, api::timestamp_type ts) {
// Default to timeout the repair task entries in 10 days, this should be enough time for the management tools to query
constexpr int ttl = 10 * 24 * 3600;
sstring req = format("INSERT INTO system.{} (task_uuid, operation, first_token, last_token, timestamp, table_uuid) VALUES (?, ?, ?, ?, ?, ?) USING TTL {}", REPAIR_TASKS, ttl);
auto muts = co_await _qp.get_mutations_internal(req, internal_system_query_state(), ts,
{entry.task_uuid.uuid(), repair_task_operation_to_string(entry.operation),
entry.first_token, entry.last_token, entry.timestamp, entry.table_uuid.uuid()});
utils::chunked_vector<canonical_mutation> cmuts(muts.begin(), muts.end());
co_return cmuts;
}
future<> system_keyspace::get_repair_task(tasks::task_id task_uuid, repair_task_consumer f) {
sstring req = format("SELECT * from system.{} WHERE task_uuid = {}", REPAIR_TASKS, task_uuid);
co_await _qp.query_internal(req, [&f] (const cql3::untyped_result_set::row& row) mutable -> future<stop_iteration> {
repair_task_entry ent;
ent.task_uuid = tasks::task_id(row.get_as<utils::UUID>("task_uuid"));
ent.operation = repair_task_operation_from_string(row.get_as<sstring>("operation"));
ent.first_token = row.get_as<int64_t>("first_token");
ent.last_token = row.get_as<int64_t>("last_token");
ent.timestamp = row.get_as<db_clock::time_point>("timestamp");
ent.table_uuid = ::table_id(row.get_as<utils::UUID>("table_uuid"));
co_await f(std::move(ent));
co_return stop_iteration::no;
});
}
future<gms::generation_type> system_keyspace::increment_and_get_generation() {
auto req = format("SELECT gossip_generation FROM system.{} WHERE key='{}'", LOCAL, LOCAL);
auto rs = co_await _qp.execute_internal(req, cql3::query_processor::cache_internal::yes);
@@ -3249,7 +3295,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
supported_features = decode_features(deserialize_set_column(*topology(), row, "supported_features"));
}
if (row.has("topology_request")) {
if (row.has("topology_request") && nstate != service::node_state::left) {
auto req = service::topology_request_from_string(row.get_as<sstring>("topology_request"));
ret.requests.emplace(host_id, req);
switch(req) {
@@ -3799,4 +3845,35 @@ future<> system_keyspace::apply_mutation(mutation m) {
return _qp.proxy().mutate_locally(m, {}, db::commitlog::force_sync(m.schema()->static_props().wait_for_sync_to_commitlog), db::no_timeout);
}
// The names are persisted in system tables so should not be changed.
static const std::unordered_map<system_keyspace::repair_task_operation, sstring> repair_task_operation_to_name = {
{system_keyspace::repair_task_operation::requested, "requested"},
{system_keyspace::repair_task_operation::finished, "finished"},
};
static const std::unordered_map<sstring, system_keyspace::repair_task_operation> repair_task_operation_from_name = std::invoke([] {
std::unordered_map<sstring, system_keyspace::repair_task_operation> result;
for (auto&& [v, s] : repair_task_operation_to_name) {
result.emplace(s, v);
}
return result;
});
sstring system_keyspace::repair_task_operation_to_string(system_keyspace::repair_task_operation op) {
auto i = repair_task_operation_to_name.find(op);
if (i == repair_task_operation_to_name.end()) {
on_internal_error(slogger, format("Invalid repair task operation: {}", static_cast<int>(op)));
}
return i->second;
}
system_keyspace::repair_task_operation system_keyspace::repair_task_operation_from_string(const sstring& name) {
return repair_task_operation_from_name.at(name);
}
} // namespace db
auto fmt::formatter<db::system_keyspace::repair_task_operation>::format(const db::system_keyspace::repair_task_operation& op, fmt::format_context& ctx) const
-> decltype(ctx.out()) {
return fmt::format_to(ctx.out(), "{}", db::system_keyspace::repair_task_operation_to_string(op));
}

View File

@@ -57,6 +57,8 @@ namespace paxos {
struct topology_request_state;
class group0_guard;
class raft_group0_client;
}
namespace netw {
@@ -185,6 +187,7 @@ public:
static constexpr auto RAFT_SNAPSHOTS = "raft_snapshots";
static constexpr auto RAFT_SNAPSHOT_CONFIG = "raft_snapshot_config";
static constexpr auto REPAIR_HISTORY = "repair_history";
static constexpr auto REPAIR_TASKS = "repair_tasks";
static constexpr auto GROUP0_HISTORY = "group0_history";
static constexpr auto DISCOVERY = "discovery";
static constexpr auto BROADCAST_KV_STORE = "broadcast_kv_store";
@@ -264,6 +267,7 @@ public:
static schema_ptr raft();
static schema_ptr raft_snapshots();
static schema_ptr repair_history();
static schema_ptr repair_tasks();
static schema_ptr group0_history();
static schema_ptr discovery();
static schema_ptr broadcast_kv_store();
@@ -403,6 +407,22 @@ public:
int64_t range_end;
};
enum class repair_task_operation {
requested,
finished,
};
static sstring repair_task_operation_to_string(repair_task_operation op);
static repair_task_operation repair_task_operation_from_string(const sstring& name);
struct repair_task_entry {
tasks::task_id task_uuid;
repair_task_operation operation;
int64_t first_token;
int64_t last_token;
db_clock::time_point timestamp;
table_id table_uuid;
};
struct topology_requests_entry {
utils::UUID id;
utils::UUID initiating_host;
@@ -424,6 +444,10 @@ public:
using repair_history_consumer = noncopyable_function<future<>(const repair_history_entry&)>;
future<> get_repair_history(table_id, repair_history_consumer f);
future<utils::chunked_vector<canonical_mutation>> get_update_repair_task_mutations(const repair_task_entry& entry, api::timestamp_type ts);
using repair_task_consumer = noncopyable_function<future<>(const repair_task_entry&)>;
future<> get_repair_task(tasks::task_id task_uuid, repair_task_consumer f);
future<> save_truncation_record(const replica::column_family&, db_clock::time_point truncated_at, db::replay_position);
future<replay_positions> get_truncated_positions(table_id);
future<> drop_truncation_rp_records();
@@ -733,3 +757,8 @@ public:
}; // class system_keyspace
} // namespace db
template <>
struct fmt::formatter<db::system_keyspace::repair_task_operation> : fmt::formatter<string_view> {
auto format(const db::system_keyspace::repair_task_operation&, fmt::format_context& ctx) const -> decltype(ctx.out());
};

View File

@@ -57,7 +57,7 @@
#include "locator/network_topology_strategy.hh"
#include "mutation/mutation.hh"
#include "mutation/mutation_partition.hh"
#include "seastar/core/on_internal_error.hh"
#include <seastar/core/on_internal_error.hh>
#include "service/migration_manager.hh"
#include "service/raft/raft_group0_client.hh"
#include "service/storage_proxy.hh"
@@ -2244,7 +2244,7 @@ future<> view_builder::start_in_background(service::migration_manager& mm, utils
// Guard the whole startup routine with a semaphore,
// so that it's not intercepted by `on_drop_view`, `on_create_view`
// or `on_update_view` events.
auto units = co_await get_units(_sem, 1);
auto units = co_await get_units(_sem, view_builder_semaphore_units);
// Wait for schema agreement even if we're a seed node.
co_await mm.wait_for_schema_agreement(_db, db::timeout_clock::time_point::max(), &_as);
@@ -2659,7 +2659,7 @@ future<> view_builder::add_new_view(view_ptr view, build_step& step) {
co_await utils::get_local_injector().inject("add_new_view_pause_last_shard", utils::wait_for_message(5min));
}
co_await _sys_ks.register_view_for_building_for_all_shards(view->ks_name(), view->cf_name(), step.current_token());
co_await _sys_ks.register_view_for_building(view->ks_name(), view->cf_name(), step.current_token());
step.build_status.emplace(step.build_status.begin(), view_build_status{view, step.current_token(), std::nullopt});
}
@@ -2667,40 +2667,74 @@ static bool should_ignore_tablet_keyspace(const replica::database& db, const sst
return db.features().view_building_coordinator && db.has_keyspace(ks_name) && db.find_keyspace(ks_name).uses_tablets();
}
void view_builder::on_create_view(const sstring& ks_name, const sstring& view_name) {
future<> view_builder::dispatch_create_view(sstring ks_name, sstring view_name) {
if (should_ignore_tablet_keyspace(_db, ks_name)) {
return make_ready_future<>();
}
return with_semaphore(_sem, view_builder_semaphore_units, [this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
// This runs on shard 0 only; seed the global rows before broadcasting.
return handle_seed_view_build_progress(ks_name, view_name).then([this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
return container().invoke_on_all([ks_name = std::move(ks_name), view_name = std::move(view_name)] (view_builder& vb) mutable {
return vb.handle_create_view_local(std::move(ks_name), std::move(view_name));
});
});
});
}
future<> view_builder::handle_seed_view_build_progress(sstring ks_name, sstring view_name) {
auto view = view_ptr(_db.find_schema(ks_name, view_name));
auto& step = get_or_create_build_step(view->view_info()->base_id());
return _sys_ks.register_view_for_building_for_all_shards(view->ks_name(), view->cf_name(), step.current_token());
}
future<> view_builder::handle_create_view_local(sstring ks_name, sstring view_name){
if (this_shard_id() == 0) {
return handle_create_view_local_impl(std::move(ks_name), std::move(view_name));
} else {
return with_semaphore(_sem, view_builder_semaphore_units, [this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
return handle_create_view_local_impl(std::move(ks_name), std::move(view_name));
});
}
}
future<> view_builder::handle_create_view_local_impl(sstring ks_name, sstring view_name) {
auto view = view_ptr(_db.find_schema(ks_name, view_name));
auto& step = get_or_create_build_step(view->view_info()->base_id());
return when_all(step.base->await_pending_writes(), step.base->await_pending_streams()).discard_result().then([this, &step] {
return flush_base(step.base, _as);
}).then([this, view, &step] () {
// This resets the build step to the current token. It may result in views currently
// being built to receive duplicate updates, but it simplifies things as we don't have
// to keep around a list of new views to build the next time the reader crosses a token
// threshold.
return initialize_reader_at_current_token(step).then([this, view, &step] () mutable {
return add_new_view(view, step);
}).then_wrapped([this, view] (future<>&& f) {
try {
f.get();
} catch (abort_requested_exception&) {
vlogger.debug("Aborted while setting up view for building {}.{}", view->ks_name(), view->cf_name());
} catch (raft::request_aborted&) {
vlogger.debug("Aborted while setting up view for building {}.{}", view->ks_name(), view->cf_name());
} catch (...) {
vlogger.error("Error setting up view for building {}.{}: {}", view->ks_name(), view->cf_name(), std::current_exception());
}
// Waited on indirectly in stop().
static_cast<void>(_build_step.trigger());
});
});
}
void view_builder::on_create_view(const sstring& ks_name, const sstring& view_name) {
if (this_shard_id() != 0) {
return;
}
// Do it in the background, serialized.
(void)with_semaphore(_sem, 1, [ks_name, view_name, this] {
auto view = view_ptr(_db.find_schema(ks_name, view_name));
auto& step = get_or_create_build_step(view->view_info()->base_id());
return when_all(step.base->await_pending_writes(), step.base->await_pending_streams()).discard_result().then([this, &step] {
return flush_base(step.base, _as);
}).then([this, view, &step] () mutable {
// This resets the build step to the current token. It may result in views currently
// being built to receive duplicate updates, but it simplifies things as we don't have
// to keep around a list of new views to build the next time the reader crosses a token
// threshold.
return initialize_reader_at_current_token(step).then([this, view, &step] () mutable {
return add_new_view(view, step).then_wrapped([this, view] (future<>&& f) {
try {
f.get();
} catch (abort_requested_exception&) {
vlogger.debug("Aborted while setting up view for building {}.{}", view->ks_name(), view->cf_name());
} catch (raft::request_aborted&) {
vlogger.debug("Aborted while setting up view for building {}.{}", view->ks_name(), view->cf_name());
} catch (...) {
vlogger.error("Error setting up view for building {}.{}: {}", view->ks_name(), view->cf_name(), std::current_exception());
}
// Waited on indirectly in stop().
(void)_build_step.trigger();
});
});
});
}).handle_exception_type([] (replica::no_such_column_family&) { });
// Do it in the background, serialized and broadcast from shard 0.
static_cast<void>(dispatch_create_view(ks_name, view_name).handle_exception([ks_name, view_name] (std::exception_ptr ep) {
vlogger.warn("Failed to dispatch view creation {}.{}: {}", ks_name, view_name, ep);
}));
}
void view_builder::on_update_view(const sstring& ks_name, const sstring& view_name, bool) {
@@ -2709,7 +2743,7 @@ void view_builder::on_update_view(const sstring& ks_name, const sstring& view_na
}
// Do it in the background, serialized.
(void)with_semaphore(_sem, 1, [ks_name, view_name, this] {
(void)with_semaphore(_sem, view_builder_semaphore_units, [ks_name, view_name, this] {
auto view = view_ptr(_db.find_schema(ks_name, view_name));
auto step_it = _base_to_build_step.find(view->view_info()->base_id());
if (step_it == _base_to_build_step.end()) {
@@ -2724,45 +2758,75 @@ void view_builder::on_update_view(const sstring& ks_name, const sstring& view_na
}).handle_exception_type([] (replica::no_such_column_family&) { });
}
void view_builder::on_drop_view(const sstring& ks_name, const sstring& view_name) {
future<> view_builder::dispatch_drop_view(sstring ks_name, sstring view_name) {
if (should_ignore_tablet_keyspace(_db, ks_name)) {
return make_ready_future<>();
}
return with_semaphore(_sem, view_builder_semaphore_units, [this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
// This runs on shard 0 only; broadcast local cleanup before global cleanup.
return container().invoke_on_all([ks_name, view_name] (view_builder& vb) mutable {
return vb.handle_drop_view_local(std::move(ks_name), std::move(view_name));
}).then([this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
return handle_drop_view_global_cleanup(std::move(ks_name), std::move(view_name));
});
});
}
future<> view_builder::handle_drop_view_local(sstring ks_name, sstring view_name) {
if (this_shard_id() == 0) {
return handle_drop_view_local_impl(std::move(ks_name), std::move(view_name));
} else {
return with_semaphore(_sem, view_builder_semaphore_units, [this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
return handle_drop_view_local_impl(std::move(ks_name), std::move(view_name));
});
}
}
future<> view_builder::handle_drop_view_local_impl(sstring ks_name, sstring view_name) {
vlogger.info0("Stopping to build view {}.{}", ks_name, view_name);
// The view is absent from the database at this point, so find it by brute force.
([&, this] {
for (auto& [_, step] : _base_to_build_step) {
if (step.build_status.empty() || step.build_status.front().view->ks_name() != ks_name) {
continue;
}
for (auto it = step.build_status.begin(); it != step.build_status.end(); ++it) {
if (it->view->cf_name() == view_name) {
_built_views.erase(it->view->id());
step.build_status.erase(it);
return;
}
}
}
})();
return make_ready_future<>();
}
future<> view_builder::handle_drop_view_global_cleanup(sstring ks_name, sstring view_name) {
if (this_shard_id() != 0) {
return make_ready_future<>();
}
vlogger.info0("Starting view global cleanup {}.{}", ks_name, view_name);
return when_all_succeed(
_sys_ks.remove_view_build_progress_across_all_shards(ks_name, view_name),
_sys_ks.remove_built_view(ks_name, view_name),
remove_view_build_status(ks_name, view_name))
.discard_result()
.handle_exception([ks_name, view_name] (std::exception_ptr ep) {
vlogger.warn("Failed to cleanup view {}.{}: {}", ks_name, view_name, ep);
});
}
void view_builder::on_drop_view(const sstring& ks_name, const sstring& view_name) {
if (this_shard_id() != 0) {
return;
}
vlogger.info0("Stopping to build view {}.{}", ks_name, view_name);
// Do it in the background, serialized.
(void)with_semaphore(_sem, 1, [ks_name, view_name, this] {
// The view is absent from the database at this point, so find it by brute force.
([&, this] {
for (auto& [_, step] : _base_to_build_step) {
if (step.build_status.empty() || step.build_status.front().view->ks_name() != ks_name) {
continue;
}
for (auto it = step.build_status.begin(); it != step.build_status.end(); ++it) {
if (it->view->cf_name() == view_name) {
_built_views.erase(it->view->id());
step.build_status.erase(it);
return;
}
}
}
})();
if (this_shard_id() != 0) {
// Shard 0 can't remove the entry in the build progress system table on behalf of the
// current shard, since shard 0 may have already processed the notification, and this
// shard may since have updated the system table if the drop happened concurrently
// with the build.
return _sys_ks.remove_view_build_progress(ks_name, view_name);
}
return when_all_succeed(
_sys_ks.remove_view_build_progress(ks_name, view_name),
_sys_ks.remove_built_view(ks_name, view_name),
remove_view_build_status(ks_name, view_name))
.discard_result()
.handle_exception([ks_name, view_name] (std::exception_ptr ep) {
vlogger.warn("Failed to cleanup view {}.{}: {}", ks_name, view_name, ep);
});
}).handle_exception_type([] (replica::no_such_keyspace&) {});
// Do it in the background, serialized and broadcast from shard 0.
static_cast<void>(dispatch_drop_view(ks_name, view_name).handle_exception([ks_name, view_name] (std::exception_ptr ep) {
vlogger.warn("Failed to dispatch view drop {}.{}: {}", ks_name, view_name, ep);
}));
}
future<> view_builder::do_build_step() {
@@ -2773,7 +2837,7 @@ future<> view_builder::do_build_step() {
return seastar::async(std::move(attr), [this] {
exponential_backoff_retry r(1s, 1min);
while (!_base_to_build_step.empty() && !_as.abort_requested()) {
auto units = get_units(_sem, 1).get();
auto units = get_units(_sem, view_builder_semaphore_units).get();
++_stats.steps_performed;
try {
execute(_current_step->second, exponential_backoff_retry(1s, 1min));
@@ -3633,20 +3697,20 @@ sstring build_status_to_sstring(build_status status) {
on_internal_error(vlogger, fmt::format("Unknown view build status: {}", (int)status));
}
void validate_view_keyspace(const data_dictionary::database& db, std::string_view keyspace_name) {
const bool tablet_views_enabled = db.features().views_with_tablets;
// Note: if the configuration option `rf_rack_valid_keyspaces` is enabled, we can be
// sure that all tablet-based keyspaces are RF-rack-valid. We check that
// at start-up and then we don't allow for creating RF-rack-invalid keyspaces.
const bool rf_rack_valid_keyspaces = db.get_config().rf_rack_valid_keyspaces();
const bool required_config = tablet_views_enabled && rf_rack_valid_keyspaces;
void validate_view_keyspace(const data_dictionary::database& db, std::string_view keyspace_name, locator::token_metadata_ptr tmptr) {
const auto& rs = db.find_keyspace(keyspace_name).get_replication_strategy();
const bool uses_tablets = db.find_keyspace(keyspace_name).get_replication_strategy().uses_tablets();
if (!required_config && uses_tablets) {
if (rs.uses_tablets() && !db.features().views_with_tablets) {
throw std::logic_error("Materialized views and secondary indexes are not supported on base tables with tablets. "
"To be able to use them, enable the configuration option `rf_rack_valid_keyspaces` and make sure "
"that the cluster feature `VIEWS_WITH_TABLETS` is enabled.");
"To be able to use them, make sure all nodes in the cluster are upgraded.");
}
try {
locator::assert_rf_rack_valid_keyspace(keyspace_name, tmptr, rs);
} catch (const std::exception& e) {
throw std::logic_error(fmt::format(
"Materialized views and secondary indexes are not supported on the keyspace '{}': {}",
keyspace_name, e.what()));
}
}

View File

@@ -9,6 +9,7 @@
#pragma once
#include "gc_clock.hh"
#include "locator/token_metadata_fwd.hh"
#include "query/query-request.hh"
#include "schema/schema_fwd.hh"
#include "readers/mutation_reader.hh"
@@ -318,7 +319,7 @@ endpoints_to_update get_view_natural_endpoint(
///
/// Preconditions:
/// * The provided `keyspace_name` must correspond to an existing keyspace.
void validate_view_keyspace(const data_dictionary::database&, std::string_view keyspace_name);
void validate_view_keyspace(const data_dictionary::database&, std::string_view keyspace_name, locator::token_metadata_ptr tmptr);
}

View File

@@ -169,10 +169,11 @@ class view_builder final : public service::migration_listener::only_view_notific
base_to_build_step_type _base_to_build_step;
base_to_build_step_type::iterator _current_step = _base_to_build_step.end();
serialized_action _build_step{std::bind(&view_builder::do_build_step, this)};
static constexpr size_t view_builder_semaphore_units = 1;
// Ensures bookkeeping operations are serialized, meaning that while we execute
// a build step we don't consider newly added or removed views. This simplifies
// the algorithms. Also synchronizes an operation wrt. a call to stop().
seastar::named_semaphore _sem{1, named_semaphore_exception_factory{"view builder"}};
seastar::named_semaphore _sem{view_builder_semaphore_units, named_semaphore_exception_factory{"view builder"}};
seastar::abort_source _as;
future<> _started = make_ready_future<>();
// Used to coordinate between shards the conclusion of the build process for a particular view.
@@ -266,6 +267,14 @@ private:
future<> maybe_mark_view_as_built(view_ptr, dht::token);
future<> mark_as_built(view_ptr);
void setup_metrics();
future<> dispatch_create_view(sstring ks_name, sstring view_name);
future<> dispatch_drop_view(sstring ks_name, sstring view_name);
future<> handle_seed_view_build_progress(sstring ks_name, sstring view_name);
future<> handle_create_view_local(sstring ks_name, sstring view_name);
future<> handle_drop_view_local(sstring ks_name, sstring view_name);
future<> handle_create_view_local_impl(sstring ks_name, sstring view_name);
future<> handle_drop_view_local_impl(sstring ks_name, sstring view_name);
future<> handle_drop_view_global_cleanup(sstring ks_name, sstring view_name);
template <typename Func1, typename Func2>
future<> write_view_build_status(Func1&& fn_group0, Func2&& fn_sys_dist) {

View File

@@ -17,7 +17,7 @@
#include "locator/abstract_replication_strategy.hh"
#include "locator/tablets.hh"
#include "raft/raft.hh"
#include "seastar/core/gate.hh"
#include <seastar/core/gate.hh>
#include "db/view/view_building_state.hh"
#include "sstables/shared_sstable.hh"
#include "utils/UUID.hh"

View File

@@ -102,13 +102,13 @@ view_update_generator::view_update_generator(replica::database& db, sharded<serv
, _early_abort_subscription(as.subscribe([this] () noexcept { do_abort(); }))
{
setup_metrics();
discover_staging_sstables();
_db.plug_view_update_generator(*this);
}
view_update_generator::~view_update_generator() {}
future<> view_update_generator::start() {
discover_staging_sstables();
_started = seastar::async([this]() mutable {
auto drop_sstable_references = defer([&] () noexcept {
// Clear sstable references so sstables_manager::stop() doesn't hang.

View File

@@ -7,7 +7,6 @@
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
import os
import sys
import errno
import logging

View File

@@ -8,7 +8,7 @@
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
import argparse
from scylla_blocktune import *
from scylla_blocktune import tune_yaml, tune_fs, tune_dev
if __name__ == "__main__":
ap = argparse.ArgumentParser('Tune filesystems for ScyllaDB')

View File

@@ -14,7 +14,9 @@ import subprocess
import time
import tempfile
import shutil
from scylla_util import *
import re
import distro
from scylla_util import out, is_debian_variant, is_suse_variant, pkg_install, is_redhat_variant, systemd_unit, is_gentoo
from subprocess import run

View File

@@ -10,9 +10,8 @@
import os
import sys
import argparse
import shlex
import distro
from scylla_util import *
import shutil
from scylla_util import is_debian_variant, pkg_install, systemd_unit, sysconfig_parser, is_gentoo, is_arch, is_amzn2, is_suse_variant, is_redhat_variant
UNIT_DATA= '''
[Unit]

View File

@@ -10,7 +10,7 @@
import os
import sys
import argparse
from scylla_util import *
from scylla_util import is_container, sysconfig_parser
if __name__ == '__main__':
if not is_container() and os.getuid() > 0:

View File

@@ -10,7 +10,7 @@
import os
import sys
import argparse
from scylla_util import *
from scylla_util import is_nonroot, is_container, etcdir, sysconfig_parser
if __name__ == '__main__':
if not is_nonroot() and not is_container() and os.getuid() > 0:

View File

@@ -8,8 +8,6 @@
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
import os
import sys
import yaml
import argparse
import subprocess

View File

@@ -8,9 +8,8 @@
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
import os
import subprocess
import shutil
from scylla_util import *
import sys
from scylla_util import systemd_unit
if __name__ == '__main__':
if os.getuid() > 0:

View File

@@ -9,7 +9,7 @@
import os
import re
from scylla_util import *
from scylla_util import etcdir, datadir, bindir, scriptsdir, is_nonroot, is_container, is_developer_mode
import resource
import subprocess
import argparse

View File

@@ -10,7 +10,7 @@
import os
import sys
import shutil
from scylla_util import *
from scylla_util import pkg_install
from subprocess import run, DEVNULL
if __name__ == '__main__':

View File

@@ -7,9 +7,8 @@
#
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
from pathlib import Path
from datetime import datetime
from scylla_util import *
from scylla_util import scylladir_p
if __name__ == '__main__':
log = scylladir_p() / 'scylla-server.log'

View File

@@ -10,7 +10,7 @@
import os
import sys
import argparse
from scylla_util import *
from scylla_util import etcdir_p
if __name__ == '__main__':
if os.getuid() > 0:

View File

@@ -11,10 +11,9 @@ import os
import sys
import argparse
import re
import distro
import shutil
from scylla_util import *
from scylla_util import systemd_unit, is_redhat_variant, pkg_install
from subprocess import run
from pathlib import Path

View File

@@ -12,8 +12,10 @@ import sys
import glob
import platform
import distro
import re
import traceback
from scylla_util import *
from scylla_util import sysconfig_parser, out, get_set_nic_and_disks_config_value, check_sysfs_numa_topology_is_valid, sysconfdir_p, scylla_excepthook, perftune_base_command
from subprocess import run
def get_cur_cpuset():

View File

@@ -9,7 +9,6 @@
import os
import argparse
import distutils.util
import pwd
import grp
import sys
@@ -18,12 +17,22 @@ import logging
import pyudev
import psutil
import platform
import shutil
from pathlib import Path
from scylla_util import *
from subprocess import run, SubprocessError
from scylla_util import is_unused_disk, out, pkg_install, systemd_unit, SystemdException, is_debian_variant, is_redhat_variant, is_offline
from subprocess import run, SubprocessError, Popen
LOGGER = logging.getLogger(__name__)
def strtobool(val):
val = val.lower()
if val in ('y', 'yes', 't', 'true', 'on', '1'):
return 1
elif val in ('n', 'no', 'f', 'false', 'off', '0'):
return 0
else:
raise ValueError(f"invalid truth value {val!r}")
class UdevInfo:
def __init__(self, device_file):
self.context = pyudev.Context()
@@ -145,7 +154,7 @@ if __name__ == '__main__':
args = parser.parse_args()
# Allow args.online_discard to be used as a boolean value
args.online_discard = distutils.util.strtobool(args.online_discard)
args.online_discard = strtobool(args.online_discard)
root = args.root.rstrip('/')
if args.volume_role == 'all':
@@ -227,7 +236,7 @@ if __name__ == '__main__':
with open(discard_path) as f:
discard = f.read().strip()
if discard != '0':
proc = subprocess.Popen(['blkdiscard', disk])
proc = Popen(['blkdiscard', disk])
procs.append(proc)
for proc in procs:
proc.wait()

View File

@@ -8,8 +8,9 @@
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
import os
import sys
import argparse
from scylla_util import *
from scylla_util import systemd_unit
def update_rsysconf(rsyslog_server):
if ':' not in rsyslog_server:

View File

@@ -9,8 +9,7 @@
import os
import sys
import subprocess
from scylla_util import *
from scylla_util import is_redhat_variant, out, sysconfig_parser
from subprocess import run
if __name__ == '__main__':

View File

@@ -12,12 +12,14 @@ import sys
import argparse
import glob
import shutil
import io
import stat
import distro
from scylla_util import *
import re
from scylla_util import colorprint, is_valid_nic, scylladir_p, is_offline, is_debian_variant, is_redhat_variant, is_suse_variant, is_gentoo, out, is_unused_disk, scriptsdir, is_nonroot, sysconfdir_p, sysconfig_parser, systemd_unit, swap_exists, get_product, etcdir
from subprocess import run, DEVNULL
PRODUCT = get_product(etcdir())
interactive = False
HOUSEKEEPING_TIMEOUT = 60
def when_interactive_ask_service(interactive, msg1, msg2, default = None):
@@ -294,11 +296,9 @@ if __name__ == '__main__':
sys.exit(1)
disks = args.disks
nic = args.nic
set_nic_and_disks = args.setup_nic_and_disks
swap_directory = args.swap_directory
swap_size = args.swap_size
ec2_check = not args.no_ec2_check
kernel_check = not args.no_kernel_check
verify_package = not args.no_verify_package
enable_service = not args.no_enable_service

View File

@@ -9,7 +9,7 @@
import os
import sys
from scylla_util import *
from scylla_util import sysconfig_parser, sysconfdir_p
from subprocess import run
if __name__ == '__main__':

View File

@@ -12,7 +12,7 @@ import sys
import argparse
import psutil
from pathlib import Path
from scylla_util import *
from scylla_util import swap_exists, out, systemd_unit
from subprocess import run
def GB(n):

View File

@@ -10,9 +10,8 @@
import os
import sys
import argparse
import subprocess
import re
from scylla_util import *
from scylla_util import sysconfig_parser, sysconfdir_p, get_set_nic_and_disks_config_value, is_valid_nic, check_sysfs_numa_topology_is_valid, out, perftune_base_command, hex2list
from subprocess import run
def bool2str(val):

View File

@@ -3,12 +3,9 @@
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
import configparser
import glob
import io
import os
import re
import shlex
import shutil
import subprocess
import yaml
import sys
@@ -19,7 +16,6 @@ from datetime import datetime, timedelta
import distro
from scylla_sysconfdir import SYSCONFDIR
from scylla_product import PRODUCT
from multiprocessing import cpu_count
@@ -338,7 +334,7 @@ def apt_install(pkg, offline_exit=True):
if apt_is_updated():
break
try:
res = run('apt-get update', shell=True, check=True, stderr=PIPE, encoding='utf-8')
run('apt-get update', shell=True, check=True, stderr=PIPE, encoding='utf-8')
break
except CalledProcessError as e:
print(e.stderr, end='')
@@ -519,3 +515,12 @@ class sysconfig_parser:
def commit(self):
with open(self._filename, 'w') as f:
f.write(self._data)
def get_product(dir):
if dir is None:
dir = etcdir()
try:
with open(os.path.join(dir, 'SCYLLA-PRODUCT-FILE')) as f:
return f.read().strip()
except FileNotFoundError:
return 'scylla'

View File

@@ -1,7 +1,6 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
import sys
import signal
import subprocess

View File

@@ -54,6 +54,14 @@ these default locations can overridden by specifying
`--alternator-encryption-options keyfile="..."` and
`--alternator-encryption-options certificate="..."`.
In addition to **alternator_port** and **alternator_https_port**, the two
options **alternator_port_proxy_protocol** (for HTTP) and
**alternator_https_port_proxy_protocol** (for HTTPS) allow running Alternator
behind a reverse proxy, such as HAProxy or AWS PrivateLink, and still report
the correct client address. The reverse proxy must be configured to use
[Proxy Protocol v2](https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt)
(the binary header format).
By default, ScyllaDB saves a snapshot of deleted tables. But Alternator does
not offer an API to restore these snapshots, so these snapshots are not useful
and waste disk space - deleting a table does not recover any disk space.
@@ -146,4 +154,5 @@ with Tablets enabled.
getting-started
compatibility
new-apis
network
```

159
docs/alternator/network.md Normal file
View File

@@ -0,0 +1,159 @@
# Reducing network costs in Alternator
In some deployments, the network between the application and the ScyllaDB
cluster is limited in bandwidth, or network traffic is expensive.
This document surveys some mechanisms that can be used to reduce this network
traffic.
## Compression
An easy way to reduce network traffic between the application and the
ScyllaDB cluster is to compress the requests and responses. The resulting
saving can be substantial if items are long and compress well - e.g., long
strings. But additionally, Alternator's protocol (the DynamoDB API) is
JSON-based - textual and verbose - and compresses well.
Note that enabling compression also has a downside - it requires more
computation by both client and server. So the choice of whether to enable
compression depends on the relative cost of network traffic and CPU.
### Compression of requests
When the application sends a large request - notably a `PutItem` or
`BatchWriteItem` - we can reduce network usage by compressing the content
of this request.
Alternator's request protocol - the DynamoDB API - is based on HTTP or HTTPS.
The standard HTTP 1.1 mechanism for compressing a request is to compress
the request's body using some compression algorithm (e.g., gzip) and
send a header `Content-Encoding: gzip` stating which compression algorithm
was used. Alternator currently supports two compression algorithms, `gzip`
and `deflate`, both standardized in ([RFC 9110](https://www.rfc-editor.org/rfc/rfc9110.html)).
Other standard compression types which are listed in
[IANA's HTTP Content Coding Registry](https://www.iana.org/assignments/http-parameters/http-parameters.xhtml#content-coding),
including `zstd` ([RFC 8878][https://www.rfc-editor.org/rfc/rfc8878.html]),
are not yet supported by Alternator.
Note that HTTP's compression only compresses the request's _body_ - not the
request's _headers_ - so it is beneficial to avoid sending unnecessary headers
in the request, as they will not be compressed. See more on this below.
To use compressed requests, the client library (SDK) used by the application
should be configured to actually compress requests. Amazon's AWS SDKs support
this feature in some languages, but not in others, and may automatically
compress only requests that are longer than a certain size. Check their website
<https://docs.aws.amazon.com/sdkref/latest/guide/feature-compression.html>
for more details for the specific SDK you are using. ScyllaDB also publishes
extensions for these SDKs which may have better support for compressed
requests (and other features mentioned in this document), so again please
consult the documentation of the specific SDK that you are using.
### Compression of responses
Some types of requests, notably `GetItem` and `BatchGetItem`, can have small
requests but large responses, so it can be beneficial to compress these
responses. The standard HTTP mechanism for doing this is that the client
provides a header like `Accept-Encoding: gzip` in the request, signalling
that it is ready to accept a gzip'ed response. Alternator may then compress
the response, or not, depending on its configuration and the response's
size. If Alternator does compress the response body, it sets a header
`Content-Encoding: gzip` in the response.
Currently, Alternator supports response compression with either `gzip`
or `deflate` encodings. If the client requests response compression
(via the `Accept-Encoding` header), then by default Alternator compresses
responses over 4 KB in length (leaving smaller responses uncompressed),
and uses compression level 6 (where 1 is the fastest, 9 is best compression).
These defaults can be changed with the configuration options
`alternator_response_compression_threshold_in_bytes` and
`alternator_response_gzip_compression_level`, respectively.
To use compressed responses, the client library (SDK) used by application
should be configured to send an `Accept-Encoding: gzip` header and to
understand the potentially-compressed response. Although DynamoDB does
support compressing responses, it is not clear if any of Amazon's AWS SDKs can
use this feature. ScyllaDB publishes extensions for these SDKs which may
have better support for compressed responses (and other features mentioned
in this document), so please consult the documentation of the specific SDK that
you are using to check if it can make use of the response compression
feature that Alternator supports.
## Fewer and shorter headers
As we saw above, the HTTP 1.1 protocol allows compressing the _bodies_ of
requests and responses. But HTTP 1.1 does not have a mechanism to compress
_headers_. So both client (SDK) and server (Alternator) should avoid sending
headers that are unnecessary or unnecessarily long.
The Alternator server sends headers like the following in its responses:
```
Content-Length: 2
Content-Type: application/x-amz-json-1.0
Date: Tue, 30 Dec 2025 20:00:01 GMT
Server: Seastar httpd
```
This is a bit over 100 bytes. Most of it is necessary, but the `Date`
and `Server` headers are not strictly necessary and a future version of
Alternator will most likely make them optional (or remove them altogether).
The request headers add significantly larger overhead, and AWS SDKs add
even more than necessary. Here is an example:
```
Content-Length: 300
amz-sdk-request: attempt=1
amz-sdk-invocation-id: caf3eb0e-8138-4cbd-adff-44ac357de04e
X-Amz-Date: 20251230T200829Z
host: 127.15.196.80:8000
User-Agent: Boto3/1.42.12 md/Botocore#1.42.12 md/awscrt#0.27.2 ua/2.1 os/linux#6.17.12-300.fc43.x86_64 md/arch#x86_64 lang/python#3.14.2 md/pyimpl#CPython m/Z,N,e,D,P,b cfg/retry-mode#legacy Botocore/1.42.12 Resource
Content-Type: application/x-amz-json-1.0
Authorization: AWS4-HMAC-SHA256 Credential=cassandra/20251230/us-east-1/dynamodb/aws4_request, SignedHeaders=content-type;host;x-amz-date;x-amz-target, Signature=34100a8cd044ef131c1ae025c91c6fc3468507c28449615cdb2edb4d82298be0"
X-Amz-Target: DynamoDB_20120810.CreateTable
Accept-Encoding: identity
```
There is a lot of waste here: Some headers like `amz-sdk-invocation-id` are
not needed at all. Others like `User-Agent` are useful for debugging but the
200-byte string sent by boto3 (AWS's Python SDK) on every request is clearly
excessive. In AWS's SDKs the user cannot make the request headers shorter,
but we plan to provide such a feature in ScyllaDB's extensions to AWS's SDKs.
Note that the request signing protocol used by Alternator and DynamoDB,
known as AWS Signature Version 4, needs the headers `x-amz-date` and
`Authorization` - which together use (as can be seen in the above example)
more than 230 bytes in each and every request. We could save these 230 bytes
by using SSL - instead of AWS Signature Version 4 - for authentication.
This idea is not yet implemented however - it is not supported by Alternator,
DynamoDB, or any of the client SDKs.
Some middleware or gateways modify HTTP traffic going through them and add
even more headers for their own use. If you suspect this might be happening,
inspect the HTTP traffic of your workload with a packet analyzer to see if
any unnecessary HTTP headers appear in requests arriving to Alternator, or
in responses arriving back at the application.
## Rack-aware request routing
In some deployments, the network between the client and the ScyllaDB server
is only expensive if the communication crosses **rack** boundaries.
A "rack" is a ScyllaDB term roughly equivalent to AWS's "availability zone".
Typically, a ScyllaDB cluster is divided to three racks and each item of
data is replicated once on each rack. Each instance of the client application
also lives in one of these racks. In many deployments, communication inside
the rack is free but communicating to a different rack costs extra. It is
therefore important that the client be aware of which rack it is running
on, and send the request to a ScyllaDB node on the same rack. It should not
choose one of the three replicas as random. This is known as "rack-aware
request routing", and should be enabled in the ScyllaDB extension of the
AWS SDK.
## Networking inside the ScyllaDB cluster
The previous sections were all about reducing the amount of network traffic
between the client application and the ScyllaDB cluster. If the network
between different ScyllaDB nodes is also metered - especially between nodes
in different racks - then this intra-cluster networking should also be reduced.
The best way to do that is to enable compression of internode communication.
Refer to [Advanced Internode (RPC) Compression](../operating-scylla/procedures/config-change/advanced-internode-compression.html)
for instructions on how to enable it.

View File

@@ -8,28 +8,22 @@ SSTable Version Support
------------------------
.. list-table::
:widths: 33 33 33
:widths: 50 50
:header-rows: 1
* - SSTable Version
- ScyllaDB Enterprise Version
- ScyllaDB Open Source Version
- ScyllaDB Version
* - 3.x ('ms')
- 2025.4 and above
- None
* - 3.x ('me')
- 2022.2 and above
- 5.1 and above
* - 3.x ('md')
- 2021.1
- 4.3, 4.4, 4.5, 4.6, 5.0
* - 3.0 ('mc')
- 2019.1, 2020.1
- 3.x, 4.1, 4.2
* - 2.2 ('la')
- N/A
- 2.3
* - 2.1.8 ('ka')
- 2018.1
- 2.2
* The supported formats are ``me`` and ``ms``.
* The ``md`` format is used only when upgrading from an existing cluster using
``md``. The ``sstable_format`` parameter is ignored if it is set to ``md``.
* Note: The ``sstable_format`` parameter specifies the SSTable format used for
**writes**. The legacy SSTable formats (``ka``, ``la``, ``mc``) remain
supported for reads, which is essential for restoring clusters from existing
backups.

View File

@@ -9,8 +9,6 @@ ScyllaDB SSTable Format
.. include:: _common/sstable_what_is.rst
In ScyllaDB 6.0 and above, *me* format is enabled by default.
For more information on each of the SSTable formats, see below:
* :doc:`SSTable 2.x <sstable2/index>`

View File

@@ -12,8 +12,6 @@ ScyllaDB SSTable - 3.x
.. include:: ../_common/sstable_what_is.rst
In ScyllaDB 6.0 and above, the ``me`` format is mandatory, and ``md`` format is used only when upgrading from an existing cluster using ``md``. The ``sstable_format`` parameter is ignored if it is set to ``md``.
Additional Information
-------------------------

View File

@@ -42,9 +42,10 @@ the administrator. The tablet load balancer decides where to migrate
the tablets, either within the same node to balance the shards or across
the nodes to balance the global load in the cluster.
The number of tablets the load balancer maintains on a node is directly
proportional to the node's storage capacity. A node with twice
the storage will have twice the number of tablets located on it.
During this process, the balancer will compute disk usage based on the
actual disk sizes of the tablets located on the nodes. It will then issue
tablet migrations which migrate tablets from nodes with higher to nodes
with lower disk utilization, equalizing the load.
As a table grows, each tablet can split into two, creating a new tablet.
The load balancer can migrate the split halves independently to different nodes
@@ -193,12 +194,18 @@ Limitations and Unsupported Features
.. warning::
If a keyspace has tablets enabled, it must remain :term:`RF-rack-valid <RF-rack-valid keyspace>`
throughout its lifetime. Failing to keep that invariant satisfied may result in data inconsistencies,
performance problems, or other issues.
If a keyspace has tablets enabled and contains a materialized view or a secondary index, it must
remain :term:`RF-rack-valid <RF-rack-valid keyspace>` throughout its lifetime. Failing to keep that
invariant satisfied may result in data inconsistencies, performance problems, or other issues.
To enable materialized views and secondary indexes for tablet keyspaces, use
the `--rf-rack-valid-keyspaces` See :ref:`Views with tablets <admin-views-with-tablets>` for details.
The invariant is enforced while the keyspace contains a materialized view or a secondary index, or
if the `rf_rack_valid_keyspaces` option is set, by rejecting operations that would violate the RF-rack-valid property:
Altering a keyspace's replication factor, adding a node in a new rack, or removing the last node
in a rack, will be rejected if they would result in an RF-rack-invalid keyspace.
To enable materialized views and secondary indexes for tablet keyspaces, the keyspace
must be :term:`RF-rack-valid <RF-rack-valid keyspace>`.
See :ref:`Views with tablets <admin-views-with-tablets>` for details.
Resharding in keyspaces with tablets enabled has the following limitations:

View File

@@ -289,7 +289,7 @@ with tablets enabled. Keyspaces using tablets must also remain :term:`RF-rack-va
throughout their lifetime. See :ref:`Limitations and Unsupported Features <tablets-limitations>`
for details.
**The ``initial`` sub-option (deprecated)**
**The** ``initial`` **sub-option (deprecated)**
A good rule of thumb to calculate initial tablets is to divide the expected total storage used
by tables in this keyspace by (``replication_factor`` * 5GB). For example, if you expect a 30TB
@@ -422,7 +422,7 @@ An empty list is allowed, and it's equivalent to numeric replication factor of 0
Altering from a rack list to a numeric replication factor is not supported.
Keyspaces which use rack lists are :term:`RF-rack-valid <RF-rack-valid keyspace>`.
Keyspaces which use rack lists are :term:`RF-rack-valid <RF-rack-valid keyspace>` if each rack in the rack list contains at least one node (excluding :doc:`zero-token nodes </architecture/zero-token-nodes>`).
.. _drop-keyspace-statement:

View File

@@ -42,14 +42,55 @@ Creating a secondary index on a table uses the ``CREATE INDEX`` statement:
create_index_statement: CREATE [ CUSTOM ] INDEX [ IF NOT EXISTS ] [ `index_name` ]
: ON `table_name` '(' `index_identifier` ')'
: [ USING `string` [ WITH OPTIONS = `map_literal` ] ]
: [ USING `string` [ WITH `index_properties` ] ]
index_identifier: `column_name`
:| ( FULL ) '(' `column_name` ')'
index_properties: index_property (AND index_property)*
index_property: OPTIONS = `map_literal`
:| view_property
where `view_property` is any :ref:`property <mv-options>` that can be used when creating
a :doc:`materialized view </features/materialized-views>`. The only exception is `CLUSTERING ORDER BY`,
which is not supported by secondary indexes.
If the statement is provided with a materialized view property, it will not be applied to the index itself.
Instead, it will be applied to the underlying materialized view of it.
For instance::
CREATE INDEX userIndex ON NerdMovies (user);
CREATE INDEX ON Mutants (abilityId);
CREATE INDEX userIndex ON NerdMovies (user);
CREATE INDEX ON Mutants (abilityId);
-- Create a secondary index called `catsIndex` on the table `Animals`.
-- The indexed column is `cats`. Both properties, `comment` and
-- `synchronous_updates`, are view properties, so the underlying materialized
-- view will be configured with: `comment = 'everyone likes cats'` and
-- `synchronous_updates = true`.
CREATE INDEX catsIndex ON Animals (cats) WITH comment = 'everyone likes cats' AND synchronous_updates = true;
-- Create a secondary index called `dogsIndex` on the same table, `Animals`.
-- This time, the indexed column is `dogs`. The property `gc_grace_seconds` is
-- a view property, so the underlying materialized view will be configured with
-- `gc_grace_seconds = 13`.
CREATE INDEX dogsIndex ON Animals (dogs) WITH gc_grace_seconds = 13;
-- The view property `CLUSTERING ORDER BY` is not supported by secondary indexes,
-- so this statement will be rejected by Scylla.
CREATE INDEX bearsIndex ON Animals (bears) WITH CLUSTERING ORDER BY (bears ASC);
View properties of a secondary index have the same limitations as those imposed by materialized views.
For instance, a materialized view cannot be created specifying ``gc_grace_seconds = 0``, so creating
a secondary index with the same property will not be possible either.
Example::
-- This statement will be rejected by Scylla because creating
-- a materialized view with `gc_grace_seconds = 0` is not possible.
CREATE INDEX names ON clients (name) WITH gc_grace_seconds = 0;
-- This statement will also be rejected by Scylla.
-- It's not possible to use `COMPACT STORAGE` with a materialized view.
CREATE INDEX names ON clients (name) WITH COMPACT STORAGE;
The ``CREATE INDEX`` statement is used to create a new (automatic) secondary index for a given (existing) column in a
given table. A name for the index itself can be specified before the ``ON`` keyword, if desired. If data already exists

View File

@@ -20,9 +20,7 @@ command line option when launchgin scylla.
You can define endpoint details in the `scylla.yaml` file. For example:
```yaml
object_storage_endpoints:
- name: s3.us-east-1.amazonaws.com
port: 443
https: true
- name: https://s3.us-east-1.amazonaws.com:443
aws_region: us-east-1
```
@@ -78,9 +76,7 @@ The examples above are intended for development or local environments. You shoul
For the EC2 Instance Metadata Service to function correctly, no additional configuration is required. However, STS requires the IAM Role ARN to be defined in the `scylla.yaml` file, as shown below:
```yaml
object_storage_endpoints:
- name: s3.us-east-1.amazonaws.com
port: 443
https: true
- name: https://s3.us-east-1.amazonaws.com:443
aws_region: us-east-1
iam_role_arn: arn:aws:iam::123456789012:instance-profile/my-instance-instance-profile
```
@@ -100,9 +96,7 @@ in `scylla.yaml`:
```yaml
object_storage_endpoints:
- name: s3.us-east-2.amazonaws.com
port: 443
https: true
- name: https://s3.us-east-2.amazonaws.com:443
aws_region: us-east-2
```

View File

@@ -30,7 +30,7 @@ SELECT * FROM system.role_attributes WHERE role='r' and attribute_name='service
### Service Level Configuration Table
```
CREATE TABLE system_distributed.service_levels (
CREATE TABLE system.service_levels_v2 (
service_level text PRIMARY KEY,
timeout duration,
workload_type text,
@@ -45,14 +45,19 @@ The table column names meanings are:
*shares* - a number that represents this service level priority in relation to other service levels.
```
select * from system_distributed.service_levels ;
select * from system.service_levels_v2 ;
service_level | timeout | workload_type
---------------+---------+---------------
sl | 500ms | interactive
service_level | shares | timeout | workload_type
---------------+--------+---------+---------------
sl | 100 | 500ms | interactive
```
* Please note that `system.service_levels_v2` is used as service levels configuration table only with consistent topology and service levels on Raft,
which is enabled by default but not mandatory yet.
Otherwise, the old `system_distributed.service_levels` is used as configuration table.
For more information see [service levels on Raft](#service-levels-on-raft) section. *
### Service Level Timeout
Service level timeout can be used to assign a default timeout value for all operations for a particular service level.
@@ -190,6 +195,20 @@ The command displays a table with: option name, effective service level the valu
timeout | sl1 | 2s
```
## Service levels on Raft
Since ScyllaDB 6.0 and ScyllaDB Enterprise 2024.2, service levels metadata is managed by Raft group0 and stored in `system.service_levels_v2`.
In existing clusters, service levels metadata is automatically copied from `system_distributed.service_levels` to `system.service_levels_v2`
during the topology upgrade (see [upgrade to raft topology section](topology-over-raft.md#upgrade-from-legacy-topology-to-raft-based-topology)). Because of this, no service levels operations should be done during the topology upgrade.
Before the service levels on Raft, service levels cache was updated in 10s intervals by polling `system_distributed.service_levels` table.
Once the service levels are migrated to raft, the cache is updated on every group0 state apply fiber, if there are any mutations to
`system.service_levels_v2`, `system.role_attributes` or ` system.role_members`.
With service levels on Raft, the cache also has an additional layer, called `service level effective cache`. This layer combines service levels
and auth information and stores effective service level for each role (see [effective service level section](#effective-service-level)).
## Implementation
### Integration with auth

View File

@@ -0,0 +1,111 @@
# Size based load balancing
Until size based load balancing was introduced, ScyllaDB performed disk based balancing
based on disk capacity and tablet count with the assumption that every tablet uses the
same amount of disk space. This means that the number of tablets located on a node was
proportional to the gross disk capacity of that node. Because the used disk space of
different tablets can vary greatly, this could create imbalance in disk utilization.
Size based load balancing aims to achieve better disk utilization accross nodes in a
cluster. The load balancer will continuously gather information about available disk
space and tablet sizes from all the nodes. It then incrementally computes tablet
migration plans which equalize disk utilization accross the cluster.
# Basic operation
The load balancer runs as a background task on the same shard that the topology
coordinator is running on. The topology coordinator collects information needed
for the load balancer to make decisions about which tablets to migrate. This
information is collected periodically by the coordinator from all the nodes in
the cluster in the form of the data structure ``load_stats``. The information in
this struct relevant for the load balancer is stored in its member tablet_stats,
which contains:
- ``tablet_sizes``: the disk size in bytes of all the tablets on the given node.
- ``effective_capacity``: contains the sum of available disk space and all the tablet sizes on the given node.
The balancer will use this information to compute the disk load (disk utilization)
on every node and shard. It will then migrate tablets from the most to the least
loaded nodes and shards.
# Table balance
The secondary goal of the load balancer is to achieve table balance. This means that
the balancer needs to equalize the following ratios:
- disk used by the table on a shard / total disk used by the table in the rack
- effective capacity of a shard / effective capacity of the rack
Otherwise, we could have imbalance where a shard contains more data from a table,
which can potentially overload the CPU of that shard in case the given table is hot.
# ``load_stats`` reconciliation
Because the ``load_stats`` collection interval is 1 minute, by the time the balancer
starts using the information in ``load_stats``, that information can be stale. This can
be caused by tablet migrations or table resize (split or merge). To get around this,
we need to update the information about tablet sizes in ``load_stats`` after a migration
or resize. Issued tablet migrations update this information in ``load_stats`` by also
migrating the tablet size from one host to another. This tablet size migration
reconciliation is performed during the end_migration stage of the tablet migration.
Tablet size reconciliation for split or merge is performed during tablet resize
finalization. For a split, the reconciliation will divide the tablet size, and create
two tablets in place of the original tablet pre-split. For merge, it will accumulate
the tablet sizes of the tablets pre-merge, create a merged tablet size and remove the
tablet sizes pre-merge from ``load_stats``.
# Force capacity based balancing
The load balancer has the ability to fall back on capacity based balancing. This can be
enforced by a config parameter force_capacity_based_balancing. During capacity based
balancing, the balancer will not look up the actual tablet sizes from ``load_stats``,
and will instead assume each tablet size is equal to default_target_tablet_sizes. It will
also use the gross disk capacity (sent in ``load_stats`` struct in the capacity member),
instead of effective capacity.
# Excluding nodes with incomplete tablet sizes
Even with tablet size reconciliation (during migration and table resize), it is still
possible for the tablet size information in ``load_stats`` to not match the current tablet
information found in the tablet map. In order to avoid problems with the balancer
making decisions based on incomplete or incorrect data, size based load balancing
will not balance nodes which have incomplete tablet sizes. Instead, it will ignore
these nodes (these nodes will not be selected as sources or destinations for tablet
migrations), and will wait for correct tablet sizes to arrive after the next ``load_stats``
refresh by the topology coordinator.
One exception to this are nodes which have been excluded from the cluster. These nodes
are down and therefor are not able to send fresh ``load_stats``. But they have to be drained
of their tablets (via tablet rebuild), and the balancer must do this even with incomplete
tablet data. So, only excluded nodes are allowed to have missing tablet sizes.
# Size based balancing cluster feature
During rolling upgrades, it can take hours or even days until all the nodes in a cluster
have been upgraded. This means that the non-upgraded nodes will send ``load_stats`` without
tablet sizes. Considering the balancer ignores these nodes during balancing, we would
have a problem with some of the nodes not being load balanced for extended periods of
time. To avoid this, size based load balancing is only enabled after the cluster feature
is enabled. Until that time, the balancer will fall back on capacity based balancing.
# Minimal tablet size
After a table is created and it begins to receive data, there can be a period of time
where the data in some of the tablets has been flushed, while others have not. During
this time, the load balancer will migrate seemingly empty (but actually not yet flushed)
tablet away from the nodes where they have been created. Later, when all the tablets
have been flushed, the balancer will migrate the tablets again. In order to avoid these
unnecessary early migrations, we introduce the idea of minimal tablet size. The balancer
will treat any tablet smaller than minimal tablet size as having a size of minimal
tablet size. This reduces early migrations.
# Balance threshold percentage
The balancer considers if a set nodes are balanced by computing the delta of disk load
between the most loaded and least loaded node, and dividing it with the load of the most
loaded node:
delta = (most_loaded - least_loaded) / most_loaded
If this computed value is below a threshold, the nodes are considered balanced. This threshold
can be configured with the ``size_based_balance_threshold_percentage`` config option.

View File

@@ -114,9 +114,16 @@ in `./testlog`. Scylla data files are stored in `/tmp`.
There are several test directories that are excluded from orchestration by `test.py`:
- test/boost
- test/raft
- test/ldap
- test/raft
- test/unit
- test/vector_search
- test/vector_search_validator
- test/alternator
- test/broadcast_tables
- test/cql
- test/cqlpy
- test/rest_api
This means that `test.py` will not run tests directly, but will delegate all work to `pytest`.
That's why all these directories do not have `suite.yaml` files.

View File

@@ -862,6 +862,9 @@ From the admin's point of view, the steps are as follows:
or via observing the logs
- After all nodes report `done` via the GET endpoint, the upgrade has fully finished
Note that during the upgrade no service levels or auth operations should be done,
as those services are performing migrations to raft metadata.
The `upgrade_state` static column in `system.topology` serves the key role
in coordinating the upgrade. It goes through the following states in the following
order:

View File

@@ -1,5 +1,5 @@
Install ScyllaDB
=================
Install ScyllaDB |CURRENT_VERSION|
=====================================
.. toctree::
:maxdepth: 2

View File

@@ -44,9 +44,10 @@ you want to install.
**Example**
.. code:: console
.. code-block:: console
:substitutions:
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version 2025.1.1
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version |CURRENT_VERSION|
Versions Earlier than 2025.1
@@ -64,13 +65,5 @@ For example:
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-product scylla-enterprise --scylla-version 2024.1
To install a supported version of *ScyllaDB Open Source*, run the command with
the ``--scylla-version`` option to specify the version you want to install.
For example:
.. code:: console
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version 6.2.1
.. include:: /getting-started/_common/setup-after-install.rst

View File

@@ -15,7 +15,7 @@ All releases are available as a Docker container, EC2 AMI, GCP, and Azure images
By *supported*, it is meant that:
- A binary installation package is available to `download <https://www.scylladb.com/download/>`_.
- A binary installation package is available.
- The download and install procedures are tested as part of the ScyllaDB release process for each version.
- An automated install is included from :doc:`ScyllaDB Web Installer for Linux tool </getting-started/installation-common/scylla-web-installer>` (for the latest versions).

View File

@@ -6,8 +6,6 @@
* :doc:`ScyllaDB SStable </operating-scylla/admin-tools/scylla-sstable>` - Validates and dumps the content of SStables, generates a histogram, dumps the content of the SStable index.
* :doc:`ScyllaDB Types </operating-scylla/admin-tools/scylla-types/>` - Examines raw values obtained from SStables, logs, coredumps, etc.
* :doc:`cassandra-stress </operating-scylla/admin-tools/cassandra-stress/>` A tool for benchmarking and load testing a ScyllaDB and Cassandra clusters.
* :doc:`SSTabledump </operating-scylla/admin-tools/sstabledump>`
* :doc:`SSTableMetadata </operating-scylla/admin-tools/sstablemetadata>`
* scylla local-file-key-generator - Generate a local file (system) key for :doc:`encryption at rest </operating-scylla/security/encryption-at-rest>`, with the provided length, Key algorithm, Algorithm block mode and Algorithm padding method.
* `scyllatop <https://www.scylladb.com/2016/03/22/scyllatop/>`_ - A terminal base top-like tool for scylladb collectd/prometheus metrics.
* :doc:`scylla_dev_mode_setup</getting-started/installation-common/dev-mod>` - run ScyllaDB in Developer Mode.

View File

@@ -14,8 +14,6 @@ Admin Tools
ScyllaDB Types </operating-scylla/admin-tools/scylla-types/>
sstableloader
cassandra-stress </operating-scylla/admin-tools/cassandra-stress/>
sstabledump
sstablemetadata
ScyllaDB Logs </getting-started/logging/>
perftune
Virtual Tables </operating-scylla/admin-tools/virtual-tables/>

View File

@@ -4,21 +4,14 @@ ScyllaDB SStable
Introduction
-------------
ScyllaDB SStable is a ScyllaDB-native tool for examining and manipulating SStables, written in C++ and built on top of the ScyllaDB codebase.
The tool is built-into the ScyllaDB executable and can be invoked via ``scylla sstable``.
This tool allows you to examine the content of SStables by performing operations such as dumping the content of SStables,
validating the content of SStables, and more. See `Supported Operations`_ for the list of available operations.
Run ``scylla sstable --help`` for additional information about the tool and the operations.
This tool is similar to SStableDump_, with notable differences:
* Built on the ScyllaDB C++ codebase, it supports all SStable formats and components that ScyllaDB supports.
* Expanded scope: this tool supports much more than dumping SStable data components (see `Supported Operations`_).
* More flexible on how schema is obtained and where SStables are located: SStableDump_ only supports dumping SStables located in their native data directory. To dump an SStable, one has to clone the entire ScyllaDB data directory tree, including system table directories and even config files. ``scylla sstable`` can dump sstables from any path with multiple choices on how to obtain the schema, see Schema_.
``scylla sstable`` was developed to supplant SStableDump_ as ScyllaDB-native tool, better tailored for the needs of ScyllaDB.
.. _SStableDump: /operating-scylla/admin-tools/sstabledump
Usage
------

View File

@@ -1,53 +0,0 @@
SSTabledump
============
.. warning:: SSTabledump is deprecated since ScyllaDB 5.4, and will be removed in the next release.
Please consider switching to :doc:`ScyllaDB SSTable </operating-scylla/admin-tools/scylla-sstable>`.
This tool allows you to converts SSTable into a JSON format file.
If you need more flexibility or want to dump more than just the data-component, see :doc:`scylla-sstable </operating-scylla/admin-tools/scylla-sstable>`.
Use the full path to the data file when executing the command.
For example:
.. code-block:: shell
sstabledump /var/lib/scylla/data/keyspace1/standard1-7119946056b211e98e85000000000001/mc-12-big-Data.db
Example output:
.. code-block:: shell
[
{
"partition" : {
"key" : [ "3137343334334f4f4d30" ],
"position" : 0
},
"rows" : [
{
"type" : "static_block",
"position" : 281,
"cells" : [
{ "name" : "\"C0\"", "value" : "0xb7789e7cdf2af541061f207714a3b8e14c72f74e663bd5c2577ac329bcb3161cf10c", "tstamp" : "2019-04-04T08:22:24.336001Z" },
{ "name" : "\"C1\"", "value" : "0xe8ed77f078a23e37f8a7246ccd8cd4099585c7031e242529e5070246860d7a1b1e85", "tstamp" : "2019-04-04T08:22:24.336001Z" },
{ "name" : "\"C2\"", "value" : "0x3b836d4333d2d5a02a63ced47596bfb5f80ecb8e80686061c3daaba87380994b7b61", "tstamp" : "2019-04-04T08:22:24.336001Z" },
{ "name" : "\"C3\"", "value" : "0x9220219581df87ff131306b8bf793c14ae8ebf8c8af1b638827ebfcab85660a378b8", "tstamp" : "2019-04-04T08:22:24.336001Z" },
{ "name" : "\"C4\"", "value" : "0x5b4c972cdeb330035b82dc0b1daa9051fff7956d45e3c6c2b21dfb1fd2bb43fb1146", "tstamp" : "2019-04-04T08:22:24.336001Z" }
]
}
]
**NOTE:**
When running as a user that is not ``root`` or ``scylla`` an error (java traceback) might be observed.
To work around the error, please use the following syntax:
.. code-block:: sh
cassandra_storagedir="/tmp" sstabledump [filename]

View File

@@ -1,55 +0,0 @@
SSTableMetadata
===============
.. warning:: SSTableMetadata is deprecated since ScyllaDB 5.4, and will be removed in the next release.
Please consider switching to :ref:`scylla sstable dump-statistics` and :ref:`scylla sstable dump-summary`.
SSTableMetadata prints metadata in ``Statistics.db`` and ``Summary.db`` about the specified SSTables to the console.
Use the full path to the data file when executing the command.
For example:
.. code-block:: console
$ sstablemetadata /var/lib/scylla/data/keyspace1/standard1-e6a565803a5e11ee83de7d84e184e393/me-9-big-Data.db
SSTable: /var/lib/scylla/data/keyspace1/standard1-e6a565803a5e11ee83de7d84e184e393/me-9-big-Data.db
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Bloom Filter FP chance: 0.010000
Minimum timestamp: 1691989005063001
Maximum timestamp: 1691989010554004
SSTable min local deletion time: 2147483647
SSTable max local deletion time: 2147483647
Compressor: -
TTL min: 0
TTL max: 0
First token: -9220288174854979040 (key=4e374b384d4b39313930)
Last token: -5881915680038312597 (key=324b4e31343339373231)
minClustringValues: []
maxClustringValues: []
Estimated droppable tombstones: 0.0
SSTable Level: 0
Repaired at: 0
Replay positions covered: {}
totalColumnsSet: 22895
totalRows: 4579
originatingHostId: 6a684ae1-7b76-44cc-9e07-32975af0e60e
Estimated tombstone drop times:Count Row Size Cell Count
1 0 0
2 0 0
3 0 0
4 0 0
5 0 4579
6 0 0
7 0 0
8 0 0
...
1414838745986 0
Estimated cardinality: 42
EncodingStats minTTL: 0
EncodingStats minLocalDeletionTime: 1442880000
EncodingStats minTimestamp: 1691988990337000
KeyType: org.apache.cassandra.db.marshal.BytesType
ClusteringTypes: []
StaticColumns: {}
RegularColumns: {C3:org.apache.cassandra.db.marshal.BytesType, C4:org.apache.cassandra.db.marshal.BytesType, C0:org.apache.cassandra.db.marshal.BytesType, C1:org.apache.cassandra.db.marshal.BytesType, C2:org.apache.cassandra.db.marshal.BytesType}

View File

@@ -111,9 +111,7 @@ should follow this format:
.. code-block:: yaml
object_storage_endpoints:
- name: <endpoint_address_or_domain_name>
port: <port_number>
https: <true_or_false> # optional
- name: http[s]://<endpoint_address_or_domain_name>[:<port_number>]
aws_region: <region_name> # optional, e.g. us-east-1
iam_role_arn: <iam_role> # optional
@@ -123,9 +121,7 @@ Example:
.. code:: yaml
object_storage_endpoints:
- name: s3.us-east-1.amazonaws.com
port: 443
https: true
- name: https://s3.us-east-1.amazonaws.com
aws_region: us-east-1
iam_role_arn: arn:aws:iam::123456789012:instance-profile/my-instance-instance-profile
@@ -267,13 +263,15 @@ Use ``scylla --help`` to get the list of experimental features.
Views with Tablets
------------------
Materialized Views (MV) and Secondary Indexes (SI) are enabled in keyspaces that use tablets
only when :term:`RF-rack-valid keyspaces <RF-rack-valid keyspace>` are enforced. That can be
done in the ``scylla.yaml`` configuration file by specifying
Materialized Views (MV) and Secondary Indexes (SI) are supported in keyspaces that use tablets
only when the keyspaces are :term:`RF-rack-valid <RF-rack-valid keyspace>`.
.. code-block:: yaml
When a keyspace contains a Materialized View or Secondary Index, some operations are restricted to maintain
the RF-rack condition. The following actions are not allowed while the view or index is present:
rf_rack_valid_keyspaces: true
* Altering the keyspace's replication factor to a value that would violate the RF-rack-valid property
* Adding a node in a new rack in an existing datacenter
* Removing the last node in a rack
Monitoring

View File

@@ -102,7 +102,6 @@ Other Tools
ScyllaDB has various other tools, mainly to work with sstables.
If you are diagnosing a problem that is related to sstables misbehaving or being corrupt, you may find these useful:
* :doc:`sstabledump </operating-scylla/admin-tools/sstabledump/>`
* :doc:`ScyllaDB SStable </operating-scylla/admin-tools/scylla-sstable/>`
* :doc:`ScyllaDB Types </operating-scylla/admin-tools/scylla-types/>`

View File

@@ -49,6 +49,13 @@ Procedure
* **seeds** - Specifies the IP address of an existing node in the cluster. The new node will use this IP to connect to the cluster and learn the cluster topology and state.
.. warning::
Adding a node in a new rack in an existing datacenter may violate the :term:`RF-rack-valid <RF-rack-valid keyspace>` constraints of some keyspace.
If a keyspace uses tablets and contains a Materialized View or Secondary Index, or if the `rf_rack_valid_keyspaces` option is set,
the invariant is enforced for the keyspace, and the new node will be rejected if adding it would violate the invariant.
#. Start the nodes with the following command:
.. include:: /rst_include/scylla-commands-start-index.rst

View File

@@ -30,6 +30,13 @@ Removing a Running Node
.. include:: /operating-scylla/_common/decommission_warning.rst
.. warning::
Removing a node in a rack may violate the :term:`RF-rack-valid <RF-rack-valid keyspace>` constraints of some keyspace.
If a keyspace uses tablets and contains a Materialized View or Secondary Index, or if the `rf_rack_valid_keyspaces` option is set,
the invariant is enforced for the keyspace, and the node removal will be rejected if it would violate the invariant.
#. Run the ``nodetool netstats`` command to monitor the progress of the token reallocation.
#. Run the ``nodetool status`` command to verify that the node has been removed.

View File

@@ -89,7 +89,7 @@ Enabling Experimental Features
There are two ways to enable experimental features:
* Enable all experimental features (which may be risky)
* Enable only the features you want to try (available in Scylla Open Source from version 3.2)
* Enable only the features you want to try
Enable All Experimental Features
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

View File

@@ -154,7 +154,7 @@ is holding your keys. You can use the following options:
* - Local Key Provider
- ``LocalFileSystemKeyProviderFactory`` (**default**)
- Stores the key on the same machine as the data.
* - Replicated Key Provider
* - Replicated Key Provider (**deprecated**)
- ``ReplicatedKeyProviderFactory``
- Stores table keys in a ScyllaDB table where the table itself is encrypted
using the system key.
@@ -183,13 +183,14 @@ Local Key Provider
The Local Key Provider is less safe than other options and as such it is not
recommended for production use. It is the default key provider for the
node-local encryption configuration in ``scylla.yaml`` because it does not
require any external resources. In production environments, it is recommended
to use an external KMS instead.
node-local encryption configuration in ``scylla.yaml`` and table encryption
because it does not require any external resources.
In production environments, it is recommended to use an external KMS instead.
The Local Key Provider is the default key provider for the node-local encryption
configuration in ScyllaDB (``user_info_encryption`` and ``system_info_encryption``
in ``scylla.yaml``). It stores the encryption keys locally on disk in a text file.
in ``scylla.yaml``) as well as table encryption.
It stores the encryption keys locally on disk in a text file.
The location of this file is specified in ``scylla.yaml``, or in the table schema.
The user has the option to generate the key(s) themselves, or let ScyllaDB
generate the key(s) for them.
@@ -210,16 +211,18 @@ Replicated Key Provider
.. note::
**Warning**: The replicated key provider is deprecated and will be removed
in a future ScyllaDB release.
The Replicated Key Provider is not recommended for production use because it
does not support key rotation. For compatibility with DataStax Cassandra, it
is the default key provider for per-table encryption setup. In production
environments, an external KMS should be used instead.
The Replicated Key Provider is the default key provider for per-table encryption
setup in ScyllaDB (``scylla_encryption_options`` in table schema). It stores and
distributes the encryption keys across every node in the cluster through a
special ScyllaDB system table (``system_replicated_keys.encrypted_keys``). The
Replicated Key Provider requires two additional keys to operate:
The Replicated Key Provider stores and distributes the encryption keys across
every node in the cluster through a special ScyllaDB system table
(``system_replicated_keys.encrypted_keys``). The Replicated Key Provider
requires two additional keys to operate:
* A system key - used to encrypt the data in the system table. The system key
can be either a local key, or a KMIP key.
@@ -302,7 +305,7 @@ Depending on your key provider, you will either have the option to allow
ScyllaDB to generate an encryption key, or you will have to provide one:
* Local Key Provider - you can provide your own keys, otherwise ScyllaDB will generate them for you
* Replicated Key Provider - you must generate a system key yourself
* Replicated Key Provider - you must generate a system key yourself (**deprecated**)
* KMIP Key Provider - you can provide your own keys, otherwise ScyllaDB will generate them for you
* KMS Key Provider - you must generate a key yourself in AWS
* GCP Key Provider - you must generate a key yourself in GCP
@@ -432,6 +435,8 @@ desired key provider:
You cannot use the same key for both the system key and the local
secret key. They must be different keys.
**Warning**: The replicated key provider is deprecated and will be removed in a future ScyllaDB release.
.. group-tab:: KMIP Key Provider
The KMIP Key Provider will first try to discover existing keys in the KMIP
@@ -994,6 +999,8 @@ in the ``scylla.yaml`` file.
The Replicated Key Provider cannot be used in ``user_info_encryption``.
You can only use it to :ref:`Encrypt a Single Table <ear-create-table>`.
**Warning**: The replicated key provider is deprecated and will be removed in a future ScyllaDB release.
.. group-tab:: KMIP Key Provider
* Make sure to :ref:`set up a KMIP Host <encryption-at-rest-set-kmip>`.
@@ -1075,6 +1082,8 @@ in the ``scylla.yaml`` file.
The Replicated Key Provider cannot be used in ``user_info_encryption``.
You can only use it to :ref:`Encrypt a Single Table <ear-create-table>`.
**Warning**: The replicated key provider is deprecated and will be removed in a future ScyllaDB release.
.. group-tab:: KMIP Key Provider
.. code-block:: yaml
@@ -1296,6 +1305,8 @@ This procedure demonstrates how to encrypt a new table.
.. group-tab:: Replicated Key Provider
**Warning**: The replicated key provider is deprecated and will be removed in a future ScyllaDB release.
* Ensure you have a system key. The system key can be either a local key,
or a KMIP key. If you don't have a system key, create one by following
the procedure in :ref:`Create Encryption Keys <ear-create-encryption-key>`.
@@ -1397,6 +1408,7 @@ This procedure demonstrates how to encrypt a new table.
;
.. group-tab:: Replicated Key Provider
**Warning**: The replicated key provider is deprecated and will be removed in a future ScyllaDB release.
.. code-block:: cql
@@ -1821,6 +1833,8 @@ Once this encryption is enabled, it is used for all system data.
The Replicated Key Provider cannot be used for system encryption. You can
only use it to :ref:`Encrypt a Single Table <ear-create-table>`.
**Warning**: The replicated key provider is deprecated and will be removed in a future ScyllaDB release.
.. group-tab:: KMIP Key Provider
* Make sure to :ref:`set up a KMIP Host <encryption-at-rest-set-kmip>`.
@@ -1902,6 +1916,8 @@ Once this encryption is enabled, it is used for all system data.
The Replicated Key Provider cannot be used for system encryption. You
can only use it to :ref:`Encrypt a Single Table <ear-create-table>`.
**Warning**: The replicated key provider is deprecated and will be removed in a future ScyllaDB release.
.. group-tab:: KMIP Key Provider
.. code-block:: yaml
@@ -2127,6 +2143,8 @@ varies depending on the key provider you are using.
The Replicated Key Provider does not support key rotation. If you need to
rotate keys, you must migrate to a different key provider.
**Warning**: The replicated key provider is deprecated and will be removed in a future ScyllaDB release.
.. group-tab:: KMIP Key Provider
.. DSE docs: https://docs.datastax.com/en/dse/6.9/securing/configure-kmip-encryption.html?#secRekeyKMIP

Some files were not shown because too many files have changed in this diff Show More