Change the prometheus_allow_protobuf configuration to true by default.
This allows ScyllaDB server to serve Prometheus protobuf format (enables
native histogram support) if asked so by the monitoring server.
Update config help text/docs to reflect protobuf support (drop
“experimental” wording).
Add cluster tests to validate the default is enabled, can be overridden,
and /metrics returns protobuf when requested via Accept header (and
falls back to text when disabled).
Fixes#27817
co-Author: mykaul <mykaul@scylladb.com>
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats, but also per-tablet stats.
Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).
This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.
Fixes https://github.com/scylladb/scylladb/issues/27921
No backport becuse only master is vulnerable (size-based load balancing).
Closesscylladb/scylladb#27926
* https://github.com/scylladb/scylladb:
test: cluster: Add reproducer for missed notification in topology coordinator
topology_coordinator: Wake up the state machine after stats refresh
topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier
topology_coordinator: Fix potential missed notification
topology_coordinator: Refresh load stats after table is created or altered
tablets: Do a group0 read barrier on tablet load stats refresh
topology_coordinator: Ensure stats are refreshed in the gossip scheduling group
test: Use ManagerClient.{disable,enable}_tablet_balancing()
test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API
test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()
Such rebuild has no read_from replica, but we know the tablet size will be 0.
If we don't, stats will be incomplete until the next refresh.
This is important for test cases which do removenode or replace while
all replicas are down. So for example test_replace from
test_tablets_removenode.py, which uses RF=1 and replaces a node.
Without this, the test waits for 60s needlessly after the first round
of rebuilding migrations before scheduling more migrations. This can
cause the test to time out.
Fixes#28115Closesscylladb/scylladb#28121
Most likely 817fdad uncovered the fact that our choice of primary replica was resonating with tablet allocation and we were ending up picking the same replica as primary within a scope instead of rotating primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes with primary_replica_only=true would put all data into 3 nodes, leaving the other 6 unused. The balancing of the dataset was performed by the subsequent repair step.
This PR fixes this by changing the formula for picking up the primary replica out of a set of eligible replicas from within the passed scope.
The PR also extends the testing scenarios in `test_backup.py` so we get to run restore for a set of topologies, for all combinations of scope, primary_replica_only and min_tablet_counts.
Most of the work was done by @bhalevy [here](https://github.com/scylladb/scylladb/compare/master...bhalevy:scylla:load-balance-primary-replica), this PR just splitted it and did touchups here and there.
Fixes#27281Closesscylladb/scylladb#27397
* github.com:scylladb/scylladb:
test: reduce dataset and number of test cases or debug builds
test: bump repair timeout up, it's sometimes not enough in CI
test: refactor test_refresh.py to match test_restore_with_streaming_scopes.
test: extend test_restore_with_streaming_scopes
test: Adjust test_restore_primary_replica_different_dc_scope_all
test: Refactor restoring code in test_backup to match SM pattern
test: add check_mutation_replicas calls after fresh creation of dataset
test: extend create_dataset to accept consistency_level
test: refactor check_mutation_replicas so it's more readable
test: make create_dataset async and refactor so it's configurable
test: use defaultdict in collect_mutations
test: add log marks to facilitate reusing server for restore
locator: tablets: Distribute data evenly among primary replicas during restore
To configure S3 storage, one needs to do
```
object_storage_endpoints:
- name: s3.us-east-1.amazonaws.com
port: 443
https: true
aws_region: us-east-1
```
and for GCS it's
```
object_storage_endpoints:
- name: https://storage.googleapis.com:433
type: gs
credentials_file: <gcp account credentials json file>
```
This PR updates the S3 part to look like
```
object_storage_endpoints:
- name: https://s3.us-east-1.amazonaws.com:443
aws_region: us-east-1
```
fixes: #26570
This is 2nd attempt, previous one (#27360) was reverted because it reported endpoint configs in new format via API and CQL always, even if the endpoint was configured in the old way. This "broke" scylla manager and some dtests. This version has this bug fixed, and endpoints are reported in the same format as they were configured with.
About correctness of the changes.
No modifications to existing tests are made here, so old format is respected correctly (as far as it's covered by tests). To prove the new format works the the test_get_object_store_endpoints is extended to validate both options. Some preparations to this test to make this happen come on their own with the PR #28111 to show that they are valid and pass before changing the core code.
Enhancing the way configuration is made, likely no need to backport.
Closesscylladb/scylladb#28112
* github.com:scylladb/scylladb:
test: Validate S3 endpoints new format works
docs: Update docs according to new endpoints config option format
object_storage: Create s3 client with "extended" endpoint name
s3/storage: Tune config updating
sstable: Shuffle args for s3_client_wrapper
test: Rename badconf variable into objconf
test: Split the object_store/test_get_object_store_endpoints test
When a counter write times out (due to rpc::timeout_error or timed_out_error),
the code was throwing mutation_write_timeout_exception but not marking the
write_timeouts metric. This resulted in counter write timeouts not being
counted in the scylla_storage_proxy_coordinator_write_timeouts metric.
Regular writes go through mutate_internal -> mutate_end, which catches
mutation_write_timeout_exception and marks the metric. However, counter
writes use a separate code path (mutate_counters) that has its own
exception handling but was missing the metric update.
This fix adds get_stats().write_timeouts.mark() before throwing the
timeout exception in the counter write path, consistent with how the
CAS path handles cas_write_timeouts.
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-245
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#28019
A replaced node may have pending operation on it. The replace operation
will move the node into the 'left' state and the request will never be
completed. More over the code does not expect left node to have a
request. It will try to process the request and will crash because the
node for the request will not be found.
The patch checks is the replaced node has peening request and completes
it with failure. It also changes topology loading code to skip requests
for nodes that are in a left state. This is not strictly needed, but
makes the code more robust.
Fixes#27990Closesscylladb/scylladb#28009
Refs #22733.
Adds runtime warning and docs info that replicated provider is deprecated and will be removed.
Fixes#27292Closesscylladb/scylladb#27270
* github.com:scylladb/scylladb:
docs::encryption: Add warning that replicated provider is deprecated
ent::encryption: Switch default key provider from replicated to local
replicated_key_provider: Add deprecation warning on usage
`test_schema_versioning_with_recovery` is currently flaky. It performs
a write with CL=ALL and then checks if the schema version is the same on
all nodes by calling `verify_table_versions_synced`. All nodes are expected
to sync their schema before handling the replica write. The node in
RECOVERY mode should do it through a schema pull, and other nodes should do
it through a group 0 read barrier.
The problem is in `verify_local_schema_versions_synced` that compares the
schema versions in `system.local`. The node in RECOVERY mode updates the
schema version in `system.local` after it acknowledges the replica write
as completed. Hence, the check can fail.
We fix the problem by making the function wait until the schema versions
match.
Note that RECOVERY mode is about to be retired together with the whole
gossip-based topology in 2026.2. So, this test is about to be deleted.
However, we still want to fix it, so that it doesn't bother us in older
branches.
Fixes#23803Closesscylladb/scylladb#28114
Problem
-------
Secondary indexes are implemented via materialized views under the
hood. The way an index behaves is determined by the configuration
of the view. Currently, it can be modified by performing the CQL
statement `ALTER MATERIALIZED VIEW` on it. However, that raises some
concerns.
Consider, for instance, the following scenario:
1. The user creates a secondary index on a table.
2. In parallel, the user performs writes to the base table.
3. The user modifies the underlying materialized view, e.g. by setting
the `synchronous_updates` to `true` [1].
Some of the writes that happened before step 3 used the default value
of the property (which is `false`). That had an actual consequence
on what happened later on: the view updates were performed
asynchronously. Only after step 3 had finished did it change.
Unfortunately, as of now, there is no way to avoid a situation like
that. Whenever the user wants to configure a secondary index they're
creating, they need to do it in another schema change. Since it's
not always possible to control how the database is manipulated in
the meantime, it leads to problems like the one described.
That's not all, though. The fact that it's not possible to configure
secondary indexes is inconsistent with other schema entities. When
it comes to tables or materialized views, the user always have a means
to set some or even all of the properties during their creation.
Solution
--------
The solution to this problem is extending the `CREATE INDEX` CQL
statement by view properties. The syntax is of form:
```
> CREATE INDEX <index name>
> .. ON <keyspace>.<table> (<columns>)
> .. WITH <properties>
```
where `<properties>` corresponds to both index-specific and view
properties [2, 3]. View properties can only be used with indexes
implemented with materialized views; for example, it will be impossible
to create a vector index when specifying any view property (see
examples below).
When a view property is provided, it will be applied when creating the
underlying materialized view. The behavior should be similar to how
other CQL statements responsible for creating schema entities work.
High-level implementation strategy
----------------------------------
1. Make auxiliary changes.
2. Introduce data structures representing the new set of index
properties: both index-specific and those corresponding to the
underlying view.
3. Extend `CREATE INDEX` to accept view properties.
4. Extend `DESCRIBE INDEX` and other `DESCRIBE` statements to include
view properties in their output.
User documentation is also updated at the steps to reflect the
corresponding changes.
Implementation considerations
-----------------------------
There are a number of schema properties that are now obsolete. They're
accepted by other CQL statements, but they have no effect. They
include:
* `index_interval`
* `replicate_on_write`
* `populate_io_cache_on_flush`
* `read_repair_chance`
* `dclocal_read_repair_chance`
If the user tries to create a secondary index specifying any of those
keywords, the statement will fail with an appropriate error (see
examples below).
Unlike materialized views, we forbid specifying the clustering order
when creating a secondary index [4]. This limitation may be lifted
later on, but it's a detail that may or may not prove troublesome. It's
better to postpone covering it to when we have a better perspective on
the consequences it would bring.
Examples
--------
Good examples
```
> CREATE INDEX idx ON ks.t (v);
> CREATE INDEX idx ON ks.t (v) WITH comment = 'ok view property';
> CREATE INDEX idx ON ks.t (v)
.. WITH comment = 'multiple view properties are ok'
.. AND synchronous_updates = true;
> CREATE INDEX idx ON ks.t (v)
.. WITH comment = 'default value ok'
.. AND synchronous_updates = false;
```
Bad examples
```
> CREATE INDEX idx ON ks.t (v) WITH replicate_on_write = true;
SyntaxException: Unknown property 'replicate_on_write'
> CREATE INDEX idx ON ks.t (v)
.. WITH OPTIONS = {'option1': 'value1'}
.. AND comment = 'some text';
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Cannot specify options for a non-CUSTOM index"
> CREATE CUSTOM INDEX idx ON ks.t (v)
.. WITH OPTIONS = {'option1': 'value1'}
.. AND comment = 'some text';
InvalidRequest: Error from server: code=2200 [Invalid query]
message="CUSTOM index requires specifying the index class"
> CREATE CUSTOM INDEX idx ON ks.t (v)
.. USING 'vector_index'
.. WITH OPTIONS = {'option1': 'value1'}
.. AND comment = 'some text';
InvalidRequest: Error from server: code=2200 [Invalid query]
message="You cannot use view properties with a vector index"
> CREATE INDEX idx ON ks.t (v) WITH CLUSTERING ORDER BY (v ASC);
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Indexes do not allow for specifying the clustering order"
```
and so on. For more examples, see the relevant tests.
References:
[1] https://docs.scylladb.com/manual/branch-2025.4/cql/cql-extensions.html#synchronous-materialized-views
[2] https://docs.scylladb.com/manual/branch-2025.4/cql/secondary-indexes.html#create-index
[3] https://docs.scylladb.com/manual/branch-2025.4/cql/mv.html#mv-options
[4] https://docs.scylladb.com/manual/branch-2025.4/cql/dml/select.html#ordering-clauseFixesscylladb/scylladb#16454
Backport: not needed. This is an enhancement.
Closesscylladb/scylladb#24977
* github.com:scylladb/scylladb:
cql3: Extend DESC INDEX by view properties
cql3: Forbid using CLUSTERING ORDER BY when creating index
cql3: Extend CREATE INDEX by MV properties
cql3/statements/create_index_statement: Allow for view options
cql3/statements/create_index_statement: Rename member
cql3/statements/index_prop_defs: Re-introduce index_prop_defs
cql3/statements/property_definitions: Add extract_property()
cql3/statements/index_prop_defs.cc: Add namespace
cql3/statements/index_prop_defs.hh: Rename type
cql3/statements/view_prop_defs.cc: Move validation logic into file
cql3/statements: Introduce view_prop_defs.{hh,cc}
cql3/statements/create_view_statement.cc: Move validation of ID
schema/schema.hh: Do not include index_prop_defs.hh
A data_sink that stores buffers into an in-memory collection had
appeared in seastar recently. In Scylla there's similar thing that uses
memory_data_sink_buffer as a container, so it's possible to drop the
data_sink_impl iself in favor of seastar implementation.
For that to work there should be append_buffers() overload for the
aforementioned container. For its nice implementation the container, in
turn, needs to get push_back() method and value_type trait. The method
already exists, but is called put(), so just rename it. There's one more
user of it this method in S3 client, and it can enjoy the added
append_buffers() helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28124
The instructions for building optimized clang neglected to mention
that the clang version to be built must be specified. Correct that.
Closesscylladb/scylladb#28135
fmt::localtime() is now deprecated, users should migrate to equivalents
from the standard libraries.
std::localtime is not thread safe, so a local wrapper is introduced,
based on the thread-safe localtime_r() (from libc).
Closesscylladb/scylladb#27821
Read timeouts are a common occurence and they typically occur when the replica is overloaded. So throwing exceptions for read timeouts is very harmful. Be careful not to thow exceptions while propagating them up the future chain. Add a test to enfore and detect regressions.
Fixes: scylladb/scylladb#25062
Improvement, normally not a backport candidate, but we may decide to backport if customer(s) are found to suffer from this.
Closesscylladb/scylladb#25068
* github.com:scylladb/scylladb:
reader_permit: remove check_abort()
test/boost/database_test: add test for read timeout exceptions
sstables/mx/reader: don't throw exceptions on the read-path
readers/multishard: don't throw exceptions on the read-path
replica/table: don't throw exceptions on the read-path
multishard_mutation_query: fix indentation
multishard_mutation_query: don't throw exceptions on the read-path
service/storage_proxy: don't throw exceptions on the full-scan path
cql3/query_processor: don't throw exceptions on the read-path
reader_permit: add get_abort_exception()
At the end of the test case, the framework greps logs for errors and
backtraces. The servers are still running at this point. Some test
cases enable debug-level logging. If servers manage to produce new
lines between the python script processes them, the grep will never
return.
Protect against this by grepping over a file snapshot.
Fixes#28086Closesscylladb/scylladb#28088
This patch adds a new document, docs/alternator/network.md,
explaining the various mechanisms that can be used to reduce
network usage in Alternator. It explains compression of requests
and responses, header reduction, rack-aware routing, and RPC compression.
Many of these topics - especially support in the client libraries -
are work in progress, so some details are still missing in the new
document. Still, I think it is a good start that can be improved
later.
Fixes#27915.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27927
Extend the test_get_object_store_endpoints() test to configure S3
endpoints in full-url format and check that they are rendered properly
via API/CQL.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.
The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It tests two things -- the way object storage config is represented via
API and CQL (from sytem.config) and that updating config affects CREATE
KEYSPACE CQL (with keyspace storage options)
It's better to split the test, as its former part is going to be
extented to validate old/new config formats (see #26570)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
to test restoring with a different min_tablet_count
than the schema was originally created with.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
This patch refactors the restoring code in cluster/test_backup.py
so it matches better the way SM works.
The patch also refactors test_restore_with_streaming_scopes so to
facilitate running restore scenarios under all supported scopes
with or w/o primary_replica_only enabled by reusing the servers
and backups for a topology. This allows us to test a lot more scenarios
without making the test impossibly slow.
split from bhalevy/load-balance-primary-replica
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
to validate that mutation assertions are sane
split from bhalevy/load-balance-primary-replica
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Most likely 817fdad uncovered the fact that our choice of
primary replica was resonating with tablet allocation and we were ending up
picking the same replica as primary within a scope instead of rotating
primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes cluster
with primary_replica_only=true would put all data into 3 nodes, leaving
the other 6 unused. The balancing of the dataset was performed by the
subsequent repair step.
split from bhalevy/load-balance-primary-replica
Fixes#27281
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Address all errors reported by CodeQL as reported on https://github.com/scylladb/scylladb/security/quality.
This is a mixed bag, with some harmless issues, while others are severe problems which will result in the code breaking (if it is even run). I suspect some of the more severe problems were found in dead code that is not used at all -- hence nobody noticed.
Still, these issues are good to fix, so we can reduce noise in the reports and improve the maintainability of the code.
Code cleanup, no backport
Closesscylladb/scylladb#27838
* github.com:scylladb/scylladb:
pgo/pgo.py: don't mutate input params
test/pylib/coverage_utils.py: profdata_to_lcov: don't mutate defaulted param
test/cluster/dtest/tools/misc.py: add type annotations to list_to_hashed_dict()
idl-compiler.py: raise TypeError instead of raw str
test/pylib/lcov_utils.py: don't call set when iterating over it
configure.py: move away from .format(**locals())
test/cluster/object_store/conftest.py: add missing call to parent constructor
idl-compiler.py: add missing call to parent class constructor
tools/scyllatop/fake.py: pass correct number of args to _add_metric
This patch adds a second reproducer for issue #25839, which is about
scanning a secondary index which returns partial results. The new test
uses count(*) without requesting the row themselves, but still has the
same problem of counting only part of the rows. This is the problem that
a user reported in issue #28026.
Unlike the previous test, this test works correctly on older versions
of Scylla - by using larger data, like on Cassandra - without changing
a configuration variable that did not yet exist. So with this test we
can confirm that this bug is a Scylla 5.2 regression:
test/cqlpy/run --release 5.1 test_secondary_index.py::test_short_count
passes, while
test/cqlpy/run --release 5.2 test_secondary_index.py::test_short_count
fails. It also fails on master, so the new test is marked "xfail".
Refs #25839
Refs #28026
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28108
Addresses outstanding review comments from PR #22961 where SSL field
collection was refactored into generic_server::connection base class.
This patch consists of minor cosmetic enhancements for increased
readability, mainly, with some minor fixups explained in specific
commits.
Cosmetic changes, no need to backport.
Closesscylladb/scylladb#27575
* github.com:scylladb/scylladb:
test_ssl: fix indentation
generic_server: improve logging broken TLS connection
test_ssl: improve timeout and readability
alternator/server: update SSL comment
This method can cause performance regressions if used in the wrong place
-- namely if it is used to abort reads by throwing the abort exception.
Exceptions should be propagated during reads without throwing them,
otherwise they cause extra CPU load, making a bad situation worse.
Remove this method, so it doesn't accidentally get more users, migrate
remaining users to get_abort_exception().
Read timeouts shouldn't trigger exceptions thrown, exceptions should be
solely propagated via futures, otherwise they put extra strain on the
system at the worst possible time: when it is overload already enough
that reads started to time out.
The test covers both single partition reads and full scans, with two
scenarios:
* timeout while the read is queued
* timeout when the read is already ongoing
If the read is aborted via the permit (due to timeout) don't throw the
abort exception, instead propagate it via the future chain.
Also, use try_catch<> instead of try ... catch to decorate
malformed_sstable_exception with the file name.
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
Use coroutine::as_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
Not using coroutine::try_future(), because on the error path, the
querier has to be closed.
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
Will replace check_abort(). The latter throws an exception which is
something we want to avoid when a read is aborted, in particular when it
times out.
Also add a convenience get_abort_exception() method to mutation_reader.
We add a test that validates that indexed queries
do not throw a warning related to vector search paging
Fixes: SCYLLADB-248
Closesscylladb/scylladb#28077
We have a test in test_compressed_response.py that reproduces a bug
where in Alternator's signature checking code, if a header had multiple
consecutive spaces its signature isn't checked correctly.
This patch fixes this and that xfailing test begins to pass.
But it turns out that the handling of multiple consecutive spaces in
headers when calculating the authentication signature is just one example
of "header canonization" that the AWS Signature V4 specification requires
us to do. There are additional types of header canonization that Alternator
must do, and this patch also adds new tests in test_authorization.py for
checking *all* the types of canonization.
Fortunately, for all other types of canonizations, we already handled
them correctly - Alternator already lowercases header names, sorts them
alphabetically and removes leading and trailing spaces before calculating
the signature. So most of the new tests added pass also without this patch,
and only one of them, test_canonization_middle_whitespace, needs this
patch to pass. As usual, all the new tests also pass on DynamoDB.
Fixes#27775
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28102
With new UI Jenkins escaping the HTML tags during rendering to prevent
XSS. This will show just link without custom name as a string that can
be copied and then pasted to navigate to the failed directory.
Closesscylladb/scylladb#28062
To fix the problem, we need to remove the first, redundant definition of
test_gossiper_unreachable_endpoints (lines 19-24). The second definition
(lines 25-40) should be retained as it has more substantial test logic.
No other code changes or imports are needed, as the test logic is
preserved fully in the retained definition.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27632
Potential fix for code scanning alert no. 167: Workflow does not contain permissions
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27819
Potential fix for code scanning alert no. 145: Workflow does not contain permissions
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27808
Try to teach CoPi a bit about how we'd like to see it implement tests, according to this repo best practices.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#28032
Following 954f2cbd2f, which added proxy protocol v2 listeners
for CQL, we do the same for alternator. We add two optional ports
for plain and TLS-wrapped HTTP.
We test each new port, that the old ports still work, and that
mixing up a port with no proxy protocol and a connection with proxy
protocol (or the opposite) fails. The latter serves to show
that the testing strategy is valid and doesn't just pass whatever
happens. We also verify that the correct addresses (and TLS mode)
show up in system.clients.
Closesscylladb/scylladb#27889
Currently Alternator supports compressed requests in the gzip format
with "Content-Encoding: gzip". We did not support any other compression
formats.
It turns out that DynamoDB also supports the "deflate" encoding.
The "deflate" format is just a small variant of gzip and also supported
by the same zlib library that we already use, so it is very easy
to add support for it as well. So this patch adds it.
Beyond compatibility with DynamoDB, another benefit of this patch is
symmetry with our response compression support (PR #27454), where
we supported both gzip and deflate compression of responses - so
we should support the same for requests.
This patch also adds tests for Content-Encoding: deflate, which pass
on DynamoDB (proving that "deflate" is indeed supported there).
On Alternator the new tests failed before this patch and pass with
this patch.
Refs #27243 (which asks to support more compression formats).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27917
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertInvalidateSizedRecordsTest.java into our
cqlpy framework.
This is one of the tests added to Cassandra as part of the vector
search work, but actually has nothing to do with vector search -
it checks what happens when key columns of different types exceeed
their maximum size (64KB).
Unfortunately, each one of the tests added here *fail* on ScyllaDB,
providing more reproducers for two already known issues (which
already had plenty of reproducers...):
Refs #8627 Cleanly reject updates with indexed values where value > 64k
Refs #12247 Better error reporting for oversized keys during INSERT
One of the tests also fails on Cassandra, due to CASSANDRA-19270.
It is not clear to me how this unit test actually passed on Cassandra,
I can only guess that the Python driver somehow makes the request
differently than what the Java unit tests use to make requests to
Cassandra.
One of the tests in the original Cassandra source file I did not
translate, readingEmptyStringsForDifferentTypes, because it tests
cqlsh, not pure CQL.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27944
It is considered a dangerous practice with possible unintended
side-effects, affecting later calls to the same function.
Found by CodeQL "Modification of parameter with default".
It is considered a dangerous practice as it creates a side-effect for
later calls to the same function.
Create a new variable instead and mutate that. Also remove the unused
update_known_ids parameter, which defaults to True and no caller changes
it. Passing False to this param also seem to have no effect. Instead of
trying to guess what the desired effect of passing False is and fixing
it, just remove this unused param.
Found by CodeQL "Modification of parameter with default".
To hopefully shut up CodeQL "Iterable can be either a string or a sequence".
This change makes the code more readable anyway, so it is more than just
a gratuitous change to make some code-scanner happy.
Unlike in C++, in Python one can only throw objects which inherit from
Exception. The message complains about wrong type so wrap it in
TypeError before passing to raise.
Found by CodeQL "Illegal raise".
Use f strings instead, they are just as convenient with the added bonus
of editors providing syntax highighting for it.
Additionally, this shuts up CodeQL complaint about "Suspicious unused
loop iteration variable" in loops where the loop variable was passed to
format indirectly via **locals().
Replace manual init of parent fields.
Found by CodeQL: "Missing call to superclass `__init__` during object
initialization".
The secret_key is not initialized to server.secret_key, instead of
server.access_key. This probably fixes a (benign) bug.
Since Scylla 6.0, service levels are manged by Raft group0.
This patch updates table name used by service levels and adds a
paragraph describing service levels on raft.
Fixesscylladb/scylladb#18177Closesscylladb/scylladb#26556
Write the boost logs into stdout in HRF format and in XML to the file. The XML file will be used for parsing and providing the error information in the summary section of the fail.
Fixes: https://github.com/scylladb/scylladb/issues/28045
Framework enhancements, no need to backport.
Closesscylladb/scylladb#28107
* github.com:scylladb/scylladb:
test.py: remove XML log from fail summary
test.py: fix truncated boost output to stdout file
Otherwise, coordinator may not react to changing stats after explicit
calls to trigger_load_stats_refresh() done on node replace or table
creation, if stats take longer to refresh than it takes the
coordinator to go idle.
The periodic refresh does wake up the topology coordinator, so the
issue is not dramatic in production, but it's annoying in tests, which
take longer because of that.
Fixes#25163
Refreshing stats will signal _topo_sm.event, so do it before waiting
for the event, to avoiding busy looping in the coordinator.
This will produce lots of logs in test cases which enable debug-level
logging in the raft logger.
Refs #28086
Checking for work is not atomic, so there is room for missed
notification. Especially that notifications are not always triggered
from fibers which take the group0 guard.
Fix by subscribing for the event before checking for work.
Fixes#27958
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats (capacity), but also per-tablet stats.
Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).
This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.
Fixes#27921
Stats refresh will be triggered on topology coordinator by events like
allocating new tablets on table creation. For refresh to be effective,
all replicas must see the new tablets, otherwise stats will be
incomplete.
If a test tries to move a tablet, it assumes the tablets are stable.
This fixes flakiness exposed by size-based load-balancing and a later
change to refresh stats sooner.
It's a global operation, so we can use any server.
It's not only convenient. The call via api.disable_tablet_balancing()
confuse people to think that it's a per-server operation. This leads
to proliferation of code which does it needlessly on all servers.
Change the behavior of the catching the boost log output. With this
change boost will output it's logging to stdour with HRF format and to
the tempfile in XML format. This will help for easier debuggint when all
messages will be in the output file and still in the fail summary.
This will move responsibility for running tests with pytest in the same manner as it was done with boost tests. From this commit, test.py is not responsible anymore for running python tests and relies completely on pytest.
This is another step for unification of test execution.
Convert skip_mode function to `pytest.mark` to be able to use to annotate the whole module instead of each test explicitly.
NOTE: this is a breaking change. From this commit, several directories with tests will require a path to the file to launch the test. Affected directories
test/alternator
test/broadcast_tables
test/cql
test/cqlpy
test/rest_api
Changes only in framework, so no backport.
This PR will increase the amount of the tests by 30 test, due to the fact that how test.py and pytest discover tests. test.py count a file as a test, and when skip used in suite.yaml it will exclude the tests from discovery completely.
While the pytest count test funstion as a test and uses skip_mode mark and will discover the tests, but it will skip them during execution, hence the difference
test.py output before PR:
```bash
> ./test.py --mode=release rest_api/test_compaction_task rest_api/test_task_manager --list --no-gather-metrics
```
test.py output in this PR:
```bash
> ./test.py --mode=release test/rest_api/test_compaction_task.py test/rest_api/test_task_manager.py --list
rest_api/test_compaction_task.py::test_global_major_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_major_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_cleanup_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_offstrategy_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_rewrite_sstables_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_reshaping_compaction_task.release.1
rest_api/test_compaction_task.py::test_resharding_compaction_task.release.1
rest_api/test_compaction_task.py::test_regular_compaction_task.release.1
rest_api/test_compaction_task.py::test_compaction_task_abort.release.1
rest_api/test_compaction_task.py::test_major_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_cleanup_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_offstrategy_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_rewrite_sstables_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_compaction_progress[major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_compaction_task.py::test_compaction_progress[shard_major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_compaction_task.py::test_compaction_progress[table_major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_task_manager.py::test_task_manager_modules.release.1
rest_api/test_task_manager.py::test_task_manager_tasks.release.1
rest_api/test_task_manager.py::test_task_manager_status_running.release.1
rest_api/test_task_manager.py::test_task_manager_status_done.release.1
rest_api/test_task_manager.py::test_task_manager_status_failed.release.1
rest_api/test_task_manager.py::test_task_manager_not_abortable.release.1
rest_api/test_task_manager.py::test_task_manager_wait.release.1
rest_api/test_task_manager.py::test_task_manager_ttl.release.1
rest_api/test_task_manager.py::test_task_manager_user_ttl.release.1
rest_api/test_task_manager.py::test_task_manager_sequence_number.release.1
rest_api/test_task_manager.py::test_task_manager_recursive_status.release.1
rest_api/test_task_manager.py::test_module_not_exists.release.1
rest_api/test_task_manager.py::test_task_folding.release.1
rest_api/test_task_manager.py::test_abort_on_unregistered_task.release.1
```
Fixes: https://github.com/scylladb/scylladb/issues/27716Closesscylladb/scylladb#26395
* github.com:scylladb/scylladb:
test.py: fix test_vector_similarity.py
docs: add directories excluded from test.py
test.py: prevent file descriptors leaking
test.py: capture print inside the test
test.py: do not print header for collection with test.py
test.py: remove not supported functionality
test.py: switch of execution of several test directories by test.py runner
test.py: integrate python tests to be executed with pytest runner
test.py: fix test/vector_search_validator to be able to run with pytest
test.py: prepare base class for migration
test.py: move environment preparation to one method
test.py: introduce new environment variable TESTPY_PREPARED_ENVIRONMENT
repair: Implement auto repair for tablet repair
This patch implements the basic auto repair support for tablet repair.
It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.
- auto_repair_enabled_default
Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.
- auto_repair_threshold_default_in_seconds
Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.
The following metrcis are added:
- auto_repair_needs_repair_nr
The number of tablets with auto repair enabled that needs repair
- auto_repair_enabled_nr
The number of tablets with auto repair enabled
The metrics are useful to tell if auto repair is falling behind.
In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.
Fixes SCYLLADB-99
New feature. No backport.
Closesscylladb/scylladb#27534
* github.com:scylladb/scylladb:
topology_coordinator: Add metrics for tablet repair
repair: Implement auto repair for tablet repair
This PR:
- Replaces a fixed version name with the variable for the current version in the instructions for installing a non-default version with Web Installer. This will make using the installer more user-friendly.
- Removes the instruction for Open Source from the Web Installer docs.
Fixes https://github.com/scylladb/scylladb/issues/28005
Fixes https://github.com/scylladb/scylladb/issues/28079Closesscylladb/scylladb#28046
* github.com:scylladb/scylladb:
doc: remove the instruction for Open Source from the Web Installer docs
doc: add the version variable to the Web Installer instructions
- scylla_tablet_ops_failed
Number of failed tablet {auto, user} repair
- scylla_tablet_ops_succeeded
Number of succeeded tablet {auto, user} repair
Currently auto_repair and user_repair tablet task are added. We can add
more tablet tasks later, e.g., rebuild, migration.
Create and drop view operations are currently performed on all shards, and their execution is not fully serialized. On slower processors this can lead to interleavings that leave stale entries in `system.scylla_views_build`
A problematic sequence looks like this:
* `on_create_view()` runs on shard 0 → entries for shard 0 and shard 1 are created
* `on_drop_view()` runs on shard 0 → entry for shard 0 is removed
* `on_create_view()` runs on shard 1 → entries for shard 0 and shard 1 are created again
* `on_drop_view()` runs on shard 1 → entry for shard 1 is removed, while the shard 0 entry remains
This results in a leftover row in `system.scylla_views_builds_in_progress`, causing `view_build_test.cc` to get stuck indefinitely in an eventual state and eventually be terminated by CI.
This patch fixes the issue by fully serializing all view create and drop operations through shard 0. Shard 0 becomes the single execution point and notifies other shards to perform their work in order. Requests originating.
new process:
- view_builder::on_create_view(...) runs only on shard 0 and kicks off dispatch_create_view(...) in the background.
- dispatch_create_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed.
- dispatch_create_view(...) calls handle_seed_view_build_progress(...) on shard 0. That:
- writes the global “build progress” row across all shards via _sys_ks.register_view_for_building_for_all_shards(...).
- After seeding, dispatch_create_view(...) broadcasts to all shards with container().invoke_on_all(...).
- Each shard runs handle_create_view_local(...), which:
- waits for pending base writes/streams, flushes the base,
- resets the reader to the current token and adds the new view,
- handles errors and triggers _build_step to continue processing.
Drop view
- view_builder::on_drop_view(...) runs only on shard 0 and kicks off dispatch_drop_view(...) in the background.
- dispatch_drop_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed.
- It broadcasts handle_drop_view_local(...) to all shards with invoke_on_all(...).
- Each shard runs handle_drop_view_local(...), which:
- removes the view from local build state (_base_to_build_step and _built_views) by scanning existing steps,
- ignores missing keyspace cases.
- After all shards finish local cleanup, shard 0 runs handle_drop_view_global_cleanup(...), which:
- removes global build progress, built‑view state, and view build status in system tables,
Shutdown
- drain() waits on _view_notification_sem before _sem so in‑flight dispatches finish before bookkeeping is halted.
In addition, the test is adjusted to remove the long eventual wait (596.52s / 30 iterations) and instead rely on the default wait of 17 iterations (~4.37 minutes), eliminating unnecessary delays while preserving correctness.
Fixes: https://github.com/scylladb/scylladb/issues/27898
Backport: not required as the problem happens on master
Closesscylladb/scylladb#27929
To prepare for implementation of filtering we skip validation
of where clauses in vector search queries. All queries that would
be blocked by the lack of ALLOW FILTERING now will pass through.
Fixes: VECTOR-410
Closesscylladb/scylladb#27758
We can run Alternator's tests against DynamoDB with `test/alternator/run --aws`, and our intention is that all except a few specially marked should pass on DynamoDB - indicating that the test itself is correct and checks compatibility with DynamoDB and not with some misunderstood spec.
Before this patch series, almost two dozen Alternator's tests failed on DynamoDB. This series fixes most of them.
Refs #26079 (it fixes almost all the problems but probably not all of them so let's keep the issue open for a while longer)
Closesscylladb/scylladb#27995
* github.com:scylladb/scylladb:
test/alternator: fix some expected error messages to fit DynamoDB
test/alternator: fix compressed request test on non-us-east1
test/alternator: fix test's expected error message on DynamoDB
test/alternator: mark Alternator-only test scylla_only
test/alternator: fix test on DynamoDB
test/alternator: increase wait_for_gsi() timeout
test/alternator: fix test passing a spurious parameter
These tools are deprecated and no longer shipped by ScyllaDB packages.
They no longer support the latest SSTable versions and ScyllaDB-only
features, like encryption and dictionary based compression.
Remove them from the documentation.
Closesscylladb/scylladb#27608
The interface of Jenkins has changed, and the instructions for creating
a token are out-of-date. This commit updates them.
Closesscylladb/scylladb#28054
There is a known limitation of the xdist.
Since it makes discovery in each thread, then compare it with master thread. The discovered lists of test should be the same. Sets are not order guaranteed, so they should not be used for parametrized testing, because discovery of the tests with using xdist will fail.
This PR just converts set to dist, to eliminate issue mentioned above.
Unused imports, unused variables and such.
Initially, there were no functional changes, just to get rid of some standard CodeQL warnings.
I've then broken the CI, as apparently there's a install time(!?) Python script creation for the sole purpose of product
naming. I changed it - we have it in etcdir, as SCYLLA-PRODUCT-FILE.
So added (copied from a different script) a get_product() helper function in scylla_util.py and used it instead.
While at it, also fixed the too broad import from scylla_util, which 'forced' me to also fix other specific imports (such as shutil).
Improvement - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#27883
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
require IP of the replacing node (like client requests or REST API
requests coming from nodetool).
We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.
For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.
For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.
We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.
We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.
We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).
Fixes#28057
Backport this PR to all branches as it fixes a problematic bug.
Closesscylladb/scylladb#27435
* github.com:scylladb/scylladb:
gossiper: add_saved_endpoint: make generations of excluded nodes negative
test: introduce test_full_shutdown_during_replace
utils: error_injection: allow aborting wait_for_message
raft topology: preserve IP -> ID mapping of a replacing node on restart
Fixes#27992
When doing a commit log oversized allocation, we lock out all other writers by grabbing
the _request_controller semaphore fully (max capacity).
We thereafter assert that the semaphore is in fact zero. However, due to how things work
with the bookkeep here, the semaphore can in fact become negative (some paths will not
actually wait for the semaphore, because this could deadlock).
Thus, if, after we grab the semaphore and execution actually returns to us (task schedule),
new_buffer via segment::allocate is called (due to a non-fully-full segment), we might
in fact grab the segment overhead from zero, resulting in a negative semaphore.
The same problem applies later when we try to sanity check the return of our permits.
Fix is trivial, just accept less-than-zero values, and take same possible ltz-value
into account in exit check (returning units)
Added whitebox (special callback interface for sync) unit test that provokes/creates
the race condition explicitly (and reliably).
Closesscylladb/scylladb#27998
The current code:
```
try:
cql.execute(f"INSERT INTO {cf} (pk, t) VALUES (-1, 'x')", host=host[0], execution_profile=cl_one_profile).result()
except Exception:
pass
```
contains a typo: `host=host[0]` which throws an exception becase Host
object is not subscriptable. The test does not fail because the except
block is too broad and suppresses all exceptions.
Fixing the typo alone is insufficient. The write still succeeds because
the remaining nodes are UP and the query uses CL=ONE, so no failure
should be expected.
Another source of flakiness is data verification:
```
SELECT * FROM {cf} WHERE pk = 0;
```
Even when a coordinator is explicitly provided, using CL=ONE does not
guarantee a local read. The coordinator may forward the read request to
another replica, causing the verification to fail nondeterministically.
This patch rewrites the tests to address these issues:
- Fix the typo: `host[0]` to `hosts[0]`
- Verify data using `MUTATION_FRAGMENTS({cf})` which guarantees a local
read on the coordinator node
- Reconnect the driver after node restart
Fixes https://github.com/scylladb/scylladb/issues/27933Closesscylladb/scylladb#27934
With migration to the pytest, file descriptors will be hanged during the whole life of the process. Previously it was not an issue, because test.py was executing only one file with Popen, so descriptors will be freed with process done. With new approach they are blocked. This will allow to eliminate this.
Fix issue when we had issue with getting cluster and then trying to set it dirty while it None.
Put cluster to the pool only if it was created
In the current state pytest do not support the order of execution, so this parameter is removed. There is no big need in this due to the differences what pytest and test.py counted test. pytest run test functions in the threads, while test.py executed test files in the threads. That's why pytest's way is more granular and allows to fill threads better.
Remove skip node, since it already added as a pytest mark for each test in the file.
Remove pool_size, since this is not used by pytest at all. Pytest uses
xdist to set the amount of threads instead of pool_size used by test.py
With this commit test.py will lose ability to run tests by itself always bypassing execution to the pytest.
NOTE: this is a breaking change. From this commit, several directories
with tests will require a path to the file to launch the test.
Affected directories
test/alternator
test/broadcast_tables
test/cql
test/cqlpy
test/rest_api
With this commit test.py will be bypassing the tests execution to the pytest. However, it will still be able to run test by itself.
With providing test name like `broadcast_tables/test_broadcast_tables` it will execute test with test.py runner, but if the path to the file will be provided like `test/broadcast_tables/test_broadcast_tables.py` it will bypass execution to the pytest.
`--test-py-init` tells to run pytest session in test.py-compatible mode
Update the help text for the name parameter for test.py about changes
how it works and which directory is served by pytest
build_mode fixture have dynamic scope. It depends how the pytest is
executed. When it executed through test.py scope will be session and
since it's broader that package everything work fine. While with pure
pytest it will fail because build_mode will have module scope.
This fix allows to run tests with pure pytest, this needed for migration
test to be executed by pytest runner instead test.py.
Since all tests share the same base class and some of the tests executed by test.py and some with pytest, we need to handle two cases where configuration is located: suite.yaml and test_config.yaml
After full migration suite.yaml case will be removed
Since anyway these two methods should be called one by one in two different cases: when test.py executes test and pytest executes test, merging them into one. Additionally, set environment variable to show the underneath pytest process that environment was already prepared and there is no need to clean directories or start additional services.
Introduce the new environment variable that will be used to signalize to the pytest runner that environment war already prepared by test.py. This needed to be able to run the test with pytest and test.py(that actually will run pytest underneath).
Preiously we were logging a broken TLS connection and then this has been
logged later again, so now instead of logging we're constructing an
exception with a message extened with TLS info, which later will be
catched with its full message still logged.
1. With this change the test really waits 10s, previously (in case
something went wrong), the timeout could take way more than that.
2. Added `else` to above `if` to increase clarity of execution flow -
it doesn't change logic, but makes it more clear.
This patch implements the basic auto repair support for tablet repair.
It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.
- auto_repair_enabled_default
Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.
- auto_repair_threshold_default_in_seconds
Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.
The following metrcis are added:
- auto_repair_needs_repair_nr
The number of tablets with auto repair enabled that needs repair
- auto_repair_enabled_nr
The number of tablets with auto repair enabled
The metrics are useful to tell if auto repair is falling behind.
In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.
Fixes SCYLLADB-99
Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations:
* Altering a keyspace's RF if it would make the keyspace RF-rack-invalid
* Adding a node in a new rack
* Removing / Decommissioning the last node in a rack
Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views.
The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid.
Fixesscylladb/scylladb#23345
Fixes https://github.com/scylladb/scylladb/issues/26820
backport to relevant versions for materialized views with tablets since it depends on rf-rack validity
Closesscylladb/scylladb#26354
* github.com:scylladb/scylladb:
docs: update RF-rack restrictions
cql3: don't apply RF-rack restrictions on vector indexes
cql3: add warning when creating mv/index with tablets about rf-rack
service/tablet_allocator: always allow tablet merge of tables with views
locator: extend rf-rack validation for rack lists
test: test rf-rack validity when creating keyspace during node ops
locator: fix rf-rack validation during node join/remove
test: test topology restrictions for views with tablets
test: add test_topology_ops_with_rf_rack_valid
topology coordinator: restrict node join/remove to preserve RF-rack validity
topology coordinator: add validation to node remove
locator: extend rf-rack validation functions
view: change validate_view_keyspace to allow MVs if RF=Racks
db: enforce rf-rack-validity for keyspaces with views
replica/db: add enforce_rf_rack_validity_for_keyspace helper
db: remove enforce parameter from check_rf_rack_validity
test: adjust test to not break rf-rack validity
Disabling of balancing waits for topology state machine to become idle, to guarantee that no migrations are happening or will happen after the call returns. But it doesn't interrupt the scheduler, which means the call can take arbitrary amount of time. It may wait for tablet repair to be finished, which can take many hours.
We should do it via topology request, which will interrupt the tablet scheduler.
Enabling of balancing can be immediate.
Fixes https://github.com/scylladb/scylladb/issues/27647Fixes#27210Closesscylladb/scylladb#27736
* https://github.com/scylladb/scylladb:
test: Verify that repair doesn't block disabling of tablet load balancing
tablets: Make balancing disabling call preempt tablet transitions
VECTOR_SEARCH_INDEXING permission didn't work on cdc tables as we mistakenly checked for vector indexes on the cdc table insted of the base.
This patch fixes that and adds a test that validates this behavior.
Fixes: VECTOR-476
Closesscylladb/scylladb#28050
Call discover_staging_sstables in view_update_generator::start() instead
of in the constructor, because the constructor is called during
initialization before sstables are loaded.
The initialization order was changed in 5d1f74b86a and caused this
regression. It means the view update generator won't discover staging
sstables on startup and view updates won't be generated for them. It
also causes issues in sstable cleanup.
view_update_generator::start() is called in a later stage of the
initialization, after sstable loading, so do the discovery of staging
sstables there.
Fixesscylladb/scylladb#27956Closesscylladb/scylladb#27970
Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard
and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them.
This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged).
Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`.
Fixes#27639
* The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions
Closesscylladb/scylladb#27643
* github.com:scylladb/scylladb:
test: database_test: do_with_some_data: randomize keys
database: truncate_table_on_all_shards: drop outdated TODO comment
database: truncate_table_on_all_shards: consider can_flush on all shards
memtable_list: unify can_flush and may_flush
test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
replica: table, storage_group, compaction_group: add needs_flush
test: database_test: do_with_some_data_in_thread: accept void callback function
2-DC cluster parallel non-RBNO rebuild failure when expanding RF in DC2.
Steps to reproduce:
1. Provision a cluster with 2 datacenters and at least 2 nodes in the second datacenter.
2. Let’s assume datacenter names are "dc1" and "dc2".
3. Create a keyspace ("keyspace1") with RF=0 in dc2.
4. Populate some data into dc1.
5. Change keyspace1 replication in dc2 to 2.
6. On 2 nodes in dc2 run the following command in parallel:
nodetool rebuild --source-dc dc1
Parallel execution of rebuilds is not possible with RBNO enabled.
This test is the repro for #27804Closesscylladb/scylladb#27747
Currently the function uses a regular expression
to check the system log for a specific message.
This is tangential to the ability to cleanly abort the restore task, plus the regular expression has a syntax error:
```
test/cluster/object_store/test_backup.py:534
/home/bhalevy/dev/scylla/test/cluster/object_store/test_backup.py:534: SyntaxWarning: "\(" is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\("? A raw string is also an option.
await wait_for_first_completed([l.wait_for("Failed to handle STREAM_MUTATION_FRAGMENTS \(receive and distribute phase\) for .+: Streaming aborted", timeout=10) for l in logs])
```
Thsi change modernizes the implementation by:
- using auto_dc_rack for manager.servers_add
- using new_test_keyspace to generate and auto delete the keyspace
- using async gatherio and a prepared statement to insert the data
- simplifing the keys and values by NOT using os.urandom (that is notoriously slow)
- inserting fewer keys in debug mode
- removing the log check
With that, the test can be reenabled in all modes.
* No backport needed since the test was disabled
Closesscylladb/scylladb#27892
* github.com:scylladb/scylladb:
test_backup: do_abort_restore: reduce data footprint
test_backup: do_abort_restore: use error injection
test_backup: do_abort_restore: use asyncio for cql
test_backup: do_abort_restore: use new_test_keyspace
test_backup: do_abort_restore: use logger rather than print
test_backup: do_abort_restore: pass auto_rack_dc to servers_add
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.
In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.
The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.
After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.
Fixes#22564Fixes#26896Closesscylladb/scylladb#27679
It was obseved:
```
test_repair_disjoint_row_2nodes_diff_shard_count was spuriously failing due to
segfault.
backtrace pointed to a failure when allocating an object from the chain of
freed objects, which indicates memory corruption.
(gdb) bt
at ./seastar/include/seastar/core/shared_ptr.hh:275
at ./seastar/include/seastar/core/shared_ptr.hh:430
Usual suspect is use-after-free, so ran the reproducer in the sanitize mode,
which indicated shared ptr was being copied into another cpu through the
multi shard writer:
seastar - shared_ptr accessed on non-owner cpu, at: ...
--------
seastar::smp_message_queue::async_work_item<mutation_writer::multishard_writer::make_shard_writer...
```
The multishard writer itself was fine, the problem was in the streaming consumer
for repair copying a shared ptr. It could work fine with same smp setting, since
there will be only 1 shard in the consumer path, from rpc handler all the way
to the consumer. But with mixed smp setting, the ptr would be copied into the
cpus involved, and since the shared ptr is not cpu safe, the refcount change
can go wrong, causing double free, use-after-free.
To fix, we pass a generic incremental repair handler to the streaming
consumer. The handler is safe to be copied to different shards. It will
be a no op if incremental repair is not enabled or on a different shard.
A reproducer test is added. The test could reproduce the crash
consistently before the fix and work well after the fix.
Fixes#27666Closesscylladb/scylladb#27870
Function skip_mode works only on function and only in cluster test. This if OK
when we need to skip one test, but it's not possible to use it with pytestmark
to automatically mark all tests in the file. The goal of this PR is to migrate
skip_mode to be dynamic pytest.mark that can be used as ordinary mark.
Closesscylladb/scylladb#27853
[avi: apply to test/cluster/test_tablets.py::test_table_creation_wakes_up_balancer]
The test if flaky, with failures in:
for server in servers:
> await check_node_log_for_failed_mutations(manager, server)
test/cluster/test_topology_ops_encrypted.py:84:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
manager = <test.pylib.manager_client.ManagerClient object at 0xffff602e8590>
server = ServerInfo(server_id=1769, ip_addr='127.82.127.43', rpc_address='127.82.127.43', datacenter='DEFAULT_DC', rack='DEFAULT_RACK', pid=186578)
async def check_node_log_for_failed_mutations(manager: ManagerClient, server: ServerInfo):
logging.info(f"Checking that node {server} had no failed mutations")
log = await manager.server_open_log(server.server_id)
occurrences = await log.grep(expr="Failed to apply mutation from", filter_expr="(TRACE|DEBUG|INFO)")
> assert len(occurrences) == 0
E AssertionError
test/cluster/util.py:319: AssertionError
As diagnosed by Gleb in https://github.com/scylladb/scylladb/issues/27942#issuecomment-3710013625:
"The fencing errors here look legit given that we do not wait for all
requests to complete while shutting down the storage proxy. The
scenario is this:
Test does writes to rf=3 keyspace with cl=one. One node is shutting
down while there is a tablet migration. Tablet migration executes
barrier and drain which fails on a node that is been shutdown. The
topology coordinator proceeds fencing the old topology, but there
still can be un-handled mutation requests from the shutting down node
on other nodes and they will generate fencing errors like they should.
They way to avoid it (though it is benign) is to wait for all outgoing
storage proxy requests to complete during shutdown, but even then the
error may still happen since a request may timeout before it is
processed by the other side, so it may be completed by a storage proxy
coordinator side, but still not handled by replica side. This what we
have fencing for in the first place."
Fix by diabling background tablet migrations, so that we have no
topology barriers concurrent with node shutdown.
Fixes#27942Closesscylladb/scylladb#28034
The driver must see server_c before we stop server_a, otherwise
there will be no live host in the pool when we attempt to drop
the keyspace:
```
@pytest.mark.asyncio
async def test_not_enough_token_owners(manager: ManagerClient):
"""
Test that:
- the first node in the cluster cannot be a zero-token node
- removenode and decommission of the only token owner fail in the presence of zero-token nodes
- removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token
owners would fall below the RF of some keyspace using tablets
"""
logging.info('Trying to add a zero-token server as the first server in the cluster')
await manager.server_add(config={'join_ring': False},
property_file={"dc": "dc1", "rack": "rz"},
expected_error='Cannot start the first node in the cluster as zero-token')
logging.info('Adding the first server')
server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"})
logging.info('Adding two zero-token servers')
# The second server is needed only to preserve the Raft majority.
server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0]
logging.info(f'Trying to decommission the only token owner {server_a}')
await manager.decommission_node(server_a.server_id,
expected_error='Cannot decommission the last token-owning node in the cluster')
logging.info(f'Stopping {server_a}')
await manager.server_stop_gracefully(server_a.server_id)
logging.info(f'Trying to remove the only token owner {server_a} by {server_b}')
await manager.remove_node(server_b.server_id, server_a.server_id,
expected_error='cannot be removed because it is the last token-owning node in the cluster')
logging.info(f'Starting {server_a}')
await manager.server_start(server_a.server_id)
logging.info('Adding a normal server')
await manager.server_add(property_file={"dc": "dc1", "rack": "r2"})
cql = manager.get_cql()
await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60)
> async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name:
test/cluster/test_not_enough_token_owners.py:57:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/lib64/python3.14/contextlib.py:221: in __aexit__
await anext(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830>
opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }"
host = None
@asynccontextmanager
async def new_test_keyspace(manager: ManagerClient, opts, host=None):
"""
A utility function for creating a new temporary keyspace with given
options. It can be used in a "async with", as:
async with new_test_keyspace(ManagerClient, '...') as keyspace:
"""
keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host)
try:
yield keyspace
except:
logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation")
raise
else:
> await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host)
E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')})
test/cluster/util.py:544: NoHostAvailable
```
Fixes#28011Closesscylladb/scylladb#28040
This commit replaces a fixed version name with the variable for the current version
in the instructions for installing a non-default version with Web Installer.
This will make using the installer more user-friendly.
Fixes https://github.com/scylladb/scylladb/issues/28005
With randomized keys, and since we're inserting only 2 keys,
it is possible that they would end up owned only by a single shard,
reproducing #27639 in snapshot_list_contains_dropped_tables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The comment was added in 83323e155e
Since then, table::seal_active_memtable was improved to guarantee
waiting on oustanding flushes on success (See d55a2ac762), so
we can remove this TODO comment (it also not covered by any issue
so nobody is planned to ever work on it).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
can_flush might return a different value for each shard
so check it right before deciding whether to flush or clear a memtable
shard.
Note that under normal condition can_flush would always return true
now that it checks only the presence of the seal memtable function
rather than check memtable_list::empty().
Fixes#27639
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that we have a unit test proving that it's safe to flush an
empty memtable list there is no need to distinguish between
may_flush and can_flush.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Table needs flush if not all its memtable lists are empty.
To be used in the next patch for a unit test.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Many test cases already assume `func` is being called a seastar
thread and although the function they pass returns a (ready) future,
it serves no purpose other than to conform to the interface.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This `raft_topology_update_ip` call always returns after `t.find(raft_id)`
returns `nullptr`, so it effectively does nothing. It's not a bug, since
there is no reason to update `system.peers` for left nodes anyway. We
delete the rows corresponding to left nodes in `process_left_node` (called
just above).
Closesscylladb/scylladb#27899
Service levels cache is empty after upgrade to consistent topology
if no mutations are commited to `system.service_levels_v2` or rolling
restart is not done.
To fix the bug, this commit adds service levels cache reloading after
upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`.
Fixes SCYLLADB-90
before doing migration to raft
There is no need to call `service_level_controller::upgrade_to_v2()`
on every topology state load, we only need to do it once.
All tests I am fixing in this patch do pass for me on DynamoDB, but
other developers report that they fail because some DynamoDB servers
apparently use slightly different error messages, with less detail about
the cause of an error. For example, some of our tests currently expect
an error message that looks like:
An error occurred (ValidationException) when calling the Query
operation: Invalid operator used in KeyConditionExpression:
attribute_exists
But some servers don't report the ": attribute_exists" at the end, so
we can't use the word "attribute_exists" it in the test to recognize
the correct error, and needs to use a different word (which both
versions of DynamoDB and Alternator all print).
As another example, the good old DynamoDB error:
An error occurred (ValidationException) when calling the Query
operation: 1 validation error detected: Value 'DOG' at
'conditionalOperator' failed to satisfy constraint: Member must
satisfy enum value set: [OR, AND]
Got replaced by the following less informative message:
An error occurred (ValidationException) when calling the Query
operation: Failed to satisfy constraint: Member must satisfy enum
value set: [ALL, OR]'
So we need to fix the test to allow it too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test test_compressed_request.py::test_compressed_request coerces
boto3 to send a compressed request, and wrongly used region_name=us-east-1
to set up the connection. Theoretically, this doesn't matter because
we also set the correct URL (for either Alternator or the desired region
in AWS). But in fact it does matter, because region name is part of the
request's signature, and DynamoDB refuses the request if it comes to
a different region than it is signed for. So this test fails when run
on DynamoDB on any other region except us-east-1.
The fix is simple - don't use the constant "us-east-1", but pick up the
correct region name from the original connection.
The functions new_dynamodb_session(), new_dynamodb() and
new_dynamodb_stremas() had the same bug and we fix it too, but it didn't
break any test because the only tests using these functions were
Scylla-only so the AWS region problem didn't apply to them.
Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants.
The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value.
By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test.
The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry.
Fixes: scylladb/scylladb#27307
Backporting to active branches - this is a test-only fix (low risk) for a flaky test that exists in older branches (thus affects the CI of active branches).
Closesscylladb/scylladb#28010
* https://github.com/scylladb/scylladb:
test/raft: use valid sentinel in liveness check to prevent digest errors
test/raft: improve debugging in randomized_nemesis_test
The Alternator test test_tag.py::test_tag_lsi_gsi expects to see an
error - it's not allowed to set a tag on a GSI or LSI - but the error
message that DynamoDB prints recently changed - instead of saying
"ResourceArn" the new error message says "resource arn".
Change the test to allow both forms, so it will pass on both Alternator
(which still uses the word ResourceArn - which is the name of the
parameter) and on DynamoDB (which uses "resource arn").
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test test_batch.py::test_batch_write_item_large_broken_connection
failed on DynamoDB (Refs #26079). It turns out this test has many
problems:
1. This test wrongly assumes a batch write needs to complete in one
attempt - and this fails on DynamoDB with low WCU capacity where
the batch needs to be resumed in multiple requests. Using boto3's
batch_writer() fixes this problem.
2. This test has NOTHING to do with batches - so is mis-named and
mis-placed. The batch write is just a way to prepare some data
in the table, and the real test is about Query'ing the data back
and observing the long response and reproducing issue #14454.
I did not rename or move the test, but left a comment explaining
the situation.
3. This test is written to assume the Query's response uses HTTP
chunked encoding. Which isn't actually true for DynamoDB, at least
not at the time of this writing. So the test fails on DynamoDB.
For the last reason, I made this test scylla_only. This test can't
really be run on DynamoDB without rewriting it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test test_batch.py::test_batch_write_item_large often fails when
running on DynamoDB, and this patch fixes it. The test checks that a
large but not over-the-limits large batch works. However, "works" only
means that the batch is not an error - it doesn't guarantee that all the
items in the batch are performed. If the WCU limits of the table are
exceeded DynamoDB may perform only part of the the batch and return the
remaining items as UnprocessedItems. This not only can happen, it
usually does happen on DynamoDB - because a new on-demand-billing table
always start with a very low WCU capacity.
So in this patch we update the test to recognize and perform the
UnprocessedItems, instead of assuming it needs to be empty.
The test continues to pass on Alternator, and finally passes on
DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In Alternator tests, the wait_for_gsi() utility function is used in
tests that add a GSI to an existing table, to wait for this new GSI
to become ready. Although this takes a fraction of a second on
Alternator, we noticed that this takes many minutes (!) on DynamoDB
so we used an absurdly high 10 minute timeout to allow tests to also
pass on DynamoDB.
But it turns out that 10 minutes wasn't absurdly high enough, and
tests using it in test_gsi_updatetable.py started to fail on DynamoDB.
Empirically, 10 minutes was enough in the past but it seems that today
adding a GSI to an empty table routinely takes as much as 20 minutes.
So this patch increases the wait_for_gsi() timeout to a whopping 30
minutes. After this patch, the tests in test_gsi_updatetable.py which
used to fail - test_gsi_backfill_with_lsi,
test_gsi_backfill_with_real_column, test_gsi_creates_and_deletes and
test_gsi_backfill_oversized_key now all pass on DynamoDB - but each
takes more than 20 minutes to pass.
To allow the test to fail much more quickly on Alternator (where
creating a GSI takes a fraction of a second), we set a much lower
but still very high timeout when running on Alternator - 60 seconds.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, tablet allocation intentionally ignores current load (
introduced by the commit #1e407ab) which could cause identical shard
selection when allocating a small number of tablets in the same topology.
When a tablet allocator is asked to allocate N tablets (where N is smaller
than the number of shards on a node), it selects the first N lowest shards.
If multiple such tables are created, each allocator run picks the same
shards, leading to tablet imbalance across shards.
This change initializes the load sketch with the current shard load,
scaled into the [0,1] range, ensuring allocation still remains even
while starting from globally least-loaded shards.
Fixes https://github.com/scylladb/scylladb/issues/27620Closesscylladb/scylladb#27802
In 12dcf79c60, we avoid the ccache masquarate directory
when choosing sccache, as that would give us a double-caching
effect: first sccache is called, then clang++ is looked up
finding ccache masquarading as clang++. We solved that by
converting the name clang++ to the absolute path /usr/bin/clang++
(or whatever), skipping over the masquarade directory in $PATH.
It turns out that we need to do the same for ccache. That commit
changed the compile command to 'ccache clang++', and ccache will
look up clang++ in $PATH, finding itself in the masquarade directory.
Fix that by avoiding the masquarade directory if a compiler cache is
specified explicitly or is found with --compiler-cache=auto.
Closesscylladb/scylladb#27996
This pull request introduces HTTP response compression to Alternator, allowing responses (both string and chunked) to be compressed using `gzip` or `deflate` when requested by clients and when the response size exceeds configurable thresholds.
* Added new source files `http_compression.cc` and `http_compression.hh` implementing compression logic, including parsing client `Accept-Encoding` headers, selecting compression algorithms, and compressing response bodies using zlib.
* Added two new configuration options to `db::config` (`alternator_response_gzip_compression_level` and `alternator_response_gzip_compression_threshold_in_bytes`) to control compression level (and optionally disable compression with level 0 - no compression) and minimum response size for compression.
* Added tests showing compliance with DynamoDB behavior.
Fixes#27246
New feature - no backporting
Closesscylladb/scylladb#27454
* github.com:scylladb/scylladb:
alternator/http_compression: Add compression of streamed response
alternator/http_compression: Add implementation od gzip/deflate of string response
alternator/http_compression: Add handling of Accept-Encoding header
test/alternator: add tests for compressed responses
Replace -1 with 0 for the liveness check operation to avoid triggering
digest validation failures. This prevents rare fatal errors when the
cluster is recovering and ensures the test does not violate append_seq
invariants.
The value -1 was causing invalid digest results in the append_seq
structure, leading to assertion failures. This could happen when the
sentinel value was the first (or only) element being appended, resulting
in a digest that did not match the expected value.
By using 0 instead, we ensure that the digest calculations remain valid
and consistent with the expected behavior of the test.
The specific value of the sentinel is not important, as long as it is
a valid elem_t that does not violate the invariants of the append_seq
structure. In particular, the sentinel value is typically used only
when no valid result is received from any server in the current loop
iteration, in which case the loop will retry.
Fixes: scylladb/scylladb#27307
Move the post-condition check before the assertion to ensure it is
always executed first. Before, the wrong value could be passed to the
digest_remove assertion, making the pre-check trigger there instead of
the post-check as expected.
Also, add a check in the append_seq constructor to ensure that the
digest value is valid when creating an append_seq object.
Disable load balancing to avoid the balancer moving the tablet from a
node with less to a node with more available disk space. Otherwise, the
move_tablet API can fail (if the tablet is already in transisiton) or
be a no-op (in case the tablet has already been migrated)
Fixes: #27980Closesscylladb/scylladb#27993
Fix the race condition when the process finished, while test is trying
to checks its descriptors. Now instead of failing the whole loop, it
will continue to iterate the rest of the process to find the needed
process.
Closesscylladb/scylladb#27994
To avoid surprises when libstdc++, clang, or other components
in the toolchain introduce regressions, we introduce a "future
toolchain". This builds on the Fedora version under active
development, and the development branches of gcc and llvm.
The future toolchain is not intended to be frozen. Rather,
periodically we will build the future toolchain, then build
ScyllaDB and run its unit tests under that toolchain, then
discard it. Any problems will then have be be tracked down
by a developer and either reported to the source repository,
or fixed in ScyllaDB.
Closesscylladb/scylladb#27964
It should be possible to return the similarity of vectors in CQL statements following the [Cassandra compatible syntax](https://cassandra.apache.org/doc/latest/cassandra/getting-started/vector-search-quickstart.html#query-vector-data-with-cql):
```
SELECT comment, similarity_cosine(comment_vector, [0.1, 0.15, 0.3, 0.12, 0.05])
FROM cycling.comments_vs;
```
Although the calculations are slow, and we already have calculated results returned via Vector Store API,
we need the functionality as it allows us to calculate similarity of vectors not stored in vector indexes.
It will be needed for [quantization and rescoring](https://scylladb.atlassian.net/wiki/spaces/RND/pages/195985800/Quantization+and+Rescoring).
The feature is also a nice-to-have in testing as requested many times by testing and CX teams.
The optimized version utilizing already calculated distances from Vector Store without a need of rescoring will be coming soon after via https://github.com/scylladb/scylladb/pull/27991.
---
The patch adds functions:
- `similarity_cosine(<vector>, <vector>)`,
- `similarity_euclidean(<vector>, <vector>)`,
- `similarity_dot_product(<vector>, <vector>)`
Where `<vector>` is either a column of type `VECTOR<FLOAT, N>` or a vector of floats literal.
These functions can be called with every `SELECT` query, not only ANN vector queries as opposed to https://github.com/scylladb/scylladb/pull/25993.
The similarity calculations are implemented inspired by [USearch's implementation](
a2f1759910/include/usearch/index_plugins.hpp (L1304-L1385)) and made compatible with [Cassandra's documentation](https://cassandra.apache.org/doc/5.0/cassandra/developing/cql/functions.html#vector-similarity-functions).
That would guarantee the results in ScyllaDB are calculated using the exact same algorithms as used in Vector Store indexes.
---
Fixes: SCYLLADB-88
Fixes: SCYLLADB-89
New feature, should land into 2026.1
Closesscylladb/scylladb#27524
* github.com:scylladb/scylladb:
docs: add vector similarity functions documentation
test/cqlpy: add similarity functions correctness tests
test/cqlpy: add similarity functions invalid call tests
cql3: introduce similarity functions syntax
vector_similarity_fcts: introduce similarity functions
vector_similarity_fcts: retrieve similarity function argument types
vector_similarity_fcts: add calculating similarity between vectors
This patch modifies RESTful API handler which disables tablet
balancing to use topology request to wait for already running tablet
transitions. Before, it was just waiting for topology to be idle, so
it could wait much longer than necessary, also for operations which
are not affected by the flag, like repair. And repair can take hours.
New request type is introduced for this synchronization: noop_request.
It will preempt the tablet scheduler, and when the request executes,
we know all later tablet transitions will respect the "balancing
disabled" flag, and only things which are unuaffected by the flag,
like repair, will be scheduled.
Fixes#27647
The test test_streams.py::test_streams_putitem_new_item_overrides_old_lsi
failed on DynamoDB (Refs #26079) because we passed an unused parameter
NonKeyAttributes to the Projection setting an LSI. NonKeyAttributes is
only allowed when ProjectionType=INCLUDE, but we used ProjectionType=ALL.
DynamoDB refuses to create an LSI with such inconsistent parameters,
and we just need to remove this unnecessary parameter from this test.
The reason why this test didn't fail on Alternator is that Alternator
doesn't yet support or even parse the Projection parameter (Refs #5036).
We also add an xfailing test (passes on DynamoDB, fails on Alternator)
checking that a spurious NonKeyAttributes parameter is rejected. When
we get around to implement the projection feature (#5036), this will
be yet another acceptance test for this feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The semaphore has detection and protection against regular resource
leaks, where some resources go unaccounted for and are not released by
the time the semaphore is destroyed. There is no detection or protection
against negative leaks: where resources are "made up" of thin air. This
kind of leaks looks benign at first sight, a few extra resources won't
hurt anyone so long as this is a small amount. But turns out that even a
single extra count resource can defeat a very important anti-deadlock
protection in can_admit_read(): the special case which admits a new
permit regardless of memory resources, when all original count resources
all available. This check uses ==, so if resource > original, the
protection is defeated indefinitely. Instead of just changing == to >=,
we add detection of such negative leaks to signal(), via
on_internal_error_noexcept().
At this time I still don't now how this negative leak happens (the code
doesn't confess), with this detection, hopefully we'll get a clue from
tests or the field. Note that on_internal_error_noexcept() will not
generate a coredump, unless ScyllaDB is explicitely configured to do so.
In production, it will just generate an error log with a backtrace.
The detection also clams the _resources to _initial_resources, to
prevent any damage from the negativae leak.
I just noticed that there is no unit test for the deadlock protection
described above, so one is added in this PR, even if only loosely
related to the rest of the patch.
Fixes: SCYLLADB-163
Closesscylladb/scylladb#27764
Previous commit added means to decide whether client asks for compression and with which algorithm.
This patch adds actual compression of responses based on zlib library.
For now only string (not chunked) responses are compressed.
Several previously defined tests start to pass.
This is an initial patch to add support of Alternator's compressed responses.
The actual compression (gzip,deflate) will be added in the following commits.
The main functionality added in this commmit is parsing of Accept-Encoding header,
that indicates compression algorithms supported by the client.
In this commit we add also configuration parameters of response gzip/deflate compression.
They allow to enable/disable compression, set level and a size threshold below which a response is not compressed.
With current implementation it is possible to decide a compression for each response, but it is not used yet.
Adds set of tests that:
1. Show how DynamoDB handles response compression.
It supports 'gzip' and 'deflate' compression, which can be selected by providing 'Accept-Encoding` header. It only encodes response above 4096B.
- `test_compressed_response`, `test_compressed_response_large` show compression for various response sizes.
- `test_accept_encoding_header` focuses on testing various values of Accept-Encoding header.
- `test_multiple_accept_encoding_headers` verifies behaviour with repeted Accept-Encoding headers.
2. Will confirm implementation of response compression in Alternator (#27246)
Additonally to above test, we check Altenator specific expectations:
- `test_chunked_response_compression` makes sure that compression will work also for chunked responses.
- `test_set_compression_options` checks config options to set response size threshold for compression and compression level
3. `test_signature_trims_accept_encoding_spaces` reveals Alternator's bug in signature verification (#27775)
This reverts commit 1bb897c7ca, reversing
changes made to 954f2cbd2f. It makes
incompatible changes to the object storage configuration format, breaking
tests [1]. It's likely that it doesn't break any production configuration,
but we can't be sure.
Fixes#27966Closesscylladb/scylladb#27969
To make the test fast, in particular in debug mode
insert fewer keys and do not rely on os.urandom
which is notoriously slow
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the test depends on timing and enough inserted
data to abort the restore tasks at exactly the right time.
This is flaky in nature, so instead, use error injection
to synchronize the abort with mutation streaming.
Note that with that we no longer get the STREAM_MUTATION_FRAGMENTS
log message, so waiting for it is dropped from the test.
The most imporant thing is that some restore tasks must fail.
(We cannot guarantee all would fail unfortunately)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use the more modern asyncio facility to run cql queries
and a prepared statement to insert data into the table.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To generate multi-rack cluster, otherwise we get the following error:
```
E cassandra.protocol.ConfigurationException: <Error from server: code=2300 [Query invalid because of configuration issue] message="Replication factor 3 exceeds the number of racks (1) in dc datacenter1">
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With the additional file_stat overload introduced in
[Update seastar submodule](3e9b071838),
use the opened directory for more efficient, relative-path based stat.
* Enhancement, no backport needed
Closesscylladb/scylladb#27967
* github.com:scylladb/scylladb:
table: get_snapshot_details: use relative-path based file_stat
table: get_snapshot_details: fix warning in exists_in_dir
table: get_snapshot_details: fix staging dir calculation
backup: process_snapshot_dir: use relative-path based file_stat
directory_lister: add ctor with opened directory
If a CQL session USEs a keyspace and then calls DESC TABLES, the user
expects to see only the tables in the chosen keyspace. However, calling
DESC KEYSPACES should still return list all the keyspaces - returning
just the USEd one is not useful - and also not what Cassandra does.
We had an xfailing test test_describe.py::test_keyspaces_with_use which
reproduces this bug (and passes on Cassandra).
In this patch we fix this bug. The fix is simple - USE should affect
DESC statements, but be ignored for DESC KEYSPACES. We can then remove
the xfail marker from the test.
The patch also includes a new test for the DESC TABLES case, where the
USE *does* have an affect. And I wanted to make sure the patch doesn't
break this case. As usual, the new test passes on both Cassandra and
ScyllaDB.
Fixes#26334
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27971
Context
-------
The procedure of hint draining boils down to the following steps:
1. Drain a hint sender. That should get rid of all hints stored
for the corresponding endpoint.
2. Remove the hint directory corresponding to that endpoint.
Obviously, it gets more complex than this high-level perspective.
Without blurring the view, the relevant information is that step 1
in the algorithm above may not be executed.
Breaking it down, it comprises of two calls to
`hint_sender::send_hints_maybe()`. The function is responsible for
sending out hints, but it's not unconditional and will not be performed
if any of the following bullets is not satisfied:
* `hint_sender::replay_allowed()` is not `true`. This can happen when
hint replay hasn't been turned on yet.
* `hint_sender::can_send()` is not `true`. This can happen if the
corresponding endpoint is not alive AND it hasn't left the cluster
AND it's still a normal token owner.
There is one more relevant point: sending hints can be stopped if
replaying hints fails and `hint_sender::send_hints_maybe()` returns
`false`. However, that's not not possible in the case of draining.
In that case, if Scylla comes across any failure, it'll simply delete
the corresponding hint segment. Because of that, we ignore it and
only focus on the two bullets.
---
Why is it a problem?
--------------------
If a hint directory is not purged of all hint segments in it,
any attempt to remove it will fail and we'll observe an error like this:
```
Exception when draining <host ID>: std::filesystem::__cxx11::filesystem_error
(error system:39, filesystem error: remove failed: Directory not empty [<path>])
```
The folder with the remaining hints will also stay on disk, which is, of
course, undesired.
---
When can it happen?
-------------------
As highlighted in the Context section of this commit message, the
key part of the code that can lead to a dangerous situation like that
is `hint_sender::send_hints_maybe()`. The function is called twice when
draining a hint endpoint manager: once to purge all of the existing
hints, and another time after flushing all hints stored in a commitlog
instances, but not listed by `hint_sender` yet. If any of those calls
misbehaves, we may end up with a problem. That's why it's crucial to
ensure that the function always goes through ALL of the hints.
Dangerous situations:
1. We try to drain hints before hint replay is allowed. That will
violate the first bullet above.
2. The node we're draining is dead, but it hasn't left the cluster,
and it still possesses some tokens.
---
How do we solve that?
---------------------
Hint replay is turned on in `main.cc`. Once enabled, it cannot be
disabled. So to address the first bullet above, it suffices to ensure
that no draining occurs beforehand. It's perfectly fine to prevent it.
Soon after hint replay is allowed, `main.cc` also asks the hint manager
to drain all of the endpoint managers whose endpoints are no longer
normal token owners (cf. `db::hints::manager::drain_left_nodes()`).
The other bullet is more tricky. It's important here to know that
draining only initiated in three situations:
1. As part of the call to `storage_service::notify_left()`.
2. As part of the call to `storage_service::notify_released()`.
3. As part of the call to `db::hints::manager::drain_left_nodes()`.
The last one is trivially non-problematic. The nodes that it'll try to
drain are no longer normal token owners, so `can_send()` must always
return `true`.
The second situation is similar. As we read in the commit message of
scylladb/scylladb@eb92f50413, which
introduced the notion of released nodes, the nodes are no longer
normal token owners:
> In this patch we postpone the hint draining for the "left" nodes to
> the time when we know that the target nodes no longer hold ownership
> of any tokens - so they're no longer referenced in topology. I'm
> calling such nodes "released".
I suggest reading the full commit message there because the problems
there are somewhat similar these changes try to solve.
Finally, the first situation: unfortunately, it's more tricky. The same
commit message says:
> When a node is being replaced, it enters a "left" state while still
> owning tokens. Before this patch, this is also the time when we start
> draining hints targeted to this node, so the hints may get sent before
> the token ownership gets migrated to another replica, and these hints
> may get lost.
This suggests that `storage_service::notify_left()` may be called when
the corresponding node still has some tokens! That's something that may
prevent properly draining hints.
Fortunately, no hope is lost. We only drain hints via `notify_left()`
when hinted handoff hasn't been upgraded to being host-ID-based yet.
If it has, draining always happens via `notify_released()`.
When I write this commit message, all of the supported versions of
Scylla 2025.1+ use host-ID-based hinted handoff. That means that
problems can only arise when upgrading from an older version of Scylla
(2024.1 downwards). Because of that, we don't cover it. It would most
likely require more extensive changes.
---
Non-issues
----------
There are notions that are closely related to sending hints. One of them
is the host filter that hinted handoff uses. It decides which endpoints
are eligible for receiving hints, and which are not. Fortunately, all
endpoints rejected by the host filter lose their hint endpoint managers
-- they're stopped as part of that procedure. What's more, draining
hints and changing the host filter cannot be happening at the same time,
so it cannot lead to any problems.
The solution
------------
To solve the described issue, we simply prevent draining hints before
hint replay is allowed. No reproducer test is attached because it's not
feasible to write one.
Fixesscylladb/scylladb#27693Closesscylladb/scylladb#27713
With the additional file_stat overload introduced in
3e9b071838, use the opened
directory for more efficient, relative-path based stat.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The functor is called both on the data directory as well
as on the staging directory, so the warning printed if the
found file is not the same inode should print the given path,
not datadir / name (as was copy and pasted).
Refs #27635
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
staging is based off of datadir, not snapshot_dir.
the issue was introduced in f5ca3657e2.
Refs #27635
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With the additional file_stat overload introduced in
3e9b071838, use the opened
directory for more efficient, relative-path based stat.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This ctor allows the caller to open the directory first,
on its own, and pass it down to the directory_lister.
Once all callers use this ctor we can get rid of
the delayed open in the get() method.
Also, in can be used to replace full-path based file_stat calls
on listed entries with file_stat(directory, name) calls
that are based on statat() and a relative path name that is present
in the listed directory entry.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
sq
Add documentation in `functions.rst` as the CQL reference
for a vector similarity functions.
This includes the syntax, example usage, and prerequisites
for the parameters.
Add `calculate_similarity` function for testing purposes.
Add tests checking if CQL returned values match the calculated
ones with the precision up to 5th decimal place.
The tests should also be run on Cassandra to check compatibility
with their responses.
The similarity function syntax is:
`similarity_<metric_name>(<vector>, <vector>)`
Where `<metric_name>` is one of `cosine`, `euclidean` and `dot_product`
matching the intended similarity metric to be used within calculations.
Where `<vector>` is either a vector column name or vector literal.
Add `vectorSimilarityArgs` symbol that is an extension of `selectionFunctionArgs`,
but allowing to use the `value` as an argument as well as the `unaliasedSelector`.
This is needed as the similarity function syntax allows both the arguments to be
a vector value, so the grammar needs to recognize the vector literal there as well.
Since we actually support `SELECT`s with constants since this patch,
return true instead of throwing an error while trying to convert the function call
to constant.
This patch introduces scalar functions `similarity_cosine()`,
`similarity_euclidean()`, and `similarity_dot_product()`
which should return a float - similarity of the given vectors
calculated according to the function's similarity metric.
The argument types of this function are retrieved with
the `retrieve_vector_arg_types`, but shall be assignable to
`vector<float, N>` where `N` is the same for both arguments.
This patch introduces a dimensionality check during the execusion
of those functions.
This patch retrieves the argument types for similarity functions.
Newly introduced `retrieve_vector_arg_types` function checks if
the provided arguments are vectors of floats and if
both the vector values match the same type (dimension).
If so, we know the exact type and set it as the function arguments type.
Otherwise, if the exact type is unkown, but we can assign to vector<float, N>
then the dimensionality check will be done during execution of
the similarity function.
This also takes care of null values and bind variables the same way
as implemented in Cassandra to stay compatible.
Meaning that if we can infer the type from one argument, then the latter
may be unknown (null or ?).
Additionally this patch adds `test_assignment_any_vector` function
which tests the weak assignment to vector<float, N> as mentioned
above.
This commit introduces `compute_cosine_similarity`, `compute_euclidean_similarity`,
`compute_dot_product_similarity` functions to calculate the vectors similarity
in respective metric.
The similarity is a float value meaning how similar the vectors are in a range of [0, 1].
Values closer to 1 indicate greater similarity.
The `dot_product` similarity requires L2 normalized vectors as arguments.
The similarity is calculated based on the jVector's implementation used by Cassandra.
f967f1c924/jvector-base/src/main/java/io/github/jbellis/jvector/vector/VectorSimilarityFunction.java (L36-L69)
Different DynamoDB operations have different settings allowed for
their "ReturnValues" argument. In particular, some operations allow
ReturnValues=UPDATED_OLD but the DeleteItem operation *does not*.
We have a test, test_delete_item_returnvalues, aimed to verify this
but it had a typo and didn't actually check "UPDATED_OLD". This patch
fixes this typo.
The test still passes because the code itself (executor.cc,
delete_item_operation's constructor) has the correct check - it was
just the test that was wrong.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27918
* tools/cqlsh scylladb/scylla-cqlsh@9e5a91d7...scylladb/scylla-cqlsh@5a1d7842 (9):
> fix wrong reference in copyutil.py
> Add GitHub Action workflow to create releases on new tags
> test_copyutil.py: introdcue test for ImportTask
> fix(copyutil.py): avoid situatuions file might be move withing multiple processes
> Fix Unix socket port display in show_host() method
> Merge pull request #157 from scylladb/alert-autofix-1
.github/workflows/build-push.yml: Potential fix for code scanning alert no. 1: Workflow does not contain permissions
> .github/workflows/dockerhub-description.yml: Potential fix for code scanning alert no. 9: Workflow does not contain permissions
> test_cqlsh_output: skip some cassandra 5.0 table options
> tests: template compression cql to use `class` insted of `sstable_comprission`
> Pin Cassandra version to 5.0 for reproducible builds
> Remove scylla-enterprise integration test and update Cassandra to latest
Closesscylladb/scylladb#27924
Like C, Python supports some escape sequences in strings such as the
familiar "\n" that converts to a newline character.
Originally, when backslash was used before a random character, for
example, "\.", Python used to just use these literal characters
backslash and dot, in the string - and not make a fuss about it.
This made it ok to use a string like "hi\.there" as a regular expression.
We have a few instances of this in our Python tests.
But recent releases of Python started to produce ugly warnings about
these cases. The error message looks like:
SyntaxWarning: "\." is an invalid escape sequence. Such sequences
will not work in the future. Did you mean "\\."? A raw string is
also an option.
Indeed in most cases the easiest solution is to use a "raw string",
a string literal preceded with r. For example, r"hi\.there". In such
strings Python doesn't replace escape sequences like \n in the string,
and also leaves the \. unchanged for the regular expression to see.
So in this patch we use raw strings in all places in test/ where Python
warns have this problem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27856
In afb96b6387, we added support for sccache. As a side effect
it changed the method of invoking ccache from transparent via PATH
(if it contains /usr/lib64/ccache) to explicit, by changing the compiler
command line from 'clang++' (which may or may not resolve the the ccache
binary) to 'ccache /usr/local/bin/clang++', which always invokes ccache.
In the default dbuild configuration, PATH does not contain /usr/lib64/ccache,
so ccache isn't invoked by default. Users can change this via the
SCYLLADB_DBUILD environment variable.
As a result of ccache being suddenly enabled for dbuild builds, ccache
will now attempt to create ~/.cache/ccache. Under docker, this does
not work, because we bind-mount ~/.cache/dbuild. Docker will create the
intermediate ~/.cache, but under the root user, not $USER. The intermediate
directory being root-owned prevents ~/.cache/ccache from being created.
Under podman, this does work, because everything runs under the container's
root user.
The fix is to bind-mount the entire ~/.ccache into the container. This
not only lets ccache create the directory, it will also find an existing
~/.cache/ccache directory and use it, enabling reuse across invocations.
Since ccache will now respect configuration changes without access to
its configuration file (notably, the maximum cache size), we also
bind-mount ~/.config.
Since ~/.ccache and ~/.config are not automatically created, we create
them explicitly so the bind mounts can work. This is for new nodes enlisted
from the cloud; developer machines will have those directories preexisting.
Note that the ccache directory used to be ~/.ccache, but was later changed.
Had the author known, we would have bind-mounted ~/.cache much earlier.
Fixes#27919.
Closesscylladb/scylladb#27920
Consider this:
- n1 is a coordinator and schedules tablet repair
- n1 detects tablet repair failed, so it schedules tablet transition to end_repair state
- n1 loses leadership and n2 becomes the new topology coordinator
- n2 runs end_repair on the tablet with session_id=00000000-0000-0000-0000-000000000000
- when a new tablet repair is scheduled, it hangs since the lock is already taken because it was not removed in previous step
To fix, we use the global_tablet_id to index the lock instead of the
session id.
In addition, we retry the repair_update_compaction_ctrl verb in case of
error to ensure the verb is eventually executed. The verb handler is
also updated to check if it is still in end_repair stage.
Fixes#26346Closesscylladb/scylladb#27740
* seastar f0298e40...4dcd4df5 (29):
> file: provide a default implementation for file_impl::statat
> util: Genralize memory_data_sink
> defer: Replace static_assert() with concept
> treewide: drop the support of fmtlib < 9.0.0
> test: Improve resilience of netsed scheduling fairness test
> Merge 'file: Use query_device_alignment_info in blkdev_alignments ' from Kefu Chai
file: Put alignment helpers in anonymous namespace
file: Use query_device_alignment_info in blkdev_alignments
> Merge 'file: Query physical block size and minimum I/O size' from Kefu Chai
file: Apply physical_block_size override to filesystem files
file: Use designated initializers in xfs_alignments
iotune: Add physical block size detection
disk_params: Add support for physical_block_size overrides from io_properties.yaml
block_device: Query alignment requirements separately for memory and I/O
> Merge 'json: formatter: fix formatting of std:string_view' from Benny Halevy
json: formatter: fix formatting of std:string_view
json: formatter: make sure std::string_view conforms to is_string_like
Fixes#27887
> demos:improve the output of demo_with_io_intent() in file_demo
> test: Add accept() vs accept_abort() socket test
> file: Refine posix_file_impl alignments initialization
> Add file::statat and a corresponding file_stat overload
> cmake: don't compile memcached app for API < 9
> Merge 'Revert to ~old lifetime semantics for lvalues passed to then()-alikes' from Travis Downs
future: adjust lifetime for lvalue continuations
future: fix value class operator()
> pollable_fd: Unfriend everything
> Merge 'file: experimental_list_directory: use buffered generator' from Benny Halevy
file: experimental_list_directory: use buffered generator
file: define list_directory_generator_type
> Merge 'Make datagram API use temporary_buffer<>-s' from Pavel Emelyanov
net: Deprecate datagram::get_data() returning packet
memcache: Fix indentation after previous patch
memcache: Use new datagram::get_buffers() API
dns: Use new datagram::get_buffers() API
tests: Use new datagram::get_buffers() API
demo: Use new datagram::get_buffers() API
udp: Make datagram implementations return span of temporary_buffer-s
> Merge 'Remove callback from timer_set::complete()' from Pavel Emelyanov
reactor: Fix indentation after previous patch
timers: Remove enabling callback from timer_set::complete()
> treewide: avoid 'static sstring' in favor of 'constexpr string_view'
> resource: Hide hwloc from public interface
> Merge 'Fix handle_exception_type for lvalues' from Travis Downs
futures_test: compile-time tests
function_traits: handle reference_wrapper
> posix_data_sink_impl: Assert to guard put UB
> treewide: fix build with `SEASTAR_SSTRING` undefined
> avoid deprecation warnings for json_exception
> `util/variant_utils`: correct type deduction for `seastar::visit`
> net/dns: fixed socket concurrent access
> treewide: add missing headers
> Merge 'Remove posix file helper file_read_state class' from Pavel Emelyanov
file: Remove file_read_state
test: Add a test for posix_file_impl::do_dma_read_bulk()
> membarrier: simplify locking
Adjust scylla to the following changes in scylla:
- file_stat became polymorphic
- needs explicit inference in table::snapshot_exists, table::get_snapshot_details
- file::experimental_list_directory now returns list_directory_generator_type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27916
We should check that the test feature is disabled on all nodes after a partial
upgrade. This hardens the test a bit, although the old code wasn't that bad,
since enabled features are a part of the group 0 state shared by all nodes.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27654
The boost test view_schema_test.cc::node_view_update_backlog can be
flaky if the test machine has a hiccup of 100ms, and this patch fixes
it:
The test is a unit test for db::view::node_update_backlog, which is
supposed to cache the backlog calculation for a given interval. The
test asks to cache the backlog for 100ms, and then without sleeping
at all tries to fetch a value again and expect the unchanged cached
value to be returned. However, if the test run experiences a context
switch of 100ms, it can fail, and it did once as reported in #27876.
The fix is to change the interval in this test from 100ms to something
much larger, like 10 seconds. We don't sleep this amount - we just need
the second fetch to happen *before* 10 seconds has passed, so there's
no harm in using a very large interval.
However, the second half of this test wants to check that after the
interval is over, we do get a new backlog calculation. So for the
second half of this test we can and should use a shorter backlog -
e.g., 10ms. We don't care if the test machine is slow or context switched,
for this half of the test we want to to sleep *more* than 10ms, and
that's easy.
The fixed test is faster than the old one (10ms instead of 100ms) and
more reliable on a shared test machine.
Fixes#27876.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27878
The refresh api is expected to automatically delete
the sstable files from the uploads/ dir. Verify that.
The code that does that is currently called by
sstables_loader::load_new_sstables:
```c++
if (load_and_stream) {
...
co_await loader.load_and_stream(ks_name, cf_name, table_id, std::move(sstables_on_shards[this_shard_id()]), primary_replica_only(primary), true /* unlink */, scope, {});
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27586
The current state (after PR #26836) is that Alternator tables are
created by default using tablets. But due to issue #23838, Alternator
Streams cannot be enabled on a table that uses tablets... An attempt to
enable Streams on such a table results in a clear error:
"Streams not yet supported on a table using tablets (issue #23838).
If you want to use streams, create a table with vnodes by setting
the tag 'system:initial_tablets' set to 'none'."
But users should be able to learn this fact from the documentation -
not just retroactively from an error message. This is especially important
because a user might create and fill a table using tablets, and only get
this error when attempting to enable Streams on the existing table -
when it is too late to change anything.
So this patch adds a paragraph on this to compatibility.md, where
several other requirements of Alternator Streams are already mentioned.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27000
Irritated by prevailing spellchecker comments attached to every PR, I aim to fix them all.
No need to backport, just cosmetic changes.
Closesscylladb/scylladb#27897
* github.com:scylladb/scylladb:
treewide: fix some spelling errors
codespell: ignore `iif` and `tread`
The explanation is in the new comment in `gossiper::add_saved_endpoint`.
We add a test for this change. It's "extremely white-box", but it's better
than nothing.
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
require IP of the replacing node (like client requests or REST API
requests coming from nodetool).
We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.
For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.
For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.
We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.
We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.
We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).
The default error message of `closed_error` is "connection is closed".
It lacks the host ID and the IP address of the connected node, which
makes debugging harder. Also, it can be more specific when
`closed_error` is thrown due to the local node shutting down.
Fixes#16923Closesscylladb/scylladb#27699
The storage::snapshot() is used in two different modes -- one to save sstable as snapshot somewhere, and another one to create a copy of sstable. The latter use-case is "optimized" by snapshotting an sstable under new generation, but it's only true for local storage. Despite for S3 storage snapshot is not implemented, _cloning_ sstable stored on S3 is not necessarily going to be the same as doing a snapshot.
Another sign of snapshot and clone being different is that calling snapshot() for snapshot itself and for clone use two very different sets of arguments -- snapshotting specifies relative name and omits new generation, while cloning doesn't need "name" and instead provides generation. Recently (#26528) cloning got extra "leave_unsealed" tag, that makes no sense for snapshotting.
Having said that, this PR introduces sstables::storage::clone() method and modifies both, callers and implementations, according to the above features of each. As a result, code logic in both methods become much simpler and a bunch of bool classes and "_tag" helper structures goes away.
Improving internal APIs, no need to backport
Closesscylladb/scylladb#27871
* github.com:scylladb/scylladb:
sstables, storage: Drop unused bool classes and tags
sstables/storage: Drop create_links_common() overloads
sstable: Simplify storage::snapshot()
sstables: Introduce storage::clone()
This reverts commit a5edbc7d612df237a1dd9d46fd5cecf251ccfd13.
<h3>Why re-enabling table audit</h3>
Audit has been disabled (scylladb/scylla-enterprise/pull/3094) over many concerns raised against the table implementation, e.g. scylladb/scylla-enterprise/issues/2939 / scylladb/scylla-enterprise/issues/2759 + there's whole outstanding backlog of issues . One of the concerns was also a possible loss of availability, and since then we migrated audit keyspace from SimpleStrategy RF=1 to NetworkTopologyStrategy RF=3 (scylladb/scylla-enterprise/pull/3399) and stopped failing queries when auditing fails (scylladb/scylla-enterprise/pull/3118 & scylladb/scylla-enterprise/pull/3117), which improves the situation but doesn't address all the concerns. Eventually we want to use syslog as audit's sink, but it's not fully ready just yet, and so we'll restore table audit for now to increase the security, but later switch to syslog. BTW. cloud will enable table audit for AUTH category scylladb/sre-ops-automation/issues/2970 separately from this effort.
<h3>Performance considerations</h3>
We are assuming that the events for the enabled categories, i.e. DCL, DDL, AUTH & ADMIN, should appear at about the same, low cadence, with AUTH perhaps having the biggest impact of them all under some workloads. The performance penalty of enabling just the AUTH category [has been measured](https://scylladb.atlassian.net/wiki/spaces/RND/pages/148308005/Audit+performance+impact+test) and while authentication throughput and read/write throughput remain stable, the queries' P99 latency may decrease by a couple of % in the most hardcore scenarios.
Fixes: https://github.com/scylladb/scylladb/issues/26020
Gradually re-enabling audit feature, no need to backport.
Closesscylladb/scylladb#27262
* github.com:scylladb/scylladb:
doc: audit: set audit as enabled by default
Reapply "audit: enable some subset of auditing by default"
Currently, the tablet load balancer performs capacity based balancing by collecting the gross disk capacity of the nodes, and computes balance assuming that all tablet sizes are the same.
This change introduces size-based load balancing. The load balancer does not assume identical tablet sizes any more, and computes load based on actual tablet sizes.
The size-based load balancer computes the difference between the most and least loaded nodes in the balancing set (nodes in DC, or nodes in a rack in case of `rf-rack-valid-keyspaces`) and stops further balancing if this difference is bellow the config option `size_based_balance_threshold_percentage`.
This config option does not apply to the absolute load, but instead to the percentage of how much the most loaded node is more loaded than the least loaded node:
`delta = (most_loaded - least_loaded) / most_loaded`
If this delta is smaller then the config threshold, the balancer will consider the nodes balanced.
This PR is a part of a series of PRs which are based on top of each other.
- First part for tablet size collection via load_stats: #26035
- Second part reconcile load_stats: #26152
- The third part for load_sketch changes: #26153
- The fourth part which performs tablet load balancing based on tablet size: #26254
- The fifth part changes the load balancing simulator: #26438
This is a new feature, backport is not needed.
Fixes#26254Closesscylladb/scylladb#26254
* github.com:scylladb/scylladb:
test, load balancing: add test for table balance
load_balancer: add cluster feature for size based balancing
load_balancer: implement size-based load balancing
config: add size based load balancing config params
load_stats: use trinfo to decide how to reconcile tablet size
load_sketch: use tablet sizes in load computation
load_stats: add get_tablet_size_in_transition()
- table, storage_group: add compaction_group_count
- And use to reserve vector capacity before adding an item per compaction_group
- table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
- compaction_groups() may allocate memory, but when called from a synchronous call site, the caller can use for_each_compaction_group instead.
* Improvement, no backport needed
Closesscylladb/scylladb#27479
* github.com:scylladb/scylladb:
table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
replica: storage_group: rename compaction_groups to compaction_groups_immediate
Raft topology goes over all nodes in a 'left' state and triggers 'remove
node' notification in case id/ip mapping is available (meaning the node
left recently), but the problem is that, since the mapping is not removed
immediately, when multiple nodes are removed in succession a notification
for the same node can be sent several times. Fix that by sending
notification only if the node still exists in the peers table. It will
be removed by the first notification and following notification will not
be sent.
Closesscylladb/scylladb#27743
Add table size to DescribeTable's reply in Alternator
Fills DescribeTable's reply with missing field TableSizeBytes.
- add helper class simple_value_with_expiry, which is like std::optional
but the value put has a timeout.
- add ignore_errors to estimate_total_sstable_volume function - if set
to true the function will catch errors during RPC and ignore them,
substituting 0 for missing value.
- add a reference to storage_service to executor class (needed to call
estimate_total_sstable_volume function).
- add fill_table_description and create_table_on_shard0 as non static
methods to executor class
- calculate TableSizeBytes value for a given table and return it as
part of DescribeTable's return value. The value calculated is cached for
approximately 6 hours (as per DescribeTable's specification).
The algorithm is as follows:
- if the requested value is in cache and is still valid it's returned,
nothing else happens.
- otherwise:
- every shard of every node is requested to calculate size of its data
- if the error happens, the error is ignored and we assume the given
shard has a size of 0
- all such values are summed producing total size
- produced value is returned to a caller
- on the node the call for a size happened every shard is requested to
cache produced value with a 6 hour timeout.
- if the next call comes for a differet shard on the same node that
doesn't yet have cached value, the shard will request the value to
be calculated again. The new value will overwrite the old one on
every shard on this node.
- if the next call comes to a different node, the process of
calculation will happen from start, possibly producing different
value. The value will have it's own timeout, there's no attempt made
to synchronize value between nodes.
- add a alternator_describe_table_info_timeout_in_seconds parameter, which
will control, how long DescribeTable's table information are being held
in cache. Default is 6 hours.
- update test to use parameter
`alternator_describe_table_info_timeout_in_seconds` - setting it to 0
and forcing flushing memtables to disk allows checking, that table size
has grown.
Fixes#7551Closesscylladb/scylladb#24634
* github.com:scylladb/scylladb:
alternator: fix invalid rebase
Update tests
Update documentation
Add table size to DescribeTable's output
Promote fill_table_description and create_table_on_shard0 to methods
Modify estimate_total_sstable_volume to opt ignore errors
Add alternator_describe_table_info_cache_validity_in_seconds config option
Add ref to service::storage_service to executor
Add simple_value_with_expiry util class
Prevent stall when the group0 history is too long using unfreeze_gently
rather than the synchronous unfreeze() function
Fixes#27872
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27873
Modify `storage_service::estimate_total_sstable_volume` function to
optionally ignore errors (instead substitute 0), when `ignore_errors`
parameter is set to `yes`.
Add a `simple_value_with_expiry` utility class, which functions like
a `std::optional` with added timeout. When emplacing a value, user
needs to provide timeout, after which value expires (in which case
the `simple_value_with_expiry` object behaves as if was never set
at all).
Add boost tests for the new class.
There's seastar helper that does the same, no need to carry yet another
implementation
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27851
Group0 commands consist of one or more mutations and are supposed to be
atomic - i.e. the data structures that reflect the group0 tables state
are not supposed to be updated while only some mutations of a command
are applied, the logic responsible for that is not supposed to observe
an inconsistent state of group0 tables.
It turns out that this assumption can be broken if a node crashes in the
middle of applying a multi-mutation group0 command. Because these
mutations are, in general, applied separately, only some mutations might
survive a crash and a restart, so the group0 tables might be in an
inconsistent state. The current logic of group0_state_machine will
attempt to read the group0 tables' state as it was left after restart,
so it may observe inconsistent state.
This can confuse the node as it may observe a state that it was not
supposed to observe, or the state will just outright break some
invariants and trigger some sanity checks. One of those was observed in
https://github.com/scylladb/scylladb/issues/26945, where a command from the CDC generation
publisher fiber was partially applied. The fiber, in addition to
publishing generations, it removes old, expired generations as well.
Removal is done by removing data that describes the generation from
cdc_generations_v3 and by removing the generation's ID from the
committed generation list in the topology table. If only the first
mutation gets through but not the other one, on reload the node will see
a committed CDC generation without data, which will trigger an
on_internal_error check.
Fix this by delaying the moment when the in memory data structures are
first loaded. In 579dcf187a, a mechanism was introduced which persists the
commit index before applying commands that are considered committed.
Starting a raft server waits until commands are replayed up to that
point. The fix is to start the group0_state_machine in a mode which only
applies mutations - the aforementioned mechanism will re-apply the
commands which will, thanks to the mutation idempotency, bring the
group0 to a consistent state. After the group0 is known to be in
consistent state (so, after raft::server_impl::start) the in-memory data
structures of group0 are loaded for the first time.
There is an exception, however: schema tables. Information about schema
is actually loaded into memory earlier than the moment when group0 is
started. Applying changes to schema is done through the migration
manager module which compares the persisted state before and after the
schema mutations are applied and acts on that. Refactoring migration
manager is out of scope of this PR. However, this is not a problem
because the migration manager takes care to apply all of the mutations
given in a command in a single commitlog segment, so the initial schema
loading code should not see an inconsistent state due to the state being
partially applied.
The fix is accompanied by a reproducer of scylladb/scylladb#26945.
Fixes: scylladb/scylladb#26945
This is not a regression, so no need to backport.
Closesscylladb/scylladb#27528
* github.com:scylladb/scylladb:
test: cluster: test for recovery after partial group0 command
group0_state_machine: remove obsolete comment about group0 consistency
group0_state_machine: don't update in-memory state machine until start
group0_state_machine: move reloading out of std::visit
service: raft: add state machine ref to raft_server_for_group
The method in question returns coroutine generator that co_yields
directory_entry-s. In case the method is not implemented, seastar
creates a fallback generator, that calls existing subscription-based
list_directory() and co_yields them. And since checked file doesn't yet
have it, fallback generator is used, thus skipping the lower file
yielding lister. Not nice.
This patch implements the generator lister for checked file, thus making
full use of lower file generator lister too.
A side note. It's not enough to implement it like
return do_io_check([] {
return lower_file->experimental_list_directory();
});
like list_directory() does, since io-checking will _not_ happen on
directory reading itself, as it's supposed to.
This is the problem of the check_file::list_directory() implementation
-- it only checks for exception when creating the subscription (and it
really never happens), but reading the directory itself happens without
io checks.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27850
This change adds a boost test which validates the resulting table
balance of size based load balancing. The threshold was set to a
conservative 1.5 overcommit to avoid flakyness.
This patch adds a cluster feature size_based_load_balancing which, until
enabled, will force capacity based balancing. This is needed because
during rolling upgrades some of the nodes will have incomplete data in
load_stats (missing tablet sizes and effective_capacity) which are
needed for size based balancing to make good decisions and issue correct
migrations.
This changes introduces tablet size based load balancing. It is an
extension of capacity based balancing with the addition of actual tablet
sizes.
It computes the difference between the most and least loaded nodes in
the DC and stops further balancing if this difference is bellow the
config option size_based_balance_threshold_percentage.
This config option does not apply to the absolute load, but instead to
the percentage of how much the most loaded node is more loaded than the
least loaded node:
delta = (most_loaded - least_loaded) / most_loaded
If this delta is smaller then the config threshold, the balancer will
consider the nodes balanced.
This change adds:
- The config paremeter force_capacity_based_balancing which, when
enabled performs capacity based balancing instead of size based.
- The config parameter size_based_balance_threshold_percentage which
sets the balance threshold for the size based load balancer.
- The config parameter minimal_tablet_size_for_balancing which sets the
minimal tablet size for the load balancer.
This patch corrects the way update_load_stats_on_end_migration() decides
which tablet transition occured, in order to reconcile tablet sizes in
load_stats. Before, the transition kind was inferred from the value of
leaving and pending replicas. This patch changes this to use the value
of trinfo.transition.
In case of a rebuild, and in case there is only one replica, the new
tablet size will be set to 0.
This patch adds a method to load_stats which searches for the tablet
size during tablet transition. In case of tablet migration, the tablet
will be searched on the leaving replica, and during rebuild we will
return the average tablet size of the pending replicas.
Recently, test/cluster/test_tablet.py::test_orphaned_sstables_on_startup started
spinning in the log browsing code, part of a the test library that looks into log files
for expected or unexpected patterns. This reproduced somewhat in continuous
integration, and very reliably for me locally.
The test was introduced in fa10b0b390, a year ago.
There are two bugs involved: first, that we're looking for crashes in this test,
since in fact it is expected to crash. The node expectedly fails with an
on_internal_error. Second, the log browsing code contains an infinite loop
if the crash backtrace happens to be the last thing in the log. The series
fixes both bugs.
Fixes#27860.
While the bad code exists in release branches, it doesn't trigger there so far, so best
to only backport it if it starts manifesting there.
Closesscylladb/scylladb#27879
* github.com:scylladb/scylladb:
test: pylib: log_browsing: fix infinite loop in find_backtraces()
test: pylib/log_browsing, cluster/test_tablets: don't look for expected crashes
There's a bunch of tagged create_links_common() overloads that call the
most generic one with properly crafted arguments and the link_mode.
Callers of those one-liners can craft the args themselves.
As a result, there's only one create_links_common() overload and callers
explicitly specifying what they want from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now there are only two callers left -- sstable::snapshot() and
sstable::seal() that wants to auto-backup the sealed sstable.
The snapshot arguments are:
- relative path, use _base_dir
- no new generation provided
- no leave-unsealed tag
With that, the implementation of filesystem_storage::snapshot() is as
simple as
- prepare full path relative to _base_dir
- touch new directory
- call create_links_common()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And call it from sstable::clone() instead of storage::snapshot().
The snapshot arguements are:
- target directory is storage::prefix(), that's _dir itself
- new generation is always provided, no need for optional
- leave_unsealed bool flag
With that, the implementation of filesystem_storage::clone() is as
simple as call create_links_common() forwarding args and _dir to it. The
unification of leave_unsealed branches will come a bit later making this
code even shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test boost/error_injection_test.cc::test_inject_future_disabled
checks what happens when a sleep injection is *disabled*: The test
has a 10-millisecond-sleep injection and measures how much it takes.
The test expects it to take less than 10 milliseconds - in fact it
should take almost zero. But this is not guaranteed - on a slow debug
build and an overcommitted server this do-nothing injection can take
some time, and in one run (#27798) it took 14 milliseconds - and the
test failed.
The solution is easy - make the sleep-that-doesn't-happen much longer -
e.g., 10 whole seconds. Since this sleep still doesn't happen, we
expect the injection to return in less - much less - than 10 seconds.
This 10 seconds is so ridiculously high we don't expect the do-nothing
injection to take 10 seconds, not even a ridiculously busy test machine.
Fixes#27798
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27874
The find_backtraces() function uses a very convoluted loop to
read the log file. The loop fails to terminate if the last thing
in the log file is the backtrace, since the loop termination condition
(`not line`) continues to be true.
It's not clear why this did not reliably hit before, but it now
reliably reproduces for me on both x86 and aarch64. Perhaps timing
changed, or perhaps previously we had more text on the log.
test_tablets.test_orphaned_sstables_on_startup verifies that an
on_internal_error("Unable to load SSTable...") is generated when
an sstable outside a tablet boundary is found on startup.
The test indeed finds the error, but then proceeds to hang in
find_backtraces(), or fail if find_backtraces() is fixed, since
it finds an unexpected (for it) crash.
Fix this by not looking for crashes if a new option expected_crash
is set. Set it for this test.
This reverts commit caa0cbe328. It is
either extremely slow or broken. I was never able to get it to
run on an r8gd.8xlarge (on the NVMe disk). Even when it passes,
it is very slow.
Test script:
```
git submodule update --recursive || exit 125
rm -rf build
d() { ./tools/toolchain/dbuild -it -- "$@"; }
d ./configure.py --mode release || exit 125
d ninja release-build || exit 125
d ./test.py --mode release
```
Ref #27858
Ref #27859
Ref #27860
Traversing the span's freelist is known to generate "Cannot access
memory at address ..." errors, which is especially annoying when it
results in failed CI. Make this loop more robust: catch gdb.error coming
from it and just log a warning that some listed objects in the span may
be free ones.
Fixes: #27681Closesscylladb/scylladb#27805
This caused concurrent writers to operate on the same file, leading to file corruption. In some cases, this manifested as test failures and intermittent std::bad_alloc exceptions.
Change Description
This change ensures that each test instance uses a unique filename for downloaded bucket files.
By isolating file writes per test execution, concurrent runs no longer interfere with each other.
Fixes: #27824
backport not required
Closesscylladb/scylladb#27843
compaction_groups_immediate() may allocate memory, but when called from a
synchronous call site, the caller can use for_each_compaction_group
instead to iterate over the compaction groups with no extra allocations.
Calling compaction_groups_immediate() is still required from an async
context when we want to "sample" the compaction groups
so we can safely iterate over them and yield in the inner loop.
Also, some performance insensitive call sites using
compaction_groups_immediate had been left as they are
to keep them simple.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Improve scylla fiber's ability to traverse through coroutines.
Add --direction command-line parameter to scylla-fiber.
Fix out-of-date premit collection in scylla read-stat and improve the printout.
scylla-gdb.py improvements, no backport needed
Closesscylladb/scylladb#27766
* github.com:scylladb/scylladb:
scylla-gdb.py: scylla read-stats: include all permit lists
scylla-gdb.py: scylla fiber: add --direction command-line param
scylla-gdb.py: scylla fiber: add support for traversing through coroutines backward
Mention the type of batch: Logged or Unlogged. The size (warn/fail on
too large size) error has different significance depending on the type.
Refs: #27605Closesscylladb/scylladb#27664
Coerce the return value of config.getoption("--repeat") to int to avoid:
Traceback (most recent call last):
File "/usr/bin/pytest", line 8, in <module>
sys.exit(console_main())
~~~~~~~~~~~~^^
File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 201, in console_main
code = main()
File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 175, in main
ret: ExitCode | int = config.hook.pytest_cmdline_main(config=config)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
File "/usr/lib/python3.14/site-packages/pluggy/_hooks.py", line 512, in __call__
return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec
return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall
raise exception
File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall
res = hook_impl.function(*args)
File "/usr/lib/python3.14/site-packages/_pytest/helpconfig.py", line 154, in pytest_cmdline_main
config._do_configure()
~~~~~~~~~~~~~~~~~~~~^^
File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 1118, in _do_configure
self.hook.pytest_configure.call_historic(kwargs=dict(config=self))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.14/site-packages/pluggy/_hooks.py", line 534, in call_historic
res = self._hookexec(self.name, self._hookimpls.copy(), kwargs, False)
File "/usr/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec
return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall
raise exception
File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall
res = hook_impl.function(*args)
File "/home/bdenes/ScyllaDB/scylladb/scylladb/test/pylib/runner.py", line 206, in pytest_configure
config.run_ids = tuple(range(1, config.getoption("--repeat") + 1))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
TypeError: can only concatenate str (not "int") to str
Closesscylladb/scylladb#27649
The script API is 500+ lines long in an already too long and hard to navigate document. Extract it to a separate document, making both documents shorter and easier to navigate.
Documentation refactoring, no backport needed.
Closesscylladb/scylladb#27609
* github.com:scylladb/scylladb:
docs: scylla-sstable-script-api.rst: add introduction and title
docs: scylla-sstable.rst: extract script API to separate document
docs: scylla-sstable: prepare for script API extract
If the table uses UDTs, include the description of these (CREATE TYPE
statement) in the schema dump. Without these the schema is not useful.
Closesscylladb/scylladb#27559
The method in question knows that it writes snapshot to local filesystem and uses this actively. This PR relaxes this knowledge and splits the logic into two parts -- one that orchestrates sstables snapshot and collects the necessary metadata, and the code that writes the metadata itself.
Closesscylladb/scylladb#27762
* github.com:scylladb/scylladb:
table: Move snapshot_file_set to table.cc
table: Rename and move snapshot_on_all_shards() method
table: Ditch jsondir variable
table, sstables: Pass snapshot name to sstable::snapshot()
table: Use snapshot_writer in write_manifest()
table: Use snapshot_writer in write_schema_as_cql()
table: Add snapshot_writer::sync()
table: Add snapshot_writer::init()
table: Introduce snapshot_writer
table: Move final sync and rename seal_snapshot()
table: Hide write_schema_as_cql()
table: Hide table::seal_snapshot()
table: Open-code finalize_snapshot()
table: Fix indentation after previuous patch
table: Use smp::invoke_on_all() to populate the vector with filenames
table: Don't touch dir once more on seal_snapshot()
table: Open-code table::take_snapshot() into caller lambda
table: Move parts of table::take_snapshot to sstables_manager
table: Introduce table::take_snapshot()
table: Store the result of smp::submit_to in local variable
Remove many unused "import" statements or parts of import statement.
All of them were detected by Copilot, but I verified each one manually
and prepared this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27676
The file test/alternator/test_transact.py accidentally had two tests
with the same name, test_transact_get_items_projection_expression.
This means the first of the two tests was ignored and never run.
This patch renames the second of the two to a more appropriate
(and unique...) name.
I verified that after this change the number of tests in this file
grows by one, and that still all tests pass on DynamoDB and fail
(as expected by xfail) on Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27702
The db::config::object_storage_endpoints parameter is live-updateable, but when the update really happens, the new endpoints may fail to propagate to non-zero shards because of the way db::config sharding is implemented.
Refs: #7316Fixes: #26509
Backport to 2025.3 and 2025.4, AFAIK there are set ups with object storage configs for native backup
Closesscylladb/scylladb#27689
* github.com:scylladb/scylladb:
sstables/storage_manager: Fix configured endpoints observer
test/object_store: Add test to validate how endpoint config update works
Currently, we support ccache as the compiler cache. Since it is transparent, nothing
much is needed to support it.
This series adds support to sccache[1] and prefers it over ccache when it is installed.
sccache brings the following benefits over ccache:
1. Integrated distributed build support similar to distcc, but with automatic toolchain packaging and a scheduler
2. Rust support
3. C++20 modules (upcoming[2])
It is the C++20 modules support that motivates the series. C++20 modules have the potential to reduce
build times, but without a compiler cache and distributed build support, they come with too large
a penalty. This removes the penalty.
The series detects that sccache is installed, selects it if so (and if not overridden
by a new option), enables it for C++ and Rust, and disables ccache transparent
caching if sccache is selected.
Note: this series doesn't add sccache to the frozen toolchain or add dbuild support. That
is left for later.
[1] https://github.com/mozilla/sccache
[2] https://github.com/mozilla/sccache/pull/2516
Toolchain improvement, won't be backported.
Closesscylladb/scylladb#27834
* github.com:scylladb/scylladb:
build: apply sccache to rust builds too
build: prevent double caching by compiler cache
build: allow selecting compiler cache, including sccache
Remove many unused "import" statements or parts of import statement.
All of them were detected by Copilot, but I verified each one manually
and prepared this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27675
Commit d3efb3ab6f added streaming session for rebuild, but it set
the session and request submission time. The session should be set when
request starts the execution, so this patch moved it to the correct
place.
Closesscylladb/scylladb#27757
Unused imports, unused variables and such.
No functional changes, just to get rid of some standard CodeQL warnings.
Benign - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#27801
Under podman, we already own /sys/fs/cgroup. Run the chown command only
under docker where the container does not map the host user to the
container root user.
The chown process is sometimes observed to fail with EPERM (see issue).
But it's not needed, so avoid it.
Fixes#27837.
Closesscylladb/scylladb#27842
Auth cache loading at startup is racing between
auth service and raft code and it doesn't support
concurrency causing it to crash.
We can't easily remove any of the places as during
raft recovery snapshot is not loaded and it relies
on loading cache via auth service. Therefore we add
semaphore.
Fixes https://github.com/scylladb/scylladb/issues/27540Closesscylladb/scylladb#27573
This patch was suggested and prepared by copilot, I am writing the commit
message because the original one was worthless.
In commit cf138da, for an an unexplained reason, a loop waiting until the
expected value appears in a materialized view was replaced by a call for
wait_for_view_built(). The old loop code was left behind in a comment,
and this commented-out code is now bothering our AI. So let's delete the
commented-out code.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27646
To configure S3 storage, one needs to do
```
object_storage_endpoints:
- name: s3.us-east-1.amazonaws.com
port: 443
https: true
aws_region: us-east-1
```
and for GCS it's
```
object_storage_endpoints:
- name: https://storage.googleapis.com:433
type: gs
credentials_file: <gcp account credentials json file>
```
This PR updates the S3 part to look like
```
object_storage_endpoints:
- name: https://s3.us-east-1.amazonaws.com:443
aws_region: us-east-1
```
fixes: #26570
Not-yet released feature, no need to backport. Old configs are not accepted any longer. If it's needed, then this decision needs to be revised.
Closesscylladb/scylladb#27360
* github.com:scylladb/scylladb:
object_storage: Temporarily handle pure endpoint addresses as endpoints
code: Remove dangling mentions of s3::endpoint_config
docs: Update docs according to new endpoints config option format
object_storage: Create s3 client with "extended" endpoint name
test: Add named constants for test_get_object_store_endpoints endpoint names
s3/storage: Tune config updating
sstable: Shuffle args for s3_client_wrapper
For deployments fronted by a reverse proxy (haproxy or privatelink), we want to
use proxy protocol v2 so that client information in system.clients is correct and so
that the shard-aware selection protocol, which depends on the source port, works
correctly. Add proxy-protocol enabled variants of each of the existing native transport
listeners.
Tests are added to verify this works. I also manually tested with haproxy.
New feature, no backport.
Closesscylladb/scylladb#27522
* github.com:scylladb/scylladb:
test: add proxy protocol tests
config, transport: support proxy protocol v2 enhanced connections
As noticed by copilot, two tests in test_guardrail_compact_storage.py
could never fail, because they used `pytest.fail` instead of the
correct `pytest.fail()` to fail. Unfortunately, Python has a footgun
where if it sees a bare function name without parenthesis, instead of
complaining it evaluates the function object and then ignores it,
and absolutely nothing happens.
So let's add the missing `()`. The test still passes, but now it at
least has a chance of failing if we have a regression.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27658
during any jenkins job that trigger `test.py` we get:
```
/jenkins/workspace/releng-testing/byo/byo_build_tests_dtest/scylla/test/pylib/s3_proxy.py:152: SyntaxWarning: 'return' in a 'finally' block
```
The 'return' statement in the finally block was causing a SyntaxWarning.
Moving the return outside the finally block ensures proper exception
handling while maintaining the intended behavior.
Closesscylladb/scylladb#27823
sstable_validation_test tests the `scylla sstable validate` command
by passing it intentionally corrupted sstables. It uses an sstable
cache to avoid re-creating the same sstables. However, the cache
does not consider the sstable version, so if called twice with the
same inputs for different versions, it will return an sstable with
the original version for both calls. As a results, `ms` sstables
were not tested. Fix this bug by adding the sstable version (and
the schema for good measure) to the cache key.
An additional bug, hidden by the first, was that we corrupted the
sstable by overwriting its Index.db component. But `ms` sstables
don't have an Index.db component, they have a Partitions.db component.
Adjust the corrupting code to take that into account.
With these two fixes, test_scylla_sstable_validate_mismatching_partition_large
fails on `ms` sstables. Disable it for that version. Since it was
previously practically untested, we're not losing any coverage.
Fixing this test unblocks further work on making pytest take charge
of running the tests. pytest exposed this problem, likely by running
it on different runners (and thus reducing the effectiveness of the
cache).
Fixes#27822.
Closesscylladb/scylladb#27825
* seastar 7ec14e83...f0298e40 (8):
> Merge 'coroutine/try_future: call set_current_task() when resuming the coroutine' from Botond Dénes
coroutine/try_future: call set_current_task() when resuming the coroutine
core: move set_current_task() out-of-line
> stop_signal: stop including reactor.hh
> cmake: Mark hwloc headers as system includes to suppress warnings
> build: explicitly enable vptr sanitizer
> httpd: Add API to set tcp keepalive params
> Merge 'Make datagram_channel::send() use temporary_buffer-s' from Pavel Emelyanov
net: Remove no longer used to_iovec() helpers
net,code: Update callers to use new datagram_channel::send()
net: Introduce datagram_channel::send(span<temporary_buffer>) method
posix-stack: Make UDP socket implementation use wrapped_iovec
posix-stack: Introduce wrapped_iovec
> code: Move pollable_fd_state::write_all(const char*) from API level 9
> thread: Remove unused sched_group() helper
configure.py: added -lubsan to DEBUG sanitizer flags
Closesscylladb/scylladb#27511
This problem and its fix was suggested by copilot, I'm just writing the
cover letter.
test/nodetool/test_status.py has the silly statement tokens == "?" which
has no effect. Looking around the code suggested to me (and also to
Copilot, nice) that the correct intent was assert tokens == "?" and not,
say, tokens = "?".
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27659
Add a reproducer for scylladb/scylladb#26945. By using error injections,
the test triggers a situation where a command that removes an obsolete
CDC generation is partially applied, then the node is killed an brought
back. Thanks to the fix, restarting the node succeeds and does not
trigger any consistency checks in the group0 reload logic.
The comment is outdated. It is concerned about group0 consistency after
crash, and that re-applying committed commands may require a raft
quorum. First, 579dcf1 was introduced (long ago) which gets rid of the
need for quorum as the node persists the commit index before applying
the commands - so it knows up to which command it should re-apply on
restart. Second, the preceding commits in this PR makes use of this
mechanism for group0.
Remove the comment as the concern was fully addressed. Additionally,
remove a mention of the comment in raft_group0_client.cc - although it
claims that the comment is placed in `group0_state_machine::apply`, it
has been moved to `merge_and_apply` in 96c6e0d (both comments were
originally introduced in 6a00e79).
Group0 commands consist of one or more mutations and are supposed to be
atomic - i.e. the data structures that reflect the group0 tables state
are not supposed to be updated while only some mutations of a command
are applied, the logic responsible for that is not supposed to observe
an inconsistent state of group0 tables.
It turns out that this assumption can be broken if a node crashes in the
middle of applying a multi-mutation group0 command. Because these
mutations are, in general, applied separately, only some mutations might
survive a crash and a restart, so the group0 tables might be in an
inconsistent state. The current logic of group0_state_machine will
attempt to read the group0 tables' state as it was left after restart,
so it may observe inconsistent state.
This can confuse the node as it may observe a state that it was not
supposed to observe, or the state will just outright break some
invariants and trigger some sanity checks. One of those was observed in
scylladb/scylladb#26945, where a command from the CDC generation
publisher fiber was partially applied. The fiber, in addition to
publishing generations, it removes old, expired generations as well.
Removal is done by removing data that describes the generation from
cdc_generations_v3 and by removing the generation's ID from the
committed generation list in the topology table. If only the first
mutation gets through but not the other one, on reload the node will see
a committed CDC generation without data, which will trigger an
on_internal_error check.
Fix this by delaying the moment when the in memory data structures are
first loaded. In 579dcf1, a mechanism was introduced which persists the
commit index before applying commands that are considered committed.
Starting a raft server waits until commands are replayed up to that
point. The fix is to start the group0_state_machine in a mode which only
applies mutations - the aforementioned mechanism will re-apply the
commands which will, thanks to the mutation idempotency, bring the
group0 to a consistent state. After the group0 is known to be in
consistent state (so, after raft::server_impl::start) the in-memory data
structures of group0 are loaded for the first time.
There is an exception, however: schema tables. Information about schema
is actually loaded into memory earlier than the moment when group0 is
started. Applying changes to schema is done through the migration
manager module which compares the persisted state before and after the
schema mutations are applied and acts on that. Refactoring migration
manager is out of scope of this PR. However, this is not a problem
because the migration manager takes care to apply all of the mutations
given in a command in a single commitlog segment, so the initial schema
loading code should not see an inconsistent state due to the state being
partially applied.
Fixes: scylladb/scylladb#26945
In the next commit, we will adjust the logic so that it only reloads in
memory state only when a flag is set. By moving the reload logic to one
place in `merge_and_apply`, the next commit will be able to reach its
goal by only adding a single `if`.
This reference will be used by the code that starts group0. It will
manually enable the in-memory state machine only after the group0 server
is fully started, which entails replaying the group0 commands that are,
locally, seen as committed - in order to repair any inconsistencies that
might have arisen due to some commands being applied only partially
(e.g. due to a crash).
Due to the recent changes in the vector store service,
the service needs to read two of the system tables
to function correctly. This was not accounted for
when the new permission was added. This patch fixes that
by allowing these tables (group0_history and versions)
to be read with the VECTOR_SEARCH_INDEXING permission.
We also add a test that validates this behavior.
Fixes: SCYLLADB-73
Closesscylladb/scylladb#27546
Initially, tests for high availability were implemented in vector-store.git
repository. High availability is currently implemented in scylladb.git
repository so this repository should be the better place to store them. This
commit copies these tests into the scylladb.git.
The commit copies validator-vector-store/src/high_availability.rs (tests logic)
and validator-tests/src/common.rs (utils for tests) into the local crate
validator-scylla. The common.rs should be copied to be able for reviewer to see
common test code and this code most likely be frequent to change - it will be
hard to maintain one common version between two repositories.
The commit updates also README for vector_search_validator; it shortly describe
the validator modules.
The commit updates reference to the latest vector-store.git master.
As a next step on the vector-store.git high_availability.rs would be removed
and common.rs moved from validator-tests into validator-vector-store.
References: VECTOR-394
Closesscylladb/scylladb#27499
This workflow validates that all commits in a pull request use email
addresses ending in @scylladb.com. For each commit with an author or
committer email that doesn't match this pattern, the workflow automatically
adds a comment to the pull request with a warning.
This serves two purposes:
1. Alert maintainers when external contributors submit code (which is
acceptable, but good to be aware of)
2. Help ScyllaDB developers catch cases where they haven't configured
their git email correctly
When a non-@scylladb.com email is detected, the workflow posts this
comment on the pull request:
```
⚠️Non-@scylladb.com Email Addresses Detected
Found commit(s) with author or committer emails that don't end with
@scylladb.com.
This indicates either:
- An external contributor (acceptable, but maintainer should be aware)
- A developer who hasn't configured their git email correctly
For ScyllaDB developers:
If you're a ScyllaDB employee, please configure your git email globally:
git config --global user.email "your.name@scylladb.com"
If only your most recent commit is invalid, you can amend it:
git commit --amend --reset-author --no-edit
git push --force
If you have multiple invalid commits, you need to rewrite them all:
git rebase -i <base-commit>
# Mark each invalid commit as 'edit', then for each:
git commit --amend --reset-author --no-edit
git rebase --continue
# Repeat for each invalid commit
git push --force
```
Fixes: https://scylladb.atlassian.net/browse/RELENG-35Closesscylladb/scylladb#27796
The table::seal_snapshot() accepts a vector of sstables filenames and
writes them into manifest file. For that, it iterates over the vector
and moves all filenames from it into the streamer object.
The problem is that the vector contains foreign pointers on sets with
sstrings. Not only sets are foreign, sstrings in it are foreign too.
It's not correct to std::move() them to local CPU.
The fix is to make streamer object work on string_view-s and populate it
with non-owning references to the sstrings from aforementioned sets.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27755
There are two test with name test_repair_options_hosts_tablets in
test/nodetool/test_cluster_repair.py and and two test_repair_keyspace
in test/nodetool/test_repair.py. Due to that one of each pair is ignored.
Rename the tests so that they are unique.
Fixes: https://github.com/scylladb/scylladb/issues/27701.
Closesscylladb/scylladb#27720
Fixes#27694
Unless set by config, the location will default to /etc/scylla, which is not a good
place to write things for tests. Push the config properly and the directory (but
_not_ creation) to all provider basetype.
Closesscylladb/scylladb#27696
The Boost.Test framework offers a way to describe tests written in it
by running them with the option `--list_content`. It can be
parametrized by either HRF (Human Readable Format) or DOT (the Graphviz
graph format) [1]. Thanks to that, we can learn the test tree structure
and collect additional information about the tests (e.g. labels [2]).
We currently emply that feature of the framework to collect and run
Boost tests in Scylla. Unfortunately, both formats have their
shortcomings:
* HRF: the format is simple to parse, but it doesn't contain all
relevant information, e.g. labels.
* DOT: the format is designed for creating graphical visualizations,
and it's relatively difficult to parse.
To amend those problems, we implement a custom extension of the feature.
It produces output in the JSON format and contains more than the most
basic information about the tests; at the same time, it's easy to browse
and parse.
To obtain that output, the user needs to call a Boost.Test executable
with the option `--list_json_content`. For example:
```
$ ./path/to/test/exec -- --list_json_content
```
Note that the argument should be prepended with a `--` to indicate that
it targets user code, not Boost.Test itself.
---
The structure of the new format looks like this (top-level downwards):
- File name
- Test suite(s) & free test cases
- Test cases wrapped in test suites
Note that it's different from the output the default Boost.Test formats
produce: they organize information within test suites, which can
potentially span multiple files [3]. The JSON format makes test files
the primary object of interest and test suites from different files
are always considered distinct.
Example of the output (after applying some formatting):
```
$ ./build/dev/test/boost/canonical_mutation_test -- --list_json_content
[{"file":"test/boost/canonical_mutation_test.cc", "content": {
"suites": [],
"tests": [
{"name": "test_conversion_back_and_forth", "labels": ""},
{"name": "test_reading_with_different_schemas", "labels": ""}
]
}}]
```
---
The implementation may be seen as a bit ugly, and it's effectively
a hack. It's based on registering a global fixture [4] and linking
that code to every Boost.Test executable.
Unfortunately, there doesn't seem to be any better way. That would
require more extensive changes in the test files (e.g. enforcing
going through the same entry point in all of them).
This implementation is a compromise between simplicity and
effectiveness. The changes are kept minimal, while the developers
writing new tests shouldn't need to remember to do anything special.
Everything should work out of the box (at least as long as there's
no non-trivial linking involved).
Fixesscylladb/scylladb#25415
---
References:
[1] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/utf_reference/rt_param_reference/list_content.html
[2] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/tests_grouping.html
[3] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/test_tree/test_suite.html
[4] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/fixtures/global.htmlClosesscylladb/scylladb#27527
On start the manager creates observer for object_storage_endpoints
config parameter. The goal is to refresh the maintained set of endpoint
parameters and client upon config change. The observer is created on
shard 0 only, and when kicked it calls manager.invoke-on-all to update
manager on all shards.
However, there's a race here. The thing is that db::config values are
implicitly "sharded" under the hood with the help of plain array. When
any code tries to read a value from db::config::something, the reading
code secretly gets the value from this inner array indexed by the
current shard id.
Next, when the config is updated, it first assigns new values to [0]
element of the hidden array, then calls broadcast_to_all_shards() helper
that copies the valaues from zeroth slot to all the others. But the
manager's observer is triggered when the new value is assigned on zero
index, and if the invoke-on-all lambda (mentioned above) happens to be
faster than broadcast_to_all_shards, the non-zero shards will read old
values from db::config's inner array.
The fix is to instantiate observer on all shards and update only local
shard, whenever this update is triggered.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a test for backup with non-existing endpoint/bucket/snapshot. It
checks that API call to backup sstables properly fails in that case.
This patch adds similar test for "unconfigured endpoint", but it adds
the endpoint configuration on-the-fly and expects that backup will
proceed after config update.
Currently the test fails, as config update only affect the config
itself, the storage_manager, that's in charge of maintaining endpoint
clients, is not really updated. Next patch will fix it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it's database::snapshot_table_on_all_shards(). This is symmetric to
database::truncate_table_on_all_shards().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the table::snapshot_on_all_shards() is storage-independent and can
stop maintaining the local path variable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently sstable::snapshot() is called with directory name where to put
snapshots into. This patch changes it to accept snapshot name instead.
This makes the table-sstable API be unware of snapshot destination
storage type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The manifest writing code is self-contained in a sense that it needs
list of sstable files and output_stream to write it too. The
snapshot_writer can provide output_stream for specific component, it can
be re-used for manifest writing code, thus making it independent from
local filesystem.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schema writing code is self-contained in a sense that it needs
schema description and output_stream to write it too. Teach the
snapshot_writer to provide output_stream and make write_schema_as_cql()
be independent from local filesystem.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's an abstract class that defines how to write data and metadata with
table snapshot. Currently it just replaces the storage_options checks
done by table::snapshot_on_all_shards(), but it will soon evolve.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The seal_snapshot() syncs directory at the end. Now when the method is
table.cc-local, it doesn't need to be that careful. It looks nicer if
being renamed to write_manifest() and it's caller that syncs directory
after calling it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method only needs schema description from table. The caller can
pre-get it and pass it as argument. This makes it symmetric with
seal_snapshot() (that will be renamed soon) and reduces the class table
API size.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is static and has nothing to do with table. The
snapshot_file_set needs to become public, but it will be moved to
table.cc soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a vector of foreign pointers to sets with sstable filenames
that's populated on all shards. The code does the invoke-on-all by hand
to grow the vector with push-back-s. However, if resizing the vector in
advance, shards will be able to just populate their slots.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the logic of take_snapshot() is split between two components
(table and sstables_manager) it's no longer useful
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Move the loop over vector of sstables that calls sstable->snapshot()
into sstables manager.
This makes it symmetric with sstables_manager::delete_atomically() and
allows for future changes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method returns all sstables vector with a guard that prevents this
list from being modified. Currently this is the part of another existing
table::take_snapshot() method, but the newer, smaller one, is more
atomic and self-contained, next patches will benefit from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR extends BaseLWTTester with optional counter-table configuration and
verification, enabling randomized LWT tests over tablets with counters.
And introduces new LWT with counters test durng tablets resize and migration
- Workload: N workers perform CAS updates
- Update counter table each time CAS was successful
- Enable balancing and increase min_tablet_count to force split,
and lower min_tablet_count to merge.
- Run tablets migrations loop
- Stop workload and verify data consistency
Refs: https://github.com/scylladb/qa-tasks/issues/1918
Refs: https://github.com/scylladb/qa-tasks/issues/1988
Refs https://github.com/scylladb/scylladb/issues/18068Closesscylladb/scylladb#27170
* github.com:scylladb/scylladb:
test: new LWT with counters test during tablets migration/resize - Workload: N workers perform CAS updates - Update counter table each time CAS was successful - Enable balancing and increase min_tablet_count to force split, and lower min_tablet_count to merge. - Run tablets migrations loop - Stop workload and verify data consistency
test/lwt: add counter-table support to BaseLWTTester
Split prepare can run concurrently with repair.
Consider this:
1) split prepare starts
2) incremental repair starts
3) split prepare finishes
4) incremental repair produces unsplit sstable
5) split is not happening on sstable produced by repair
5.1) that sstable is not marked as repaired yet
5.2) might belong to repairing set (has compaction disabled)
6) split executes
7) repairing or repaired set has unsplit sstable
If split was acked to coordinator (meaning prepare phase finished),
repair must make sure that all sstables produced by it are split.
It's not happening today with incremental repair because it disables
split on sstables belonging to repairing group. And there's a window
where sstables produced by repair belong to that group.
To solve the problem, we want the invariant where all sealed sstables
will be split.
To achieve this, streaming consumers are patched to produce unsealed
sstable, and the new variant add_new_sstable_and_update_cache() will
take care of splitting the sstable while it's unsealed.
If no split is needed, the new sstable will be sealed and attached.
This solution was also needed to interact nicely with out of space
prevention too. If disk usage is critical, split must not happen on
restart, and the invariant aforementioned allows for it, since any
unsplit sstable left unsealed will be discarded on restart.
The streaming consumer will fail if disk usage is critical too.
The reason interposer consumer doesn't fully solve the problem is
because incremental repair can start before split, and the sstable
being produced when split decision was emitted must be split before
attached. So we need a solution which covers both scenarios.
Fixes#26041.
Fixes#27414.
Should be backported to 2025.4 that contains incremental repair
Closesscylladb/scylladb#26528
* github.com:scylladb/scylladb:
test: Add reproducer for split vs intra-node migration race
test: Verify split failure on behalf of repair during critical disk utilization
test: boost: Add failure_when_adding_new_sstable_test
test: Add reproducer for split vs incremental repair race condition
compaction: Fail split of new sstable if manager is disabled
replica: Don't split in do_add_sstable_and_update_cache()
streaming: Leave sstables unsealed until attached to the table
replica: Wire add_new_sstables_and_update_cache() into intra-node streaming
replica: Wire add_new_sstable_and_update_cache() into file streaming consumer
replica: Wire add_new_sstable_and_update_cache() into streaming consumer
replica: Document old add_sstable_and_update_cache() variants
replica: Introduce add_new_sstables_and_update_cache()
replica: Introduce add_new_sstable_and_update_cache()
replica: Account for sstables being added before ACKing split
replica: Remove repair read lock from maybe_split_new_sstable()
compaction: Preserve state of input sstable in maybe_split_new_sstable()
Rename maybe_split_sstable() to maybe_split_new_sstable()
sstables: Allow storage::snapshot() to leave destination sstable unsealed
sstables: Add option to leave sstable unsealed in the stream sink
test: Verify unsealed sstable can be compacted
sstables: Allow unsealed sstable to be loaded
sstables: Restore sstable_writer_config::leave_unsealed
After tests end, an extra check is performed, looking into node logs for crashes, aborts and similar issues.
The test directory is also scanned for coredumps.
If any of the above are found, the test will fail with an error.
The following checks are made:
- Any log line matching `Assertion.*failed` or containing `AddressSanitizer` is marked as a critical error
- Lines matching `Aborting on shard` will only be marked as a critical error if the paterns in `manager.ignore_cores_log_patterns` are not found in that log
- If any critical error is found, the log is also scanned for backtraces
- Any backtraces found are decoded and saved
- If the test is marked with `@pytest.mark.check_nodes_for_errors`, the logs are checked for any `ERROR` lines
- Any pattern in `manager.ignore_log_patterns` and `manager.ignore_cores_log_patterns` will cause above check to ignore that line
- The `expected_error` value that many methods, like `manager.decommission_node`, have will be automatically appended to `manager.ignore_log_patterns`
refs: https://github.com/scylladb/qa-tasks/issues/1804
---
[Examples](https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo_build_tests_dtest/46/testReport/):
Following examples are run on a separate branch where changes have been made to enable these failures.
`test_unfinished_writes_during_shutdown`
- Errors are found in logs and are not ignored
```
failed on teardown with "Failed:
Server 2096: found 1 error(s) (log: scylla-2096.log)
ERROR 2025-12-15 14:20:06,563 [shard 0: gms] raft_topology - raft_topology_cmd barrier_and_drain failed with: std::runtime_error (raft topology: command::barrier_and_drain, the version has changed, version 11, current_version 12, the topology change coordinator had probably migrated to another node)
Server 2101: found 4 error(s) (log: scylla-2101.log)
ERROR 2025-12-15 14:20:04,674 [shard 0:strm] repair - repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: Repair 1 out of 4 ranges, keyspace=system_distributed, table=view_build_status, range=(minimum token,maximum token), peers=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], live_peers=[b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], status=failed: mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive
ERROR 2025-12-15 14:20:04,674 [shard 1:strm] repair - repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: Repair 1 out of 4 ranges, keyspace=system_distributed, table=view_build_status, range=(minimum token,maximum token), peers=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], live_peers=[b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], status=failed: mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive
ERROR 2025-12-15 14:20:04,675 [shard 0: gms] raft_topology - raft_topology_cmd stream_ranges failed with: std::runtime_error (["shard 0: std::runtime_error (repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: 1 out of 4 ranges failed, keyspace=system_distributed, tables=[\"view_build_status\", \"cdc_generation_timestamps\", \"service_levels\", \"cdc_streams_descriptions_v2\"], repair_reason=bootstrap, nodes_down_during_repair={27c027a6-603d-49d0-8766-1b085d8c7d29}, aborted_by_user=false, failed_because=std::runtime_error (Repair mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive, keyspace=system_distributed, mandatory_neighbors=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e]))", "shard 1: std::runtime_error (repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: 1 out of 4 ranges failed, keyspace=system_distributed, tables=[\"view_build_status\", \"cdc_generation_timestamps\", \"service_levels\", \"cdc_streams_descriptions_v2\"], repair_reason=bootstrap, nodes_down_during_repair={27c027a6-603d-49d0-8766-1b085d8c7d29}, aborted_by_user=false, failed_because=std::runtime_error (Repair mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive, keyspace=system_distributed, mandatory_neighbors=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e]))"])
ERROR 2025-12-15 14:20:06,812 [shard 0:main] init - Startup failed: std::runtime_error (Bootstrap failed. See earlier errors (Rolled back: Failed stream ranges: std::runtime_error (failed status returned from 9dd942aa-acec-4105-9719-9bda403e8e94)))
Server 2094: found 1 error(s) (log: scylla-2094.log)
ERROR 2025-12-15 14:20:04,675 [shard 0: gms] raft_topology - send_raft_topology_cmd(stream_ranges) failed with exception (node state is bootstrapping): std::runtime_error (failed status returned from 9dd942aa-acec-4105-9719-9bda403e8e94)"
```
`test_kill_coordinator_during_op`
- aborts caused by injection
- `ignore_cores_log_patterns` is not set
- while there are errors in logs and `ignore_log_patterns` is not set, they are ignored automatically due to the `expected_error` parameter, such as in `await manager.decommission_node(server_id=other_nodes[-1].server_id, expected_error="Decommission failed. See earlier errors")`
```
failed on teardown with "Failed:
Server 1105: found 1 critical error(s), 1 backtrace(s) (log: scylla-1105.log)
Aborting on shard 0, in scheduling group gossip.
1 backtrace(s) saved in scylla-1105-backtraces.txt
Server 1106: found 1 critical error(s), 1 backtrace(s) (log: scylla-1106.log)
Aborting on shard 0, in scheduling group gossip.
1 backtrace(s) saved in scylla-1106-backtraces.txt
Server 1113: found 1 critical error(s), 1 backtrace(s) (log: scylla-1113.log)
Aborting on shard 0, in scheduling group gossip.
1 backtrace(s) saved in scylla-1113-backtraces.txt
Server 1148: found 1 critical error(s), 1 backtrace(s) (log: scylla-1148.log)
Aborting on shard 0, in scheduling group gossip.
1 backtrace(s) saved in scylla-1148-backtraces.txt"
```
Decoded backtrace can be found in [failed_test_logs](https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo_build_tests_dtest/46/artifact/testlog/x86_64/dev/failed_test/test_kill_coordinator_during_op.dev.1)
Closesscylladb/scylladb#26177
* github.com:scylladb/scylladb:
test: add logging to crash_coordinator_before_stream injection
test: add crash detection during tests
test.py: add pid to ServerInfo
There are three tests in cluster/object_store suite that check how
backup fails in case either of its parameters doesn't really exists. All
three greatly duplicate each other, it makes sense to merge them into
one larger parametrized test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27695
This pull request introduces a new caching mechanism for client options in the Alternator and transport layers, refactors how client metadata is stored and accessed, and extends the `system.clients` virtual table to surface richer client information. The changes improve efficiency by deduplicating commonly used strings (like driver names/versions and client options), and ensure that client data is handled in a way that's safe for cross-shard access. Additionally, the test suite and virtual table schema are updated to reflect the new client options data.
**Caching and client metadata refactoring:**
* The largest and most repeatable items in the connection state before this PR were a `driver_name` and a `driver_version` which were stored as an `sstring` object which means that the corresponding memory consumption was 16 bytes per each such value at least (the smallest size of the `seastar`'s `sstring` object) **per-connection**. In reality the driver name is usually longer than 15 characters, e.g. "ScyllaDB Python Driver" is 23 characters and this is not the longest driver name there is. In such cases the actual memory usage of a corresponding `sstring` object jumps to 8 + 4 + 1 + (string length, 23 in our example) + 1.
So, for "ScyllaDB Python Driver" it would be 37 bytes (in reality it would be a bit more due to natural alignment of other allocations since the `contents` size is not well aligned (13 bytes), but let's ignore this for now).
* These bytes add up quickly as there are more connections and, sometimes we are talking about millions of connections per-shard.
* Using a smart pointer (`lw_shared_ptr`) referencing a corresponding cached value will effectively reduce the per-connection memory usage to be 8 bytes (a size of a pointer on 64-bit CPU platform) for each such value. While storing a corresponding `sstring` value only once.
* This will would reduce the "variable" (per-connection) memory usage by **at least 50%**. And in case of "ScyllaDB Python Driver" driver version - by 78%!
* And all this for a price of a single `loading_shared_values` object **per-shard** (implements a hash table) and a minor overhead for each value **stored** in it.
* Introduced a new cache type (`client_options_cache_type`) for deduplicating and sharing client option strings, and refactored `client_data`, `client_state`, and related classes to use `foreign_ptr<std::unique_ptr<client_data>>` and cached entry types for fields like driver name, driver version, and client options. (`client_data.hh`, `service/client_state.hh`, `alternator/server.hh`, `alternator/controller.hh`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fR33-R36) [[2]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fL40-R56) [[3]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL105-R107) [[4]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[5]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL91-R92) [[6]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[7]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[8]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[9]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[10]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[11]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48)
* Updated the methods for setting and getting driver name, driver version, and client options in `client_state` to be asynchronous and use the new cache. (`service/client_state.hh`, `service/client_state.cc`) [[1]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[2]](diffhunk://#diff-99634aae22e2573f38b4e2f050ed2ac4f8173ff27f0ae8b3609d1f0cc1aeb775R347-R362)
**Virtual table and API enhancements:**
* Extended the `system.clients` virtual table schema and implementation to include a new `client_options` column (a map of option key/value pairs), and updated the table population logic to use the new cached types and foreign pointers. (`db/virtual_tables.cc`) [[1]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1R752) [[2]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L769-R770) [[3]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L809-R816) [[4]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L828-R879)
**API and interface changes:**
* Changed the signatures of `get_client_data` methods throughout the codebase to return vectors of `foreign_ptr<std::unique_ptr<client_data>>` instead of plain `client_data` objects, to ensure safe cross-shard access. (`alternator/controller.hh`, `alternator/controller.cc`, `alternator/server.hh`, `alternator/server.cc`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[2]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[3]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[4]](diffhunk://#diff-a7e2cda866c03a75afcf3b087de1c1dcd2e7aa996214db67f9a11ed6451e596dL988-R995) [[5]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[6]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[7]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48)
**Testing and validation:**
* Updated the Python test for the `system.clients` table to verify the new `client_options` column and its contents, ensuring that driver name and version are present in the options map. (`test/cqlpy/test_virtual_tables.py`) [[1]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R79) [[2]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R88-R90)
Closesscylladb/scylladb#25746
* github.com:scylladb/scylladb:
transport/server: declare a new "CLIENT_OPTIONS" option as supported
service/client_state and alternator/server: use cached values for driver_name and driver_version fields
system.clients: add a client_options column
controller: update get_client_data to use foreign_ptr for client_data
The Boost ASSERTs in the digest functions of the randomized_nemesis_test
were not working well inside the state machine digest functions, leading
to unhelpful boost::execution_exception errors that terminated the apply
fiber, and didn't provide any helpful information.
Replaced by explicit checks with on_fatal_internal_error calls that
provide more context about the failure. Also added validation of the
digest value after appending or removing an element, which allows to
determine which operation resulted in causing the wrong value.
This effectively reverts the changes done in https://github.com/scylladb/scylladb/pull/19282,
but adds improved error reporting.
Refs: scylladb/scylladb#27307
Refs: scylladb/scylladb#17030Closesscylladb/scylladb#27791
The doc about DDL statements claims that an `ALTER KEYSPACE` will fail
in the presence of an ongoing global topology operation.
This limitation was specifically referring to RF changes, which Scylla
implements as global topology requests (`keyspace_rf_change`), and it
was true when it was first introduced (1b913dd880) because there was
no global topology request queue at that time, so only one ongoing
global request was allowed in the cluster.
This limitation was lifted with the introduction of the global topology
request queue (6489308ebc), and it was re-introduced again very
recently (2e7ba1f8ce) in a slightly different form; it now applies only
to RF changes (not to any request type) and only those that affect the
same keyspace. None of these two changes were ever reflected in the doc.
Synchronize the doc with the current state.
Fixes#27776.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#27786
This will allow to add custom XML attribute to the JUnit report. In this
case there will be path to the function that can be used to run with
pytest command. Parametrized tests will have path to the function
excluding parameter.
Closesscylladb/scylladb#27707
Preemptible reclaim is only done from the background reclaimer,
so backtrace is not useful. It's also normal that it takes a long time.
Skip the backtrace when reclaim is preemptible to reduce log noise.
Fixes the issue where background reclaim was printing unnecessary
backtraces in lsa-timing logs when operations took longer than the
stall detection threshold.
Closes: #27692
Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
This patch consists of a few smaller follow-ups to the view building worker:
- catch general execption in staging task registrator
- remove unnecessary CV broadcast
- don't pollute function context with conditionally compiled variable
- avoid creating a copy of tasks map
- fix some typos
Refs https://github.com/scylladb/scylladb/issues/25929
Refs https://github.com/scylladb/scylladb/pull/26897
This PR doesn't fix any bugs but recently we're backporting some PRs to 2025.4, so let's also backport this one to avoid painful conflicts.
Closesscylladb/scylladb#26558
* github.com:scylladb/scylladb:
docs/dev/view-building-coordinator: fix typos
db/view/view_building_worker: remove unnnecessary empty lines
db/view/view_building_worker: fix typo
db/view/view_building_worker: avoid creating a copy of tasks map
db/view/view_building_worker: wrap conditionally compiled code in a scope
db/view/view_building_worker: remove unnecessary CV broadcast
db/view/view_building_worker: catch general execption in staging task registrator
Currently, we determine the live vs. total snapshot size by listing all files in the snapshot directory,
and for each name, look it up in the base table directory and see if it exists there, and if so, if it's the same file
as in the snapshot by looking to the fstat data for the dev id and inode number.
However, we do not look the names in the staging directory so staging sstable
would skew the results as the will falsely contribute to the live size, since they
wouldn't be found in the base directory.
This change processes both the staging directory and base table directory
and keeps the file capacity in a map, indexed by the files inode number, allowing us to easily
detect hard links and be resilient against concurrent move of files from the staging sub-directory
back into the base table directory.
Fixes#27635
* Minor issue, no backport required
Closesscylladb/scylladb#27636
* github.com:scylladb/scylladb:
table: get_snapshot_details: add FIXME comments
table: get_snapshot_details: lookup entries also in the staging directory
table: get_snapshot_details: optimize using the entry number_of_links
table: get_snapshot_details: continue loop for manifest and schema entries
table: get_snapshot_details: use directory_lister
These patches fix a bunch of variables defined in test/cqlpy tests, but not used. Besides wasting a few bytes on disk, these unused variables can add confusion for readers who see them and might think they have some use which they are missing.
All these unused variables were found by Copilot's "code quality" scanner, but I considered each of them, and fixed them manually.
Closesscylladb/scylladb#27667
* github.com:scylladb/scylladb:
test/cqlpy: remove unused variables
test/cqlpy: use unique partition in test
This commit adds a page with an overview of Vector Search under the Features section.
It includes a link to the VS documentation in ScyllaDB Cloud,
as the feature is only available in ScyllaDB Cloud.
The purpose of the page is to raise awareness of the feature.
Fixes https://scylladb.atlassian.net/browse/VECTOR-215Closesscylladb/scylladb#27787
Add a new configuration option for selecting the compiler
cache. Prefer sccache if found, since it supports rust as
well as C++, has better support for distributed compilation,
and is slated to receive module support soon.
cmake is also supported.
For some reason, we might fail. Retry 10 times, and fail with an error code instead of 404 or whatnot.
Benign, I hope - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#27746
`-fexperimental-assignment-tracking` was added in fdd8b03d4b to
make coroutine debugging work.
However, since then, it became unnecessary, perhaps due to 87c0adb2fe,
or perhaps to a toolchain fix.
Drop it, so we can benefit from assignment tracking (whatever it is),
and to improve compatibility with sccache, which rejects this option.
I verified that the test added in fdd8b03d4b fails without the option
and passes with this patch; in other words we're not introducing a
regression here.
Closesscylladb/scylladb#27763
Update the documentation about restrictions to tablets keyspaces related
to RF-rack.
* MV/SI require the keyspace to be RF-rack-valid
* topology operations are restricted if a keyspace has views to preserve
RF-rack-validity
When creating an index we validate that the keyspace is RF-rack-valid
and print a warning that the keyspace must remain RF-rack-valid.
This should apply only to indexes that are based on materialized views
for which there are consistency concerns when the keyspace is not
RF-rack-valid.
vector indexes are not based on materialized views, hence these
restrictions should not apply to them.
Creating a MV or index in a tablets-based keyspace now forces additional
restrictions on the keyspace. The keyspace must be RF-rack-valid and it
must remain RF-rack-valid while the view exists.
Add a CQL warning about these restrictions.
allow tablet merge of tables with views even if the
rf_rack_valid_keyspaces option is not set, because now keyspaces that
have views are enforced to always be rf-rack-valid, regardless of the
option value.
Extend the RF-rack validation in `assert_rf_rack_valid_keyspace` to
validate rack-list-based replication as well. Previously, validation was
done only for numeric replication.
If the replication is based on a rack list, we validate that all racks
that are required for replication are present in the topology rack map.
If some rack is needed for replication but is missing, or it doesn't
have normal token owner nodes, the validation fails with an error.
add tests that attempt to create a keyspace during different stages of
node join or remove, and verify that the rf-rack condition can't be
broken - either creating the keyspace should fail or the node operation
should fail, depending on the stage.
If a keyspace is created while a node is joining or being removed, it could
break the rf-rack invariant. For example:
1. We have 3 nodes in 3 racks, no keyspaces
2. A new node starts to join in a new rack - passes validation because
there are no keyspaces
3. Create a keyspace with rf=3 - passes validation because the joining
node is not a normal token owner yet
4. The new node becomes a normal token owner
5. The rf-rack invariant is broken. We have rf=3 and 4 racks
To fix this, we change the rf-rack check to consider a node as a token
owner if it's either a normal token owner or it has bootstrap tokens and
is about to become a normal token owner.
Now the condition can't be broken. Consider keyspace creation at
different stages of adding a node in our example:
* Before the node is assigned bootstrap tokens: the node is not
considered. We can create a keyspace with rf=3 as if the node doesn't
exist, and then node join will fail in the group0 operation that
assigns bootstrap tokens, because during this operation we check
rf-rack validity.
* Assigning bootstrap tokens is a single group0 operation that is
serialized with keyspace creation. During this operation we check that
adding the node as a token owner will maintain rf-rack validity for all
keyspaces.
* After the node is assigned bootstrap tokens and until it becomes a
normal token owner: it is considered as a transitioning token owner by
the rf-rack check and the rack is considered a transitioning rack. We
can't count the rack as a normal rack because the node join may still
fail and rollback. Trying to create a keyspace with either rf=3 or
rf=4 will fail because we can end up with either 3 or 4 racks.
Similarly, when removing a node, we validate that removing the node will
maintain rf-rack validity in the same group0 operation that changes the
node state to removing/decommissioning, after which the node becomes a
leaving endpoint, and it's not considered a normal token owner anymore
for the rf-rack check.
Add tests that verify the restrictions on topology operations when there
are keyspaces with tablets and materialized views.
For such keyspaces, RF=Racks must be enforced while they have
materialized views, therefore adding a node in a new rack or removing a
node that would eliminate a rack should be rejected.
add new tests for testing that RF-rack validity is maintained when doing
topology operations that may break them, such as adding nodes in new
racks or removing nodes.
when a new node joins or an existing node is removed / decommissioned,
check if the operation would violate the RF-rack-validity of some
keyspace. if so - reject the operation in order to preserve
RF-rack-validity.
Fixesscylladb/scylladb#23345Fixesscylladb/scylladb#26820
add validation to node remove / decommission, similar to node validation
when a node joins.
when starting node remove or decommission, the validation function
checks if the operation is valid and can proceed. if not, it's aborted
with an error message.
we change the return type of validate_joining_node so that it will be
similar and consistent with the new validate_removing_node.
Extend the locator function assert_rf_rack_valid_keyspace to accept
arbitrary topology dc-rack maps and nodes instead of using the current
token metadata.
This allows us to add a new variant of the function that checks rf-rack
validity given a topology change that we want to apply. we will use it
to check that rf-rack validity will be maintained before applying the
topology change.
The possible topology changes for the check are node add and node remove
/ decommission. These operations can change the number of normal racks -
if a new node is added to a new rack, or the last node is removed from a
rack.
The function validate_view_keyspace checks if a keyspace is eligible for
having materialized views, and it is used for validation when creating a
MV or a MV-based index.
Previously, it was required that the rf_rack_valid_keyspaces option is
set in order for tablets-based keyspaces to be considered eligible, and
the RF-rack condition was enforced when the option is set.
Instead of this, we change the validation to allow MVs in a keyspace if
the RF-rack condition is satisfied for the keyspace - regardless of the
config option.
We remove the config validation for views on startup that validates the
option `rf_rack_valid_keyspaces` is set if there are any views with
tablets, since this is not required anymore.
We can do this without worrying about upgrades because this change will
be effective from 2025.4 where MVs with tablets are first out of
experimental phase.
We update the test for MV and index restrictions in tablets keyspaces
according to the new requirements.
* Create MV/index: previously the test checked that it's allowed only if
the config option `rf_rack_valid_keyspaces` is set. This is changed
now so it's always allowed to create MV/index if the keyspace is
RF-rack-valid. Update the test to verify that we can create MV/index
when the keyspace is RF-rack-valid, even if the rf_rack option is not
set, and verify that it fails when the keyspace is RF-rack-invalid.
* Alter: Add a new test to verify that while a keyspace has views, it
can't be altered to become RF-rack-invalid.
Extend the RF-rack-validity enforcement to keyspaces that have views,
regardless of the option `rf_rack_valid_keyspaces`.
Previously, RF-rack-validity was enforced when `rf_rack_valid_keyspaces`
was set for all keyspaces. Now we want to allow creating MVs in tablet
keyspaces that are RF-rack-valid and enforce the RF-rack-validity even
if the config option is not set.
Add the helper function enforce_rf_rack_validity_for_keyspace that
returns true if RF-rack-validity should be enforced for a keyspace, and
use it wherever we need to check this instead of checking the config
option directly.
This is useful because this condition is used in multiple places, and
having it defined in a single helper function will make it easier to
see and change the enforcement conditions.
simple refactoring: the enforce parameter is always given the value of
the `rf_rack_valid_keyspaces` option. remove the parameter and use the
option value directly from the db config.
this will be useful for a later change to the enforcement conditions.
the test test_unfinished_writes_during_shutdown starts 3 nodes in 3
racks and creates a keyspace with RF=3, then adds a new node in a 4th
rack. this breaks rf-rack validity for the keyspace.
we change it instead to add the new node in an existing rack. it doesn't
matter for the test - the test only wants to add a new node to trigger
some topology change.
The `vector_store_client_test_dns_resolving_repeated` test had race
conditions causing it to be flaky. Two main issues were identified:
1. Race between initial refresh and manual trigger: The test assumes
a specific resolution sequence, but timing variations between the
initial DNS refresh (on client creation) and the first manual
trigger (in the test loop) can cause unexpected delayed scheduling.
2. Extra triggers from resolve_hostname fiber: During the client
refresh phase, the background DNS fiber clears the client list.
If resolve_hostname executes in the window after clearing but
before the update completes, pending triggers are processed,
incrementing the resolution count unexpectedly. At count 6, the
mock resolver returns a valid address (count % 3 == 0), causing
the test to fail.
The fix relaxes test assertions to verify retry behavior and client
clearing on DNS address loss, rather than enforcing exact resolution
counts.
Fixes: #27074Closesscylladb/scylladb#27685
Declare support for a 'CLIENT_OPTIONS' startup key.
This key is meant to be used by drivers for sending client-specific
configurations like request timeouts values, retry policy configuration, etc.
The value of this key can be any string in general (according to the CQL binary protocol),
however, it's expected to be some structured format, e.g. JSON.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Optimize memory usage changing types of driver_name and driver_version be
a reference to a cached value instead of an sstring.
These fields very often have the same value among different connections hence
it makes sense to cache these values and use references to them instead of duplicating
such strings in each connection state.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This new column is going to contain all OPTIONS sent in the
STARTUP frame of the corresponding CQL session.
The new column has a `frozen<map<text, text>>` type, and
we are also optimizing the amount of required memory for storing
corresponding keys and values by caching them on each shard level.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
get_client_data() is used to assemble `client_data` objects from each connection
on each CPU in the context of generation of the `system.clients` virtual table data.
After collected, `client_data` objects were std::moved and arranged into a
different structure to match the table's sorting requirements.
This didn't allow having not-cross-shard-movable objects as fields in the `client_data`,
e.g. lw_shared_ptr objects.
Since we are planning to add such fields to `client_data` in following patches this patch
is solving the limitation above by making get_client_data() return `foreign_ptr<std::unique_ptr<client_data>>`
objects instead of naked `client_data` ones.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This PR migrates schema management tests from dtest to this repository.
One reason is that there is an ongoing effort to migrate tests from dtest to here.
Test `TestLargePartitionAlterSchema.test_large_partition_with_drop_column` failed with timeout error once. The main suspect so far are infra related problems, like infra congestion. The [logs from the test execution](https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/1062/testReport/junit/schema_management_test/TestLargePartitionAlterSchema/Run_Dtest_Parallel_Cloud_Machines___Dtest___full_split001___test_large_partition_with_drop_column/), linked in the issue [test_large_partition_with_drop_column failed on TimeoutError #26932](https://github.com/scylladb/scylladb/issues/26932) show the following:
- `populate` works as intended - it starts, then during populate/insert drop column happened, then an exception is raised and intentionally ignored in the test, so no `Finish populate DB` for 50 x 1490 records - expected
- drop column works as intended - interrupts `populate` and proceeds to flush
- flush **probably** works as intended - logs are consistent with what we expect and what I got in local test runs
- `read` is the only thing that visibly got stuck, all the way until timeout happened, 5 minutes after the start
Migrating the test to this repo will also give us test start and end times on CI machines, in the sql report database. It has start and end timestamp for each test executed. We will be able to see how long does it usually take when the test is successful. It can not be seen from the logs, because logs are not kept for successful tests.
Another thing this PR does is adding a log message at the end of `database::flush_all_tables`. This will let us know if a thread got stuck inside or finished successfully. This addresses the **probably** part of the flush analysis step described above. If the issue reoccurs, we will have more information.
The test `test_large_partition_with_add_column` has not been executing for ~5 years. It was never migrated to pytest. The name was left as `large_partition_with_add_column_test`, and was skipped. Now it is enabled and updated.
Both `test_large_partition_with_add_column` and `test_large_partition_with_drop_column` are improved.
Small performance improvements:
- Regex compilation extracted from the stress function to the module level, to avoid recompilation.
- Do not materialize list in `stress_object` for loop. Use a generator expression.
The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column`
and `test_large_partition_with_drop_column`.
These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago.
The scenario in which the problem could have happened has to involve:
- a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition.
- appending writes to the partition (not overwrites)
- scans of the partition
- schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped
column lands in the middle (in alphabetical order) of the old column set.
The way the test is set up is:
- fixed number of writes per populate call
- fixed number of reads
This has the following implications:
- if the machine executing the test is fast, all the writes are done before the 10 seconds sleep
- there are too many reads - most of them get executed after the test logic is done
This patch solves these issues in the following way:
- populate lazily generates write data, and stops when instructed by `stop_populating` event
- read, which is done sequentially, stops when instructed by `stop_reading` event
- number of max operations is increased significantly, but the operations are stopped 1 second
after node flush; this makes sure there are enough operations during the test, but also that
the test does not take unnecessary time
Test execution time has been reduced severalfold. On dev machine the time the tests take is
reduced from 110 seconds to 34 seconds.
scylla-dtest PR that removes migrated tests:
[schema_management_test.py: remove tests already ported to scylladb repo #6427](https://github.com/scylladb/scylla-dtest/pull/6427)
Fixes#26932
This is a migration of existing tests to this repository. No need for backport.
Closesscylladb/scylladb#27106
* github.com:scylladb/scylladb:
test: dtest: schema_management_test.py: speed up `TestLargePartitionAlterSchema` tests
test: dtest: schema_management_test.py: fix large partition add column test
test: dtest: schema_management_test.py: add `TestSchemaManagement.prepare`
test: dtest: schema_management_test.py: test enhancements
test: dtest: schema_management_test.py: make the tests work
test: dtest: migrate setup and tools from dtest
test: dtest: copy unmodified schema_management_test.py
replica: database: flush_all_tables log on completion
`_verify_tasks_processed_metrics()` is used to check that the correct
service level is used to process requests. It takes two service levels
as arguments and executes numerous requests. After that, the number
of tasks processed by one of the service levels is expected to rise
by at least the number of executed requests. In contrast,
the second service level is expected to process fewer tasks than
the number of requests.
Unfortunately, background noise may cause some tasks to be executed
on the service level that is not supposed to process requests.
This patch increases the number of executed requests to eliminate
the chance of noise causing test failures.
Additionally, this commit extends logging to make future investigation
easier.
Fixes: https://github.com/scylladb/scylladb/issues/27715
No backport, fix for test on master.
Closesscylladb/scylladb#27735
* github.com:scylladb/scylladb:
test: remove unused `get_processed_tasks_for_group`
test: increase num of requests in driver_service_level tests
After 39cec4a node join may fail with either "init - Startup failed"
notification or occasionally because it was banned, depending on timing.
The change updates the test to handle both cases.
Fixes: scylladb/scylladb#27697
No backport: This failure is only present in master.
Closesscylladb/scylladb#27768
When a node without the required feature attempts to join a Raft-based
cluster with the feature enabled, there is a race between the join
rejection response ("Feature check failed") and the ban notification
("received notification of being banned"). Depending on timing, either
message may appear in the joining node's log.
This starts to happen after 39cec4a (which introduced informing the
nodes about being banned).
Updated the test to accept both error messages as valid, making the test
robust against this race condition, which is more likely in debug mode
or under slow execution.
Fixes: scylladb/scylladb#27603
No backport: This failure is only present in master.
Closesscylladb/scylladb#27760
The test had a sporadic failure due to a broken promise exception.
The issue was in `test_pinger::ping()` which captured the promise by
move into the subscription lambda, causing the promise to be destroyed
when the lambda was destroyed during coroutine unwinding.
Simplify `test_pinger::ping()` by replacing manual abort_source/promise
logic with `seastar::sleep_abortable()`.
This removes the risk of promise lifetime/race issues and makes the code
simpler and more robust.
Fixes: scylladb/scylladb#27136
Backport to active branches: This fixes a CI test issue, so it is
beneficial to backport the fix. As this is a test-only fix, it is a low
risk change.
Closesscylladb/scylladb#27737
This test starts a 3-node cluster and creates a large blob file so that one
node reaches critical disk utilization, triggering write rejections on that
node. The test then writes data with CL=QUORUM and validates that the data:
- did not reach the critically utilized node
- did reach the remaining two nodes
By default, tables use speculative retries to determine when coordinators may
query additional replicas.
Since the validation uses CL=ONE, it is possible that an additional request
is sent to satisfy the consistency level. As a result:
- the first check may fail if the additional request is sent to a node that
already contains data, making it appear as if data reached the critically
utilized node
- the second check may fail if the additional request is sent to the critically
utilized node, making it appear as if data did not reach the healthy node
The patch fixes the flakiness by disabling the speculative retries.
Fixes https://github.com/scylladb/scylladb/issues/27212Closesscylladb/scylladb#27488
The current code which collects permit stats is out-of-date (by a few
years), as it only iterates through _permit_list. There are 4 additional
lists that permits can be part of now (all intrusive). Include all of
these in the stat collection.
As a bonus, also print the semaphore pointer in the printout, so the
user can hand-examine it, should they wish to.
Can be "forward", "backward" or "both" (default).
Allows traversing the fiber in just one direction. Useful when scylla
fiber fails to traverse through a task and the user has to locate the
next one in the chain manually. When resuming from this next item, the
user might want to skip the already seen part of the fiber, to save time
on the invokation.
Traversing through coroutines forward (finding task waiting on this
coroutine) is already supported. This patch adds support for traversing
through coroutines backwards (finding task waited on by coroutine).
Coroutines need special handling: the future<> object is most likely
allocated on the coroutine frame, so we have to search throgh that to
find it. When doing so the first two pointers on the frame have to be
skipped: these are pointers to .resume and .destroy respectively and
will halt the search algorithm if seen.
The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column`
and `test_large_partition_with_drop_column`.
These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago.
The scenario in which the problem could have happened has to involve:
- a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition.
- appending writes to the partition (not overwrites)
- scans of the partition
- schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped
column lands in the middle (in alphabetical order) of the old column set.
The way the test is set up is:
- fixed number of writes per populate call
- fixed number of reads
This has the following implications:
- if the machine executing the test is fast, all the writes are done before the 10 seconds sleep
- there are too many reads - most of them get executed after the test logic is done
This patch solves these issues in the following way:
- populate lazily generates write data, and stops when instructed by `stop_populating` event
- read, which is done sequentially, stops when instructed by `stop_reading` event
- number of max operations is increased significantly, but the operations are stopped 1 second
after node flush; this makes sure there are enough operations during the test, but also that
the test does not take unnecessary time
Test execution time has been reduced severalfold. On dev machine the time the tests take is
reduced from 110 seconds to 34 seconds.
The patch also introduces a few small improvements:
- `cs_run` renamed to `run_stress` for clarity
- Stopped checking if cluster is `ScyllaCluster`, since it is the only one we use
- `case_map` removed from `test_alter_table_in_parallel_to_read_and_write`, used `mixed` param directly
- Added explanation comment on why we do `data[i].append(None)`
- Replaced `alter_table` inner function with its body, for simplicity
- Removed unnecessary `ck_rows` variable in `populate`
- Removed unnecessary `isinstance(self.cluster. ScyllaCluster)`
- Adjusted `ThreadPoolExecutor` size in several places where 5 workers are not needed
- Replaced functional programming style expressions for `new_versions` and `columns_list` with
comprehension/generator statement python style code, improving readability
Refs #26932
fix
Currently some things are not supported for colocated tables: it's not
possible to repair a colocated table, and due to this it's also not
possible to use the tombstone_gc=repair mode on a colocated table.
Extend the documentation to explain what colocated tables are and
document these restrictions.
Fixesscylladb/scylladb#27261Closesscylladb/scylladb#27516
After tests end, an extra check if performed, looking into node logs.
By default, it only searches for critical errors and scans for coredumps.
If the test has the fixture `check_nodes_for_errors`, it will search for all errors.
Both checks can be ignored by setting `ignore_cores_log_patterns` and `ignore_log_patterns`.
If any of the above are found, the test will fail with an error.
`large_partition_with_add_column_test` and `large_partition_with_drop_column_test`
were added on August 17th, 2020 in scylladb/scylla-dtest#1569.
Only `large_partition_with_drop_column_test` was migrated to pytest, and renamed
to `test_large_partition_with_drop_column` on March 31st, 2021 in scylladb/scylla-dtest#2051.
Since then this test has not been running.
This patch fixes it - the test is updated and renamed and the testing environment
now properly picks it up.
Refs #26932
Extract repeated cluster initialization code in `TestSchemaManagement`
into a separate `prepare` method. It holds all the common code for
cluster preparation, with just the necessary parameters.
Refs #26932
Extract regex compilation from the stress functions to the module level,
to avoid unnecessary regex compilation repetition.
Add descriptions to the stress functions.
Do not materialize list in `stress_object` for loop. Use a generator expression.
Make `_set_stress_val` an object method.
Refs #26932
Remove unused function markers.
Add wait_other_notice=True to cluster start method in
TestSchemaHistory.prepare function to make the test stable.
Enable the test in suite.yaml for dev and debug modes.
Fixes#26932
Copy schema_management_test.py from scylla-dtest to
test/cluster/dtest/schema_management_test.py.
Add license header.
Disable it for debug, dev, and release mode.
Refs #26932
Make the removenode operation go through the `left_token_ring` state, similar to decommission. This ensures that when removenode completes, all nodes in the cluster are aware of the topology change through a global token metadata barrier.
Previously, removenode would skip the `left_token_ring` state and go directly from `write_both_read_new` to `left` state. This meant that when the operation completed, some nodes might not yet know about the topology change, potentially causing issues with subsequent data plane requests.
Key changes:
- Both decommission and removenode now transition to `left_token_ring` state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the shutdown RPC (removed nodes are already dead)
- Updated documentation to reflect that both operations use this state
This change improves consistency guarantees for removenode operations by ensuring cluster-wide awareness before completion.
The change is protected by "REMOVENODE_WITH_LEFT_TOKEN_RING" feature flag to also support mixed clusters during e.g. upgrade.
Fixes: scylladb/scylladb#25530
No backport: This fixes and issue found in tests. It can theoretically happen in production too, but wasn't reported in any customer issue, so a backport is not needed.
Closesscylladb/scylladb#26931
* https://github.com/scylladb/scylladb:
topology: make removenode use left_token_ring state for global barrier
topology: allow removing nodes not having tokens
features: add feature flag for removenode via left token ring
The function `get_processed_tasks_for_group` was defined twice in
`test_raft_service_levels.py`. This change removes the unused
definition to avoid confusion and clean up the code.
`_verify_tasks_processed_metrics()` is used to check that the correct
service level is used to process requests. It takes two service levels
as arguments and executes numerous requests. After that, the number
of tasks processed by one of the service levels is expected to rise
by at least the number of executed requests. In contrast,
the second service level is expected to process fewer tasks than
the number of requests.
Unfortunately, background noise may cause some tasks to be executed
on the service level that is not supposed to process requests.
This patch increases the number of executed requests to eliminate
the chance of noise causing test failures.
Additionally, this commit extends logging to make future investigation
easier.
Fixes: scylladb/scylladb#27715
When learning a schema that has a linked cdc schema, we need to learn
also the cdc schema, and at the end the schema should point to the
learned cdc schema.
This is needed because the linked cdc schema is used for generating cdc
mutations, and when we process the mutations later it is assumed in some
places that the mutation's schema has a schema registry entry.
We fix a scenario where we could end up with a schema that points to a
cdc schema that doesn't have a schema registry entry. This could happen
for example if the schema is loaded before it is learned, so when we
learn it we see that it already has an entry. In that case, we need to
set the cdc schema to the learned cdc schema as well, because it could
have been loaded previously with a cdc schema that was not learned.
Fixesscylladb/scylladb#27610Closesscylladb/scylladb#27704
The test generates a staging sstable on a node and verifies whether
the view is correctly populated.
However view updates generated by a staging sstable
(`view_update_generator::generate_and_propagate_view_updates()`) aren't
awaited by sstable consumer.
It's possible that the view building coordinator may see the task as finished
(so the staging sstable was processed) but not all view updates were
writted yet.
This patch fixes the flakiness by waiting until
`scylla_database_view_update_backlog` drops down to 0 on all shards.
Fixesscylladb/scylladb#26683Closesscylladb/scylladb#27389
When calling a migration notification from the context of a notification
callback, this could lead to a deadlock with unregistering a listener:
A: the parent notification is called. it calls thread_for_each, where it
acquires a read lock on the vector of listeners, and calls the
callback function for each listener while holding the lock.
B: a listener is unregistered. it calls `remove` and tries to acquire a
write lock on the vector of listeners. it waits because the lock is
held.
A: the callback function calls another notification and calls
thread_for_each which tries to acquire the read lock again. but it
waits since there is a waiter.
Currently we have such concrete scenario when creating a table, where
the callback of `before_create_column_family` in the tablet allocator
calls `before_allocate_tablet_map`, and this could deadlock with node
shutdown where we unregister listeners.
Fix this by not acquiring the read lock again in the nested
notification. There is no need because the read lock is already held by
the parent notification while the child notification is running. We add
a function `thread_for_each_nested` that is similar to `thread_for_each`
except it assumes the read lock is already held and doesn't acquire it,
and it should be used for nested notifications instead of
`thread_for_each`.
Fixesscylladb/scylladb#27364Closesscylladb/scylladb#27637
Make the removenode operation go through the `left_token_ring` state,
similar to decommission. This ensures that when removenode completes,
all nodes in the cluster are aware of the topology change through a
global token metadata barrier.
Previously, removenode would skip the `left_token_ring` state and go
directly from `write_both_read_new` to `left` state. This meant that
when the operation completed, some nodes might not yet know about the
topology change, potentially causing issues with subsequent data plane
requests.
Key changes:
- Both decommission and removenode now transition to `left_token_ring`
state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the
shutdown RPC (removed nodes may already be dead)
- Updated documentation to reflect that both operations use this state
This change improves consistency guarantees for removenode operations
by ensuring cluster-wide awareness before completion.
Fixes: scylladb/scylladb#25530
For the changes to go through the left_token_ring state when
REMOVENODE_WITH_LEFT_TOKEN_RING feature is enabled, we need to allow
removing nodes to not have any tokens (similarly to decommissioning
nodes, which use the same sequence of states).
This means the tests also need to change to allow for this new behavior
- it can temporarily happen that a removing node has no tokens but is
still part of Raft group 0 (so there may be a temporary mismatch between
the token ring and group 0 membership).
Therefore, the `check_token_ring_and_group0_consistency` function is
replaced by `wait_for_token_ring_and_group0_consistency`, which waits
up to 30 seconds for consistency to be reached.
To improve the behavior of the removenode operation, we want to issue
a global topology barrier after the removenode has been applied.
However, this requires changing the topology state machine to add a new
state (left_token_ring) to the removenode flow, which is not supported
by older nodes.
To allow rolling upgrades, we add a feature flag
REMOVENODE_WITH_LEFT_TOKEN_RING that controls whether the new removenode
flow is used.
Test that the new configuration options work and that we can
connect to them. Use direct connections with an inline implementation
of the proxy protocol and the CQL native protocol, since we want
to maintain direct control over the source port number (for shard-aware
ports). Also test we land on the expected shard.
We have four native transport ports: two for plain/TLS, and two
more for shard-aware (plain/TLS as well). Add four more that expect
the proxy protocol v2 header. This allows nodes behind a reverse
proxy to record the correct source address and port in system.clients,
and the shard-aware port to see the correct source port selection
made my the client.
Currently the formatter converts it to json and then tries to emit into
the output context with the "...{{}}" format string. The intent was to
have the "...{<json text>}" output. However, the double curly brace in
format string means "print a curly brace", so the output of the above
formatting is "...{}", literally.
Fix by keeping a single curly brace. The "<json text>" thing will have
its own surrounding curly braces.
Fixes#27718
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27687
This action is triggered when a new milestone is created in scylladb.git
It will call the main logic which will create the same milestone as Jira
releases in the SCYLLADB and CUSTOMER Jira projects.
Fixes: PM-100
Closesscylladb/scylladb#27717
If a keyspace has a numeric replication factor in a DC and rf < #racks,
then the replicas of tablets in this keyspace can be distributed among
all racks in the DC (different for each tablet). With rack list, we need all
tablet replicas to be placed on the same racks. Hence, the conversion
requires tablet co-location.
After this series, the conversion can be done using ALTER KEYSPACE
statement. The statement that does this conversion in any DC is not
allowed to change a rf in any DC. So, if we have dc1 and dc2 with 3 racks
each and a keyspace ks then with a single ALTER KEYSPACE we can do:
- {dc1 : 2} -> {dc1 : [r1, r2]};
- {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: [r2,r3]};
- {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: 2}
- {dc1 : 2} -> {dc1 : 2, dc2 : [r1]}
But we cannot do:
- {dc1 : 2} -> {dc1 : [r1, r2, r3]};
- {dc1 : 1, dc2 : [r1, r2] → dc1: [r1], dc2: [r1].
In order to do the co-locations rf change request is paused. Tablet
load balancer examines the paused rf change requests and schedules
necessary tablet migrations. During the process of co-location, no other
cross-rack migration is allowed.
Load balancer checks whether any paused rf change request is
ready to be resumed. If so, it puts the request back to global topology
request queue.
While an rf change request for a keyspace is running, any other rf change
of this keyspace will fail.
Fixes: #26398.
New feature, no backport
Closesscylladb/scylladb#27279
* github.com:scylladb/scylladb:
test: add est_rack_list_conversion_with_two_replicas_in_rack
test: test creating tablet_rack_list_colocation_plan
test: add test_numeric_rf_to_rack_list_conversion test
tasks: service: add global_topology_request_virtual_task
cql3: statements: allow altering from numeric rf to rack list
service: topology_coordinator: pause keyspace_rf_change request
service: implement make_rack_list_colocation_plan
service: add tablet_rack_list_colocation_plan
cql3: reject concurrent alter of the same keyspace
test: check paused rf change requests persistence
db: service: add paused_rf_change_requests to system.topology
service: pass topology and system_keyspace to load_balancer ctor
service: tablet_allocator: extract load updates
service: tablet_allocator: extract ensure_node
tasks, system_keyspace: Introduce get_topology_request_entry_opt()
node_ops: Drop get_pending_ids()
node_ops: Drop redundant get_status_helper()
runner.py defines a command-line option `--extra-scylla-cmdline-options`
with the default type=str. However, the function `merge_cmdline_options`,
which consumes this value to merge command-line options from multiple
sources, expects a list of strings.
This mismatch results in the following exception:
```
raise ValueError(f'invalid argument name {name}, all args {args}')
ValueError: invalid argument name o, all args --logger-log-level repair=debug --default-log-level=error
```
when a test is run with pytest using:
`--extra-scylla-cmdline-options='--logger-log-level repair=debug --default-log-level=error'`
Fix this by handling the option consistently and calling `.split()`.
Also change the default value from an empty list to an empty string
to avoid confusion both in runner.py and test.py.
Closesscylladb/scylladb#27523
Ref https://github.com/scylladb/seastar/pull/3163
We can optimize the stat calls we use here by
using open_directory to open the snapshot,
base, and staging directory once, and using statat
calls for the relative name instead of the full
blown file_stat that needs to traverse the whole
path prefix for every call (the dirents are likely
to be cached, but still why waste cpu cycles on that
over and over again).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the number_of_linkes equals 1, we can be sure that
the file exists only in the snapshot directory so there is no need
to look it up in the data directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that we're using a simple loop in the coroutine just continue
the loop for files we want to ignore.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is more efficient to use the coroutine generator
to list the directory.
Brewing changes in seastar would make the generator buffered
as well as adding an extended generation that would
return the file stat data for each entry, that would become
useful in the next patch that optimizes the algorithm by
considering the entry's link count.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When translating Cassandra's unit tests, in a couple of places I accidentally used the same name for two tests, resulting in the first of each pair to never running.
Let's fix the name of the second of the each pair to be the real name it had in the original Cassandra test.
Closesscylladb/scylladb#27644
* github.com:scylladb/scylladb:
test/cqlpy: rename test with duplicate name
test/cqlpy: rename test with duplicate name
Fixing something that never bothered anyone but our automated "code quality" tool: there's an unnecessary call to "pass" in one of our tests. Just remove it.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27645
This test was observed to fail multiple times recently in promotion,
because there were successful reads. The failure only reproduces on
arm64, it doesn't reproduce on x86.
The suspected reason is that the data set is too close to the edge,
where all reads fail due to too high memory consumption. Reduce the
number of sstables used by this test to 54 (from 64).
Fixes: #27248Closesscylladb/scylladb#27650
Copilot found in test/alternator a bunch of places where we unnecessarily assign a variable that we don't use, or had a duplicated statement which doesn't do anything. This patch fixes all of them. AI still doesn't know how to prepare a patch that looks anything close to reasonable, so I did this part manually, and also carefully investigated each and every change (this took **a lot** of human time).
These patches don't change anything in the functionality of any of the tests. It's all cosmetic.
Closesscylladb/scylladb#27655
* github.com:scylladb/scylladb:
test/alternator: remove unnecessary duplicate statement
test/alternator: remove unused variable assignments
We currently have races, like between moving an sstable from staging
using change_state, or when taking a snapshot, to e.g.
rewrite_statistics that replaces one of the sstable component files
when called, for example, from update_repaired_at by incremental repair.
Use a semaphore as a mutex to serialize those functions.
Note that there is no need for rwlock since the operations
are rare and read-only operations like snapshot don't
need to run in parallel.
Fixes#25919
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Previously, the scheduling_group column was updated during the switch_tenant function, which meant the update occurred only after the tenant change operation completed—updating rows one by one. With this change, the scheduling_group column is now updated before the switch_tenant logic runs, ensuring that the table reflects the correct scheduling groups for all rows as early as possible.
fixes: #26060fixes: #27295
backport: not required
this is a minor bug fix. Internal logic worked but the user couldnt see the change in the table if they would read the system.clients table
Closesscylladb/scylladb#26404
* github.com:scylladb/scylladb:
test: cqlpy: Remove test_switch_tenants and add test in cluster testing. The test needs to run twice, in two separate Scylla runs, using two different modes: gossip and raft. The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode. Therefore, the test was moved from cqlpy to the cluster-based framework. This commit both adds the test in cluster/ and removes the old version in cqlpy/.
server: Refactor update_control_connection_scheduling_group functionality This refactoring moves the logic that retrieves the scheduling group for driver_service_level_name out of switch_tenant. This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP). The lookup function is fully synchronized, non-coroutine, and returns immediately. For that reason, it’s better to perform this lookup outside of the switch_tenant function.
server: Refactor scheduling group update functionality. This change generalizes the scheduling-group update functionality and removes some copy-paste code, improving overall readability and maintainability. To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”).
server: Fix switch_tenant problem, When running on a V2 server, service-level data comes from service level cache. Because of this, we can use synchronized function to get the schedualing group. Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case. This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.
server: Add update_service_level_scheduling_group_v1 functions to create placehholder for functionality that will introduce v2 implementation. The new functionality will allow usage of service level cache
Add a service::topo::global_topology_request_virtual_task, which
covers the replication factor changes.
Currently, the global_topology_request_virtual_task can be aborted
only if it is paused.
The progress of the rf change isn't counted.
Allow altering from numeric replication factor to rack list. Ensure
that a single ALTER KEYSPACE statement doesn't try to both convert
to rack list and change rf.
To do the conversion from numeric rf to rack list, we need to co-locate
tablets of the keyspace, so that all of them have replicas on the same racks.
Pause the keyspace_rf_change global topology request, so that the co-location
could be done before the ALTER KEYSPACE changes are applied.
The pause is needed if in any dc rf changes from numeric to rack list
and the co-location is necessary. In this case we don't finish the request.
Instead, we add the request to the paused request vector. No migrations are
started.
The make_rack_list_colocation_plan consists of two phases.
In the first phase (realized with find_required_rack_list_colocations),
we find the pairs of (replica to be co-located, destination dc and rack).
We skip the pairs related to the tablets that are in transition or for
which the load balancer migration is already planned. We group the pairs
by destination dc and rack. Thanks to that in the second phase we can
calculate the least loaded nodes and shards only once for each rack.
In the second phase, we calculate the load of the nodes in a cluster
based on current transition and previously scheduled migrations.
We utilize the map created in the first phase and choose the least
loaded targets in each rack. We skip the tablets for which the
co-location was already scheduled.
find_required_rack_list_colocations isn't a method of load_balancer,
because in the following changes it is going to be reused by topology
coordinator to determine whether the rf change should be paused.
Add tablet_rack_list_colocation_plan. Keep it in migration_plan.
The plan includes a request that is ready to resume. There can be
more than one such request at the time, but we consider them one
by one for clarity of code. Rack list co-locations will be kept together
with normal load balancer migrations.
Consider normal load balancer migrations before rack list co-locations.
During rack list co-location, allow load balancer migrations to happen
only within a single rack. Do not create the merge co-location plan if
there is ongoing rack list co-location (if there are any rf changes paused).
Generate rf change resume based on the plan. Add _request_to_resume back
to global requests queue.
make_rack_list_colocation_plan will be implemented in the following change.
Reject ALTER KEYSPACE request if there is unfinished (queued, pending,
or paused) alter request of the same keyspace.
This is required as in the following changes, global request queue
will contain rf change requests meant to be resumed.
In the following changes, we allow to alter from numeric rf to rack list.
Before the alter, two tablets of the same keyspace can have replicas
on different racks. To switch to rack list, we need to co-locate
the replicas. It will be achieved by pausing the keyspace_rf_change
and scheduling migrations.
We need to persist the ids of requests that are paused. A new column -
paused_rf_change_requests is added to system.topology table.
In this commit no data is kept in the new column.
Pass a pointer to service::topology and db::system_keyspace to load
balancer. It will be used in the following patches to create
rack_list_colocation plan.
Every pending request should also have an entry in
system.topology_requests so it's redundant.
And problematic, because we cannot build a full request entry from
just an id alone, so if we would return those requests, they would
have blank information, and logic which needs more information would
not work.
We're extending the logic of DESCRIBE INDEX to include properties of the
underlying materialized view. Tests are provided to ensure the
implementation works as intended.
This is a temporary solution as handling this property may require
a bit more attention or at least a bit more focus. For now, let's
forbid using it so it's clear it won't get applied. A simple test
is provided to cover it.
We document the restriction.
After the previous patch that extended the grammar and provided
basic functionalities to accommodate properties of materialized views
in indexes, this commit takes another step and actually applies them
to the underlying view when it's being created.
We're providing validation tests for each property, with the single
exception of CLUSTERING ORDER BY. That one will be handled separately
in an upcoming commit.
We also update the user documentation.
We're allowing CREATE INDEX to accept the same set of properties as
materialized views do. Our goal is to give the user an ability to
configure the underlying materialized view of an index directly,
when creating it.
This commit doesn't do anything except for extending the grammar
and passing the right pieces of information to the right destinations.
There's no validation and the options have no effect yet. That will
be done in the following patch.
The type represents a mix of both index-specific and view properties.
Since we cannot easily distinguish which properties belong to which
entity, let's use this abstraction and filter them from the C++ level.
This is a prerequisite for extending the capabilities of CREATE INDEX
by allowing it to configuring the underlying materialized view.
We rename the type `index_prop_defs` to `index_specific_prop_defs`.
The rationale for the change is to distinguish between properties
related directly to a index and properties related to the underlying
view (if applicable).
The type `index_prop_defs` will be re-introduced in an upcomming commit
where it'll encompass both index-related and view-related properties.
This is a prerequisite for it.
We're introducing a new type wrapping properties that can be used with
materialized views. Doing that, we achieve the following things:
(1) We can keep validation logic in one place.
(2) We differentiate between properties of a regular table and
properties of a materialized view.
(3) It provides better modularization and allows for reusing the code.
(4) It gets rid of inconsistencies in the existing code, e.g.
CREATE MV using one type for properties, while ALTER MV another.
The actual end goal of this commit is to be able to reuse at least part
of the validation logic of MVs in CREATE INDEX and, when it gets added,
ALTER INDEX: we want to endow those statements with an ability to modify
the underlying materialized view without having to modify it directly.
This patch does NOT implement the whole validation logic yet. It will be
done in a following commit.
Refs scylladb/scylladb#16454
`system.client_routes` is a system table that sets the target address and ports for each `host_id`, for one or more connection (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`.
This patch series contains:
- Introduction of `CLIENT_ROUTES` feature flag.
- Implementation of raft-based `system.client_routes` table
- Implementation of `v2/client-routes` POST/DELETE/GET endpoints
- Implementation of new `CLIENT_ROUTES_CHANGE` event that is sent to drivers when `system.client_routes` is changed
- New tests that verifies the aforementioned features
Ref: scylladb/scylla-enterprise#5699
For now, no automatic backport. However, the changes are planned to be release on `2025.4` either as a backport or a private build.
Closesscylladb/scylladb#27323
* https://github.com/scylladb/scylladb:
docs: describe CLIENT_ROUTES_CHANGE extension
test: add test for CLIENT_ROUTES event
service: transport: add CLIENT_ROUTES_CHANGE event
test: add cluster tests for client routes
test: add API tests for client_routes endpoints
test: add `timeout` parameter to `delete` in RESTClient
test: allow json_body in send
api: implement client_routes endpoints
api: add client_routes.json
service: main: add client_routes_service
db: add system.client_routes table
gms: add CLIENT_ROUTES feature
This reverts commit 866c96f536, reversing
changes made to 367633270a.
This change caused all longevities to fail, with a crash in parsing
scylla-metadata. The investigation is still ongoing, with no quick fix
in sight yet.
Fixes: #27496Closesscylladb/scylladb#27518
To fix this issue, the std_list_iterator class defined within
std_list.__iter__ should implement the full iterator protocol by defining
an __iter__() method that returns self. This change ensures any instance
of std_list_iterator can be used as an iterator in Python for loops and
other iteration contexts, as required. The fix is to add a small method
definition inside the std_list_iterator class, ideally after the __init__
or in a logical place with the other dunder methods.
Only the code inside the std_list class's __iter__ function (lines around
the definition of the inner class and its methods) needs to be edited.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27642
Fixes#26744
If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category.
However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up.
The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of losing data as expected.
Closesscylladb/scylladb#27556
* github.com:scylladb/scylladb:
test::cluster::dtest::tools::files: Remove file
commitlog_replay: Handle fully corrupt files same as partial corruption.
test::pylib::suite::base: Split options.name test specifier only once
Fixes#17384
Bypasses enabling off-strategy storage/placement for repair streams
when table repaired is using tablets. Instead, the resulting sstable(s)
will be placed in the "normal" set of sstables, and bypass a post-repair
off-strategy compaction.
v2:
Bypass off-strat for whatever reason iff dest is tablets.
Closesscylladb/scylladb#27500
The test fails in CI sometimes, and we want a coredump from a failure
to debug that. We made the test send a `signal SIGSEGV` to Scylla
on failure, but apparently that doesn't work as intended on our CI
hosts. (The CI runner seemingly can't find any coredump afterwards).
We can use gdb's `gcore` command to produce a coredump in a more
predictable way.
Refs scylladb/scylladb#22501Closesscylladb/scylladb#27498
This series adds an xfailing reproducers for two issue: #8070 and #27037:
27037 is about where even with alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON representation - a Alternator Streams
event is unduly produced.
8070 is about the ability to write malformed values into the database and then fail during read - instead of failing, as expected, during the write. This issue was known for years, but we never really had a reproducer for it - it's not possible to reproduce it using clean boto3 code and we need to build a request manually.
The first two patches are two small cleanups (including fixes#27372) that I did while preparing the real tests - which are in the final two patches.
Closesscylladb/scylladb#27376
* github.com:scylladb/scylladb:
test/alternator: add reproducer for bug with storing invalid values
test/alternator: reproducer for issue 27375
utils/rjson: fix error messages from rjson::parse()
test/alternator: extract get_signed_request() to util.py
Introduce the CLIENT_ROUTES_CHANGE event to let drivers refresh
connections when `system.client_routes` is modified. Some deployments
(e.g., Private Link) require specific address/port mappings that can
change without topology changes and drivers need to adapt promptly
to avoid connectivity issues.
This new EVENT type carries a change indicator plus the affected
`connection_ids` and `host_ids`. The only change value is
`UPDATE_NODES`, meaning one or more client routes were inserted,
updated, or deleted.
Drivers subscribe using the existing events mechanism, so no additional
`cql_protocol_extension` key is required.
Ref: scylladb/scylla-enterprise#5699
Copilot detected a few cases of cqlpy tests setting a variable which
they don't use. In all the cases in this patch, we can just remove
the variable. Although the AI found all these unused variables, I
verified each case carefully before changing it in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
copilot noticed that test/alternator/test_scan.py had a duplicate
statement (call to full_scan()). It doesn't break the test, but also
adds nothing but confusion - so let's just remove it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
copilot noticed in that in in many of Alternator tests, we have some
unnecessary assignments. For example, in a few places, we use the idiom:
with pytest.raises(...):
ret = ...
The "ret=" part is unnecessary, as this test expects the statement to
fail (hence the raises()), and ret is never assigned. The assignment
was only there because we copied this statement from another place in
the test, which does expect the statement to pass and wants to validate
the returned value.
So we should just drop the "ret=" from these tests.
Another common occurance is that we used the idiom
response = table.do_something()
Without checking the response and no intention to check it (either we
know it will work, or we just want to check it doesn't throw). So we
can drop the "response=" here too.
All of the unused variables in this patch were discovered by Copilot,
but I reviewed each of them carefully myself and prepared this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
It is traditional to use a unique (or random) partition key in cqlpy
tests, to allow multiple tests to share the same table and make the test
suite a bit faster. One of the tests, test_multi_column_relation_desc,
set up a unique key "k", but then forgot to use it and used partition
key 0 instead. Fix the test to use this k.
This problem was spotted by Copilot, who saw the unused variable k.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
- Workload: N workers perform CAS updates
- Update counter table each time CAS was successful
- Enable balancing and increase min_tablet_count to force split,
and lower min_tablet_count to merge.
- Run tablets migrations loop
- Stop workload and verify data consistency
Fix multiple cases where the captured `std::exception_ptr` has been re-thrown via simple `throw eptr;`, which results in losing the original exception type and details.
Resolved at various places found by clang-tidy:
1. db::schema_applier
When applying schema changes, the previous implementation attempted to handle exceptions by catching and rethrowing them, but did so incorrectly: using `throw ex` with a `std::exception_ptr` loses the original exception type and details.
However, in this case, explicit exception handling is unnecessary. The only reason for catching was to ensure `ap.destroy()` is called before propagating the exception. This can be more cleanly and safely achieved using Seastar's `.finally()` continuation, which guarantees cleanup regardless of success or failure.
2. directories
The `std::exception_ptr()` has been captured for logging and then again re-thrown incorrectly via `throw ex;`. We could use `std::rethrow_exception()` here instead, but it seems to be simpler to just use regular `throw;` to rethrow the original exception, and only use the `std::current_exception()` for logging (which is a pattern used in other places as well).
3. storage_service
Here the exception has been re-thrown incorrectly in a coroutine. There it is best to use the `co_await coroutine::return_exception_ptr` to propagate exception more efficiently in a coroutine-friendly manner.
Fixes: SCYLLADB-94
Refs: scylladb/scylladb#27501
No backport: This fixes an error logging issue, that isn't a production problem by itself (only found in test), therefore not backporting to older branches.
Closesscylladb/scylladb#27613
* https://github.com/scylladb/scylladb:
db: schema_applier: improve exception-safe cleanup
directories: fix exception rethrowing
storage_service: use coroutine-friendly exception propagation in join_node_response_handler
The cqlpy test test_materialized_view.py::test_view_in_system_tables
checks that the system table "system.built_views" can inform us that
a view has been built. This test was flaky, starting to fail quite
often recently, and this patch fixes the problem in the test.
For historic reasons this test began by calling a utility function
wait_for_view_built() - which uses a different system table,
system_distributed.view_build_status, to wait until the view was built.
The test then immediately tries to verify that also system.built_views
lists this view.
But there is no real reason why we could assume - or want to assume -
that these two tables are updated in this order, or how much time
passed between the two tables being changed. The authors of this
test already acknowledged there is a problem - they included a hack
purporting to be a "read barrier" that claimed to solve this exact
problem - but it seems it doesn't, or at least no longer does after
recent changes to the view builder's implementation.
The solution is simple - just remove the call to wait_for_view_built()
and the "hack" after it. We should just wait in a loop (until a timeout)
for the system table that we really wanted to check - system.built_views.
It's as simple as that. No need for any other assumptions or hacks.
Fixes#27296
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27626
When translating Cassandra's test validation/operations/CreateTest.java
I accidentally used the same name for two tests, resulting in the first
of them never being run.
Let's fix the name of the second of the two to be the real name it had
in the original Cassandra test.
After this patch pytest reports 16 tests in this file, instead of 15
before this patch. The previously-ignored test was correct, and it
now passes in both Scylla and Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, we first print the json contents into a stringstream buffer
and then we write it as a whole to the manifest.json file output stream.
This is not scalable and may cause large allocation for large enough number
of files.
Fixes#24216
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27542
The end goal we have in mind in this commit is to extract the validation
logic of the options used for creating and altering an MV to a separate
place and be able to call from different places in the code.
It will be useful when extending the capabilities of the CREATE INDEX
statement.
In this patch, we move the part of validation responsible for checking
the ID option to keep it close to the other parts of validation of the
options in their "raw" form.
One of the upcoming commits will lead to a cyclic dependency
of headers because `schema.hh` includes `index_prop_defs.hh`.
To prevent that, we remove the include and replace it with
a manually added alias.
This is not a perfect solution, but doing it properly would
require comprehensive changes. We can do that in a separate
task.
Add the JSON definitions for the POST, GET, and DELETE endpoints used
to modify client routes. These endpoints are intended for Cloud to
update the `system.client_routes` table.
The API is implemented in `/v2/` because the endpoints process arrays
of objects. Handling of such structures was improved between
Swagger 1.2 and 2.0 versions. There are already similar
`get_metrics_config` and `set_metrics_config` endpoints that operate
on similar structures and they are also in /v2/.
The introduced JSON files start with `, ` but it's intended because
the files are concatenated to the existing (metrics) JSON files,
and they need to represent valid JSON after the concatenation.
Ref: scylladb/scylla-enterprise#5699
Introduce `system.client_routes`, a system table that sets the target
address and ports for each `host_id`, for one or more connections
(e.g., Private Link) represented by `connection_id`. Cloud will write
the table via REST, and drivers will read it via CQL to override
values obtained from `system.local` and `system.peers`.
The table is Raft-managed to provide consistent replication across
nodes.
Schema overview: each row is identified by `(connection_id, host_id)`
and describes where clients should connect: `address` and one or more of
`port`, `tls_port`, `alternator_port`, `alternator_https_port`.
`host_id` is a UUID (just as in ScyllaDB) but `connection_id` can be
any string to accept formats of all cloud providers. `address` is
also a regular string because it can represent either an IP address or
a domain. Ports are optional in the sense that at least one of
the four must be provided.
Ref: scylladb/scylla-enterprise#5699
The feature will be used later in this patch series:
- To avoid unnecessary operations when the feature is not enabled
- To guard new API endpoints from being used before the cluster is
ready to use them.
- To implement update tests (by disabling/enabling the feature)
Ref: scylladb/scylla-enterprise#5699
The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied.
When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table.
Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed). Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones. Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range.
When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC.
The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions.
This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR:
CREATE TABLE system.batchlog_v2 (
version int,
stage int,
shard int,
written_at timestamp,
id uuid,
data blob,
PRIMARY KEY ((version, stage, shard), written_at, id));
The new schema organization has the following goals:
1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard column.
4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.
Fixes: https://github.com/scylladb/scylladb/issues/23358
This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`.
Closesscylladb/scylladb#26671
* github.com:scylladb/scylladb:
db/config: change batchlog_replay_cleanup_after_replays default to 1
test/boost/batchlog_manager_test: add test for batchlog cleanup
replica/mutation_dump: always set position weight for clustering positions
service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
test/lib: introduce error_injection.hh
utils/error_injection: add debug log to disable() and disable_all()
test/lib/cql_test_env: forward config to batchlog
test/lib/cql_test_env: add batch type to execute_batch()
test/lib/cql_assertions: add with_size(predicate) overload
test/lib/cql_assertions: add source location to fail messages
test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check
test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
db/batchlog_manager: config: s/write_timeout/reply_timeot/
db,service: switch to system.batchlog_v2
db/system_keyspace: introduce system.batchlog_v2
service,db: extract generation of batchlog delete mutation
service,db: extract get_batchlog_mutation_for() from storage-proxy
db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
data_dictionary: table: add get_truncation_time()
db/batchlog_manager: batch(): replace map_reduce() with simple loop
db/batchlog_manager: finish coroutinizing replay_all_failed_batches
db/batchlog_manager: improve replayAllFailedBatches logs
When translating Cassandra's test validation/operations/DeleteTest.java
I accidentally used the same name for two tests, resulting in the first
of them never being run.
Let's fix the name of the second of the two to be the real name it had
in the original Cassandra test.
After this patch pytest reports 52 tests in this file, instead of 51
before this patch. The previously-ignored test was correct, and it
now passes in both Scylla and Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When creating an alternator table with tablets, if it has an index, LSI
or GSI, require the config option rf_rack_valid_keyspaces to be enabled.
The option is required for materialized views in tablets keyspaces to
function properly and avoid consistency issues that could happen due to
cross-rack migrations and pairing switches when RF-rack validity is not
enforced.
Currently the option is validated when creating a materialized view via
the CQL interface, but it's missing from the alternator interface. Since
alternator indexes are based on materialized views, the same check
should be added there as well.
Fixesscylladb/scylladb#27612Closesscylladb/scylladb#27622
We currently allow restrictions on single column primary key,
but we ignore the restriction and return all results.
This can confuse the users. We change it so such a restriction
will throw an error and add a test to validate it.
Fixes: VECTOR-331
Closesscylladb/scylladb#27143
The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode.
Therefore, the test was moved from cqlpy to the cluster-based framework.
This commit both adds the test in cluster/ and removes the old version in cqlpy/.
This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP).
The lookup function is fully synchronized, non-coroutine, and returns immediately.
For that reason, it’s better to perform this lookup outside of the switch_tenant function.
To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”).
Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case.
This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.
This patch series contains the following changes:
- Incorporation of `crypt_sha512.c` from musl to out codebase
- Conversion of `crypt_sha512.c` to C++ and coroutinization
- Coroutinization of `auth::passwords::check`
- Enabling use of `__crypt_sha512` orignated from `crypt_sha512.c` for
computing SHA 512 passwords of length <=255
- Addition of yielding in the aforementioned hashing implementation.
The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (https://github.com/scylladb/scylladb/issues/24524).
However, because there is only one alien thread, overall hashing throughput was reduced
(see, e.g., https://github.com/scylladb/scylla-enterprise/issues/5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding is introduced in this patch series.
Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.
The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
- Alien thread: 41.5 new connections/s per shard
- Reverted alien thread: 244.1 new connections/s per shard
- This commit (yielding in hashing): 198.4 new connections/s per shard
The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.
On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
- Alien thread: 109 stalls/shard total
- Reverted alien thread: 13186 stalls/shard total
- This commit (yielding in hashing): 149 stalls/shard total
Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
- Alien thread: 1087 ms/shard total
- Reverted alien thread: 72839 ms/shard total
- This commit (yielding in hashing): 1623 ms/shard total
To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.
Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711
No automatic backport. After this PR is completed, the alien thread should be rather reverted from older branches (2025.2-2025.4 because on 2025.1 it's already removed). Backporting of the other commits needs further discussion.
Closesscylladb/scylladb#26860
* github.com:scylladb/scylladb:
test/boost: add too_long_password to auth_passwords_test
test/boost: add same_hashes_as_crypt_r to auth_passwords_test
auth: utils: add yielding to crypt_sha512
auth: change return type of passwords::check to future
auth: remove code duplication in verify_scheme
test/boost: coroutinize auth_passwords_test
utils: coroutinize crypt_sha512
utils: make crypt_sha512.cc to compile
utils: license: import crypt_sha512.c from musl to the project
Revert "auth: move passwords::check call to alien thread"
This is a problem caught after removing split from
add_sstable_and_update_cache(), which was used by
intra node migration when loading new sstables
into the destination shard.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If manager has been disabled due to out of space prevention, it's
important to throw an exception rather than silently not
splitting the new sstable.
Not splitting a sstable when needed can cause correctness issue
when finalizing split later.
It's better to fail the writer (e.g. repair one) which will be
retried than making caller think everything succeeded.
The new replica::table::add_new_sstable_and_update_cache() will
now unlink the new sstable on failure, so the table dir will
not be left with sstables not loaded.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now, only sstable loader on boot and refresh from upload uses this
procedure. The idea is that maybe_split_new_sstable() will throw
when compaction cannot run due to e.g. out of space prevention.
It could fail repair writer, but we don't want it to fail boot.
As for refresh from upload, it's not supposed to work when tablet
map at the time of backup is not the same when restoring.
Even before this, refresh would fail if split already executed,
split would only happen if split was still ongoing. We need
token range stability for local restore. The safe variant will
always be load and stream.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We want the invariant that after ACK, all sealed sstables will be split.
This guarantee that on restart, no unsplit sstables will be found
sealed.
The paths that generate unsplit sstables are streaming and file
streaming consumers. It includes intra-node streaming, which
is local but can clone an unsplit sstable into destination.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
After the wiring, failure to attach the new sstable in the streaming
consumer will unlink the sstable automatically.
Fixes#27414.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Piggyback on new add_new_sstable_and_update_cache(), replacing
the previous add_sstables_and_update_cache().
Will be used by intra-node migration since we want it to be
safe when loading the cloned sstables. An unsplit sstable
can be cloned into destination which already ACKed split,
so we need this variant which splits sstable if needed,
while it's unsealed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Failure to load sstable in streaming can leave sealed sstables
on disk since they're not unlinked on failure.
This can result in several problems:
1) Data resurrection: since the sstable may contain deleted data
2) Split issue: since the finalization requires all sstables to be split
3) Disk usage issue: since the sstables hold space and streaming retries
can keep accumulating these files.
This new procedure will be later wired into streaming consumers, in
order to fix those problems.
Another benefit of the interface is that if there's split when adding
the new sstable, the output sstables will be returned to the caller,
allowing them to register the actual loaded sstables into e.g.
the view builder.
Refs #27414.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We want the invariant that after ACK, all sealed sstables will be split.
If check-and-attach is not atomic, this sequence is possible:
1) no split decision set.
2) Unsplit sstable is checked, no need to split, sealed.
3) split decision is set and ACKed
4) unsplit sstable is attached
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The lock is intended to serialize some maintenance compactions,
such as major, with repair. But maybe_split_new_sstable() is
restricted solely to new sstables that aren't part of the
sstable set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is crucial with MVs, since the splitting must preserve the state of
the original sstable. We want the sstable to be in staging dir, so it's
excluded when calculating the diff for performing pushes to view
replicas.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since the function must only be used on new sstables, it should
be renamed to something describing its usage should be restricted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That will be needed for file streaming to leave output sstable unsealed.
we want the invariant where all sealed sstables are split after split
was ACKed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is crucial for splitting before sealing the sstable produced by
repair. This way, unsplit sstables won't be left on disk sealed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
File streaming will have to load an unsealed sstable, so we need
to be able to parse components from temporary TOC instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This option was retired in commit 0959739216, but
it will be again needed in order to implement split before sealing.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When applying schema changes, the previous implementation attempted to
handle exceptions by catching and rethrowing them, but did so
incorrectly: using `throw ex` with a `std::exception_ptr` loses the
original exception type and details. The correct approach is to use
`std::rethrow_exception()`.
However, in this case, explicit exception handling is unnecessary. The
only reason for catching was to ensure `ap.destroy()` is called before
propagating the exception. This can be more cleanly and safely achieved
using Seastar's `.finally()` continuation, which guarantees cleanup
regardless of success or failure.
This change removes the manual try/catch/rethrow and uses `.finally()`
to ensure proper cleanup, letting exceptions propagate naturally and
preserving their type and information.
Fixes: SCYLLADB-94
Refs: scylladb/scylladb#27501
Fix location identified by clang-tidy where `std::exception_ptr` was
incorrectly rethrown using `throw ep;`. The correct approach is to use
`std::rethrow_exception(ep)`, which preserves the original exception
type and stack trace.
But this can be even further simplified by logging the current exception
with `std::current_exception()` and rethrowing using `throw;` instead of
capturing and rethrowing a `std::exception_ptr`. This matches the
idiomatic pattern used elsewhere in the codebase and improves clarity.
This change ensures proper exception propagation and avoids type slicing
or loss of diagnostic information.
Improve exception handling in join_node_response_handler by using
`co_await coroutine::return_exception_ptr` to propagate exceptions.
This replaces the incorrect direct throw of `std::exception_ptr` and
ensures proper coroutine-friendly exception propagation.
Adding pid info to servers allows matching coredumps with servers
Other improvements:
- When replacing just some fields of ServerInfo, use `_replace` instead of
building a new object. This way it is agnostic to changes to the Object
- When building ServerInfo from a list, the types defined for its fields are
not enforced, so ServerInfo(*list) works fine and does not need to be changed if
fields are added or removed.
The script API is 500+ lines long in an already too long and hard to
navigate document. Extract it to a separate document, making both
documents shorter and easier to navigate.
Reverts commit 8192f45e84.
The merge exposed a critical bug where truncate operations during table drop with auto-snapshot fail, causing Raft applier fiber to stop with unhandled exceptions. This leads to schema inconsistencies across nodes and test failures with "Keyspace does not exist" errors.
**Root Cause**
Commit 19b6207f modified `truncate_table_on_all_shards` to set `use_sstable_identifier = true`:
```cpp
// Before (working)
co_await table::snapshot_on_all_shards(sharded_db, table_shards, name);
// After (broken)
auto opts = db::snapshot_options{.use_sstable_identifier = true};
co_await table::snapshot_on_all_shards(sharded_db, table_shards, name, opts);
```
This triggers exceptions during snapshot that propagate through Raft state machine, causing:
- Raft applier stops: `raft::state_machine_error` at `raft/server.cc:1369`
- Schema changes fail to propagate
- Nodes report non-existent keyspaces for valid schemas
**Changes**
Reverts 15 files (200 deletions, 74 insertions):
- Removes `use_sstable_identifier` from truncate/snapshot code paths
- Reverts `snapshot_options` struct back to simple `skip_flush` boolean
- Removes REST API and nodetool `--use-sstable-identifier` parameter
- Removes feature tests from `test/boost/database_test.cc`
No backport required - the original feature was merged to master only and never released.
<!-- START COPILOT ORIGINAL PROMPT -->
<details>
<summary>Original prompt</summary>
----
*This section details on the original issue you should resolve*
<issue_title>test_table_drop_with_auto_snapshot failed with InvalidRequest</issue_title>
<issue_description>Seen in: https://jenkins.scylladb.com/job/scylla-master/job/next/9968//testReport
Logs: [download](https://downloads.scylladb.com/unstable/scylla/master/testLogs/2025-12-08T15:05:42Z/)
Error message:
```
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace test does not exist"
```
Stacktrace:
```
manager = <test.pylib.manager_client.ManagerClient object at 0xffff734e6c10>
@pytest.mark.asyncio
async def test_table_drop_with_auto_snapshot(manager: ManagerClient):
logger.info("Bootstrapping cluster")
cfg = { 'auto_snapshot': True }
servers = await manager.servers_add(3, config = cfg)
cql = manager.get_cql()
# Increases the chance of tablet migration concurrent with schema change
await inject_error_on(manager, "tablet_allocator_shuffle", servers)
for i in range(3):
await cql.run_async("DROP KEYSPACE IF EXISTS test;")
await cql.run_async("CREATE KEYSPACE IF NOT EXISTS test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 8 };")
await cql.run_async("CREATE TABLE IF NOT EXISTS test.tbl_sample_kv (id int, value text, PRIMARY KEY (id));")
> await cql.run_async("INSERT INTO test.tbl_sample_kv (id, value) VALUES (1, 'ala');")
E cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace test does not exist"
test/cluster/test_tablets2.py:173: InvalidRequest
```
</issue_description>
<agent_instructions>this issue was exposed by commit 8192f45e84, please send a pull request reverting that merge commit and mark it as fixing this github issue.</agent_instructions>
<comments>
<comment_new><author>@yaronkaikov</author><body>
@denesb is this something in your team area? if not , please feel free to delegate it or un-assign yourself :-)</body></comment_new>
<comment_new><author>@nyh</author><body>
This is very strange. Clearly the keyspace `test` does exist at this point, because we created it two lines above and also we ran `CREATE TABLE .. test.tbl_sample_kv` which would have failed if the keyspace `test` didn't exist - so it must exit, no?
In the past, we had a bug where the running `CREATE KEYSPACE IF NOT EXISTS` forgot to set the "schema modified" event in the response so it failed to wait for schema agreement, but 1. we fixed this bug (https://github.com/scylladb/scylladb/pull/18819 by @nuivall ) and 2. this bug didn't happen in this case, where CREATE TABLE deed had work to do.
But I just realized something... Our fix in https://github.com/scylladb/scylladb/pull/18819 only applies to CREATE KEYSPACE / TABLE / VIEW / TYPE statements. It wasn't applied to `DROP KEYSPACE` - and it should have been....
But I don't have a good theory how a bug like https://github.com/scylladb/scylladb/pull/18819 can explain this specific test failure. Different schema operations are already linearized, so if a `CREATE TABLE test.tbl_sample_kv` succeeded, I don't see how there could possibly be any earlier `DROP KEYSPACE test` that suddenly springs to life. Unless we have a serious bug in our raft-based schema operations.</body></comment_new>
<comment_new><author>@nyh</author><body>
Another bug we could have in theory is that the Python driver's async `cql.run_async` might have a bug where it is not waiting for the schema agreement despite being told to wait. If it doesn't wait for schema agreement, this can easily explain this bug:
1. the CREATE KEYSPACE, CREATE TABLE both are sent to node A, but
2. the last INSERT INTO is sent to node B which is not yet aware of this new keyspace and table, and fails.
Copilot claims that **execute_async() does have this bug!**
> For schema-altering statements, schema agreement (meaning all nodes agree on the new schema) is important before running follow-up operations, but this is enforced only by synchronous helpers like Session.execute(), not the asynchronous version.
> If you use execute_async() for schema operations, you are responsible for checking schema agreement yourself, using [Session.check_schema_agreement()](https://docs.datastax.com/en/developer/python-driver/latest/api/cassandra/cluster/#cassandra.cluster.Session.check_schema_agreement) or (in newer code) ResponseFuture.check_schema_agreement.
> According to [a discussion on the DataStax support forum](https://support.datastax.com/s/article/Does-the-Python-Driver-for-Cassandra-Wait-for-Schema-Agreement-after-a-Schema-Change?language=en_US) and the [driver’s source code](7f12a5e1c6/cassandra/cluster.py (L487)), schema agreement is not ch...
</details>
<!-- START COPILOT CODING AGENT SUFFIX -->
- Fixesscylladb/scylladb#27501
<!-- START COPILOT CODING AGENT TIPS -->
---
Closesscylladb/scylladb#27604
* github.com:scylladb/scylladb:
Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
Initial plan
We are about to extract the script API to a separate document. In
preparation convert soon-to-be cross-document references, so they keep
working after the extraction.
This reverts commit 8192f45e84.
The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.
The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
Add pull_request_target event with unlabeled type to trigger-scylla-ci
workflow. This allows automatic CI triggering when the 'conflicts' label
is removed from a PR, in addition to the existing manual trigger via
comment.
The workflow now runs when:
- A user posts a comment with '@scylladbbot trigger-ci' (existing)
- The 'conflicts' label is removed from a PR (new)
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-84Closesscylladb/scylladb#27521
Fixed a critical bug where `storage_group::for_each_compaction_group()` was incorrectly marked `noexcept`, causing `std::terminate` when actions threw exceptions (e.g., `utils::memory_limit_reached` during memory-constrained reader creation).
**Changes made:**
1. Removed `noexcept` from `storage_group::for_each_compaction_group()` declaration and implementation
2. Removed `noexcept` from `storage_group::compaction_groups()` overloads (they call for_each_compaction_group)
3. Removed `noexcept` from `storage_group::live_disk_space_used()` and `memtable_count()` (they call compaction_groups())
4. Kept `noexcept` on `storage_group::flush()` - it's a coroutine that automatically captures exceptions and returns them as exceptional futures
5. Removed `noexcept` from `table_load_stats()` functions in base class, table, and storage group managers
**Rationale:**
As noted by reviewers, there's no reason to kill the server if these functions throw. For coroutines returning futures, `noexcept` is appropriate because Seastar automatically captures exceptions and returns them as exceptional futures. For other functions, proper exception handling allows the system to recover gracefully instead of terminating.
Fixes#27475Closesscylladb/scylladb#27476
* github.com:scylladb/scylladb:
replica: Remove unnecessary noexcept
replica: Remove noexcept from compaction_groups() functions
replica: Remove noexcept from storage_group::for_each_compaction_group
There is no 'regular' incremental mode anymore.
The example seems have meant 'disabled'.
Fixes#27587
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Extend the Fixes validation pattern to also accept JIRA issue references
(format: [A-Z]+-\d+) in addition to GitHub issue references. This allows
backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'.
Fixes: https://github.com/scylladb/scylladb/issues/27571Closesscylladb/scylladb#27572
This reverts commit faad0167d7. It causes
a regression in
test_two_tablets_concurrent_repair_and_migration_repair_writer_level
in debug mode (with ~5%-10% probability).
Fixes#27510.
Closesscylladb/scylladb#27560
This patch-set consolidates and corrects rjson string conversion handling.
It removes unnecessary string copies, ensures proper length usage and
replaces ad-hoc conversions with consistent helper functions.
Overall, the changes make rjson string handling safer, faster, and more uniform across the codebase.
Backport: no, it's a refactor
Closesscylladb/scylladb#27394
* github.com:scylladb/scylladb:
fix rjson::value to bytes conversion with missing GetStringLength call
alternator: change type from string to string_view in should_add_capacity
fix rjson::value to string_view conversion with missing GetStringLength call
use rjson::to_string_view when rjson::value gets converted using GetStringLength
use rjson::to_sstring and rjson::to_string for various string conversions
utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds
utils: move rjson::to_string_view func to string related place
utils: add to_sstring and to_string rjson helper
This patch adds a reproducer for a long-known bug, #8070, where
Alternator can store invalid values which are just blindly stored as
JSON, and we will only see the failure when reading the item back -
and either the client will fail to parse it, or sometimes even Alternator's
own code (e.g., FilterExpression) will fail to parse it. The right
behavior is to fail the write - not the read.
The included test checks writing different kinds of invalid values using
PutItem, UpdateItem, and BatchWriteItem. The new tests pass on DynamoDB,
but fail on Alternator so marked as "xfail".
Refs #8070.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a reproducer for issue #27375, where even with
alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON
representation - a Alternator Streams event is unduly produced.
For example, if a map {'dog': 1, 'cat': 2} is changed to
{'cat': 2, 'dog': 1}, this non-change should not be reported.
The new test added in this patch passes on DynamoDB (an event
is not generated) but fails on Alternator (an event is generated),
so the new test is marked with xfail.
Refs #27375.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
rjson::parse() when parsing JSON stored in a chunked_content (a vector
of temporary buffers) failed to initialize its byte counter to 0,
resulting in garbage positions in error messages like:
Parsing JSON failed: Missing a name for object member. at 1452254
These error messages were most noticable in Alternator, which parses
JSON requests using a chunked_content, and reports these errors back
to the user.
The fix is trivial: add the missing initialization of the counter.
The patch also adds a regression test for this bug - it sends a JSON
corrupt at position 1, and expect to see "at 1" and not some large
random number.
Fixes#27372
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
get_signed_request() started in test_manual_requests.py as a way to sign
a manually-created DynamoDB-API request - for sending requests that boto3
can't.
Over time, we started to use this function in additional test files, and
it's about time to move it to util.py - which is more natural to import
from multiple files.
This patch also adds a new function, manual_request(), which combines
get_signed_request() and actually sending the request via
requests.post(). New tests should prefer it, because it's easier to use.
We'll use the new function in tests that we add in the next patches.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
`test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py`
has an insert statement that is expected to fail. Dtest environment uses
`FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails
means we expect 6 error audit logs.
The test failed because `create keyspace ks` failed once, then succeeded on retry.
It allowed the test to proceed properly, but the last part of the test that expects
exactly 6 failed queries actually had 7.
The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed
queries, counting only the query expected to fail. If other queries fail with
successful retry, it's fine. If other queries fail without successful retry, the test
will fail, as it should in such situations. They are not related to this expected
failed insert statement.
Fixes#27322Closesscylladb/scylladb#27378
When waiting for the condition variable times out
we call on_internal_error, but unfortunately, the backtrace
it generates is obfuscated by
`coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`.
To make the log more useful, print the error injection name
and the caller's source_location in the timeout error message.
Fixes#27531
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27532
This contained only one routine; `corrupt_file`, which is
highly problematic, and not used. If you want to "corrupt" a
file, it should be done controlled, not at random.
Fixes#26744
If a segment to replay is broken such that the main header is not zero,
but still broken, we throw header_checksum_error. This was not handled in
replayer, which grouped this into the "user error/fundamental problem"
category. However, assuming we allow for "real" disk corruption, this should
really be treated same as data corruption, i.e. reported data loss, not
failure to start up.
The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes
provoked this issue, by doing random file wrecking, which on rare occasions
provoked this, and thus failed test due to scylla not starting up, instead
of loosing data as expected.
Changed test to consistently cause this exact error instead.
The test documents the current behavior of hashing algorithms that
fail if the passphrase has 512 bytes or more.
Moreover, it documents the behavior of the current bcrypt
implementation that compares only the first 72 bytes of the password.
Although we don't typically use bcrypt for password hashing, it is
possible to insert such a hash using
`CREATE ROLE ... WITH HASHED PASSWORD ...`.
Refs: scylladb/scylladb#26842
This change allows yielding during hashing computations to prevent
stalls.
The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
- Alien thread: 41.5 new connections/s per shard
- Reverted alien thread: 244.1 new connections/s per shard
- This commit (yielding in hashing): 198.4 new connections/s per shard
The alien thread is limited by a single-core hashing throughput,
which is roughly 400-500 hashes/s in the test environment. Therefore,
with smp=10, the throughput is below 50 hashes/s, and the difference
between the alien thread and other solutions further increases with
higer smp.
The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.
On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
- Alien thread: 109 stalls/shard total
- Reverted alien thread: 13186 stalls/shard total
- This commit (yielding in hashing): 149 stalls/shard total
Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
- Alien thread: 1087 ms/shard total
- Reverted alien thread: 72839 ms/shard total
- This commit (yielding in hashing): 1623 ms/shard total
To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.
Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711
Introduce a new `passwords::hash_with_salt_async` and change the return
type of `passwords::check` to `future<bool>`. This enables yielding
during password computations later in this patch series.
The old method, `hash_with_salt`, is marked as deprecated because
new code should use the new `hash_with_salt_async` function.
We are not removing `hash_with_salt` now to reduce the regression risk
of changing the hashing implementation—at least the methods that change
persistent hashes (CREATE, ALTER) will continue to use the old hashing
method. However, in the future, `hash_with_salt` should be entirely
removed.
Refs: scylladb/scylladb#26859
Refactoring: create a new function `verify_hashing_output` to reuse
code in `hash_with_salt` and `verify_scheme`. The change is introduced
to facilitate verification of hashing output when the implementation
is extended later in this patch series.
Refs: scylladb/scylladb#26859
This commit prepares `auth_passwords_test` for using coroutines,
because later in this patch series `auth::passwords::check` and other
similar functions will return Seastar futures.
Refs: scylladb/scylladb#26859
Change `sha512crypt` and `__crypt_sha512` to coroutines to allow
yielding during hash computations later in this patch series.
Refs: scylladb/scylladb#26859
This patch imports the `crypt_sha512.c` file from the musl library.
We need it to incorporate yielding in the `crypt_r` function to avoid
reactor stalls during long hashing computations.
Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.
Both `crypt_sha512.c` and musl license are obtained from
git.musl-libc.org:
- https://git.musl-libc.org/cgit/musl/tree/src/crypt/crypt_sha512.c
- https://git.musl-libc.org/cgit/musl/tree/COPYRIGHT
Import commit:
commit 1b76ff0767d01df72f692806ee5adee13c67ef88
Author: Alex Rønne Petersen <alex@alexrp.com>
Date: Sun Oct 12 05:35:19 2025 +0200
s390x: shuffle register usage in __tls_get_offset to avoid r0 as address
Refs: scylladb/scylladb#26859
The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (scylladb/scylladb#24524). However, because
there is only one alien thread, overall hashing throughput was reduced
(see, e.g., scylladb/scylla-enterprise#5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding will be introduced later in this patch series.
This reverts commit 9574513ec1.
For some arcane reason, we split optional the test pattern given to
test.py twice across '::' to get the file + case specifiers later given
to pytest etc. This means that for a test with a class group (such as some
migrated dtests), we cannot really specify the exact test to run
(pattern <file>::<class>::test).
Simply splitting only on first '::' fixes this. Should not affect any
other tests.
Can potentially lead to unnecessary abort.
compaction_groups() and for_each_compaction_group() can throw.
Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>
To keep backward compatibility, support
- old configs -- where endpoint is just an address and port is separate.
When it happens, format the "new" endpoint name
- lookup by address-only. If it happens, scan all endpoints and see if
any one matches the provided address
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.
The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.
Tests, that generate the config files, and docs are updated.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
…eshold
The initial problem:
Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened.
Test logic:
These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests.
The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values.
Note that the exceptions are counted per shard, not per code path.
The new problem:
The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12.
There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted.
Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths.
For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run).
Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold.
Also, this patch series enables debug logging for `exception` logger. This will allow us to inspect which exceptions happened if a protocol exceptions test fails again.
Fixes#27247Fixes#27325
Issue observed on master and branch-2025.4. The tests, in the same form, exist on master, branch-2025.4, branch-2025.3, branch-2025.2, and branch-2025.1. Code change is simple, and no issue is expected with backport automation. Thus, backports for all the aforementioned versions is requested.
Closesscylladb/scylladb#27412
* github.com:scylladb/scylladb:
test: cqlpy: test_protocol_exceptions.py: enable debug exception logging
test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold
In some cases we unnecessarily convert to string which
causes a copy. In other we convert without calling
GetStringLength which causes iteration to dermine length
which is already known. In some cases we do even both.
This commit fixes that.
So that conversion code is common and it's easier
to avoid accidental type conversions. Additionally
according to rapid json library size must be checked
explicitly, this also avoids extra iteration in char*
to (s)string conversion.
Rebase to Fedora 43 with clang 21.1 and libstdc++ 15.
Fedora container image registry moved to registry.fedoraproject.org as
it seems to be updated more regularly.
Added python3-devel to the dependencies as some packages scylla-cqlsh
depends on aren't yet available in the form of wheels for Python 3.14,
and so have to be built locally. In any case it's better to reduce
dependency on those wheels even if the ones currently missing appear
eventually.
Added libev-devel to the dependencies so that the python driver
builds correctly even if "wheels" are not published. This reduces
our dependency on the python driver's binary release schedule.
Without libev-devel, TLS does not work correctly.
We no long remove the clang and clang-libs packages. Doxygen
started depending on clang-libs, and removing them removes
doxygen, breaking the build when it looks for that. The build
will still pick up the optimized clang, since /usr/local/bin
is earlier in the path. We keep the clang package, since it allows
us to mess a little less with the directory structure.
Optimized clang binaries generates and stored in
https://devpkg.scylladb.com/clang/clang-21.1.6-Fedora-43-aarch64.tar.gzhttps://devpkg.scylladb.com/clang/clang-21.1.6-Fedora-43-x86_64.tar.gz
With ./scripts/refresh-pgo-profiles.sh, the new compiler shows a small
performance improvement (instructions_per_op) in perf-simple-query:
clang 21:
259353.60 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35720 insns/op, 17427 cycles/op, 0 errors)
265940.08 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35725 insns/op, 17042 cycles/op, 0 errors)
262650.01 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35720 insns/op, 17240 cycles/op, 0 errors)
262881.22 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35675 insns/op, 17222 cycles/op, 0 errors)
264898.68 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35732 insns/op, 17070 cycles/op, 0 errors)
throughput:
mean= 263144.72 standard-deviation=2528.69
median= 262881.22 median-absolute-deviation=1753.96
maximum=265940.08 minimum=259353.60
instructions_per_op:
mean= 35714.47 standard-deviation=22.34
median= 35720.38 median-absolute-deviation=10.20
maximum=35732.14 minimum=35675.50
cpu_cycles_per_op:
mean= 17200.12 standard-deviation=154.62
median= 17221.70 median-absolute-deviation=129.77
maximum=17427.33 minimum=17041.57
clang 20:
254431.39 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35883 insns/op, 17708 cycles/op, 0 errors)
259701.02 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35883 insns/op, 17351 cycles/op, 0 errors)
261166.92 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35912 insns/op, 17270 cycles/op, 0 errors)
260656.31 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35869 insns/op, 17289 cycles/op, 0 errors)
259628.13 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35946 insns/op, 17370 cycles/op, 0 errors)
throughput:
mean= 259116.75 standard-deviation=2698.56
median= 259701.02 median-absolute-deviation=1539.55
maximum=261166.92 minimum=254431.39
instructions_per_op:
mean= 35898.42 standard-deviation=30.69
median= 35882.97 median-absolute-deviation=15.90
maximum=35945.63 minimum=35869.02
cpu_cycles_per_op:
mean= 17397.49 standard-deviation=178.35
median= 17351.35 median-absolute-deviation=108.79
maximum=17707.63 minimum=17269.68
Closesscylladb/scylladb#26773
Currently we have 3 explicit checks, and some of them are configurable:
- Jenkins job being stable. Can be disabled with --force
- Whether submodule update is happenning. It's not allowed by default, and
should be enabled with --allow-submodule option
- Target branch checking (recently merged #27249). Happens unconditionally
This PR unifies all checks in two ways.
First, each restriction can be lifted with --allow-foo options. The existing
--allow-submodule stays and two options are added:
- --allow-unstable to skip jenkins job check (like --force works now)
- --allow-any-branch to skip target branch check
Second, the --force option lifts all the known restrictions.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27294
With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack.
In this series we remove no longer needed code and reorganize the needed code for better clarity.
After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines.
Fixes https://github.com/scylladb/scylladb/issues/26313Closesscylladb/scylladb#27383
* github.com:scylladb/scylladb:
mv: replace the simple/complex rack-aware pairing with exact rack matching
mv: split out vnode pairing code from get_view_natural_endpoint
mv: unify self-pairing and rack-aware pairing into one bool
mv: remove the workaround for left nodes when sending view updates
Scylla implements `LWT` in the` storage_proxy::cas` method. This method expects to be called on a specific shard, represented by the `cas_shard` parameter. Clients must create this object before calling `storage_proxy::cas`, check its `this_shard()` method, and jump to `cas_shard.shard()` if it returns false.
The nuance is that by the time the request reaches the destination shard, the tablet may have already advanced in its migration state machine. For example, a client may acquire a `cas_shard` at the `streaming` tablet state, then submit a request to another shard via `smp::submit_to(cas_shard.shard())`. However, the new `cas_shard` created on that other shard might already be in the `write_both_read_new` state, and its `cas_shard.shard()` would not be equal to `this_shard_id()`. Such broken invariant results in an `on_internal_error` in `storage_proxy::cas`.
Clients of `storage_proxy::cas` are expected to check` cas_shard.this_shard()` and recursively jump to another shard if it returns false. Most calls to `storage_proxy::cas` already implement this logic. The only exception is `executor::do_batch_write`, which currently checks `cas_shard.this_shard()` only once. This can break the invariant if the tablet state changes more than once during the operation.
This PR fixes the issue by implementing recursive `cas_shard.this_shard()` checks in `executor::do_batch_write`. It also adds a test that reproduces the problem.
Fixes: scylladb/scylladb#27353
backport: need to be backported to 2025.4
Closesscylladb/scylladb#27396
* github.com:scylladb/scylladb:
alternator/executor.cc: eliminate redundant dk copy
alternator/executor.cc: release cas_shard on the original shard
alternator/executor.cc: move shard check into cas_write
alternator/executor.cc: make cas_write a private method
alternator/executor.cc: make do_batch_write a private method
alternator/executor.cc: fix indent
test_alternator: add test_alternator_invalid_shard_for_lwt
Before this series, we kept the cas_shard on the original shard to
guard against tablet movements running in parallel with
storage_proxy::cas.
The bug addressed by this PR shows that this approach is flawed:
keeping the cas_shard on the original shard does not guarantee that
a new cas_shard acquired on the target shard won’t require another
jump.
We fixed this in the previous commit by checking cas_shard.this_shard()
on the target shard and continuing to jump to another shard if
necessary. Once cas_shard.this_shard() on the target shard returns
true, the storage_proxy::cas invariants are satisfied, and no other
cas_shard instances need to remain alive except the one passed
into storage_proxy::cas.
This change ensures that if cas_shard points to a different shard,
the executor will continue issuing shard jumps until
cas_shard.this_shard() returns true. The commit simply moves the
this_shard() check from the parallel_for_each lambda into cas_write,
with minimal functional changes.
We enable test_alternator_invalid_shard_for_lwt since now it should
pass.
Fixesscylladb/scylladb#27353
Refactor the way we decide the sstable belong to a tablet, fully or partially to simplify the flow and make it more readable. Also extract the logic and make it testable, add tests to cover changes
The change is purely aesthetic, no need to backport
Closesscylladb/scylladb#27101
* github.com:scylladb/scylladb:
streaming: remove unnecessary lambda creating sstable token range
streaming: simplify get_sstables_for_tablets logic
streaming: switch to range-based for loop
streaming: drop sstable skip microoptimization in tablet loop
streaming: replace reverse iterators with reverse view in sstables scan
streaming: return from get_sstables_for_tablets earlier
streaming: add get_sstables_by_tablet_range tests
test,sstables: add helper to set sstable first and last keys
streaming: refactor get_sstables_for_tablets to make it accessible
streaming: refactor get_sstables_for_tablets to make it testable
streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic
Enable debug logging for "exception" logger inside protocol exception tests.
The exceptions will be logged, and it will be possible to see which ones
occured if a protocol exceptions test fails.
Refs #27272
Refs #27325
The initial problem:
Some of the tests in test_protocol_exceptions.py started failing. The failure is
on the condition that no more than `cpp_exception_threshold` happened.
Test logic:
These tests assert that specific code paths do not throw an exception anymore.
Initial implementation ran a code path once, and asserted there were 0 exceptions.
Sometimes an exception or several can occur, not directly related to the code paths
the tests check, but those would fail the tests.
The solution was to run the tests multiple times. If there is a regression, there
would be at least as many exceptions thrown as there are test runs. If there is no
regression, a few exceptions might happen, up to 10 per 100 test runs.
I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values.
Note that the exceptions are counted per shard, not per code path.
The new problem:
The occassional exceptions thrown by some parts of the server now throw a bit more
than before. Based on the logs linked on the issues, it is usually 12.
There are possibly multiple ways to resolve the issue. I have considered logging
exceptions and parsing them. I would have to filter exception logs only for wanted
exceptions. However, if a new, different exception is introduced, it might not be
counted.
Another approach is to just increase the threshold a bit. The issue of throwing
more exceptions than before in some other server modules should be addressed by
a set of tests for that module, just like these tests check protocol exceptions,
not caring who used protocol check code paths.
For those reasons, the solution implemented here is to increase `cpp_exception_threshold`
to `20`. It will not make the tests unreliable, because, as mentioned, if there is a
regression, there would be at least `run_count` exceptions per `run_count` test runs
(1 exception per single test run).
Still, to make "background exceptions" occurence a bit more normalized, `run_count` too
is doubled, from `100` to `200`. At the first glance this looks like nothing is changed,
but actually doubling both run count and exception threshold here implies that the
exception burst does not scale as much as run count, it is just that the "jitter" is
bigger than the old threshold.
Fixes#27247Fixes#27325
Default number of retires in `eventually()` in `test_builder_with_concurrent_drop`
sometimes is not enough to observe changes in system tables on aarch64
builds.
This patch increases the number of retries to 30.
Fixesscylladb/scylladb#27370Closesscylladb/scylladb#27493
If the process is running returncode will be Node, otherwise it will
have some value (which can be 0 s well) and the current code treats 0
as if the process is still running.
Closesscylladb/scylladb#27490
Tablet migration transfers sstable files without changing origin
host-id. As it should, becuase those sstables were not written on the
destination host, and should be ignored by commit log replay.
So it's a normal situation, and it's confusing to see this warning in
logs.
Fixes#26957Closesscylladb/scylladb#27433
The distributed_loader::get_sstables_from_object_store() method accepts an endpoint parameter and internally wants to get storage type for that endpoint (s3 or gcs). This is needed to construct storage_options object to create an sstable object.
To get the type, the method scans db::config option, but there's much simpler way to get one.
Code cleanup, no need to backport
Closesscylladb/scylladb#27381
* github.com:scylladb/scylladb:
sstables_loader: Provide endpoint type for get_sstables_from_object_store()
storage_manager: Introduce get_endpoint_type() method
storage_manager: Split get_endpoint_client()
We rewrite the test to avoid flakiness. Instead of looking at the
metrics, we make a trade-off and start depending on a less reliable
mechanism -- logs. We grep all relevant messages printed by Scylla
in TRACE mode and make sure that they were all printed from a context
using the streaming scheduling group.
Although it's a "less proper" way of testing, it should be much more
dependable and avoid flakiness.
Fixesscylladb/scylladb#25957Closesscylladb/scylladb#26656
The test test_truncate_during_topology_change tests TRUNCATE TABLE while
bootstrapping a new node. With tablets enabled TRUNCATE is a global
topology operation which needs to serialize with boostrap.
When TRUNCATE TABLE is issued, it first checks if there is an already
queued truncate for the same table. This can happen if a previous
TRUNCATE operation has timed out, and the client retried. The newly
issued truncate will only join the queued one if it is waiting to be
processed, and will fail immediatelly if the TRUNCATE is already being
processed.
In this test, TRUNCATE will be retried after a timeout (1 minute) due to
the default retry policy, and will be retried up to 3 times, while the
bootstrap is delayed by 2 minutes. This means that the test can validate
the result of a truncate which was started after bootstrap was
completed.
Because of the way truncate joins existing truncate operations, we can
also have the following scenario:
- TRUNCATE times out after one minute because the new node is being
bootstrapped
- the client retries the TRUNCATE command which also times out after 1m
- the third attempt is received during TRUNCATE being processed which
fails the test
This patch changes the retry policy of the TRUNCATE operation to
FallthroughRetryPolicy which guarantees that TRUNCATE will not be
retried on timeout. It also increases the timeout of the TRUNCATE from 1
to 4 minutes. This way the test will actually validate the performance
of the TRUNCATE operation which was issued during bootstrap, instead of
the subsequent, retried TRUNCATEs which could have been issued after the
bootstrap was complete.
Fixes: #26347Closesscylladb/scylladb#27245
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.
In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.
The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.
After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.
Fixes#22564Fixes#26896Closesscylladb/scylladb#26924
There is a bug in current pytest's boost implementation. When timeout
reached process will be killed, but it was not correctly propagated,
that lead to a false positive result. This will fail test case when
timeout for the process is reached.
This is to prevent issues like this https://github.com/scylladb/scylladb/issues/27237Closesscylladb/scylladb#27463
The `sstable_token_range` lambda was only used once to create a token
range for an SSTable. Inline the construction directly where needed,
removing the extra lambda. This simplifies the code without changing
behavior.
Remove the use of the `overlaps` helper and unnest nested conditionals
in get_sstables_for_tablets. Straightforward `before` and `after` checks
are sufficient to decide how each SSTable should be handled.
Replace the explicit iterator loop with a range-based for loop. This
simplifies the code, enforces constness, and avoids the unnecessary
use of postfix increment. The behavior remains unchanged,but
readability and maintainability are improved.
Remove the microoptimization that advanced over SSTables ending before a
tablet range. This approach is misleading since SSTables are not sorted
by their end token, and the extra logic adds complexity with little to
no benefit. The streaming path here is not performance‑critical, so the
simpler loop is preferable.
Use a reverse view over the SSTables vector instead of reverse iterators.
This avoids awkward rbegin/rend usage and the mental overhead of tracking
inverted sort order. With a view, we can use standard begin/end iteration
while preserving the intended scan direction.
Add a comprehensive test suite that exercises various combinations of
SSTable containment within tablet ranges. These cases cover boundary
conditions, partial overlaps, and full containment to validate all
recent changes made to `get_sstables_by_tablet_range`.
Introduce a utility helper to set the first and last decorated keys on
an SSTable. This is intended for testing purposes, making it easier to
construct SSTables with defined boundaries in unit tests.
Create `get_sstables_for_tablets_for_tests` friend free function
for testing purposes. Adding this free function allows
direct testing without requiring the full streamer context.
Make the `get_sstables_for_tablets` member function `static`. This
is a step toward improved testability, allowing the function to be
invoked directly without requiring a full instance of the streamer.
This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier.
When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory
and the manifest.json file, rather than using the sstable generation.
This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable
may be migrated across shards or across nodes, and in this case, its generation may change, but its
sstable identifier remains sstable.
Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to
object storage and exist in previous backed up snapshots.
Historically, the sstable generation was guaranteed to be unique only per table per node,
so the dedup code currently checks for deduplication in the node scope.
However, with tablet migration, sstables are renamed when migrated to a different shard,
i.e. their generation changes, and they may be renamed when migrated to another node,
but even if they are not, the dedup logic still assumes uniqueness only within a node.
To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since 3a12ad96c7).
Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables
in a wider scope. This can be cluster-wide, but we practically need only rack-wide deduplication
or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a
numerical replication factor to a rack list containing a subset of the available racks in a datacenter).
Fixes#27181
* New feature, no backport required
Closesscylladb/scylladb#27184
* github.com:scylladb/scylladb:
database: truncate_table_on_all_shards: set use_sstable_identifier to true
nodetool: snapshot: add --use-sstable-identifier option
api: storage_service: take_snapshot: add use_sstable_identifier option
test: database_test: add snapshot_use_sstable_identifier_works
test: database_test: snapshot_works: add validate_manifest
sstable: write_scylla_metadata: add random_sstable_identifier error injection
table: snapshot_on_all_shards: take snapshot_options
sstable: add get_format getter
sstable: snapshot: add use_sstable_identifier option
db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
db: snapshot_ctl: move skip_flush to struct snapshot_options
We will need to access executor::_stats field from cas_write. We could
pass it as a paramter, but it seems simpler to just make cas_write
and instance method too.
This test reproduces scylladb/scylladb#27353 using two injection
points. First, the test triggers an intra-node tablet migration and
suspends it at the streaming stage using the
intranode_migration_streaming_wait injection. Next, it enables the
alternator_executor_batch_write_wait injection, which suspends a
batch write after its cas_shard has already been created.
The test then issues several batch writes and waits until one of them
hits this injection on the destination shard. At this point, the
cas_shard.erm for that write is still in the streaming state,
meaning the executor would need to jump back to the source shard.
The test then resumes the suspended tablet migration, allowing it to
update the ERM on the source shard to write_both_read_new. After that,
the test releases the suspended batch write and expects it to perform
two shard jumps: first from the destination to the source shard, and
then again back to the source shard.
This commit adds the alternator_executor_batch_write_wait injection to
alternator/executor.cc. Coroutines are intentionally avoided in the
parallel_for_each lambda to prevent unnecessary coroutine-frame
allocations.
This reverts commit ff1b212319. In this
commit, the python driver was updated to 3.29.6. That version has a
serious flaw - it rejects compression=None settings [1] which
cqlsh (legitimately) uses in copyutil.py.
The reason this hasn't caused numerous continuous integration failures
is that the submodule update commit did not update the frozen toolchain,
so the build was effectively running with an older version of the driver.
Fix by reverting the change. This allows us to regenerate the frozen
toolchain when we need to.
Reverted changes:
* tools/cqlsh 2240122...6badc99 (2):
> Update scylla-driver version to 3.29.6
> Revert "Migrate workflows to Blacksmith"
[1] 78f554236fClosesscylladb/scylladb#27473
This is an optimization follow-up [for this PR](https://github.com/scylladb/scylladb/pull/27396#issuecomment-3611410774): avoiding destruction of foreign objects on the wrong shard. Releasing objects allocated on a different shard causes their ::free calls to be executed remotely, which adds unnecessary load to the SMP subsystem.
Before this PR, a `std::vector<put_or_delete_item>` could be moved to another shard. When the vector was eventually destroyed, its ::free had to be marshalled back to the shard where the memory had originally been allocated. This change avoids that overhead by passing the vector by const reference instead.
backport: not needed, this is an optimization
Closesscylladb/scylladb#27432
* github.com:scylladb/scylladb:
alternator/executor.cc: avoid cross-shard free
storage_proxy: cas: take cas_request by raw reference
With python 3.14, the Process fails due to pickling issue with nodes objects.
This will eliminate this issue, so we can bump up the python version.
Closesscylladb/scylladb#27456
This commit is an optimization: avoiding destruction of
foreign objects on the wrong shard. Releasing objects allocated on a
different shard causes their ::free calls to be executed remotely,
which adds unnecessary load to the SMP subsystem.
Before this patch, a std::vector could be moved
to another shard. When the vector was eventually destroyed,
its ::free had to be marshalled back to the shard where the memory had
originally been allocated. This change avoids that overhead by passing
the vector by const reference instead.
The referenced objects lifetime correctness reasoning:
* the put_or_delete_item refs usages in put_or_delete_item_cas_request
are bound to its lifetime
* cas_request lifetime is bound to storage_proxy::cas future
* we don't release put_or_delete_item-s untill all storage_proxy::cas
calls are done.
In the next commit we want to add an optimization that relies on
precise control over the lifetime of cas_request. In particular, we
want the implementation of this interface in Alternator to operate on
raw references that are guaranteed to remain valid only until the
cas() future is resolved. We already depend on the same lifetime
assumptions in cas_request when used by modification_statement.
However, these assumptions are not clearly expressed in the current
interface: cas_request is taken by shared_ptr, and nothing prevents
cas() from storing that pointer inside paxos_response_handler, which
may outlive the cas() future.
This commit fixes that by taking cas_request by raw reference. This
makes it explicit that cas() does not assume ownership of the object.
Callers must ensure that the referenced object remains valid until
the returned future is resolved.
Large reserves in allocating_section can cause stalls. We already log
reserve increase, but we don't know which table it belongs to:
lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments;
Allocating sections used for updating row cache on memtable flush are
notoriously problematic. Each table has its own row_cache, so its own
allocating_section(s). If we attached table name to those sections, we
could identify which table is causing problems. In some issues we
suspected system.raft, but we can't be sure.
This patch allows naming allocating_sections for the purpose of
identifying them in such log messages. I use abstract_formatter for
this purpose to avoid the cost of formatting strings on the hot path
(e.g. index_reader). And also to avoid duplicating strings which are
already stored elsewhere.
Fixes#25799Closesscylladb/scylladb#27470
Range tombstones are represented as entry attributes, which applies to
the interval between entries. So if a range tombstone covers many
rows, to apply it we have to update all covered entries. In some
workloads that could be many entries, even the whole cache. Before
the patch, we did this update without preemption, which can cause
reactor stalls in such workloads.
This scenario is already covered by mvcc_tests,
e.g. test_apply_to_incomplete_respects_continuity. And I verified that
the new preemption point is hit in the test.
perf-row-cache-update results show no significant stalls anymore (max
2ms scheduling delay, instead of previous 1.5 s):
Generated 1124195 rows
Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]}
Draining...
took 0.000616 [ms]
cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630)
update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0
Fixes#23479Fixes#2578Closesscylladb/scylladb#27469
* github.com:scylladb/scylladb:
cache, mvcc: Preempt cache update when applying range tombstone from memtable
partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone()
perf-row-cache-update: Add scenario with large tombstone covering many rows
We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group.
Fixes https://github.com/scylladb/scylladb/issues/27142
Backport to all version where 60f1053087 was backported to since it should improve performance in large clusters.
Closesscylladb/scylladb#27387
* github.com:scylladb/scylladb:
direct_failure_detector: run direct failure detector in the gossiper scheduling group
raft: drop invoke_on from the pinger verb handler
direct_failure_detector: pass timeout to direct_fd_ping verb
We switched to using v3 schema tables (in system_schema keyspace) in
2017, in 9eb91bc30b.
So no system should have the old schema any more.
No need to run legacy_schema_migrator on boot.
Closesscylladb/scylladb#27420
Range tombstones are represented as entry attributes, which applies to
the interval between entries. So if a range tombstone covers many
rows, to apply it we have to update all covered entries. In some
workloads that could be many entries, even the whole cache. Before
the patch, we did this update without preemption, which can cause
reactor stalls in such workloads.
This scenario is already covered by mvcc_tests,
e.g. test_apply_to_incomplete_respects_continuity. And I verified that
the new preemption point is hit in the test.
perf-row-cache-update results show no significant stalls anymore (max
2ms scheduling delay, instead of previous 1.5 s):
Generated 1124195 rows
Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]}
Draining...
took 0.000616 [ms]
cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630)
update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0
Fixes#23479Fixes#2578
Fills memtable with rows and a tombstone which deletes all rows which
are already in cache.
Similar to raft log workload, but more extreme.
With -c1 -m4G, observed really bad performance:
update: 1711.976196 [ms], preemption: {count: 22603, 99%: 0.943127 [ms], max: 1494.571776 [ms]}, cache: 2148/2906 [MB], alloc/comp: 1334/869 [MB] (amp: 0.651), pr/me/dr 1062186/0/1062187
cache: 2148/2906 [MB], memtable: 738/1024 [MB], alloc/comp: 993/0 [MB] (amp: 0.000)
Which means that max reactor stall during cache update was 1.5 [s]
0.7 GB memtables. 2.1 GB in cache.
The DynamoDB API's "BatchWriteItem" operation is spelled like this, in
singular. Some comments incorrectly referred to as BatchWriteItems - in
plural. This patch fixes those mistakes.
There are no functional changes here or changes to user-facing documents -
these mistakes were only in code comments.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27446
The code creates a local variable, so it's better to wrap it in a local
scope, to the conditionally compiled variable doesn't pollute the
external scope.
After scylladb/scylladb#26897 was merged, the worker doesn't use the
view building state machine CV to manage lifetime of batches, so the
broadcast is not needed.
In case of general exception in `view_building_worker::create_staging_sstable_tasks()`,
catch it, print it with error level and sleep 1s before retrying.
This will allow for the registrator to retry its work in case of failure
and it should be easier to detect any bugs in the method.
Test that taking a snapshot with the use_sstable_identifier
option (and injecting `random_sstable_identifier`) produces
different file names in the snapshot than the original
sstable names and validate te manifest.json file respectively.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used by a unit test in the following patch for testing
the snapshot use_sstable_identifier option.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used by the snapshot code in te following patch
for manufacturing a basename using the sstable_id rather
than its generation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When set to true, use the sstable_identifier as the sstable name
in the snapshot rather than its generation.
sstable::snapshot now returns the generation it used
for the sstable in the snapshot, based on the `use_sstable_identifier`
option, to be used by the upper layer generating the manifest.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When direct failure detector was introduces the idea was that it will
run on the same connection raft group0 verbs are running, but in
60f1053087 raft verbs were moved to run on the gossiper connection
while DIRECT_FD_PING was left where it was. This patch move it to
gossiper connection as well and fix the pinger code to run in gossiper
scheduling group.
Currently raft direct pinger verb jumps to shard 0 to check if group0 is
alive before replying. The verb runs relatively often, so it is not very
efficient. The patch distributes group0 liveness information (as it
changes) to all shard instead, so that the handler itself does not need
to jump to shard 0.
To be used for naming sstables in the snapshot by their
sstable identifiers rather than their generation, to
facilitate global deduplication of sstables in backup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently direct_fd_ping runs without timeout, but the verb is not
waited forever, the wait is canceled after a timeout, this timeout
simply is not passed to the rpc. It may create a situation where the
rpc callback can runs on a destination but it is no longer waited on.
Change the code to pass timeout to rpc as well and return earlier from
the rpc handler if the timeout is reached by the time the callback is
called. This is backwards compatible since timeout is passed as
optional.
Now that batchlog cleanup is cheap, on account of memtable flush on the
system.batchlog table garbage-collecting tombstones (previous patch), we
can afford to do cleanup on each replay, keeping the memtable size small
and more importantly -- the amount of tombstones in the memtable small.
SELECT * FROM MUTATION_FRAGMENTS() queries have a transformed schema
(mutation-fragment schema), which is a superset of that of the queried
table's. The mutation fragment schema represents position_in_partition
of mutation fragments expressed as clustering columns. This presents
some challenges, as some position_in_partition fields are null for some
positions. This was solved by setting these clustering keys components
to bytes{}. In the process, a mistake was made: when the clustering key
is missing in the position_in_partition, the position_weight is also set
to bytes{}. This is not correct, it is possible for some positions to
have no key but to still have a position_weight. An example is
position_in_partition::before_all_clustered_rows().
Fix this by always filling in the position_weight for positions which
have region() == clustered, instead of the earlier condition on the key
presence.
This is a minor bug affecting range tombstone changes at the two
extremes: position_in_partition::{before,after}_all_clustered_rows(). In
both cases, the position_weight can be deduced by a human looking at the
results, based on the position of the range tombstone change, relative
to other fragments.
Rename to make it more explicit where the error injection happens.
Also change how the error is injected, use the lambda overload instead
of is_enabled(), the former leaves better trace in logs, which helps
when debugging tests.
For tests that contain multiple assert_that() invokations, identifying
the one that failed is very challenging. Add source location to fail
messages to allow convenient identification of the call-site.
To enable assertions on columns which are sometimes null.
One existing user of with_typed_column() needs adjustment, because the
previous version of with_typed_column() covered up silently for null
value, but after this patch this caused a failure.
Although the value of this item is indeed derived from the write timeout
config, the name doesn't reflect what it is used for. Change it to
reflect it better.
New batchlogs are written to the batchlog_v2 table and replay also uses
the v2 table.
The content of system.batchlog is attempted to be migrated to
system.batchlog_v2 after each start of the batchlog_manager service.
The migration is retried on each replay if it fails. This is reduntant
but simple.
Batchlog cleanup now doesn't involve flushing memtables, the only
remaining user of replica/database.hh is gone, so the include is
dropped.
Rearranges the system.batchlog schema as follows:
CREATE TABLE system.batchlog_v2 (
version int,
stage int,
shard int,
written_at timestamp,
id uuid,
data blob,
PRIMARY KEY ((version, stage, shard), written_at, id));
With the following goals:
1) Make post-replay batchlog cleanup possible with a simple
range-tombstone. This allows dropping the individual dead batchlog
entries, as they are shadowed by a higher level tombstone. This
enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component:
batchlog entries that fail the first replay attempt, are moved to the
failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard
column.
4) Make batchlog entries ordered by the batchlog create time (id). This
allows for selecting batchlogs to replay, without post-filtering of
batchlogs that are too young to be replayed.
Don't build batchlog delete mutations in storage-proxy code. Move this
code into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
db/batchlog_manager.cc
2) Reduce current and future code duplication.
3) Make future changes to this logic easier.
Don't build batchlog mutations in storage-proxy code. Move this code
into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
db/batchlog_manager.cc
2) Reduce current and future code duplication.
2) Make future changes to this logic easier.
Just skip the mutation(s) whose tables were dropped instead.
Use the newly introduced data_dictionary::table::get_truncation_time()
to avoid looking up real table object.
The map_reduce achieves no concurrency, both map and reduce are
synchronous. It only achieves two redundant lookups for the table and
hard-to-read code. Convert it into a simple loop. Preserve the
stall-protection by adding a maybe_yield() to the loop.
When the initial version of rack-aware pairing was introduced, materialized
views with tablets were still experimental. Since then, we decided that
we'll only allow materialized views in clusters where the base table and
the view are replicated on the same racks, with one replica of each tablet
on each rack.
This allows us to remove almost all logic from our base-view pairing. The
only check for the paired view replica is now whether it's in the same
rack as the base replica sending the update.
In this patch we replace the simple and complex rack-aware pairing with
the simple check above.
Because of this, we have to remove a test case from network_topology_strategy_test
which was testing complex pairing. The tested topology is not supported
for views with tablets (or is unlikely to be supported, as it's a random test),
so there's no use keeping the test.
The test case for simple rack aware pairing was kept, but now we only test
the case where each rack has one replica, not multiple.
Additionally, we split finding of an unpaired replica to a separate function
and partially rewrite it without reusing the helper stuctures that were
present when calculating the simple and complex rack-aware pairing.
We only look for an unpaired replica if we couldn't find a paired replica
ourselves or if the number of view replicas didn't match the base replicas.
If an unpaired replica appears while these conditions pass, we won't send
an extra update, but that would be a new bug altogether, because we only
expect the unpaired replica to appear during RF changes, so when these
conditions aren't fulfilled.
Fixes https://github.com/scylladb/scylladb/issues/26313
Add cleanup flag value to start message and drop cpu, it is redundant as
Scylla already adds the shard number to the logs.
Add all_replayed to finish message.
Currently the method scans db::config to find one. It has some
drawbacks. First, it's not very nice. Second, it needs to handle the
case when the endpoint is missing, while it relally never is. Third, the
type in config entry is not necessarily set.
It's nicer to get the type from storage manager.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
To avoid repeatedly checking whether we're using tablets and having
to use unnecesarily flexible code fitting both cases, we split out
the base-view pairing code for the case of vnodes to another function.
The get_view_natural_endpoint will now have only common steps,
a call to that function, and steps specific to tablets.
We always use "legacy self pairing" when not using tablets, and
the "rack aware pairing" has been enabled in every version where
views with tablets isn't experimental. So in practice, instead
of checking these variables we can just look at whether the
table uses tablets.
At one point, the get_view_natural_endpoint was using IP for the
view update (and hint) destinations, but the hint code was using
host_id for the destinations. When a node left, we could no longer
have a mapping for a IP to host_id and when trying to store a hint
for this IP, we'd crash.
We worked around this issue by dropping the view update completely
if the target is in the "left" state.
Since then, we also moved to host_id's in the view update code, so
there's no longer any translation needed when storing the hints.
Additionally, we now drain hints not when entering the "left" state,
but when the node actually stops owning tokens.
Because of that, the workaround is not needed anymore, so we remove
it in this commit.
The existing test_mv_tablets_empty_ip case verifies that indeed, we
do not crash in the original problematic scenario.
Extract the SST filtering logic into a dedicated member function. This
prepares the code for independent testing without requiring the entire
streamer to be initialized.
2025-11-30 18:27:15 +02:00
684 changed files with 23536 additions and 6979 deletions
- Strive for simplicity and clarity, add complexity only when clearly justified
- Question requests: don't blindly implement requests - evaluate trade-offs, identify issues, and suggest better alternatives when appropriate
- Consider different approaches, weigh pros and cons, and recommend the best fit for the specific context
## Test Philosophy
- Performance matters. Tests should run as quickly as possible. Sleeps in the code are highly discouraged and should be avoided, to reduce run time and flakiness.
- Stability matters. Tests should be stable. New tests should be executed 100 times at least to ensure they pass 100 out of 100 times. (use --repeat 100 --max-failures 1 when running it)
- Unit tests should ideally test one thing and one thing only.
- Tests for bug fixes should run before the fix - and show the failure and after the fix - and show they now pass.
- Tests for bug fixes should have in their comments which bug fixes (GitHub or JIRA issue) they test.
- Tests in debug are always slower, so if needed, reduce number of iterations, rows, data used, cycles, etc. in debug mode.
- Tests should strive to be repeatable, and not use random input that will make their results unpredictable.
- Tests should consume as little resources as possible. Prefer running tests on a single node if it is sufficient, for example.
// Note: we don't care when the notification of other shards will finish, as long as it will be done
// it's possible to get into race condition (next DescribeTable comes to other shard, that new shard doesn't have
// the size yet, so it will calculate it again) - this is not a problem, because it will call cache_newly_calculated_size_on_all_shards
// with expiry, which is extremely unlikely to be exactly the same as the previous one, all shards will keep the size coming with expiry that is further into the future.
// In case of the same expiry, some shards will have different size, which means DescribeTable will return different values depending on the shard
// which is also fine, as the specification doesn't give precision guarantees of any kind.
co_returnapi_error::validation("GlobalSecondaryIndexes and LocalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
throwapi_error::validation(fmt::format("AttributeDefinitions redefined {} to {} already a key attribute of type {} in this table",def.name_as_text(),type,def_type));
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.