This series optimizes role lookup by moving find_record into standard_role_manager and switching it to use the auth cache. This allows reverting can_login to its original simpler form, ensuring hot paths are properly cached while maintaining consistency via group0_guard.
Backport: no, it's not a bug fix.
Closesscylladb/scylladb#28329
* github.com:scylladb/scylladb:
auth: bring back previous version of standard_role_manager::can_login
auth: switch find_record to use cache
auth: make find_record and callers standard_role_manager members
The method was coroutinized by 6df07f7ff7. Back then thecoroutine::switch_to() wasn't available, and the code used with_scheduling_group() to call coroutinized lambdas. Those lambdas were implemented as on-stack variables to solve the capture list lifetime problems. As a result, the code looks like
```
auto flush = [] {
... // do the flushing
auto post_flush = [] {
... // do the post-flushing
}
co_return co_await with_scheduling_group(group_b, post_flush);
};
co_return co_await with_scheduling_group(group_a, flush);
```
which is a bit clumsy. Now we have switch_to() and can make the code flow of this method more readable, like this
```
co_await switch_to(group_a);
... // do the flushing
co_await switch_to(group_b);
... // do the post-flushing
```
Code cleanup, not backporting
Closesscylladb/scylladb#28430
* github.com:scylladb/scylladb:
table: Fix indentation after previous patch
table: Use coroutine::switch_to() in try_flush_memtable_to_sstable()
test_alternator_proxy_protocol starts a node and connects via the alternator ports.
Starting a node, by default, waits until the CQL ports are up. This does not guarantee
that the alternator ports are up (they will be up very soon after this), so there is a short
window where a connection to the alternator ports will fail.
Fix by adding a ServerUpState=SERVING mode, which waits for the node to report
to its supervisor (systemd, which we are pretending to be) that its ports are open.
The test is then adjusted to request this new ServerUpState.
Fixes#28210Fixes#28211
Flaky tests are only in master and branch-2026.1, so backporting there.
Closesscylladb/scylladb#28291
* github.com:scylladb/scylladb:
test: test_alternator_proxy_protocol: wait for the node to report itself as serving
test: cluster_manager: add ability to wait for supervisor STATUS=serving
It allows dropping the local lambdas passed into with_scheduling_group()
calls. Overall the code flow becomes more readable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit moves the "Ungrouped properties" category to the end of the
properties list. The properties are now published in the documentation,
and it doesn't look good if the list starts with ungrouped properties.
This patch was taken over from Anna Stuchlik <anna.stuchlik@scylladb.com>.
Closesscylladb/scylladb#28343
Move to `replica/`, drop `flat` from name and drop unused usages as well as unused includes.
Code cleanup, no backport
Closesscylladb/scylladb#28353
* github.com:scylladb/scylladb:
replica/partition_snapshot_reader: remove unused includes
partition_snapshot_reader: remove "flat" from name
mv partition_snapshot_reader.hh -> replica/
This test case was observed to take over 2 minutes to run on CI
machines, contributing to already bloated CI run times.
Disable this test in debug mode. This test checks for memtable flush
being slowed down when compaction can't keep up. So this test needs to
overwhelm the CPU by definition. On the other hand, this is not a
correctness test, there are such tests for the memtable and compaction
already, so it is not critical to run this in debug mode, it is not
expected to catch any use-after-free and such.
Closesscylladb/scylladb#28407
This commit replaces the previous approach of running pytest inside
GDB’s Python interpreter. Instead, tests are executed by driving a
persistent GDB process externally using pexpect.
- pexpect: Python library for controlling interactive programs
(used here to send commands to GDB and capture its output)
- persistent GDB: keep one GDB session alive across multiple tests
instead of starting a new process for each test
Tests can now be executed via `./test.py gdb` or with
`pytest test/scylla_gdb`. This improves performance and
makes failures easier to debug since pytest no longer runs
hidden inside GDB subprocesses.
Closesscylladb/scylladb#24804
When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued. They can time out while in the queue, but will not time
out once admitted.
Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes to
for a queued read to be admitted is around that of the timeout.
If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following criteria:
if read's remaining time <= read's timeout when arrived to the semaphore * live updateable preemptive_abort_factor;
the read is rejected and the next one from the wait list is considered.
Fixes https://github.com/scylladb/scylladb/issues/14909
Fixes: SCYLLADB-353
Backport is not needed. Better to first observe its impact.
Closesscylladb/scylladb#21649
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: Check during admission if read may timeout
permit_reader::impl: Replace break with return after evicting inactive permit on timeout
reader_concurrency_semaphore: Add preemptive_abort_factor to constructors
config: Add parameters to control reads' preemptive_abort_factor
permit_reader: Add a new state: preemptive_aborted
reader_concurrency_semaphore: validate waiters counter when dequeueing a waiting permit
reader_concurrency_semaphore: Remove cpu_concurrency's default value
Contains various improvements to tablet load balancer. Batched together to save on the bill for CI.
Most notably:
- Make plan summary more concise, and print info only about present elements.
- Print rack name in addition to DC name when making a per-rack plan
- Print "Not possible to achieve balance" only when this is the final plan with no active migrations
- Print per-node stats when "Not possible to achieve balance" is printed
- amortize metrics lookup cost
- avoid spamming logs with per-node "Node {} does not have complete tablet stats, ignoring"
Backport to 2026.1: since the changes enhance debuggability and are relatively low risk
Fixes#28423Fixes#28422Closesscylladb/scylladb#28337
* github.com:scylladb/scylladb:
tablets: tablet_allocator.cc: Convert tabs to spaces
tablets: load_balancer: Warn about incomplete stats once for all offending nodes
tablets: load_balancer: Improve node stats printout
tablets: load_balancer: Warn about imbalance only when there are no more active migrations
tablets: load_balancer: Extract print_node_stats()
tablet: load_balancer: Use empty() instead of size() where applicable
tablets: Fix redundancy in migration_plan::empty()
tablets: Cache pointer to stats during plan-making
tablets: load_balancer: Print rack in addition to DC when giving context
tablets: load_balancer: Make plan summary concise
tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
In fact, it's partially there already. When view_builder::start() is called is first calls initialization code (the start_in_background() method), then kicks do_build_step() that runs a background fiber to perform build steps. The starting code inherits scheduling group from main(). And the step fiber code needs to run itself in a maintenance scheduling group, so it explicitly grabs one via database->db_config.
This PR mainly gets rid of the call to database::get_streaming_scheduling_group() from do_build_step() as preparation to splitting the streaming scheduling group into parts (see SCYLLADB-351). To make it happen the do_build_step() is patched to inherit its scheduling group from view_builder::start() and the start() itself is called by main from maintenance scheduling group (like for other view building services).
New feature (nested scheduling group), not backporting
Closesscylladb/scylladb#28386
* github.com:scylladb/scylladb:
view_builder: Start background in maintenance group
view_builder: Wake-up step fiber with condition variable
In a lambda returned from make_streaming_consumer() there's a check for
current scheudling group being streaming one. It came from #17090 where
streaming code was launched in wrong sched group thus affecting user
groups in a bad way.
The check is nice and useful, but it abuses replica::database by getting
unrelated information from it.
To preserve the check and to stop using database as provider of configs,
keep the streaming scheduling group handle in the debug namespace. This
emphasises that this global variable is purely for debugging purposes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28410
Explain what automatic repair is and how to configure it. While at it, improve the existing repair documentation a bit.
Fixes: SCYLLADB-130
This PR missed the 2026.1 branch date, so it needs backport to 2026.1, where the auto repair feature debuts.
Closesscylladb/scylladb#28199
* github.com:scylladb/scylladb:
docs: add feature page for automatic repair
docs: inter-link incremental-repair and repair documents
docs: incremental-repair: fix curl example
One of the best features of the pytest framework is "assertion
rewriting": If your test does for example "assert a + 1 == b", the
assertion is "rewritten" so that if it fails it tells you not only
that "a+1" and "b" are not equal, what the non-equal values are,
how they are not equal (e.g., find different elements of arrays) and
how each side of the equality was calculated.
But pytest can only "rewrite" assertion that it sees. If you call a
utility function checksomething() from another module and that utility
function calls assert, it will not be able to rewrite it, and you'll
get ugly, hard-to-debug, assertion failures.
This problem is especially noticable in tests we translated from
Cassandra, in test/cqlpy/cassandra_tests. Those tests use a bunch of
assertion-performing utility functions like assertRows() et al.
Those utility functions are defined in a separate source file,
porting.py, so by default do not get their assertions rewritten.
We had a solution for this: test/cqlpy/cassandra_test/__init__.py had:
pytest.register_assert_rewrite("cassandra_tests.porting")
This tells pytest to rewrite assertions in porting.py the first time
that it is imported.
It used to work well, but recently it stopped working. This is because
we change the module paths recently, and it should be written as
test.cqlpy.cassandra_tests.porting.
I verified by editing one of the cassandra_tests to make a bad check
that indeed this statement stopped working, and fixing the module
path in this way solves it, and makes assertion rewriting work
again.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28411
Currently view_builder::start() is called in default scheduling group.
Once it initializes itself, it wakes up the step fiber that explicitly
switches to maintenance scheduling group.
This explicit switch made sence before previous patch, when the fiber
was implemented as a serialized action. Now the fiber starts directly
from .start() method and can inherit scheduling group from it.
Said that, main code calls view_builder::start() in maintenance
scheduling group killing two birds with one stone. First, the step fiber
no longer needs borrow its scheduling group indirectly via database.
Second, the start_in_background() code itself runs in a more suitable
scheduling group.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
View builder runs a background fiber that perform build steps. To kick
the fiber it uses serizlized action, but it's an overkill -- nobody
waits for the action to finish, but on stop, when it's joined.
This patch uses condition variable to kick the fiber, and starts it
instantly, in the place where serialized action was first kicked.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Explain what the feature is and how to confiture it.
Inter-link all the repair related pages, so one can discover all about
repair, regardless of which page they land on.
When a shard on a replica is overloaded, it breaks down completely,
throughput collapses, latencies go through the roof and the
node/shard can even become completely unresponsive to new connection
attempts.
When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued and thus they can time out while being in the queue or during
the execution. In the latter case, the timeout does not always
result in the read being aborted.
Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes
for a queued read to be admitted is around that of the timeout.
If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following cryteria:
if read's remaining time <= read's timeout when arrived to the semaphore * preemptive factor;
the read is rejected and the next one from the wait list is
considered.
Evicting an inactive permit destroyes the permit object when the
reader is closed, making any further member access invalid. Switch
from break to an early return to prevent any possible use-after-free
after evict() in the state::inactive timeout path.
The new parameter parametrizes the factor used to reject a read
during admission. Its value shall be between 0.0 and 1.0 where
+ 0.0 means a read will never get rejected during admission
+ 1.0 means a read will immediatelly get rejected during admission
Although passing values outside the interaval is possible, they
will have the exact same effects as they were clamped to [0.0, 1.0].
A permit gets into the preemptive_aborted state when:
- times out;
- gets rejected from execution due to high chance its execution would
not finalize on time;
Being in this state means a permit was removed from the wait list,
its internal timer was canceled and semaphore's statistic
`total_reads_shed_due_to_overload` increased.
Fix a subtle but damaging failure mode in the tablet migration state machine: when a barrier fails, the follow-up barrier is triggered asynchronously, and cleanup can get skipped for that iteration. On the next loop, the original failure may no longer be visible (because the failing node got excluded), so the tablet can incorrectly move forward instead of entering `cleanup_target`.
To make cleanup reliable this PR:
Adds an additional “fallback cleanup” stage
- `write_both_read_old_fallback_cleanup`
that does not modify read/write selectors. This stage is safe to enter immediately after a barrier failure, and it funnels the tablet into cleanup with the required barriers.
Avoids changing both read and write selectors in a single step transitioning from `write_both_read_new` to `cleanup_target`. The fallback path updates selectors in a safe order: read first, then write.
Allows a direct no-barrier transition from `allow_write_both_read_old` to `cleanup_target` after failure, because in that specific case `cleanup_target` doesn’t change selectors and the hop is safe.
No need for backport. It's an improvement. Currently, tablets transition to `cleanup_target` eventually via failed streaming.
Closesscylladb/scylladb#28169
* github.com:scylladb/scylladb:
topology_coordinator: add write_both_read_old_fallback_cleanup state
topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier
topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old
topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old
There are few places that use raft_group0_client as a way to get to system_keyspace. Mostly they can live without it -- either the needed reference is already at hand, or it's (ab)used to get to the database reference. The only place that really needs the system keyspace is the state merger code that needs last state ID. For that, the explicit helper method is added to group0_client.
Refining API between components, not backporting
Closesscylladb/scylladb#28387
* github.com:scylladb/scylladb:
raft_group0_client: Dont export system keyspace
raft_group0_client: Add and use get_last_group0_state_id()
group0_state_machine: Call ensure_group0_sched() with data_dictionary
view_building_worker: Use its own system_keyspace& reference
During test.py run, noticed this warning:
```
10:38:22 test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: 32 warnings
10:38:22 /jenkins/workspace/releng-testing/scylla-ci/scylla/test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: PytestAssertRewriteWarning: Module already imported so cannot be rewritten: test.cqlpy.cassandra_tests.porting
10:38:22 pytest.register_assert_rewrite('test.cqlpy.cassandra_tests.porting')
```
The insert_update_if_condition_test.py was calling
pytest.register_assert_rewrite() for the porting module, but this
registration is already handled by cassandra_tests/__init__.py which
is automatically loaded before any test runs.
Closesscylladb/scylladb#28409
The include-what-you-use workflow fails with
```
Invalid workflow file: .github/workflows/iwyu.yaml#L25
The workflow is not valid. .github/workflows/iwyu.yaml (Line: 25, Col: 3): Error calling workflow 'scylladb/scylladb/.github/workflows/read-toolchain.yaml@257054deffbef0bde95f0428dc01ad10d7b30093'. The nested job 'read-toolchain' is requesting 'contents: read', but is only allowed 'contents: none'.
```
Fix by adding the correct permissions.
Closesscylladb/scylladb#28390
Enhance the skip_mode marker to accept either a single mode string
or a list of modes, allowing tests to be skipped across multiple
build configurations with a single marker.
Before:
@pytest.mark.skip_mode("dev", reason="...")
@pytest.mark.skip_mode("debug", reason="...")
After:
@pytest.mark.skip_mode(["dev", "debug"], reason="...")
This reduces duplication when the same skip condition applies
to multiple build modes.
Closesscylladb/scylladb#28406
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.
We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await log.wait_for("finished do_send_ack2_msg")` above guarantees that the
node has started the gossip shadow round, which happens after starting the REST
API.
Fixes#28385Closesscylladb/scylladb#28388
Currently, tablet_allocator switches to streaming scheduling group that
it gets from database. It's not nice to use database as provider of
configs/scheduling_groups.
This patch adds a background scheduling group for tablet allocator
configured via its config and sets it to streaming group in main.cc
code.
This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28356
This PR reduces the runtime of `test_out_of_space_prevention.py` by addressing two main sources of overhead: slow “critical utilization” setup and delayed tablet load stats propagation. Combined, these changes cut the module’s total execution time from 324s to 185s.
Improvements. No backup is required.
Closesscylladb/scylladb#28396
* github.com:scylladb/scylladb:
test/storage: speed up out-of-space prevention tests by using smaller volumes
test/storage: reduce tablet load stats refresh interval to speed up OOS prevention tests
As a next step of migration to the pytest runner, this PR moves
responsibility for nodetool tests execution solely to the pytest.
Closesscylladb/scylladb#28348
Otherwise, it may be only a temporary situation due to lack of
candidates, and may be unnecessarily alerting.
Also, print node stats to allow assessing how bad the situation is on
the spot. Those stats can hint to a cause of imbalance, if balancing
is per-DC and racks have different capacity.
Load-balancing can be now per-rack instead of per-DC. So just printing
"in DC" is confusing. If we're balancing a rack, we should print which
rack is that.
Before:
load_balancer - Prepared 1 migration plans, out of which there were 1 tablet migration(s) and 0 resize decision(s) and 0 tablet repair(s) and 0 rack-list colocation(s)
After:
load_balancer - Prepared plan: migrations: 1
We print only stats about elements which are present.
When building with `--disable-precompiled-header`, view.cc failed to
compile due to missing <seastar/coroutine/all.hh> include, which provides
`coroutine::all`.
The problem doesn't manifest when precompiled headers are used, which is
the default. So that's likely why it was missed by the CI.
Adding the explicit include fixes the build.
Fixes: scylladb/scylladb#28378
Ref: scylladb/scylladb#28093
No backport: This problem is only present in master.
Closesscylladb/scylladb#28379
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.
Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request
Fixes: CUSTOMER-96
Fixes: CUSTOMER-139
Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3
Closesscylladb/scylladb#27891
* github.com:scylladb/scylladb:
connection_factory: includes cleanup
dns_connection_factory: refine the move constructor
connection_factory: retry on failure
connection_factory: introduce TTL timer
connection_factory: get rid of shared_future in dns_connection_factory
connection_factory: extract connection logic into a member
connection_factory: remove unnecessary `else`
connection_factory: use all resolved DNS addresses
s3_test: remove client double-close
Use the new ServerUpState=SERVING mechanism to wait to the alternator
ports to be up, rather than relying on the default waiting for CQL,
which happens earlier and therefore opens a window where a connection to
the alternator ports will fail.
When running under systemd, ScyllaDB sends a STATUS=serving message
to systemd. Co-opt this mechanism by setting up NOTIFY_SOCKET, thus
making the cluster manager pretend it is systemd. Users of the cluster
manager can now wait for the node to report itself up, rather than
having to parse log files or retry connections.
This parameter was not mentioned at all anywhere in the documentation.
Add an explanation of this parameter: why we need it, what is the
default and how it can be changed.
Closesscylladb/scylladb#28132
The commit 59faa6d, introduces a new parameter called cpu_concurrency
and sets its default value to 1 which violates the commit fbb83dd that
removes all default values from constructors but one used by the unit
tests.
The patch removes the default value of the cpu_concurrency parameter
and alters tests to use the test dedicated reader_concurrency_semaphore
constructor wherever possible.
Tests in test_out_of_space_prevention.py spend a large fraction of
time creating a random “blob” file to cross the 0.8 critical disk
utilization threshold. With 100MB volumes this requires writing
~70–80MB of data, which is slow inside Docker/Podman-backed volumes.
Most tests only use ~11MB of data, so large volumes are unnecessary.
Reduce the test volume size to 20MB so the critical threshold is
reached at ~16MB and the blob file is much smaller.
This cuts ~5–6s per test.
Set `--tablet-load-stats-refresh-interval-in-seconds=1` for this module’s
clusters applicable to all tests. This significantly reduces runtime
for the slowest cases:
- test_reject_split_compaction: 75.62s -> 23.04s
- test_split_compaction_not_triggered: 69.36s -> 22.98s
Now system_keyspace reference is used internally by the client code
itself, no need to encourage other services abuse it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places that want to get last state id and for that
they make raft_group0_client() export system_keyspace reference.
This patch adds a helper method to provide the needed ID.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a validation for tables being used by group0 commands are marked
with the respective prop. For it the caller code needs to provide
database reference and it gets one from client -> system_keyspace chain.
There's more explicit way -- get the data_dictionary via proxy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some code in the worker need to mess with system_keyspace&. While
there's a reference on it from the worker object, it gets one via
group0 -> group0_client, which is a bit an overkill.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch set eliminates special audit info guard used before for batch statements
and simplifies audit::inspect function by returning quickly if audit is not needed.
It saves around 300 instructions on a request's hot path.
Related: https://github.com/scylladb/scylladb/issues/27941
Backport: no, not a bug
Closesscylladb/scylladb#28326
* github.com:scylladb/scylladb:
audit: replace batch dynamic_cast with static_cast
audit: eliminate dynamic_cast to batch_statement in inspect
audit: cql: remove create_no_audit_info
audit: add batch bool to audit_info class
Currently it grabs one from database, but it's not nice to use database
as config/sched-groups provider.
This PR passes the scheduling group to use for sending hints via manager
which, in turn, gets one from proxy via its config (proxy config already
carries configuration for hints manager). The group is initialized in
main.cc code and is set to the maintenance one (nowadays it's the same
as streaming group).
This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28358
consistent_cluster_management is deprecated since scylla-5.2 and no
longer used by Scylladb, so it should not be used by test either.
Closesscylladb/scylladb#28340
These two streams mostly play together. The former provides an input_stream from read from in-memory temporary buffers, the latter wraps it to limit the size of provided temporary buffers. Both are used to test contiguous data consumer, also the buffer_input_stream has a caller in sstables reversing reader.
This PR removes the buffer_input_stream in favor of seastar memory_data_source, and moves the limiting_input_stream into test/lib.
Enanching testing code, not backporting
Closesscylladb/scylladb#28352
* github.com:scylladb/scylladb:
code: Move limiting data source to test/lib
util: Simplify limiting_data_source API
util: Remove buffer_input_stream
test: Use seastar::util::temporary_buffer_data_source in data consumer test
sstables: Use seastar::util::as_input_stream() in mx reader
This compaction group testing is useless because the machinery for it
to work was removed. This was useful in the early tablet days, where
we wanted to test compaction groups directly. Today groups are stressed
and tested on every tablet test.
I see a ~40% reduction time after this patch, since database_test is
one of the most (if not the most) time consuming in boost suite.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28324
db: view: refactor semaphore usage in create/drop view paths
Refactor the construction and usage of semaphore units in the create and drop view flows.
The previous semaphore handling was hard to follow (as noted while working on https://github.com/scylladb/scylladb/pull/27929), so this change restructures unit creation and movement to follow a clearer and symmetric pattern across shards.
The semaphore usage model is now documented with a detailed in-code comment to make the intended behavior and invariants explicit.
As part of the refactor, the control flow is modernized by replacing continuation-based logic with coroutine-style code, improving readability and maintainability.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-250
backport: not required, this is a refactor
Closesscylladb/scylladb#28093
* github.com:scylladb/scylladb:
db: view: extend try/catch scope in handle_create_view_local The try/catch region is extended to cover step functions and inner helpers, which may throw or abort during view creation. This change is safe because we are just swolowing more parts that may throw due to semaphore abortion or any other abortion request, and doesnt change the logic
db: view: refine create/drop coroutine signatures Refactor the create/drop coroutine interfaces to accept parameters as const references, enabling a clearer workflow and safer data flow.
db: view: switch from continuations to coroutines Refactor the flow and style of create and drop view to use coroutines instead of continuations. This simplifies the logic, improves readability, and makes the code easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit. this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
db: view: introduce helper to acquire or reuse semaphore units Introduce a small helper that acquires semaphore units when needed or reuses units provided by the caller. This centralizes semaphore handling, simplifies the current logic, and enables refactoring the view create/drop path to a coroutine-based implementation instead of continuation-style code.
db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0
Previously, we wanted to make minimal changes with regards to the new
unified auth cache. However, as a result, some calls on the hot path
were missed. Now we have switched the underlying find_record call
to use the cache. Since caching is now at a lower level, we bring
back the original code.
Since every write-type auth statement takes group0_guard at the beginning,
we hold read_apply_mutex and cannot have a running raft apply during our
operation. Therefore, the auth cache and internal CQL reads return the same,
consistent results. This makes it safe to read via cache instead of internal
CQL.
LDAP is an exception, but it is eventually consistent anyway.
The partition snapshot lives in mutation/, however mutation/ is a lower
level concept than a mutation reader. The next best place for this
reader is the replica/ directory, where the memtable, its main user,
also lives.
Also move the code to the replica namespace.
test/boost/mvcc_test.cc includes this header but doesn't use anything
from it. Instead of updating the include path, just drop the unused
include.
Filter the content of sstable(s), including or excluding the specified partitions. Partitions can be provided on the command line via `--partition`, or in a file via `--partitions-file`. Produces one output sstable per input sstable -- if the filter selects at least one partition in the respective input sstable. Output sstables are placed in the path provided via `--oputput-dir`. Use `--merge` to filter all input sstables combined, producing one output sstable.
Fixes: #13076
New functionality, no backport.
Closesscylladb/scylladb#27836
* github.com:scylladb/scylladb:
tools/scylla-sstable: introduce filter command
tools/scylla-sstable: remove --unsafe-accept-nonempty-output-dir
tools/scylla-sstable: make partition_set ordered
tools/scylla-stable: remove unused boost/algorithm/string.hpp include
Generally square brackets are non allowed in URI, while pytest uses it
the test name to show that there were additional parameters for the same
test. When such a test fail it shows the directory correctly in Jenkins,
however attempt to download only this will fail, because of the square
brackets in URI. This change substitute the square brackets with round
brackets.
Closesscylladb/scylladb#28226
Yet another barrier-failure scenario exists in the `write_both_read_new`
state. When the barrier fails, the tablet is expected to transition
to `cleanup_target`, but because barrier execution is asynchronous,
the cleanup transition can be skipped entirely and the tablet may
continue forward instead.
Both `write_both_read_new` and `cleanup_target` modify read and write
selectors. In this situation, a barrier is required, and transitioning
directly between these states without one is unsafe.
Introduce an intermediate `write_both_read_old_fallback_cleanup`
state that modifies only a read selector and can be entered without
a barrier (there is no need to wait for all nodes to start using the
"new" read selector). From there, the tablet can proceed to `cleanup_target`,
where the required barriers are enforced.
This also avoids changing both selectors in a single step. A direct
transition from `write_both_read_new` to `cleanup_target` updates
both selectors at once, which can leave coordinators using the old
selector for writes and the new selector for reads, causing reads to
miss preceding writes.
By routing through the fallback state, selectors are updated in
order—read first, then write—preserving read-after-write correctness.
In both `streaming` and `rebuild_repair` stages, the read/write
selectors are unchanged compared to the preceding stage. Because
entry into these stages is already fenced by a barrier from
`write_both_read_old`, and the `cleanup_target` itself requires
barrier, rolling back directly to `cleanup_target` is safe without
an additional barrier.
A similar barrier-failure scenario exists in the `write_both_read_old`
state. If the barrier fails, the tablet is expected to transition to
`cleanup_target`, but due to the barrier being evaluated asynchronously
the cleanup path can be skipped and the tablet may continue forward
instead.
In `write_both_read_old`, we already switched group0 writes from old
to both, while the barrier may not have executed yet. As a result,
nodes can be at most one step apart (some still use old, others use
both).
Transitioning to `cleanup_target` reverts the write selector back to
old. Nodes still differ by at most one step (old vs both), so the
transition is safe without an additional barrier.
This prevents cleanup from being skipped while keeping selector semantics
and barrier guarantees intact.
When a tablet is in `allow_write_both_read_old`, progressing normally
requires a barrier. If this first barrier fails, the tablet is supposed
to transition to `cleanup_target` on the next iteration:
```
case locator::tablet_transition_stage::allow_write_both_read_old:
if (action_failed(tablet_state.barriers[trinfo.stage])) {
if (check_excluded_replicas()) {
transition_to_with_barrier(locator::tablet_transition_stage::cleanup_target);
break;
}
}
if (do_barrier()) {
...
}
break;
```
That transition itself requires a barrier, which is executed asynchronously.
Because the barrier runs in the background, the cleanup logic is skipped in
that iteration.
On the following iteration, `action_failed(barriers[stage])` no longer
returns true, since the node that caused the original barrier failure
has been excluded. The barrier is therefore observed as successful,
and the tablet incorrectly proceeds to the next stage instead of entering
`cleanup_target`.
Since `cleanup_target` does not modify read/write selectors, the transition
can be done safely without a barrier, simplifying the state machine and
ensuring cleanup is not skipped.
Without it, the tablet would still eventually reach `cleanup_target` via
`write_both_read_old` and `streaming`, but that path is unnecessary.
The try/catch region is extended to cover step functions and inner helpers,
which may throw or abort during view creation.
This change is safe because we are just swolowing more parts that may throw due to semaphore abortion
or any other abortion request, and doesnt change the logic
Refactor the flow and style of create and drop view to use coroutines instead of continuations.
This simplifies the logic, improves readability, and makes the code
easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit.
this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
Introduce a small helper that acquires semaphore units when needed or
reuses units provided by the caller.
This centralizes semaphore handling, simplifies the current logic, and
enables refactoring the view create/drop path to a coroutine-based
implementation instead of continuation-style code.
Commit 0156e97560 ("storage_proxy: cas: reject for
tablets-enabled tables") marked a bunch of LWT tests as
XFAIL with tablets enabled, pending resolution of #18066.
But since that event is now in the past, we undo the XFAIL
markings (or in some cases, use an any-keyspace fixture
instead of a vnodes-only fixture).
Ref #18066.
Closesscylladb/scylladb#28336
Only two tests use it now -- the limit-data-source-test iself and a test
that validates continuous_data_consumer template.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The source maintains "limit generator" -- a function that returns the
maximum size of bytes to return from the next buffer.
Currently all callers just return constant numbers from it. Passing a
function that returns non-constant one can, probably, be used for a
fuzzy test, but even the limiting-data-source-test itself doesn't do it,
so what's the point...
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test creates buffer_data_source_impl and wraps it with limiting data
source. The former data_source duplicates the functionality of the
existing seastar temporary_buffer_data_source.
This patch makes the test code use seastar facility. The
buffer_data_source_impl will be removed soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the code uses make_buffer_input_stream() helper that creates
an input stream with buffer_data_source_impl inside which, in turn,
provides the data_source_impl API over a single temporary_buffer.
Seastar has the very same facility, it's better to use it. Eventually
the buffer_data_source_impl will be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We don't need a special guard value, it's
only being filled for batch statements for
which we can simply ignore the value.
Not having special value allows us to return
fast when audit is not enabled.
The user can now discover the general explanatio of repair when reading
about incremental repair, useful if they don't know what repair is.
The user can now discover incremental repair while reading the generic
repair procedure document.
This PR refactors the streaming subsystem to support direct download of fully contained sstables. Instead of streaming these files, they are downloaded and attached directly to their corresponding tables. This approach reduces overhead, simplifies logic, and improves efficiency. Expected node scope restore performance improvement: ~4 times faster in best case scenario when all sstables are fully contained.
1. Add storage options field to sstable Introduce a data member to store storage options, enabling distinction between local and object storage types.
2. Add method to create component source Extend the storage interface with a public method to create a data_source for any sstable component.
3. Inline streamer instance creation Remove make_sstable_streamer and inline its usage to allow different sets of arguments at call sites.
4. Skip streaming empty sstable sets Avoid unnecessary streaming calls when the sstable set is empty.
5. Enable direct download of contained sstables Replace streaming of fully contained sstables with direct download, attaching them to their corresponding table.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-200
Refs: https://github.com/scylladb/scylladb/issues/23908
No need to backport as this code targets 2026.2 release (for tablet-aware restore)
Closesscylladb/scylladb#26834
* github.com:scylladb/scylladb:
tests: reuse test_backup_broken_streaming
streaming: enable direct download of contained sstables
storage: add method to create component source
streaming: keep sharded database reference on tablet_sstable_streamer
streaming: skip streaming empty sstable sets
streaming: inline streamer instance creation
tests: fix incorrect backup/restore test flow
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.
Closesscylladb/scylladb#28240
In this PR, we fix two bugs present in `boost_test_tree_lister` that
affected the output of `--list_json_content` added in
scylladb/scylladb@afde5f668a:
* The labels test units use were duplicated in the output.
* If a test suite or a test file didn't contain any tests, it wasn't
listed in the output.
Refs scylladb/scylladb#25415
Backport: not needed. The code hasn't been used anywhere yet.
Closesscylladb/scylladb#28255
* github.com:scylladb/scylladb:
test/lib/boost_test_tree_lister.cc: Record empty test suites
test/lib/boost_test_tree_lister.cc: Deduplicate labels
Finishing the deprecation of the skip_mode function in favor of
pytest.mark.skip_mode. This PR is only cleaning and migrating leftover tests
that are still used and old way of skip_mode.
Closesscylladb/scylladb#28299
Move state management from dns_connection_factory into state class
itself to encapsulate its internal state and stop managing it from the
`dns_connection_factory`
Instead of streaming fully contained sstables, download them directly
and attach them to their corresponding table. This simplifies the
process and avoids unnecessary streaming overhead.
The class in question has internal implementations of in-memory data_sink_impl and data_source_impl. In seastar there's generic implementation of the same facilities. From the "code re-use" perspective it makes sense to use both. TODO-s in Scylla code supports that.
Using newer seastar facilities, not backporting.
Closesscylladb/scylladb#28321
* github.com:scylladb/scylladb:
sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink
sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl
There are two checks for live endpoints performed in test_gossiper.py,
but one of those sits in test_gossiper_unreachable_endpoints somehow.
This patch moves live endpoints check into live endpoints test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28224
This series implements rescoring algorithm.
Index options allowing to enable this functionality were introduced in earlier PR https://github.com/scylladb/scylladb/pull/28165.
When Vector Index has enabled quantization, Vector Store uses reduced vector representation to save memory, but it may degrade correctness of ANN queries. For quantized index we can enable rescoring algorithm, which recalculates similarity score from full vector representation stored in Scylla and reorder returned result set.
It works also with oversampling - we fetch more candidates from Vector Store, rescore them at Scylla and return only requested number of results.
Example:
Creating a Vector Index with Rescoring
```sql
-- Create a table with a vector column
CREATE TABLE ks.products (
id int PRIMARY KEY,
embedding vector<float, 128>
);
-- Create a vector index with rescoring enabled
CREATE INDEX products_embedding_idx ON ks.products (embedding)
USING 'vector_index'
WITH OPTIONS = {
'similarity_function': 'cosine',
'quantization': 'i8',
'oversampling': '2.0',
'rescoring': 'true'
};
```
1. **Quantization** (`i8`) compresses vectors in the index, reducing memory usage but introducing precision loss in distance calculations
2. **Oversampling** (`2.0`) retrieves 2× more candidates than requested from the vector store (e.g., `LIMIT 10` fetches 20 candidates)
3. **Rescoring** (`true`) recalculates similarity scores using full-precision (`f32`) vectors from the base table and re-ranks results
Query example:
```sql
-- Find 10 most similar products
SELECT id, similarity_cosine(embedding, [0.1, 0.2, ...]) AS score
FROM ks.products
ORDER BY embedding ANN OF [0.1, 0.2, ...]
LIMIT 10;
```
With rescoring enabled, the query:
1. Fetches 20 candidates from the quantized index (due to oversampling=2.0)
2. Reads full-precision embeddings from the base table
3. Recalculates similarity scores with full precision
4. Re-ranks and returns the top 10 results
In this implementation we use CQL similarity function implementation to calculate new score values and use them in post query ordering. We add that column manually to selection, but it has to be removed from the final response.
Follow-up https://github.com/scylladb/scylladb/pull/28165
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83
New feature - doesn't need backport.
Closesscylladb/scylladb#27769
* github.com:scylladb/scylladb:
vector_index: rescoring: Fetch oversampled rows
vector_index: rescoring: Sort by similarity column
select_statement: Modify `needs_post_query_ordering` condition
vector_index: rescoring: Add hidden similarity score column
vector_index: Refactor extracting ANN query information
The storage_proxy::stop() is not called by main (it is commented out due to #293), so the corresponding message injection is never hit. When the test releases paxos_state_learn_after_mutate, shutdown may already be in progress or even completed by the time we try to trigger the storage_proxy::stop injection, which makes the test flaky.
Fix this by completely removing the storage_proxy::stop injection. The injection is not required for test correctness. Shutdown must wait for the background LWT learn to finish, which is released via the paxos_state_learn_after_mutate injection. The shutdown process blocks on in-flight HTTP requests through seastar::httpd::http_server::stop and its _task_gate, so the HTTP request that releases paxos_state_learn_after_mutate is guaranteed to complete before the node is shut down.
Fixesscylladb/scylladb#28260
backport: 2025.4, the `test_lwt_shutdown` test was introduced in this version
Closesscylladb/scylladb#28315
* https://github.com/scylladb/scylladb:
storage_proxy: drop stop() method
test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
It was observed twice that the test times out in debug mode.
Fix by increasing the timeout.
The test never expects a timeout, so increasing it won't increase
the test duration.
Fixes#28028Closesscylladb/scylladb#28272
- Pass pytest request fixture into coro_task (used for scylla_tmp_dir
and core dump path)
- Rename duplicate `test_sstable_summary` that runs sstable-index-cache
to `test_sstable_index_cache` so both tests are collected
Refs https://github.com/scylladb/scylladb/issues/22501Closesscylladb/scylladb#28286
The former accumulates sstable writer writes into a vector of temporary
buffers. In seastar there's a generic memory data sink that provides a
sink to accumulate stream of bytes into any container.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The latter is used to wrap vector of buffers into an input_stream.
Seastar already provides the very same functionality with the
convenience as_input_stream() helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.
The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
Fixes https://github.com/scylladb/scylladb/issues/28214
backport to 2025.4 to add RF-rack-valid enforcements in alternator
Closesscylladb/scylladb#28154
* github.com:scylladb/scylladb:
locator: document the exception type of assert_rf_rack_valid_keyspace
alternator: don't require rf_rack flag for indexes, validate instead
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.
This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.
The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)
and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.
Fixes: scylladb/scylladb#27893
A minor change, no need to backport.
Closesscylladb/scylladb#27894
* github.com:scylladb/scylladb:
db: fail reads and writes with local consistencty level to a DC with RF=0
db: consistency_level: split `local_quorum_for()`
db: consistency_level: fix nrs -> nts abbreviation
storage_proxy::stop() is not called by main (it is commented out due to #293),
so the corresponding message injection is never hit. When the test releases
paxos_state_learn_after_mutate, shutdown may already be in progress or even
completed by the time we try to trigger the storage_proxy::stop injection,
which makes the test flaky.
Fix this by completely removing the storage_proxy::stop injection.
The injection is not required for test correctness. Shutdown must wait for the
background LWT learn to finish, which is released via the
paxos_state_learn_after_mutate injection.
The shutdown process blocks on in-flight api HTTP requests through
seastar::httpd::http_server::stop and its _task_gate, so the
shutdown will not prevent the HTTP request that released the
paxos_state_learn_after_mutate from completing successfully.
Fixesscylladb/scylladb#28260
During rewrite --extra-scylla-cmdline-options was missed and it was not passed to the tests that are using pytest. The issue that there were no possibility to pass these parameters via cmd to the Scylla, while tests were not affected because they were using the parameters from the yaml file.
This PR fixes this issue so it will be easier to modify the Scylla start parameters without modifying code.
No backport needed, only framework enhancement.
Closesscylladb/scylladb#28156
* github.com:scylladb/scylladb:
test.py: do not crash when there is no boost log
test.py: pass correctly extra cmd line arguments
Before this change, the test function `_verify_tasks_processed_metrics`
verified that after service level reconfiguration, a given number of
`scylla_scheduler_tasks_processed` were processed by a given scheduling
group. Moreover, the check verified that another scheduling group
didn't process a high number of requests. The second check was vulnerable
to flakiness, because sometimes additional load caused extensive work
in the second scheduling group (e.g. password hashing in `sl:driver`
due to new connections being created).
To avoid test failures, this commit changes which metric is verified:
instead of `scylla_scheduler_tasks_processed`, the metric
`scylla_transport_cql_requests_count` is checked. This prevents similar
problems, because there is no reason for a high number of
requests to be processed by the second scheduling group. Moreover,
it allows decreasing the number of requests that are sent for
verification, and thus speeds up the test.
Fixes: scylladb/scylladb#27715Closesscylladb/scylladb#28318
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.
Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.
The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.
Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)
One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
- strongly-consistent-tables
```
cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```
Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56
backport: no need
Closesscylladb/scylladb#27614
* https://github.com/scylladb/scylladb:
test_encryption: capture stderr
test/cluster: add test_strong_consistency.py
raft_group_registry: disable metrics for non-0 groups
strong consistency: implement select_statement::do_execute()
cql: add select_statement.cc
strong consistency: implement coordinator::query()
cql: add modification_statement
cql: add statement_helpers
strong consistency: implement coordinator::mutate()
raft.hh: make server::wait_for_leader() public
strong_consistency: add coordinator
modification_statement: make get_timeout public
strong_consistency: add groups_manager
strong_consistency: add state_machine and raft_command
table: add get_max_timestamp_for_tablet
tablets: generate raft group_id-s for new table
tablet_replication_strategy: add consistency field
tablets: add raft_group_id
modification_statement: remove virtual where it's not needed
modification_statement: inline prepare_statement()
system_keyspace: disable tablet_balancing for strongly_consistent_tables
cql: rename strongly_consistent statements to broadcast statements
This patch adds links to the Vector Search documentation that is hosted
together with Scylla Cloud docs to the CQL documentation.
It also make the note about supported capabilities consistent and
removes the experimental label as the feature is GAed.
Fixes: SCYLLADB-371
Closesscylladb/scylladb#28312
Filter the content of sstable(s), including or excluding the specified
partitions. Partitions can be provided on the command line via
`--partition`, or in a file via `--partitions-file`.
Produces one output sstable per input sstable -- if the filter selects
at least one partition in the respective input sstable.
Output sstables are placed in the path provided via `--oputput-dir`.
Use `--merge` to filter all input sstables combined, producing one
output sstable.
The function assert_rf_rack_valid_keyspace uses the exception type
std::invalid_argument when the RF-rack validation fails. Document it and
change all callers to catch this specific exception type when checking
for RF-rack validation failures, so that other exception types can be
propagated properly.
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.
The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
So far with oversampling the extended set of keys was returned from VS,
but query to the base table was still limited by the query `limit`.
Now for rescoring we want to fetch rows for all the keys returned from VS.
However later we need to restore the command limit, to trim result_set accordingly.
For non-rescoring scenarios we trim directly keys set returned from VS if it happens to exceed query limit.
With this change rescoring validation tests (except `no_nulls_in_rescored_results`) pass fully.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83
This patch implements second part of rescoring - ordering results by similarity column added in earlier patch.
For this purpose in this patch we define `_ordering_comparator`, which enables pre-existing post-query ordering functionality.
However, none additional test passes yet, as they include ovesampling, which will be the subject of following patches.
Our plan for rescoring is to use the existing post-query ordering mechanism to sort (and trim) result_set by similarity column.
For general SELECT case this ordering is permitted only for queries with IN on the partition key and an ORDER BY, which is checked in `needs_post_query_ordering`.
Recently this check was overriden for ANN queries in https://github.com/scylladb/scylladb/pull/28109 to enable IN queries handled by VS without excessive post-processing.
In this patch we revert that change - ANN case will be handled by general check.
However we change the condition - we will enable post processing anytime `_ordering_comparator` is set.
In current implementation `_ordering_comparator` is created only in `select_statement::prepare` with `get_ordering_comparator`,
only for the same conditions as were checked in `needs_post_query_ordering`, so this change should be transparent for general SELECT.
For ANN query it is also not set (yet), so it will not influence ANN filtering, but we confirm that this functionality still works
by adding filtering test: `test/vector_search/filter_test.cc::vector_store_client_test_filtering_ann_cql`.
Rescoring ordering for ANN queries will be enabled when we add `_ordering_comparator` in following patch.
Rescoring consist of recalculating similarity score and reordering results based on it.
In this patch we add calculation of similarity score as a hidden (non-serialized) column and following patch will add reordering.
Normal ordering uses `add_column_for_post_processing`, however this works only for regular columns, not function.
So we create it together with user requested columns (this also forces the use of `selection_with_processing`) and hide the column later.
This also requires special handling for 'SELECT *' case - we need to manually add all columns before adding similarity column.
In case user already asks for similarity score in the SELECT clause, this value will be calculated twice - is should be optimized in future patches.
When setting up coredump handling, if there are old mounts in a deleted state (e.g. from an older installation),
systemd might fail to activate the new `.mount` unit properly because it assumes the path is already mounted.
Explicitly unmount `/var/lib/systemd/coredump` before proceeding with the setup to ensure a clean state.
Fix: scylladb/scylla-enterprise#5692Closesscylladb/scylladb#28300
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.
Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.
Flow of decommission/removenode request:
1) pending and paused, has tablet replicas on target node.
Tablet scheduler will start draining tablets.
2) No tablets on target node, request is pending but not paused
3) Request is scheduled, node is in transition
4) Request is done
Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.
When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).
Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.
Fixes#21452Closesscylladb/scylladb#24129
* https://github.com/scylladb/scylladb:
docs: Document parallel decommission and removenode and relevant task API
test: Add tests for parallel decommission/removenode
test: util: Introduce ensure_group0_leader_on()
test: tablets: Check that there are no migrations scheduled on draining nodes
test: lib: topology_builder: Introduce add_draining_request()
topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
tablets: topology_coordinator: Refactor to propagate reason for migration rollback
tablet_allocator: Skip co-location on draining nodes
node_ops: task_manager_module: Populate entity field also for active requests
tasks: node_ops: Put node id in the entity field
tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
topology: Protect against empty cancelation reason
tasks, topology: Make pending node operations abortable
doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
raft_topology, tablets: Drain tablets in parallel with other topology operations
virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
locator: topology: Add "draining" flag to a node
topology_coordinator: Extract generate_cancel_request_update()
storage_service: Drop dependency in topology_state_machine.hh in the header
locator: Extract common code in assert_rf_rack_valid_keyspace()
topology_coordinator, storage_service: Validate node removal/decommission at request submission time
This flag was added to operations which have an --output-dir
command-line arguments. These operations write sstables and need a
directory where to write them. Back in the numeric-generation world this
posed a problem: if the directory contained any sstable, generation
clash was almost guaranteed, because each scylla-sstable command
invokation would start output generations from 1. To avoid this, empty
output directory was a requirement, with the
--unsafe-accept-nonempty-output-dir allowing for a force-override.
Now in the timeuuid generation days, all this is not necessary anymore:
generations are unique, so it is not a problem if the output directory
already contains sstables: the probability of generation clash is almost
0. Even if it happens, the tool will just simply fail to write the new
sstable with the clashing generation.
Remove this historic relic of a flag and the related logic, it is just a
pointless nuissance nowadays.
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.
This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.
The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)
and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.
Fixes: scylladb/scylladb#27893
The core of `local_quorum_for()` has been extracted to
`get_replication_factor_for_dc()`, which is going to be used later,
while `local_quorum_for()` itself has been recreated using the exracted
part.
`network_topology_strategy` was abbreviated with `nrs`, and not `nts`. I
think someone incorrectly assumed it's 'network Replication strategy', hence
nrs.
When one request is super slow and req/s high
in theory we have a collision on id, this patch
avoids that by reusing id and aborting when there
is no free one (unlikely).
This commit avoids leaking seastar::async future from two benchmark
tools: perf-alternator and perf-cql-raw. Additionally it adds
abort_source for fast and clean shutdown.
The test is currently flaky. With `remove_dead_nodes_with == "remove"`,
it sends several ALTER KEYSPACE requests. The request performed just
after adding 3 new nodes can unexpectedly be sent twice to two
different nodes by the driver. The second receiver rejects the request
through the new guardrail added in 2e7ba1f8ce,
and the test fails.
This has been acknowledged as a bug in the Python driver. It shouldn't
retry non-idempotent requests with the default retry policy. There could
be one more bug in the driver, as it looks like the driver decides to
resend the request after it disconnects from the first receiver. The
first receiver has just bootstrapped, so the driver shouldn't disconnect.
We deflake the test by reconnecting the driver before performing the
problematic ALTER KEYSPACE request.
The change has been tested in byo, as the failure reproduces only in CI.
Without the change, the test fails once in ~250 runs in dev mode. With
the change, more than 1000 runs passed.
Fixes#27862
No backport needed as 2e7ba1f8ce is only
in master.
Closesscylladb/scylladb#28290
For the purpose of rescoring we will need information if the query is an ANN query
and the access to index option earlier in the `select_statement::prepare` than it happened before.
This patch refactors extracting this information to new helper structure `ann_ordering_info`
and is consistently using it.
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.
When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.
The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.
enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
Mark rf_rack_valid_keyspaces as deprecated.
Fixes: https://github.com/scylladb/scylladb/issues/26399.
New feature; no backport needed
Closesscylladb/scylladb#28084
* github.com:scylladb/scylladb:
test: add test for enforce_rack_list option
db: mark rf_rack_valid_keyspaces as deprecated
config: add enforce_rack_list option
Revert "alternator: require rf_rack_valid_keyspaces when creating index"
Adds a "sstables" array member to manifest.json.
For each sstables, keep the following metadata:
id - a uuid for the sstable (the sstable identifier
if the use-sstable-identifier option was used, otherwise
the sstable uuid generation)
toc_name - the name of the TOC.txt file
data_size and index_size - in bytes
first_token and last_token - of the sstable first and last keys.
Fixes: SCYLLADB-196
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a table member to manifest.json with the keyspace_name,
table_name, table_id, tablets_type, and, for tablets-enabled tables, get
tablet_count on each shard and write the minimum to manifest.json.
For vnodes-based tables, tablet_count=0.
For now, `tablets_type` may be either `none` for vnodes tables, or
`powof2` for tablets tables. In the future, when we support arbitrary
tablt boundaries, this will be reflected here, and it is likely we
would backup the whole tablets map sperately to get all tablet boundaries.
Fixes SCYLLADB-195
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And keep the options for now in the local_snapshot_writer.
The options will be used by following patches to pass
extra metadata like the snapshot creation time, expiration time, etc.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If tablets are enabled via db::config add the `tablet = {'enabled': true}'
option when creating a keyspace, even if `cql_test_config.initial_tablets`
is disengaged.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add metadata about the node: host_id, datacenter, and rack.
This enables dc- or rack- aware restore.
Today this information is "encoded" into the snapshot hierarchy
prefixes, but if all manifest files would be stored in a flat
directory, we'd need to encode that metadata in the object name,
but it'd be better for the manifest contents to be self descriptive.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add metadata about the manifest itself:
A version and the manifest scope (currently "node",
but in the future, may also be "shard", or "tablet")
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.
We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await coordinator_log.wait_for("delay_node_bootstrap: waiting for message")`
above guarantees that the node has submitted the join topology request, which
happens after starting the REST API.
Fixes#28227Closesscylladb/scylladb#28233
In PR 5b6570be52 we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams).
This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (adf9c426c2).
This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config.
Fixes#26914.
Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults.
Closesscylladb/scylladb#27204
* github.com:scylladb/scylladb:
db/config: Update sstable_compression_user_table_options description
schema: Add initializer for compression defaults
schema: Generalize static configurators into schema initializers
schema: Initialize static properties eagerly
db: config: Add accessor for sstable_compression_user_table_options
test: Check that CQL and Alternator tables respect compression config
Remove the `make_sstable_streamer` function and inline its usage where
needed. This change allows passing different sets of arguments
directly at the call sites.
When working directly with sstable components, the provided name should
be only the file name without path prefixes. Any prefixing tokens
belong in the 'prefix' argument, as the name suggests.
The test is currently flaky. It incorrectly assumes that a read with
CL=LOCAL_ONE will see the data inserted by a preceding write with
CL=LOCAL_ONE in the same datacenter with RF=2.
The same issue has already been fixed for CL=ONE in
21edec1ace. The difference is that
for CL=LOCAL_ONE, only dc1 is problematic, as dc2 has RF=1.
We fix the issue for CL=LOCAL_ONE by skipping the check for dc1.
Fixes#28253
The fix addresses CI flakiness and only changes the test, so it
should be backported.
Closesscylladb/scylladb#28274
Add a basic test that creates a strongly consistent keyspace and table,
writes some data, and verifies that the same data can be read back.
Since Scylla-side request proxying is not yet implemented, writes are
handled only on the leader node. The test uses the existing
`/raft/leader_host` REST endpoint to determine the leader of the tablets
Raft group.
The `raft::server` registers metrics using the `server_id` label. When
both a group0 Raft server and the tablets Raft server are created on
the same node/shard, duplicate metrics cause conflicts.
This commit temporarily disables metrics for non-0 groups. A proper fix
will likely require adding a `group_id` label in the future.
We use decoration instead of inheritance, since inheritance already
serves to differentiate statement types (modification_statement has
update_statement and delete_statement as descendants). A better
solution would likely involve refactoring modification_statement and
extracting the mutation-generation logic into a reusable component
shared by both eventual and strongly consistent statements.
Introduce two helper methods that will be used for strongly consistent
select_statement and modification_statement.
redirect_statement() forwards the request to another shard or node.
Currently, only shard forwarding is implemented; node-level proxying
will be added in follow-up PRs.
is_strongly_consistent() will be used in the prepare() method of raw
statements to determine whether a strongly consistent statement should
be created for the given CQL statement.
To guarantee monotonic mutation timestamps, we compute the maximum
timestamp used so far for the current tablet. This is done by calling
read_barrier() on the tablet’s Raft group server and extracting the
maximum timestamp from the local database via
table::get_max_timestamp_for_tablet().
Because read_barrier() may take a while, we perform it proactively in a
dedicated fiber, leader_info_updater, rather than during the mutation
request. This fiber is started when the Raft group server starts for a
tablet. It reacts to wait_for_state_change(), computes the maximum
timestamp, and stores it per term.
The new groups_manager::begin_mutate() function checks whether the
maximum timestamp has already been computed for the current term. If
not, it asks the client to wait. This two-step interface (synchronous
begin_mutate() + asynchronous wait on the need_wait_for_leader future)
is needed because the term can change at any asynchronous point.
If begin_mutate() were asynchronous, the client would need to recheck
the term after `co_await begin_mutate()`.
We currently do not handle raft::commit_status_unknown. We rethrow it to
the CQL client, which must check whether the command was applied and
retry if necessary. Handling this inside Scylla would require persisting
a deduplication key after applying the mutation, which introduces write
amplification. Additionally, connection breaks between Scylla and the
driver can always occur, so the client must be prepared to verify the
command status regardless.
When a strongly consistent request arrives at a node, we
need to know which replica is the leader, since such requests
are generally executed only on the leader. If a leader has
not yet been elected, we must wait. This commit exposes
wait_for_leader() so it can be used for that purpose.
We cannot rely solely on wait_for_state_change(), because it does not
trigger when some other node becomes a leader.
Add the `coordinator` class, which will be responsible for coordinating
reads and writes to strongly consistent tables. This commit includes
only the boilerplate; the methods will be implemented in separate
commits.
These commands will be used by strongly consistent tablets to submit
mutations to Raft. A simple state_machine implementation is introduced
to apply these commands.
We apply commands in batches to reduce commitlog I/O overhead. The
batched variant of database::apply has known atomicity issues. For
example, it does not guarantee atomicity under memory pressure: some
mutations may be published to the memtable while others are blocked in
run_when_memory_available. We will address these issues later.
Strongly consistent writes require knowing the maximum timestamp of
locally applied mutations to guarantee monotonically increasing
timestamps for subsequent writes.
This commit adds a function that returns the maximum timestamp for a
given tablet.
Why it is safe to use this function with deleted cells:
* Tombstones are included in memtable.get_max_timestamp() calculations.
* The maximum timestamp of a memtable is used to initialize the maximum
timestamp of the resulting sstable.
* During compaction, a new sstable’s maximum timestamp is initialized as
the maximum of the contributing sstables.
This commit adds a `consistency` field to `tablet_replication_strategy`.
In upcoming commits we'll use this field to determine if a
`raft_group_id` should be generated for a new table.
Add a `raft_group_id` column to `system.tablets` and to the `tablet_map`
class. The column is populated only when the
`strongly_consistent_tables` feature is enabled.
This feature is currently disabled by default and is enabled only when
the user sets the `STRONGLY_CONSISTENT_TABLES` experimental flag.
The `raft_group_id` column is added to `system.tablets` only when this
flag is set. This allows the schema to evolve freely while the feature
is experimental, without requiring complex migrations.
This is a refactoring/simplification commit.
There are many 'prepare' functions in this class that don't
meaningfully differ from each other. The prepare_statement() adds
accidental complexity by adding a level of indirection -- the reader
has to jump between the call site and the function body to reconstruct
the full picture.
In preparation for upcoming work on strongly consistent queries in
Scylla, this commit renames the existing `strongly_consistent`
statements to `broadcast_statements` to avoid confusion.
The old code paths are kept temporarily, as they may be useful for
reference or for copying parts during the implementation of the new
strongly consistent statements.
This patch changes the layout of user-facing scheduling groups from
/
`- statement
`- sl:default
`- sl:*
`- other groups (compaction, streaming, etc.)
into
/
`- user (supergroup)
`- statement
`- sl:default
`- sl:*
`- other groups (compaction, streaming, etc.)
The new supergroup has 1000 static shares and is name-less, in a sense
that it only have a variable in the code to refer to and is not exported
via metrics (should be fixed in seastar if we want to).
The moved groups don't change their names or shares, only move inside
the scheduling hierarchy.
The goal of the change is to improve resource consumption of sl:*
groups. Right now activities in low-shares service levels are scheduled
on-par with e.g. streaming activity, which is considered to be low-prio
one. By moving all sl:* groups into their own supergroup with 1000
shares changes the meaning of sl:* shares. From now on these shares
values describe preirities of service level between each-other, and the
user activities compete with the rest of the system with 1000 shares,
regardless of how many service levels are there.
Unit tests keep their user groups under root supergroup (for simplicity)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28235
Currently, raft-based node operations with streaming use topology guards, but repair-based don't.
Topology guards ensure that if a respective session is closed (the operation has finished), each leftover operation being a part of this session fails. Thanks to that we won't incorrectly assume that e.g. the old rpc received late belongs to the newly started operation. This is especially important if the operation involves writes.
Pass a topology_guard down from raft_topology_cmd_handler to repair tasks. Repair tasks already support topology guards.
Fixes: https://github.com/scylladb/scylladb/issues/27759
No topology_guard in any version; needs backport to all versions
Closesscylladb/scylladb#27839
* github.com:scylladb/scylladb:
service: use session variable for streaming
service: pass topology guard to RBNO
backup and restore tests. This made the testing times explode
with both cluster/object_store/test_backup.py and
cluster/test_refresh.py taking more than an hour each to complete
under test.py and around 14min under pytest directly.
This was painful especially in CI because it runs tests under test.py which
suffers from the issue of not being able to run test cases from within
the same file in parallel (a fix is attempted in 27618).
This patch reduces the dataset of these tests to the minimum and
gets rid of one of the tested topology as it was redundant.
The test times are reduced to 2min under pytest and 14 mins under
test.py.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#28280
This series introduces `rescoring` index option.
There is no rescoring algorithm implementation yet.
This series prepares it by:
- adding new index option
- adding documentation
- adding tests for option handling
- adding tests for rescoring implementation - at this point they report errors and are marked that this is expected, because rescoring is not implemented.
Follow-up https://github.com/scylladb/scylladb/pull/27677
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-293
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294
No backporting - it is a new feature.
Closesscylladb/scylladb#28165
* github.com:scylladb/scylladb:
vector_search: Add more rescoring validation tests
vector_search: Add rescoring validation test
vector_search: doc: Document new index option
vector_search: test: Add `rescoring` index option test
vector_index: introduce rescoring option
vector_index: improve options validation
The streamed_mutation_freezer class uses a deque to avoid large
allocations, but fails as seen in the referenced issue when the
vector backing the deque grows too large. This may be a problem
in itself, but the issue doesn't provide enough information to tell.
Fix the immediate problem by switching to chunked_vector, which
is better in avoiding large allocations. We do lose some early-free
in serialize_mutation_fragments(), but since most of the memory should
be in the clustering row itself, not in the deque/chunked_vector holding
it, it should not be a problem.
Fixes#28275Closesscylladb/scylladb#28281
Auth v2 migration uses non-paged queries via `execute_internal` API.
This commit changes it to use `query_internal` instead, which uses
paging under the hood.
Fixes: https://github.com/scylladb/scylladb/issues/27577
A minor enhancement, no need to backport.
Closesscylladb/scylladb#25395
* github.com:scylladb/scylladb:
auth: use paged internal queries during migration
auth: move some code in migrate_to_auth_v2 up
auth: re-align pieces of migrate_to_auth_v2
cql: extend `query_internal` with `query_state` param
This reverts commit c8cff94a5a.
Re-enabling incremental repair on master with "Aborting on shard 0 during
scaleout + repair #26041" and "Failure to attach sstables in streaming consumer
leaves sealed sstables on disk #27414" fixed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28120
Since #28109 was merged, those tests started to pass as we allow
the filtering on primary key columns within ANN vector queries.
Closesscylladb/scylladb#28231
Adding tests for specific cases of rescoring processing:
- wildcard selection - "SELECT * ..." is a case with slightly different path of rescoring processing. We want to confirm that it is handled correctly.
- calculating similarity with other vectors in SELECT clause should not influence ANN ordering.
- NULL handling - results that for any reason have NULL in a score should be filtered out.
As rescoring is not implemented yet, the tests use boost::unit_test::expected_failures
to indicate that the test reports errors.
Verify that vector store results will be correctly rescored and reordered
according to the rescoring algorithm.
As rescoring is not implemented yet, the tests use `boost::unit_test::expected_failures`
to indicate that they report errors.
First test checks rescoring with a simple selection list.
Second makes sure that rescoring is not triggered for quantization=f32 - full representation of vectors.
Third repeats the first one, but adds to it returning of similarity score value.
This patch adds vector index option allowing to enable rescoring - recalculation of similarity metric and re-ranking of quantized VS candidates.
Quantization is a necessary condition to run rescoring - checked in convenience function `is_rescoring_enabled`.
Rescoring itself is not implemented - it will come in following patches.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294
Add `prepared_filter` class which handles the preparation, construction
and caching of Vector Search filtering compatible JSON object.
If no bind markers found in SELECT statement, the JSON object will be built
once at prepare time and cached for use during execution calls.
Adjust tests accordingly to use prepared filters.
Follow-up: #28109
Fixes: SCYLLADB-299
During rewrite --extra-scylla-cmdline-options was missed and it was not
passed to the tests that are using pytest. The issue that there were no
possibility to pass these parameters via cmd to the Scylla, while tests
were not affected because they were using the parameters from the yaml
file. This PR fixes this issue so it will be easier to modify the Scylla
start parameters without modifying code.
Move Vector Search filter functions from `cql3::restrictions` to
`vector_search` namespace as it's a better place according to
it's purpose.
The effective name has now changed from `cql3::restrictions::to_json`
to `vector_search::to_json` which clearly mentions that the JSON
object will be used for Vector Search.
Rename the auxilary functions to use `to_json` suffix instead of
variety of verbs as those functions logic focuses on building JSON
object from different structures. The late naming emphasized too
much distinction between those functions, while they do pretty much
the same thing.
Follow-up: #28109
The API contract in partition_version.hh states that when dealing with
evictable entries, a real cache tracker pointer has to be passed to all
methods that ask for it. The nonpopulating reader violates this, passing
a nullptr to the snapshot. This was observed to cause a crash when a
concurrent cache read accessed the snapshot with the null tracker.
A reproducer is included which fails before and passes after the fix.
Fixes: #26847Closesscylladb/scylladb#28163
Bound_weight and partition_region are defined in both paging_state.idl.hh and
position_in_partition.idl.hh. This isn't currently causing any issues, but if
a future RPC uses both the paging_state and position_in_partition, after
including both files we'll get a duplicate error.
In this patch we prevent this by removing the definitions from paging_state.idl.hh
and including position_in_partition.idl.hh in their place.
Closesscylladb/scylladb#28228
Use session that was retrieved at the beginning of the handler for
node operations with streaming to ensure that the session id won't
change in between.
Currently, raft-based node operations with streaming use topology
guards, but repair-based don't.
Topology guards ensure that if a respective session is closed
(the operation has finished), each leftover operation being a part
of this session fails. Thanks to that we won't incorrectly assume
that e.g. the old rpc received late belongs to the newly started
operation. This is especially important if the operation involves
writes.
Pass a topology_guard down from raft_topology_cmd_handler to repair
tasks. Repair tasks already support topology guards.
Fixes: https://github.com/scylladb/scylladb/issues/27759
Mark rf_rack_valid_keyspaces option as deprecated. User should
use enforce_rack_list option instead.
The option can still be used and it does not change it's behavior.
Docs is updated accordingly.
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.
When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.
The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.
enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
Currently the suite generates config in old format, and only a single
test validates that using new format "works".
This change updates the suite (mainly the MinioServer::create_conf()
method) to generate endpoint confit in new format.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28113
The datagram_channel::send() method that sends net::packet-s is
deprecated in favor of using span<temporary_buffer> one. Auditing code
still uses the former one -- it constructs a packet by using formatted
string by copying the string into the packet's fragment, then sends it.
This patch releases string into temporary_buffer and then passes
one-element span to send().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28198
do_query() is a coroutine but uses some continuations to take
advantage of exceptions being propagated via future::then() without
being thrown. We can accomplish the same thing with a nested coroutine
and coroutine::try_future(), simplifying the code.
While this area isn't performance intensive, we're not adding allocations.
The coroutine frame may add an allocation, but since read_page()
certainly does not return immediately, the following then() will allocate
as well. Since we eliminated that then(), the change is at least neutral
allocation-wise.
Closesscylladb/scylladb#28258
Consider the following scenario:
1. Let nodes A,B,C form a cluster with RF=3
2. Write query with CL=QUORUM is submitted and is acknowledged by
nodes B,C
3. Follow-up read query with CL=QUORUM is sent to verify the write
from the previous step
4. Coordinator sends data/digest requests to the nodes A,B. Since the
node A is missing data, digest mismatches and data reconciliation
is triggered
5. The node A or B fails, becomes unavailable, etc
6. During reconciliation, data requests are sent to node A,B and fail
failing the entire read query
When the above scenario happens, the tests using `start_writes()` fail
with the following stacktrace:
```
...
> await finish_writes()
test/cluster/test_tablets_migration.py:259:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/util.py:241: in finish
await asyncio.gather(*tasks)
test/pylib/util.py:227: in do_writes
raise e
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
worker_id = 1
...
> rows = await cql.run_async(rd_stmt, [pk])
E cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test_1767777001181_bmsvk.test - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
```
Note that when a node failure happens before/during a read query,
there is no test failure as the speculative retries are enabled
by default. Hence an additional data/digest read is sent to the third
remaining node.
However, the same speculative read is cancelled the moment, the read
query reaches CL which may trigger a read-repair.
This change:
- Retries the verification read in start_writes() on failure to mitigate
races between reads and node failures
- Adds additional logging to correlate Python exceptions with Scylla logs
Fixes https://github.com/scylladb/scylladb/issues/27478
Fixes https://github.com/scylladb/scylladb/issues/27974
Fixes https://github.com/scylladb/scylladb/issues/27494
Fixes https://github.com/scylladb/scylladb/issues/23529
Note that this change test flakiness observed during tablet transitions.
However, it serves as a workaround for a higher-level issue
https://github.com/scylladb/scylladb/issues/28125Closesscylladb/scylladb#28140
Auth v2 migration uses non-paged queries via `execute_internal` API.
This commit changes it to use `query_internal` instead, which uses
paging under the hood.
Fixes: scylladb/scylladb#27577
Just move the touched code above so the next commit is more readable.
But this has a drawback: previously, if the returned rows were empty,
this code was not executed, but now is independently of the query
results. This shouldn't be a big deal, though, as auth shouldn't be
empty.
This patch adds vector index options allowing to enable quantization and oversampling.
Specific quantization value will be used internally by vector store.
In the current implementation, get_oversampling allows us to decide how many times more candidates
to retrieve from vector store - final response is still trimmed to the given limit.
It is a first step to allow rescoring - recalculation of similarity metric and re-ranking.
Without rescoring oversampling will be also further optimized to happen internally in vector store.
`test/vector_search/rescoring_test.cc` implements basic tests of added functionality.
New options are documented in `docs/cql/secondary-indexes.rst`.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82
Ref https://scylladb.atlassian.net/browse/SCYLLADB-83
New feature - no backporting
Closesscylladb/scylladb#27677
* github.com:scylladb/scylladb:
vector_search: doc: Document new index options
vector_search: test: Test oversampling
vector_search: test: Add rescoring index options test
vector_search: test: Extract Configure utility to shared header
vector_index: introduce `quantization` and `oversampling` options
Cassandra changed their system tables in 3.0. We migrated to the new system table layout in 2017, in ScyllaDB 2.0.
System tables introduced in Cassandra 3.0, as well as the 3.0 variant of pre-existing system tables were added to the db::system_table::v3 namespace.
We ended up adding some new ScyllaDB-only system tables to this namespace as well.
As the dust settled, most of the v3 system tables ended up being either simple aliases to non-v3 tables, or new tables.
Either way, the codebase uses just one variant of each table for a long time now the v3:: distinction is pointless.
Remove the v3 namespace and unify the table listing under the top-level db::system_keyspace scope.
Code cleanup, no backport
Closesscylladb/scylladb#28146
* github.com:scylladb/scylladb:
db/system_keyspace: move remining tables out of v3 keyspace
db/system_keyspace: relocate truncated() and commitlog_cleanups()
db/system_keyspace: drop v3::local()
db/system_keyspace: remove duplicate table names from v3
Before this commit, if a test file or a test suite didn't include
any actual test cases, it was ignored by `boost_test_tree_lister`.
However, this information is useful; for example, it allows us to tell
if the test file the user wants to run doesn't exist or simply doesn't
contain any tests. The kind of error we would return to them should be
different depending on which situation we're dealing with.
We start including those empty suites and files in the output of
`--list_json_content`.
---
Examples (with additional formatting):
* Consider the following test file, `test/boost/dummy_test.cc` [1]:
```
BOOST_AUTO_TEST_SUITE(dummy_suite1)
BOOST_AUTO_TEST_SUITE(dummy_suite2)
BOOST_AUTO_TEST_SUITE_END()
BOOST_AUTO_TEST_SUITE_END()
BOOST_AUTO_TEST_SUITE(dummy_suite3)
BOOST_AUTO_TEST_SUITE_END()
```
Before this commit:
```
$ ./build/debug/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [], "tests": []}}]
```
After this commit:
```
$ ./build/debug/test/boost/dummy_test -- --list_json_content
[{"file":"test/boost/dummy_test.cc", "content": {"suites": [
{"name": "dummy_suite1", "suites": [
{"name": "dummy_suite2", "suites": [], "tests": []}
], "tests": []},
{"name": "dummy_suite3", "suites": [], "tests": []}
], "tests": []}}]
```
* Consider the same test file as in Example 1, but also assume it's compiled
into `test/boost/combined_tests`.
Before this commit:
```
$ ./build/debug/test/boost/combined_tests -- --list_json_content | grep dummy
$
```
After this commit:
```
$ ./build/debug/test/boost/combined_tests -- --list_json_content
[..., {"file": "test/boost/dummy_test.cc", "content": {"suites": [
{"name": "dummy_suite1", "suites":
[{"name": "dummy_suite2", "suites": [], "tests": []}],
"tests": []},
{"name": "dummy_suite3", "suites": [], "tests": []}],
"tests":[]}}, ...]
```
[1] Note that the example is simplified. As of now, it's not possible to use
`--list_json_content` with a file without any Boost tests. That will
result in the following error: `Test setup error: test tree is empty`.
Refs scylladb/scylladb#25415
In scylladb/scylladb@afde5f668a, we
implemented custom collection of information about Boost tests
in the repository. The solution boiled down to traversing through
the test tree via callbacks provided by Boost.Test and calling that
code from a global fixture. This way, the code is called automatically
by the framework.
Unfortunately, for an unknown reason, this leads to labels of test units
being duplicated. We haven't found the root cause yet and so we
deduplicate the labels manually.
---
Example (with additional formatting):
Consider the following test in the file `test/boost/dummy_test.cc`:
```
SEASTAR_TEST_CASE(dummy_case, *boost::unit_test::label("mylabel1")) {
return make_ready_future();
}
```
Before this commit:
```
$ ./build/dev/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [],
"tests": [{"name": "dummy_case", "labels": "mylabel1,mylabel1"}]}
}]
```
After this commit:
```
$ ./build/dev/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [],
"tests": [{"name": "dummy_case", "labels": "mylabel1"}]}
}]
```
Refs scylladb/scylladb#25415
The loops in `ongoing_rf_change()` perform explicit yields, but they
also perform coroutine operations which can yield implicitly. The
explicit yields are redundant.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`effective_capacity` is a value used in size based load balancing. It contains the sum of available disk space of a node and all the tablet sizes.
This change adds this value to the virtual table `system.load_per_node`. This can be useful for debugging size based load balancing.
Size based load balancing is currently only on master, so no backport is needed.
Closesscylladb/scylladb#28220
* github.com:scylladb/scylladb:
docs: add effective_capacity to system keyspace docs
virtual_table: add effective_capacity to load_per_node
This test has to be adjusted in lock-step with scylladb.git, due to changes in https://github.com/scylladb/scylladb/pull/27836. It is simpler to just take the time and import it, so https://github.com/scylladb/scylladb/pull/27836 can patch all the affected tests, including this one.
All code is imported verbatim, then patched later, such that the series remains bisectable.
dtest import, no backport needed
Closesscylladb/scylladb#28085
* github.com:scylladb/scylladb:
test/cluster/dtest: remove is_win() and users
test/cluster/dtest/scrub_test.py: add license blurb
test/cluster/dtest: import scrub_test.py
test/cluster/dtest/ccmlib: scylla_node.py: adapt run_scylla_sstable() at al
test/cluster/dtest/ccmlib: scylla_node.py: import run_scylla_sstable()
The original scrub test was done by the Cassandra project, hence there
is two Licenses notices: one for the original work by Cassandra
(2015) and one for our modifications on top (2021).
And dependencies: get_sstables() and __gather_sstables().
Code is importend verbatim, but doesn't work yet (no users yet either).
Will be patched to work in the next commit.
The last remining tables in the v3 keyspace are those that are genuinely
distinct -- added by Cassandra 3.0 or >= ScyllaDB 2.0.
Move these out of the v3 keyspace too, with this the v3 keyspace is
defunct and removed.
The name variables of these tables is outside the v3 namespace but the
method defining their schema is in the v3 namespace. Relocate the
methods out from the v3 namespace, to the scope where the name variables
live.
The methods are moved to the private: part of system_keyspace, as they
don't have external users currently.
Those table names that are effectively just an alias of the their
counterpart outside of the v3 namespace (struct).
scylla_local() is made public. Currently it is private, but it has
external users, working around the private designation by using the
public v3::scylla_local() alias. This change just makes the existing
status clear.
This patch adds vector index options allowing to enable quantization and oversampling.
Specific quantization value will be used internally by vector store.
In the current implementation, `get_oversampling` allows us to decide how many times more candidates
to retrieve from vector store - final response is still trimmed to the given limit.
It is a first step to allow rescoring - recalculation of similarity metric and re-ranking.
Without rescoring oversampling will be also further optimized to happen internally in vector store.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82
Ref https://scylladb.atlassian.net/browse/SCYLLADB-83
Currently, we only know about long reclaims from lsa-timing stall
reports. Shorter reclaims can go under the radar.
Those metrics will help to asses increase in LSA activity, which
translates to higher CPU cost of a workload.
reclaim tracks memory which goes to the standard allocator, e.g. when
entering and allocating_section or in the background reclaimer.
evict/compact count activity towrads building LSA reserve, in
allocating_section entry, or naked LSA allocation.
Closesscylladb/scylladb#27774
The way that test.py runs test/cqlpy tests requires that tests end their
session with all keyspaces deleted. If we forget to delete a keyspace,
test.py suspects some test fails and reports a failure. As reported in
issue #26291, the test file test/cqlpy/test_describe.py caused this check
to trigger, so this file was added to the blacklist "dirties_cluster"
in suite.yaml to force test.py to ignore this problem.
I believe the cause of the problem was as follows: test_describe.py
didn't really leave any undeleted keyspace. Rather, test_describe.py had
one test which used "USE" and this broke DESC KEYSPACES (Refs #26334) -
which test.py used to see which keyspaces remained.
We solved this problem not just once, but twice:
1. In pull request #26345, I fixed the test not to use "USE" on the main
CQL session.
2. In pull request #27971, I fixed DESC KEYSPACES implementation so even
if "USE" was in effect, it will return the correct results.
I checked manually, and after removing test_describe.py from the
dirties_cluster blacklist, all cqlpy tests now pass, without
spurious failures in the test following test_describe.py. So it's time
to remove it from the blacklist.
Fixes#26291
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27973
Commit d54d409 (audit: write out to both table and syslog) unified
create_audit and start_audit, which moved the audit service creation later
in the startup sequence. This broke startup when audit is enabled because
view_builder prepares CQL queries before start_audit runs, and
query preparation calls audit_instance().local_is_initialized()
which crashes on the non-existent sharded service.
Move start_audit to run before view_builder::start() and other components
that may prepare CQL queries during their initialization.
Fixes SCYLLADB-252
Closesscylladb/scylladb#28139
Our glossary is stuck in the past, still discussing token ownership in
terms of vnodes and cluster synchronization in terms of gossip.
This patch tries to improve this a bit, although much more work needs to
be done.
The term `Tablet` is added and the definition of `Token` and `Token
Range` is rephrased to be tablet inclusive.
The term `Cluster` is changed to mention raft as the synchronization
mechanism instead of gossip.
One oustanding problem is that our general architecture page describing
the ring acrhitecture is still Vnode only. We have a seprate Tablets
page, but the two don't link to each other and most documentation refer
only to the former. A casual reader might be able to spend a a lot of
time on our documentation page, without even seeing the word: tablets.
Closesscylladb/scylladb#28170
The loop that unwraps nested exception, rethrows nested exception and saves pointer to the temporary std::exception& inner on stack, then continues. This pointer is, thus, pointing to a released temporary
Closesscylladb/scylladb#28143
reader_permit::release_base_resources() is a soft evict for the permit:
it releases the resources aquired during admission. This is used in
cases where a single process owns multiple permits, creating a risk for
deadlock, like it is the case for repair. In this case,
release_base_resources() acts as a manual eviction mechanism to prevent
permits blockings each other from admission.
Recently we found a bad interaction between release_base_resources() and
permit eviction. Repair uses both mechanism: it marks its permits as
inactive and later it also uses release_base_resources(). This partice
might be worth reconsidering, but the fact remains that there is a bug
in the reader permit which causes the base resources to be released
twice when release_base_resources() is called on an already evicted
permit. This is incorrect and is fixed in this patch.
Improve release_base_resources():
* make _base_resources const
* move signal call into the if (_base_resources_consumed()) { }
* use reader_permit::impl::signal() instead of
reader_concurrency_semaphore::signal()
* all places where base resources are released now call
release_base_resources()
A reproducer unit test is added, which fails before and passes after the
fix.
Fixes: #28083Closesscylladb/scylladb#28155
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertUpdateIfConditionTest.java into our cqlpy
framework.
This test file checks various LWT conditional updates. After that
file became too big, the Cassandra developers split parts from it -
moving tests for LWT with collections, UDTs, and static columns to
separate test files - which I already translated (pull request #13663).
This patch translates the remaining, main, LWT tests.
Strangely, this test file also has, in the middle of the file, several
tests for conditional schema changes, like CREATE KEYSPACE IF NOT EXISTS,
a feature which has *nothing* to do with LWT so really didn't belong in
this file. But I translated those as well.
These new tests all pass on both ScyllaDB and Cassandra, and have not
uncovered any new bug.
However these tests do demonstrate yet again something that users and
developers of ScyllaDB's LWT must be aware of: Whereas usually
ScyllaDB's goal has been compatiblity with Cassandra's CQL, in LWT
this has *not* been the case: ScyllaDB deviated from Cassandra's
behavior in its LWT implementation in several places. These intentional
deviations were documented in docs/kb/lwt-differences.rst.
Accordingly, the tests here include almost a hundred (!) modificatons
(search for "if is_scylla") to allow the same test to pass on both
ScyllaDB and Cassandra, as well as many comments explaining the types
of differences we're seeing.
Although these deviations from Cassandra compatibility are known and
intentional, it's worth listing here the ones re-discovered by these
new tests:
1. On a successful conditional write, Cassandra returns just true, Scylla
also returns the old contents of the row.
2. Similarly, in an IF EXISTS write that failed (the row did not exist),
Cassandra returns just false, Scylla also returns extra null values for
each and every column of the row.
3. Cassandra allows in "IF v IN (?, ?)" to bind individual values to
UNSET_VALUE and skips them, Scylla treats this as an error. Refs #13659.
4. When there are static columns, Scylla's LWT response returns the static
column first, Cassandra returns the modified column first. Since both
also say which columns they return, neither is more correct than the other,
a normally users will address specific columns by name, not by position.
5. docs/kb/lwt-differences.rst explains that "the returned result set
contains an old row for every conditional statement in the batch".
Beyond this different, actually non-conditional updates in the batch will
also get a row in Scylla's result. Refs #27955.
6. For batch statement, ScyllaDB allows mixing `IF EXISTS`, `IF NOT EXISTS`,
and other conditions for the same row. Cassandra doesn't, so checks that
these combinations are not allowed were commented out.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27961
This PR marks system_replicated_keys as a system keyspace.
It was missing when the keyspace was added.
A side effect of that is that metrics that are not supposed to be reported are.
Fixes#27903Closesscylladb/scylladb#27954
* github.com:scylladb/scylladb:
distributed_loader: system_replicated_keys as system keyspace
replicated_key_provider: make KSNAME public
In storage_service::raft_topology_cmd_handler we pass a lambda
wrapped in coroutine::lambda to a function that creates streaming_task_impl.
The lambda is kept in streaming_task_impl that invokes it in its run
method.
The lambda captures may be destroyed before the lambda is called, leading
to use after free.
Do not wrap a lambda passed to streaming_task_impl into coroutine::lambda.
Use this auto dissociate the lambda lifetime from the calling statement.
Fixes: https://github.com/scylladb/scylladb/issues/28200.
Closesscylladb/scylladb#28201
Currently, if a rf change request is paused, it immediately changes
the system_schema.keyspaces to use rack list for this keyspace.
If the request is aborted, the co-location might not be finished.
Hence, we can end up with inconsistent schema and tablet replica state.
Update the system_schema.keyspaces only after the co-location is done (and
not when it's started).
Fixes: https://github.com/scylladb/scylladb/issues/28167
No backport needed; changes that introduced a bug are only on master
Closesscylladb/scylladb#28168
* github.com:scylladb/scylladb:
service: fin indentation
test: add test_numeric_rf_to_rack_list_conversion_abort
service: tasks: fix type of global_topology_request_virtual_task
service: do not change the schema while pausing the rf change
Refs #27429
Re-implement the dtest with same name as a scylla pytest, using a python level network proxy instead of tcpdump etc.
Both to avoid sudo and also to ensure we don't race.
Juggles different listen_address and broadcast_address values to insert a proxy measuring RPC traffic.
Note: the measuring relies on python network IO not splitting data chunks, since we don't really have packet-level view of the connections.
Note that a scylla change is required to make the ip address magic work, otherwise topology mechanism gets
confused. This should maybe at some point be looked into more, since we should be more resilient against various services in scylla binding to different addresses.
When this test is merged, we can drop the flaky test from dtest. And hope no new flakiness comes from this one...
Closesscylladb/scylladb#28133
* github.com:scylladb/scylladb:
test/cluster/test_internode_compression: Transpose test from dtest
gossiper/main: Extend special treatment of node ID resolve for rpc_address
All the tests under test/cqlpy/cassandra_tests/ were translated from
Cassandra's unit tests originally written in Java into our own test
framework, and accordinly carry a clear mention of their origin and
original license.
However, we did modify these original tests - even if the modification
was slight and mostly straightforward. Therefore I was asked to also
mention our own copyright (and license) for these modifications.
So this patch adds to every file in test/cqlpy/cassandra_tests/ text like:
# Modifications: Copyright 2026-present ScyllaDB
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
with the appropriate year instead of 2026.
Fixes#28215
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28216
Many tests want to assume that group0 leader runs on a particualr
server, typically the first server in the list.
And they cannot be easily made to work with arbitrary leader, becuase
they setup a particular topology and then stop particular nodes, and
want to assume the leader is stable. They open leader's log and
expect things to appear in that log.
It's much easier to ensure the leader, than to prepare tests to
handle failovers.
In case of decommission, it's not desirable because it's less urgent.
In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.
Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
Reaching critical disk utilization on destination means the draining
either caused it, or at least works against reliveing it. So it's
better to cancel those requests. In case of decommission, if critical
disk utilization was caused by it due to not enough capacity, aborting
decomission will bring capacity back to the system and rebalancing
will relieve critical disk utlization.
In case of decommission, it's not desirable because it's less
urgent.
In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.
Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
We want to be able to cancel decommission when it's still in the
tablet draining phase. Such a request is in a pending and paused
state, and can be safely canceled. We set the node's "draining" flag
back to false.
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.
Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.
Flow of decommission/removenode request:
1) pending and paused, has tablet replicas on target node.
Tablet scheduler will start draining tablets.
2) No tablets on target node, request is pending but not paused
3) Request is scheduled, node is in transition
4) Request is done
Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.
When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).
Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.
The test case test_explicit_tablet_movement_during_decommission is
removed. It verifies that tablet move API works during tablet draining
transition. After this PR, we no longer enter this transition, so the
test doesn't work. It loses its purpose, because movement during
normal tablet balancing is not special and tested elsewhere.
They are being drained of tablet replicas, tablet scheduler works to
move replicas away from such nodes. This state is set at the
beginning of decommission and removenode operations.
After parallel tablet draining, the validation at the time request
starts executing is too late, tablets will be already drained.
This trips tests which expect validation failure, but get tablet
draining failure instead.
Also, in case of decommission, it's a waste to go through draining
only to discover that the operation has to be rolled back due to
validation.
So avoid submitting a request altogether if it's invalid.
The validation at request execution start remains, for extra sefety.
validate_removing_node() was extracted out of topology_coordinator,
so that it can be called by storage_service on non-coordinator.
Some tests need adjusting for the fact that after failed removenode
the node may still not be marked as excluded, so we need to explicitly
exclude it or add to the list of ignored nodes in the next removenode
operation.
Since Vector Store service filtering API has been implemented (scylladb/vector-store#334), there is a need for the implementation of Scylla side part.
This patch should implement a `statement_restrictions` parsing into Vector Store filtering API compatible JSON objects.
Those objects should be added to ANN query vector POST requests as `filter` object.
After this patch, the subset of all operations ([Vector Search Filtering Milestone 1](https://scylladb.atlassian.net/wiki/spaces/RND/pages/156729450/Vector+Search+Filtering+Design+Document#Milestone-1)) happy path should be completed, allowing users to filter on primary key columns with single column `=` and `IN` or multiple column `()=()` and `() IN ()`.
The restrictions for other operations should be implemented in a PR on Vector Store service side.
---
This PR implements parsing the `statement_restrictions` into Vector Store filtering API compatible JSON objects.
The JSON objects are created and used in ANN vector queries with filtering.
It closes the Scylla side implementation of Vector Search filtering milestone 1.
Unit tests for `statement_restrictions` parsing are added. Integration tests will be added on Vector Store service side PR.
---
Fixes: SCYLLADB-249
New feature, should land into 2026.1
Closesscylladb/scylladb#28109
* github.com:scylladb/scylladb:
docs: update documentation on filtering with vector queries
test/vector_search: add test for filtered ANN with VS mock
test/vector_search: add restriction to JSON conversion unit tests
vector_search: cql: construct and use filter in ANN vector queries
select_statement: do not require post query ordering for vector queries
vector_search: add `statement_restrictions` to JSON parsing
seastar dd46b6f..e00f1513
```
e00f1513 Merge 'net: Add DNS TTL to the net::hostent' from Ernest Zaslavsky
8a69e1f4 net: extract common implementation of inet_address::find_all
cb469fd1 net: deprecate the addr_list in hostent
1d59c0ca net: expose DNS TTL via net::hostent
3c6d919f http: add virtual close() to connection_factory
bbd0001a Revert "net: expose DNS TTL via net::hostent"
```
Closesscylladb/scylladb#28147
Workload prioritization was added in scylladb/scylladb#22031.
The functionality of updating service levels was implemented as
a lambda coroutine, leaving room for the lambda coroutine fiasco.
The problem was noticed and addressed in scylladb/scylladb#26404.
There are currently three functions that call switch_tenant:
- update_user_scheduling_group_v1 and update_user_scheduling_group_v2
use the deducing this (this auto self) to ensure the proper
lifecycle of the lambda capture.
- update_control_connection_scheduling_group doesn’t use the deducing
this, but the lambda captures only `this`, which is used before
the first possible coroutine preemption. Therefore, it doesn’t seem
that any memory corruption or undefined behavior is possible here.
Nevertheless, it seems better to start using the deducing this in
update_control_connection_scheduling_group as well, to avoid problems
in the future if someone modifies the code and forgets to add it.
Fixes: SCYLLADB-284
Closesscylladb/scylladb#28158
The `make_key` lambda erroneously allocates a fixed 8-byte buffer
(`sizeof(s.size())`) for variable-length strings, potentially causing
uninitialized bytes to be included. If such bytes exist and they are
not valid UTF-8 characters, deserialization fails:
```
ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7)
```
Fixes#28195.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#28197
Continuing the read once it is aborted (e.g. due to timeout) is a waste
of resources, as the produced results will be discarded.
Poll the permit's abort exception in the memtable and cache reader's
fill_buffer(). This results in one poll per buffer filled (8KB of data).
We already have similar poll for sstable readers, as disk reads are
usually much heavier and therefore it is more important to stop them
ASAP after abort. Cache and memtable reads are usually quick but not
always, hence it is important to also have polling in the cache and
memtable readers.
Refs: #11469Fixes: #28148Closesscylladb/scylladb#28149
0ede8d154b introduced the dev doc for size
based load balancing, but also added spelling errors.
This PR fixed these errors.
Closesscylladb/scylladb#28196
Currently, the type of global_topology_request_virtual_task isn't
taken out of std::variant before printing, which results with
a task of type variant(actual_type).
Retrieve the type from the variant before passing it to task type.
Currently, if a rf change request is paused, it immediately changes
the system_schema.keyspaces to use rack list for this keyspace.
If the request is aborted, the co-location might not be finished.
Hence, we can end up with inconsistent schema and tablet replica state.
Update the system_schema.keyspaces only after the co-location is done (and
not when it's started).
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats, but also per-tablet stats.
Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).
This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.
Fixes https://github.com/scylladb/scylladb/issues/27921
No backport becuse only master is vulnerable (size-based load balancing).
Closesscylladb/scylladb#27926
* https://github.com/scylladb/scylladb:
test: cluster: Add reproducer for missed notification in topology coordinator
topology_coordinator: Wake up the state machine after stats refresh
topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier
topology_coordinator: Fix potential missed notification
topology_coordinator: Refresh load stats after table is created or altered
tablets: Do a group0 read barrier on tablet load stats refresh
topology_coordinator: Ensure stats are refreshed in the gossip scheduling group
test: Use ManagerClient.{disable,enable}_tablet_balancing()
test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API
test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()
Add a description of available filtering options with ANN vector queries.
Provide an example of such query and a reference to `WHERE` clause restrictions.
Add `filter` option in `ann()` function to write the filter JSON
object as the POST request in ANN vector queries.
Adjust existing `vector_store_client_test` tests accordingly.
As there is only one `ORDER BY` clause with `ANN OF` ordering supported
in ANN vector queries, there is no need to require post query ordering
for the ANN vector queries. The standard ordering is not allowed here.
In fact the ordering is done on the Vector Store service side within
the ANN search, so that the returned primary keys are already sorted
accordingly.
If left unchanged, the filtering with `IN` clauses would cause
a `bad_function_call` server error as the filtering with `IN` clauses
require the post query ordering in a standard case.
Such rebuild has no read_from replica, but we know the tablet size will be 0.
If we don't, stats will be incomplete until the next refresh.
This is important for test cases which do removenode or replace while
all replicas are down. So for example test_replace from
test_tablets_removenode.py, which uses RF=1 and replaces a node.
Without this, the test waits for 60s needlessly after the first round
of rebuilding migrations before scheduling more migrations. This can
cause the test to time out.
Fixes#28115Closesscylladb/scylladb#28121
Most likely 817fdad uncovered the fact that our choice of primary replica was resonating with tablet allocation and we were ending up picking the same replica as primary within a scope instead of rotating primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes with primary_replica_only=true would put all data into 3 nodes, leaving the other 6 unused. The balancing of the dataset was performed by the subsequent repair step.
This PR fixes this by changing the formula for picking up the primary replica out of a set of eligible replicas from within the passed scope.
The PR also extends the testing scenarios in `test_backup.py` so we get to run restore for a set of topologies, for all combinations of scope, primary_replica_only and min_tablet_counts.
Most of the work was done by @bhalevy [here](https://github.com/scylladb/scylladb/compare/master...bhalevy:scylla:load-balance-primary-replica), this PR just splitted it and did touchups here and there.
Fixes#27281Closesscylladb/scylladb#27397
* github.com:scylladb/scylladb:
test: reduce dataset and number of test cases or debug builds
test: bump repair timeout up, it's sometimes not enough in CI
test: refactor test_refresh.py to match test_restore_with_streaming_scopes.
test: extend test_restore_with_streaming_scopes
test: Adjust test_restore_primary_replica_different_dc_scope_all
test: Refactor restoring code in test_backup to match SM pattern
test: add check_mutation_replicas calls after fresh creation of dataset
test: extend create_dataset to accept consistency_level
test: refactor check_mutation_replicas so it's more readable
test: make create_dataset async and refactor so it's configurable
test: use defaultdict in collect_mutations
test: add log marks to facilitate reusing server for restore
locator: tablets: Distribute data evenly among primary replicas during restore
To configure S3 storage, one needs to do
```
object_storage_endpoints:
- name: s3.us-east-1.amazonaws.com
port: 443
https: true
aws_region: us-east-1
```
and for GCS it's
```
object_storage_endpoints:
- name: https://storage.googleapis.com:433
type: gs
credentials_file: <gcp account credentials json file>
```
This PR updates the S3 part to look like
```
object_storage_endpoints:
- name: https://s3.us-east-1.amazonaws.com:443
aws_region: us-east-1
```
fixes: #26570
This is 2nd attempt, previous one (#27360) was reverted because it reported endpoint configs in new format via API and CQL always, even if the endpoint was configured in the old way. This "broke" scylla manager and some dtests. This version has this bug fixed, and endpoints are reported in the same format as they were configured with.
About correctness of the changes.
No modifications to existing tests are made here, so old format is respected correctly (as far as it's covered by tests). To prove the new format works the the test_get_object_store_endpoints is extended to validate both options. Some preparations to this test to make this happen come on their own with the PR #28111 to show that they are valid and pass before changing the core code.
Enhancing the way configuration is made, likely no need to backport.
Closesscylladb/scylladb#28112
* github.com:scylladb/scylladb:
test: Validate S3 endpoints new format works
docs: Update docs according to new endpoints config option format
object_storage: Create s3 client with "extended" endpoint name
s3/storage: Tune config updating
sstable: Shuffle args for s3_client_wrapper
test: Rename badconf variable into objconf
test: Split the object_store/test_get_object_store_endpoints test
When a counter write times out (due to rpc::timeout_error or timed_out_error),
the code was throwing mutation_write_timeout_exception but not marking the
write_timeouts metric. This resulted in counter write timeouts not being
counted in the scylla_storage_proxy_coordinator_write_timeouts metric.
Regular writes go through mutate_internal -> mutate_end, which catches
mutation_write_timeout_exception and marks the metric. However, counter
writes use a separate code path (mutate_counters) that has its own
exception handling but was missing the metric update.
This fix adds get_stats().write_timeouts.mark() before throwing the
timeout exception in the counter write path, consistent with how the
CAS path handles cas_write_timeouts.
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-245
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#28019
A replaced node may have pending operation on it. The replace operation
will move the node into the 'left' state and the request will never be
completed. More over the code does not expect left node to have a
request. It will try to process the request and will crash because the
node for the request will not be found.
The patch checks is the replaced node has peening request and completes
it with failure. It also changes topology loading code to skip requests
for nodes that are in a left state. This is not strictly needed, but
makes the code more robust.
Fixes#27990Closesscylladb/scylladb#28009
Refs #22733.
Adds runtime warning and docs info that replicated provider is deprecated and will be removed.
Fixes#27292Closesscylladb/scylladb#27270
* github.com:scylladb/scylladb:
docs::encryption: Add warning that replicated provider is deprecated
ent::encryption: Switch default key provider from replicated to local
replicated_key_provider: Add deprecation warning on usage
Refs #27429
re-implement the dtest with same name as a scylla pytest, using
a python level network proxy instead of tcpdump etc. Both to avoid
sudo and also to ensure we don't race.
v2:
* Included positive test (mode=all)
`test_schema_versioning_with_recovery` is currently flaky. It performs
a write with CL=ALL and then checks if the schema version is the same on
all nodes by calling `verify_table_versions_synced`. All nodes are expected
to sync their schema before handling the replica write. The node in
RECOVERY mode should do it through a schema pull, and other nodes should do
it through a group 0 read barrier.
The problem is in `verify_local_schema_versions_synced` that compares the
schema versions in `system.local`. The node in RECOVERY mode updates the
schema version in `system.local` after it acknowledges the replica write
as completed. Hence, the check can fail.
We fix the problem by making the function wait until the schema versions
match.
Note that RECOVERY mode is about to be retired together with the whole
gossip-based topology in 2026.2. So, this test is about to be deleted.
However, we still want to fix it, so that it doesn't bother us in older
branches.
Fixes#23803Closesscylladb/scylladb#28114
Problem
-------
Secondary indexes are implemented via materialized views under the
hood. The way an index behaves is determined by the configuration
of the view. Currently, it can be modified by performing the CQL
statement `ALTER MATERIALIZED VIEW` on it. However, that raises some
concerns.
Consider, for instance, the following scenario:
1. The user creates a secondary index on a table.
2. In parallel, the user performs writes to the base table.
3. The user modifies the underlying materialized view, e.g. by setting
the `synchronous_updates` to `true` [1].
Some of the writes that happened before step 3 used the default value
of the property (which is `false`). That had an actual consequence
on what happened later on: the view updates were performed
asynchronously. Only after step 3 had finished did it change.
Unfortunately, as of now, there is no way to avoid a situation like
that. Whenever the user wants to configure a secondary index they're
creating, they need to do it in another schema change. Since it's
not always possible to control how the database is manipulated in
the meantime, it leads to problems like the one described.
That's not all, though. The fact that it's not possible to configure
secondary indexes is inconsistent with other schema entities. When
it comes to tables or materialized views, the user always have a means
to set some or even all of the properties during their creation.
Solution
--------
The solution to this problem is extending the `CREATE INDEX` CQL
statement by view properties. The syntax is of form:
```
> CREATE INDEX <index name>
> .. ON <keyspace>.<table> (<columns>)
> .. WITH <properties>
```
where `<properties>` corresponds to both index-specific and view
properties [2, 3]. View properties can only be used with indexes
implemented with materialized views; for example, it will be impossible
to create a vector index when specifying any view property (see
examples below).
When a view property is provided, it will be applied when creating the
underlying materialized view. The behavior should be similar to how
other CQL statements responsible for creating schema entities work.
High-level implementation strategy
----------------------------------
1. Make auxiliary changes.
2. Introduce data structures representing the new set of index
properties: both index-specific and those corresponding to the
underlying view.
3. Extend `CREATE INDEX` to accept view properties.
4. Extend `DESCRIBE INDEX` and other `DESCRIBE` statements to include
view properties in their output.
User documentation is also updated at the steps to reflect the
corresponding changes.
Implementation considerations
-----------------------------
There are a number of schema properties that are now obsolete. They're
accepted by other CQL statements, but they have no effect. They
include:
* `index_interval`
* `replicate_on_write`
* `populate_io_cache_on_flush`
* `read_repair_chance`
* `dclocal_read_repair_chance`
If the user tries to create a secondary index specifying any of those
keywords, the statement will fail with an appropriate error (see
examples below).
Unlike materialized views, we forbid specifying the clustering order
when creating a secondary index [4]. This limitation may be lifted
later on, but it's a detail that may or may not prove troublesome. It's
better to postpone covering it to when we have a better perspective on
the consequences it would bring.
Examples
--------
Good examples
```
> CREATE INDEX idx ON ks.t (v);
> CREATE INDEX idx ON ks.t (v) WITH comment = 'ok view property';
> CREATE INDEX idx ON ks.t (v)
.. WITH comment = 'multiple view properties are ok'
.. AND synchronous_updates = true;
> CREATE INDEX idx ON ks.t (v)
.. WITH comment = 'default value ok'
.. AND synchronous_updates = false;
```
Bad examples
```
> CREATE INDEX idx ON ks.t (v) WITH replicate_on_write = true;
SyntaxException: Unknown property 'replicate_on_write'
> CREATE INDEX idx ON ks.t (v)
.. WITH OPTIONS = {'option1': 'value1'}
.. AND comment = 'some text';
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Cannot specify options for a non-CUSTOM index"
> CREATE CUSTOM INDEX idx ON ks.t (v)
.. WITH OPTIONS = {'option1': 'value1'}
.. AND comment = 'some text';
InvalidRequest: Error from server: code=2200 [Invalid query]
message="CUSTOM index requires specifying the index class"
> CREATE CUSTOM INDEX idx ON ks.t (v)
.. USING 'vector_index'
.. WITH OPTIONS = {'option1': 'value1'}
.. AND comment = 'some text';
InvalidRequest: Error from server: code=2200 [Invalid query]
message="You cannot use view properties with a vector index"
> CREATE INDEX idx ON ks.t (v) WITH CLUSTERING ORDER BY (v ASC);
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Indexes do not allow for specifying the clustering order"
```
and so on. For more examples, see the relevant tests.
References:
[1] https://docs.scylladb.com/manual/branch-2025.4/cql/cql-extensions.html#synchronous-materialized-views
[2] https://docs.scylladb.com/manual/branch-2025.4/cql/secondary-indexes.html#create-index
[3] https://docs.scylladb.com/manual/branch-2025.4/cql/mv.html#mv-options
[4] https://docs.scylladb.com/manual/branch-2025.4/cql/dml/select.html#ordering-clauseFixesscylladb/scylladb#16454
Backport: not needed. This is an enhancement.
Closesscylladb/scylladb#24977
* github.com:scylladb/scylladb:
cql3: Extend DESC INDEX by view properties
cql3: Forbid using CLUSTERING ORDER BY when creating index
cql3: Extend CREATE INDEX by MV properties
cql3/statements/create_index_statement: Allow for view options
cql3/statements/create_index_statement: Rename member
cql3/statements/index_prop_defs: Re-introduce index_prop_defs
cql3/statements/property_definitions: Add extract_property()
cql3/statements/index_prop_defs.cc: Add namespace
cql3/statements/index_prop_defs.hh: Rename type
cql3/statements/view_prop_defs.cc: Move validation logic into file
cql3/statements: Introduce view_prop_defs.{hh,cc}
cql3/statements/create_view_statement.cc: Move validation of ID
schema/schema.hh: Do not include index_prop_defs.hh
A data_sink that stores buffers into an in-memory collection had
appeared in seastar recently. In Scylla there's similar thing that uses
memory_data_sink_buffer as a container, so it's possible to drop the
data_sink_impl iself in favor of seastar implementation.
For that to work there should be append_buffers() overload for the
aforementioned container. For its nice implementation the container, in
turn, needs to get push_back() method and value_type trait. The method
already exists, but is called put(), so just rename it. There's one more
user of it this method in S3 client, and it can enjoy the added
append_buffers() helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28124
The instructions for building optimized clang neglected to mention
that the clang version to be built must be specified. Correct that.
Closesscylladb/scylladb#28135
fmt::localtime() is now deprecated, users should migrate to equivalents
from the standard libraries.
std::localtime is not thread safe, so a local wrapper is introduced,
based on the thread-safe localtime_r() (from libc).
Closesscylladb/scylladb#27821
In PR 5b6570be52 we introduced the config option
`sstable_compression_user_table_options` to allow adjusting the default
compression settings for user tables. However, the new option was hooked
into the CQL layer and applied only to CQL base tables, not to the whole
spectrum of user tables: CQL auxiliary tables (materialized views,
secondary indexes, CDC log tables), Alternator base tables, Alternator
auxiliary tables (GSIs, LSIs, Streams).
Fix this by moving the logic into the `schema_builder` via a schema
initializer. This ensures that the default compression settings are
applied uniformly regardless of how the table is created, while also
keeping the logic in a central place.
Register the initializer at startup in all executables where schemas are
being used (`scylla_main()`, `scylla_sstable_main()`, `cql_test_env`).
Finally, remove the ad-hoc logic from `create_table_statement`
(redundant as of this patch), remove the xfail markers from the relevant
tests and adjust `test_describe_cdc_log_table_create_statement` to
expect LZ4WithDicts as the default compressor.
Fixes#26914.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Extend the `static_configurator` mechanism to support initialization of
arbitrary schema properties, not only static ones, by passing a
`schema_builder` reference to the configurator interface.
As part of this change, rename `static_configurator` to
`schema_initializer` to better reflect its broader responsibility.
Add a checkpoint/restore mechanism to allow de-registering an
initializer (useful for testing; will be used in the next patch).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Schemas maintain a set of so-called "static properties". These are not
user-visible schema properties; they are internal values carried by
in-memory `schema` objects for convenience (349bc1a9b6,
https://github.com/scylladb/scylladb/pull/13170#issuecomment-1469848086).
Currently, the initialization of these properties happens when a
`schema_builder` builds a schema (`schema_builder::build()`), by
invoking all registered "static configurators".
This patch moves the initialization of static properties into the
`schema_builder` constructor. With this change, the builder initializes
the properties once, stores them in a data member, and reuses them for
all schema objects that it builds. This doesn't affect correctness as
the values produced by static configurators are "static" by
nature; they do not depend on runtime state.
In the next patch, we will replace the "static configurator" pattern
with a more general pattern that also supports initialization of regular
schema properties, not just static ones. Regular properties cannot be
initialized in `build()` because users may have already explicitly set
values via setters, and there is no way to distinguish between default
values and explicitly assigned ones.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The `sstable_compression_user_table_options` config option determines
the default compression settings for user tables.
In patch 2fc812a1b9, the default value of this option was changed from
LZ4 to LZ4WithDicts and a fallback logic was introduced during startup
to temporarily revert the option to LZ4 until the dictionary compression
feature is enabled.
Replace this fallback logic with an accessor that returns the correct
settings depending on the feature flag. This is cleaner and more
consistent with the way we handle the `sstable_format` option, where the
same problem appears (see `get_preferred_sstable_version()`).
As a consequence, the configuration option must always be accessed
through this accessor. Add a comment to point this out.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In patches 11f6a25d44 and 7b9428d8d7 we added tests to verify that
auxiliary tables for both CQL and Alternator have the same default
compression settings as their base tables. These tests do not check
where these defaults originate from; they just verify that they are
consistent.
Add some more tests to verify the actual source of the defaults, which
is expected to be the `sstable_compression_user_table_options`
from the configuration. Unlike the previous tests, these tests require
dedicated Scylla instances with custom configuration, so they must be
placed under `test/cluster/`.
Mark them as xfail-ing. The marker will be removed later in this series.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Read timeouts are a common occurence and they typically occur when the replica is overloaded. So throwing exceptions for read timeouts is very harmful. Be careful not to thow exceptions while propagating them up the future chain. Add a test to enfore and detect regressions.
Fixes: scylladb/scylladb#25062
Improvement, normally not a backport candidate, but we may decide to backport if customer(s) are found to suffer from this.
Closesscylladb/scylladb#25068
* github.com:scylladb/scylladb:
reader_permit: remove check_abort()
test/boost/database_test: add test for read timeout exceptions
sstables/mx/reader: don't throw exceptions on the read-path
readers/multishard: don't throw exceptions on the read-path
replica/table: don't throw exceptions on the read-path
multishard_mutation_query: fix indentation
multishard_mutation_query: don't throw exceptions on the read-path
service/storage_proxy: don't throw exceptions on the full-scan path
cql3/query_processor: don't throw exceptions on the read-path
reader_permit: add get_abort_exception()
Refs #27429
If running with broadcast_address != listen/cql/rpc address, topology
gets confused about the varying addresses. Need to special case
resolve both addresses as "self". I.e. extend broadcast_address
treatment to cql_address as well.
Added export of this via gossiper for symmetry.
At the end of the test case, the framework greps logs for errors and
backtraces. The servers are still running at this point. Some test
cases enable debug-level logging. If servers manage to produce new
lines between the python script processes them, the grep will never
return.
Protect against this by grepping over a file snapshot.
Fixes#28086Closesscylladb/scylladb#28088
This patch adds a new document, docs/alternator/network.md,
explaining the various mechanisms that can be used to reduce
network usage in Alternator. It explains compression of requests
and responses, header reduction, rack-aware routing, and RPC compression.
Many of these topics - especially support in the client libraries -
are work in progress, so some details are still missing in the new
document. Still, I think it is a good start that can be improved
later.
Fixes#27915.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27927
Refactor streams.cc - turn `.then` calls into coroutines.
Reduces amount of clutter, lambdas and referenced variables.
Note - the code is kept at the same indentation level to ease review,
the next commit will fix this.
Extend the test_get_object_store_endpoints() test to configure S3
endpoints in full-url format and check that they are rendered properly
via API/CQL.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.
The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It tests two things -- the way object storage config is represented via
API and CQL (from sytem.config) and that updating config affects CREATE
KEYSPACE CQL (with keyspace storage options)
It's better to split the test, as its former part is going to be
extented to validate old/new config formats (see #26570)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
to test restoring with a different min_tablet_count
than the schema was originally created with.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
This patch refactors the restoring code in cluster/test_backup.py
so it matches better the way SM works.
The patch also refactors test_restore_with_streaming_scopes so to
facilitate running restore scenarios under all supported scopes
with or w/o primary_replica_only enabled by reusing the servers
and backups for a topology. This allows us to test a lot more scenarios
without making the test impossibly slow.
split from bhalevy/load-balance-primary-replica
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
to validate that mutation assertions are sane
split from bhalevy/load-balance-primary-replica
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Most likely 817fdad uncovered the fact that our choice of
primary replica was resonating with tablet allocation and we were ending up
picking the same replica as primary within a scope instead of rotating
primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes cluster
with primary_replica_only=true would put all data into 3 nodes, leaving
the other 6 unused. The balancing of the dataset was performed by the
subsequent repair step.
split from bhalevy/load-balance-primary-replica
Fixes#27281
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Address all errors reported by CodeQL as reported on https://github.com/scylladb/scylladb/security/quality.
This is a mixed bag, with some harmless issues, while others are severe problems which will result in the code breaking (if it is even run). I suspect some of the more severe problems were found in dead code that is not used at all -- hence nobody noticed.
Still, these issues are good to fix, so we can reduce noise in the reports and improve the maintainability of the code.
Code cleanup, no backport
Closesscylladb/scylladb#27838
* github.com:scylladb/scylladb:
pgo/pgo.py: don't mutate input params
test/pylib/coverage_utils.py: profdata_to_lcov: don't mutate defaulted param
test/cluster/dtest/tools/misc.py: add type annotations to list_to_hashed_dict()
idl-compiler.py: raise TypeError instead of raw str
test/pylib/lcov_utils.py: don't call set when iterating over it
configure.py: move away from .format(**locals())
test/cluster/object_store/conftest.py: add missing call to parent constructor
idl-compiler.py: add missing call to parent class constructor
tools/scyllatop/fake.py: pass correct number of args to _add_metric
This patch adds a second reproducer for issue #25839, which is about
scanning a secondary index which returns partial results. The new test
uses count(*) without requesting the row themselves, but still has the
same problem of counting only part of the rows. This is the problem that
a user reported in issue #28026.
Unlike the previous test, this test works correctly on older versions
of Scylla - by using larger data, like on Cassandra - without changing
a configuration variable that did not yet exist. So with this test we
can confirm that this bug is a Scylla 5.2 regression:
test/cqlpy/run --release 5.1 test_secondary_index.py::test_short_count
passes, while
test/cqlpy/run --release 5.2 test_secondary_index.py::test_short_count
fails. It also fails on master, so the new test is marked "xfail".
Refs #25839
Refs #28026
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28108
Addresses outstanding review comments from PR #22961 where SSL field
collection was refactored into generic_server::connection base class.
This patch consists of minor cosmetic enhancements for increased
readability, mainly, with some minor fixups explained in specific
commits.
Cosmetic changes, no need to backport.
Closesscylladb/scylladb#27575
* github.com:scylladb/scylladb:
test_ssl: fix indentation
generic_server: improve logging broken TLS connection
test_ssl: improve timeout and readability
alternator/server: update SSL comment
This method can cause performance regressions if used in the wrong place
-- namely if it is used to abort reads by throwing the abort exception.
Exceptions should be propagated during reads without throwing them,
otherwise they cause extra CPU load, making a bad situation worse.
Remove this method, so it doesn't accidentally get more users, migrate
remaining users to get_abort_exception().
Read timeouts shouldn't trigger exceptions thrown, exceptions should be
solely propagated via futures, otherwise they put extra strain on the
system at the worst possible time: when it is overload already enough
that reads started to time out.
The test covers both single partition reads and full scans, with two
scenarios:
* timeout while the read is queued
* timeout when the read is already ongoing
If the read is aborted via the permit (due to timeout) don't throw the
abort exception, instead propagate it via the future chain.
Also, use try_catch<> instead of try ... catch to decorate
malformed_sstable_exception with the file name.
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
Use coroutine::as_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
Not using coroutine::try_future(), because on the error path, the
querier has to be closed.
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
Will replace check_abort(). The latter throws an exception which is
something we want to avoid when a read is aborted, in particular when it
times out.
Also add a convenience get_abort_exception() method to mutation_reader.
We add a test that validates that indexed queries
do not throw a warning related to vector search paging
Fixes: SCYLLADB-248
Closesscylladb/scylladb#28077
We have a test in test_compressed_response.py that reproduces a bug
where in Alternator's signature checking code, if a header had multiple
consecutive spaces its signature isn't checked correctly.
This patch fixes this and that xfailing test begins to pass.
But it turns out that the handling of multiple consecutive spaces in
headers when calculating the authentication signature is just one example
of "header canonization" that the AWS Signature V4 specification requires
us to do. There are additional types of header canonization that Alternator
must do, and this patch also adds new tests in test_authorization.py for
checking *all* the types of canonization.
Fortunately, for all other types of canonizations, we already handled
them correctly - Alternator already lowercases header names, sorts them
alphabetically and removes leading and trailing spaces before calculating
the signature. So most of the new tests added pass also without this patch,
and only one of them, test_canonization_middle_whitespace, needs this
patch to pass. As usual, all the new tests also pass on DynamoDB.
Fixes#27775
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28102
With new UI Jenkins escaping the HTML tags during rendering to prevent
XSS. This will show just link without custom name as a string that can
be copied and then pasted to navigate to the failed directory.
Closesscylladb/scylladb#28062
To fix the problem, we need to remove the first, redundant definition of
test_gossiper_unreachable_endpoints (lines 19-24). The second definition
(lines 25-40) should be retained as it has more substantial test logic.
No other code changes or imports are needed, as the test logic is
preserved fully in the retained definition.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27632
Potential fix for code scanning alert no. 167: Workflow does not contain permissions
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27819
Potential fix for code scanning alert no. 145: Workflow does not contain permissions
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27808
Try to teach CoPi a bit about how we'd like to see it implement tests, according to this repo best practices.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#28032
Following 954f2cbd2f, which added proxy protocol v2 listeners
for CQL, we do the same for alternator. We add two optional ports
for plain and TLS-wrapped HTTP.
We test each new port, that the old ports still work, and that
mixing up a port with no proxy protocol and a connection with proxy
protocol (or the opposite) fails. The latter serves to show
that the testing strategy is valid and doesn't just pass whatever
happens. We also verify that the correct addresses (and TLS mode)
show up in system.clients.
Closesscylladb/scylladb#27889
Currently Alternator supports compressed requests in the gzip format
with "Content-Encoding: gzip". We did not support any other compression
formats.
It turns out that DynamoDB also supports the "deflate" encoding.
The "deflate" format is just a small variant of gzip and also supported
by the same zlib library that we already use, so it is very easy
to add support for it as well. So this patch adds it.
Beyond compatibility with DynamoDB, another benefit of this patch is
symmetry with our response compression support (PR #27454), where
we supported both gzip and deflate compression of responses - so
we should support the same for requests.
This patch also adds tests for Content-Encoding: deflate, which pass
on DynamoDB (proving that "deflate" is indeed supported there).
On Alternator the new tests failed before this patch and pass with
this patch.
Refs #27243 (which asks to support more compression formats).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27917
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertInvalidateSizedRecordsTest.java into our
cqlpy framework.
This is one of the tests added to Cassandra as part of the vector
search work, but actually has nothing to do with vector search -
it checks what happens when key columns of different types exceeed
their maximum size (64KB).
Unfortunately, each one of the tests added here *fail* on ScyllaDB,
providing more reproducers for two already known issues (which
already had plenty of reproducers...):
Refs #8627 Cleanly reject updates with indexed values where value > 64k
Refs #12247 Better error reporting for oversized keys during INSERT
One of the tests also fails on Cassandra, due to CASSANDRA-19270.
It is not clear to me how this unit test actually passed on Cassandra,
I can only guess that the Python driver somehow makes the request
differently than what the Java unit tests use to make requests to
Cassandra.
One of the tests in the original Cassandra source file I did not
translate, readingEmptyStringsForDifferentTypes, because it tests
cqlsh, not pure CQL.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27944
It is considered a dangerous practice with possible unintended
side-effects, affecting later calls to the same function.
Found by CodeQL "Modification of parameter with default".
It is considered a dangerous practice as it creates a side-effect for
later calls to the same function.
Create a new variable instead and mutate that. Also remove the unused
update_known_ids parameter, which defaults to True and no caller changes
it. Passing False to this param also seem to have no effect. Instead of
trying to guess what the desired effect of passing False is and fixing
it, just remove this unused param.
Found by CodeQL "Modification of parameter with default".
To hopefully shut up CodeQL "Iterable can be either a string or a sequence".
This change makes the code more readable anyway, so it is more than just
a gratuitous change to make some code-scanner happy.
Unlike in C++, in Python one can only throw objects which inherit from
Exception. The message complains about wrong type so wrap it in
TypeError before passing to raise.
Found by CodeQL "Illegal raise".
Use f strings instead, they are just as convenient with the added bonus
of editors providing syntax highighting for it.
Additionally, this shuts up CodeQL complaint about "Suspicious unused
loop iteration variable" in loops where the loop variable was passed to
format indirectly via **locals().
Replace manual init of parent fields.
Found by CodeQL: "Missing call to superclass `__init__` during object
initialization".
The secret_key is not initialized to server.secret_key, instead of
server.access_key. This probably fixes a (benign) bug.
Since Scylla 6.0, service levels are manged by Raft group0.
This patch updates table name used by service levels and adds a
paragraph describing service levels on raft.
Fixesscylladb/scylladb#18177Closesscylladb/scylladb#26556
Write the boost logs into stdout in HRF format and in XML to the file. The XML file will be used for parsing and providing the error information in the summary section of the fail.
Fixes: https://github.com/scylladb/scylladb/issues/28045
Framework enhancements, no need to backport.
Closesscylladb/scylladb#28107
* github.com:scylladb/scylladb:
test.py: remove XML log from fail summary
test.py: fix truncated boost output to stdout file
Otherwise, coordinator may not react to changing stats after explicit
calls to trigger_load_stats_refresh() done on node replace or table
creation, if stats take longer to refresh than it takes the
coordinator to go idle.
The periodic refresh does wake up the topology coordinator, so the
issue is not dramatic in production, but it's annoying in tests, which
take longer because of that.
Fixes#25163
Refreshing stats will signal _topo_sm.event, so do it before waiting
for the event, to avoiding busy looping in the coordinator.
This will produce lots of logs in test cases which enable debug-level
logging in the raft logger.
Refs #28086
Checking for work is not atomic, so there is room for missed
notification. Especially that notifications are not always triggered
from fibers which take the group0 guard.
Fix by subscribing for the event before checking for work.
Fixes#27958
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats (capacity), but also per-tablet stats.
Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).
This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.
Fixes#27921
Stats refresh will be triggered on topology coordinator by events like
allocating new tablets on table creation. For refresh to be effective,
all replicas must see the new tablets, otherwise stats will be
incomplete.
If a test tries to move a tablet, it assumes the tablets are stable.
This fixes flakiness exposed by size-based load-balancing and a later
change to refresh stats sooner.
It's a global operation, so we can use any server.
It's not only convenient. The call via api.disable_tablet_balancing()
confuse people to think that it's a per-server operation. This leads
to proliferation of code which does it needlessly on all servers.
Change the behavior of the catching the boost log output. With this
change boost will output it's logging to stdour with HRF format and to
the tempfile in XML format. This will help for easier debuggint when all
messages will be in the output file and still in the fail summary.
This will move responsibility for running tests with pytest in the same manner as it was done with boost tests. From this commit, test.py is not responsible anymore for running python tests and relies completely on pytest.
This is another step for unification of test execution.
Convert skip_mode function to `pytest.mark` to be able to use to annotate the whole module instead of each test explicitly.
NOTE: this is a breaking change. From this commit, several directories with tests will require a path to the file to launch the test. Affected directories
test/alternator
test/broadcast_tables
test/cql
test/cqlpy
test/rest_api
Changes only in framework, so no backport.
This PR will increase the amount of the tests by 30 test, due to the fact that how test.py and pytest discover tests. test.py count a file as a test, and when skip used in suite.yaml it will exclude the tests from discovery completely.
While the pytest count test funstion as a test and uses skip_mode mark and will discover the tests, but it will skip them during execution, hence the difference
test.py output before PR:
```bash
> ./test.py --mode=release rest_api/test_compaction_task rest_api/test_task_manager --list --no-gather-metrics
```
test.py output in this PR:
```bash
> ./test.py --mode=release test/rest_api/test_compaction_task.py test/rest_api/test_task_manager.py --list
rest_api/test_compaction_task.py::test_global_major_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_major_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_cleanup_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_offstrategy_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_rewrite_sstables_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_reshaping_compaction_task.release.1
rest_api/test_compaction_task.py::test_resharding_compaction_task.release.1
rest_api/test_compaction_task.py::test_regular_compaction_task.release.1
rest_api/test_compaction_task.py::test_compaction_task_abort.release.1
rest_api/test_compaction_task.py::test_major_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_cleanup_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_offstrategy_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_rewrite_sstables_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_compaction_progress[major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_compaction_task.py::test_compaction_progress[shard_major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_compaction_task.py::test_compaction_progress[table_major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_task_manager.py::test_task_manager_modules.release.1
rest_api/test_task_manager.py::test_task_manager_tasks.release.1
rest_api/test_task_manager.py::test_task_manager_status_running.release.1
rest_api/test_task_manager.py::test_task_manager_status_done.release.1
rest_api/test_task_manager.py::test_task_manager_status_failed.release.1
rest_api/test_task_manager.py::test_task_manager_not_abortable.release.1
rest_api/test_task_manager.py::test_task_manager_wait.release.1
rest_api/test_task_manager.py::test_task_manager_ttl.release.1
rest_api/test_task_manager.py::test_task_manager_user_ttl.release.1
rest_api/test_task_manager.py::test_task_manager_sequence_number.release.1
rest_api/test_task_manager.py::test_task_manager_recursive_status.release.1
rest_api/test_task_manager.py::test_module_not_exists.release.1
rest_api/test_task_manager.py::test_task_folding.release.1
rest_api/test_task_manager.py::test_abort_on_unregistered_task.release.1
```
Fixes: https://github.com/scylladb/scylladb/issues/27716Closesscylladb/scylladb#26395
* github.com:scylladb/scylladb:
test.py: fix test_vector_similarity.py
docs: add directories excluded from test.py
test.py: prevent file descriptors leaking
test.py: capture print inside the test
test.py: do not print header for collection with test.py
test.py: remove not supported functionality
test.py: switch of execution of several test directories by test.py runner
test.py: integrate python tests to be executed with pytest runner
test.py: fix test/vector_search_validator to be able to run with pytest
test.py: prepare base class for migration
test.py: move environment preparation to one method
test.py: introduce new environment variable TESTPY_PREPARED_ENVIRONMENT
repair: Implement auto repair for tablet repair
This patch implements the basic auto repair support for tablet repair.
It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.
- auto_repair_enabled_default
Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.
- auto_repair_threshold_default_in_seconds
Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.
The following metrcis are added:
- auto_repair_needs_repair_nr
The number of tablets with auto repair enabled that needs repair
- auto_repair_enabled_nr
The number of tablets with auto repair enabled
The metrics are useful to tell if auto repair is falling behind.
In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.
Fixes SCYLLADB-99
New feature. No backport.
Closesscylladb/scylladb#27534
* github.com:scylladb/scylladb:
topology_coordinator: Add metrics for tablet repair
repair: Implement auto repair for tablet repair
This PR:
- Replaces a fixed version name with the variable for the current version in the instructions for installing a non-default version with Web Installer. This will make using the installer more user-friendly.
- Removes the instruction for Open Source from the Web Installer docs.
Fixes https://github.com/scylladb/scylladb/issues/28005
Fixes https://github.com/scylladb/scylladb/issues/28079Closesscylladb/scylladb#28046
* github.com:scylladb/scylladb:
doc: remove the instruction for Open Source from the Web Installer docs
doc: add the version variable to the Web Installer instructions
- scylla_tablet_ops_failed
Number of failed tablet {auto, user} repair
- scylla_tablet_ops_succeeded
Number of succeeded tablet {auto, user} repair
Currently auto_repair and user_repair tablet task are added. We can add
more tablet tasks later, e.g., rebuild, migration.
Create and drop view operations are currently performed on all shards, and their execution is not fully serialized. On slower processors this can lead to interleavings that leave stale entries in `system.scylla_views_build`
A problematic sequence looks like this:
* `on_create_view()` runs on shard 0 → entries for shard 0 and shard 1 are created
* `on_drop_view()` runs on shard 0 → entry for shard 0 is removed
* `on_create_view()` runs on shard 1 → entries for shard 0 and shard 1 are created again
* `on_drop_view()` runs on shard 1 → entry for shard 1 is removed, while the shard 0 entry remains
This results in a leftover row in `system.scylla_views_builds_in_progress`, causing `view_build_test.cc` to get stuck indefinitely in an eventual state and eventually be terminated by CI.
This patch fixes the issue by fully serializing all view create and drop operations through shard 0. Shard 0 becomes the single execution point and notifies other shards to perform their work in order. Requests originating.
new process:
- view_builder::on_create_view(...) runs only on shard 0 and kicks off dispatch_create_view(...) in the background.
- dispatch_create_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed.
- dispatch_create_view(...) calls handle_seed_view_build_progress(...) on shard 0. That:
- writes the global “build progress” row across all shards via _sys_ks.register_view_for_building_for_all_shards(...).
- After seeding, dispatch_create_view(...) broadcasts to all shards with container().invoke_on_all(...).
- Each shard runs handle_create_view_local(...), which:
- waits for pending base writes/streams, flushes the base,
- resets the reader to the current token and adds the new view,
- handles errors and triggers _build_step to continue processing.
Drop view
- view_builder::on_drop_view(...) runs only on shard 0 and kicks off dispatch_drop_view(...) in the background.
- dispatch_drop_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed.
- It broadcasts handle_drop_view_local(...) to all shards with invoke_on_all(...).
- Each shard runs handle_drop_view_local(...), which:
- removes the view from local build state (_base_to_build_step and _built_views) by scanning existing steps,
- ignores missing keyspace cases.
- After all shards finish local cleanup, shard 0 runs handle_drop_view_global_cleanup(...), which:
- removes global build progress, built‑view state, and view build status in system tables,
Shutdown
- drain() waits on _view_notification_sem before _sem so in‑flight dispatches finish before bookkeeping is halted.
In addition, the test is adjusted to remove the long eventual wait (596.52s / 30 iterations) and instead rely on the default wait of 17 iterations (~4.37 minutes), eliminating unnecessary delays while preserving correctness.
Fixes: https://github.com/scylladb/scylladb/issues/27898
Backport: not required as the problem happens on master
Closesscylladb/scylladb#27929
To prepare for implementation of filtering we skip validation
of where clauses in vector search queries. All queries that would
be blocked by the lack of ALLOW FILTERING now will pass through.
Fixes: VECTOR-410
Closesscylladb/scylladb#27758
We can run Alternator's tests against DynamoDB with `test/alternator/run --aws`, and our intention is that all except a few specially marked should pass on DynamoDB - indicating that the test itself is correct and checks compatibility with DynamoDB and not with some misunderstood spec.
Before this patch series, almost two dozen Alternator's tests failed on DynamoDB. This series fixes most of them.
Refs #26079 (it fixes almost all the problems but probably not all of them so let's keep the issue open for a while longer)
Closesscylladb/scylladb#27995
* github.com:scylladb/scylladb:
test/alternator: fix some expected error messages to fit DynamoDB
test/alternator: fix compressed request test on non-us-east1
test/alternator: fix test's expected error message on DynamoDB
test/alternator: mark Alternator-only test scylla_only
test/alternator: fix test on DynamoDB
test/alternator: increase wait_for_gsi() timeout
test/alternator: fix test passing a spurious parameter
These tools are deprecated and no longer shipped by ScyllaDB packages.
They no longer support the latest SSTable versions and ScyllaDB-only
features, like encryption and dictionary based compression.
Remove them from the documentation.
Closesscylladb/scylladb#27608
The interface of Jenkins has changed, and the instructions for creating
a token are out-of-date. This commit updates them.
Closesscylladb/scylladb#28054
There is a known limitation of the xdist.
Since it makes discovery in each thread, then compare it with master thread. The discovered lists of test should be the same. Sets are not order guaranteed, so they should not be used for parametrized testing, because discovery of the tests with using xdist will fail.
This PR just converts set to dist, to eliminate issue mentioned above.
Unused imports, unused variables and such.
Initially, there were no functional changes, just to get rid of some standard CodeQL warnings.
I've then broken the CI, as apparently there's a install time(!?) Python script creation for the sole purpose of product
naming. I changed it - we have it in etcdir, as SCYLLA-PRODUCT-FILE.
So added (copied from a different script) a get_product() helper function in scylla_util.py and used it instead.
While at it, also fixed the too broad import from scylla_util, which 'forced' me to also fix other specific imports (such as shutil).
Improvement - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#27883
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
require IP of the replacing node (like client requests or REST API
requests coming from nodetool).
We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.
For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.
For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.
We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.
We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.
We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).
Fixes#28057
Backport this PR to all branches as it fixes a problematic bug.
Closesscylladb/scylladb#27435
* github.com:scylladb/scylladb:
gossiper: add_saved_endpoint: make generations of excluded nodes negative
test: introduce test_full_shutdown_during_replace
utils: error_injection: allow aborting wait_for_message
raft topology: preserve IP -> ID mapping of a replacing node on restart
Fixes#27992
When doing a commit log oversized allocation, we lock out all other writers by grabbing
the _request_controller semaphore fully (max capacity).
We thereafter assert that the semaphore is in fact zero. However, due to how things work
with the bookkeep here, the semaphore can in fact become negative (some paths will not
actually wait for the semaphore, because this could deadlock).
Thus, if, after we grab the semaphore and execution actually returns to us (task schedule),
new_buffer via segment::allocate is called (due to a non-fully-full segment), we might
in fact grab the segment overhead from zero, resulting in a negative semaphore.
The same problem applies later when we try to sanity check the return of our permits.
Fix is trivial, just accept less-than-zero values, and take same possible ltz-value
into account in exit check (returning units)
Added whitebox (special callback interface for sync) unit test that provokes/creates
the race condition explicitly (and reliably).
Closesscylladb/scylladb#27998
The current code:
```
try:
cql.execute(f"INSERT INTO {cf} (pk, t) VALUES (-1, 'x')", host=host[0], execution_profile=cl_one_profile).result()
except Exception:
pass
```
contains a typo: `host=host[0]` which throws an exception becase Host
object is not subscriptable. The test does not fail because the except
block is too broad and suppresses all exceptions.
Fixing the typo alone is insufficient. The write still succeeds because
the remaining nodes are UP and the query uses CL=ONE, so no failure
should be expected.
Another source of flakiness is data verification:
```
SELECT * FROM {cf} WHERE pk = 0;
```
Even when a coordinator is explicitly provided, using CL=ONE does not
guarantee a local read. The coordinator may forward the read request to
another replica, causing the verification to fail nondeterministically.
This patch rewrites the tests to address these issues:
- Fix the typo: `host[0]` to `hosts[0]`
- Verify data using `MUTATION_FRAGMENTS({cf})` which guarantees a local
read on the coordinator node
- Reconnect the driver after node restart
Fixes https://github.com/scylladb/scylladb/issues/27933Closesscylladb/scylladb#27934
With migration to the pytest, file descriptors will be hanged during the whole life of the process. Previously it was not an issue, because test.py was executing only one file with Popen, so descriptors will be freed with process done. With new approach they are blocked. This will allow to eliminate this.
Fix issue when we had issue with getting cluster and then trying to set it dirty while it None.
Put cluster to the pool only if it was created
In the current state pytest do not support the order of execution, so this parameter is removed. There is no big need in this due to the differences what pytest and test.py counted test. pytest run test functions in the threads, while test.py executed test files in the threads. That's why pytest's way is more granular and allows to fill threads better.
Remove skip node, since it already added as a pytest mark for each test in the file.
Remove pool_size, since this is not used by pytest at all. Pytest uses
xdist to set the amount of threads instead of pool_size used by test.py
With this commit test.py will lose ability to run tests by itself always bypassing execution to the pytest.
NOTE: this is a breaking change. From this commit, several directories
with tests will require a path to the file to launch the test.
Affected directories
test/alternator
test/broadcast_tables
test/cql
test/cqlpy
test/rest_api
With this commit test.py will be bypassing the tests execution to the pytest. However, it will still be able to run test by itself.
With providing test name like `broadcast_tables/test_broadcast_tables` it will execute test with test.py runner, but if the path to the file will be provided like `test/broadcast_tables/test_broadcast_tables.py` it will bypass execution to the pytest.
`--test-py-init` tells to run pytest session in test.py-compatible mode
Update the help text for the name parameter for test.py about changes
how it works and which directory is served by pytest
build_mode fixture have dynamic scope. It depends how the pytest is
executed. When it executed through test.py scope will be session and
since it's broader that package everything work fine. While with pure
pytest it will fail because build_mode will have module scope.
This fix allows to run tests with pure pytest, this needed for migration
test to be executed by pytest runner instead test.py.
Since all tests share the same base class and some of the tests executed by test.py and some with pytest, we need to handle two cases where configuration is located: suite.yaml and test_config.yaml
After full migration suite.yaml case will be removed
Since anyway these two methods should be called one by one in two different cases: when test.py executes test and pytest executes test, merging them into one. Additionally, set environment variable to show the underneath pytest process that environment was already prepared and there is no need to clean directories or start additional services.
Introduce the new environment variable that will be used to signalize to the pytest runner that environment war already prepared by test.py. This needed to be able to run the test with pytest and test.py(that actually will run pytest underneath).
Preiously we were logging a broken TLS connection and then this has been
logged later again, so now instead of logging we're constructing an
exception with a message extened with TLS info, which later will be
catched with its full message still logged.
1. With this change the test really waits 10s, previously (in case
something went wrong), the timeout could take way more than that.
2. Added `else` to above `if` to increase clarity of execution flow -
it doesn't change logic, but makes it more clear.
This patch implements the basic auto repair support for tablet repair.
It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.
- auto_repair_enabled_default
Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.
- auto_repair_threshold_default_in_seconds
Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.
The following metrcis are added:
- auto_repair_needs_repair_nr
The number of tablets with auto repair enabled that needs repair
- auto_repair_enabled_nr
The number of tablets with auto repair enabled
The metrics are useful to tell if auto repair is falling behind.
In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.
Fixes SCYLLADB-99
Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations:
* Altering a keyspace's RF if it would make the keyspace RF-rack-invalid
* Adding a node in a new rack
* Removing / Decommissioning the last node in a rack
Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views.
The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid.
Fixesscylladb/scylladb#23345
Fixes https://github.com/scylladb/scylladb/issues/26820
backport to relevant versions for materialized views with tablets since it depends on rf-rack validity
Closesscylladb/scylladb#26354
* github.com:scylladb/scylladb:
docs: update RF-rack restrictions
cql3: don't apply RF-rack restrictions on vector indexes
cql3: add warning when creating mv/index with tablets about rf-rack
service/tablet_allocator: always allow tablet merge of tables with views
locator: extend rf-rack validation for rack lists
test: test rf-rack validity when creating keyspace during node ops
locator: fix rf-rack validation during node join/remove
test: test topology restrictions for views with tablets
test: add test_topology_ops_with_rf_rack_valid
topology coordinator: restrict node join/remove to preserve RF-rack validity
topology coordinator: add validation to node remove
locator: extend rf-rack validation functions
view: change validate_view_keyspace to allow MVs if RF=Racks
db: enforce rf-rack-validity for keyspaces with views
replica/db: add enforce_rf_rack_validity_for_keyspace helper
db: remove enforce parameter from check_rf_rack_validity
test: adjust test to not break rf-rack validity
Disabling of balancing waits for topology state machine to become idle, to guarantee that no migrations are happening or will happen after the call returns. But it doesn't interrupt the scheduler, which means the call can take arbitrary amount of time. It may wait for tablet repair to be finished, which can take many hours.
We should do it via topology request, which will interrupt the tablet scheduler.
Enabling of balancing can be immediate.
Fixes https://github.com/scylladb/scylladb/issues/27647Fixes#27210Closesscylladb/scylladb#27736
* https://github.com/scylladb/scylladb:
test: Verify that repair doesn't block disabling of tablet load balancing
tablets: Make balancing disabling call preempt tablet transitions
VECTOR_SEARCH_INDEXING permission didn't work on cdc tables as we mistakenly checked for vector indexes on the cdc table insted of the base.
This patch fixes that and adds a test that validates this behavior.
Fixes: VECTOR-476
Closesscylladb/scylladb#28050
Call discover_staging_sstables in view_update_generator::start() instead
of in the constructor, because the constructor is called during
initialization before sstables are loaded.
The initialization order was changed in 5d1f74b86a and caused this
regression. It means the view update generator won't discover staging
sstables on startup and view updates won't be generated for them. It
also causes issues in sstable cleanup.
view_update_generator::start() is called in a later stage of the
initialization, after sstable loading, so do the discovery of staging
sstables there.
Fixesscylladb/scylladb#27956Closesscylladb/scylladb#27970
Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard
and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them.
This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged).
Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`.
Fixes#27639
* The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions
Closesscylladb/scylladb#27643
* github.com:scylladb/scylladb:
test: database_test: do_with_some_data: randomize keys
database: truncate_table_on_all_shards: drop outdated TODO comment
database: truncate_table_on_all_shards: consider can_flush on all shards
memtable_list: unify can_flush and may_flush
test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
replica: table, storage_group, compaction_group: add needs_flush
test: database_test: do_with_some_data_in_thread: accept void callback function
2-DC cluster parallel non-RBNO rebuild failure when expanding RF in DC2.
Steps to reproduce:
1. Provision a cluster with 2 datacenters and at least 2 nodes in the second datacenter.
2. Let’s assume datacenter names are "dc1" and "dc2".
3. Create a keyspace ("keyspace1") with RF=0 in dc2.
4. Populate some data into dc1.
5. Change keyspace1 replication in dc2 to 2.
6. On 2 nodes in dc2 run the following command in parallel:
nodetool rebuild --source-dc dc1
Parallel execution of rebuilds is not possible with RBNO enabled.
This test is the repro for #27804Closesscylladb/scylladb#27747
Currently the function uses a regular expression
to check the system log for a specific message.
This is tangential to the ability to cleanly abort the restore task, plus the regular expression has a syntax error:
```
test/cluster/object_store/test_backup.py:534
/home/bhalevy/dev/scylla/test/cluster/object_store/test_backup.py:534: SyntaxWarning: "\(" is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\("? A raw string is also an option.
await wait_for_first_completed([l.wait_for("Failed to handle STREAM_MUTATION_FRAGMENTS \(receive and distribute phase\) for .+: Streaming aborted", timeout=10) for l in logs])
```
Thsi change modernizes the implementation by:
- using auto_dc_rack for manager.servers_add
- using new_test_keyspace to generate and auto delete the keyspace
- using async gatherio and a prepared statement to insert the data
- simplifing the keys and values by NOT using os.urandom (that is notoriously slow)
- inserting fewer keys in debug mode
- removing the log check
With that, the test can be reenabled in all modes.
* No backport needed since the test was disabled
Closesscylladb/scylladb#27892
* github.com:scylladb/scylladb:
test_backup: do_abort_restore: reduce data footprint
test_backup: do_abort_restore: use error injection
test_backup: do_abort_restore: use asyncio for cql
test_backup: do_abort_restore: use new_test_keyspace
test_backup: do_abort_restore: use logger rather than print
test_backup: do_abort_restore: pass auto_rack_dc to servers_add
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.
In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.
The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.
After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.
Fixes#22564Fixes#26896Closesscylladb/scylladb#27679
It was obseved:
```
test_repair_disjoint_row_2nodes_diff_shard_count was spuriously failing due to
segfault.
backtrace pointed to a failure when allocating an object from the chain of
freed objects, which indicates memory corruption.
(gdb) bt
at ./seastar/include/seastar/core/shared_ptr.hh:275
at ./seastar/include/seastar/core/shared_ptr.hh:430
Usual suspect is use-after-free, so ran the reproducer in the sanitize mode,
which indicated shared ptr was being copied into another cpu through the
multi shard writer:
seastar - shared_ptr accessed on non-owner cpu, at: ...
--------
seastar::smp_message_queue::async_work_item<mutation_writer::multishard_writer::make_shard_writer...
```
The multishard writer itself was fine, the problem was in the streaming consumer
for repair copying a shared ptr. It could work fine with same smp setting, since
there will be only 1 shard in the consumer path, from rpc handler all the way
to the consumer. But with mixed smp setting, the ptr would be copied into the
cpus involved, and since the shared ptr is not cpu safe, the refcount change
can go wrong, causing double free, use-after-free.
To fix, we pass a generic incremental repair handler to the streaming
consumer. The handler is safe to be copied to different shards. It will
be a no op if incremental repair is not enabled or on a different shard.
A reproducer test is added. The test could reproduce the crash
consistently before the fix and work well after the fix.
Fixes#27666Closesscylladb/scylladb#27870
Function skip_mode works only on function and only in cluster test. This if OK
when we need to skip one test, but it's not possible to use it with pytestmark
to automatically mark all tests in the file. The goal of this PR is to migrate
skip_mode to be dynamic pytest.mark that can be used as ordinary mark.
Closesscylladb/scylladb#27853
[avi: apply to test/cluster/test_tablets.py::test_table_creation_wakes_up_balancer]
The test if flaky, with failures in:
for server in servers:
> await check_node_log_for_failed_mutations(manager, server)
test/cluster/test_topology_ops_encrypted.py:84:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
manager = <test.pylib.manager_client.ManagerClient object at 0xffff602e8590>
server = ServerInfo(server_id=1769, ip_addr='127.82.127.43', rpc_address='127.82.127.43', datacenter='DEFAULT_DC', rack='DEFAULT_RACK', pid=186578)
async def check_node_log_for_failed_mutations(manager: ManagerClient, server: ServerInfo):
logging.info(f"Checking that node {server} had no failed mutations")
log = await manager.server_open_log(server.server_id)
occurrences = await log.grep(expr="Failed to apply mutation from", filter_expr="(TRACE|DEBUG|INFO)")
> assert len(occurrences) == 0
E AssertionError
test/cluster/util.py:319: AssertionError
As diagnosed by Gleb in https://github.com/scylladb/scylladb/issues/27942#issuecomment-3710013625:
"The fencing errors here look legit given that we do not wait for all
requests to complete while shutting down the storage proxy. The
scenario is this:
Test does writes to rf=3 keyspace with cl=one. One node is shutting
down while there is a tablet migration. Tablet migration executes
barrier and drain which fails on a node that is been shutdown. The
topology coordinator proceeds fencing the old topology, but there
still can be un-handled mutation requests from the shutting down node
on other nodes and they will generate fencing errors like they should.
They way to avoid it (though it is benign) is to wait for all outgoing
storage proxy requests to complete during shutdown, but even then the
error may still happen since a request may timeout before it is
processed by the other side, so it may be completed by a storage proxy
coordinator side, but still not handled by replica side. This what we
have fencing for in the first place."
Fix by diabling background tablet migrations, so that we have no
topology barriers concurrent with node shutdown.
Fixes#27942Closesscylladb/scylladb#28034
The driver must see server_c before we stop server_a, otherwise
there will be no live host in the pool when we attempt to drop
the keyspace:
```
@pytest.mark.asyncio
async def test_not_enough_token_owners(manager: ManagerClient):
"""
Test that:
- the first node in the cluster cannot be a zero-token node
- removenode and decommission of the only token owner fail in the presence of zero-token nodes
- removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token
owners would fall below the RF of some keyspace using tablets
"""
logging.info('Trying to add a zero-token server as the first server in the cluster')
await manager.server_add(config={'join_ring': False},
property_file={"dc": "dc1", "rack": "rz"},
expected_error='Cannot start the first node in the cluster as zero-token')
logging.info('Adding the first server')
server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"})
logging.info('Adding two zero-token servers')
# The second server is needed only to preserve the Raft majority.
server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0]
logging.info(f'Trying to decommission the only token owner {server_a}')
await manager.decommission_node(server_a.server_id,
expected_error='Cannot decommission the last token-owning node in the cluster')
logging.info(f'Stopping {server_a}')
await manager.server_stop_gracefully(server_a.server_id)
logging.info(f'Trying to remove the only token owner {server_a} by {server_b}')
await manager.remove_node(server_b.server_id, server_a.server_id,
expected_error='cannot be removed because it is the last token-owning node in the cluster')
logging.info(f'Starting {server_a}')
await manager.server_start(server_a.server_id)
logging.info('Adding a normal server')
await manager.server_add(property_file={"dc": "dc1", "rack": "r2"})
cql = manager.get_cql()
await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60)
> async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name:
test/cluster/test_not_enough_token_owners.py:57:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/lib64/python3.14/contextlib.py:221: in __aexit__
await anext(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830>
opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }"
host = None
@asynccontextmanager
async def new_test_keyspace(manager: ManagerClient, opts, host=None):
"""
A utility function for creating a new temporary keyspace with given
options. It can be used in a "async with", as:
async with new_test_keyspace(ManagerClient, '...') as keyspace:
"""
keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host)
try:
yield keyspace
except:
logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation")
raise
else:
> await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host)
E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')})
test/cluster/util.py:544: NoHostAvailable
```
Fixes#28011Closesscylladb/scylladb#28040
This commit replaces a fixed version name with the variable for the current version
in the instructions for installing a non-default version with Web Installer.
This will make using the installer more user-friendly.
Fixes https://github.com/scylladb/scylladb/issues/28005
With randomized keys, and since we're inserting only 2 keys,
it is possible that they would end up owned only by a single shard,
reproducing #27639 in snapshot_list_contains_dropped_tables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The comment was added in 83323e155e
Since then, table::seal_active_memtable was improved to guarantee
waiting on oustanding flushes on success (See d55a2ac762), so
we can remove this TODO comment (it also not covered by any issue
so nobody is planned to ever work on it).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
can_flush might return a different value for each shard
so check it right before deciding whether to flush or clear a memtable
shard.
Note that under normal condition can_flush would always return true
now that it checks only the presence of the seal memtable function
rather than check memtable_list::empty().
Fixes#27639
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that we have a unit test proving that it's safe to flush an
empty memtable list there is no need to distinguish between
may_flush and can_flush.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Table needs flush if not all its memtable lists are empty.
To be used in the next patch for a unit test.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Many test cases already assume `func` is being called a seastar
thread and although the function they pass returns a (ready) future,
it serves no purpose other than to conform to the interface.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This `raft_topology_update_ip` call always returns after `t.find(raft_id)`
returns `nullptr`, so it effectively does nothing. It's not a bug, since
there is no reason to update `system.peers` for left nodes anyway. We
delete the rows corresponding to left nodes in `process_left_node` (called
just above).
Closesscylladb/scylladb#27899
Service levels cache is empty after upgrade to consistent topology
if no mutations are commited to `system.service_levels_v2` or rolling
restart is not done.
To fix the bug, this commit adds service levels cache reloading after
upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`.
Fixes SCYLLADB-90
before doing migration to raft
There is no need to call `service_level_controller::upgrade_to_v2()`
on every topology state load, we only need to do it once.
All tests I am fixing in this patch do pass for me on DynamoDB, but
other developers report that they fail because some DynamoDB servers
apparently use slightly different error messages, with less detail about
the cause of an error. For example, some of our tests currently expect
an error message that looks like:
An error occurred (ValidationException) when calling the Query
operation: Invalid operator used in KeyConditionExpression:
attribute_exists
But some servers don't report the ": attribute_exists" at the end, so
we can't use the word "attribute_exists" it in the test to recognize
the correct error, and needs to use a different word (which both
versions of DynamoDB and Alternator all print).
As another example, the good old DynamoDB error:
An error occurred (ValidationException) when calling the Query
operation: 1 validation error detected: Value 'DOG' at
'conditionalOperator' failed to satisfy constraint: Member must
satisfy enum value set: [OR, AND]
Got replaced by the following less informative message:
An error occurred (ValidationException) when calling the Query
operation: Failed to satisfy constraint: Member must satisfy enum
value set: [ALL, OR]'
So we need to fix the test to allow it too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test test_compressed_request.py::test_compressed_request coerces
boto3 to send a compressed request, and wrongly used region_name=us-east-1
to set up the connection. Theoretically, this doesn't matter because
we also set the correct URL (for either Alternator or the desired region
in AWS). But in fact it does matter, because region name is part of the
request's signature, and DynamoDB refuses the request if it comes to
a different region than it is signed for. So this test fails when run
on DynamoDB on any other region except us-east-1.
The fix is simple - don't use the constant "us-east-1", but pick up the
correct region name from the original connection.
The functions new_dynamodb_session(), new_dynamodb() and
new_dynamodb_stremas() had the same bug and we fix it too, but it didn't
break any test because the only tests using these functions were
Scylla-only so the AWS region problem didn't apply to them.
The Alternator test test_tag.py::test_tag_lsi_gsi expects to see an
error - it's not allowed to set a tag on a GSI or LSI - but the error
message that DynamoDB prints recently changed - instead of saying
"ResourceArn" the new error message says "resource arn".
Change the test to allow both forms, so it will pass on both Alternator
(which still uses the word ResourceArn - which is the name of the
parameter) and on DynamoDB (which uses "resource arn").
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test test_batch.py::test_batch_write_item_large_broken_connection
failed on DynamoDB (Refs #26079). It turns out this test has many
problems:
1. This test wrongly assumes a batch write needs to complete in one
attempt - and this fails on DynamoDB with low WCU capacity where
the batch needs to be resumed in multiple requests. Using boto3's
batch_writer() fixes this problem.
2. This test has NOTHING to do with batches - so is mis-named and
mis-placed. The batch write is just a way to prepare some data
in the table, and the real test is about Query'ing the data back
and observing the long response and reproducing issue #14454.
I did not rename or move the test, but left a comment explaining
the situation.
3. This test is written to assume the Query's response uses HTTP
chunked encoding. Which isn't actually true for DynamoDB, at least
not at the time of this writing. So the test fails on DynamoDB.
For the last reason, I made this test scylla_only. This test can't
really be run on DynamoDB without rewriting it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test test_batch.py::test_batch_write_item_large often fails when
running on DynamoDB, and this patch fixes it. The test checks that a
large but not over-the-limits large batch works. However, "works" only
means that the batch is not an error - it doesn't guarantee that all the
items in the batch are performed. If the WCU limits of the table are
exceeded DynamoDB may perform only part of the the batch and return the
remaining items as UnprocessedItems. This not only can happen, it
usually does happen on DynamoDB - because a new on-demand-billing table
always start with a very low WCU capacity.
So in this patch we update the test to recognize and perform the
UnprocessedItems, instead of assuming it needs to be empty.
The test continues to pass on Alternator, and finally passes on
DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In Alternator tests, the wait_for_gsi() utility function is used in
tests that add a GSI to an existing table, to wait for this new GSI
to become ready. Although this takes a fraction of a second on
Alternator, we noticed that this takes many minutes (!) on DynamoDB
so we used an absurdly high 10 minute timeout to allow tests to also
pass on DynamoDB.
But it turns out that 10 minutes wasn't absurdly high enough, and
tests using it in test_gsi_updatetable.py started to fail on DynamoDB.
Empirically, 10 minutes was enough in the past but it seems that today
adding a GSI to an empty table routinely takes as much as 20 minutes.
So this patch increases the wait_for_gsi() timeout to a whopping 30
minutes. After this patch, the tests in test_gsi_updatetable.py which
used to fail - test_gsi_backfill_with_lsi,
test_gsi_backfill_with_real_column, test_gsi_creates_and_deletes and
test_gsi_backfill_oversized_key now all pass on DynamoDB - but each
takes more than 20 minutes to pass.
To allow the test to fail much more quickly on Alternator (where
creating a GSI takes a fraction of a second), we set a much lower
but still very high timeout when running on Alternator - 60 seconds.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch modifies RESTful API handler which disables tablet
balancing to use topology request to wait for already running tablet
transitions. Before, it was just waiting for topology to be idle, so
it could wait much longer than necessary, also for operations which
are not affected by the flag, like repair. And repair can take hours.
New request type is introduced for this synchronization: noop_request.
It will preempt the tablet scheduler, and when the request executes,
we know all later tablet transitions will respect the "balancing
disabled" flag, and only things which are unuaffected by the flag,
like repair, will be scheduled.
Fixes#27647
The test test_streams.py::test_streams_putitem_new_item_overrides_old_lsi
failed on DynamoDB (Refs #26079) because we passed an unused parameter
NonKeyAttributes to the Projection setting an LSI. NonKeyAttributes is
only allowed when ProjectionType=INCLUDE, but we used ProjectionType=ALL.
DynamoDB refuses to create an LSI with such inconsistent parameters,
and we just need to remove this unnecessary parameter from this test.
The reason why this test didn't fail on Alternator is that Alternator
doesn't yet support or even parse the Projection parameter (Refs #5036).
We also add an xfailing test (passes on DynamoDB, fails on Alternator)
checking that a spurious NonKeyAttributes parameter is rejected. When
we get around to implement the projection feature (#5036), this will
be yet another acceptance test for this feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
To make the test fast, in particular in debug mode
insert fewer keys and do not rely on os.urandom
which is notoriously slow
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the test depends on timing and enough inserted
data to abort the restore tasks at exactly the right time.
This is flaky in nature, so instead, use error injection
to synchronize the abort with mutation streaming.
Note that with that we no longer get the STREAM_MUTATION_FRAGMENTS
log message, so waiting for it is dropped from the test.
The most imporant thing is that some restore tasks must fail.
(We cannot guarantee all would fail unfortunately)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use the more modern asyncio facility to run cql queries
and a prepared statement to insert data into the table.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To generate multi-rack cluster, otherwise we get the following error:
```
E cassandra.protocol.ConfigurationException: <Error from server: code=2300 [Query invalid because of configuration issue] message="Replication factor 3 exceeds the number of racks (1) in dc datacenter1">
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move KSNAME constant from internal static to public member of
replicated_key_provider_factory class.
It will be used to identify it as a system keyspace.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The explanation is in the new comment in `gossiper::add_saved_endpoint`.
We add a test for this change. It's "extremely white-box", but it's better
than nothing.
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
require IP of the replacing node (like client requests or REST API
requests coming from nodetool).
We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.
For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.
For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.
We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.
We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.
We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).
Update the documentation about restrictions to tablets keyspaces related
to RF-rack.
* MV/SI require the keyspace to be RF-rack-valid
* topology operations are restricted if a keyspace has views to preserve
RF-rack-validity
When creating an index we validate that the keyspace is RF-rack-valid
and print a warning that the keyspace must remain RF-rack-valid.
This should apply only to indexes that are based on materialized views
for which there are consistency concerns when the keyspace is not
RF-rack-valid.
vector indexes are not based on materialized views, hence these
restrictions should not apply to them.
Creating a MV or index in a tablets-based keyspace now forces additional
restrictions on the keyspace. The keyspace must be RF-rack-valid and it
must remain RF-rack-valid while the view exists.
Add a CQL warning about these restrictions.
allow tablet merge of tables with views even if the
rf_rack_valid_keyspaces option is not set, because now keyspaces that
have views are enforced to always be rf-rack-valid, regardless of the
option value.
Extend the RF-rack validation in `assert_rf_rack_valid_keyspace` to
validate rack-list-based replication as well. Previously, validation was
done only for numeric replication.
If the replication is based on a rack list, we validate that all racks
that are required for replication are present in the topology rack map.
If some rack is needed for replication but is missing, or it doesn't
have normal token owner nodes, the validation fails with an error.
add tests that attempt to create a keyspace during different stages of
node join or remove, and verify that the rf-rack condition can't be
broken - either creating the keyspace should fail or the node operation
should fail, depending on the stage.
If a keyspace is created while a node is joining or being removed, it could
break the rf-rack invariant. For example:
1. We have 3 nodes in 3 racks, no keyspaces
2. A new node starts to join in a new rack - passes validation because
there are no keyspaces
3. Create a keyspace with rf=3 - passes validation because the joining
node is not a normal token owner yet
4. The new node becomes a normal token owner
5. The rf-rack invariant is broken. We have rf=3 and 4 racks
To fix this, we change the rf-rack check to consider a node as a token
owner if it's either a normal token owner or it has bootstrap tokens and
is about to become a normal token owner.
Now the condition can't be broken. Consider keyspace creation at
different stages of adding a node in our example:
* Before the node is assigned bootstrap tokens: the node is not
considered. We can create a keyspace with rf=3 as if the node doesn't
exist, and then node join will fail in the group0 operation that
assigns bootstrap tokens, because during this operation we check
rf-rack validity.
* Assigning bootstrap tokens is a single group0 operation that is
serialized with keyspace creation. During this operation we check that
adding the node as a token owner will maintain rf-rack validity for all
keyspaces.
* After the node is assigned bootstrap tokens and until it becomes a
normal token owner: it is considered as a transitioning token owner by
the rf-rack check and the rack is considered a transitioning rack. We
can't count the rack as a normal rack because the node join may still
fail and rollback. Trying to create a keyspace with either rf=3 or
rf=4 will fail because we can end up with either 3 or 4 racks.
Similarly, when removing a node, we validate that removing the node will
maintain rf-rack validity in the same group0 operation that changes the
node state to removing/decommissioning, after which the node becomes a
leaving endpoint, and it's not considered a normal token owner anymore
for the rf-rack check.
Add tests that verify the restrictions on topology operations when there
are keyspaces with tablets and materialized views.
For such keyspaces, RF=Racks must be enforced while they have
materialized views, therefore adding a node in a new rack or removing a
node that would eliminate a rack should be rejected.
add new tests for testing that RF-rack validity is maintained when doing
topology operations that may break them, such as adding nodes in new
racks or removing nodes.
when a new node joins or an existing node is removed / decommissioned,
check if the operation would violate the RF-rack-validity of some
keyspace. if so - reject the operation in order to preserve
RF-rack-validity.
Fixesscylladb/scylladb#23345Fixesscylladb/scylladb#26820
add validation to node remove / decommission, similar to node validation
when a node joins.
when starting node remove or decommission, the validation function
checks if the operation is valid and can proceed. if not, it's aborted
with an error message.
we change the return type of validate_joining_node so that it will be
similar and consistent with the new validate_removing_node.
Extend the locator function assert_rf_rack_valid_keyspace to accept
arbitrary topology dc-rack maps and nodes instead of using the current
token metadata.
This allows us to add a new variant of the function that checks rf-rack
validity given a topology change that we want to apply. we will use it
to check that rf-rack validity will be maintained before applying the
topology change.
The possible topology changes for the check are node add and node remove
/ decommission. These operations can change the number of normal racks -
if a new node is added to a new rack, or the last node is removed from a
rack.
The function validate_view_keyspace checks if a keyspace is eligible for
having materialized views, and it is used for validation when creating a
MV or a MV-based index.
Previously, it was required that the rf_rack_valid_keyspaces option is
set in order for tablets-based keyspaces to be considered eligible, and
the RF-rack condition was enforced when the option is set.
Instead of this, we change the validation to allow MVs in a keyspace if
the RF-rack condition is satisfied for the keyspace - regardless of the
config option.
We remove the config validation for views on startup that validates the
option `rf_rack_valid_keyspaces` is set if there are any views with
tablets, since this is not required anymore.
We can do this without worrying about upgrades because this change will
be effective from 2025.4 where MVs with tablets are first out of
experimental phase.
We update the test for MV and index restrictions in tablets keyspaces
according to the new requirements.
* Create MV/index: previously the test checked that it's allowed only if
the config option `rf_rack_valid_keyspaces` is set. This is changed
now so it's always allowed to create MV/index if the keyspace is
RF-rack-valid. Update the test to verify that we can create MV/index
when the keyspace is RF-rack-valid, even if the rf_rack option is not
set, and verify that it fails when the keyspace is RF-rack-invalid.
* Alter: Add a new test to verify that while a keyspace has views, it
can't be altered to become RF-rack-invalid.
Extend the RF-rack-validity enforcement to keyspaces that have views,
regardless of the option `rf_rack_valid_keyspaces`.
Previously, RF-rack-validity was enforced when `rf_rack_valid_keyspaces`
was set for all keyspaces. Now we want to allow creating MVs in tablet
keyspaces that are RF-rack-valid and enforce the RF-rack-validity even
if the config option is not set.
Add the helper function enforce_rf_rack_validity_for_keyspace that
returns true if RF-rack-validity should be enforced for a keyspace, and
use it wherever we need to check this instead of checking the config
option directly.
This is useful because this condition is used in multiple places, and
having it defined in a single helper function will make it easier to
see and change the enforcement conditions.
simple refactoring: the enforce parameter is always given the value of
the `rf_rack_valid_keyspaces` option. remove the parameter and use the
option value directly from the db config.
this will be useful for a later change to the enforcement conditions.
the test test_unfinished_writes_during_shutdown starts 3 nodes in 3
racks and creates a keyspace with RF=3, then adds a new node in a 4th
rack. this breaks rf-rack validity for the keyspace.
we change it instead to add the new node in an existing rack. it doesn't
matter for the test - the test only wants to add a new node to trigger
some topology change.
We're extending the logic of DESCRIBE INDEX to include properties of the
underlying materialized view. Tests are provided to ensure the
implementation works as intended.
This is a temporary solution as handling this property may require
a bit more attention or at least a bit more focus. For now, let's
forbid using it so it's clear it won't get applied. A simple test
is provided to cover it.
We document the restriction.
After the previous patch that extended the grammar and provided
basic functionalities to accommodate properties of materialized views
in indexes, this commit takes another step and actually applies them
to the underlying view when it's being created.
We're providing validation tests for each property, with the single
exception of CLUSTERING ORDER BY. That one will be handled separately
in an upcoming commit.
We also update the user documentation.
We're allowing CREATE INDEX to accept the same set of properties as
materialized views do. Our goal is to give the user an ability to
configure the underlying materialized view of an index directly,
when creating it.
This commit doesn't do anything except for extending the grammar
and passing the right pieces of information to the right destinations.
There's no validation and the options have no effect yet. That will
be done in the following patch.
The type represents a mix of both index-specific and view properties.
Since we cannot easily distinguish which properties belong to which
entity, let's use this abstraction and filter them from the C++ level.
This is a prerequisite for extending the capabilities of CREATE INDEX
by allowing it to configuring the underlying materialized view.
We rename the type `index_prop_defs` to `index_specific_prop_defs`.
The rationale for the change is to distinguish between properties
related directly to a index and properties related to the underlying
view (if applicable).
The type `index_prop_defs` will be re-introduced in an upcomming commit
where it'll encompass both index-related and view-related properties.
This is a prerequisite for it.
We're introducing a new type wrapping properties that can be used with
materialized views. Doing that, we achieve the following things:
(1) We can keep validation logic in one place.
(2) We differentiate between properties of a regular table and
properties of a materialized view.
(3) It provides better modularization and allows for reusing the code.
(4) It gets rid of inconsistencies in the existing code, e.g.
CREATE MV using one type for properties, while ALTER MV another.
The actual end goal of this commit is to be able to reuse at least part
of the validation logic of MVs in CREATE INDEX and, when it gets added,
ALTER INDEX: we want to endow those statements with an ability to modify
the underlying materialized view without having to modify it directly.
This patch does NOT implement the whole validation logic yet. It will be
done in a following commit.
Refs scylladb/scylladb#16454
The end goal we have in mind in this commit is to extract the validation
logic of the options used for creating and altering an MV to a separate
place and be able to call from different places in the code.
It will be useful when extending the capabilities of the CREATE INDEX
statement.
In this patch, we move the part of validation responsible for checking
the ID option to keep it close to the other parts of validation of the
options in their "raw" form.
One of the upcoming commits will lead to a cyclic dependency
of headers because `schema.hh` includes `index_prop_defs.hh`.
To prevent that, we remove the include and replace it with
a manually added alias.
This is not a perfect solution, but doing it properly would
require comprehensive changes. We can do that in a separate
task.
- Strive for simplicity and clarity, add complexity only when clearly justified
- Question requests: don't blindly implement requests - evaluate trade-offs, identify issues, and suggest better alternatives when appropriate
- Consider different approaches, weigh pros and cons, and recommend the best fit for the specific context
## Test Philosophy
- Performance matters. Tests should run as quickly as possible. Sleeps in the code are highly discouraged and should be avoided, to reduce run time and flakiness.
- Stability matters. Tests should be stable. New tests should be executed 100 times at least to ensure they pass 100 out of 100 times. (use --repeat 100 --max-failures 1 when running it)
- Unit tests should ideally test one thing and one thing only.
- Tests for bug fixes should run before the fix - and show the failure and after the fix - and show they now pass.
- Tests for bug fixes should have in their comments which bug fixes (GitHub or JIRA issue) they test.
- Tests in debug are always slower, so if needed, reduce number of iterations, rows, data used, cycles, etc. in debug mode.
- Tests should strive to be repeatable, and not use random input that will make their results unpredictable.
- Tests should consume as little resources as possible. Prefer running tests on a single node if it is sufficient, for example.
CLEANER_DIRS:test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service
// Note: we don't care when the notification of other shards will finish, as long as it will be done
// it's possible to get into race condition (next DescribeTable comes to other shard, that new shard doesn't have
// the size yet, so it will calculate it again) - this is not a problem, because it will call cache_newly_calculated_size_on_all_shards
// with expiry, which is extremely unlikely to be exactly the same as the previous one, all shards will keep the size coming with expiry that is further into the future.
// In case of the same expiry, some shards will have different size, which means DescribeTable will return different values depending on the shard
// which is also fine, as the specification doesn't give precision guarantees of any kind.
// A race condition is possible: if a DescribeTable request arrives on a different shard before
// that shard receives the cached size, it will recalculate independently. This is acceptable because:
// 1. Both calculations will cache their results with an expiry time
// 2. Expiry times are unlikely to be identical, so eventually all shards converge to the most recent value
// 3. Even if expiry times match, different shards may briefly return different table sizes
// 4. This temporary inconsistency is acceptable per DynamoDB specification, which doesn't guarantee
// exact precision for DescribeTable size information
co_returnapi_error::validation("GlobalSecondaryIndexes and LocalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
// Creating an index in tablets mode requires the keyspace to be RF-rack-valid.
// GSI and LSI indexes are based on materialized views which require RF-rack-validity to avoid consistency issues.
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",
"max cpu usage ratio (between 0 and 1) for compaction process. Not intended for setting in normal operations. Setting it to 1 or higher will disable it, recommended operational setting is 0.5.")
"If set to higher than 0, ignore the controller's output and set the memtable shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
"If set to higher than 0, ignore the controller's output and set the compaction shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
"Set the maximum shares of regular compaction to the specific value. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
"If set to true, enforce the min_threshold option for compactions strictly. If false (default), Scylla may decide to compact even if below min_threshold.")
"SELECT statements with aggregation or GROUP BYs or a secondary index may use this page size for their internal reading data, not the page size specified in the query options.")
,alternator_port(this,"alternator_port",value_status::Used,0,"Alternator API port.")
,alternator_https_port(this,"alternator_https_port",value_status::Used,0,"Alternator API HTTPS port.")
"Port on which the Alternator HTTPS API listens for clients using proxy protocol v2. Disabled (0) by default.")
,alternator_address(this,"alternator_address",value_status::Used,"0.0.0.0","Alternator API listening address.")
,alternator_enforce_authorization(this,"alternator_enforce_authorization",liveness::LiveUpdate,value_status::Used,false,"Enforce checking the authorization header for every request in Alternator.")
,alternator_warn_authorization(this,"alternator_warn_authorization",liveness::LiveUpdate,value_status::Used,false,"Count and log warnings about failed authentication or authorization")
"\tdisabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option\n"
"\tenabled: New keyspaces use tablets by default, unless disabled by the tablets={'enabled':false} option\n"
"\tenforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option")
,auto_repair_enabled_default(this,"auto_repair_enabled_default",liveness::LiveUpdate,value_status::Used,false,"Set true to enable auto repair for tablet tables by default. The value will be overridden by the per keyspace or per table configuration which is not implemented yet.")
,auto_repair_threshold_default_in_seconds(this,"auto_repair_threshold_default_in_seconds",liveness::LiveUpdate,value_status::Used,24*3600,"Set the default time in seconds for the auto repair threshold for tablet tables. If the time since last repair is bigger than the configured time, the tablet is eligible for auto repair. The value will be overridden by the per keyspace or per table configuration which is not implemented yet.")
,enable_create_table_with_compact_storage(this,"enable_create_table_with_compact_storage",liveness::LiveUpdate,value_status::Used,false,"Enable the deprecated feature of CREATE TABLE WITH COMPACT STORAGE. This feature will eventually be removed in a future version.")
"Sets the minimal tablet size for the load balancer. For any tablet smaller than this, the balancer will use this size instead of the actual tablet size.")
"max cpu usage ratio (between 0 and 1) for compaction process. Not intended for setting in normal operations. Setting it to 1 or higher will disable it, recommended operational setting is 0.5.")
"If set to higher than 0, ignore the controller's output and set the memtable shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
"If set to higher than 0, ignore the controller's output and set the compaction shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
"Set the maximum shares of regular compaction to the specific value. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
"If set to true, enforce the min_threshold option for compactions strictly. If false (default), Scylla may decide to compact even if below min_threshold.")
"Set the minimum interval in seconds between flushing all tables before each major compaction (default is 86400)."
"This option is useful for maximizing tombstone garbage collection by releasing all active commitlog segments."
"Set to 0 to disable automatic flushing all tables before major compaction.")
,default_log_level(this,"default_log_level",value_status::Used,seastar::log_level::info,"Default log level for log messages")
,logger_log_level(this,"logger_log_level",value_status::Used,{},"Map of logger name to log level. Valid log levels are 'error', 'warn', 'info', 'debug' and 'trace'")
,log_to_stdout(this,"log_to_stdout",value_status::Used,true,"Send log output to stdout")
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.