- Fix variable name error: host[0] → hosts[0] on line 98
- Add missing await keywords for async operations on lines 209 and 385
- Rename class random_content_file to RandomContentFile (PascalCase)
- Fix function name typo: test_autotoogle_compaction → test_autotoggle_compaction
Co-authored-by: mykaul <4655593+mykaul@users.noreply.github.com>
There are three tests in cluster/object_store suite that check how
backup fails in case either of its parameters doesn't really exists. All
three greatly duplicate each other, it makes sense to merge them into
one larger parametrized test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27695
This pull request introduces a new caching mechanism for client options in the Alternator and transport layers, refactors how client metadata is stored and accessed, and extends the `system.clients` virtual table to surface richer client information. The changes improve efficiency by deduplicating commonly used strings (like driver names/versions and client options), and ensure that client data is handled in a way that's safe for cross-shard access. Additionally, the test suite and virtual table schema are updated to reflect the new client options data.
**Caching and client metadata refactoring:**
* The largest and most repeatable items in the connection state before this PR were a `driver_name` and a `driver_version` which were stored as an `sstring` object which means that the corresponding memory consumption was 16 bytes per each such value at least (the smallest size of the `seastar`'s `sstring` object) **per-connection**. In reality the driver name is usually longer than 15 characters, e.g. "ScyllaDB Python Driver" is 23 characters and this is not the longest driver name there is. In such cases the actual memory usage of a corresponding `sstring` object jumps to 8 + 4 + 1 + (string length, 23 in our example) + 1.
So, for "ScyllaDB Python Driver" it would be 37 bytes (in reality it would be a bit more due to natural alignment of other allocations since the `contents` size is not well aligned (13 bytes), but let's ignore this for now).
* These bytes add up quickly as there are more connections and, sometimes we are talking about millions of connections per-shard.
* Using a smart pointer (`lw_shared_ptr`) referencing a corresponding cached value will effectively reduce the per-connection memory usage to be 8 bytes (a size of a pointer on 64-bit CPU platform) for each such value. While storing a corresponding `sstring` value only once.
* This will would reduce the "variable" (per-connection) memory usage by **at least 50%**. And in case of "ScyllaDB Python Driver" driver version - by 78%!
* And all this for a price of a single `loading_shared_values` object **per-shard** (implements a hash table) and a minor overhead for each value **stored** in it.
* Introduced a new cache type (`client_options_cache_type`) for deduplicating and sharing client option strings, and refactored `client_data`, `client_state`, and related classes to use `foreign_ptr<std::unique_ptr<client_data>>` and cached entry types for fields like driver name, driver version, and client options. (`client_data.hh`, `service/client_state.hh`, `alternator/server.hh`, `alternator/controller.hh`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fR33-R36) [[2]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fL40-R56) [[3]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL105-R107) [[4]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[5]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL91-R92) [[6]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[7]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[8]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[9]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[10]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[11]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48)
* Updated the methods for setting and getting driver name, driver version, and client options in `client_state` to be asynchronous and use the new cache. (`service/client_state.hh`, `service/client_state.cc`) [[1]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[2]](diffhunk://#diff-99634aae22e2573f38b4e2f050ed2ac4f8173ff27f0ae8b3609d1f0cc1aeb775R347-R362)
**Virtual table and API enhancements:**
* Extended the `system.clients` virtual table schema and implementation to include a new `client_options` column (a map of option key/value pairs), and updated the table population logic to use the new cached types and foreign pointers. (`db/virtual_tables.cc`) [[1]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1R752) [[2]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L769-R770) [[3]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L809-R816) [[4]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L828-R879)
**API and interface changes:**
* Changed the signatures of `get_client_data` methods throughout the codebase to return vectors of `foreign_ptr<std::unique_ptr<client_data>>` instead of plain `client_data` objects, to ensure safe cross-shard access. (`alternator/controller.hh`, `alternator/controller.cc`, `alternator/server.hh`, `alternator/server.cc`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[2]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[3]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[4]](diffhunk://#diff-a7e2cda866c03a75afcf3b087de1c1dcd2e7aa996214db67f9a11ed6451e596dL988-R995) [[5]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[6]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[7]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48)
**Testing and validation:**
* Updated the Python test for the `system.clients` table to verify the new `client_options` column and its contents, ensuring that driver name and version are present in the options map. (`test/cqlpy/test_virtual_tables.py`) [[1]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R79) [[2]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R88-R90)
Closesscylladb/scylladb#25746
* github.com:scylladb/scylladb:
transport/server: declare a new "CLIENT_OPTIONS" option as supported
service/client_state and alternator/server: use cached values for driver_name and driver_version fields
system.clients: add a client_options column
controller: update get_client_data to use foreign_ptr for client_data
The Boost ASSERTs in the digest functions of the randomized_nemesis_test
were not working well inside the state machine digest functions, leading
to unhelpful boost::execution_exception errors that terminated the apply
fiber, and didn't provide any helpful information.
Replaced by explicit checks with on_fatal_internal_error calls that
provide more context about the failure. Also added validation of the
digest value after appending or removing an element, which allows to
determine which operation resulted in causing the wrong value.
This effectively reverts the changes done in https://github.com/scylladb/scylladb/pull/19282,
but adds improved error reporting.
Refs: scylladb/scylladb#27307
Refs: scylladb/scylladb#17030Closesscylladb/scylladb#27791
The doc about DDL statements claims that an `ALTER KEYSPACE` will fail
in the presence of an ongoing global topology operation.
This limitation was specifically referring to RF changes, which Scylla
implements as global topology requests (`keyspace_rf_change`), and it
was true when it was first introduced (1b913dd880) because there was
no global topology request queue at that time, so only one ongoing
global request was allowed in the cluster.
This limitation was lifted with the introduction of the global topology
request queue (6489308ebc), and it was re-introduced again very
recently (2e7ba1f8ce) in a slightly different form; it now applies only
to RF changes (not to any request type) and only those that affect the
same keyspace. None of these two changes were ever reflected in the doc.
Synchronize the doc with the current state.
Fixes#27776.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#27786
This will allow to add custom XML attribute to the JUnit report. In this
case there will be path to the function that can be used to run with
pytest command. Parametrized tests will have path to the function
excluding parameter.
Closesscylladb/scylladb#27707
Preemptible reclaim is only done from the background reclaimer,
so backtrace is not useful. It's also normal that it takes a long time.
Skip the backtrace when reclaim is preemptible to reduce log noise.
Fixes the issue where background reclaim was printing unnecessary
backtraces in lsa-timing logs when operations took longer than the
stall detection threshold.
Closes: #27692
Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
This patch consists of a few smaller follow-ups to the view building worker:
- catch general execption in staging task registrator
- remove unnecessary CV broadcast
- don't pollute function context with conditionally compiled variable
- avoid creating a copy of tasks map
- fix some typos
Refs https://github.com/scylladb/scylladb/issues/25929
Refs https://github.com/scylladb/scylladb/pull/26897
This PR doesn't fix any bugs but recently we're backporting some PRs to 2025.4, so let's also backport this one to avoid painful conflicts.
Closesscylladb/scylladb#26558
* github.com:scylladb/scylladb:
docs/dev/view-building-coordinator: fix typos
db/view/view_building_worker: remove unnnecessary empty lines
db/view/view_building_worker: fix typo
db/view/view_building_worker: avoid creating a copy of tasks map
db/view/view_building_worker: wrap conditionally compiled code in a scope
db/view/view_building_worker: remove unnecessary CV broadcast
db/view/view_building_worker: catch general execption in staging task registrator
Currently, we determine the live vs. total snapshot size by listing all files in the snapshot directory,
and for each name, look it up in the base table directory and see if it exists there, and if so, if it's the same file
as in the snapshot by looking to the fstat data for the dev id and inode number.
However, we do not look the names in the staging directory so staging sstable
would skew the results as the will falsely contribute to the live size, since they
wouldn't be found in the base directory.
This change processes both the staging directory and base table directory
and keeps the file capacity in a map, indexed by the files inode number, allowing us to easily
detect hard links and be resilient against concurrent move of files from the staging sub-directory
back into the base table directory.
Fixes#27635
* Minor issue, no backport required
Closesscylladb/scylladb#27636
* github.com:scylladb/scylladb:
table: get_snapshot_details: add FIXME comments
table: get_snapshot_details: lookup entries also in the staging directory
table: get_snapshot_details: optimize using the entry number_of_links
table: get_snapshot_details: continue loop for manifest and schema entries
table: get_snapshot_details: use directory_lister
These patches fix a bunch of variables defined in test/cqlpy tests, but not used. Besides wasting a few bytes on disk, these unused variables can add confusion for readers who see them and might think they have some use which they are missing.
All these unused variables were found by Copilot's "code quality" scanner, but I considered each of them, and fixed them manually.
Closesscylladb/scylladb#27667
* github.com:scylladb/scylladb:
test/cqlpy: remove unused variables
test/cqlpy: use unique partition in test
This commit adds a page with an overview of Vector Search under the Features section.
It includes a link to the VS documentation in ScyllaDB Cloud,
as the feature is only available in ScyllaDB Cloud.
The purpose of the page is to raise awareness of the feature.
Fixes https://scylladb.atlassian.net/browse/VECTOR-215Closesscylladb/scylladb#27787
For some reason, we might fail. Retry 10 times, and fail with an error code instead of 404 or whatnot.
Benign, I hope - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#27746
`-fexperimental-assignment-tracking` was added in fdd8b03d4b to
make coroutine debugging work.
However, since then, it became unnecessary, perhaps due to 87c0adb2fe,
or perhaps to a toolchain fix.
Drop it, so we can benefit from assignment tracking (whatever it is),
and to improve compatibility with sccache, which rejects this option.
I verified that the test added in fdd8b03d4b fails without the option
and passes with this patch; in other words we're not introducing a
regression here.
Closesscylladb/scylladb#27763
The `vector_store_client_test_dns_resolving_repeated` test had race
conditions causing it to be flaky. Two main issues were identified:
1. Race between initial refresh and manual trigger: The test assumes
a specific resolution sequence, but timing variations between the
initial DNS refresh (on client creation) and the first manual
trigger (in the test loop) can cause unexpected delayed scheduling.
2. Extra triggers from resolve_hostname fiber: During the client
refresh phase, the background DNS fiber clears the client list.
If resolve_hostname executes in the window after clearing but
before the update completes, pending triggers are processed,
incrementing the resolution count unexpectedly. At count 6, the
mock resolver returns a valid address (count % 3 == 0), causing
the test to fail.
The fix relaxes test assertions to verify retry behavior and client
clearing on DNS address loss, rather than enforcing exact resolution
counts.
Fixes: #27074Closesscylladb/scylladb#27685
Declare support for a 'CLIENT_OPTIONS' startup key.
This key is meant to be used by drivers for sending client-specific
configurations like request timeouts values, retry policy configuration, etc.
The value of this key can be any string in general (according to the CQL binary protocol),
however, it's expected to be some structured format, e.g. JSON.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Optimize memory usage changing types of driver_name and driver_version be
a reference to a cached value instead of an sstring.
These fields very often have the same value among different connections hence
it makes sense to cache these values and use references to them instead of duplicating
such strings in each connection state.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This new column is going to contain all OPTIONS sent in the
STARTUP frame of the corresponding CQL session.
The new column has a `frozen<map<text, text>>` type, and
we are also optimizing the amount of required memory for storing
corresponding keys and values by caching them on each shard level.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
get_client_data() is used to assemble `client_data` objects from each connection
on each CPU in the context of generation of the `system.clients` virtual table data.
After collected, `client_data` objects were std::moved and arranged into a
different structure to match the table's sorting requirements.
This didn't allow having not-cross-shard-movable objects as fields in the `client_data`,
e.g. lw_shared_ptr objects.
Since we are planning to add such fields to `client_data` in following patches this patch
is solving the limitation above by making get_client_data() return `foreign_ptr<std::unique_ptr<client_data>>`
objects instead of naked `client_data` ones.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This PR migrates schema management tests from dtest to this repository.
One reason is that there is an ongoing effort to migrate tests from dtest to here.
Test `TestLargePartitionAlterSchema.test_large_partition_with_drop_column` failed with timeout error once. The main suspect so far are infra related problems, like infra congestion. The [logs from the test execution](https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/1062/testReport/junit/schema_management_test/TestLargePartitionAlterSchema/Run_Dtest_Parallel_Cloud_Machines___Dtest___full_split001___test_large_partition_with_drop_column/), linked in the issue [test_large_partition_with_drop_column failed on TimeoutError #26932](https://github.com/scylladb/scylladb/issues/26932) show the following:
- `populate` works as intended - it starts, then during populate/insert drop column happened, then an exception is raised and intentionally ignored in the test, so no `Finish populate DB` for 50 x 1490 records - expected
- drop column works as intended - interrupts `populate` and proceeds to flush
- flush **probably** works as intended - logs are consistent with what we expect and what I got in local test runs
- `read` is the only thing that visibly got stuck, all the way until timeout happened, 5 minutes after the start
Migrating the test to this repo will also give us test start and end times on CI machines, in the sql report database. It has start and end timestamp for each test executed. We will be able to see how long does it usually take when the test is successful. It can not be seen from the logs, because logs are not kept for successful tests.
Another thing this PR does is adding a log message at the end of `database::flush_all_tables`. This will let us know if a thread got stuck inside or finished successfully. This addresses the **probably** part of the flush analysis step described above. If the issue reoccurs, we will have more information.
The test `test_large_partition_with_add_column` has not been executing for ~5 years. It was never migrated to pytest. The name was left as `large_partition_with_add_column_test`, and was skipped. Now it is enabled and updated.
Both `test_large_partition_with_add_column` and `test_large_partition_with_drop_column` are improved.
Small performance improvements:
- Regex compilation extracted from the stress function to the module level, to avoid recompilation.
- Do not materialize list in `stress_object` for loop. Use a generator expression.
The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column`
and `test_large_partition_with_drop_column`.
These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago.
The scenario in which the problem could have happened has to involve:
- a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition.
- appending writes to the partition (not overwrites)
- scans of the partition
- schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped
column lands in the middle (in alphabetical order) of the old column set.
The way the test is set up is:
- fixed number of writes per populate call
- fixed number of reads
This has the following implications:
- if the machine executing the test is fast, all the writes are done before the 10 seconds sleep
- there are too many reads - most of them get executed after the test logic is done
This patch solves these issues in the following way:
- populate lazily generates write data, and stops when instructed by `stop_populating` event
- read, which is done sequentially, stops when instructed by `stop_reading` event
- number of max operations is increased significantly, but the operations are stopped 1 second
after node flush; this makes sure there are enough operations during the test, but also that
the test does not take unnecessary time
Test execution time has been reduced severalfold. On dev machine the time the tests take is
reduced from 110 seconds to 34 seconds.
scylla-dtest PR that removes migrated tests:
[schema_management_test.py: remove tests already ported to scylladb repo #6427](https://github.com/scylladb/scylla-dtest/pull/6427)
Fixes#26932
This is a migration of existing tests to this repository. No need for backport.
Closesscylladb/scylladb#27106
* github.com:scylladb/scylladb:
test: dtest: schema_management_test.py: speed up `TestLargePartitionAlterSchema` tests
test: dtest: schema_management_test.py: fix large partition add column test
test: dtest: schema_management_test.py: add `TestSchemaManagement.prepare`
test: dtest: schema_management_test.py: test enhancements
test: dtest: schema_management_test.py: make the tests work
test: dtest: migrate setup and tools from dtest
test: dtest: copy unmodified schema_management_test.py
replica: database: flush_all_tables log on completion
`_verify_tasks_processed_metrics()` is used to check that the correct
service level is used to process requests. It takes two service levels
as arguments and executes numerous requests. After that, the number
of tasks processed by one of the service levels is expected to rise
by at least the number of executed requests. In contrast,
the second service level is expected to process fewer tasks than
the number of requests.
Unfortunately, background noise may cause some tasks to be executed
on the service level that is not supposed to process requests.
This patch increases the number of executed requests to eliminate
the chance of noise causing test failures.
Additionally, this commit extends logging to make future investigation
easier.
Fixes: https://github.com/scylladb/scylladb/issues/27715
No backport, fix for test on master.
Closesscylladb/scylladb#27735
* github.com:scylladb/scylladb:
test: remove unused `get_processed_tasks_for_group`
test: increase num of requests in driver_service_level tests
After 39cec4a node join may fail with either "init - Startup failed"
notification or occasionally because it was banned, depending on timing.
The change updates the test to handle both cases.
Fixes: scylladb/scylladb#27697
No backport: This failure is only present in master.
Closesscylladb/scylladb#27768
When a node without the required feature attempts to join a Raft-based
cluster with the feature enabled, there is a race between the join
rejection response ("Feature check failed") and the ban notification
("received notification of being banned"). Depending on timing, either
message may appear in the joining node's log.
This starts to happen after 39cec4a (which introduced informing the
nodes about being banned).
Updated the test to accept both error messages as valid, making the test
robust against this race condition, which is more likely in debug mode
or under slow execution.
Fixes: scylladb/scylladb#27603
No backport: This failure is only present in master.
Closesscylladb/scylladb#27760
The test had a sporadic failure due to a broken promise exception.
The issue was in `test_pinger::ping()` which captured the promise by
move into the subscription lambda, causing the promise to be destroyed
when the lambda was destroyed during coroutine unwinding.
Simplify `test_pinger::ping()` by replacing manual abort_source/promise
logic with `seastar::sleep_abortable()`.
This removes the risk of promise lifetime/race issues and makes the code
simpler and more robust.
Fixes: scylladb/scylladb#27136
Backport to active branches: This fixes a CI test issue, so it is
beneficial to backport the fix. As this is a test-only fix, it is a low
risk change.
Closesscylladb/scylladb#27737
This test starts a 3-node cluster and creates a large blob file so that one
node reaches critical disk utilization, triggering write rejections on that
node. The test then writes data with CL=QUORUM and validates that the data:
- did not reach the critically utilized node
- did reach the remaining two nodes
By default, tables use speculative retries to determine when coordinators may
query additional replicas.
Since the validation uses CL=ONE, it is possible that an additional request
is sent to satisfy the consistency level. As a result:
- the first check may fail if the additional request is sent to a node that
already contains data, making it appear as if data reached the critically
utilized node
- the second check may fail if the additional request is sent to the critically
utilized node, making it appear as if data did not reach the healthy node
The patch fixes the flakiness by disabling the speculative retries.
Fixes https://github.com/scylladb/scylladb/issues/27212Closesscylladb/scylladb#27488
The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column`
and `test_large_partition_with_drop_column`.
These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago.
The scenario in which the problem could have happened has to involve:
- a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition.
- appending writes to the partition (not overwrites)
- scans of the partition
- schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped
column lands in the middle (in alphabetical order) of the old column set.
The way the test is set up is:
- fixed number of writes per populate call
- fixed number of reads
This has the following implications:
- if the machine executing the test is fast, all the writes are done before the 10 seconds sleep
- there are too many reads - most of them get executed after the test logic is done
This patch solves these issues in the following way:
- populate lazily generates write data, and stops when instructed by `stop_populating` event
- read, which is done sequentially, stops when instructed by `stop_reading` event
- number of max operations is increased significantly, but the operations are stopped 1 second
after node flush; this makes sure there are enough operations during the test, but also that
the test does not take unnecessary time
Test execution time has been reduced severalfold. On dev machine the time the tests take is
reduced from 110 seconds to 34 seconds.
The patch also introduces a few small improvements:
- `cs_run` renamed to `run_stress` for clarity
- Stopped checking if cluster is `ScyllaCluster`, since it is the only one we use
- `case_map` removed from `test_alter_table_in_parallel_to_read_and_write`, used `mixed` param directly
- Added explanation comment on why we do `data[i].append(None)`
- Replaced `alter_table` inner function with its body, for simplicity
- Removed unnecessary `ck_rows` variable in `populate`
- Removed unnecessary `isinstance(self.cluster. ScyllaCluster)`
- Adjusted `ThreadPoolExecutor` size in several places where 5 workers are not needed
- Replaced functional programming style expressions for `new_versions` and `columns_list` with
comprehension/generator statement python style code, improving readability
Refs #26932
fix
Currently some things are not supported for colocated tables: it's not
possible to repair a colocated table, and due to this it's also not
possible to use the tombstone_gc=repair mode on a colocated table.
Extend the documentation to explain what colocated tables are and
document these restrictions.
Fixesscylladb/scylladb#27261Closesscylladb/scylladb#27516
`large_partition_with_add_column_test` and `large_partition_with_drop_column_test`
were added on August 17th, 2020 in scylladb/scylla-dtest#1569.
Only `large_partition_with_drop_column_test` was migrated to pytest, and renamed
to `test_large_partition_with_drop_column` on March 31st, 2021 in scylladb/scylla-dtest#2051.
Since then this test has not been running.
This patch fixes it - the test is updated and renamed and the testing environment
now properly picks it up.
Refs #26932
Extract repeated cluster initialization code in `TestSchemaManagement`
into a separate `prepare` method. It holds all the common code for
cluster preparation, with just the necessary parameters.
Refs #26932
Extract regex compilation from the stress functions to the module level,
to avoid unnecessary regex compilation repetition.
Add descriptions to the stress functions.
Do not materialize list in `stress_object` for loop. Use a generator expression.
Make `_set_stress_val` an object method.
Refs #26932
Remove unused function markers.
Add wait_other_notice=True to cluster start method in
TestSchemaHistory.prepare function to make the test stable.
Enable the test in suite.yaml for dev and debug modes.
Fixes#26932
Copy schema_management_test.py from scylla-dtest to
test/cluster/dtest/schema_management_test.py.
Add license header.
Disable it for debug, dev, and release mode.
Refs #26932
Make the removenode operation go through the `left_token_ring` state, similar to decommission. This ensures that when removenode completes, all nodes in the cluster are aware of the topology change through a global token metadata barrier.
Previously, removenode would skip the `left_token_ring` state and go directly from `write_both_read_new` to `left` state. This meant that when the operation completed, some nodes might not yet know about the topology change, potentially causing issues with subsequent data plane requests.
Key changes:
- Both decommission and removenode now transition to `left_token_ring` state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the shutdown RPC (removed nodes are already dead)
- Updated documentation to reflect that both operations use this state
This change improves consistency guarantees for removenode operations by ensuring cluster-wide awareness before completion.
The change is protected by "REMOVENODE_WITH_LEFT_TOKEN_RING" feature flag to also support mixed clusters during e.g. upgrade.
Fixes: scylladb/scylladb#25530
No backport: This fixes and issue found in tests. It can theoretically happen in production too, but wasn't reported in any customer issue, so a backport is not needed.
Closesscylladb/scylladb#26931
* https://github.com/scylladb/scylladb:
topology: make removenode use left_token_ring state for global barrier
topology: allow removing nodes not having tokens
features: add feature flag for removenode via left token ring
The function `get_processed_tasks_for_group` was defined twice in
`test_raft_service_levels.py`. This change removes the unused
definition to avoid confusion and clean up the code.
`_verify_tasks_processed_metrics()` is used to check that the correct
service level is used to process requests. It takes two service levels
as arguments and executes numerous requests. After that, the number
of tasks processed by one of the service levels is expected to rise
by at least the number of executed requests. In contrast,
the second service level is expected to process fewer tasks than
the number of requests.
Unfortunately, background noise may cause some tasks to be executed
on the service level that is not supposed to process requests.
This patch increases the number of executed requests to eliminate
the chance of noise causing test failures.
Additionally, this commit extends logging to make future investigation
easier.
Fixes: scylladb/scylladb#27715
When learning a schema that has a linked cdc schema, we need to learn
also the cdc schema, and at the end the schema should point to the
learned cdc schema.
This is needed because the linked cdc schema is used for generating cdc
mutations, and when we process the mutations later it is assumed in some
places that the mutation's schema has a schema registry entry.
We fix a scenario where we could end up with a schema that points to a
cdc schema that doesn't have a schema registry entry. This could happen
for example if the schema is loaded before it is learned, so when we
learn it we see that it already has an entry. In that case, we need to
set the cdc schema to the learned cdc schema as well, because it could
have been loaded previously with a cdc schema that was not learned.
Fixesscylladb/scylladb#27610Closesscylladb/scylladb#27704
The test generates a staging sstable on a node and verifies whether
the view is correctly populated.
However view updates generated by a staging sstable
(`view_update_generator::generate_and_propagate_view_updates()`) aren't
awaited by sstable consumer.
It's possible that the view building coordinator may see the task as finished
(so the staging sstable was processed) but not all view updates were
writted yet.
This patch fixes the flakiness by waiting until
`scylla_database_view_update_backlog` drops down to 0 on all shards.
Fixesscylladb/scylladb#26683Closesscylladb/scylladb#27389
When calling a migration notification from the context of a notification
callback, this could lead to a deadlock with unregistering a listener:
A: the parent notification is called. it calls thread_for_each, where it
acquires a read lock on the vector of listeners, and calls the
callback function for each listener while holding the lock.
B: a listener is unregistered. it calls `remove` and tries to acquire a
write lock on the vector of listeners. it waits because the lock is
held.
A: the callback function calls another notification and calls
thread_for_each which tries to acquire the read lock again. but it
waits since there is a waiter.
Currently we have such concrete scenario when creating a table, where
the callback of `before_create_column_family` in the tablet allocator
calls `before_allocate_tablet_map`, and this could deadlock with node
shutdown where we unregister listeners.
Fix this by not acquiring the read lock again in the nested
notification. There is no need because the read lock is already held by
the parent notification while the child notification is running. We add
a function `thread_for_each_nested` that is similar to `thread_for_each`
except it assumes the read lock is already held and doesn't acquire it,
and it should be used for nested notifications instead of
`thread_for_each`.
Fixesscylladb/scylladb#27364Closesscylladb/scylladb#27637
Make the removenode operation go through the `left_token_ring` state,
similar to decommission. This ensures that when removenode completes,
all nodes in the cluster are aware of the topology change through a
global token metadata barrier.
Previously, removenode would skip the `left_token_ring` state and go
directly from `write_both_read_new` to `left` state. This meant that
when the operation completed, some nodes might not yet know about the
topology change, potentially causing issues with subsequent data plane
requests.
Key changes:
- Both decommission and removenode now transition to `left_token_ring`
state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the
shutdown RPC (removed nodes may already be dead)
- Updated documentation to reflect that both operations use this state
This change improves consistency guarantees for removenode operations
by ensuring cluster-wide awareness before completion.
Fixes: scylladb/scylladb#25530
For the changes to go through the left_token_ring state when
REMOVENODE_WITH_LEFT_TOKEN_RING feature is enabled, we need to allow
removing nodes to not have any tokens (similarly to decommissioning
nodes, which use the same sequence of states).
This means the tests also need to change to allow for this new behavior
- it can temporarily happen that a removing node has no tokens but is
still part of Raft group 0 (so there may be a temporary mismatch between
the token ring and group 0 membership).
Therefore, the `check_token_ring_and_group0_consistency` function is
replaced by `wait_for_token_ring_and_group0_consistency`, which waits
up to 30 seconds for consistency to be reached.
To improve the behavior of the removenode operation, we want to issue
a global topology barrier after the removenode has been applied.
However, this requires changing the topology state machine to add a new
state (left_token_ring) to the removenode flow, which is not supported
by older nodes.
To allow rolling upgrades, we add a feature flag
REMOVENODE_WITH_LEFT_TOKEN_RING that controls whether the new removenode
flow is used.
Currently the formatter converts it to json and then tries to emit into
the output context with the "...{{}}" format string. The intent was to
have the "...{<json text>}" output. However, the double curly brace in
format string means "print a curly brace", so the output of the above
formatting is "...{}", literally.
Fix by keeping a single curly brace. The "<json text>" thing will have
its own surrounding curly braces.
Fixes#27718
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27687
This action is triggered when a new milestone is created in scylladb.git
It will call the main logic which will create the same milestone as Jira
releases in the SCYLLADB and CUSTOMER Jira projects.
Fixes: PM-100
Closesscylladb/scylladb#27717
If a keyspace has a numeric replication factor in a DC and rf < #racks,
then the replicas of tablets in this keyspace can be distributed among
all racks in the DC (different for each tablet). With rack list, we need all
tablet replicas to be placed on the same racks. Hence, the conversion
requires tablet co-location.
After this series, the conversion can be done using ALTER KEYSPACE
statement. The statement that does this conversion in any DC is not
allowed to change a rf in any DC. So, if we have dc1 and dc2 with 3 racks
each and a keyspace ks then with a single ALTER KEYSPACE we can do:
- {dc1 : 2} -> {dc1 : [r1, r2]};
- {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: [r2,r3]};
- {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: 2}
- {dc1 : 2} -> {dc1 : 2, dc2 : [r1]}
But we cannot do:
- {dc1 : 2} -> {dc1 : [r1, r2, r3]};
- {dc1 : 1, dc2 : [r1, r2] → dc1: [r1], dc2: [r1].
In order to do the co-locations rf change request is paused. Tablet
load balancer examines the paused rf change requests and schedules
necessary tablet migrations. During the process of co-location, no other
cross-rack migration is allowed.
Load balancer checks whether any paused rf change request is
ready to be resumed. If so, it puts the request back to global topology
request queue.
While an rf change request for a keyspace is running, any other rf change
of this keyspace will fail.
Fixes: #26398.
New feature, no backport
Closesscylladb/scylladb#27279
* github.com:scylladb/scylladb:
test: add est_rack_list_conversion_with_two_replicas_in_rack
test: test creating tablet_rack_list_colocation_plan
test: add test_numeric_rf_to_rack_list_conversion test
tasks: service: add global_topology_request_virtual_task
cql3: statements: allow altering from numeric rf to rack list
service: topology_coordinator: pause keyspace_rf_change request
service: implement make_rack_list_colocation_plan
service: add tablet_rack_list_colocation_plan
cql3: reject concurrent alter of the same keyspace
test: check paused rf change requests persistence
db: service: add paused_rf_change_requests to system.topology
service: pass topology and system_keyspace to load_balancer ctor
service: tablet_allocator: extract load updates
service: tablet_allocator: extract ensure_node
tasks, system_keyspace: Introduce get_topology_request_entry_opt()
node_ops: Drop get_pending_ids()
node_ops: Drop redundant get_status_helper()
runner.py defines a command-line option `--extra-scylla-cmdline-options`
with the default type=str. However, the function `merge_cmdline_options`,
which consumes this value to merge command-line options from multiple
sources, expects a list of strings.
This mismatch results in the following exception:
```
raise ValueError(f'invalid argument name {name}, all args {args}')
ValueError: invalid argument name o, all args --logger-log-level repair=debug --default-log-level=error
```
when a test is run with pytest using:
`--extra-scylla-cmdline-options='--logger-log-level repair=debug --default-log-level=error'`
Fix this by handling the option consistently and calling `.split()`.
Also change the default value from an empty list to an empty string
to avoid confusion both in runner.py and test.py.
Closesscylladb/scylladb#27523
Ref https://github.com/scylladb/seastar/pull/3163
We can optimize the stat calls we use here by
using open_directory to open the snapshot,
base, and staging directory once, and using statat
calls for the relative name instead of the full
blown file_stat that needs to traverse the whole
path prefix for every call (the dirents are likely
to be cached, but still why waste cpu cycles on that
over and over again).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the number_of_linkes equals 1, we can be sure that
the file exists only in the snapshot directory so there is no need
to look it up in the data directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that we're using a simple loop in the coroutine just continue
the loop for files we want to ignore.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is more efficient to use the coroutine generator
to list the directory.
Brewing changes in seastar would make the generator buffered
as well as adding an extended generation that would
return the file stat data for each entry, that would become
useful in the next patch that optimizes the algorithm by
considering the entry's link count.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When translating Cassandra's unit tests, in a couple of places I accidentally used the same name for two tests, resulting in the first of each pair to never running.
Let's fix the name of the second of the each pair to be the real name it had in the original Cassandra test.
Closesscylladb/scylladb#27644
* github.com:scylladb/scylladb:
test/cqlpy: rename test with duplicate name
test/cqlpy: rename test with duplicate name
Fixing something that never bothered anyone but our automated "code quality" tool: there's an unnecessary call to "pass" in one of our tests. Just remove it.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27645
This test was observed to fail multiple times recently in promotion,
because there were successful reads. The failure only reproduces on
arm64, it doesn't reproduce on x86.
The suspected reason is that the data set is too close to the edge,
where all reads fail due to too high memory consumption. Reduce the
number of sstables used by this test to 54 (from 64).
Fixes: #27248Closesscylladb/scylladb#27650
Copilot found in test/alternator a bunch of places where we unnecessarily assign a variable that we don't use, or had a duplicated statement which doesn't do anything. This patch fixes all of them. AI still doesn't know how to prepare a patch that looks anything close to reasonable, so I did this part manually, and also carefully investigated each and every change (this took **a lot** of human time).
These patches don't change anything in the functionality of any of the tests. It's all cosmetic.
Closesscylladb/scylladb#27655
* github.com:scylladb/scylladb:
test/alternator: remove unnecessary duplicate statement
test/alternator: remove unused variable assignments
We currently have races, like between moving an sstable from staging
using change_state, or when taking a snapshot, to e.g.
rewrite_statistics that replaces one of the sstable component files
when called, for example, from update_repaired_at by incremental repair.
Use a semaphore as a mutex to serialize those functions.
Note that there is no need for rwlock since the operations
are rare and read-only operations like snapshot don't
need to run in parallel.
Fixes#25919
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Previously, the scheduling_group column was updated during the switch_tenant function, which meant the update occurred only after the tenant change operation completed—updating rows one by one. With this change, the scheduling_group column is now updated before the switch_tenant logic runs, ensuring that the table reflects the correct scheduling groups for all rows as early as possible.
fixes: #26060fixes: #27295
backport: not required
this is a minor bug fix. Internal logic worked but the user couldnt see the change in the table if they would read the system.clients table
Closesscylladb/scylladb#26404
* github.com:scylladb/scylladb:
test: cqlpy: Remove test_switch_tenants and add test in cluster testing. The test needs to run twice, in two separate Scylla runs, using two different modes: gossip and raft. The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode. Therefore, the test was moved from cqlpy to the cluster-based framework. This commit both adds the test in cluster/ and removes the old version in cqlpy/.
server: Refactor update_control_connection_scheduling_group functionality This refactoring moves the logic that retrieves the scheduling group for driver_service_level_name out of switch_tenant. This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP). The lookup function is fully synchronized, non-coroutine, and returns immediately. For that reason, it’s better to perform this lookup outside of the switch_tenant function.
server: Refactor scheduling group update functionality. This change generalizes the scheduling-group update functionality and removes some copy-paste code, improving overall readability and maintainability. To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”).
server: Fix switch_tenant problem, When running on a V2 server, service-level data comes from service level cache. Because of this, we can use synchronized function to get the schedualing group. Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case. This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.
server: Add update_service_level_scheduling_group_v1 functions to create placehholder for functionality that will introduce v2 implementation. The new functionality will allow usage of service level cache
Add a service::topo::global_topology_request_virtual_task, which
covers the replication factor changes.
Currently, the global_topology_request_virtual_task can be aborted
only if it is paused.
The progress of the rf change isn't counted.
Allow altering from numeric replication factor to rack list. Ensure
that a single ALTER KEYSPACE statement doesn't try to both convert
to rack list and change rf.
To do the conversion from numeric rf to rack list, we need to co-locate
tablets of the keyspace, so that all of them have replicas on the same racks.
Pause the keyspace_rf_change global topology request, so that the co-location
could be done before the ALTER KEYSPACE changes are applied.
The pause is needed if in any dc rf changes from numeric to rack list
and the co-location is necessary. In this case we don't finish the request.
Instead, we add the request to the paused request vector. No migrations are
started.
The make_rack_list_colocation_plan consists of two phases.
In the first phase (realized with find_required_rack_list_colocations),
we find the pairs of (replica to be co-located, destination dc and rack).
We skip the pairs related to the tablets that are in transition or for
which the load balancer migration is already planned. We group the pairs
by destination dc and rack. Thanks to that in the second phase we can
calculate the least loaded nodes and shards only once for each rack.
In the second phase, we calculate the load of the nodes in a cluster
based on current transition and previously scheduled migrations.
We utilize the map created in the first phase and choose the least
loaded targets in each rack. We skip the tablets for which the
co-location was already scheduled.
find_required_rack_list_colocations isn't a method of load_balancer,
because in the following changes it is going to be reused by topology
coordinator to determine whether the rf change should be paused.
Add tablet_rack_list_colocation_plan. Keep it in migration_plan.
The plan includes a request that is ready to resume. There can be
more than one such request at the time, but we consider them one
by one for clarity of code. Rack list co-locations will be kept together
with normal load balancer migrations.
Consider normal load balancer migrations before rack list co-locations.
During rack list co-location, allow load balancer migrations to happen
only within a single rack. Do not create the merge co-location plan if
there is ongoing rack list co-location (if there are any rf changes paused).
Generate rf change resume based on the plan. Add _request_to_resume back
to global requests queue.
make_rack_list_colocation_plan will be implemented in the following change.
Reject ALTER KEYSPACE request if there is unfinished (queued, pending,
or paused) alter request of the same keyspace.
This is required as in the following changes, global request queue
will contain rf change requests meant to be resumed.
In the following changes, we allow to alter from numeric rf to rack list.
Before the alter, two tablets of the same keyspace can have replicas
on different racks. To switch to rack list, we need to co-locate
the replicas. It will be achieved by pausing the keyspace_rf_change
and scheduling migrations.
We need to persist the ids of requests that are paused. A new column -
paused_rf_change_requests is added to system.topology table.
In this commit no data is kept in the new column.
Pass a pointer to service::topology and db::system_keyspace to load
balancer. It will be used in the following patches to create
rack_list_colocation plan.
Every pending request should also have an entry in
system.topology_requests so it's redundant.
And problematic, because we cannot build a full request entry from
just an id alone, so if we would return those requests, they would
have blank information, and logic which needs more information would
not work.
`system.client_routes` is a system table that sets the target address and ports for each `host_id`, for one or more connection (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`.
This patch series contains:
- Introduction of `CLIENT_ROUTES` feature flag.
- Implementation of raft-based `system.client_routes` table
- Implementation of `v2/client-routes` POST/DELETE/GET endpoints
- Implementation of new `CLIENT_ROUTES_CHANGE` event that is sent to drivers when `system.client_routes` is changed
- New tests that verifies the aforementioned features
Ref: scylladb/scylla-enterprise#5699
For now, no automatic backport. However, the changes are planned to be release on `2025.4` either as a backport or a private build.
Closesscylladb/scylladb#27323
* https://github.com/scylladb/scylladb:
docs: describe CLIENT_ROUTES_CHANGE extension
test: add test for CLIENT_ROUTES event
service: transport: add CLIENT_ROUTES_CHANGE event
test: add cluster tests for client routes
test: add API tests for client_routes endpoints
test: add `timeout` parameter to `delete` in RESTClient
test: allow json_body in send
api: implement client_routes endpoints
api: add client_routes.json
service: main: add client_routes_service
db: add system.client_routes table
gms: add CLIENT_ROUTES feature
This reverts commit 866c96f536, reversing
changes made to 367633270a.
This change caused all longevities to fail, with a crash in parsing
scylla-metadata. The investigation is still ongoing, with no quick fix
in sight yet.
Fixes: #27496Closesscylladb/scylladb#27518
To fix this issue, the std_list_iterator class defined within
std_list.__iter__ should implement the full iterator protocol by defining
an __iter__() method that returns self. This change ensures any instance
of std_list_iterator can be used as an iterator in Python for loops and
other iteration contexts, as required. The fix is to add a small method
definition inside the std_list_iterator class, ideally after the __init__
or in a logical place with the other dunder methods.
Only the code inside the std_list class's __iter__ function (lines around
the definition of the inner class and its methods) needs to be edited.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27642
Fixes#26744
If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category.
However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up.
The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of losing data as expected.
Closesscylladb/scylladb#27556
* github.com:scylladb/scylladb:
test::cluster::dtest::tools::files: Remove file
commitlog_replay: Handle fully corrupt files same as partial corruption.
test::pylib::suite::base: Split options.name test specifier only once
Fixes#17384
Bypasses enabling off-strategy storage/placement for repair streams
when table repaired is using tablets. Instead, the resulting sstable(s)
will be placed in the "normal" set of sstables, and bypass a post-repair
off-strategy compaction.
v2:
Bypass off-strat for whatever reason iff dest is tablets.
Closesscylladb/scylladb#27500
The test fails in CI sometimes, and we want a coredump from a failure
to debug that. We made the test send a `signal SIGSEGV` to Scylla
on failure, but apparently that doesn't work as intended on our CI
hosts. (The CI runner seemingly can't find any coredump afterwards).
We can use gdb's `gcore` command to produce a coredump in a more
predictable way.
Refs scylladb/scylladb#22501Closesscylladb/scylladb#27498
This series adds an xfailing reproducers for two issue: #8070 and #27037:
27037 is about where even with alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON representation - a Alternator Streams
event is unduly produced.
8070 is about the ability to write malformed values into the database and then fail during read - instead of failing, as expected, during the write. This issue was known for years, but we never really had a reproducer for it - it's not possible to reproduce it using clean boto3 code and we need to build a request manually.
The first two patches are two small cleanups (including fixes#27372) that I did while preparing the real tests - which are in the final two patches.
Closesscylladb/scylladb#27376
* github.com:scylladb/scylladb:
test/alternator: add reproducer for bug with storing invalid values
test/alternator: reproducer for issue 27375
utils/rjson: fix error messages from rjson::parse()
test/alternator: extract get_signed_request() to util.py
Introduce the CLIENT_ROUTES_CHANGE event to let drivers refresh
connections when `system.client_routes` is modified. Some deployments
(e.g., Private Link) require specific address/port mappings that can
change without topology changes and drivers need to adapt promptly
to avoid connectivity issues.
This new EVENT type carries a change indicator plus the affected
`connection_ids` and `host_ids`. The only change value is
`UPDATE_NODES`, meaning one or more client routes were inserted,
updated, or deleted.
Drivers subscribe using the existing events mechanism, so no additional
`cql_protocol_extension` key is required.
Ref: scylladb/scylla-enterprise#5699
Copilot detected a few cases of cqlpy tests setting a variable which
they don't use. In all the cases in this patch, we can just remove
the variable. Although the AI found all these unused variables, I
verified each case carefully before changing it in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
copilot noticed that test/alternator/test_scan.py had a duplicate
statement (call to full_scan()). It doesn't break the test, but also
adds nothing but confusion - so let's just remove it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
copilot noticed in that in in many of Alternator tests, we have some
unnecessary assignments. For example, in a few places, we use the idiom:
with pytest.raises(...):
ret = ...
The "ret=" part is unnecessary, as this test expects the statement to
fail (hence the raises()), and ret is never assigned. The assignment
was only there because we copied this statement from another place in
the test, which does expect the statement to pass and wants to validate
the returned value.
So we should just drop the "ret=" from these tests.
Another common occurance is that we used the idiom
response = table.do_something()
Without checking the response and no intention to check it (either we
know it will work, or we just want to check it doesn't throw). So we
can drop the "response=" here too.
All of the unused variables in this patch were discovered by Copilot,
but I reviewed each of them carefully myself and prepared this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
It is traditional to use a unique (or random) partition key in cqlpy
tests, to allow multiple tests to share the same table and make the test
suite a bit faster. One of the tests, test_multi_column_relation_desc,
set up a unique key "k", but then forgot to use it and used partition
key 0 instead. Fix the test to use this k.
This problem was spotted by Copilot, who saw the unused variable k.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Fix multiple cases where the captured `std::exception_ptr` has been re-thrown via simple `throw eptr;`, which results in losing the original exception type and details.
Resolved at various places found by clang-tidy:
1. db::schema_applier
When applying schema changes, the previous implementation attempted to handle exceptions by catching and rethrowing them, but did so incorrectly: using `throw ex` with a `std::exception_ptr` loses the original exception type and details.
However, in this case, explicit exception handling is unnecessary. The only reason for catching was to ensure `ap.destroy()` is called before propagating the exception. This can be more cleanly and safely achieved using Seastar's `.finally()` continuation, which guarantees cleanup regardless of success or failure.
2. directories
The `std::exception_ptr()` has been captured for logging and then again re-thrown incorrectly via `throw ex;`. We could use `std::rethrow_exception()` here instead, but it seems to be simpler to just use regular `throw;` to rethrow the original exception, and only use the `std::current_exception()` for logging (which is a pattern used in other places as well).
3. storage_service
Here the exception has been re-thrown incorrectly in a coroutine. There it is best to use the `co_await coroutine::return_exception_ptr` to propagate exception more efficiently in a coroutine-friendly manner.
Fixes: SCYLLADB-94
Refs: scylladb/scylladb#27501
No backport: This fixes an error logging issue, that isn't a production problem by itself (only found in test), therefore not backporting to older branches.
Closesscylladb/scylladb#27613
* https://github.com/scylladb/scylladb:
db: schema_applier: improve exception-safe cleanup
directories: fix exception rethrowing
storage_service: use coroutine-friendly exception propagation in join_node_response_handler
The cqlpy test test_materialized_view.py::test_view_in_system_tables
checks that the system table "system.built_views" can inform us that
a view has been built. This test was flaky, starting to fail quite
often recently, and this patch fixes the problem in the test.
For historic reasons this test began by calling a utility function
wait_for_view_built() - which uses a different system table,
system_distributed.view_build_status, to wait until the view was built.
The test then immediately tries to verify that also system.built_views
lists this view.
But there is no real reason why we could assume - or want to assume -
that these two tables are updated in this order, or how much time
passed between the two tables being changed. The authors of this
test already acknowledged there is a problem - they included a hack
purporting to be a "read barrier" that claimed to solve this exact
problem - but it seems it doesn't, or at least no longer does after
recent changes to the view builder's implementation.
The solution is simple - just remove the call to wait_for_view_built()
and the "hack" after it. We should just wait in a loop (until a timeout)
for the system table that we really wanted to check - system.built_views.
It's as simple as that. No need for any other assumptions or hacks.
Fixes#27296
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27626
When translating Cassandra's test validation/operations/CreateTest.java
I accidentally used the same name for two tests, resulting in the first
of them never being run.
Let's fix the name of the second of the two to be the real name it had
in the original Cassandra test.
After this patch pytest reports 16 tests in this file, instead of 15
before this patch. The previously-ignored test was correct, and it
now passes in both Scylla and Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, we first print the json contents into a stringstream buffer
and then we write it as a whole to the manifest.json file output stream.
This is not scalable and may cause large allocation for large enough number
of files.
Fixes#24216
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27542
Add the JSON definitions for the POST, GET, and DELETE endpoints used
to modify client routes. These endpoints are intended for Cloud to
update the `system.client_routes` table.
The API is implemented in `/v2/` because the endpoints process arrays
of objects. Handling of such structures was improved between
Swagger 1.2 and 2.0 versions. There are already similar
`get_metrics_config` and `set_metrics_config` endpoints that operate
on similar structures and they are also in /v2/.
The introduced JSON files start with `, ` but it's intended because
the files are concatenated to the existing (metrics) JSON files,
and they need to represent valid JSON after the concatenation.
Ref: scylladb/scylla-enterprise#5699
Introduce `system.client_routes`, a system table that sets the target
address and ports for each `host_id`, for one or more connections
(e.g., Private Link) represented by `connection_id`. Cloud will write
the table via REST, and drivers will read it via CQL to override
values obtained from `system.local` and `system.peers`.
The table is Raft-managed to provide consistent replication across
nodes.
Schema overview: each row is identified by `(connection_id, host_id)`
and describes where clients should connect: `address` and one or more of
`port`, `tls_port`, `alternator_port`, `alternator_https_port`.
`host_id` is a UUID (just as in ScyllaDB) but `connection_id` can be
any string to accept formats of all cloud providers. `address` is
also a regular string because it can represent either an IP address or
a domain. Ports are optional in the sense that at least one of
the four must be provided.
Ref: scylladb/scylla-enterprise#5699
The feature will be used later in this patch series:
- To avoid unnecessary operations when the feature is not enabled
- To guard new API endpoints from being used before the cluster is
ready to use them.
- To implement update tests (by disabling/enabling the feature)
Ref: scylladb/scylla-enterprise#5699
The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied.
When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table.
Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed). Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones. Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range.
When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC.
The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions.
This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR:
CREATE TABLE system.batchlog_v2 (
version int,
stage int,
shard int,
written_at timestamp,
id uuid,
data blob,
PRIMARY KEY ((version, stage, shard), written_at, id));
The new schema organization has the following goals:
1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard column.
4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.
Fixes: https://github.com/scylladb/scylladb/issues/23358
This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`.
Closesscylladb/scylladb#26671
* github.com:scylladb/scylladb:
db/config: change batchlog_replay_cleanup_after_replays default to 1
test/boost/batchlog_manager_test: add test for batchlog cleanup
replica/mutation_dump: always set position weight for clustering positions
service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
test/lib: introduce error_injection.hh
utils/error_injection: add debug log to disable() and disable_all()
test/lib/cql_test_env: forward config to batchlog
test/lib/cql_test_env: add batch type to execute_batch()
test/lib/cql_assertions: add with_size(predicate) overload
test/lib/cql_assertions: add source location to fail messages
test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check
test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
db/batchlog_manager: config: s/write_timeout/reply_timeot/
db,service: switch to system.batchlog_v2
db/system_keyspace: introduce system.batchlog_v2
service,db: extract generation of batchlog delete mutation
service,db: extract get_batchlog_mutation_for() from storage-proxy
db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
data_dictionary: table: add get_truncation_time()
db/batchlog_manager: batch(): replace map_reduce() with simple loop
db/batchlog_manager: finish coroutinizing replay_all_failed_batches
db/batchlog_manager: improve replayAllFailedBatches logs
When translating Cassandra's test validation/operations/DeleteTest.java
I accidentally used the same name for two tests, resulting in the first
of them never being run.
Let's fix the name of the second of the two to be the real name it had
in the original Cassandra test.
After this patch pytest reports 52 tests in this file, instead of 51
before this patch. The previously-ignored test was correct, and it
now passes in both Scylla and Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When creating an alternator table with tablets, if it has an index, LSI
or GSI, require the config option rf_rack_valid_keyspaces to be enabled.
The option is required for materialized views in tablets keyspaces to
function properly and avoid consistency issues that could happen due to
cross-rack migrations and pairing switches when RF-rack validity is not
enforced.
Currently the option is validated when creating a materialized view via
the CQL interface, but it's missing from the alternator interface. Since
alternator indexes are based on materialized views, the same check
should be added there as well.
Fixesscylladb/scylladb#27612Closesscylladb/scylladb#27622
We currently allow restrictions on single column primary key,
but we ignore the restriction and return all results.
This can confuse the users. We change it so such a restriction
will throw an error and add a test to validate it.
Fixes: VECTOR-331
Closesscylladb/scylladb#27143
The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode.
Therefore, the test was moved from cqlpy to the cluster-based framework.
This commit both adds the test in cluster/ and removes the old version in cqlpy/.
This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP).
The lookup function is fully synchronized, non-coroutine, and returns immediately.
For that reason, it’s better to perform this lookup outside of the switch_tenant function.
To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”).
Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case.
This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.
This patch series contains the following changes:
- Incorporation of `crypt_sha512.c` from musl to out codebase
- Conversion of `crypt_sha512.c` to C++ and coroutinization
- Coroutinization of `auth::passwords::check`
- Enabling use of `__crypt_sha512` orignated from `crypt_sha512.c` for
computing SHA 512 passwords of length <=255
- Addition of yielding in the aforementioned hashing implementation.
The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (https://github.com/scylladb/scylladb/issues/24524).
However, because there is only one alien thread, overall hashing throughput was reduced
(see, e.g., https://github.com/scylladb/scylla-enterprise/issues/5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding is introduced in this patch series.
Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.
The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
- Alien thread: 41.5 new connections/s per shard
- Reverted alien thread: 244.1 new connections/s per shard
- This commit (yielding in hashing): 198.4 new connections/s per shard
The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.
On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
- Alien thread: 109 stalls/shard total
- Reverted alien thread: 13186 stalls/shard total
- This commit (yielding in hashing): 149 stalls/shard total
Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
- Alien thread: 1087 ms/shard total
- Reverted alien thread: 72839 ms/shard total
- This commit (yielding in hashing): 1623 ms/shard total
To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.
Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711
No automatic backport. After this PR is completed, the alien thread should be rather reverted from older branches (2025.2-2025.4 because on 2025.1 it's already removed). Backporting of the other commits needs further discussion.
Closesscylladb/scylladb#26860
* github.com:scylladb/scylladb:
test/boost: add too_long_password to auth_passwords_test
test/boost: add same_hashes_as_crypt_r to auth_passwords_test
auth: utils: add yielding to crypt_sha512
auth: change return type of passwords::check to future
auth: remove code duplication in verify_scheme
test/boost: coroutinize auth_passwords_test
utils: coroutinize crypt_sha512
utils: make crypt_sha512.cc to compile
utils: license: import crypt_sha512.c from musl to the project
Revert "auth: move passwords::check call to alien thread"
When applying schema changes, the previous implementation attempted to
handle exceptions by catching and rethrowing them, but did so
incorrectly: using `throw ex` with a `std::exception_ptr` loses the
original exception type and details. The correct approach is to use
`std::rethrow_exception()`.
However, in this case, explicit exception handling is unnecessary. The
only reason for catching was to ensure `ap.destroy()` is called before
propagating the exception. This can be more cleanly and safely achieved
using Seastar's `.finally()` continuation, which guarantees cleanup
regardless of success or failure.
This change removes the manual try/catch/rethrow and uses `.finally()`
to ensure proper cleanup, letting exceptions propagate naturally and
preserving their type and information.
Fixes: SCYLLADB-94
Refs: scylladb/scylladb#27501
Fix location identified by clang-tidy where `std::exception_ptr` was
incorrectly rethrown using `throw ep;`. The correct approach is to use
`std::rethrow_exception(ep)`, which preserves the original exception
type and stack trace.
But this can be even further simplified by logging the current exception
with `std::current_exception()` and rethrowing using `throw;` instead of
capturing and rethrowing a `std::exception_ptr`. This matches the
idiomatic pattern used elsewhere in the codebase and improves clarity.
This change ensures proper exception propagation and avoids type slicing
or loss of diagnostic information.
Improve exception handling in join_node_response_handler by using
`co_await coroutine::return_exception_ptr` to propagate exceptions.
This replaces the incorrect direct throw of `std::exception_ptr` and
ensures proper coroutine-friendly exception propagation.
Reverts commit 8192f45e84.
The merge exposed a critical bug where truncate operations during table drop with auto-snapshot fail, causing Raft applier fiber to stop with unhandled exceptions. This leads to schema inconsistencies across nodes and test failures with "Keyspace does not exist" errors.
**Root Cause**
Commit 19b6207f modified `truncate_table_on_all_shards` to set `use_sstable_identifier = true`:
```cpp
// Before (working)
co_await table::snapshot_on_all_shards(sharded_db, table_shards, name);
// After (broken)
auto opts = db::snapshot_options{.use_sstable_identifier = true};
co_await table::snapshot_on_all_shards(sharded_db, table_shards, name, opts);
```
This triggers exceptions during snapshot that propagate through Raft state machine, causing:
- Raft applier stops: `raft::state_machine_error` at `raft/server.cc:1369`
- Schema changes fail to propagate
- Nodes report non-existent keyspaces for valid schemas
**Changes**
Reverts 15 files (200 deletions, 74 insertions):
- Removes `use_sstable_identifier` from truncate/snapshot code paths
- Reverts `snapshot_options` struct back to simple `skip_flush` boolean
- Removes REST API and nodetool `--use-sstable-identifier` parameter
- Removes feature tests from `test/boost/database_test.cc`
No backport required - the original feature was merged to master only and never released.
<!-- START COPILOT ORIGINAL PROMPT -->
<details>
<summary>Original prompt</summary>
----
*This section details on the original issue you should resolve*
<issue_title>test_table_drop_with_auto_snapshot failed with InvalidRequest</issue_title>
<issue_description>Seen in: https://jenkins.scylladb.com/job/scylla-master/job/next/9968//testReport
Logs: [download](https://downloads.scylladb.com/unstable/scylla/master/testLogs/2025-12-08T15:05:42Z/)
Error message:
```
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace test does not exist"
```
Stacktrace:
```
manager = <test.pylib.manager_client.ManagerClient object at 0xffff734e6c10>
@pytest.mark.asyncio
async def test_table_drop_with_auto_snapshot(manager: ManagerClient):
logger.info("Bootstrapping cluster")
cfg = { 'auto_snapshot': True }
servers = await manager.servers_add(3, config = cfg)
cql = manager.get_cql()
# Increases the chance of tablet migration concurrent with schema change
await inject_error_on(manager, "tablet_allocator_shuffle", servers)
for i in range(3):
await cql.run_async("DROP KEYSPACE IF EXISTS test;")
await cql.run_async("CREATE KEYSPACE IF NOT EXISTS test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 8 };")
await cql.run_async("CREATE TABLE IF NOT EXISTS test.tbl_sample_kv (id int, value text, PRIMARY KEY (id));")
> await cql.run_async("INSERT INTO test.tbl_sample_kv (id, value) VALUES (1, 'ala');")
E cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace test does not exist"
test/cluster/test_tablets2.py:173: InvalidRequest
```
</issue_description>
<agent_instructions>this issue was exposed by commit 8192f45e84, please send a pull request reverting that merge commit and mark it as fixing this github issue.</agent_instructions>
<comments>
<comment_new><author>@yaronkaikov</author><body>
@denesb is this something in your team area? if not , please feel free to delegate it or un-assign yourself :-)</body></comment_new>
<comment_new><author>@nyh</author><body>
This is very strange. Clearly the keyspace `test` does exist at this point, because we created it two lines above and also we ran `CREATE TABLE .. test.tbl_sample_kv` which would have failed if the keyspace `test` didn't exist - so it must exit, no?
In the past, we had a bug where the running `CREATE KEYSPACE IF NOT EXISTS` forgot to set the "schema modified" event in the response so it failed to wait for schema agreement, but 1. we fixed this bug (https://github.com/scylladb/scylladb/pull/18819 by @nuivall ) and 2. this bug didn't happen in this case, where CREATE TABLE deed had work to do.
But I just realized something... Our fix in https://github.com/scylladb/scylladb/pull/18819 only applies to CREATE KEYSPACE / TABLE / VIEW / TYPE statements. It wasn't applied to `DROP KEYSPACE` - and it should have been....
But I don't have a good theory how a bug like https://github.com/scylladb/scylladb/pull/18819 can explain this specific test failure. Different schema operations are already linearized, so if a `CREATE TABLE test.tbl_sample_kv` succeeded, I don't see how there could possibly be any earlier `DROP KEYSPACE test` that suddenly springs to life. Unless we have a serious bug in our raft-based schema operations.</body></comment_new>
<comment_new><author>@nyh</author><body>
Another bug we could have in theory is that the Python driver's async `cql.run_async` might have a bug where it is not waiting for the schema agreement despite being told to wait. If it doesn't wait for schema agreement, this can easily explain this bug:
1. the CREATE KEYSPACE, CREATE TABLE both are sent to node A, but
2. the last INSERT INTO is sent to node B which is not yet aware of this new keyspace and table, and fails.
Copilot claims that **execute_async() does have this bug!**
> For schema-altering statements, schema agreement (meaning all nodes agree on the new schema) is important before running follow-up operations, but this is enforced only by synchronous helpers like Session.execute(), not the asynchronous version.
> If you use execute_async() for schema operations, you are responsible for checking schema agreement yourself, using [Session.check_schema_agreement()](https://docs.datastax.com/en/developer/python-driver/latest/api/cassandra/cluster/#cassandra.cluster.Session.check_schema_agreement) or (in newer code) ResponseFuture.check_schema_agreement.
> According to [a discussion on the DataStax support forum](https://support.datastax.com/s/article/Does-the-Python-Driver-for-Cassandra-Wait-for-Schema-Agreement-after-a-Schema-Change?language=en_US) and the [driver’s source code](7f12a5e1c6/cassandra/cluster.py (L487)), schema agreement is not ch...
</details>
<!-- START COPILOT CODING AGENT SUFFIX -->
- Fixesscylladb/scylladb#27501
<!-- START COPILOT CODING AGENT TIPS -->
---
Closesscylladb/scylladb#27604
* github.com:scylladb/scylladb:
Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
Initial plan
This reverts commit 8192f45e84.
The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.
The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
Add pull_request_target event with unlabeled type to trigger-scylla-ci
workflow. This allows automatic CI triggering when the 'conflicts' label
is removed from a PR, in addition to the existing manual trigger via
comment.
The workflow now runs when:
- A user posts a comment with '@scylladbbot trigger-ci' (existing)
- The 'conflicts' label is removed from a PR (new)
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-84Closesscylladb/scylladb#27521
Fixed a critical bug where `storage_group::for_each_compaction_group()` was incorrectly marked `noexcept`, causing `std::terminate` when actions threw exceptions (e.g., `utils::memory_limit_reached` during memory-constrained reader creation).
**Changes made:**
1. Removed `noexcept` from `storage_group::for_each_compaction_group()` declaration and implementation
2. Removed `noexcept` from `storage_group::compaction_groups()` overloads (they call for_each_compaction_group)
3. Removed `noexcept` from `storage_group::live_disk_space_used()` and `memtable_count()` (they call compaction_groups())
4. Kept `noexcept` on `storage_group::flush()` - it's a coroutine that automatically captures exceptions and returns them as exceptional futures
5. Removed `noexcept` from `table_load_stats()` functions in base class, table, and storage group managers
**Rationale:**
As noted by reviewers, there's no reason to kill the server if these functions throw. For coroutines returning futures, `noexcept` is appropriate because Seastar automatically captures exceptions and returns them as exceptional futures. For other functions, proper exception handling allows the system to recover gracefully instead of terminating.
Fixes#27475Closesscylladb/scylladb#27476
* github.com:scylladb/scylladb:
replica: Remove unnecessary noexcept
replica: Remove noexcept from compaction_groups() functions
replica: Remove noexcept from storage_group::for_each_compaction_group
There is no 'regular' incremental mode anymore.
The example seems have meant 'disabled'.
Fixes#27587
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Extend the Fixes validation pattern to also accept JIRA issue references
(format: [A-Z]+-\d+) in addition to GitHub issue references. This allows
backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'.
Fixes: https://github.com/scylladb/scylladb/issues/27571Closesscylladb/scylladb#27572
This reverts commit faad0167d7. It causes
a regression in
test_two_tablets_concurrent_repair_and_migration_repair_writer_level
in debug mode (with ~5%-10% probability).
Fixes#27510.
Closesscylladb/scylladb#27560
This patch-set consolidates and corrects rjson string conversion handling.
It removes unnecessary string copies, ensures proper length usage and
replaces ad-hoc conversions with consistent helper functions.
Overall, the changes make rjson string handling safer, faster, and more uniform across the codebase.
Backport: no, it's a refactor
Closesscylladb/scylladb#27394
* github.com:scylladb/scylladb:
fix rjson::value to bytes conversion with missing GetStringLength call
alternator: change type from string to string_view in should_add_capacity
fix rjson::value to string_view conversion with missing GetStringLength call
use rjson::to_string_view when rjson::value gets converted using GetStringLength
use rjson::to_sstring and rjson::to_string for various string conversions
utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds
utils: move rjson::to_string_view func to string related place
utils: add to_sstring and to_string rjson helper
This patch adds a reproducer for a long-known bug, #8070, where
Alternator can store invalid values which are just blindly stored as
JSON, and we will only see the failure when reading the item back -
and either the client will fail to parse it, or sometimes even Alternator's
own code (e.g., FilterExpression) will fail to parse it. The right
behavior is to fail the write - not the read.
The included test checks writing different kinds of invalid values using
PutItem, UpdateItem, and BatchWriteItem. The new tests pass on DynamoDB,
but fail on Alternator so marked as "xfail".
Refs #8070.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a reproducer for issue #27375, where even with
alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON
representation - a Alternator Streams event is unduly produced.
For example, if a map {'dog': 1, 'cat': 2} is changed to
{'cat': 2, 'dog': 1}, this non-change should not be reported.
The new test added in this patch passes on DynamoDB (an event
is not generated) but fails on Alternator (an event is generated),
so the new test is marked with xfail.
Refs #27375.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
rjson::parse() when parsing JSON stored in a chunked_content (a vector
of temporary buffers) failed to initialize its byte counter to 0,
resulting in garbage positions in error messages like:
Parsing JSON failed: Missing a name for object member. at 1452254
These error messages were most noticable in Alternator, which parses
JSON requests using a chunked_content, and reports these errors back
to the user.
The fix is trivial: add the missing initialization of the counter.
The patch also adds a regression test for this bug - it sends a JSON
corrupt at position 1, and expect to see "at 1" and not some large
random number.
Fixes#27372
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
get_signed_request() started in test_manual_requests.py as a way to sign
a manually-created DynamoDB-API request - for sending requests that boto3
can't.
Over time, we started to use this function in additional test files, and
it's about time to move it to util.py - which is more natural to import
from multiple files.
This patch also adds a new function, manual_request(), which combines
get_signed_request() and actually sending the request via
requests.post(). New tests should prefer it, because it's easier to use.
We'll use the new function in tests that we add in the next patches.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
`test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py`
has an insert statement that is expected to fail. Dtest environment uses
`FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails
means we expect 6 error audit logs.
The test failed because `create keyspace ks` failed once, then succeeded on retry.
It allowed the test to proceed properly, but the last part of the test that expects
exactly 6 failed queries actually had 7.
The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed
queries, counting only the query expected to fail. If other queries fail with
successful retry, it's fine. If other queries fail without successful retry, the test
will fail, as it should in such situations. They are not related to this expected
failed insert statement.
Fixes#27322Closesscylladb/scylladb#27378
When waiting for the condition variable times out
we call on_internal_error, but unfortunately, the backtrace
it generates is obfuscated by
`coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`.
To make the log more useful, print the error injection name
and the caller's source_location in the timeout error message.
Fixes#27531
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27532
This contained only one routine; `corrupt_file`, which is
highly problematic, and not used. If you want to "corrupt" a
file, it should be done controlled, not at random.
Fixes#26744
If a segment to replay is broken such that the main header is not zero,
but still broken, we throw header_checksum_error. This was not handled in
replayer, which grouped this into the "user error/fundamental problem"
category. However, assuming we allow for "real" disk corruption, this should
really be treated same as data corruption, i.e. reported data loss, not
failure to start up.
The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes
provoked this issue, by doing random file wrecking, which on rare occasions
provoked this, and thus failed test due to scylla not starting up, instead
of loosing data as expected.
Changed test to consistently cause this exact error instead.
The test documents the current behavior of hashing algorithms that
fail if the passphrase has 512 bytes or more.
Moreover, it documents the behavior of the current bcrypt
implementation that compares only the first 72 bytes of the password.
Although we don't typically use bcrypt for password hashing, it is
possible to insert such a hash using
`CREATE ROLE ... WITH HASHED PASSWORD ...`.
Refs: scylladb/scylladb#26842
This change allows yielding during hashing computations to prevent
stalls.
The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
- Alien thread: 41.5 new connections/s per shard
- Reverted alien thread: 244.1 new connections/s per shard
- This commit (yielding in hashing): 198.4 new connections/s per shard
The alien thread is limited by a single-core hashing throughput,
which is roughly 400-500 hashes/s in the test environment. Therefore,
with smp=10, the throughput is below 50 hashes/s, and the difference
between the alien thread and other solutions further increases with
higer smp.
The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.
On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
- Alien thread: 109 stalls/shard total
- Reverted alien thread: 13186 stalls/shard total
- This commit (yielding in hashing): 149 stalls/shard total
Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
- Alien thread: 1087 ms/shard total
- Reverted alien thread: 72839 ms/shard total
- This commit (yielding in hashing): 1623 ms/shard total
To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.
Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711
Introduce a new `passwords::hash_with_salt_async` and change the return
type of `passwords::check` to `future<bool>`. This enables yielding
during password computations later in this patch series.
The old method, `hash_with_salt`, is marked as deprecated because
new code should use the new `hash_with_salt_async` function.
We are not removing `hash_with_salt` now to reduce the regression risk
of changing the hashing implementation—at least the methods that change
persistent hashes (CREATE, ALTER) will continue to use the old hashing
method. However, in the future, `hash_with_salt` should be entirely
removed.
Refs: scylladb/scylladb#26859
Refactoring: create a new function `verify_hashing_output` to reuse
code in `hash_with_salt` and `verify_scheme`. The change is introduced
to facilitate verification of hashing output when the implementation
is extended later in this patch series.
Refs: scylladb/scylladb#26859
This commit prepares `auth_passwords_test` for using coroutines,
because later in this patch series `auth::passwords::check` and other
similar functions will return Seastar futures.
Refs: scylladb/scylladb#26859
Change `sha512crypt` and `__crypt_sha512` to coroutines to allow
yielding during hash computations later in this patch series.
Refs: scylladb/scylladb#26859
This patch imports the `crypt_sha512.c` file from the musl library.
We need it to incorporate yielding in the `crypt_r` function to avoid
reactor stalls during long hashing computations.
Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.
Both `crypt_sha512.c` and musl license are obtained from
git.musl-libc.org:
- https://git.musl-libc.org/cgit/musl/tree/src/crypt/crypt_sha512.c
- https://git.musl-libc.org/cgit/musl/tree/COPYRIGHT
Import commit:
commit 1b76ff0767d01df72f692806ee5adee13c67ef88
Author: Alex Rønne Petersen <alex@alexrp.com>
Date: Sun Oct 12 05:35:19 2025 +0200
s390x: shuffle register usage in __tls_get_offset to avoid r0 as address
Refs: scylladb/scylladb#26859
The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (scylladb/scylladb#24524). However, because
there is only one alien thread, overall hashing throughput was reduced
(see, e.g., scylladb/scylla-enterprise#5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding will be introduced later in this patch series.
This reverts commit 9574513ec1.
For some arcane reason, we split optional the test pattern given to
test.py twice across '::' to get the file + case specifiers later given
to pytest etc. This means that for a test with a class group (such as some
migrated dtests), we cannot really specify the exact test to run
(pattern <file>::<class>::test).
Simply splitting only on first '::' fixes this. Should not affect any
other tests.
Can potentially lead to unnecessary abort.
compaction_groups() and for_each_compaction_group() can throw.
Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>
…eshold
The initial problem:
Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened.
Test logic:
These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests.
The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values.
Note that the exceptions are counted per shard, not per code path.
The new problem:
The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12.
There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted.
Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths.
For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run).
Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold.
Also, this patch series enables debug logging for `exception` logger. This will allow us to inspect which exceptions happened if a protocol exceptions test fails again.
Fixes#27247Fixes#27325
Issue observed on master and branch-2025.4. The tests, in the same form, exist on master, branch-2025.4, branch-2025.3, branch-2025.2, and branch-2025.1. Code change is simple, and no issue is expected with backport automation. Thus, backports for all the aforementioned versions is requested.
Closesscylladb/scylladb#27412
* github.com:scylladb/scylladb:
test: cqlpy: test_protocol_exceptions.py: enable debug exception logging
test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold
In some cases we unnecessarily convert to string which
causes a copy. In other we convert without calling
GetStringLength which causes iteration to dermine length
which is already known. In some cases we do even both.
This commit fixes that.
So that conversion code is common and it's easier
to avoid accidental type conversions. Additionally
according to rapid json library size must be checked
explicitly, this also avoids extra iteration in char*
to (s)string conversion.
Rebase to Fedora 43 with clang 21.1 and libstdc++ 15.
Fedora container image registry moved to registry.fedoraproject.org as
it seems to be updated more regularly.
Added python3-devel to the dependencies as some packages scylla-cqlsh
depends on aren't yet available in the form of wheels for Python 3.14,
and so have to be built locally. In any case it's better to reduce
dependency on those wheels even if the ones currently missing appear
eventually.
Added libev-devel to the dependencies so that the python driver
builds correctly even if "wheels" are not published. This reduces
our dependency on the python driver's binary release schedule.
Without libev-devel, TLS does not work correctly.
We no long remove the clang and clang-libs packages. Doxygen
started depending on clang-libs, and removing them removes
doxygen, breaking the build when it looks for that. The build
will still pick up the optimized clang, since /usr/local/bin
is earlier in the path. We keep the clang package, since it allows
us to mess a little less with the directory structure.
Optimized clang binaries generates and stored in
https://devpkg.scylladb.com/clang/clang-21.1.6-Fedora-43-aarch64.tar.gzhttps://devpkg.scylladb.com/clang/clang-21.1.6-Fedora-43-x86_64.tar.gz
With ./scripts/refresh-pgo-profiles.sh, the new compiler shows a small
performance improvement (instructions_per_op) in perf-simple-query:
clang 21:
259353.60 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35720 insns/op, 17427 cycles/op, 0 errors)
265940.08 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35725 insns/op, 17042 cycles/op, 0 errors)
262650.01 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35720 insns/op, 17240 cycles/op, 0 errors)
262881.22 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35675 insns/op, 17222 cycles/op, 0 errors)
264898.68 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35732 insns/op, 17070 cycles/op, 0 errors)
throughput:
mean= 263144.72 standard-deviation=2528.69
median= 262881.22 median-absolute-deviation=1753.96
maximum=265940.08 minimum=259353.60
instructions_per_op:
mean= 35714.47 standard-deviation=22.34
median= 35720.38 median-absolute-deviation=10.20
maximum=35732.14 minimum=35675.50
cpu_cycles_per_op:
mean= 17200.12 standard-deviation=154.62
median= 17221.70 median-absolute-deviation=129.77
maximum=17427.33 minimum=17041.57
clang 20:
254431.39 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35883 insns/op, 17708 cycles/op, 0 errors)
259701.02 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35883 insns/op, 17351 cycles/op, 0 errors)
261166.92 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35912 insns/op, 17270 cycles/op, 0 errors)
260656.31 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35869 insns/op, 17289 cycles/op, 0 errors)
259628.13 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35946 insns/op, 17370 cycles/op, 0 errors)
throughput:
mean= 259116.75 standard-deviation=2698.56
median= 259701.02 median-absolute-deviation=1539.55
maximum=261166.92 minimum=254431.39
instructions_per_op:
mean= 35898.42 standard-deviation=30.69
median= 35882.97 median-absolute-deviation=15.90
maximum=35945.63 minimum=35869.02
cpu_cycles_per_op:
mean= 17397.49 standard-deviation=178.35
median= 17351.35 median-absolute-deviation=108.79
maximum=17707.63 minimum=17269.68
Closesscylladb/scylladb#26773
Currently we have 3 explicit checks, and some of them are configurable:
- Jenkins job being stable. Can be disabled with --force
- Whether submodule update is happenning. It's not allowed by default, and
should be enabled with --allow-submodule option
- Target branch checking (recently merged #27249). Happens unconditionally
This PR unifies all checks in two ways.
First, each restriction can be lifted with --allow-foo options. The existing
--allow-submodule stays and two options are added:
- --allow-unstable to skip jenkins job check (like --force works now)
- --allow-any-branch to skip target branch check
Second, the --force option lifts all the known restrictions.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27294
With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack.
In this series we remove no longer needed code and reorganize the needed code for better clarity.
After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines.
Fixes https://github.com/scylladb/scylladb/issues/26313Closesscylladb/scylladb#27383
* github.com:scylladb/scylladb:
mv: replace the simple/complex rack-aware pairing with exact rack matching
mv: split out vnode pairing code from get_view_natural_endpoint
mv: unify self-pairing and rack-aware pairing into one bool
mv: remove the workaround for left nodes when sending view updates
Scylla implements `LWT` in the` storage_proxy::cas` method. This method expects to be called on a specific shard, represented by the `cas_shard` parameter. Clients must create this object before calling `storage_proxy::cas`, check its `this_shard()` method, and jump to `cas_shard.shard()` if it returns false.
The nuance is that by the time the request reaches the destination shard, the tablet may have already advanced in its migration state machine. For example, a client may acquire a `cas_shard` at the `streaming` tablet state, then submit a request to another shard via `smp::submit_to(cas_shard.shard())`. However, the new `cas_shard` created on that other shard might already be in the `write_both_read_new` state, and its `cas_shard.shard()` would not be equal to `this_shard_id()`. Such broken invariant results in an `on_internal_error` in `storage_proxy::cas`.
Clients of `storage_proxy::cas` are expected to check` cas_shard.this_shard()` and recursively jump to another shard if it returns false. Most calls to `storage_proxy::cas` already implement this logic. The only exception is `executor::do_batch_write`, which currently checks `cas_shard.this_shard()` only once. This can break the invariant if the tablet state changes more than once during the operation.
This PR fixes the issue by implementing recursive `cas_shard.this_shard()` checks in `executor::do_batch_write`. It also adds a test that reproduces the problem.
Fixes: scylladb/scylladb#27353
backport: need to be backported to 2025.4
Closesscylladb/scylladb#27396
* github.com:scylladb/scylladb:
alternator/executor.cc: eliminate redundant dk copy
alternator/executor.cc: release cas_shard on the original shard
alternator/executor.cc: move shard check into cas_write
alternator/executor.cc: make cas_write a private method
alternator/executor.cc: make do_batch_write a private method
alternator/executor.cc: fix indent
test_alternator: add test_alternator_invalid_shard_for_lwt
Before this series, we kept the cas_shard on the original shard to
guard against tablet movements running in parallel with
storage_proxy::cas.
The bug addressed by this PR shows that this approach is flawed:
keeping the cas_shard on the original shard does not guarantee that
a new cas_shard acquired on the target shard won’t require another
jump.
We fixed this in the previous commit by checking cas_shard.this_shard()
on the target shard and continuing to jump to another shard if
necessary. Once cas_shard.this_shard() on the target shard returns
true, the storage_proxy::cas invariants are satisfied, and no other
cas_shard instances need to remain alive except the one passed
into storage_proxy::cas.
This change ensures that if cas_shard points to a different shard,
the executor will continue issuing shard jumps until
cas_shard.this_shard() returns true. The commit simply moves the
this_shard() check from the parallel_for_each lambda into cas_write,
with minimal functional changes.
We enable test_alternator_invalid_shard_for_lwt since now it should
pass.
Fixesscylladb/scylladb#27353
Refactor the way we decide the sstable belong to a tablet, fully or partially to simplify the flow and make it more readable. Also extract the logic and make it testable, add tests to cover changes
The change is purely aesthetic, no need to backport
Closesscylladb/scylladb#27101
* github.com:scylladb/scylladb:
streaming: remove unnecessary lambda creating sstable token range
streaming: simplify get_sstables_for_tablets logic
streaming: switch to range-based for loop
streaming: drop sstable skip microoptimization in tablet loop
streaming: replace reverse iterators with reverse view in sstables scan
streaming: return from get_sstables_for_tablets earlier
streaming: add get_sstables_by_tablet_range tests
test,sstables: add helper to set sstable first and last keys
streaming: refactor get_sstables_for_tablets to make it accessible
streaming: refactor get_sstables_for_tablets to make it testable
streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic
Enable debug logging for "exception" logger inside protocol exception tests.
The exceptions will be logged, and it will be possible to see which ones
occured if a protocol exceptions test fails.
Refs #27272
Refs #27325
The initial problem:
Some of the tests in test_protocol_exceptions.py started failing. The failure is
on the condition that no more than `cpp_exception_threshold` happened.
Test logic:
These tests assert that specific code paths do not throw an exception anymore.
Initial implementation ran a code path once, and asserted there were 0 exceptions.
Sometimes an exception or several can occur, not directly related to the code paths
the tests check, but those would fail the tests.
The solution was to run the tests multiple times. If there is a regression, there
would be at least as many exceptions thrown as there are test runs. If there is no
regression, a few exceptions might happen, up to 10 per 100 test runs.
I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values.
Note that the exceptions are counted per shard, not per code path.
The new problem:
The occassional exceptions thrown by some parts of the server now throw a bit more
than before. Based on the logs linked on the issues, it is usually 12.
There are possibly multiple ways to resolve the issue. I have considered logging
exceptions and parsing them. I would have to filter exception logs only for wanted
exceptions. However, if a new, different exception is introduced, it might not be
counted.
Another approach is to just increase the threshold a bit. The issue of throwing
more exceptions than before in some other server modules should be addressed by
a set of tests for that module, just like these tests check protocol exceptions,
not caring who used protocol check code paths.
For those reasons, the solution implemented here is to increase `cpp_exception_threshold`
to `20`. It will not make the tests unreliable, because, as mentioned, if there is a
regression, there would be at least `run_count` exceptions per `run_count` test runs
(1 exception per single test run).
Still, to make "background exceptions" occurence a bit more normalized, `run_count` too
is doubled, from `100` to `200`. At the first glance this looks like nothing is changed,
but actually doubling both run count and exception threshold here implies that the
exception burst does not scale as much as run count, it is just that the "jitter" is
bigger than the old threshold.
Fixes#27247Fixes#27325
Default number of retires in `eventually()` in `test_builder_with_concurrent_drop`
sometimes is not enough to observe changes in system tables on aarch64
builds.
This patch increases the number of retries to 30.
Fixesscylladb/scylladb#27370Closesscylladb/scylladb#27493
If the process is running returncode will be Node, otherwise it will
have some value (which can be 0 s well) and the current code treats 0
as if the process is still running.
Closesscylladb/scylladb#27490
Tablet migration transfers sstable files without changing origin
host-id. As it should, becuase those sstables were not written on the
destination host, and should be ignored by commit log replay.
So it's a normal situation, and it's confusing to see this warning in
logs.
Fixes#26957Closesscylladb/scylladb#27433
The distributed_loader::get_sstables_from_object_store() method accepts an endpoint parameter and internally wants to get storage type for that endpoint (s3 or gcs). This is needed to construct storage_options object to create an sstable object.
To get the type, the method scans db::config option, but there's much simpler way to get one.
Code cleanup, no need to backport
Closesscylladb/scylladb#27381
* github.com:scylladb/scylladb:
sstables_loader: Provide endpoint type for get_sstables_from_object_store()
storage_manager: Introduce get_endpoint_type() method
storage_manager: Split get_endpoint_client()
We rewrite the test to avoid flakiness. Instead of looking at the
metrics, we make a trade-off and start depending on a less reliable
mechanism -- logs. We grep all relevant messages printed by Scylla
in TRACE mode and make sure that they were all printed from a context
using the streaming scheduling group.
Although it's a "less proper" way of testing, it should be much more
dependable and avoid flakiness.
Fixesscylladb/scylladb#25957Closesscylladb/scylladb#26656
The test test_truncate_during_topology_change tests TRUNCATE TABLE while
bootstrapping a new node. With tablets enabled TRUNCATE is a global
topology operation which needs to serialize with boostrap.
When TRUNCATE TABLE is issued, it first checks if there is an already
queued truncate for the same table. This can happen if a previous
TRUNCATE operation has timed out, and the client retried. The newly
issued truncate will only join the queued one if it is waiting to be
processed, and will fail immediatelly if the TRUNCATE is already being
processed.
In this test, TRUNCATE will be retried after a timeout (1 minute) due to
the default retry policy, and will be retried up to 3 times, while the
bootstrap is delayed by 2 minutes. This means that the test can validate
the result of a truncate which was started after bootstrap was
completed.
Because of the way truncate joins existing truncate operations, we can
also have the following scenario:
- TRUNCATE times out after one minute because the new node is being
bootstrapped
- the client retries the TRUNCATE command which also times out after 1m
- the third attempt is received during TRUNCATE being processed which
fails the test
This patch changes the retry policy of the TRUNCATE operation to
FallthroughRetryPolicy which guarantees that TRUNCATE will not be
retried on timeout. It also increases the timeout of the TRUNCATE from 1
to 4 minutes. This way the test will actually validate the performance
of the TRUNCATE operation which was issued during bootstrap, instead of
the subsequent, retried TRUNCATEs which could have been issued after the
bootstrap was complete.
Fixes: #26347Closesscylladb/scylladb#27245
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.
In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.
The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.
After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.
Fixes#22564Fixes#26896Closesscylladb/scylladb#26924
There is a bug in current pytest's boost implementation. When timeout
reached process will be killed, but it was not correctly propagated,
that lead to a false positive result. This will fail test case when
timeout for the process is reached.
This is to prevent issues like this https://github.com/scylladb/scylladb/issues/27237Closesscylladb/scylladb#27463
The `sstable_token_range` lambda was only used once to create a token
range for an SSTable. Inline the construction directly where needed,
removing the extra lambda. This simplifies the code without changing
behavior.
Remove the use of the `overlaps` helper and unnest nested conditionals
in get_sstables_for_tablets. Straightforward `before` and `after` checks
are sufficient to decide how each SSTable should be handled.
Replace the explicit iterator loop with a range-based for loop. This
simplifies the code, enforces constness, and avoids the unnecessary
use of postfix increment. The behavior remains unchanged,but
readability and maintainability are improved.
Remove the microoptimization that advanced over SSTables ending before a
tablet range. This approach is misleading since SSTables are not sorted
by their end token, and the extra logic adds complexity with little to
no benefit. The streaming path here is not performance‑critical, so the
simpler loop is preferable.
Use a reverse view over the SSTables vector instead of reverse iterators.
This avoids awkward rbegin/rend usage and the mental overhead of tracking
inverted sort order. With a view, we can use standard begin/end iteration
while preserving the intended scan direction.
Add a comprehensive test suite that exercises various combinations of
SSTable containment within tablet ranges. These cases cover boundary
conditions, partial overlaps, and full containment to validate all
recent changes made to `get_sstables_by_tablet_range`.
Introduce a utility helper to set the first and last decorated keys on
an SSTable. This is intended for testing purposes, making it easier to
construct SSTables with defined boundaries in unit tests.
Create `get_sstables_for_tablets_for_tests` friend free function
for testing purposes. Adding this free function allows
direct testing without requiring the full streamer context.
Make the `get_sstables_for_tablets` member function `static`. This
is a step toward improved testability, allowing the function to be
invoked directly without requiring a full instance of the streamer.
This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier.
When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory
and the manifest.json file, rather than using the sstable generation.
This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable
may be migrated across shards or across nodes, and in this case, its generation may change, but its
sstable identifier remains sstable.
Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to
object storage and exist in previous backed up snapshots.
Historically, the sstable generation was guaranteed to be unique only per table per node,
so the dedup code currently checks for deduplication in the node scope.
However, with tablet migration, sstables are renamed when migrated to a different shard,
i.e. their generation changes, and they may be renamed when migrated to another node,
but even if they are not, the dedup logic still assumes uniqueness only within a node.
To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since 3a12ad96c7).
Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables
in a wider scope. This can be cluster-wide, but we practically need only rack-wide deduplication
or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a
numerical replication factor to a rack list containing a subset of the available racks in a datacenter).
Fixes#27181
* New feature, no backport required
Closesscylladb/scylladb#27184
* github.com:scylladb/scylladb:
database: truncate_table_on_all_shards: set use_sstable_identifier to true
nodetool: snapshot: add --use-sstable-identifier option
api: storage_service: take_snapshot: add use_sstable_identifier option
test: database_test: add snapshot_use_sstable_identifier_works
test: database_test: snapshot_works: add validate_manifest
sstable: write_scylla_metadata: add random_sstable_identifier error injection
table: snapshot_on_all_shards: take snapshot_options
sstable: add get_format getter
sstable: snapshot: add use_sstable_identifier option
db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
db: snapshot_ctl: move skip_flush to struct snapshot_options
We will need to access executor::_stats field from cas_write. We could
pass it as a paramter, but it seems simpler to just make cas_write
and instance method too.
This test reproduces scylladb/scylladb#27353 using two injection
points. First, the test triggers an intra-node tablet migration and
suspends it at the streaming stage using the
intranode_migration_streaming_wait injection. Next, it enables the
alternator_executor_batch_write_wait injection, which suspends a
batch write after its cas_shard has already been created.
The test then issues several batch writes and waits until one of them
hits this injection on the destination shard. At this point, the
cas_shard.erm for that write is still in the streaming state,
meaning the executor would need to jump back to the source shard.
The test then resumes the suspended tablet migration, allowing it to
update the ERM on the source shard to write_both_read_new. After that,
the test releases the suspended batch write and expects it to perform
two shard jumps: first from the destination to the source shard, and
then again back to the source shard.
This commit adds the alternator_executor_batch_write_wait injection to
alternator/executor.cc. Coroutines are intentionally avoided in the
parallel_for_each lambda to prevent unnecessary coroutine-frame
allocations.
This reverts commit ff1b212319. In this
commit, the python driver was updated to 3.29.6. That version has a
serious flaw - it rejects compression=None settings [1] which
cqlsh (legitimately) uses in copyutil.py.
The reason this hasn't caused numerous continuous integration failures
is that the submodule update commit did not update the frozen toolchain,
so the build was effectively running with an older version of the driver.
Fix by reverting the change. This allows us to regenerate the frozen
toolchain when we need to.
Reverted changes:
* tools/cqlsh 2240122...6badc99 (2):
> Update scylla-driver version to 3.29.6
> Revert "Migrate workflows to Blacksmith"
[1] 78f554236fClosesscylladb/scylladb#27473
This is an optimization follow-up [for this PR](https://github.com/scylladb/scylladb/pull/27396#issuecomment-3611410774): avoiding destruction of foreign objects on the wrong shard. Releasing objects allocated on a different shard causes their ::free calls to be executed remotely, which adds unnecessary load to the SMP subsystem.
Before this PR, a `std::vector<put_or_delete_item>` could be moved to another shard. When the vector was eventually destroyed, its ::free had to be marshalled back to the shard where the memory had originally been allocated. This change avoids that overhead by passing the vector by const reference instead.
backport: not needed, this is an optimization
Closesscylladb/scylladb#27432
* github.com:scylladb/scylladb:
alternator/executor.cc: avoid cross-shard free
storage_proxy: cas: take cas_request by raw reference
With python 3.14, the Process fails due to pickling issue with nodes objects.
This will eliminate this issue, so we can bump up the python version.
Closesscylladb/scylladb#27456
This commit is an optimization: avoiding destruction of
foreign objects on the wrong shard. Releasing objects allocated on a
different shard causes their ::free calls to be executed remotely,
which adds unnecessary load to the SMP subsystem.
Before this patch, a std::vector could be moved
to another shard. When the vector was eventually destroyed,
its ::free had to be marshalled back to the shard where the memory had
originally been allocated. This change avoids that overhead by passing
the vector by const reference instead.
The referenced objects lifetime correctness reasoning:
* the put_or_delete_item refs usages in put_or_delete_item_cas_request
are bound to its lifetime
* cas_request lifetime is bound to storage_proxy::cas future
* we don't release put_or_delete_item-s untill all storage_proxy::cas
calls are done.
In the next commit we want to add an optimization that relies on
precise control over the lifetime of cas_request. In particular, we
want the implementation of this interface in Alternator to operate on
raw references that are guaranteed to remain valid only until the
cas() future is resolved. We already depend on the same lifetime
assumptions in cas_request when used by modification_statement.
However, these assumptions are not clearly expressed in the current
interface: cas_request is taken by shared_ptr, and nothing prevents
cas() from storing that pointer inside paxos_response_handler, which
may outlive the cas() future.
This commit fixes that by taking cas_request by raw reference. This
makes it explicit that cas() does not assume ownership of the object.
Callers must ensure that the referenced object remains valid until
the returned future is resolved.
Large reserves in allocating_section can cause stalls. We already log
reserve increase, but we don't know which table it belongs to:
lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments;
Allocating sections used for updating row cache on memtable flush are
notoriously problematic. Each table has its own row_cache, so its own
allocating_section(s). If we attached table name to those sections, we
could identify which table is causing problems. In some issues we
suspected system.raft, but we can't be sure.
This patch allows naming allocating_sections for the purpose of
identifying them in such log messages. I use abstract_formatter for
this purpose to avoid the cost of formatting strings on the hot path
(e.g. index_reader). And also to avoid duplicating strings which are
already stored elsewhere.
Fixes#25799Closesscylladb/scylladb#27470
Range tombstones are represented as entry attributes, which applies to
the interval between entries. So if a range tombstone covers many
rows, to apply it we have to update all covered entries. In some
workloads that could be many entries, even the whole cache. Before
the patch, we did this update without preemption, which can cause
reactor stalls in such workloads.
This scenario is already covered by mvcc_tests,
e.g. test_apply_to_incomplete_respects_continuity. And I verified that
the new preemption point is hit in the test.
perf-row-cache-update results show no significant stalls anymore (max
2ms scheduling delay, instead of previous 1.5 s):
Generated 1124195 rows
Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]}
Draining...
took 0.000616 [ms]
cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630)
update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0
Fixes#23479Fixes#2578Closesscylladb/scylladb#27469
* github.com:scylladb/scylladb:
cache, mvcc: Preempt cache update when applying range tombstone from memtable
partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone()
perf-row-cache-update: Add scenario with large tombstone covering many rows
We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group.
Fixes https://github.com/scylladb/scylladb/issues/27142
Backport to all version where 60f1053087 was backported to since it should improve performance in large clusters.
Closesscylladb/scylladb#27387
* github.com:scylladb/scylladb:
direct_failure_detector: run direct failure detector in the gossiper scheduling group
raft: drop invoke_on from the pinger verb handler
direct_failure_detector: pass timeout to direct_fd_ping verb
We switched to using v3 schema tables (in system_schema keyspace) in
2017, in 9eb91bc30b.
So no system should have the old schema any more.
No need to run legacy_schema_migrator on boot.
Closesscylladb/scylladb#27420
Range tombstones are represented as entry attributes, which applies to
the interval between entries. So if a range tombstone covers many
rows, to apply it we have to update all covered entries. In some
workloads that could be many entries, even the whole cache. Before
the patch, we did this update without preemption, which can cause
reactor stalls in such workloads.
This scenario is already covered by mvcc_tests,
e.g. test_apply_to_incomplete_respects_continuity. And I verified that
the new preemption point is hit in the test.
perf-row-cache-update results show no significant stalls anymore (max
2ms scheduling delay, instead of previous 1.5 s):
Generated 1124195 rows
Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]}
Draining...
took 0.000616 [ms]
cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630)
update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0
Fixes#23479Fixes#2578
Fills memtable with rows and a tombstone which deletes all rows which
are already in cache.
Similar to raft log workload, but more extreme.
With -c1 -m4G, observed really bad performance:
update: 1711.976196 [ms], preemption: {count: 22603, 99%: 0.943127 [ms], max: 1494.571776 [ms]}, cache: 2148/2906 [MB], alloc/comp: 1334/869 [MB] (amp: 0.651), pr/me/dr 1062186/0/1062187
cache: 2148/2906 [MB], memtable: 738/1024 [MB], alloc/comp: 993/0 [MB] (amp: 0.000)
Which means that max reactor stall during cache update was 1.5 [s]
0.7 GB memtables. 2.1 GB in cache.
The code creates a local variable, so it's better to wrap it in a local
scope, to the conditionally compiled variable doesn't pollute the
external scope.
After scylladb/scylladb#26897 was merged, the worker doesn't use the
view building state machine CV to manage lifetime of batches, so the
broadcast is not needed.
In case of general exception in `view_building_worker::create_staging_sstable_tasks()`,
catch it, print it with error level and sleep 1s before retrying.
This will allow for the registrator to retry its work in case of failure
and it should be easier to detect any bugs in the method.
Test that taking a snapshot with the use_sstable_identifier
option (and injecting `random_sstable_identifier`) produces
different file names in the snapshot than the original
sstable names and validate te manifest.json file respectively.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used by a unit test in the following patch for testing
the snapshot use_sstable_identifier option.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used by the snapshot code in te following patch
for manufacturing a basename using the sstable_id rather
than its generation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When set to true, use the sstable_identifier as the sstable name
in the snapshot rather than its generation.
sstable::snapshot now returns the generation it used
for the sstable in the snapshot, based on the `use_sstable_identifier`
option, to be used by the upper layer generating the manifest.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When direct failure detector was introduces the idea was that it will
run on the same connection raft group0 verbs are running, but in
60f1053087 raft verbs were moved to run on the gossiper connection
while DIRECT_FD_PING was left where it was. This patch move it to
gossiper connection as well and fix the pinger code to run in gossiper
scheduling group.
Currently raft direct pinger verb jumps to shard 0 to check if group0 is
alive before replying. The verb runs relatively often, so it is not very
efficient. The patch distributes group0 liveness information (as it
changes) to all shard instead, so that the handler itself does not need
to jump to shard 0.
To be used for naming sstables in the snapshot by their
sstable identifiers rather than their generation, to
facilitate global deduplication of sstables in backup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently direct_fd_ping runs without timeout, but the verb is not
waited forever, the wait is canceled after a timeout, this timeout
simply is not passed to the rpc. It may create a situation where the
rpc callback can runs on a destination but it is no longer waited on.
Change the code to pass timeout to rpc as well and return earlier from
the rpc handler if the timeout is reached by the time the callback is
called. This is backwards compatible since timeout is passed as
optional.
Now that batchlog cleanup is cheap, on account of memtable flush on the
system.batchlog table garbage-collecting tombstones (previous patch), we
can afford to do cleanup on each replay, keeping the memtable size small
and more importantly -- the amount of tombstones in the memtable small.
SELECT * FROM MUTATION_FRAGMENTS() queries have a transformed schema
(mutation-fragment schema), which is a superset of that of the queried
table's. The mutation fragment schema represents position_in_partition
of mutation fragments expressed as clustering columns. This presents
some challenges, as some position_in_partition fields are null for some
positions. This was solved by setting these clustering keys components
to bytes{}. In the process, a mistake was made: when the clustering key
is missing in the position_in_partition, the position_weight is also set
to bytes{}. This is not correct, it is possible for some positions to
have no key but to still have a position_weight. An example is
position_in_partition::before_all_clustered_rows().
Fix this by always filling in the position_weight for positions which
have region() == clustered, instead of the earlier condition on the key
presence.
This is a minor bug affecting range tombstone changes at the two
extremes: position_in_partition::{before,after}_all_clustered_rows(). In
both cases, the position_weight can be deduced by a human looking at the
results, based on the position of the range tombstone change, relative
to other fragments.
Rename to make it more explicit where the error injection happens.
Also change how the error is injected, use the lambda overload instead
of is_enabled(), the former leaves better trace in logs, which helps
when debugging tests.
For tests that contain multiple assert_that() invokations, identifying
the one that failed is very challenging. Add source location to fail
messages to allow convenient identification of the call-site.
To enable assertions on columns which are sometimes null.
One existing user of with_typed_column() needs adjustment, because the
previous version of with_typed_column() covered up silently for null
value, but after this patch this caused a failure.
Although the value of this item is indeed derived from the write timeout
config, the name doesn't reflect what it is used for. Change it to
reflect it better.
New batchlogs are written to the batchlog_v2 table and replay also uses
the v2 table.
The content of system.batchlog is attempted to be migrated to
system.batchlog_v2 after each start of the batchlog_manager service.
The migration is retried on each replay if it fails. This is reduntant
but simple.
Batchlog cleanup now doesn't involve flushing memtables, the only
remaining user of replica/database.hh is gone, so the include is
dropped.
Rearranges the system.batchlog schema as follows:
CREATE TABLE system.batchlog_v2 (
version int,
stage int,
shard int,
written_at timestamp,
id uuid,
data blob,
PRIMARY KEY ((version, stage, shard), written_at, id));
With the following goals:
1) Make post-replay batchlog cleanup possible with a simple
range-tombstone. This allows dropping the individual dead batchlog
entries, as they are shadowed by a higher level tombstone. This
enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component:
batchlog entries that fail the first replay attempt, are moved to the
failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard
column.
4) Make batchlog entries ordered by the batchlog create time (id). This
allows for selecting batchlogs to replay, without post-filtering of
batchlogs that are too young to be replayed.
Don't build batchlog delete mutations in storage-proxy code. Move this
code into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
db/batchlog_manager.cc
2) Reduce current and future code duplication.
3) Make future changes to this logic easier.
Don't build batchlog mutations in storage-proxy code. Move this code
into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
db/batchlog_manager.cc
2) Reduce current and future code duplication.
2) Make future changes to this logic easier.
Just skip the mutation(s) whose tables were dropped instead.
Use the newly introduced data_dictionary::table::get_truncation_time()
to avoid looking up real table object.
The map_reduce achieves no concurrency, both map and reduce are
synchronous. It only achieves two redundant lookups for the table and
hard-to-read code. Convert it into a simple loop. Preserve the
stall-protection by adding a maybe_yield() to the loop.
When the initial version of rack-aware pairing was introduced, materialized
views with tablets were still experimental. Since then, we decided that
we'll only allow materialized views in clusters where the base table and
the view are replicated on the same racks, with one replica of each tablet
on each rack.
This allows us to remove almost all logic from our base-view pairing. The
only check for the paired view replica is now whether it's in the same
rack as the base replica sending the update.
In this patch we replace the simple and complex rack-aware pairing with
the simple check above.
Because of this, we have to remove a test case from network_topology_strategy_test
which was testing complex pairing. The tested topology is not supported
for views with tablets (or is unlikely to be supported, as it's a random test),
so there's no use keeping the test.
The test case for simple rack aware pairing was kept, but now we only test
the case where each rack has one replica, not multiple.
Additionally, we split finding of an unpaired replica to a separate function
and partially rewrite it without reusing the helper stuctures that were
present when calculating the simple and complex rack-aware pairing.
We only look for an unpaired replica if we couldn't find a paired replica
ourselves or if the number of view replicas didn't match the base replicas.
If an unpaired replica appears while these conditions pass, we won't send
an extra update, but that would be a new bug altogether, because we only
expect the unpaired replica to appear during RF changes, so when these
conditions aren't fulfilled.
Fixes https://github.com/scylladb/scylladb/issues/26313
Add cleanup flag value to start message and drop cpu, it is redundant as
Scylla already adds the shard number to the logs.
Add all_replayed to finish message.
Currently the method scans db::config to find one. It has some
drawbacks. First, it's not very nice. Second, it needs to handle the
case when the endpoint is missing, while it relally never is. Third, the
type in config entry is not necessarily set.
It's nicer to get the type from storage manager.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
To avoid repeatedly checking whether we're using tablets and having
to use unnecesarily flexible code fitting both cases, we split out
the base-view pairing code for the case of vnodes to another function.
The get_view_natural_endpoint will now have only common steps,
a call to that function, and steps specific to tablets.
We always use "legacy self pairing" when not using tablets, and
the "rack aware pairing" has been enabled in every version where
views with tablets isn't experimental. So in practice, instead
of checking these variables we can just look at whether the
table uses tablets.
At one point, the get_view_natural_endpoint was using IP for the
view update (and hint) destinations, but the hint code was using
host_id for the destinations. When a node left, we could no longer
have a mapping for a IP to host_id and when trying to store a hint
for this IP, we'd crash.
We worked around this issue by dropping the view update completely
if the target is in the "left" state.
Since then, we also moved to host_id's in the view update code, so
there's no longer any translation needed when storing the hints.
Additionally, we now drain hints not when entering the "left" state,
but when the node actually stops owning tokens.
Because of that, the workaround is not needed anymore, so we remove
it in this commit.
The existing test_mv_tablets_empty_ip case verifies that indeed, we
do not crash in the original problematic scenario.
Extract the SST filtering logic into a dedicated member function. This
prepares the code for independent testing without requiring the entire
streamer to be initialized.
2025-11-30 18:27:15 +02:00
302 changed files with 8148 additions and 3282 deletions
co_returnapi_error::validation("GlobalSecondaryIndexes and LocalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
throwapi_error::validation(fmt::format("AttributeDefinitions redefined {} to {} already a key attribute of type {} in this table",def.name_as_text(),type,def_type));
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
"Number of threads with which to deliver hints. In multiple data-center deployments, consider increasing this number because cross data-center handoff is generally slower.")
"Clean up batchlog memtable after every N replays. Replays are issued on a timer, every 60 seconds. So if batchlog_replay_cleanup_after_replays is set to 60, the batchlog memtable is flushed every 60 * 60 seconds.")
staticautoGET_COLUMN_MAPPING_QUERY=format("SELECT column_name, clustering_order, column_name_bytes, kind, position, type FROM system.{} WHERE cf_id = ? AND schema_version = ?",
format("SELECT * FROM system.{} WHERE done = false AND request_type IN ({}) ALLOW FILTERING",TOPOLOGY_REQUESTS,request_types_str));
// Requests which finished after end_time_limit.
autors_done=co_awaitexecute_cql(
format("SELECT * FROM system.{} WHERE end_time > {} AND request_type IN ('{}', '{}', '{}', '{}', '{}') ALLOW FILTERING",TOPOLOGY_REQUESTS,end_time_limit.time_since_epoch().count(),
format("SELECT * FROM system.{} WHERE end_time > {} AND request_type IN ({}) ALLOW FILTERING",TOPOLOGY_REQUESTS,end_time_limit.time_since_epoch().count(),request_types_str));
@@ -11,7 +11,7 @@ This section will guide you through the steps for setting up the cluster:
<https://hub.docker.com/r/scylladb/scylla/>, but add to every `docker run`
command a `-p 8000:8000` before the image name and
`--alternator-port=8000 --alternator-write-isolation=always` at the end.
The "alternator-port" option specifies on which port Scylla will listen for
The "alternator-port" option specifies on which port ScyllaDB will listen for
the (unencrypted) DynamoDB API, and the "alternator-write-isolation" chooses
whether or not Alternator will use LWT for every write.
For example,
@@ -24,10 +24,10 @@ This section will guide you through the steps for setting up the cluster:
By default, ScyllaDB run in this way will not have authentication or
authorization enabled, and any DynamoDB API request will be honored without
requiring them to be signed appropriately. See the
[Scylla Alternator for DynamoDB users](compatibility.md#authentication-and-authorization)
[ScyllaDB Alternator for DynamoDB users](compatibility.md#authentication-and-authorization)
document on how to configure authentication and authorization.
## Testing Scylla's DynamoDB API support:
## Testing ScyllaDB's DynamoDB API support:
### Running AWS Tic Tac Toe demo app to test the cluster:
1. Follow the instructions on the [AWS github page](https://github.com/awsdocs/amazon-dynamodb-developer-guide/blob/master/doc_source/TicTacToe.Phase1.md)
@@ -365,7 +365,7 @@ Modifying a keyspace with tablets enabled is possible and doesn't require any sp
- The replication factor (RF) can be increased or decreased by at most 1 at a time. To reach the desired RF value, modify the RF repeatedly.
- The ``ALTER`` statement rejects the ``replication_factor`` tag. List the DCs explicitly when altering a keyspace. See :ref:`NetworkTopologyStrategy <replication-strategy>`.
-If there's any other ongoing global topology operation, executing the``ALTER`` statement will fail (with an explicit and specific error) and needs to be repeated.
-An RF change cannot be requested while another RF change is pending for the same keyspace. Attempting to execute an``ALTER`` statement in this scenario will fail with an explicit error. Wait for the ongoing RF change to complete before issuing another ``ALTER`` statement.
- The ``ALTER`` statement may take longer than the regular query timeout, and even if it times out, it will continue to execute in the background.
- The replication strategy cannot be modified, as keyspaces with tablets only support ``NetworkTopologyStrategy``.
- The ``ALTER`` statement will fail if it would make the keyspace :term:`RF-rack-invalid <RF-rack-valid keyspace>`.
@@ -1043,6 +1043,8 @@ The following modes are available:
* - ``immediate``
- Tombstone GC is immediately performed. There is no wait time or repair requirement. This mode is useful for a table that uses the TWCS compaction strategy with no user deletes. After data is expired after TTL, ScyllaDB can perform compaction to drop the expired data immediately.
..warning:: The ``repair`` mode is not supported for :term:`Colocated Tables <Colocated Table>` in this version.
To learn more about TTL, and see a hands-on example, check out `this lesson <https://university.scylladb.com/courses/data-modeling/lessons/advanced-data-modeling/topic/expiring-data-with-ttl-time-to-live/>`_ on ScyllaDB University.
*`Video: Managing data expiration with Time-To-Live <https://www.youtube.com/watch?v=SXkbu7mFHeA>`_
*:doc:`Apache Cassandra Query Language (CQL) Reference </cql/index>`
*:doc:`KB Article:How to Change gc_grace_seconds for a Table </kb/gc-grace-seconds/>`
*:doc:`KB Article:Time to Live (TTL) and Compaction </kb/ttl-facts/>`
@@ -41,12 +41,12 @@ Unless the task was aborted, the worker will eventually reply that the task was
it temporarily saves list of ids of finished tasks and removes those tasks from group0 state (pernamently marking them as finished) in 200ms intervals. (*)
This batching of removing finished tasks is done in order to reduce number of generated group0 operations.
On the other hand, view buildind tasks can can also be aborted due to 2 main reasons:
On the other hand, view building tasks can can also be aborted due to 2 main reasons:
- a keyspace/view was dropped
- tablet operations (see [tablet operations section](#tablet-operations))
In the first case we simply delete relevant view building tasks as they are no longer needed.
But if a task needs to be aborted due to tablet operation, we're firstly setting the `aborted` flag to true. We need to do this because we need the task informations
to created a new adjusted tasks (if the operation succeeded) or rollback them (if the operation failed).
But if a task needs to be aborted due to tablet operation, we're firstly setting the `aborted` flag to true. We need to do this because we need the task information
to create new adjusted tasks (if the operation succeeded) or rollback them (if the operation failed).
Once a task is aborted by setting the flag, this cannot be revoked, so rolling back a task means creating its duplicate and removing the original task.
(*) - Because there is a time gap between when the coordinator learns that a task is finished (from the RPC response) and when the task is marked as completed,
This is the mapping used to decide on which stream IDs to use when making writes, as explained in the :doc:`./cdc-streams` document. It is a global property of the cluster: it doesn't depend on the table you're making writes to.
..caution::
The tables mentioned in the following sections: ``system_distributed.cdc_generation_timestamps`` and ``system_distributed.cdc_streams_descriptions_v2`` have been introduced in ScyllaDB 4.4. It is highly recommended to upgrade to 4.4 for efficient CDC usage. The last section explains how to run the below examples in ScyllaDB 4.3.
@@ -28,7 +28,8 @@ Incremental Repair is only supported for tables that use the tablets architectur
Incremental Repair Modes
------------------------
While incremental repair is the default and recommended mode, you can control its behavior for a given repair operation using the ``incremental_mode`` parameter. This is useful for situations where you might need to force a full data validation.
Incremental is currently disabled by default. You can control its behavior for a given repair operation using the ``incremental_mode`` parameter.
This is useful for enabling incremental repair, or in situations where you might need to force a full data validation.
@@ -20,7 +20,10 @@ You can run your ScyllaDB workloads on AWS, GCE, and Azure using a ScyllaDB imag
Amazon Web Services (AWS)
-----------------------------
The recommended instance types are :ref:`i3en <system-requirements-i3en-instances>`,:ref:`i4i <system-requirements-i4i-instances>`, :ref:`i7i <system-requirements-i7i-instances>`, and :ref:`i7ie <system-requirements-i7ie-instances>`.
The recommended instance types are :ref:`i3en <system-requirements-i3en-instances>`,
and :ref:`i8ge <system-requirements-i8ge-instances>`.
..note::
@@ -195,6 +198,118 @@ All i7i instances have the following specs:
See `Amazon EC2 I7i Instances <https://aws.amazon.com/ec2/instance-types/i7i/>`_ for details.
.._system-requirements-i8g-instances:
i8g instances
^^^^^^^^^^^^^^
The following i8g instances are supported.
..list-table::
:widths:30 20 20 30
:header-rows:1
* - Model
- vCPU
- Mem (GiB)
- Storage (GB)
* - i8g.large
- 2
- 16
- 1 x 468 GB
* - i8g.xlarge
- 4
- 32
- 1 x 937 GB
* - i8g.2xlarge
- 8
- 64
- 1 x 1,875 GB
* - i8g.4xlarge
- 16
- 128
- 1 x 3,750 GB
* - i8g.8xlarge
- 32
- 256
- 2 x 3,750 GB
* - i8g.12xlarge
- 48
- 384
- 3 x 3,750 GB
* - i8g.16xlarge
- 64
- 512
- 4 x 3,750 GB
All i8g instances have the following specs:
* Powered by AWS Graviton4 processors
* 3rd generation AWS Nitro SSD storage
* DDR5-5600 memory for improved throughput
* Up to 100 Gbps of networking bandwidth and up to 60 Gbps of bandwidth to
Amazon Elastic Block Store (EBS)
* Instance sizes offer up to 45 TB of total local NVMe instance storage
See `Amazon EC2 I8g Instances <https://aws.amazon.com/ec2/instance-types/i8g/>`_ for details.
.._system-requirements-i8ge-instances:
i8ge instances
^^^^^^^^^^^^^^
The following i8ge instances are supported.
..list-table::
:widths:30 20 20 30
:header-rows:1
* - Model
- vCPU
- Mem (GiB)
- Storage (GB)
* - i8ge.large
- 2
- 16
- 1 x 1,250 GB
* - i8ge.xlarge
- 4
- 32
- 1 x 2,500 GB
* - i8ge.2xlarge
- 8
- 64
- 2 x 2,500 GB
* - i8ge.3xlarge
- 12
- 96
- 1 x 7,500 GB
* - i8ge.6xlarge
- 24
- 192
- 2 x 7,500 GB
* - i8ge.12xlarge
- 48
- 384
- 4 x 7,500 GB
* - i8ge.18xlarge
- 72
- 576
- 6 x 7,500 GB
All i8ge instances have the following specs:
* Powered by AWS Graviton4 processors
* 3rd generation AWS Nitro SSD storage
* DDR5-5600 memory for improved throughput
* Up to 300 Gbps of networking bandwidth and up to 60 Gbps of bandwidth to
Amazon Elastic Block Store (EBS)
* Instance sizes offer up to 120 TB of total local NVMe instance storage
See `Amazon EC2 I8g Instances <https://aws.amazon.com/ec2/instance-types/i8g/>`_ for details.
Im4gn and Is4gen instances
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ScyllaDB supports Arm-based Im4gn and Is4gen instances. See `Amazon EC2 Im4gn and Is4gen instances <https://aws.amazon.com/ec2/instance-types/i4g/>`_ for specification details.
It is always recommended to run the latest version of ScyllaDB.
The latest stable release version is always available from the `Download Center <https://www.scylladb.com/download/>`_.
It is always recommended to run the latest stable version of ScyllaDB.
When you install ScyllaDB, it installs by default two services: **scylla-housekeeping-restart** and **scylla-housekeeping-daily**. These services check for the latest ScyllaDB version and prompt the user if they are using a version that is older than what is publicly available.
Information about your ScyllaDB deployment, including the ScyllaDB version currently used, as well as unique user and server identifiers, are collected by a centralized service.
@@ -9,6 +9,8 @@ Running ``cluster repair`` on a **single node** synchronizes all data on all nod
To synchronize all data in clusters that have both tablets-based and vnodes-based keyspaces, run :doc:`nodetool repair -pr </operating-scylla/nodetool-commands/repair/>` on **all**
of the nodes in the cluster, and :doc:`nodetool cluster repair </operating-scylla/nodetool-commands/cluster/repair/>` on **any** of the nodes in the cluster.
..warning:::term:`Colocated Tables <Colocated Table>` cannot be synchronized using cluster repair in this version.
To check if a keyspace enables tablets, use:
..code-block::cql
@@ -53,13 +55,13 @@ ScyllaDB nodetool cluster repair command supports the following options:
-``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental.
-``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled'.
Note that audit for every DML or QUERY might impact performance and consume a lot of storage.
Note that enabling audit may negatively impact performance and audit-to-table may consume extra storage. That's especially true when auditing DML and QUERY categories, which generate a high volume of audit messages.
Configuring Audit Storage
---------------------------
Auditing messages can be sent to :ref:`Syslog <auditing-syslog-storage>` or stored in a Scylla :ref:`table <auditing-table-storage>`.
Currently, auditing messages can only be saved to one location at a time. You cannot log into both a table and the Syslog.
Auditing messages can be sent to :ref:`Syslog <auditing-syslog-storage>` or stored in a Scylla :ref:`table <auditing-table-storage>` or both.
.._auditing-syslog-storage:
@@ -193,6 +192,23 @@ For example:
2018-03-18 00:00:00+0000 | 10.143.2.108 | 3429b1a5-2a94-11e8-8f4e-000000000001 | DDL | ONE | False | nba | DROP TABLE nba.team_roster ; | 127.0.0.1 | team_roster | Scylla |
(1 row)
.._auditing-table-and-syslog-storage:
Storing Audit Messages in a Table and Syslog Simultaneously
#. Follow both procedures from above, and set the ``audit`` parameter in the ``scylla.yaml`` file to both ``syslog`` and ``table``. You need to restart scylla only once.
To have both syslog and table you need to specify both backends separated by a comma:
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.