scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-04 14:03:06 +00:00

Author	SHA1	Message	Date
Abhinav Jha	08cd442ddc	raft: replication test: change rpc_propose_conf_change test to SEASTAR_THREAD_TEST_CASE RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet loss named name_drops. The framework makes hard coded assumptions about leader which doesn't hold well in case of packet losses. This short term fix disables the packet drop variant of the specified test. It should be safe to re-enable it once the whole framework is re-worked to remove these hard coded assumptions. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#23816 Closes scylladb/scylladb#25489 (cherry picked from commit `a0ee5e4b85`) Closes scylladb/scylladb#25526	2025-08-18 12:26:17 +02:00
Jenkins Promoter	4724237537	Update ScyllaDB version to: 2025.1.7	2025-08-17 15:45:37 +03:00
Jenkins Promoter	f22ea03851	Update pgo profiles - aarch64	2025-08-15 04:32:33 +03:00
Jenkins Promoter	46ae7fcae1	Update pgo profiles - x86_64	2025-08-15 04:06:06 +03:00
Yaron Kaikov	88373e930a	Revert "dist/docker/debian/build_docker.sh: add scylla-server-dbg" This reverts commit `d7a02eceea`. This makes our containers MUCH larger than they need to be: 800.46 MB (2025.1.5) vs. 273.36 M (2025.1.3). Fixes: https://github.com/scylladb/scylladb/issues/25479 Closes scylladb/scylladb#25102	2025-08-14 14:54:04 +03:00
Wojciech Przytuła	b9ae9473ba	Fix link to ScyllaDB manual The link would point to outdated OS docs. I fixed it to point to up-to-date Enterprise docs. Closes scylladb/scylladb#25328 (cherry picked from commit `7600ccfb20`) Closes scylladb/scylladb#25483	2025-08-13 11:17:03 +03:00
Dawid Mędrek	7f205fe063	test: Enable RF-rack-valid keyspaces in all Python suites We're enabling the configuration option `rf_rack_valid_keyspaces` in all Python test suites. All relevant tests have been adjusted to work with it enabled. That encompasses the following suites: * alternator, * broadcast_tables, * cluster (already enabled in scylladb/scylladb@ee96f8dcfc), * cql, * cqlpy (already enabled in scylladb/scylladb@be0877ce69), * nodetool, * rest_api. Two remaining suites that use tests written in Python, redis and scylla_gdb, are not affected, at least not directly. The redis suite requires creating an instance of Scylla manually, and the tests don't do anything that could violate the restriction. The scylla_gdb suite focuses on testing the capabilities of scylla-gdb.py, but even then it reuses the `run` file from the cqlpy suite. Fixes scylladb/scylladb#25126 Closes scylladb/scylladb#24617 (cherry picked from commit `b41151ff1a`) Closes scylladb/scylladb#25229	2025-08-13 09:24:09 +03:00
Asias He	dec3c84799	repair: Skip hints and batchlog flush in case of nodes down The flush api could not detect if the node is down and fail the flush before the timeout. This patch detects if there is down node and skip the flush if so, since the flush will fail after the timeout in this case anyway. The slowness due to the flush timeout in compaction_test.py::TestCompaction::test_delete_tombstone_gc_node_down is fixed with this patch. Fixes #22413 Closes scylladb/scylladb#22445 (cherry picked from commit `0682b1c716`) Closes scylladb/scylladb#25433	2025-08-13 09:23:43 +03:00
Dawid Mędrek	69307eaf2d	db/commitlog: Extend error messages for corrupted data We're providing additional information in error messages when throwing an exception related to data corruption: when a segment is truncated and when it's content is invalid. That might prove helpful when debugging. Closes scylladb/scylladb#25190 (cherry picked from commit `408b45fa7e`) Closes scylladb/scylladb#25459	2025-08-13 09:22:50 +03:00
Szymon Malewski	38f4c3325d	test/alternator: enable more relevant logs in CI. This patch sets, for alternator test suite, all 'alternator-*' loggers and 'paxos' logger to trace level. This should significantly ease debugging of failed tests, while it has no effect on test time and increases log size only by 7%. This affects running alternator tests only with `test.py`, not with `test/alternator/run`. Closes #24645 Closes scylladb/scylladb#25327 (cherry picked from commit `eb11485969`) Closes scylladb/scylladb#25381 scylla-2025.1.6 scylla-2025.1.6-candidate-20250812093600	2025-08-11 07:04:52 +03:00
Botond Dénes	b7e606ac91	Merge '[Backport 2025.1] truncate: change check for write during truncate into a log warning' from Scylladb[bot] TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised. The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands. This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, and the offending replay positions which caused the check to fail. This PR also adds a test which validates that TRUNCATE works correctly with concurrent writes. More specifically, it checks that: - all data written before TRUNCATE starts is deleted - none of the data after TRUNCATE completes is deleted Fixes: #25173 Fixes: #25013 Backport is needed in versions which check for truncate with concurrent writes using `on_internal_error()`: 2025.3 2025.2 2025.1 - (cherry picked from commit `268ec72dc9`) - (cherry picked from commit `33488ba943`) Parent PR: #25174 Closes scylladb/scylladb#25348 * github.com:scylladb/scylladb: truncate: add test for truncate with concurrent writes truncate: change check for write during truncate into a log warning	2025-08-11 07:02:55 +03:00
Taras Veretilnyk	1ae3cd310b	docs: fix typo in command name enbleautocompaction -> enableautocompaction Renamed the file and updated all references from 'enbleautocompaction' to the correct 'enableautocompaction'. Fixes scylladb/scylladb#25172 Closes scylladb/scylladb#25175 (cherry picked from commit `6b6622e07a`) Closes scylladb/scylladb#25215	2025-08-11 07:01:57 +03:00
Botond Dénes	7ab6911b03	Merge '[Backport 2025.1] storage_service: cancel all write requests after stopping transports' from Scylladb[bot] When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore. If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out. This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped. Fixes scylladb/scylladb#23665 Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3. - (cherry picked from commit `bc934827bc`) - (cherry picked from commit `e0dc73f52a`) Parent PR: #24714 Closes scylladb/scylladb#25168 * github.com:scylladb/scylladb: storage_service: Cancel all write requests on storage_proxy shutdown test: Add test for unfinished writes during shutdown and topology change	2025-08-11 07:01:09 +03:00
Sergey Zolotukhin	4eab3b6a91	storage_service: Cancel all write requests on storage_proxy shutdown During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown` as one of the first steps. However, even after RPCs are shut down, some write handlers in `storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM. Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block the messaging server shutdown and delay the entire shutdown process until the write timeout occurs. This change introduces the cancellation of all outstanding write handlers in `storage_proxy` during shutdown to prevent unnecessary delays. Fixes scylladb/scylladb#23665 (cherry picked from commit `e0dc73f52a`)	2025-08-08 15:53:37 +02:00
Sergey Zolotukhin	f309d035f4	test: Add test for unfinished writes during shutdown and topology change This test reproduces an issue where a topology change and an ongoing write query during query coordinator shutdown can cause the node to get stuck. When a node receives a write request, it creates a write handler that holds a copy of the current table's ERM (Effective Replication Map). The ERM ensures that no topology or schema changes occur while the request is being processed. After the query coordinator receives the required number of replica write ACKs to satisfy the consistency level (CL), it sends a reply to the client. However, the write response handler remains alive until all replicas respond — the remaining writes are handled in the background. During shutdown, when all network connections are closed, these responses can no longer be received. As a result, the write response handler is only destroyed once the write timeout is reached. This becomes problematic because the ERM held by the handler blocks topology or schema change commands from executing. Since shutdown waits for these commands to complete, this can lead to unnecessary delays in node shutdown and restarts, and occasional test case failures. Test for: scylladb/scylladb#23665 (cherry picked from commit `bc934827bc`)	2025-08-08 15:53:17 +02:00
Taras Veretilnyk	599c2351d0	docs: Sort commands list in nodetool.rst Fixes scylladb/scylladb#25330 Closes scylladb/scylladb#25331 (cherry picked from commit `bcb90c42e4`) Closes scylladb/scylladb#25370	2025-08-07 13:14:55 +03:00
Ferenc Szili	a1e80365a7	truncate: add test for truncate with concurrent writes test_validate_truncate_with_concurrent_writes checks if truncate deletes all the data written before the truncate starts, and does not delete any data after truncate completes. (cherry picked from commit `33488ba943`)	2025-08-07 09:56:49 +02:00
Botond Dénes	b09297c0b9	Merge '[Backport 2025.1] repair: postpone repair until topology is not busy ' from Scylladb[bot] Currently, repair_service::repair_tablets starts repair if there is no ongoing tablet operations. The check does not consider global topology operations, like tablet resize finalization. Hence, if: - topology is in the tablet_resize_finalization state; - repair starts (as there is no tablet transitions) and holds the erm; - resize finalization finishes; then the repair sees a topology state different than the actual - it does not see that the storage groups were already split. Repair code does not handle this case and it results with on_internal_error. Start repair when topology is not busy. The check isn't atomic, as it's done on a shard 0. Thus, we compare the topology versions to ensure that the business check is valid. Fixes: https://github.com/scylladb/scylladb/issues/24195. Needs backport to all branches since they are affected - (cherry picked from commit `df152d9824`) - (cherry picked from commit `83c9af9670`) Parent PR: #24202 Closes scylladb/scylladb#24778 * github.com:scylladb/scylladb: test: add test for repair and resize finalization repair: postpone repair until topology is not busy	2025-08-07 06:25:21 +03:00
Nikos Dragazis	22dbdafd64	test: kmip: Fix segfault from premature destruction of port_promise `kmip_test_helper()` is a utility function to spawn a dedicated PyKMIP server for a particular Boost test case. The function runs the server as an external process and uses a thread to parse the port from the server's logs. The thread communicates the port to the main thread via a promise. The current implementation has a bug where the thread may set a value to the promise after its destruction, causing a segfault. This happens when the server does not start within 20 seconds, in which case the port future throws and the stack unwinding machinery destroys the port promise before the thread that writes to it. Fix the bug by declaring the promise before the cleanup action. The bug has been encountered in CI runs on slow machines, where the PyKMIP server takes too long to create its internal tables (due to slow fdatasync calls from SQLite). This patch does not improve CI stability - it only ensures that the error condition is properly reflected in the test output. This patch is not a backport. The same bug has been fixed in master as part of a larger rewrite of the `kmip_test_helper()` (see `722e2bce96`). Refs #24747, #24842. Fixes #24574. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#25028	2025-08-06 11:56:59 +03:00
Anna Stuchlik	f38b543937	doc: add the patch release upgrade procedure for version 2025.1 Fixes https://github.com/scylladb/scylladb/issues/25321 Closes scylladb/scylladb#25324	2025-08-06 11:23:37 +03:00
Aleksandra Martyniuk	f8d2f83c41	test: add test for repair and resize finalization Add test that checks whether repair does not start if there is an ongoing resize finalization. (cherry picked from commit `83c9af9670`)	2025-08-06 10:08:50 +02:00
Aleksandra Martyniuk	d976ab3933	repair: postpone repair until topology is not busy Currently, repair_service::repair_tablets starts repair if there is no ongoing tablet operations. The check does not consider global topology operations, like tablet resize finalization. This may cause a data race and unexpected behavior. Start repair when topology is not busy. (cherry picked from commit `df152d9824`)	2025-08-06 09:46:05 +02:00
Botond Dénes	f453b5bfa3	Merge '[Backport 2025.1] sstables: Fix quadratic space complexity in partitioned_sstable_set' from Scylladb[bot] Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so. - (cherry picked from commit `494ed6b887`) - (cherry picked from commit `59dad2121f`) - (cherry picked from commit `21d1e78457`) - (cherry picked from commit `c77f710a0c`) - (cherry picked from commit `d5bee4c814`) Parent PR: #23806 Closes scylladb/scylladb#24012 * github.com:scylladb/scylladb: test: Verify partitioned set store split and unsplit correctly sstables: Fix quadratic space complexity in partitioned_sstable_set compaction: Wire table_state into make_sstable_set() compaction: Introduce token_range() to table_state dht: Add overlap_ratio() for token range	2025-08-06 09:56:43 +03:00
Michał Jadwiszczak	5c8a2784e8	storage_service, group0_state_machine: move SL cache update from `topology_state_load()` to `load_snapshot()` Currently the service levels cache is unnecessarily updated in every call of `topology_state_load()`. But it is enough to reload it only when a snapshot is loaded. (The cache is also already updated when there is a change to one of `service_levels_v2`, `role_members`, `role_attributes` tables.) Fixes scylladb/scylladb#25114 Fixes scylladb/scylladb#23065 Closes scylladb/scylladb#25116 (cherry picked from commit `10214e13bd`) Closes scylladb/scylladb#25303	2025-08-06 09:54:24 +03:00
Nikos Dragazis	3e96b9a13d	test: Use in-memory SQLite for PyKMIP server The PyKMIP server uses an SQLite database to store artifacts such as encryption keys. By default, SQLite performs a full journal and data flush to disk on every CREATE TABLE operation. Each operation triggers three fdatasync(2) calls. If we multiply this by 16, that is the number of tables created by the server, we get a significant number of file syncs, which can last for several seconds on slow machines. This behavior has led to CI stability issues from KMIP unit tests where the server failed to complete its schema creation within the 20-second timeout (observed on spider9 and spider11). Fix this by configuring the server to use an in-memory SQLite. Fixes #24842. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#24995 (cherry picked from commit `2656fca504`) Closes scylladb/scylladb#25297	2025-08-06 09:53:18 +03:00
Aleksandra Martyniuk	a65117617d	api: storage_service: do not log the exception that is passed to user The exceptions that are thrown by the tasks started with API are propagated to users. Hence, there is no need to log it. Remove the logs about exception in user started tasks. Fixes: https://github.com/scylladb/scylladb/issues/16732. Closes scylladb/scylladb#25153 (cherry picked from commit `e607ef10cd`) Closes scylladb/scylladb#25295	2025-08-06 09:49:51 +03:00
Botond Dénes	f2a1e9f9ad	Merge '[Backport 2025.1] test/cqlpy: Adjust tests to RF-rack-valid keyspaces' from Scylladb[bot] In this PR, we adjust tests in the cqlpy test suite so they only use RF-rack-valid keyspaces. After that, we enable the configuration option `rf_rack_valid_keyspaces` in the suite by default. Refs scylladb/scylladb#23428 Fixes scylladb/scylladb#25306 Backport: backporting to 2025.1 so we can test the option there too. - (cherry picked from commit `6bde01bb59`) - (cherry picked from commit `958eaec056`) - (cherry picked from commit `a59842257a`) - (cherry picked from commit `be0877ce69`) Parent PR: #23489 Closes scylladb/scylladb#25307 * github.com:scylladb/scylladb: test/cqlpy: Enable rf_rack_valid_keyspaces by default test: Move test_alter_tablet_keyspace_rf to cluster suite test/cqlpy: Adjust tests to RF-rack-valid keyspaces test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces	2025-08-06 09:49:18 +03:00
Ferenc Szili	75c1ed0e86	truncate: change check for write during truncate into a log warning TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised. The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands. This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, the truncated_at timepoint, the offending replay positions which caused the check to fail. Fixes: #25173 Fixes: #25013 (cherry picked from commit `268ec72dc9`)	2025-08-06 00:51:06 +00:00
Avi Kivity	6f44cd672e	Merge '[Backport 2025.1] qos: don't populate effective service level cache until auth is migrated to raft' from Scylladb[bot] Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart). - (cherry picked from commit `2bb800c004`) - (cherry picked from commit `3a082d314c`) Parent PR: #25188 Closes scylladb/scylladb#25283 * github.com:scylladb/scylladb: test: sl: verify that legacy auth is not queried in sl to raft upgrade qos: don't populate effective service level cache until auth is migrated to raft	2025-08-03 15:35:32 +03:00
Dawid Mędrek	9666c255a4	test/cqlpy: Enable rf_rack_valid_keyspaces by default All of the tests in the suite have been adjusted so they only use RF-rack-valid keyspaces, so let's start enabling the option by default. (cherry picked from commit `be0877ce69`)	2025-08-01 21:41:40 +02:00
Dawid Mędrek	254f9b9427	test: Move test_alter_tablet_keyspace_rf to cluster suite We move the test `test_alter_tablet_keyspace_rf` from the cqlpy to the cluster test suite. The reason behind the change is that the test cannot be run with `rf_rack_valid_keyspaces` turned on in the configuration. During the test, we make the keyspace RF-rack-invalid multiple times. Since RF-rack-validity is a very strong constraint, adjust the test otherwise is impossible. By moving it to the cluster test suite, we're able to change the configuration of the node used in the test, and so the test can work again. (cherry picked from commit `a59842257a`)	2025-08-01 21:41:37 +02:00
Dawid Mędrek	1e7a6643fb	test/cqlpy: Adjust tests to RF-rack-valid keyspaces (cherry picked from commit `958eaec056`)	2025-08-01 21:41:34 +02:00
Dawid Mędrek	929aed3a30	test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces We adjust three existing Cassandra tests so that they don't create RF-rack-invalid keyspaces. We modify the replication factor used in the problematic tests. The changes don't affect the tests as the value of the RF is unrelated to what they verify. Thanks to that, we can run them now even with enforced RF-rack-valid keyspaces. The drawback is that the modified ALTER statements do not modify the RF at all. However, since the tests seem to verify that the code responsible for VALIDATING a request works as intended, that should have little to no impact on them. (cherry picked from commit `6bde01bb59`)	2025-08-01 21:41:30 +02:00
Jenkins Promoter	edfdff5b1d	Update pgo profiles - aarch64	2025-08-01 04:46:49 +03:00
Jenkins Promoter	706ad5baa6	Update pgo profiles - x86_64	2025-08-01 04:39:41 +03:00
Piotr Dulikowski	e5a47753ce	test: sl: verify that legacy auth is not queried in sl to raft upgrade Adjust `test_service_levels_upgrade`: right before upgrade to topology on raft, enable an error injection which triggers when the standard role manager is about to query the legacy auth tables in the system_auth keyspace. The preceding commit which fixes scylladb/scylladb#24963 makes sure that the legacy tables are not queried during upgrade to topology on raft, so the error injection does not trigger and does not cause a problem; without that commit, the test fails. (cherry picked from commit `3a082d314c`)	2025-07-31 15:12:51 +00:00
Piotr Dulikowski	afe786843f	qos: don't populate effective service level cache until auth is migrated to raft Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 (cherry picked from commit `2bb800c004`)	2025-07-31 15:12:50 +00:00
Jakub Smolar	7044e81e2d	gdb: handle zero-size reads in managed_bytes Fixes: https://github.com/scylladb/scylladb/issues/25048 Closes scylladb/scylladb#25050 (cherry picked from commit `6e0a063ce3`) Closes scylladb/scylladb#25138	2025-07-31 13:07:15 +03:00
Pavel Emelyanov	22147d053d	Merge '[Backport 2025.1] transport: remove throwing protocol_exception on connection start' from Dario Mirovic Note: The simplest approach to resolving `process_request_one` merge issues, since it has been refactored, was to include the three commits from before, and then the commits that are actually being backported. `protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future. This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future. There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it. Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance. In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test. Testing Build: `release` Test file: `test/cqlpy/test_protocol_exceptions.py` Test name: `test_protocol_version_mismatch` (modified for mass connection requests) Test arguments: ``` max_attempts=100'000 num_parallel=10 ``` Throwing `protocol_exception` results: ``` real=1:26.97 user=10:00.27 sys=2:34.55 cpu=867% real=1:26.95 user=9:57.10 sys=2:32.50 cpu=862% real=1:26.93 user=9:56.54 sys=2:35.59 cpu=865% real=1:26.96 user=9:54.95 sys=2:32.33 cpu=859% real=1:26.96 user=9:53.39 sys=2:33.58 cpu=859% real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862% # average ``` Returning `protocol_exception` as `result_with_exception` or an exceptional future: ``` real=1:18.46 user=9:12.21 sys=2:19.08 cpu=881% real=1:18.44 user=9:04.03 sys=2:17.91 cpu=869% real=1:18.47 user=9:12.94 sys=2:19.68 cpu=882% real=1:18.49 user=9:13.60 sys=2:19.88 cpu=883% real=1:18.48 user=9:11.76 sys=2:17.32 cpu=878% real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879% # average ``` This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567. Refs: #24567 This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting. * (cherry picked from commit `7aaeed012e`) * (cherry picked from commit `30d424e0d3`) * (cherry picked from commit `9f4344a435`) * (cherry picked from commit `5390f92afc`) * (cherry picked from commit `4a6f71df68`) Parent PR: #24738 Closes scylladb/scylladb#25240 * github.com:scylladb/scylladb: test/cqlpy: add cpp exception metric test conditions transport/server: replace protocol_exception throws with returns utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception transport/server: avoid exception-throw overhead in handle_error test/cqlpy: add protocol_exception tests transport: remove redundant references in process_request_one transport: fix the indentation in process_request_one transport: add futures in CQL server exception handling	2025-07-31 12:20:41 +03:00
Anna Stuchlik	1ecf3fcfcd	doc: add tablets support information to the Drivers table This commit: - Extends the Drivers support table with information on which driver supports tablets and since which version. - Adds the driver support policy to the Drivers page. - Reorganizes the Drivers page to accommodate the updates. In addition: - The CPP-over-Rust driver is added to the table. - The information about Serverless (which we don't support) is removed and replaced with tablets to correctly describe the contents of the table. Fixes https://github.com/scylladb/scylladb/issues/19471 Refs https://github.com/scylladb/scylladb-docs-homepage/issues/69 Closes scylladb/scylladb#24635 (cherry picked from commit `18b4d4a77c`) Closes scylladb/scylladb#25247	2025-07-31 12:20:29 +03:00
Aleksandra Martyniuk	2bd1339620	streaming: close sink when exception is thrown If an exception is thrown in result_handling_cont in streaming, then the sink does not get closed. This leads to a node crash. Close sink in exception handler. Fixes: https://github.com/scylladb/scylladb/issues/25165. Closes scylladb/scylladb#25238 (cherry picked from commit `99ff08ae78`) Closes scylladb/scylladb#25266	2025-07-31 12:20:12 +03:00
Dario Mirovic	3e388e7910	test/cqlpy: add cpp exception metric test conditions Tested code paths should not throw exceptions. `scylla_reactor_cpp_exceptions` metric is used. This is a global metric. To address potential test flakiness, each test runs multiple times: - `run_count = 100` - `cpp_exception_threshold = 10` If a change in the code introduced an exception, expectation is that the number of registered exceptions will be > `cpp_exception_threshold` in `run_count` runs. In which case the test fails. Fixes: #25273 (cherry picked from commit `4a6f71df68`)	2025-07-30 21:57:25 +02:00
Dario Mirovic	e8478982dc	transport/server: replace protocol_exception throws with returns Replace throwing protocol_exception with returning it as a result or an exceptional future in the transport server module. This improves performance, for example during connection storms and server restarts, where protocol exceptions are more frequent. In functions already returning a future, protocol exceptions are propagated using an exceptional future. In functions not already returning a future, result_with_exception is used. Notable change is checking v.failed() before calling v.get() in process_request function, to avoid throwing in case of an exceptional future. Refs: #24567 Fixes: #25273 (cherry picked from commit `5390f92afc`)	2025-07-30 21:57:20 +02:00
Dario Mirovic	028de964c8	utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception Make make_bytes_ostream and make_fragmented_temporary_buffer accept writer callbacks that return utils::result_with_exception instead of forcing them to throw on error. This lets callers propagate failures by returning an error result rather than throwing an exception. Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer concepts to simplify and document the template requirements on writer callbacks. This patch does not modify the actual callbacks passed, except for the syntax changes needed for successful compilation, without changing the logic. Refs: #24567 Fixes: #25273 (cherry picked from commit `9f4344a435`)	2025-07-30 21:57:15 +02:00
Dario Mirovic	2fbcdfd4b4	transport/server: avoid exception-throw overhead in handle_error Previously, connection::handle_error always called f.get() inside a try/catch, forcing every failed future to throw and immediately catch an exception just to classify it. This change eliminates that extra throw/catch cycle by first checking f.failed(), getting the stored std::exception_ptr via f.get_exception(), and then dispatching on its type via utils::try_catch<T>(eptr). The error-response logic is not changed - cassandra_exception, std::exception, and unknown exceptions are caught and processed, and any exceptions thrown by write_response while handling those exceptions continues to escape handle_error. Refs: #24567 Fixes: #25273 (cherry picked from commit `30d424e0d3`)	2025-07-30 21:57:09 +02:00
Dario Mirovic	96f5bcc5be	test/cqlpy: add protocol_exception tests Add a helper to fetch scylla_transport_cql_errors_total{type="protocol_error"} counter from Scylla's metrics endpoint. These metrics are used to track protocol error count before and after each test. Add cql_with_protocol context manager utility for session creation with parameterized protocol_version value. This is used for testing connection establishment with different protocol versions, and proper disposal of successfully established sessions. The tests cover two failure scenarios: - Protocol version mismatch in test_protocol_version_mismatch which tests both supported and unsupported protocol version - Malformed frames via raw socket in _protocol_error_impl, used by several test functions, and also test_no_protocol_exceptions test to assert that the error counters never decrease during test execution, catching unintended metric resets Refs: #24567 Fixes: #25273 (cherry picked from commit `7aaeed012e`)	2025-07-30 21:56:45 +02:00
Tomasz Grabiec	24435185d7	Merge '[Backport 2025.1] streaming: Avoid deadlock by running view checks in a separate scheduling group' from Scylladb[bot] This issue happens with removenode, when RBNO is disabled, so range streamer is used. The deadlock happens in a scenario like this: 1. Start 3 nodes: {A, B, C}, RF=2 2. Node A is lost 3. removenode A 4. Both B and C gain ownership of ranges. 5. Streaming sessions are started with crossed directions: B->C, C->B Readers created by sender side exhaust streaming semaphore on B and C. Receiver side attempts to obtain a permit indirectly by calling check_needs_view_update_path(), which reads local tables. That read is blocked and times-out, causing streaming to fail. The streaming writer is already using a tracking-only permit. Even if we didn't deadlock, and the streaming semaphore was simply exhausted by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation. To avoid that, run the query under a different scheduling group, which translates to the system semaphore instead of the maintenance semaphore, to break the dependency. The gossip group was chosen because it shouldn't be contended and this change should not interfere with it much. Fixes #24807 Fixes #24925 - (cherry picked from commit `ee2fa58bd6`) - (cherry picked from commit `dff2b01237`) Parent PR: #24929 Closes scylladb/scylladb#25052 * github.com:scylladb/scylladb: streaming: Avoid deadlock by running view checks in a separate scheduling group service: migration_manager: Run group0 barrier in gossip scheduling group	2025-07-30 02:22:51 +02:00
Tomasz Grabiec	2a9eecdb65	streaming: Avoid deadlock by running view checks in a separate scheduling group This issue happens with removenode, when RBNO is disabled, so range streamer is used. The deadlock happens in a scenario like this: 1. Start 3 nodes: {A, B, C}, RF=2 2. Node A is lost 3. removenode A 4. Both B and C gain ownership of ranges. 5. Streaming sessions are started with crossed directions: B->C, C->B Readers created by sender side exhaust streaming semaphore on B and C. Receiver side attempts to obtain a permit indirectly by calling check_needs_view_update_path(), which reads local tables. That read is blocked and times-out, causing streaming to fail. The streaming writer is already using a tracking-only permit. To avoid that, run the query under a different scheduling group, which translates to the system semaphore instead of the maintenance semaphore, to break the dependency. The gossip group was chosen because it shouldn't be contended and this change should not interfere with it much. Fixes: #24807 (cherry picked from commit `dff2b01237`)	2025-07-28 21:47:50 +02:00
Andrzej Jackowski	d6fa9c95c4	transport: remove redundant references in process_request_one The references were added and used in previous commits to limit the number of line changes for a reviewer convenience. This commit removes the redundant references to make the code more clear and concise. (cherry picked from commit `9b1f062827`)	2025-07-28 20:45:37 +02:00
Andrzej Jackowski	e8dbb82949	transport: fix the indentation in process_request_one Fix the indentation after the previous commit that intentionally had a wrong indent to limit the number of changed lines (cherry picked from commit `9c0f369cf8`)	2025-07-28 20:45:29 +02:00

1 2 3 4 5 ...

46950 Commits