scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 02:20:37 +00:00

Author	SHA1	Message	Date
Marcin Maliszkiewicz	edf0148bee	perf-alternator: wait for alternator port before running workload This patch is mostly for the purpose of running pgo CI job. We may receive connection error if asyncio.sleep(5) in pgo.py is not sufficient waiting time. In pgo.py we do wait for port but only for cql, anyway it's better to have high level check than trying to wait for alternator port there.	2026-03-16 16:07:52 +01:00
Artsiom Mishuta	755d528135	test.py: fix warnings changes in this commit: 1)rename class from 'TestContext' to 'Context' so pytest will not consider this class as a test 2)extend pytest filterwarnings list to ignore warnings from external libs 3) use datetime.datetime.now(datetime.UTC) unstead datetime.datetime.utcnow() 4) use ResultSet.one() instead ResultSet[0] Fixes SCYLLADB-904 Fixes SCYLLADB-908 Related SCYLLADB-902 Closes scylladb/scylladb#28956	2026-03-15 12:00:10 +02:00
Piotr Dulikowski	d8b283e1fb	Merge 'Add CQL forwarding for strongly consistent tables' from Wojciech Mitros In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica. The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`) For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader. This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself. Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71) [SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#27517 * github.com:scylladb/scylladb: test: add tests for CQL forwarding transport: enable CQL forwarding for strong consistency statements transport: add remote statement preparation for CQL forwarding transport: handle redirect responses in CQL forwarding transport: add exception handling for forwarded CQL requests transport: add basic CQL request forwarding idl: add a representation of client_state for forwarding cql_server: handle query, execute, batch in one case transport: inline process_on_shard in cql_server::process transport: extract process() to cql_server transport: add messaging_service to cql_server transport: add response reconstruction helpers for forwarding transport: generalize the bounce result message for bouncing to other nodes strong consistency: redirect requests to live replicas from the same rack transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server transport: extract the error handling from process_request_one transport: move error response helpers from connection to cql_server	2026-03-13 15:03:10 +01:00
Tomasz Grabiec	518470e89e	Merge 'load_stats: improve tablet filtering for load stats' from Ferenc Szili When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage. Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages. For tablet size collection, we should include all the tablets which are currently taking up disk space. This means: - on leaving replica, include all tablets until the `cleanup` stage - on pending replica, include all tablets starting with the `write_both_read_new` and later stages While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1 Fixes: SCYLLADB-829 Closes scylladb/scylladb#28587 * github.com:scylladb/scylladb: load_stats: add filtering for tablet sizes load_stats: move tablet filtering for table size computation load_stats: bring the comment and code in sync	2026-03-13 13:08:11 +01:00
Gleb Natapov	fae5282c82	service level: fix crash during migration to driver server level Before `b59b3d4` the migration code checked that service level controller is on v2 version before migration and the check also implicitly checked that _sl_data_accessor field is already initialized, but now that the check is gone the migration can start before service level controller is fully initialized. Re add the check, but to a different place. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1049 Closes scylladb/scylladb#29021	2026-03-13 11:24:26 +01:00
Dani Tweig	aa46a0f4e0	Add VECTOR to the list of synced milestones in scylladb.git - Added VECTOR to the comma-separated list of Jira project keys in `call_sync_milestone_to_jira.yml`. - The `jira_project_keys` value changed from `SCYLLADB,CUSTOMER,SMI,RELENG` to `SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR`. - The VECTOR project needs to sync with scylladb.git milestones, so that when a GitHub milestone is created or closed in scylladb/scylladb, the corresponding Jira release is also created or released in the VECTOR project. - Previously only SCYLLADB, CUSTOMER, SMI, and RELENG projects were synced. Fixes:PM-220 Closes scylladb/scylladb#29014	2026-03-13 09:58:41 +02:00
Botond Dénes	fc8cebd671	Merge 'Verify components digests during component load and scrub in validate mode' from Taras Veretilnyk This PR adds integrity verification for SSTable component files during loading. When component digests are present in Scylla metadata, the loader now validates each component's CRC32 digest against the stored expected value, catching silent corruption of component files. Index, Rows and Partitions components digests are also validated duriung scrub in validate mode Added corruption tests that write an SSTable, flip a bit in a specific component file, then verify that reloading the SSTable detects the corruption and throws the expected exception. Depends on https://github.com/scylladb/scylladb/pull/28338 Backport is not required, this is new feature Fixes https://github.com/scylladb/scylladb/issues/20103 Closes scylladb/scylladb#28761 * github.com:scylladb/scylladb: test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade sstables: propagate ignore_component_digest_mismatch config to all load sites sstables: add option to ignore component digest mismatches sstable_compaction_test: Add scrub validate test for corrupted index sstables: add tests for component digest validation on corrupted SSTables sstables: validate index components digests during SSTable scrub in validate mode sstables: verify component digests on SSTable load sstables: add digest_file_random_access_reader for CRC32 digest computation	2026-03-13 09:55:55 +02:00
Avi Kivity	ae8a418744	Merge 'Await async calls in test tablets migration' from Benny Halevy Fix several test cases that did not await async tasks: - test_restart_leaving_replica_during_cleanup - test_restart_in_cleanup_stage_after_cleanup - test_tablet_back_and_forth_migration - test_staging_backlog_is_preserved_with_file_based_streaming Fixes SCYLLADB-910 * Minor fixes, no backport needed Closes scylladb/scylladb#28908 * github.com:scylladb/scylladb: test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task test_tablets_migration: drop unused imports from cassandra.query	2026-03-13 00:20:29 +02:00
Avi Kivity	b228eb26e6	Merge 'dbuild: Use slirp4netns network in dbuild nested containers' from Calle Wilund Fixes #25084 Add slirp4netns and use for nested containers. This will allow nested container port aliasing, helping CI stability. Note: this contains and updated Dockerfile for dbuild image, but since chicken and eggs, right now will force install slirp4netns before anything in dbuild script. Updates the mock server handling to use ephemeral ports and query from container, ensuring we don't get port collisions. (boost as well as pytest). Includes a timeout up, and a tweak to our scylla_cluster handling, ensuring we don't deadlock when pipe size is less than requires for our sys notify messages. Closes scylladb/scylladb#28727 * github.com:scylladb/scylladb: gcs_fixture: Change to use docker helper aws_kms_fixture: Modify to use docker helper test/lib/proc_util: Add docker helper pytest: use ephemeral port publish for docker mock servers dbuild: Use container network in dbuild nested containers scylla_cluster: Read notify sock in background to prevent deadlock	2026-03-12 23:49:25 +02:00
Nadav Har'El	ad832c263e	test/cluster: mark test_alternator_concurrent_rmw_same_partition_different_server not strictly xfail A few days ago, in commit `7b30a39` we added to pytest.ini the option xfail_strict. This option causes every time a test XPASSes, i.e., an xfail test actually passes - to be considered an error and fail the test. But some tests demonstrate a timing-related bug and do not reproduce the bug every single time. An example we noticed in one CI run is: test/cluster/test_alternator.py::test_alternator_concurrent_rmw_same_partition_different_server This test reproduces a timing-related bug (if you do an LWT write to one partition on to two different coordinators "at the same time", you can get a failure), but only most of the time, not 100% of the time. The solution is to add "strict=False" for the xfail marker on this specific test. This undoes the xfail_strict for this specific test, accepting that this specific test can either pass or fail. Note that this does NOT make this test worthless - we still see this test failing most of the time, and when a developer finally fixes this issue, the test will begin to pass all the time. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-941 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#29016	2026-03-12 23:46:23 +02:00
Avi Kivity	03186ce60d	Merge 'Cleanup after auth v1 and default superuser code removal' from Marcin Maliszkiewicz This is short cleanup after recent removal of creating default cassandra superuser and auth-v1 code removal. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1036 Backport: no, just code cleanup Closes scylladb/scylladb#29004 * github.com:scylladb/scylladb: auth: remove DEFAULT_SUPERUSER_NAME constant and dead DEFAULT_USER_PASSWORD auth: use configurable default_superuser in describe_roles auth: move default_superuser to common, remove _superuser member auth: use LOCAL_ONE for all auth queries auth: remove get_auth_ks_name indirection	2026-03-12 23:44:32 +02:00
Avi Kivity	e2eeef3e01	Merge 'service level: remove remnants of version 1 service level' from Gleb Natapov can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well. Also drop async versions of find_effective_service_level and get_user_scheduling_group since they are unused. No need to backport, code removal, Closes scylladb/scylladb#29002 * github.com:scylladb/scylladb: service level: make maybe_update_per_service_level_params synchronous service level: remove unused get_user_scheduling_group function service level: drop async find_effective_service_level service level: remove remnants of version 1 service level	2026-03-12 23:39:41 +02:00
Botond Dénes	eed3a6d407	sstables/mx/writer: move post-cell write yield to collection write loop Introduced by `54bddeb3b5`, the yield was added to write_cell(), to also help the general case where there is no collection. Arguably this was unnecessary and this patch moves the yield to write_collection(), to the cell write loop instead, so regular cells don't have to poll the preempt flag. Closes scylladb/scylladb#29013	2026-03-12 21:26:35 +02:00
Avi Kivity	e8a6706d6e	Merge 'shorten some sleeps to speed up bootstrap in tests' from Patryk Jędrzejczak This PR shortens two sleeps from 1s to 100ms to speed up bootstrap in tests. The changed sleeps are: - the pause duration in group0 discovery, - the retry period in `wait_for_cql`. Refs: https://scylladb.atlassian.net/browse/SCYLLADB-918 No backport: performance improvements mostly relevant to tests. Closes scylladb/scylladb#29020 * github.com:scylladb/scylladb: test: pylib: util: wait for CQL being ready with a shorter period group0: discovery: shorten the pause duration	2026-03-12 21:17:05 +02:00
Wojciech Mitros	32974770b0	test: add tests for CQL forwarding Add basic cluster tests for CQL forwarding. The test cases include: - basic reads and writes - prepared statements with binds - forwarding from a non-replica - exception passthrough during forwarding (using an injection) - re-preparing a statement on the target node, even if the user query is also an EXECUTE request on a prepared statement - verification metric updates The existing test_basic_write_read was modified so that a few extra cases could be validated on the same cluster.	2026-03-12 19:43:35 +01:00
Wojciech Mitros	916a9995c1	transport: enable CQL forwarding for strong consistency statements We enable CQL forwarding by starting to return the bounce_to_node result message in redirect_statement() instead of throwing. The forwarding code introduced in the preceding patches reacts to these messages, allowing the requests to be forwarded. With the update, some tests assuming that requests can't be forwarded need to be adjusted, so we do that as well.	2026-03-12 19:43:35 +01:00
Wojciech Mitros	21a7b036a5	transport: add remote statement preparation for CQL forwarding During forwarding of CQL EXECUTE requests, the target node may not have the prepared statement in its cache. If we do have this statement as a coordinator, instead of returning PREPARED NOT FOUND to the client, we want to prepare the statement ourselves on target node. For that, we add a new FORWARD_CQL_PREPARE RPC. We use the new RPC after gettting the prepared_not_found status during forwarding. When we try to forward a request, we always have the query string (we decide whether to forward based on this query), so we can always use the new RPC when getting the prepared_not_found status. After receiving the response, we try forwarding the EXECUTE request again.	2026-03-12 19:43:35 +01:00
Wojciech Mitros	96a5e1c7ce	transport: handle redirect responses in CQL forwarding During CQL forwarding, when the target node can't handle the request, it will find another node which can execute the request or which knows where the request can be executed. We return this information in responses to CQL forwarding, and in this patch, we add handling of this kind of a response. After getting a redirect response, we retry forwarding to the returned host/shard until success or timeout. This can happen many times during a single request, when we first forward to a replica and later to the coordinator, or when a replica/coordinator migrated while we were performing the forwarding	2026-03-12 19:43:31 +01:00
Wojciech Mitros	8816d3038c	transport: add exception handling for forwarded CQL requests When a forwarded request fails on the remote node, we can't use the exception handling that happens in process_request_one because we don't go through this code path. Instead, we use the previously extracted cql_server::handle_exception handler, which performs all accounting on the forwarded-to node, and which prepares the response. For the read_failure_exception_with_timeout exception, we need to perform the sleep on the source node, so we return the timeout in the forwarding response and use it on the source node to know how long to sleep without any extra calculations. The handle_forward_execute() method is extracted from the inline handler lambda to make the error catching wrapper cleaner.	2026-03-12 19:41:37 +01:00
Wojciech Mitros	23bff5dfef	transport: add basic CQL request forwarding Add the infrastructure for forwarding CQL requests to other nodes. When a process() call results in a node bounce (as opposed to a shard bounce), the coordinator serializes the request and sends it via the FORWARD_CQL_EXECUTE RPC verb to the target node. In this patch we omit several features that allow handling more scenarios that can happen when trying to forward a CQL request, but the RPC request and response are already prepared for them. They will be handled in the following commits.	2026-03-12 19:41:35 +01:00
Avi Kivity	76b6784c1a	Merge 'cql3: track CQL parsing memory cost and use it for admission control' from Marcin Maliszkiewicz Use rolling_max_tracker to record gross bytes allocated during each CQL parse. The rolling maximum is then added to the memory estimate for incoming QUERY and PREPARE requests so that the admission control in the CQL transport layer accounts for parsing overhead. The measured memory footprint serves as upper bound rather than exact number but it's purpose is to prevent OOMs under unprepared statements heavy load. In benchmark 1G memory node shows decrease of non-LSA memory usage from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as memory admission kicks in trying to prevent OOM. This is phase 1 of OOM prevention, potential next steps: - add second admission in query_processor::get_statement trying to prevent potential thundering herd problem - decrease cql_server memory pool size - count reads in the memory pool - add per service level memory pool and a shared one Related https://scylladb.atlassian.net/browse/SCYLLADB-740 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938 Backport: no, new feature, but we may reconsider if some customer needs it Closes scylladb/scylladb#28919 * github.com:scylladb/scylladb: cql3: track CQL parsing memory cost and use it for admission control utils: add rolling max tracker	2026-03-12 19:59:52 +02:00
Wojciech Mitros	170b82ddca	idl: add a representation of client_state for forwarding In the following patches, when we start allowing to forward CQL requests to other nodes, we'll need to use the same client state for executing the request on the destination node as we had on the source. client_state contains many fields and we need to create a new instance of it when we start handling the forwarded request, so to prepare for the forwarding RPC, we add a serializable format of the client_state as an IDL struct. The new class is missing some fields that are not used while executing requests, and some whose value is determined by the fact that the client state is used for a forwarded request. These include: - driver name, driver version, client options - not used for executing requests. Instead, we use these as data sources for the virtual "clients" system table. - auth_state - must be READY - we reached a bounce message, so we were able to try executing the request locally - _control_connection - used for altering a cql_server::connection, which we don't have on the target node - _default_timeout_config - used when updating service levels, also only per-connection - workload_type - used for deciding whether to allow shedding at the start of processing the request, and for getting per-connection service level params (for an API)	2026-03-12 17:48:58 +01:00
Wojciech Mitros	b4a7fefe20	cql_server: handle query, execute, batch in one case Currently we perform the same steps when handling query, execute and batch CQL requests. So instead of creating multiple functions performing these steps, we can handle them all in one fallthrough case in cql_server::connection::process_request_one.	2026-03-12 17:48:58 +01:00
Wojciech Mitros	dadb87047c	transport: inline process_on_shard in cql_server::process The process_on_shard method is relatively short, it's only used in the process() method and the Process concept that is uses is as long as the function itself. This area will be made more complex by the following patches for cql forwarding, so we simplify it by inlining process_on_shard in cql_server::process.	2026-03-12 17:48:58 +01:00
Wojciech Mitros	24cdc3a10d	transport: extract process() to cql_server Move process() and process_on_shard() from cql_server::connection to cql_server. The process() method is no longer a template - instead, it takes an opcode parameter and uses get_process_fn_for_opcode() to select the appropriate internal processing function. The process_query, process_execute, and process_batch wrappers on connection now delegate to _server.process() with the appropriate opcode. This refactoring is preparation for CQL request forwarding, where process() will need to be called from a context other than connection - the forwarding RPC handler).	2026-03-12 17:48:57 +01:00
Wojciech Mitros	0e3469e89c	transport: add messaging_service to cql_server The messaging service will be used by cql_server to register RPC handlers for forwarding CQL requests between nodes. We pass it through the controller to cql_server.	2026-03-12 17:48:57 +01:00
Wojciech Mitros	1376caf980	transport: add response reconstruction helpers for forwarding Expose response::flags() and response::extract_body(), and a new constructor. It will be needed for creating a cql_transport::response from the response body returned during CQL forwarding.	2026-03-12 17:48:57 +01:00
Wojciech Mitros	e44820ba1f	transport: generalize the bounce result message for bouncing to other nodes In the following patches, we'll start allowing forwarding requests to strongly consistent tables so that they'll get executed on the suitable tablet Raft group members. For that we'll reuse the approach that we already have for bouncing requests to other shards - we'll try to execute a request locally, and the result of that will be a bounce message with another replica as the target. In this patch we generalize the former bounce_to_shard result message so that it will be able to specify the target of the bounce as another shard or specific replica. We also rename it to result_message::bounce so that it stops implying that only another shard may be its target. Aside from the host_id and the shard, the new message also includes the timeout, because in the service handling the forwarding we won't have the access to it, and it's needed for specifying how long we should wait for the forwarded requests. It also includes an information whether this is a write request to return correct timeout response in case the deadline is exceeded. We will return other hosts in the new bounce message when executing requests to strongly consistent tables when we can't handle the request because we aren't a suitable replica. We can't handle this message yet, so we don't return it anywhere and we still assume that every bounce message is a bounce to the same host.	2026-03-12 17:48:57 +01:00
Wojciech Mitros	b4d66fda2e	strong consistency: redirect requests to live replicas from the same rack Forwarding CQL requests is not implemented yet, but we're already prepared to return the target to forward to when trying to execute strongly consistent requests. Currently, if we're not a replica of the affected tablet, we redirect the request to the first replica in the list. This is not optimal, because this replica may be down or it may be in another rack, making us perform cross-rack requests during forwarding. Instead, we should forward the request to the replica from the same rack and handle the case where the replica is down. In this patch we change the replica selection for forwarding strongly consistent requests, so that when the coordinator isn't a replica, it redirects the request to the replica from the same rack. If the replica from the same rack is down, or there is no replica in our rack, we choose the next closest replica (preferring same-DC replicas over other DCs). If no replica is alive, the query fails - the driver should retry when some replica comes back up.	2026-03-12 17:48:54 +01:00
Alex	7fd39ba586	test/cluster: strengthen raft voters multi-DC test and tune debug runtime The test_raft_voters_multidc_kill_dc scenario had become weaker after group0 voter count was made always odd. In particular, the old num_nodes == 1 case (dc1=2, dc2=1, dc3=1) could pass even without the intended balancing logic, because with 3 voters total we naturally get one voter per DC. This change restores coverage of the original intent: - Replace num_nodes parametrization with explicit DC triples. - Use (3, 1, 1) to force a meaningful asymmetric topology where voter placement logic is required. - Keep a larger topology case (6, 3, 3) for broader coverage. - Mark (6, 3, 3) as skip_mode(debug) with reason: larger topology case is too slow in debug on minipcs. Also updated comments/docstring to match the new setup. Fixes: SCYLLADB-794 backport: None, it is done to deflake minipcs that will start working only on master Closes scylladb/scylladb#29000	2026-03-12 17:07:45 +01:00
Wojciech Mitros	309abc44d9	transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server Change sleep_until_timeout_passes() to accept a foreign_ptr<std::unique_ptr<response>>. We can easily create the foreign_ptr for the responses created in the CQL server, but we'll need this when we get responses when forwarding CQL statements - the responses may come from other shards. We also move it from cql_server::connection to cql_server, because for forwarded CQL requests, we'll need to handle it at the cql_server level. The method also loses its const qualifier - the abort_source that we pass into sleep_abortable needs to be non-const. Apparently, we could still use it in a const method of cql_server::connection because we passed it as _server._abort_source which caused the const qualifier to be lost.	2026-03-12 16:03:14 +01:00
Marcin Maliszkiewicz	975cd60e05	ldap: fix use-after-move crash in ldap_reuser::reap() After stop() moved _reaper, in-flight with_connection() callbacks could still call reap(), which accessed the moved-from future causing a SIGSEGV in future_base::detach_promise(). Add a seastar::gate so stop() waits for all in-flight operations before moving _reaper. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1043 Closes scylladb/scylladb#29015	2026-03-12 16:48:45 +02:00
Patryk Jędrzejczak	c50cf32793	test: pylib: util: wait for CQL being ready with a shorter period `wait_for_cql` is used in hundreds, if not thousands, of places in tests. We shouldn't waste up to 1s for every call. Also, the 1s period is clearly too long compared to the bootstrap time, which is usually 0-3s in dev mode. The following test speeds up from 50s to 42s with the change: ``` for _ in range(10): servers = await manager.servers_add(3) await manager.get_ready_cql(servers) ```	2026-03-12 15:40:19 +01:00
Patryk Jędrzejczak	f85628a9a0	group0: discovery: shorten the pause duration Nodes currently pause group0 discovery for 1s. This case is always hit while adding multiple nodes in parallel to an empty cluster by all nodes except the one that becomes the group0 leader. This is fine in production, but in tests, the slowdown is quite significant. Every `manager.servers_add(n)` call for n > 1 becomes 1s slower when the cluster is empty. Many cluster tests are affected. In this commit, we decrease the sleep duration from 1s to 100ms to speed up tests. The consequence of this change is that nodes might perform more steps in group0 discovery, but the increase in CPU usage and network traffic should be negligible.	2026-03-12 15:40:18 +01:00
Gleb Natapov	c67f876893	service level: make maybe_update_per_service_level_params synchronous It does not call async functions any more.	2026-03-12 15:53:08 +02:00
Benny Halevy	b3fec20960	test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather Currently the test iterates on all servers and calls manager.api.disable_injection but it doesn't await those calls. Use asyncio.gather to await all calls in parallel. Co-authored-by: Copilot CLI Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-12 15:26:40 +02:00
Benny Halevy	61d5a2df02	test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet Co-authored-by: Copilot CLI Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-12 15:26:40 +02:00
Benny Halevy	b8655748a2	test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task Co-authored-by: Copilot CLI Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-12 15:26:40 +02:00
Benny Halevy	10dccc2c4e	test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task Co-authored-by: Copilot CLI Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-12 15:26:40 +02:00
Benny Halevy	c9d653fb1e	test_tablets_migration: drop unused imports from cassandra.query Co-authored-by: Copilot CLI Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-12 15:26:40 +02:00
Gleb Natapov	c30907b8f2	service level: remove unused get_user_scheduling_group function	2026-03-12 14:28:26 +02:00
Gleb Natapov	a934d8391d	service level: drop async find_effective_service_level find_cached_effective_service_level does exactly same thing now and it is synchronous.	2026-03-12 14:28:26 +02:00
Botond Dénes	15cfa5beeb	mutation/collection_mutation: don't copy the serialized collection serialize_collection_mutation() copies the serialized collection into the returned collection_mutation object. Change to move to avoid the copy. Fixes: SCYLLADB-1041 Closes scylladb/scylladb#29010	2026-03-12 13:57:40 +02:00
Gleb Natapov	f888f2dced	service level: remove remnants of version 1 service level can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well.	2026-03-12 12:27:52 +02:00
Nadav Har'El	27f0510280	test/alternator: test_gzip_request_oversized now passes on AWS The Alternator test test_compressed_request.py::test_gzip_request_oversized checks that a very large request that compresses to a small size is still rejected. This test passed on Alternator, but used to fail on DynamoDB because DynamoDB didn't reject this case. This was a bug in DynamoDB (a "decompression bomb" vulnerability), and after I reported it, it was fixed. So now this test does pass on DynamoDB (after a small modification to allow for different error codes). So remove its scylla_only marker, and make the comment true to the current state. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28820	2026-03-12 10:41:56 +01:00
Marcin Maliszkiewicz	b277d9d9aa	cql3: track CQL parsing memory cost and use it for admission control Use rolling_max_tracker to record gross bytes allocated during each CQL parse. The rolling maximum is then added to the memory estimate for incoming QUERY and PREPARE requests so that the admission control in the CQL transport layer accounts for parsing overhead. The measured memory footprint serves as upper bound rather than exact number but it's purpose is to prevent OOMs under unprepared statements heavy load. In benchmark 1G memory node shows decrease of non-LSA memory usage from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as memory admission kicks in trying to prevent OOM.	2026-03-12 10:16:10 +01:00
Botond Dénes	0b19a6de85	tombstone_gc: tombstone_gc_state::for_tests(): remove unused param Closes scylladb/scylladb#28923	2026-03-12 10:01:42 +01:00
Marcin Maliszkiewicz	2d22eea2f9	Merge 'cql3: Replace SCYLLA_ASSERT and abort by throwing_assert' from Nadav Har'El In this patch we replace every single use of SCYLLA_ASSERT(), abort() and assert() in the cql3/ directory by throwing_assert(). The problem with SCYLLA_ASSERT()/abort()/assert() is that when it fails, it crashes Scylla. This is almost always a bad idea (see #7871 discussing why), but it's even riskier in front-end code like cql3/: In front-end code, there is a risk that due to a bug in our code, a specific user request can cause Scylla to crash. A malicious user can send this query to all nodes and crash the entire cluster. When the user is not malicious, it causes a small problem (a failing request) to become a much worse crash - and worse, the user has no idea which request is causing this crash and the crash will repeat if the same request is tried again. All of this is solved by using the new throwing_assert(), which is the same as SCYLLA_ASSERT() but throws an exception (using on_internal_error()) instead of crashing. The exception will prevent the code path with the invalid assumption from continuing, but will result in only the current user request being aborted, with a clear error message reporting the internal server error due to an assertion failure. I reviewed all the changes that I did in these patches to check that (to the best of my understanding) none of the assertions in cql3/ involve the sort of serious corruption that might require crashing the Scylla node entirely. throwing_assert() also improves logging of assertion failures compared to the original SCYLLA_ASSERT()/abort() - SCYLLA_ASSERT() printed a message to stderr which in many installations is lost, and abort() often prints no message at all. But throwing_assert() uses Scylla's standard logger, and also includes a backtrace in the log message. Fixes #13970 (Exorcise assertions from CQL code paths) Refs #7871 (Exorcise assertions from Scylla) Closes scylladb/scylladb#28847 * github.com:scylladb/scylladb: cql3: remove unnecessary assert() cql3: replace abort() by throwing_assert() cql3: Replace SCYLLA_ASSERT by throwing_assert	2026-03-12 09:09:24 +01:00
Szymon Malewski	3116db6c2d	test: fix `testJsonOrdering` The `test/cqlpy/cassandra_tests/validation/entities/json_test.py::testJsonOrdering` was failing because of differences between Cassandra and Scylla in printing JSON floating point values - e.g. Cassandra prints 30.0, where Scylla prints 30. Both are valid, so in this patch, instead of comparing strings, we compare parsed JSON using `EquivalentJson`. Fixes #28467 Closes scylladb/scylladb#28924	2026-03-12 09:07:08 +01:00
Marcin Maliszkiewicz	5b2a07b408	utils: add rolling max tracker We will use it later to track parser memory usage via per query samples. Tests runtime in dev: 1.6s	2026-03-12 08:56:41 +01:00

1 2 3 4 5 ...

52583 Commits