scylladb

Author	SHA1	Message	Date
Pavel Emelyanov	937d008d3c	Merge 'Clean up partition_snapshot_reader' from Botond Dénes Move to `replica/`, drop `flat` from name and drop unused usages as well as unused includes. Code cleanup, no backport Closes scylladb/scylladb#28353 * github.com:scylladb/scylladb: replica/partition_snapshot_reader: remove unused includes partition_snapshot_reader: remove "flat" from name mv partition_snapshot_reader.hh -> replica/	2026-01-29 11:22:15 +03:00
Botond Dénes	f6d7f606aa	memtable_test: disable flushing_rate_is_reduced_if_compaction_doesnt_keep_up for debug This test case was observed to take over 2 minutes to run on CI machines, contributing to already bloated CI run times. Disable this test in debug mode. This test checks for memtable flush being slowed down when compaction can't keep up. So this test needs to overwhelm the CPU by definition. On the other hand, this is not a correctness test, there are such tests for the memtable and compaction already, so it is not critical to run this in debug mode, it is not expected to catch any use-after-free and such. Closes scylladb/scylladb#28407	2026-01-29 11:13:22 +03:00
Botond Dénes	482ffe06fd	Merge 'Improve load shedding on the replica side' from Łukasz Paszkowski When reads arrive, they have to wait for admission on the reader concurrency semaphore. If the node is overloaded, the reads will be queued. They can time out while in the queue, but will not time out once admitted. Once the shard is sufficiently loaded, it is possible that most queued reads will time out, because the average time it takes to for a queued read to be admitted is around that of the timeout. If a read times out, any work we already did, or are about to do on it is wasted effort. Therefore, the patch tries to prevent it by checking if an admitted read has a chance to complete in time and abort it if not. It uses the following criteria: if read's remaining time <= read's timeout when arrived to the semaphore * live updateable preemptive_abort_factor; the read is rejected and the next one from the wait list is considered. Fixes https://github.com/scylladb/scylladb/issues/14909 Fixes: SCYLLADB-353 Backport is not needed. Better to first observe its impact. Closes scylladb/scylladb#21649 * github.com:scylladb/scylladb: reader_concurrency_semaphore: Check during admission if read may timeout permit_reader::impl: Replace break with return after evicting inactive permit on timeout reader_concurrency_semaphore: Add preemptive_abort_factor to constructors config: Add parameters to control reads' preemptive_abort_factor permit_reader: Add a new state: preemptive_aborted reader_concurrency_semaphore: validate waiters counter when dequeueing a waiting permit reader_concurrency_semaphore: Remove cpu_concurrency's default value	2026-01-29 08:27:22 +02:00
Łukasz Paszkowski	7e1bbbd937	reader_concurrency_semaphore: Check during admission if read may timeout When a shard on a replica is overloaded, it breaks down completely, throughput collapses, latencies go through the roof and the node/shard can even become completely unresponsive to new connection attempts. When reads arrive, they have to wait for admission on the reader concurrency semaphore. If the node is overloaded, the reads will be queued and thus they can time out while being in the queue or during the execution. In the latter case, the timeout does not always result in the read being aborted. Once the shard is sufficiently loaded, it is possible that most queued reads will time out, because the average time it takes for a queued read to be admitted is around that of the timeout. If a read times out, any work we already did, or are about to do on it is wasted effort. Therefore, the patch tries to prevent it by checking if an admitted read has a chance to complete in time and abort it if not. It uses the following cryteria: if read's remaining time <= read's timeout when arrived to the semaphore * preemptive factor; the read is rejected and the next one from the wait list is considered.	2026-01-28 14:24:45 +01:00
Łukasz Paszkowski	fde09fd136	reader_concurrency_semaphore: Add preemptive_abort_factor to constructors The new parameter parametrizes the factor used to reject a read during admission. Its value shall be between 0.0 and 1.0 where + 0.0 means a read will never get rejected during admission + 1.0 means a read will immediatelly get rejected during admission Although passing values outside the interaval is possible, they will have the exact same effects as they were clamped to [0.0, 1.0].	2026-01-28 14:20:01 +01:00
Avi Kivity	47315c63dc	treewide: include Seastar headers with angle brackets Seastar is a "system" library from our point of view, so should be included with angle brackets. Closes scylladb/scylladb#28395	2026-01-28 10:33:06 +02:00
Pavel Emelyanov	834921251b	test: Replace memory_data_source with seastar::util::as_input_stream The existing test-only implementation is a simplified version of the generic one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28339	2026-01-28 10:15:03 +02:00
Pavel Emelyanov	02af292869	Merge 'Introduce TTL and retries to address resolution' from Ernest Zaslavsky In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop. Introduce RR TTL and connection failure retry to - re-resolve the RR in a timely manner - forcefully reset and re-resolve addresses - add a special case when the TTL is 0 and the record must be resolved for every request Fixes: CUSTOMER-96 Fixes: CUSTOMER-139 Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3 Closes scylladb/scylladb#27891 * github.com:scylladb/scylladb: connection_factory: includes cleanup dns_connection_factory: refine the move constructor connection_factory: retry on failure connection_factory: introduce TTL timer connection_factory: get rid of shared_future in dns_connection_factory connection_factory: extract connection logic into a member connection_factory: remove unnecessary `else` connection_factory: use all resolved DNS addresses s3_test: remove client double-close	2026-01-27 18:45:43 +03:00
Łukasz Paszkowski	8829098e90	reader_concurrency_semaphore: Remove cpu_concurrency's default value The commit `59faa6d`, introduces a new parameter called cpu_concurrency and sets its default value to 1 which violates the commit `fbb83dd` that removes all default values from constructors but one used by the unit tests. The patch removes the default value of the cpu_concurrency parameter and alters tests to use the test dedicated reader_concurrency_semaphore constructor wherever possible.	2026-01-27 15:40:11 +01:00
Gleb Natapov	9daa109d2c	test: get rid of consistent_cluster_management usage in test consistent_cluster_management is deprecated since scylla-5.2 and no longer used by Scylladb, so it should not be used by test either. Closes scylladb/scylladb#28340	2026-01-27 11:31:30 +01:00
Avi Kivity	f1c6094150	Merge 'Remove buffer_input_stream and limiting_input_stream from core code' from Pavel Emelyanov These two streams mostly play together. The former provides an input_stream from read from in-memory temporary buffers, the latter wraps it to limit the size of provided temporary buffers. Both are used to test contiguous data consumer, also the buffer_input_stream has a caller in sstables reversing reader. This PR removes the buffer_input_stream in favor of seastar memory_data_source, and moves the limiting_input_stream into test/lib. Enanching testing code, not backporting Closes scylladb/scylladb#28352 * github.com:scylladb/scylladb: code: Move limiting data source to test/lib util: Simplify limiting_data_source API util: Remove buffer_input_stream test: Use seastar::util::temporary_buffer_data_source in data consumer test sstables: Use seastar::util::as_input_stream() in mx reader	2026-01-26 22:05:59 +02:00
Raphael S. Carvalho	0e07c6556d	test: Remove useless compaction group testing in database_test This compaction group testing is useless because the machinery for it to work was removed. This was useful in the early tablet days, where we wanted to test compaction groups directly. Today groups are stressed and tested on every tablet test. I see a ~40% reduction time after this patch, since database_test is one of the most (if not the most) time consuming in boost suite. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#28324	2026-01-26 19:16:27 +02:00
Botond Dénes	9d1933492a	mv partition_snapshot_reader.hh -> replica/ The partition snapshot lives in mutation/, however mutation/ is a lower level concept than a mutation reader. The next best place for this reader is the replica/ directory, where the memtable, its main user, also lives. Also move the code to the replica namespace. test/boost/mvcc_test.cc includes this header but doesn't use anything from it. Instead of updating the include path, just drop the unused include.	2026-01-26 16:52:08 +02:00
Pavel Emelyanov	77435206b9	code: Move limiting data source to test/lib Only two tests use it now -- the limit-data-source-test iself and a test that validates continuous_data_consumer template. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-26 12:49:42 +03:00
Pavel Emelyanov	111b376d0d	util: Simplify limiting_data_source API The source maintains "limit generator" -- a function that returns the maximum size of bytes to return from the next buffer. Currently all callers just return constant numbers from it. Passing a function that returns non-constant one can, probably, be used for a fuzzy test, but even the limiting-data-source-test itself doesn't do it, so what's the point... Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-26 12:46:37 +03:00
Pavel Emelyanov	4639681907	test: Use seastar::util::temporary_buffer_data_source in data consumer test The test creates buffer_data_source_impl and wraps it with limiting data source. The former data_source duplicates the functionality of the existing seastar temporary_buffer_data_source. This patch makes the test code use seastar facility. The buffer_data_source_impl will be removed soon. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-26 12:44:33 +03:00
Ernest Zaslavsky	cb2aa85cf5	aws_error: handle all restartable nested exception types Previously we only inspected std::system_error inside std::nested_exception to support a specific TLS-related failure mode. However, nested exceptions may contain any type, including other restartable (retryable) errors. This change unwraps one nested exception per iteration and re-applies all known handlers until a match is found or the chain is exhausted. Closes scylladb/scylladb#28240	2026-01-26 10:19:57 +03:00
Ernest Zaslavsky	bd9d5ad75b	s3_test: remove client double-close `test_chunked_download_data_source_with_delays` was calling `close()` on a client twice, remove the unnecessary call	2026-01-25 15:42:48 +02:00
Pavel Emelyanov	cb6ee05391	Merge 'Extend snapshot manifest.json with tablet-aware metadata' from Benny Halevy This series extends the json manifest file we create when taking snapshots. It adds the following metadata: - manifesr version and scope - snapshot name - created_at and expires_at timestamps (#24061) - node metadata (host_id, dc, rack) - keyspace and table metadat - tablet_count (#26352) - per-sstable metadata (#26352) Fixes [SCYLLADB-189](https://scylladb.atlassian.net/browse/SCYLLADB-189) Fixes [SCYLLADB-195](https://scylladb.atlassian.net/browse/SCYLLADB-195) Fixes [SCYLLADB-196](https://scylladb.atlassian.net/browse/SCYLLADB-196) * Enhancement, no backport needed [SCYLLADB-189]: https://scylladb.atlassian.net/browse/SCYLLADB-189?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [SCYLLADB-195]: https://scylladb.atlassian.net/browse/SCYLLADB-195?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [SCYLLADB-196]: https://scylladb.atlassian.net/browse/SCYLLADB-196?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#27945 * github.com:scylladb/scylladb: snapshot: keep per-sstable metadata in manifest.json snapshot: add table info and tablet_count to manifest.json snapshot: add basic support for snapshot ttl in manifest.json table: snapshot_on_all_shards: take snapshot_options db: snapshot_ctl: move skip_flush to struct snapshot_options snapshot: add snapshot name in manifest.json test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces snapshot: add node info to manifest.json snapshot: add manifest info to manifest.json test: database_test: snapshot_works: add validate_manifest	2026-01-22 15:19:11 +03:00
Patryk Jędrzejczak	67045b5f17	Merge 'raft_topology, tablets: Drain tablets in parallel with other topology operations' from Tomasz Grabiec Allows other topology operations to execute while tablets are being drained on decommission. In particular, bootstrap on scale-out. This is important for elasticity. Allows multiple decommission/removenode to happen in parallel, which is important for efficiency. Flow of decommission/removenode request: 1) pending and paused, has tablet replicas on target node. Tablet scheduler will start draining tablets. 2) No tablets on target node, request is pending but not paused 3) Request is scheduled, node is in transition 4) Request is done Nodes are considered draining as soon as there is a leave or remove request on them. If there are tablet replicas present on the target node, the request is in a paused state and will not be picked by topology coordinator. The paused state is computed from topology state automatically on reload. When request is not paused, its execution starts in write_both_read_old state. The old tablet_draining state is not entered (it's deprecated now). Tablet load balancing will yield the state machine as soon as some request is no longer paused and ready to be scheduled, based on standard preemption mechanics. Fixes #21452 Closes scylladb/scylladb#24129 * https://github.com/scylladb/scylladb: docs: Document parallel decommission and removenode and relevant task API test: Add tests for parallel decommission/removenode test: util: Introduce ensure_group0_leader_on() test: tablets: Check that there are no migrations scheduled on draining nodes test: lib: topology_builder: Introduce add_draining_request() topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization tablets: topology_coordinator: Refactor to propagate reason for migration rollback tablet_allocator: Skip co-location on draining nodes node_ops: task_manager_module: Populate entity field also for active requests tasks: node_ops: Put node id in the entity field tasks, node_ops: Unify setting of task_stats in get_status() and get_stats() topology: Protect against empty cancelation reason tasks, topology: Make pending node operations abortable doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged raft_topology, tablets: Drain tablets in parallel with other topology operations virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node locator: topology: Add "draining" flag to a node topology_coordinator: Extract generate_cancel_request_update() storage_service: Drop dependency in topology_state_machine.hh in the header locator: Extract common code in assert_rf_rack_valid_keyspace() topology_coordinator, storage_service: Validate node removal/decommission at request submission time	2026-01-22 13:06:53 +01:00
Benny Halevy	d6557764b9	snapshot: keep per-sstable metadata in manifest.json Adds a "sstables" array member to manifest.json. For each sstables, keep the following metadata: id - a uuid for the sstable (the sstable identifier if the use-sstable-identifier option was used, otherwise the sstable uuid generation) toc_name - the name of the TOC.txt file data_size and index_size - in bytes first_token and last_token - of the sstable first and last keys. Fixes: SCYLLADB-196 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:42:52 +02:00
Benny Halevy	dc9093303d	snapshot: add table info and tablet_count to manifest.json Add a table member to manifest.json with the keyspace_name, table_name, table_id, tablets_type, and, for tablets-enabled tables, get tablet_count on each shard and write the minimum to manifest.json. For vnodes-based tables, tablet_count=0. For now, `tablets_type` may be either `none` for vnodes tables, or `powof2` for tablets tables. In the future, when we support arbitrary tablt boundaries, this will be reflected here, and it is likely we would backup the whole tablets map sperately to get all tablet boundaries. Fixes SCYLLADB-195 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:36:52 +02:00
Benny Halevy	91df129e21	snapshot: add basic support for snapshot ttl in manifest.json Store the snapshot `created_at` time and an optional `expires_at` time. Fixes SCYLLADB-189 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	49a3e0914d	db: snapshot_ctl: move skip_flush to struct snapshot_options So we can easily extend it and add more options. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	d9fc3b1c11	snapshot: add snapshot name in manifest.json Store the snapshot tag in the manifest file. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	0d82e56078	snapshot: add node info to manifest.json Add metadata about the node: host_id, datacenter, and rack. This enables dc- or rack- aware restore. Today this information is "encoded" into the snapshot hierarchy prefixes, but if all manifest files would be stored in a flat directory, we'd need to encode that metadata in the object name, but it'd be better for the manifest contents to be self descriptive. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	24040efc54	snapshot: add manifest info to manifest.json Add metadata about the manifest itself: A version and the manifest scope (currently "node", but in the future, may also be "shard", or "tablet") Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	9e0f5410ae	test: database_test: snapshot_works: add validate_manifest Validate the manifest.json format by loading it using rjson::parse and then validate its contents to ensure it lists exactly the SSTables present in the snapshot directory. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Botond Dénes	4281d18c2e	Merge 'schema: Apply `sstable_compression_user_table_options` to CQL aux and Alternator tables' from Nikos Dragazis In PR `5b6570be52` we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams). This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (`adf9c426c2`). This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config. Fixes #26914. Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults. Closes scylladb/scylladb#27204 * github.com:scylladb/scylladb: db/config: Update sstable_compression_user_table_options description schema: Add initializer for compression defaults schema: Generalize static configurators into schema initializers schema: Initialize static properties eagerly db: config: Add accessor for sstable_compression_user_table_options test: Check that CQL and Alternator tables respect compression config	2026-01-22 06:50:48 +02:00
Pavel Emelyanov	18b5a49b0c	Populate all sl:* groups into dedicated top-level supergroup This patch changes the layout of user-facing scheduling groups from / `- statement `- sl:default `- sl:* `- other groups (compaction, streaming, etc.) into / `- user (supergroup) `- statement `- sl:default `- sl:* `- other groups (compaction, streaming, etc.) The new supergroup has 1000 static shares and is name-less, in a sense that it only have a variable in the code to refer to and is not exported via metrics (should be fixed in seastar if we want to). The moved groups don't change their names or shares, only move inside the scheduling hierarchy. The goal of the change is to improve resource consumption of sl:* groups. Right now activities in low-shares service levels are scheduled on-par with e.g. streaming activity, which is considered to be low-prio one. By moving all sl:* groups into their own supergroup with 1000 shares changes the meaning of sl:* shares. From now on these shares values describe preirities of service level between each-other, and the user activities compete with the rest of the system with 1000 shares, regardless of how many service levels are there. Unit tests keep their user groups under root supergroup (for simplicity) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28235	2026-01-21 14:14:48 +02:00
Botond Dénes	a53f989d2f	db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot The API contract in partition_version.hh states that when dealing with evictable entries, a real cache tracker pointer has to be passed to all methods that ask for it. The nonpopulating reader violates this, passing a nullptr to the snapshot. This was observed to cause a crash when a concurrent cache read accessed the snapshot with the null tracker. A reproducer is included which fails before and passes after the fix. Fixes: #26847 Closes scylladb/scylladb#28163	2026-01-20 12:34:37 +01:00
Avi Kivity	36347c3ce9	Merge 'db/system_keyspace: remove namespace v3' from Botond Dénes Cassandra changed their system tables in 3.0. We migrated to the new system table layout in 2017, in ScyllaDB 2.0. System tables introduced in Cassandra 3.0, as well as the 3.0 variant of pre-existing system tables were added to the db::system_table::v3 namespace. We ended up adding some new ScyllaDB-only system tables to this namespace as well. As the dust settled, most of the v3 system tables ended up being either simple aliases to non-v3 tables, or new tables. Either way, the codebase uses just one variant of each table for a long time now the v3:: distinction is pointless. Remove the v3 namespace and unify the table listing under the top-level db::system_keyspace scope. Code cleanup, no backport Closes scylladb/scylladb#28146 * github.com:scylladb/scylladb: db/system_keyspace: move remining tables out of v3 keyspace db/system_keyspace: relocate truncated() and commitlog_cleanups() db/system_keyspace: drop v3::local() db/system_keyspace: remove duplicate table names from v3	2026-01-19 20:54:38 +02:00
Botond Dénes	e01041d3ee	db/system_keyspace: move remining tables out of v3 keyspace The last remining tables in the v3 keyspace are those that are genuinely distinct -- added by Cassandra 3.0 or >= ScyllaDB 2.0. Move these out of the v3 keyspace too, with this the v3 keyspace is defunct and removed.	2026-01-19 12:32:21 +02:00
Ernest Zaslavsky	829bd9b598	aws_error: fix nested exception handling The loop that unwraps nested exception, rethrows nested exception and saves pointer to the temporary std::exception& inner on stack, then continues. This pointer is, thus, pointing to a released temporary Closes scylladb/scylladb#28143	2026-01-19 11:41:47 +03:00
Botond Dénes	b7bc48e7b7	reader_concurrency_semaphore: improve handling of base resources reader_permit::release_base_resources() is a soft evict for the permit: it releases the resources aquired during admission. This is used in cases where a single process owns multiple permits, creating a risk for deadlock, like it is the case for repair. In this case, release_base_resources() acts as a manual eviction mechanism to prevent permits blockings each other from admission. Recently we found a bad interaction between release_base_resources() and permit eviction. Repair uses both mechanism: it marks its permits as inactive and later it also uses release_base_resources(). This partice might be worth reconsidering, but the fact remains that there is a bug in the reader permit which causes the base resources to be released twice when release_base_resources() is called on an already evicted permit. This is incorrect and is fixed in this patch. Improve release_base_resources(): * make _base_resources const * move signal call into the if (_base_resources_consumed()) { } * use reader_permit::impl::signal() instead of reader_concurrency_semaphore::signal() * all places where base resources are released now call release_base_resources() A reproducer unit test is added, which fails before and passes after the fix. Fixes: #28083 Closes scylladb/scylladb#28155	2026-01-19 11:37:51 +03:00
Tomasz Grabiec	478b8f09df	test: tablets: Check that there are no migrations scheduled on draining nodes In case of decommission, it's not desirable because it's less urgent. In case of removenode, it leads to failure of removenode operation because scheduled co-locating migration will fail if the destination is on the excluded node, and this failure will be interpreted as drain failure and coordinator will cancel the request. Not a problem before "parallel decommission" because this failure is only a streaming failure, not a barrier failure, so exception doesn't escape into the catch clause in transition stage handler, and the migration is simply rolled back. Once draining happens in the tablet migration track, streaming failure will be interpreted as drain failure and cancel the request.	2026-01-18 15:36:07 +01:00
Nikos Dragazis	8aca7b0eb9	test: database_test: Fix serialization of partition key The `make_key` lambda erroneously allocates a fixed 8-byte buffer (`sizeof(s.size())`) for variable-length strings, potentially causing uninitialized bytes to be included. If such bytes exist and they are not valid UTF-8 characters, deserialization fails: ``` ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7) ``` Fixes #28195. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#28197	2026-01-17 20:32:06 +02:00
Botond Dénes	1e09a34686	replica: add abort polling to memtable and cache readers Continuing the read once it is aborted (e.g. due to timeout) is a waste of resources, as the produced results will be discarded. Poll the permit's abort exception in the memtable and cache reader's fill_buffer(). This results in one poll per buffer filled (8KB of data). We already have similar poll for sstable readers, as disk reads are usually much heavier and therefore it is more important to stop them ASAP after abort. Cache and memtable reads are usually quick but not always, hence it is important to also have polling in the cache and memtable readers. Refs: #11469 Fixes: #28148 Closes scylladb/scylladb#28149	2026-01-16 18:03:04 +01:00
Botond Dénes	122b7847e5	Merge 'index: Accept view properties in CREATE INDEX' from Dawid Mędrek Problem ------- Secondary indexes are implemented via materialized views under the hood. The way an index behaves is determined by the configuration of the view. Currently, it can be modified by performing the CQL statement `ALTER MATERIALIZED VIEW` on it. However, that raises some concerns. Consider, for instance, the following scenario: 1. The user creates a secondary index on a table. 2. In parallel, the user performs writes to the base table. 3. The user modifies the underlying materialized view, e.g. by setting the `synchronous_updates` to `true` [1]. Some of the writes that happened before step 3 used the default value of the property (which is `false`). That had an actual consequence on what happened later on: the view updates were performed asynchronously. Only after step 3 had finished did it change. Unfortunately, as of now, there is no way to avoid a situation like that. Whenever the user wants to configure a secondary index they're creating, they need to do it in another schema change. Since it's not always possible to control how the database is manipulated in the meantime, it leads to problems like the one described. That's not all, though. The fact that it's not possible to configure secondary indexes is inconsistent with other schema entities. When it comes to tables or materialized views, the user always have a means to set some or even all of the properties during their creation. Solution -------- The solution to this problem is extending the `CREATE INDEX` CQL statement by view properties. The syntax is of form: ``` > CREATE INDEX <index name> > .. ON <keyspace>.<table> (<columns>) > .. WITH <properties> ``` where `<properties>` corresponds to both index-specific and view properties [2, 3]. View properties can only be used with indexes implemented with materialized views; for example, it will be impossible to create a vector index when specifying any view property (see examples below). When a view property is provided, it will be applied when creating the underlying materialized view. The behavior should be similar to how other CQL statements responsible for creating schema entities work. High-level implementation strategy ---------------------------------- 1. Make auxiliary changes. 2. Introduce data structures representing the new set of index properties: both index-specific and those corresponding to the underlying view. 3. Extend `CREATE INDEX` to accept view properties. 4. Extend `DESCRIBE INDEX` and other `DESCRIBE` statements to include view properties in their output. User documentation is also updated at the steps to reflect the corresponding changes. Implementation considerations ----------------------------- There are a number of schema properties that are now obsolete. They're accepted by other CQL statements, but they have no effect. They include: * `index_interval` * `replicate_on_write` * `populate_io_cache_on_flush` * `read_repair_chance` * `dclocal_read_repair_chance` If the user tries to create a secondary index specifying any of those keywords, the statement will fail with an appropriate error (see examples below). Unlike materialized views, we forbid specifying the clustering order when creating a secondary index [4]. This limitation may be lifted later on, but it's a detail that may or may not prove troublesome. It's better to postpone covering it to when we have a better perspective on the consequences it would bring. Examples -------- Good examples ``` > CREATE INDEX idx ON ks.t (v); > CREATE INDEX idx ON ks.t (v) WITH comment = 'ok view property'; > CREATE INDEX idx ON ks.t (v) .. WITH comment = 'multiple view properties are ok' .. AND synchronous_updates = true; > CREATE INDEX idx ON ks.t (v) .. WITH comment = 'default value ok' .. AND synchronous_updates = false; ``` Bad examples ``` > CREATE INDEX idx ON ks.t (v) WITH replicate_on_write = true; SyntaxException: Unknown property 'replicate_on_write' > CREATE INDEX idx ON ks.t (v) .. WITH OPTIONS = {'option1': 'value1'} .. AND comment = 'some text'; InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot specify options for a non-CUSTOM index" > CREATE CUSTOM INDEX idx ON ks.t (v) .. WITH OPTIONS = {'option1': 'value1'} .. AND comment = 'some text'; InvalidRequest: Error from server: code=2200 [Invalid query] message="CUSTOM index requires specifying the index class" > CREATE CUSTOM INDEX idx ON ks.t (v) .. USING 'vector_index' .. WITH OPTIONS = {'option1': 'value1'} .. AND comment = 'some text'; InvalidRequest: Error from server: code=2200 [Invalid query] message="You cannot use view properties with a vector index" > CREATE INDEX idx ON ks.t (v) WITH CLUSTERING ORDER BY (v ASC); InvalidRequest: Error from server: code=2200 [Invalid query] message="Indexes do not allow for specifying the clustering order" ``` and so on. For more examples, see the relevant tests. References: [1] https://docs.scylladb.com/manual/branch-2025.4/cql/cql-extensions.html#synchronous-materialized-views [2] https://docs.scylladb.com/manual/branch-2025.4/cql/secondary-indexes.html#create-index [3] https://docs.scylladb.com/manual/branch-2025.4/cql/mv.html#mv-options [4] https://docs.scylladb.com/manual/branch-2025.4/cql/dml/select.html#ordering-clause Fixes scylladb/scylladb#16454 Backport: not needed. This is an enhancement. Closes scylladb/scylladb#24977 * github.com:scylladb/scylladb: cql3: Extend DESC INDEX by view properties cql3: Forbid using CLUSTERING ORDER BY when creating index cql3: Extend CREATE INDEX by MV properties cql3/statements/create_index_statement: Allow for view options cql3/statements/create_index_statement: Rename member cql3/statements/index_prop_defs: Re-introduce index_prop_defs cql3/statements/property_definitions: Add extract_property() cql3/statements/index_prop_defs.cc: Add namespace cql3/statements/index_prop_defs.hh: Rename type cql3/statements/view_prop_defs.cc: Move validation logic into file cql3/statements: Introduce view_prop_defs.{hh,cc} cql3/statements/create_view_statement.cc: Move validation of ID schema/schema.hh: Do not include index_prop_defs.hh	2026-01-14 09:54:27 +02:00
Nikos Dragazis	d5ec66bc0c	schema: Generalize static configurators into schema initializers Extend the `static_configurator` mechanism to support initialization of arbitrary schema properties, not only static ones, by passing a `schema_builder` reference to the configurator interface. As part of this change, rename `static_configurator` to `schema_initializer` to better reflect its broader responsibility. Add a checkpoint/restore mechanism to allow de-registering an initializer (useful for testing; will be used in the next patch). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-01-13 20:45:59 +02:00
Nikos Dragazis	76b2d0f961	db: config: Add accessor for sstable_compression_user_table_options The `sstable_compression_user_table_options` config option determines the default compression settings for user tables. In patch `2fc812a1b9`, the default value of this option was changed from LZ4 to LZ4WithDicts and a fallback logic was introduced during startup to temporarily revert the option to LZ4 until the dictionary compression feature is enabled. Replace this fallback logic with an accessor that returns the correct settings depending on the feature flag. This is cleaner and more consistent with the way we handle the `sstable_format` option, where the same problem appears (see `get_preferred_sstable_version()`). As a consequence, the configuration option must always be accessed through this accessor. Add a comment to point this out. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-01-13 18:30:38 +02:00
Avi Kivity	489d1a0fbc	Merge 'replica: don't throw exceptions for read timeout' from Botond Dénes Read timeouts are a common occurence and they typically occur when the replica is overloaded. So throwing exceptions for read timeouts is very harmful. Be careful not to thow exceptions while propagating them up the future chain. Add a test to enfore and detect regressions. Fixes: scylladb/scylladb#25062 Improvement, normally not a backport candidate, but we may decide to backport if customer(s) are found to suffer from this. Closes scylladb/scylladb#25068 * github.com:scylladb/scylladb: reader_permit: remove check_abort() test/boost/database_test: add test for read timeout exceptions sstables/mx/reader: don't throw exceptions on the read-path readers/multishard: don't throw exceptions on the read-path replica/table: don't throw exceptions on the read-path multishard_mutation_query: fix indentation multishard_mutation_query: don't throw exceptions on the read-path service/storage_proxy: don't throw exceptions on the full-scan path cql3/query_processor: don't throw exceptions on the read-path reader_permit: add get_abort_exception()	2026-01-13 16:17:41 +02:00
Avi Kivity	c6dfae5661	treewide: #include Seastar headers with angle brackets Seastar is an external library from the point of view of ScyllaDB, so should be included with angle brackets. Closes scylladb/scylladb#27947	2026-01-13 14:56:15 +02:00
Botond Dénes	354c805e6a	reader_permit: remove check_abort() This method can cause performance regressions if used in the wrong place -- namely if it is used to abort reads by throwing the abort exception. Exceptions should be propagated during reads without throwing them, otherwise they cause extra CPU load, making a bad situation worse. Remove this method, so it doesn't accidentally get more users, migrate remaining users to get_abort_exception().	2026-01-13 10:47:57 +02:00
Botond Dénes	a0ddac655d	test/boost/database_test: add test for read timeout exceptions Read timeouts shouldn't trigger exceptions thrown, exceptions should be solely propagated via futures, otherwise they put extra strain on the system at the worst possible time: when it is overload already enough that reads started to time out. The test covers both single partition reads and full scans, with two scenarios: * timeout while the read is queued * timeout when the read is already ongoing	2026-01-13 10:47:57 +02:00
Michał Hudobski	c8aa49b196	vector search, paging: add test for paging warnings We add a test that validates that indexed queries do not throw a warning related to vector search paging Fixes: SCYLLADB-248 Closes scylladb/scylladb#28077	2026-01-13 10:33:36 +02:00
Petr Gusev	889d7782ed	treewide: use coroutine::maybe_yield in coroutines It's more efficient since coroutine::maybe_yield returns a lightweight struct (awaitable), not the future. Closes scylladb/scylladb#28101	2026-01-12 10:38:47 +01:00
Alex	e430065c92	db: views: serialize create/drop view operations via shard 0 Create and drop view operations are currently performed on all shards, and their execution is not fully serialized. On slower processors this can lead to interleavings that leave stale entries in `system.scylla_views_build` A problematic sequence looks like this: * `on_create_view()` runs on shard 0 → entries for shard 0 and shard 1 are created * `on_drop_view()` runs on shard 0 → entry for shard 0 is removed * `on_create_view()` runs on shard 1 → entries for shard 0 and shard 1 are created again * `on_drop_view()` runs on shard 1 → entry for shard 1 is removed, while the shard 0 entry remains This results in a leftover row in `system.scylla_views_builds_in_progress`, causing `view_build_test.cc` to get stuck indefinitely in an eventual state and eventually be terminated by CI. This patch fixes the issue by fully serializing all view create and drop operations through shard 0. Shard 0 becomes the single execution point and notifies other shards to perform their work in order. Requests originating. new process: - view_builder::on_create_view(...) runs only on shard 0 and kicks off dispatch_create_view(...) in the background. - dispatch_create_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed. - dispatch_create_view(...) calls handle_seed_view_build_progress(...) on shard 0. That: - writes the global “build progress” row across all shards via _sys_ks.register_view_for_building_for_all_shards(...). - After seeding, dispatch_create_view(...) broadcasts to all shards with container().invoke_on_all(...). - Each shard runs handle_create_view_local(...), which: - waits for pending base writes/streams, flushes the base, - resets the reader to the current token and adds the new view, - handles errors and triggers _build_step to continue processing. Drop view - view_builder::on_drop_view(...) runs only on shard 0 and kicks off dispatch_drop_view(...) in the background. - dispatch_drop_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed. - It broadcasts handle_drop_view_local(...) to all shards with invoke_on_all(...). - Each shard runs handle_drop_view_local(...), which: - removes the view from local build state (_base_to_build_step and _built_views) by scanning existing steps, - ignores missing keyspace cases. - After all shards finish local cleanup, shard 0 runs handle_drop_view_global_cleanup(...), which: - removes global build progress, built‑view state, and view build status in system tables, Shutdown - drain() waits on _view_notification_sem before _sem so in‑flight dispatches finish before bookkeeping is halted. In addition, the test is adjusted to remove the long eventual wait (596.52s / 30 iterations) and instead rely on the default wait of 17 iterations (~4.37 minutes), eliminating unnecessary delays while preserving correctness. Fixes: https://github.com/scylladb/scylladb/issues/27898 Backport: not required as the problem happens on master Closes scylladb/scylladb#27929	2026-01-12 09:23:22 +02:00
Calle Wilund	a7cdb602e1	db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc Fixes #27992 When doing a commit log oversized allocation, we lock out all other writers by grabbing the _request_controller semaphore fully (max capacity). We thereafter assert that the semaphore is in fact zero. However, due to how things work with the bookkeep here, the semaphore can in fact become negative (some paths will not actually wait for the semaphore, because this could deadlock). Thus, if, after we grab the semaphore and execution actually returns to us (task schedule), new_buffer via segment::allocate is called (due to a non-fully-full segment), we might in fact grab the segment overhead from zero, resulting in a negative semaphore. The same problem applies later when we try to sanity check the return of our permits. Fix is trivial, just accept less-than-zero values, and take same possible ltz-value into account in exit check (returning units) Added whitebox (special callback interface for sync) unit test that provokes/creates the race condition explicitly (and reliably). Closes scylladb/scylladb#27998	2026-01-09 14:06:58 +02:00
Michał Hudobski	e2e479f20d	auth: fix cdc vector search indexing permission bug VECTOR_SEARCH_INDEXING permission didn't work on cdc tables as we mistakenly checked for vector indexes on the cdc table insted of the base. This patch fixes that and adds a test that validates this behavior. Fixes: VECTOR-476 Closes scylladb/scylladb#28050	2026-01-08 21:55:19 +02:00

1 2 3 4 5 ...

4474 Commits