scylladb

Author	SHA1	Message	Date
Piotr Dulikowski	aba922ea65	Merge 'cdc: improve cdc metadata loading' from Michael Litvak when loading CDC streams metadata for tablets from the tables, read only new entries from the history table instead of reading all entries. This improves the CDC metadata reloading, making it more efficient and predictable. the CDC metadata is loaded as part of group0 reload whenever the internal CDC tables are modified. on tablet split / merge, we create a new CDC timestamp and streams by writing them to the cdc_streams_history table by group0 operation, and when it's applied we reload the in-memory CDC streams map by reading from the tables and constructing the updated map. Previously, on every update, we would read the entire cdc_streams_history entries for the changed table, constructing all its streams and creating a new map from scratch. We improve this now by reading only new entries from cdc_streams_history and append them to the existing map. we can do this because we only append new entries to cdc_streams_history with higher timestamp than all previous entries. This makes this reloading more efficient and predictable, because previously we would read a number of entries that depends on the number of tablets splits and merges, which increases over time and is unbounded, whereas now we read only a single stream set on each update. Fixes https://github.com/scylladb/scylladb/issues/26732 backport to 2025.4 where cdc with tablets is introduced Closes scylladb/scylladb#26160 * github.com:scylladb/scylladb: test: cdc: extend cdc with tablets tests cdc: improve cdc metadata loading	2025-10-29 11:07:48 +01:00
Botond Dénes	ac618a53f4	Merge 'db: repair: do not update repair_time if batchlog replay failed' from Aleksandra Martyniuk Currently, batchlog replay is considered successful even if all batches fail to be sent (they are replayed later). However, repair requires all batches to be sent successfully. Currently, if batchlog isn't cleared, the repair never learns and updates the repair_time. If GC mode is set to "repair", this means that the tombstones written before the repair_time (minus propagation_delay) can be GC'd while not all batches were replied. Consider a scenario: - Table t has a row with (pk=1, v=0); - There is an entry in the batchlog that sets (pk=1, v=1) in table t; - The row with pk=1 is deleted from table t; - Table t is repaired: - batchlog reply fails; - repair_time is updated; - propagation_delay seconds passes and the tombstone of pk=1 is GC'd; - batchlog is replayed and (pk=1, v=1) inserted - data resurrection! Do not update repair_time if sending any batch fails. The data is still repaired. For tablet repair the repair runs, but at the end the exception is passed to topology coordinator. Thanks to that the repair_time isn't updated. The repair request isn't removed as well, due to which the repair will need to rerun. Apart from that, a batch is removed from the batchlog if its version is invalid or unknown. The condition on which we consider a batch too fresh to replay is updated to consider propagation_delay. Fixes: https://github.com/scylladb/scylladb/issues/24415 Data resurrection fix; needs backport to all versions Closes scylladb/scylladb#26319 * github.com:scylladb/scylladb: db: fix indentation test: add reproducer for data resurrection repair: fail tablet repair if any batch wasn't sent successfully db/batchlog_manager: fix making decision to skip batch replay db: repair: throw if replay fails db/batchlog_manager: delete batch with incorrect or unknown version db/batchlog_manager: coroutinize replay_all_failed_batches	2025-10-28 14:52:59 +02:00
Botond Dénes	f3cec5f11a	Merge 'index: Set tombstone_gc when creating underlying view' from Dawid Mędrek Before this commit, when the underlying materialized view was created, it didn't have the property `tombstone_gc` set to any value. We fix the bug in this PR. Implementation strategy: 1. Move code responsible for producing the schema of a secondary index to the file that handles `CREATE INDEX`. 2. Set the property when creating the view. 3. Add reproducer tests. Fixes scylladb/scylladb#26542 Backport: we can discuss it. Closes scylladb/scylladb#26543 * github.com:scylladb/scylladb: index: Set tombstone_gc when creating secondary index index: Make `create_view_for_index` method of `create_index_statement` index: Move code for creating MV of secondary index to cql3 db, cql3: Move creation of underlying MV for index	2025-10-28 14:42:42 +02:00
Radosław Cybulski	ea6b22f461	Add max trace size output configuration variable In #24031 users complained, that trace message is truncated, namely it's no longer json parsable and table name might not be part of the output. This path enables users to configure maximum size of trace message. In case user wanted `table` name, but didn't care about message size, #26634 will help. - add configuration varable `alternator_max_users_query_size_in_trace_output` with default value of 4096 (4 times old default value). - modify `truncated_content_view` function to use new configuration variable for truncation limit - update `truncated_content_view` to consistently truncate at given size, previously trunctation would also happen when data arrived in more than one chunk - update `truncated_content_view` to better handle truncated value (limit number of copies) - fix `scylla_config_read` call - call to `query` for a configuration name that is not existing will return `Items` array empty (but present) - this would raise array access exception few lines below. - add test Refs #26634 Refs #24031 Closes scylladb/scylladb#26618	2025-10-28 13:29:15 +03:00
Avi Kivity	d81796cae3	Merge 'Limit concurrent view updates from all sources' from Wojciech Mitros Before this patch, when a base table has many materialized views, each write to this table can start up to 128 view updates in parallel. With high client write concurrency, the actual concurrency of writes executed on the node may grow unexpectedly, which can lead to higher latency and higher memory usage compared to a sequential approach. In this patch we add a per-shard, per-service-level semaphore which limits the number of concurrent view updates processed on the shard in this service level to a constant value. We take one unit from the semaphore for each local view update write, and releasing it when it finishes. The remote view updates do not take units from the semaphore because they don't consume nearly as much processing power and they are limited by another semaphore based on their memory usage. Fixes https://github.com/scylladb/scylladb/issues/25341 Closes scylladb/scylladb#25456 * github.com:scylladb/scylladb: mv: limit concurrent view updates from all sources database: rename _view_update_concurrency_sem to _view_update_memory_sem	2025-10-28 11:13:24 +02:00
Michael Litvak	8743422241	cdc: improve cdc metadata loading when loading CDC streams metadata for tablets from the tables, read only new entries from the history table instead of reading all entries. This improves the CDC metadata reloading, making it more efficient and predictable. the CDC metadata is loaded as part of group0 reload whenever the internal CDC tables are modified. on tablet split / merge, we create a new CDC timestamp and streams by writing them to the cdc_streams_history table by group0 operation, and when it's applied we reload the in-memory CDC streams map by reading from the tables and constructing the updated map. Previously, on every update, we would read the entire cdc_streams_history entries for the changed table, constructing all its streams and creating a new map from scratch. We improve this now by reading only new entries from cdc_streams_history and append them to the existing map. we can do this because we only append new entries to cdc_streams_history with higher timestamp than all previous entries. This makes this reloading more efficient and predictable, because previously we would read a number of entries that depends on the number of tablets splits and merges, which increases over time and is unbounded, whereas now we read only a single stream set on each update. Fixes scylladb/scylladb#26732	2025-10-28 08:54:09 +01:00
Wojciech Mitros	f07a86d16e	mv: limit concurrent view updates from all sources Before this patch, when a base table has many materialized views, each write to this table can start up to 128 view updates in parallel. With high client write concurrency, the actual concurrency of writes executed on the node may grow unexpectedly, which can lead to higher latency and higher memory usage compared to a sequential approach. In this patch we add a per-shard, per-service-level semaphore which limits the number of concurrent view updates processed on the shard in this service level to a constant value. We take one unit from the semaphore for each local view update write, and releasing it when it finishes. The remote view updates do not take units from the semaphore because they don't consume nearly as much processing power and they are limited by another semaphore based on their memory usage. The effect of this patch can also be observed when writing to a base table with a large number of materialized views, like in the materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent dtest. In that test, if we perform a full scan in parallel to a write workload with a concurrency of 100 to a table with 100 views, the scan would sometimes timeout because it would effectively get 1/10000 of cpu. With this patch, the cpu concurrency of view updates was limited to 128 (we ran both writes and scan in the same service level), and the scan no longer timed out. Fixes https://github.com/scylladb/scylladb/issues/25341	2025-10-27 18:55:41 +01:00
Aleksandra Martyniuk	6fc43f27d0	db: fix indentation	2025-10-23 10:39:43 +02:00
Aleksandra Martyniuk	e1b2180092	db/batchlog_manager: fix making decision to skip batch replay Currently, we skip batch replay if less than batch_log_timeout passed from the moment the batch was written. batch_log_timeout value can be configured. If it is large, it won't be replayed for a long time. If the tombstone will be GC'd before the batch is replayed, then we risk the data resurrection. To ensure safety we can skip only the batches that won't be GC'd. In this patch we skip replay of the batches for which: now() < written_at + min(timeout + propagation_delay) repair_time is set as a start of batchlog replay, so at the moment of the check we will have: repair_time <= now() So we know that: repair_time < written_at + propagation_delay With this condition we are sure that GC won't happen.	2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk	7f20b66eff	db: repair: throw if replay fails Return a flag determining whether all the batches were sent successfully in batchlog_manager::replay_all_failed_batches (batches skipped due to being too fresh are not counted). Throw in repair_flush_hints_batchlog_handler if not all batches were replayed, to ensure that repair_time isn't updated.	2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk	904183734f	db/batchlog_manager: delete batch with incorrect or unknown version batchlog_manager::replay_all_failed_batches skips batches that have unknown or incorrect version. Next round will process these batches again. Such batches will probably be skipped everytime, so there is no point in keeping them. Even if at some point the version becomes correct, we should not replay the batch - it might be old and this may lead to data resurrection.	2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk	502b03dbc6	db/batchlog_manager: coroutinize replay_all_failed_batches	2025-10-23 10:38:31 +02:00
Wojciech Mitros	c0d0f8f85b	database: rename _view_update_concurrency_sem to _view_update_memory_sem In the following commit, we'll introduce a new semaphore for view updates that limits their concurrency by view update count. To avoid confusion, we rename the existing semaphore that tracks the memory used by concurrent view updates and related objects accordingly.	2025-10-23 10:00:15 +02:00
Tomasz Grabiec	ba692d1805	schema_tables: Keep "replication" column backwards-compatible by expanding rack lists to numeric RF In `380f243986` we added support for rack lists in replication options. Drivers which are not prepared to parse that (as of now, all of them), will not create metadata object for that keyspace. This breaks, for example, the "copy to/from" cqlsh command. Potentially other things too. To fix that, keep the "replication" column in the old format, and store numeric RF there, which corresponds to the number of replicas. Accurate options in the new format are put in "replication_v2". We set replication_v2 in the schema only when it differs from the old "replication" so that the new column is not set during upgrade, otherwise downgrade would fail. Partition tombstone is added to ensure that pre-alter replication_v2 value is deleted on alters which change replication to a value which is the same as the post-alter "replication" value. Fixes #26415 Closes scylladb/scylladb#26429	2025-10-21 09:11:25 +03:00
Piotr Dulikowski	f76917956c	view_building_worker: access tablet map through erm on sstable discovery Currently, the data returned by `database::get_tables_metadata()` and `database::get_token_metadata()` may not be consistent. Specifically, the tables metadata may contain some tablet-based tables before their tablet maps appear in the token metadata. This is going to be fixed after issue scylladb/scylladb#24414 is closed, but for the time being work around it by accessing the token metadata via `table`->effective_replication_map() - that token metadata is guaranteed to have the tablet map of the `table`. Fixes: scylladb/scylladb#26403 Closes scylladb/scylladb#26588	2025-10-21 00:14:39 +02:00
Dawid Mędrek	20761b5f13	db, cql3: Move creation of underlying MV for index The main goal of this patch is to give more control over the creation of the underlying view on an index to `create_index_statement.cc`. That goal is in line with how the other statements are executed: the schema is built in the cql3 module and only the ready schema_ptr is passed further. That should also make the code cleaner and easier to understand. There are a few important things to note here: * A call to `service::prepare_new_view_announcement` appears out of nowhere. Aside from some validation checks and logging, that function does pretty much the same as the pre-existing code we remove: a. It creates Raft mutations based on the passed `view_ptr`. b. It creates Raft mutations responsible for view building tasks. c. It notifies about a new column family. * We seemingly get rid of the code that creates view building tasks. That's not true: we still do that via `service::prepare_new_view_announcement`. That should explain why the change doesn't remove any relevant logic. On the other hand, it might be more difficult to explain why moving the code is correct. I'll touch on it below. Before that, it may also be important to highlight that this commit only affects the logic responsible for creating an index. There should be no effect on any other part of how Scylla behaves. --- Proving the correctness of the solution would take quite a lot of space, so I'll only summarize it. It relies on a few things: 1. Two schema changes cannot happen in one operation. We allow for more but only when those changes are dependent on each other and when the additional ones are internal for Scylla, e.g. creating an index leads to creating the underlying materialized view. 2. There are no entities or components that rely on indexes. 3. Each index is uniquely defined by the keyspace it belongs to and the name of the index. 4. There is a bijection between rows in `system_schema.indexes` and the currently existing indexes. 5. The name of an unnamed index depends on the name of the base table and the names of the indexed columns. The name of an unnamed index may have a number attached to it, but that number only depends on the state of the schema at the time of creation of the index, and it never changes later on. There are no other things the name of an unnamed index depends on. 6. Scylla doesn't allow for changing any column in the base table that has an index depending on it. Based on that, we conclude that every existing index has exactly one entry in `system_schema.indexes`, and the primary key of that entry never changes. The columns of `system_schema.indexes` that are not part of the primary key are: `kind` and `options`. Both values are only decided at the time of creation of an index, and currently there's no way to modify them. That implies that there are only two events when an entry in the system table can change: when creating an index and when dropping an index. --- When we consider the previous place of the logic that this commit moves to `cql3/statements/create_index_statement.cc`, it works like this: 1. We compare the sets of indexes defined on a specific table (in the form of a structure called `index_metadata`) before and after an operation. 2. We divide the entries into three sets: those present in both sets and those present in only one of them. 3. We handle each of those three sets separately. The structure `index_metadata` is a reflection of entries in `system_schema.indexes`. It stores one more parameter -- `local` -- but its value depends on the other values of an entry, so we can ignore it in this reasoning. Because an index cannot be modified -- it can only be created or dropped -- there are at most two non-empty sets: the set of new indexes and the set of dropped indexes. Those sets are only non-empty during an operation like `CREATE INDEX`, `DROP INDEX`, `DROP TABLE (base table)`, `DROP KEYSPACE`. Note that it's impossible to drop an index by dropping the underlying materialized view -- Scylla doesn't allow for that. However, the code in `migration_manager.cc` we call (`prepare_column_family_update_announcement`) and the code that we call in `schema_tables.cc` (`make_update_table_mutations`) is only triggered by updates related to the base table. In the context of `DROP TABLE` or `DROP KEYSPACE`, we'd call `prepare_column_family_drop_announcement` instead. In other words, we're only concerned with `CREATE INDEX` and `DROP INDEX`. --- A conclusion from this reasoning is that we only need to consider those two situations when talking about correctness of this change. The impact of this commit is that we may have potentially reordered mutations in the resulting vector that will be applied to the Raft log. The only mutations we may have reordered are the mutations responsible for creating the underlying view and the mutations responsible for updating columns in the base table. It's clear then that this commit brings no change at all: we only give `cql3/statements/create_index_statement.cc` more control over creating the underlying view. --- We leave a remnant of the code in `db/schema_tables.cc` responsible for dropping an index along with its underlying view. It would require changing a bit more of the logic, and we don't need it for the rest of this sequence of changes. Refs scylladb/scylladb#16454	2025-10-20 14:04:06 +02:00
Pavel Emelyanov	44ed3bbb7c	Merge 'RFC: Initial GCP storage backend for scylla (sstables + backup)' from Calle Wilund Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage. Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers. This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend. Similarly with storage_options. Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc). Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends. Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake. Fixes #25359 Fixes #26453 Closes scylladb/scylladb#26186 * github.com:scylladb/scylladb: docs::dev::object_storage: Add some initial info on GS storage docs/dev: Add mention of (nested) docker usage in testing.md sstables::object_storage_client: Forward memory limit semaphore to GS instance utils::gcp::object_storage: Add optional memory limits to up/download sstables::object_storage_client: Add multi-upload support for GS utils::gcp::storage: Add merge objects operation test_backup/test_basic: Make tests multiplex both s3 and gs backends test::cluster::conftest: Add support for multiple object storage backends boost::gcs_storage_test: reindent boost::gcs_storage_test: Convert to use fixture tests::boost: Add GS object storage cases to mirror S3 ones tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env sstables::object_storage_client: Add google storage implementation test_services: Allow testing with GS object storage parameters utils::gcp::gcp_credentials: Add option to create uninitialized credentials utils::gcp::object_storage: Make create_download_source return seekable_data_source utils::gcp::object_storage: Add defensive copies of string_view params utils::gcp::object_storage: Add missing retry backoff increate utils::gcp::object_storage: Add timestamp to object listing utils::gcp::object_storage: Add paging support to list_objects object_storage_client: Add object_name wrapper type utils::gcp::object_storage: Add optional abort_source utils::rest::client: Add abort_source support sstables: Use object_storage_client for remote storage sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial) s3::upload_progress: Promote to general util type storage_options: Abstract s3 to "object_storage" and add gs as option sstables::file_io_extension: Change "creator" callback to just data_source utils::io-wrappers: Add ranged data_source utils::io-wrappers: Add file wrapper type for seekable_source utils::seekable_source: Add a seekable IO source type object_storage_endpoint_param: Add gs storage as option config: break out object_storage_endpoint_param preparing for multi storage	2025-10-20 13:14:53 +03:00
Tomasz Grabiec	c4a87453a2	Merge 'Add experimental feature flag for strongly consistent tables and extend kesypace creation syntax to allow specifying consistency mode.' from Gleb Natapov The series adds an experimental flag for strongly consistent tables and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace. Closes scylladb/scylladb#26116 * github.com:scylladb/scylladb: schema: Allow configuring consistency setting for a keyspace db: experimental consistent-tablets option	2025-10-16 21:48:06 +02:00
Tomasz Grabiec	e6c427953e	Merge 'schema_applier: unify handling of token_metadata during schema change' from Marcin Maliszkiewicz This patchset improves the atomicity and clarity of schema application in the presence of token metadata updates during schema changes. The primary focus is to ensure that changes to tablet metadata are applied atomically as part of the schema commit phase, rather than being replicated to all cores afterward, which previously violated atomicity guarantees. Key changes: - Introduced pending_token_metadata to unify handling of new and existing metadata. - Split token metadata replication into prepare and commit steps. - Abstracted schema dependencies in storage_service to support pending schema visibility. - Applied tablet metadata updates atomically within schema commit phase. Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/24414 Closes scylladb/scylladb#25302 * github.com:scylladb/scylladb: db: schema_applier: update tablet metadata atomically db: replica: move tables_metadata locking to commit storage_service: abstract schema dependecies during token metadata update storage_service: split replicate_to_all_cores to steps db: schema_applier: unify token_metadata loading replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge service: fix dependencies during migration_manager startup db: schema_applier: move pending_token_metadata to locator db: always use _tablet_hint as condition for tablet metadata change db: refactor new_token_metadata into pending_token_metadata db: rename new_token_metadata to pending_token_metadata db: schema_applier: move types storage init to merge_types func db: schema_applier: make merge functions non-static members db: remove unused proxy from create_keyspace_metadata	2025-10-16 21:43:49 +02:00
Gleb Natapov	c255740989	schema: Allow configuring consistency setting for a keyspace We want to add strongly consistent tables as an option. We will have two kind of strongly consistent tables: globally consistent and locally consistent. The former means that requests from all DCs will be globally linearisable while the later - only requests to the same DCs will be linearisable. To allow configuring all the possibilities the patch adds new parameter to a keyspace definition "consistency" that can be configured to be `eventual`, `global` or `local`. Non eventual setting is supported for tablets enabled keyspaces only. Since we want to start with implementing local consistency configuring global consistency will result in an error for now.	2025-10-16 13:34:49 +03:00
Marcin Maliszkiewicz	47dba4203a	db: schema_applier: update tablet metadata atomically Before mutable_token_metadata_ptr containing tablet changes was replicated to all cores in post_commit phase which violated atomicy guarantee of schema_applier, now it's incorporated into per shard commit phase. It uses service::schema_getter abstraction introduced in earlier commit to inject "pending" schema which is not yet visible to the whole system.	2025-10-16 10:56:50 +02:00
Marcin Maliszkiewicz	e5fffa158f	db: replica: move tables_metadata locking to commit This keeps the locking scope minimal, and since unlocking is done in commit(), locking fits here as well.	2025-10-16 10:56:10 +02:00
Nadav Har'El	921d07a26b	cql: make SELECT's "internal page size" configurable In some uses of SELECT, such as aggregation (sum() et al.), GROUP BY or secondary index, it needs to perform internal scans. It uses an "internal page size" which before this patch was always DEFAULT_COUNT_PAGE_SIZE = 10000. There was an ad-hoc and undocumented way to override this default in C++ tests, using functions in test/lib/select_statement_utils.hh, but it was so non-obvious that the test that most needed to override this default - the very slow test test_indexing_paging_and_aggregation which would have been must faster with a lower setting - never used it. So in this patch we replace the ad-hoc configuration functions by a bona-fide Scylla configuration option named "select_internal_page_size". The few C++ tests that used the old configuration functions were modified to use the new configuration parameters. The slow test test_indexing_paging_and_aggregation still doesn't use the new configuration to become faster - we'll do this in the next patch. Another benefit of having this "internal page size" as a configuration option is that one day a user might realize that the default choice 10,000 is bad for some reason (which I can't envision right now), so having it configurable might come it handy. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-15 18:42:09 +03:00
Gleb Natapov	eb9112a4a2	db: experimental consistent-tablets option The option will be used to hid consistent tablets feature until it is ready.	2025-10-15 11:27:10 +03:00
Marcin Maliszkiewicz	b0f11b6d91	db: schema_applier: unify token_metadata loading Putting it into a single place gives more clarity on how _pending_token_metadata is made and avoids extra per shard copy when tablets change.	2025-10-14 10:56:37 +02:00
Marcin Maliszkiewicz	d67632bfe2	replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge This copy is now used during the whole duration of schema merge. If it changes due to tablet_hint then it's replicated to all shards as before.	2025-10-14 10:56:36 +02:00
Marcin Maliszkiewicz	46bff28a38	db: schema_applier: move pending_token_metadata to locator It never belonged to tables and views and its placement stems from location of _tablet_hint handling code. In the follwing commits we'll reference it in storage_service.cc.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	1a539f7151	db: always use _tablet_hint as condition for tablet metadata change When all schema_applier code uses this condition it's easier to grep than when we use different, derived conditions.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	c112916215	db: refactor new_token_metadata into pending_token_metadata It prepares pending_token_metadata to handle both new and copy of existing metadata for consistent usage in later commit. It also adds shared_token_metatada getter so that we don't need to get it from db.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	668231d97c	db: rename new_token_metadata to pending_token_metadata Part of the refactor done in following commit. Separated for easier review.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	0c4c995c0d	db: schema_applier: move types storage init to merge_types func Merge_types function groups operation related to types, types storage fits this group.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	794d68e44c	db: schema_applier: make merge functions non-static members This is mechanical change which simplifies the code. Schema_applier class is an object which holds schema merging intermediate state so it's fine that all schema merging functions have access to this state.	2025-10-14 10:56:25 +02:00
Marcin Maliszkiewicz	209563f478	db: remove unused proxy from create_keyspace_metadata	2025-10-14 10:56:25 +02:00
Dawid Mędrek	7d017748ab	db/commitlog: Extend segment truncation error messages We include more relevant information for debugging purposes: the remaining bytes and the size. It might be useful to determine where exactly an error occurred and help reason about it. Closes scylladb/scylladb#26486	2025-10-13 17:42:31 +03:00
Calle Wilund	144b550e4f	object_storage_client: Add object_name wrapper type Remaining S3-centric, but abstracting the object name to possible implementations not quite formatted the same.	2025-10-13 08:53:25 +00:00
Calle Wilund	5d4558df3b	sstables: Use object_storage_client for remote storage Replaces direct s3 interfaces with the abstraction layer, and open for having multiple implentations/backends	2025-10-13 08:53:25 +00:00
Calle Wilund	14dada350a	object_storage_endpoint_param: Add gs storage as option	2025-10-13 08:53:24 +00:00
Calle Wilund	78d9dda060	config: break out object_storage_endpoint_param preparing for multi storage Moves the config wrapper to own file (to reduce recompilation for modifying) and refactors to handle extending this parameter to non-s3 endpoint configs.	2025-10-13 08:53:24 +00:00
Botond Dénes	24c6476f73	mutation/mutation_compactor: add tombstone_gc_state to query ctor So tombstones can be purged correctly based on the tombstone gc mode. Currently if repair-mode is used, tombstones are not purged at all, which can lead to purged tombstone being re-replicated to replicas which already purged them via read-repair. This is not a correctness problem, tombstones are not included in data query resutl or digest, these purgable tombstone are only a nuissance for read repair, where they can create extra differences between replicas. Note that for the read repair to trigger, some difference other than in purgable tombstones has to exist, because as mentioned above, these are not included in digets. Fixes: scylladb/scylladb#24332 Closes scylladb/scylladb#26351	2025-10-12 17:48:15 +03:00
Botond Dénes	d9c3772e20	service/storage_proxy: send batches with CL=EACH_QUORUM Batches that fail on the initial send are retired later, until they succeed. These retires happen with CL=ALL, regardless of what the original CL of the batch was. This is unnecessarily strict. We tried to follow Cassandra here, but Cassandra has a big caveat in their use of CL=ALL for batches. They accept saving just a hint for any/all of the endpoints, so a batch which was just logged in hints is good enough for them. We do not plan on replicating this usage of hints at this time, so as a middle ground, the CL is changed to EACH_QUORUM. Fixes: scylladb/scylladb#25432 Closes scylladb/scylladb#26304	2025-10-12 17:18:41 +03:00
Piotr Dulikowski	0b800aab17	Merge 'db/view/view_building_worker: move `discover_existing_staging_sstables()` to the foreground' from Michał Jadwiszczak db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground This patch moves `discover_existing_staging_sstables()` to be executed from main level, instead of running it on the background fiber. This method need to be run only once during the startup to collect existing staging sstables, so there is no need to do it in the background. This change will increase debugability of any further issues related to it (like https://github.com/scylladb/scylladb/issues/26403). Fixes https://github.com/scylladb/scylladb/issues/26417 The patch should be backported to 2025.4 Closes scylladb/scylladb#26446 * github.com:scylladb/scylladb: db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground db/view/view_building_worker: futurize and rename `start_background_fibers()`	2025-10-09 18:24:50 +02:00
Michał Jadwiszczak	8d0d53016c	db/view/view_building_worker: update state again if some batch was finished during the update There was a race between loop in `view_building_worker::run_view_building_state_observer()` and a moment when a batch was finishing its work (`.finally()` callback in `view_building_worker::batch::start()`). State observer waits on `_vb_state_machine.event` CV and when it's awoken, it takes group0 read apply mutex and updates its state. While updating the state, the observer looks at `batch::state` field and reacts to it accordingly. On the other hand, when a batch finishes its work, it sets `state` field to `batch_state::finished` and does a broadcast on `_vb_state_machine.event` CV. So if the batch will execute the callback in `.finally()` while the observer is updating its state, the observer may miss the event on the CV and it will never notice that the batch was finished. This patch fixes this by adding a `some_batch_finished` flag. Even if the worker won't see an event on the CV, it will notice that the flag was set and it will do next iteration. Fixes scylladb/scylladb#26204 Closes scylladb/scylladb#26289	2025-10-09 18:17:22 +02:00
Michał Jadwiszczak	84e4e34d81	db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground This patch moves `discover_existing_staging_sstables()` to be executed from main level, instead of running it on the background fiber. This method need to be run only once during the startup to collect existing staging sstables, so there is no need to do it in the background. This change will increase debugability of any further issues related to it (like scylladb/scylladb#26403). Fixes scylladb/scylladb#26417	2025-10-08 11:16:07 +02:00
Michał Jadwiszczak	575dce765e	db/view/view_building_worker: futurize and rename `start_background_fibers()` Next commit will move `discover_existing_staging_sstables()` to the foreground, so to prepare for this we need to futurize `start_background_fibers()` method and change its name to better reflect its purpose.	2025-10-08 10:19:41 +02:00
Andrzej Jackowski	8953f96609	system_keyspace: add service_level_driver_created This commit extends sytem.scylla_local table with an additional key/value pair that can be used later in this patch series to keep an information that `sl:driver` was already created. The purpose of storing this information is to ensure that `sl:driver` is not recreated after being intentionally removed. A new mutation is included in `register_raft_pull_snapshot` to keep `service_level_driver_created` in state machine shapshot, which is required for proper propagation of the value when a new node is added to the cluster. Refs: scylladb/scylladb#24411	2025-10-08 08:24:23 +02:00
Piotr Dulikowski	380f243986	Merge ' Support replication factor rack list for tablet-based keyspaces' from Tomasz Grabiec This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names. For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] } Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs. Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks. Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported. New feature, no backport required. Co-authored with @bhalevy Fixes https://github.com/scylladb/scylladb/issues/25269 Fixes https://github.com/scylladb/scylladb/issues/23525 Closes scylladb/scylladb#26358 * github.com:scylladb/scylladb: tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count locator: Make hasher for endpoint_dc_rack globally accessible test: tablets: Add test for replica allocation on rack list changes test: lib: topology_builder: generate unique rack names test: Add tests for rack list RF doc: Document rack-list replication factor topology_coordinator: Restore formatting topology_coordinator: Cancel keyspace alter on broader set of errors topology_coordinator: Make keyspace alter process options through as_ks_metadata_update() cql3: ks_prop_defs: Preserve old options cql3: ks_prop_defs: Introduce flattened() locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace() tablet_allocator: Respect binding replicas to racks locator: network_topology_strategy: Respect rack list when reallocating tablets cql3: ks_prop_defs: Fail with more information when options are not in expected format locator, cql3: Support rack lists in replication options cql3: Fail early on vnode/tablet flavor alter cql3: Extract convert_property_map() out of Cql.g schema: Use definition from the header instead of open-coding it locator: Abstract obtaining the number of replicas from replication_strategy_config_option cql3, locator: Use type aliases for option maps locator: Add debug logging locator: Pass topology to replication strategy constructor abstract_replication_strategy, network_topology_strategy: add replication_factor_data class	2025-10-06 14:14:09 +02:00
Piotr Dulikowski	e7907b173a	Merge 'db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Dawid Mędrek Materialized views are currently in the experimental phase and using them in tablet-based keyspaces requires starting Scylla with an experimental feature, `views-with-tablets`. Any attempts to create a materialized view or secondary index when it's not enabled will fail with an appropriate error. After considerable effort, we're drawing close to bringing views out of the experimental phase, and the experimental feature will no longer be needed. However, materialized views in tablet-based keyspaces will still be restricted, and creating them will only be possible after enabling the configuration option `rf_rack_valid_keyspaces`. That's what we do in this PR. In this patch, we adjust existing tests in the tree to work with the new restriction. That shouldn't have been necessary because we've already seemingly adjusted all of them to work with the configuration option, but some tests hid well. We fix that mistake now. After that, we introduce the new restriction. What's more, when starting Scylla, we verify that there is no materialized view that would violate the contract. If there are some that do, we list them, notify the user, and refuse to start. High-level implementation strategy: 1. Name the restrictions in form of a function. 2. Adjust existing tests. 3. Restrict materialized views by both the experimental feature and the configuration option. Add validation test. 4. Drop the requirement for the experimental feature. Adjust the added test and add a new one. 5. Update the user documentation. Fixes scylladb/scylladb#23030 Backport: 2025.4, as we are aiming to support materialized views for tablets from that version. Closes scylladb/scylladb#25802 * github.com:scylladb/scylladb: view: Stop requiring experimental feature db/view: Verify valid configuration for tablet-based views db/view: Require rf_rack_valid_keyspaces when creating view test/cluster/random_failures: Skip creating secondary indexes test/cluster/mv: Mark test_mv_rf_change as skipped test/cluster: Adjust MV tests to RF-rack-validity test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces db/view: Name requirement for views with tablets	2025-10-06 12:46:46 +02:00
Tomasz Grabiec	66755db062	locator, cql3: Support rack lists in replication options Allows per-DC replication factor to be either a string, holding a numerical value, or a list of strings, holding a list of rack names. The rack list is not respected yet by the tablet allocator, this is achieved in subsequent commit. This changes the format of options stored in the flattened map in system_schema.keyspaces#replication. Values which are rack lists, are converted into multiple entries, with the list index appended to the key with ':' as the separator: For example, this extended map: { 'dc1': '3', 'dc2': ['rack1', 'rack2'] } is stored as a flattened map: { 'dc1': '3', 'dc2:0': 'rack1', 'dc2:1': 'rack2' } Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	91e51a5dd1	cql3, locator: Use type aliases for option maps In preparation for changing their structure. 1) std::map<sstring, sstring> -> replication_strategy_config_options Parsed options. Values will become std::variant<sstring, rack_list> 2) std::map<sstring, sstring> -> property_definitions::map_type Flattened map of options, as stored system tables.	2025-10-01 16:06:51 +02:00
Dawid Mędrek	b409e85c20	view: Stop requiring experimental feature We modify the requirements for using materialized views in tablet-based keyspaces. Before, it was necessary to enable the configuration option `rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS` enabled, and using the experimental feature `views-with-tablets`. We drop the last requirement. We adjust code to that change and provide a new validation test. We also update the user documentation to reflect the changes. Fixes scylladb/scylladb#23030	2025-10-01 09:01:53 +02:00

1 2 3 4 5 ...

4566 Commits