scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 20:27:03 +00:00

Author	SHA1	Message	Date
Benny Halevy	b5d537283d	database: truncate_table_on_all_shards: drop outdated TODO comment The comment was added in `83323e155e` Since then, table::seal_active_memtable was improved to guarantee waiting on oustanding flushes on success (See `d55a2ac762`), so we can remove this TODO comment (it also not covered by any issue so nobody is planned to ever work on it). Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `93b827c185`)	2026-01-09 08:35:00 +02:00
Benny Halevy	fd9ad9a11c	database: truncate_table_on_all_shards: consider can_flush on all shards can_flush might return a different value for each shard so check it right before deciding whether to flush or clear a memtable shard. Note that under normal condition can_flush would always return true now that it checks only the presence of the seal memtable function rather than check memtable_list::empty(). Fixes #27639 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2a803d2261`)	2026-01-09 08:34:31 +02:00
Benny Halevy	4968ea4ab6	memtable_list: unify can_flush and may_flush Now that we have a unit test proving that it's safe to flush an empty memtable list there is no need to distinguish between may_flush and can_flush. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `02ee341a03`)	2026-01-09 08:31:35 +02:00
Avi Kivity	3f343d70e4	database: fix overflow when computing data distribution over shards We store the per-shard chunk count in a uint64_t vector global_offset, and then convert the counts to offsets with a prefix sum: ```c++ // [1, 2, 3, 0] --> [0, 1, 3, 6] std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus()); ``` However, std::exclusive_scan takes the accumulator type from the initial value, 0, which is an int, instead of from the range being iterated, which is of uint64_t. As a result, the prefix sum is computed as a 32-bit integer value. If it exceeds 0x8000'0000, it becomes negative. It is then extended to 64 bits and stored. The result is a huge 64-bit number. Later on we try to find an sstable with this chunk and fail, crashing on an assertion. An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57 The fix is simple: the initial value is passed as uint64_t instead of int. Fixes https://github.com/scylladb/scylladb/issues/27417 Closes scylladb/scylladb#27418 (cherry picked from commit `9696ee64d0`)	2025-12-04 20:18:13 +02:00
Aleksandra Martyniuk	8a3932e4d9	replica: database: change type of tables_metadata::_ks_cf_to_uuid If there is a lot of tables, a node reports oversized allocation in _ks_cf_to_uuid of type flat_hash_map. Change the type to std::unordered_map to prevent oversized allocations. Fixes: https://github.com/scylladb/scylladb/issues/26787. Closes scylladb/scylladb#27165 (cherry picked from commit `19a7d8e248`) Closes scylladb/scylladb#27198	2025-12-03 12:24:29 +03:00
Calle Wilund	2bbf3cf669	system_keyspace: Prune dropped tables from truncation on start/drop Fixes #25683 Once a table drop is complete, there should be no reason to retain truncation records for it, as any replay should skip mutations anyway (no CF), and iff we somehow resurrect a dropped table, this replay-resurrected data is the least problem anyway. Adds a prune phase to the startup drop_truncation_rp_records run, which ignores updating, and instead deletes records for non-existant tables (which should patch any existing servers with lingering data as well). Also does an explicit delete of records on actual table DROP, to ensure we don't grow this table more than needed even in long uptime nodes. Small unit test included. Closes scylladb/scylladb#25699 (cherry picked from commit `bc20861afb`) Closes scylladb/scylladb#25815	2025-09-05 19:02:39 +03:00
Dawid Mędrek	9652a1260f	main: Log RF-rack-invalid keyspaces at startup When the configuration option `rf_rack_valid_keyspaces` is enabled and there is an RF-rack-invalid keyspace, starting a node fails. However, when the configuration option is disabled, but there still is a keyspace that violates the condition, we'd like Scylla to print a warning informing the user about the fact. That's what happens in this commit. We provide a validation test. (cherry picked from commit `837d267cbf`)	2025-08-22 14:31:49 +00:00
Ferenc Szili	0248f555da	truncate: change check for write during truncate into a log warning TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised. The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands. This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, the truncated_at timepoint, the offending replay positions which caused the check to fail. Fixes: #25173 Fixes: #25013 (cherry picked from commit `268ec72dc9`)	2025-08-06 00:52:15 +00:00
Benny Halevy	c8043e05c1	replica: database: get and expose a mutable locator::shared_token_metadata Prepare for next patch, the will use this shared_token_metadata to make mutable_token_metadata_ptr:s Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `493a2303da`)	2025-07-07 09:27:06 +03:00
Botond Dénes	ebd9420687	sstables: add corrupt_data_handler to sstables::sstables Similar to how large_data_handler is handled, propagate through sstables::sstables_manager and store its owner: replica::database. Tests and tools are also patched. Mostly mechanical changes, updating constructors and patching callers.	2025-06-25 08:41:26 +03:00
Avi Kivity	cd79a8fc25	Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz" This reverts commit `0b516da95b`, reversing changes made to `30199552ac`. It breaks cluster.random_failures.test_random_failures.test_random_failures in debug mode (at least). Fixes #24513	2025-06-16 22:38:12 +03:00
Tomasz Grabiec	0b516da95b	Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft. Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`. Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/19649 Closes scylladb/scylladb#20853 * github.com:scylladb/scylladb: storage_service: always wake up load balancer on update tablet metadata db: schema_applier: call destroy also when exception occurs db: replica: simplify seeding ERM during shema change db: remove cleanup from add_column_family db: abort on exception during schema commit phase db: make user defined types changes atomic replica: db: make keyspace schema changes atomic db: atomically apply changes to tables and views replica: make truncate_table_on_all_shards get whole schema from table_shards service: split update_tablet_metadata into two phases service: pull out update_tablet_metadata from migration_listener db: service: add store_service dependency to schema_applier service: simplify load_tablet_metadata and update_tablet_metadata db: don't perform move on tablet_hint reference replica: split add_column_family_and_make_directory into steps replica: db: split drop_table into steps db: don't move map references in merge_tables_and_views() db: introduce commit_on_shard function db: access types during schema merge via special storage replica: make non-preemptive keyspace create/update/delete functions public replica: split update keyspace into two phases replica: split creating keyspace into two functions db: rename create_keyspace_from_schema_partition db: decouple functions and aggregates schema change notification from merging code db: store functions and aggregates change batch in schema_applier db: decouple tables and views schema change notifications from merging code db: store tables and views schema diff in schema_applier db: decouple user type schema change notifications from types merging code service: unify keyspace notification functions arguments db: replica: decouple keyspace schema change notifications to a separate function db: add class encapsulating schema merging	2025-06-10 13:45:32 +02:00
Raphael S. Carvalho	2d716f3ffe	replica: Fix truncate assert failure Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed a preexisting fragility which I missed. 1) truncate gets RP mark X, truncated_at = second T 2) new sstable written during snapshot or later, also at second T (difference of MS) 3) discard_sstables() get RP Y > saved RP X, since creation time of sstable with RP Y is equal to truncated_at = second T. So the problem is that truncate is using a clock of second granularity for filtering out sstables written later, and after we got low mark and truncate time, it can happen that a sstable is flushed later within the same second, but at a different millisecond. By switching to a millisecond clock (db_clock), we allow sstables written later within the same second from being filtered out. It's not perfect but extremely unlikely a new write lands and get flushed in the same millisecond we recorded truncated_at timepoint. In practice, truncate will not be used concurrently to writes, so this should be enough for our tests performing such concurrent actions. We're moving away from gc_clock which is our cheap lowres_clock, but time is only retrieved when creating sstable objects, which frequency of creation is low enough for not having significant consequences, and also db_clock should be cheap enough since it's usually syscall-less. Fixes #23771. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24426	2025-06-08 15:59:15 +03:00
Marcin Maliszkiewicz	547bb1f663	db: replica: simplify seeding ERM during shema change We know that caller is running on shard 0 so we can avoid some extra boilerplate.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	97cdb72d4d	db: remove cleanup from add_column_family Since we abort now on failure during schema commit there is no need for cleanup as it only manages in-memory state. Explicit cf.stop was added to code paths outside of schema merging to avoid unnecessary regressions.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	5b2e4140cc	replica: db: make keyspace schema changes atomic Now all keyspace related schema changes are observable on given shard as they would be applied atomically. This is achieved by commit_on_shard() function being non-preemptive (no futures, no co_awaits). In the future we'll extend this to the whole schema and also other subsystems.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	556e89bc9d	db: atomically apply changes to tables and views In this commit we make use of splitted functions introduced before. Pattern is as follows: - in merge_tables_and_views we call some preparatory functions - in schema_applier::update we call non-yielding step - in schema_applier::post_commit we call cleanups and other finalizing async functions Additionally we introduce frozen_schema_diff because converting schema_ptr to global_schema_ptr triggers schema registration and with atomic changes we need to place registration only in commit phase. Schema freezing is the same method global_schema_ptr uses to transport schema across shards (via schema_registry cache).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	a27776b4ff	replica: make truncate_table_on_all_shards get whole schema from table_shards Before for views and indexes it was fetching base schema from db (and couple other properties). This is a problem once we introduce atomic tables and views deletion (in the following commit). Because once we delete table it can no longer be fetched from db object, and truncation is performed after atomically deleting all relevant tables/views/indexes. Now the whole relevant schema will be fetched via global_table_ptr (table_shards) object.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	1ad14f02f1	replica: split add_column_family_and_make_directory into steps This is similar work as for drop_table in previous commit. add_column_family_and_make_directory() behaves exactly the same as before but calls to it in schema_applier will be replaced by calls directly to split steps. Other usages will remain intact as they don't need atomicity (like creating system tables at startup).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	141a5643e5	replica: db: split drop_table into steps This is done so that actual dropping can be an atomic step which could be composed with other schema operations, and eventually all subsystems modified via raft so that we could introduce atomic changes which span across different subsystems. We split drop_table_on_all_shards() into: - prepare_tables_metadata_change_on_all_shards() - prepare_drop_table_on_all_shards() - drop_table() - cleanup_drop_table_on_all_shards() prepare_tables_metadata_change_on_all_shards() is necessary because when applying multiple schema changes at once (e.g. drop and add tables) we need to lock only once. We add legacy_drop_table_on_all_shards() which behaves exactly like old drop_table_on_all_shards() to be compatible with code which doesn't need to play with atomicity. Usages of legacy_drop_table_on_all_shards() in schema_applier will be replaced with direct calls to split functions in the following commits - that's the place we will take advantage of drop_table not yielding (as it returns void now).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	7f057af1f2	replica: make non-preemptive keyspace create/update/delete functions public As those operations will be managed by schema_applier class. This will be implemented in following commit.	2025-05-27 20:01:35 +02:00
Marcin Maliszkiewicz	2daa630938	replica: split update keyspace into two phases - first phase is preemptive (prepare_update_keyspace) - second phase is non-preemptive (update_keyspace) This is done so that schema change can be applied atomically. Aditionally create keyspace code was changed to share common part with update keyspace flow. This commit doesn't yet change the behaviour of the code, as it doesn't guarantee atomicity, it will be done in following commits.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	fe0f4033ca	replica: split creating keyspace into two functions This is done so that in following commits insert_keyspace can be used to atomically change schema (as it doesn't yield).	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	aceb1f9659	db: rename create_keyspace_from_schema_partition It only creates keyspace metadata.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	d7202586ca	db: replica: decouple keyspace schema change notifications to a separate function In following commits we want to separate updating code from committing shema change (making it visible). Since notifications should be issued after change is visible we need to separate them and call after committing. In subsequent commits other notification types will be moved too. We change here order of notification calls with regards to rest of schema updating code. I.e. before keyspace notifications triggered before tables were updated, after the change they will trigger once everything is updated. There is no indication that notification listeners depend on this behaviour.	2025-05-27 19:59:47 +02:00
Wojciech Mitros	5920647617	mv: remove queue length limit from the view update read concurrency semaphore Each view update is correlated to a write that generates it (aside from view building which is throttled separately). These writes are limited by a throttling mechanism, which effectively works by performing the writes with CL=ALL if ongoing writes exceed some memory usage limit When writes generate view updates, they usually also need to perform a read. This read goes through a read concurrency semaphore where it can get delayed or killed. The semaphore allows up to 100 concurrent reads and puts all remaining reads in a queue. If the number of queued reads exceeds a specific limit, the view update will fail on the replica, causing inconsistencies. This limit is not necessary. When a read gets queued on the semaphore, the write that's causing the view update is paused, so the write takes part in the regular write throttling. If too many writes get stuck on view update reads, they will get throttled, so their number is limited and the number of queued reads is also limited to the same amount. In this patch we remove the specified queue length limit for the view update read concurrency semaphore. Instead of this limit, the queue will be now limited indirectly, by the base write throttling mechanism. This may allow the queue grow longer than with the previous limit, but it shouldn't ever cause issues - we only perform up to 100 actual reads at once, and the remaining ones that get queued use a tiny amount of memory, less than the writes that generated them and which are getting limited directly. Fixes https://github.com/scylladb/scylladb/issues/23319 Closes scylladb/scylladb#24112	2025-05-14 18:29:30 +03:00
Botond Dénes	ca7f557e86	readers/multishard: drop v2 from reader and related names	2025-05-09 07:53:29 -04:00
Botond Dénes	7ba3c3fec3	readers/multi_range: remove flat from name	2025-05-09 07:53:25 -04:00
Nadav Har'El	262530f27c	Merge 'mv: make base_info in view schemas immutable' from Wojciech Mitros Currently, the base_info may or may not be set in view schemas. Even when it's set, it may be modified. This necessitates extra checks when handling view schemas, as we'll as potentially causing errors when we forget to set it at some point. Instead, we want to make the base info an immutable member of view schemas (inside view_info). To achieve this, in this series we remove all base_info members that can change due to a base schema update, and we calculate the remaining values during view update generation, using the most up-to-date base schema version. To calculate the values that depend on the base schema version, we need to iterate over the view primary key and find the corresponding columns, which adds extra overhead for each batch of view updates. However, this overhead should be relatively small, as when creating a view update, we need to prepare each of its columns anyway. And if we need to read the old value of the base row, the relative overhead is even lower. After this change, the base info in view schemas stays the same for all base schema updates, so we'll no longer get issues with base_info being incompatible with a base schema version. Additionally, it's a step towards making the schema objects immutable, which we sometimes incorrectly assumed in the past (they're still not completely immutable yet, as some other fields in view_info other than base_info are initialized lazily and may depend on the base schema version). Fixes https://github.com/scylladb/scylladb/issues/9059 Fixes https://github.com/scylladb/scylladb/issues/21292 Fixes https://github.com/scylladb/scylladb/issues/22194 Fixes https://github.com/scylladb/scylladb/issues/22410 Closes scylladb/scylladb#23337 * github.com:scylladb/scylladb: test: remove flakiness from test_schema_is_recovered_after_dying mv: add a test for dropping an index while it's building base_info: remove the lw_shared_ptr variant view_info: don't re-set base_info after construction base_info: remove base_info snapshot semantics base_info: remove base schema from the base_info schema_registry: store base info instead of base schema for view entries base_info: make members non-const view_info: move the base info to a separate header view_info: move computation of view pk columns not in base pk to view_updates view_info: move base-dependent variables into base_info view_info: set base info on construction	2025-04-27 19:12:12 +03:00
Wojciech Mitros	d7bd86591e	view_info: don't re-set base_info after construction In the previous commits we made sure that the base info is not dependent on the base schema version, and the info dependent on the base schema version is calculated when it's needed. In this patch we remove the unnecessary re-setting of the base_info. The set_base_info method isn't removed completely, because it also has a secondary function - zeroing the view_info fields other than base_info. Because of this, in this patch we rename it accordingly and limit its use to the updates caused by a base schema change.	2025-04-24 01:08:40 +02:00
Aleksandra Martyniuk	c1618c7de5	test: test table drop during flush	2025-04-23 14:29:28 +02:00
Aleksandra Martyniuk	91b57e79f3	replica: skip flush of dropped table	2025-04-23 14:29:28 +02:00
Nadav Har'El	6db666a1c1	replica: fix 10-second pause during shutdown As noticed in issue #23687, if we shut down Scylla while a paged read is in progress - or even a paged read that the client had no intention of ever resume it - the shutdown pauses for 10 seconds. The problem was the stop() order - we must stop the "querier cache" before we can close sstables - the "querier cache" is what holds paged readers alive waiting for clients to resume those reads, and while a reader is alive it holds on to sstables so they can't be closed. The querier cache's querier_cache::default_entry_ttl is set to 10 seconds, which is why the shutdown was un-paused after 10 seconds. This fix in this patch is obvious: We need to stop the querier cache (and have it release all the readers it was holding) before we close the sstables. Fixes #23687 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23770	2025-04-16 20:35:44 +03:00
Pavel Emelyanov	d9853efa7c	Merge '[Out-of-space prevention] db: backup: prioritize sstables that were deleted from the table' from Benny Halevy The motivation behind this change to free up disk space as early as possible. The reason is that snapshot locks the space of all SSTables in the snapshot, and deleting form the table, for example, by compaction, or tablet migration, won't free-up their capacity until they are uploaded to object storage and deleted from the snapshot. This series adds prioritization of deleted sstables in two cases: First, after the snapshot dir is processed, the list of SSTable generation is cross-referenced with the list of SSTables presently in the table and any generation that is not in the table is prioritized to be uploaded earlier. In addition, a subscription mechanism was added to sstables_manager and it is used in backup to prioritize SSTables that get deleted from the table directory during backup. This is particularly important when backup happens during high disk utilization (e.g. 90%). Without it, even if the cluster is scaled up and tablets are migrated away from the full nodes to new nodes, tablet cleanup might not free any space if all the tablet sstables are hardlinked to the snapshot taken for backup. * Enhancement, no backport needed Closes scylladb/scylladb#23241 * github.com:scylladb/scylladb: db: snapshot: backup_task: prioritize sstables deleted during upload sstables_manager: add subscriptions db: snapshot: backup_task: limit concurrency sstables: directory_semaphore: expose get_units db: snapshot: backup_task: add sharded sstables_manager database: expose get_sstables_manager(schema) db: snapshot: backup_task: do_backup: prioritize sstables that are already deleted from the table db: snapshot-ctl: pass table_id to backup_task db: snapshot-ctl: expose sharded db() getter db: snapshot: backup_task: do_backup: organize components by sstable generation db: snapshot: coroutinize backup_task db: snapshot: backup_task: refactor backup_file out of uploads_worker db: snapshot: backup_task: refactor uploads_worker out of do_backup db: snapshot: backup_task: process_snapshot_dir: initialize total progress utils/s3: upload_progress: init members to 0 db: snapshot: backup_task: do_backup: refactor process_snapshot_dir db: snapshot: backup_task: keep expection as member	2025-04-09 15:32:11 +03:00
Benny Halevy	b270d552fb	database: expose get_sstables_manager(schema) Return either the system or use sstables manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Tomasz Grabiec	06b49bdf69	Merge 'row_cache: don't garbage-collect tombstones which cover data in memtables' from Botond Dénes The row cache can garbage-collect tombstones in two places: 1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it; 2) During reads - reads now compact data including garbage collection; In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables. This PR includes fixes for (2), which were not handled at all currently. (1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included. Fixes: https://github.com/scylladb/scylladb/issues/23291 Fixes: https://github.com/scylladb/scylladb/issues/23252 The fix will need backport to all live release. Closes scylladb/scylladb#23255 * github.com:scylladb/scylladb: test/boost/row_cache_test: add memtable overlap check tests replica/table: add error injection to memtable post-flush phase utils/error_injection: add a way to set parameters from error injection points test/cluster: add test_data_resurrection_in_memtable.py test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts replica/mutation_dump: don't assume cells are live replica/database: do_apply() add error injection point replica: improve memtable overlap checks for the cache replica/memtable: add is_merging_to_cache() db/row_cache: add overlap-check for cache tombstone garbage collection mutation/mutation_compactor: copy key passed-in to consume_new_partition()	2025-04-08 17:26:58 +02:00
Raphael S. Carvalho	0f59deffaa	replica: Fix truncate and drop table after tablet migration happens When running those operations after a tablet replica is migrated away from a shard, an assert can fail resulting in a crash. Status quo (around the assert in truncate procedure): 1) Highest RP seen by table is saved in low_mark, and the current time in low_mark_at. 2) Then compaction is disabled in order to not mix data written before truncate, and data written later. 3) Then memtable is flushed in order for the data written before truncate to be available in sstables and then removed. 4) Now, current time is saved in truncated_at, which is supposedly the time of truncate to decide which sstables to remove. Note: truncated_at is likely above low_mark_at due to steps 2 and 3. The interesting part of the assert is: (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp) Note: RP in the assert above is the highest RP among all sstables generated before truncated_at. RP is retrieved by table::discard_sstables(). If truncated_at > low_mark_at, maybe newer data was written during steps 2 and 3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with RP > low_mark. So assert's 2nd condition is there to defend against the scenario above. truncated_at and low_mark_at uses millisecond granularity, so even if truncated_at == low_mark_at, data could have been written in steps 2 and 3 (during same MS window), failing the assert. This is fragile. Reproducer: To reproduce the problem, truncated_at must be > low_mark_at, which can easily happen with both drop table and truncate due to steps 2 and 3. If a shard has 2 or more tablets, the table's highest RP refer to just one tablet in that shard. If the tablet with the highest RP is migrated away, then the sstables in that shard will have lower RP than the recorded highest RP (it's a table wide state, which makes sense since CL is shared among tablets). So when either drop table or truncate runs, low_mark will be potentially bigger than highest RP retrieved from sstables. Proposed solution: The current assert is hacked to not fail if writes sneak in, during steps 2 and 3, but it's still fragile and seems not to serve its real purpose, since it's allowing for RP > low_mark. We should be able to say that low_mark >= RP, as a way of asserting we're not leaving data targeted by truncate behind (or that we're not removing the wrong data). But the problem is that we're saving low_mark in step 1, before preparation steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying all data written so far is targeted for removal. But as of today, low_mark refers to all data written up to step 1. So low_mark is now only one set before issuing flush, and also accounts for all potentially flushed data. Fixes #18059. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23560	2025-04-08 07:32:58 +03:00
Botond Dénes	cb76cafb60	replica/database: do_apply() add error injection point So writes (to user tables) can be failed on a replica, via error injection. Should simplify tests which want to create differences in what writes different replicas receive.	2025-04-08 00:11:35 -04:00
Botond Dénes	d126ea09ba	replica: improve memtable overlap checks for the cache The current memtable overlap check that is used by the cache -- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only checks the active memtable, so memtables which are either being flushed or are already flushed and also have active reads against them do not participate in the overlap check. This can result in temporary data resurrection, where a cache read can garbage-collect a tombstone which still covers data in a flushing or flushed memtable, which still have active read against it. To prevent this, extend the overlap check to also consider all of the memtable list. Furthermore, memtable_list::erase() now places the removed (flushed) memtable in an intrusive list. These entries are alive only as long as there are readers still keeping an `lw_shared_ptr<memtable>` alive. This list is now also consulted on overlap checks.	2025-04-08 00:11:35 -04:00
Pavel Emelyanov	10376b5b85	db: Re-use database::snapshot_table_on_all_shards() There are two snapshot-on-all-shards methods on the database -- the one that snapshots a keyspace and the one that snapshots a vector of tables. The latter snapshots a single table with a neat helper, while the former has the helper open-coded. Re-using the helper in keyspace snapshot is worth it, but needs to patch the helper to work on uuid, rather than ks:cf pair of strings. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23532	2025-04-07 11:55:43 +02:00
Michał Chojnowski	d920ab5366	database: add sample_data_files() Add a helper for sampling the Data files for a given table. We will use it to take samples for dictionary training.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	30a9d471fa	sstables: plug an `sstable_compressor_factory` into `sstables_manager` Create a `sstable_compressor_factory_impl` in `scylla_main`, and pipe it through constructors into `sstables_manager`. In next commits, the factory available through the `sstables_manager` will be used to create compressors for SSTable readers and writers.	2025-04-01 00:07:28 +02:00
Piotr Dulikowski	288216a89e	Merge 'Ignore wrapped exceptions `gate_closed_exception` and `rpc::closed_error` when node shuts down.' from Sergey Zolotukhin Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error` in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped in a `nested_exception`, an error message is printed, causing tests to fail. This commit adds handling for nested exceptions in this case to prevent unnecessary error messages. Fixes scylladb/scylladb#23325 Fixes scylladb/scylladb#23305 Fixes scylladb/scylladb#21815 Backport: looks like this is quite a frequent issue, therefore backport to 2025.1. Closes scylladb/scylladb#23336 * github.com:scylladb/scylladb: database: Pass schema_ptr as const ref in `wrap_commitlog_add_error` database: Unify exception handling in `do_apply` and `apply_with_commitlog` storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down. exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.	2025-03-27 11:39:42 +01:00
Sergey Zolotukhin	d448f3de77	database: Pass schema_ptr as const ref in `wrap_commitlog_add_error`	2025-03-26 11:15:26 +01:00
Sergey Zolotukhin	0d9d0fe60e	database: Unify exception handling in `do_apply` and `apply_with_commitlog` Move exception wrapping logic from `do_apply` and `apply_with_commitlog` to `wrap_commitlog_add_error` to ensure consistent error handling.	2025-03-26 11:15:18 +01:00
Avi Kivity	7646e1448a	Merge 'cql3: Introduce RF-rack-valid keyspaces' from Dawid Mędrek This PR is an introductory step towards enforcing RF-rack-valid keyspaces in Scylla. The scope of changes: * defining RF-rack-valid keyspaces, * introducing a configuration option enforcing RF-rack-valid keyspaces, * restricting the CREATE and ALTER KEYSPACE statements so that they never lead to RF-rack invalid keyspaces, * during the initialization of a node, it verifies that all existing keyspaces are RF-rack-valid. If not, the initialization fails. We provide tests verifying that the changes behave as intended. --- Note that there are a number of things that still need to be implemented. That includes, for instance, restricting topology operations too. --- Implementation strategy (going beyond the scope of this PR): 1. Introduce the new configuration option `rf_rack_valid_keyspaces`. 2. Start enforcing RF-rack-validity in keyspaces if the option is enabled. 3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests. 4. Once the tests have been adjusted, change the default value of the option to enabled. 5. Stop explicitly enabling the option in tests. 6. Get rid of the option. --- Fixes scylladb/scylladb#20356 Fixes scylladb/scylladb#23276 Fixes scylladb/scylladb#23300 --- Backport: this is part of the requirements for releasing 2025.1. Closes scylladb/scylladb#23138 * github.com:scylladb/scylladb: main: Refuse to start node when RF-rack-invalid keyspace exists cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces db/config: Introduce RF-rack-valid keyspaces	2025-03-20 19:10:36 +02:00
Dawid Mędrek	0e04a6f3eb	main: Refuse to start node when RF-rack-invalid keyspace exists When a node is started with the option `rf_rack_valid_keyspaces` enabled, the initialization will fail if there is an RF-rack-invalid keyspace. We want to force the user to adjust their existing keyspaces when upgrading to 2025.* so that the invariant that every keyspace is RF-rack-valid is always satisfied. Fixes scylladb/scylladb#23300	2025-03-19 15:13:44 +01:00
Botond Dénes	fda3486770	Merge 'Remove some excessive ks:cf -> table_id conversions in API and schema_tables' from Pavel Emelyanov Actually, the main goal of this PR was to remove parse_tables() helpers from api/ in favor of more flexible (yet same complex) parse_table_infos(), but it turned out that it also saves some lookups in database maps. There are several places in API and schema_tables that have table_id at hand, but at some point drop it and carry keyspace and table names over to a place that maps ks:cf back to table_id and then uses it to find the table object. This PR keeps the table_id with the help of table_info struct in those places. This change allows removing the aforementioned parse_table() helpers from api/ and also saves few lookups in database maps. Removing the parse_tables() from api/ is the continuation of previous effort that reduces the set of helpers in api/ code that help handlers "parse" keyspaces and tables names see #22742 #21533 Closes scylladb/scylladb#23216 * github.com:scylladb/scylladb: api: Remove the remaining parse_tables() overload database: Sanitize flush_tables_on_all_shards() schema_tables: Remove all_table_names() database: Make tables flushing helper use table_info-s, not names api: Make keyspace flush endpoint use parse_table_infos() (and a bit more) schema_tables,client_state: Switch to using all_table_infos() schema_tables: Tune up some methods to benefit from table_infos schema_tables: Introduce all_table_infos()	2025-03-17 15:40:41 +02:00
Pavel Emelyanov	89f3c1a91e	database: Sanitize flush_tables_on_all_shards() Previous patch left this method with few uglinesses - the vector<table_id> argument is named table_names - the sstring keyspace argument is unused - the keyspace argument is captured for no use Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:13:10 +03:00
Pavel Emelyanov	c2d23d7948	database: Make tables flushing helper use table_info-s, not names The database::flush_tables_on_all_shards() method accepts a keyspace name and a vector of table names. Then it converts ks:cf pair for each of the table name into a table-id and flushes the table with the ID. All the callers of that method already have or can easily get the vector of table_id-s, not just names, so make use of this. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:11:32 +03:00

1 2 3 4 5 ...

547 Commits