scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 12:47:02 +00:00

Author	SHA1	Message	Date
Nadav Har'El	6b35eea1a9	Merge '[Backport 2025.1] Alternator batch rcu' from Scylladb[bot] This series adds support for reporting consumed capacity in BatchGetItem operations in Alternator. It includes changes to the RCU accounting logic, exposing internal functionality to support batch-specific behavior, and adds corresponding tests for both simple and complex use cases involving multiple tables and consistency modes. Need backporting to 2025.1, as RCU and WCU are not fully supported Fixes #23690 - (cherry picked from commit `0eabf8b388`) - (cherry picked from commit `88095919d0`) - (cherry picked from commit `3acde5f904`) Parent PR: #23691 Closes scylladb/scylladb#23790 * github.com:scylladb/scylladb: test_returnconsumedcapacity.py: test RCU for batch get item alternator/executor: Add RCU support for batch get items alternator/consumed_capacity: make functionality public	2025-04-17 21:39:58 +03:00
Benny Halevy	fd6c7c53b8	token_group_based_splitting_mutation_writer: maybe_switch_to_new_writer: prevent double close Currently, maybe_switch_to_new_writer resets _current_writer only in a continuation after closing the current writer. This leaves a window of vulnerability if close() yields, and token_group_based_splitting_mutation_writer::close() is called. Seeing the engaged _current_writer, close() will call _current_writer->close() - which must be called exactly once. Solve this when switching to a new writer by resetting _current_writer before closing it and potentially yielding. Fixes #22715 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#22922 (cherry picked from commit `29b795709b`) Closes scylladb/scylladb#22965	2025-04-17 12:59:55 +02:00
Amnon Heiman	9434bd81b3	test_returnconsumedcapacity.py: test RCU for batch get item This patch adds tests for consumed capacity in batch get item. It tests both the simple case and the multi-item, multi-table case that combines consistent and non-consistent reads. (cherry picked from commit `3acde5f904`)	2025-04-17 10:30:18 +00:00
Amnon Heiman	0761eacf68	alternator/executor: Add RCU support for batch get items This patch adds RCU support for batch get items. With batch requests, multiple objects are read from multiple tables. While the criterion for adding the units is per the batch request, the units are calculated per table—and so is the read consistency. (cherry picked from commit `88095919d0`)	2025-04-17 10:30:18 +00:00
Amnon Heiman	8bb4ee49da	alternator/consumed_capacity: make functionality public The consumed_capacity_counter is not completely applicable for batch operations. This patch makes some of its functionality public so that batch get item can use the components to decide if it needs to send consumed capacity in the reply, to get the half units used by the metrics and returned result, and to allow an empty constructor for the RCU counter. (cherry picked from commit `0eabf8b388`)	2025-04-17 10:30:18 +00:00
Avi Kivity	a89cdfc253	scylla-gdb: small-objects: fix for very small objects Because of rounding and alignment, there are multiple pools for small sizes (e.g. 4 for size 32). Because the pool selection algorithm ignores alignment, different pools can be chosen for different object sizes. For example, an object size of 29 will choose the first pool of size 32, while an object size of 32 will choose the fourth pool of size 32. The small-objects command doesn't know about this and always considers just the first pool for a given size. This causes it to miss out on sister pools. While it's possible to adjust pool selection to always choose one of the pools, it may eat a precious cycle. So instead let's compensate in the small-objects command. Instead of finding one pool for a given size, find all of them, and iterate over all those pools. Fixes #23603 Closes scylladb/scylladb#23604 (cherry picked from commit `b4d4e48381`) Closes scylladb/scylladb#23749	2025-04-16 14:37:43 +03:00
Botond Dénes	998bfe908f	Merge '[Backport 2025.1] Fix EAR not applied on write to S3 (but on read).' from Scylladb[bot] Fixes #23225 Fixes #23185 Adds a "wrap_sink" (with default implementation) to sstables::file_io_extension, and moves extension wrapping of file and sink objects to storage level. (Wrapping/handling on sstable level would be problematic, because for file storage we typically re-use the sstable file objects for sinks, whereas for S3 we do not). This ensures we apply encryption on both read and write, whereas we previously only did so on read -> fail. Adds io wrapper objects for adapting file/sink for default implementation, as well as a proper encrypted sink implementation for EAR. Unit tests for io objects and a macro test for S3 encrypted storage included. - (cherry picked from commit `98a6d0f79c`) - (cherry picked from commit `e100af5280`) - (cherry picked from commit `d46dcbb769`) - (cherry picked from commit `e02be77af7`) - (cherry picked from commit `9ac9813c62`) - (cherry picked from commit `5c6337b887`) Parent PR: #23261 Closes scylladb/scylladb#23424 * github.com:scylladb/scylladb: encryption: Add "wrap_sink" to encryption sstable extension encrypted_file_impl: Add encrypted_data_sink sstables::storage: Move wrapping sstable components to storage provider sstables::file_io_extension: Add a "wrap_sink" method. sstables::file_io_extension: Make sstable argument to "wrap" const utils: Add "io-wrappers", useful IO helper types	2025-04-16 09:32:23 +03:00
Calle Wilund	0eed7f8f29	encryption: Add "wrap_sink" to encryption sstable extension Creates a more efficient data_sink wrapper for encrypted output stream (S3). (cherry picked from commit `5c6337b887`)	2025-04-15 11:00:22 +00:00
Calle Wilund	f174b419a4	encrypted_file_impl: Add encrypted_data_sink Adds a sibling type to encrypted file, a data_sink, that will write a data stream in the same block format as a file object would. Including end padding. For making encrypted data sink writing less cumbersome. (cherry picked from commit `9ac9813c62`)	2025-04-15 11:00:22 +00:00
Calle Wilund	ac4c7a7ad2	sstables::storage: Move wrapping sstable components to storage provider Fixes #23225 Fixes #23185 Moved wrapping component files/sinks to storage provider. Also ensures to wrap data_sinks as well as actual files. This ensures that we actually write encryption if active. (cherry picked from commit `e02be77af7`)	2025-04-15 11:00:22 +00:00
Calle Wilund	6feb95ffad	sstables::file_io_extension: Add a "wrap_sink" method. Similar to wrap file, should wrap a data_sink (used for sstable writers), in obvious write-only, simple stream mode. Default impl will detect if we wrap files for this component, and if so, generate a file wrapper for the input sink, wrap this, and the wrap it in a file_data_sink_impl. This is obviously not efficient, so extensions used in actual non-test code should implement the method. (cherry picked from commit `d46dcbb769`)	2025-04-15 11:00:22 +00:00
Calle Wilund	b6ec0961ca	sstables::file_io_extension: Make sstable argument to "wrap" const This matches the signature of call sites. Since the only "real" extension to actually make a marker in the sstable will do so in the scylla component, which is writable even in a const sstable, this is ok. (cherry picked from commit `e100af5280`)	2025-04-15 10:36:47 +00:00
Calle Wilund	9a10458500	utils: Add "io-wrappers", useful IO helper types Mainly to add a somewhat functional file-impl wrapping a data_sink. This can implement a rudimentary, write-only, file based on any output sink. For testing, and because they fit there, place memory sink and source types there as well. (cherry picked from commit `98a6d0f79c`)	2025-04-15 10:36:47 +00:00
Pavel Emelyanov	263416201c	Merge '[Backport 2025.1] audit: add semaphore to audit_syslog_storage_helper' from Scylladb[bot] audit_syslog_storage_helper::syslog_send_helper uses Seastar's net::datagram_channel to write to syslog device (usually /dev/log). However, datagram_channel.send() is not fiber-safe (ref seastar#2690), so unserialized use of send() results in packets overwriting its state. This, in turn, causes a corruption of audit logs, as well as assertion failures. To workaround the problem, a new semaphore is introduced in audit_syslog_storage_helper. As storage_helper is a member of sharded audit service, the semaphore allows for one datagram_channel.send() on each shard. Each audit_syslog_storage_helper stores its own datagram_channel, therefore concurrent sends to datagram_channel are eliminated. This change: - Moved syslog_send_helper to audit_syslog_storage_helper - Corutinize audit_syslog_storage_helper - Introduce semaphore with count=1 in audit_syslog_storage_helper. See https://github.com/scylladb/scylla-dtest/pull/5749 for releated dtest Fixes: scylladb/scylladb#22973 Backport to 2025.1 should be considered, as https://github.com/scylladb/scylladb/issues/22973 is known to cause crashes of 2025.1. - (cherry picked from commit `dbd2acd2be`) - (cherry picked from commit `889fd5bc9f`) - (cherry picked from commit `c12f976389`) Parent PR: #23464 Closes scylladb/scylladb#23674 * github.com:scylladb/scylladb: audit: add semaphore to audit_syslog_storage_helper audit: corutinize audit_syslog_storage_helper audit: moved syslog_send_helper to audit_syslog_storage_helper	2025-04-15 12:37:48 +03:00
Jenkins Promoter	42db149393	Update ScyllaDB version to: 2025.1.2	2025-04-15 12:13:36 +03:00
Pavel Emelyanov	4c382fbe7e	cql: Remove unused "initial_tablets" mention from guardrails All tablets configuration was moved into its own "with tablets" section, this option name cannot be met among replication factors. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23555 (cherry picked from commit `d4f3a3ee4f`) Closes scylladb/scylladb#23676	2025-04-15 11:01:50 +03:00
David Garcia	7588789b02	fix: openapi not rendering in docs.scylladb.com/manual Closes scylladb/scylladb#23686 (cherry picked from commit `cf11d5eb69`) Closes scylladb/scylladb#23710	2025-04-15 10:58:59 +03:00
Jenkins Promoter	a0faf0bde0	Update pgo profiles - aarch64	2025-04-15 04:33:44 +03:00
Jenkins Promoter	a503e74bf5	Update pgo profiles - x86_64	2025-04-15 04:10:13 +03:00
Botond Dénes	c1dce79847	Merge '[Backport 2025.1] Finalize tablet splits earlier' from Scylladb[bot] Resize finalization is executed in a separate topology transition state, `tablet_resize_finalization`, to ensure it does not overlap with tablet transitions. The topology transitions into the `tablet_resize_finalization` state only when no tablet migrations are scheduled or being executed. If there is a large load-balancing backlog, split finalization might be delayed indefinitely, leaving the tables with large tablets. This PR fixes the issue by updating the load balancer to no schedule any migrations and to not make any repair plans when there a resize finalization is pending in any table. Also added a testcase to verify the fix. Fixes #21762 - (cherry picked from commit `8cabc66f07`) - (cherry picked from commit `5b47d84399`) - (cherry picked from commit `dccce670c1`) Parent PR: #22148 Closes scylladb/scylladb#23633 * github.com:scylladb/scylladb: topology_coordinator: fix indentation in generate_migration_updates topology_coordinator: do not schedule migrations when there are pending resize finalizations load_balancer: make repair plans only when there is no pending resize finalization	2025-04-14 06:44:57 +03:00
Botond Dénes	251db77fcb	mutation/frozen_mutation: frozen_mutation_consumer_adaptor: fix end-of-partition handling This adaptor adapts a mutation reader pausable consumer to the frozen mutation visitor interface. The pausable consumer protocol allows the consumer to skip the remaining parts of the partition and resume the consumption with the next one. To do this, the consumer just has to return stop_iteration::yes from one of the consume() overloads for clustering elements, then return stop_iteration::no from consume_end_of_partition(). Due to a bug in the adaptor, this sequence leads to terminating the consumption completely -- so any remaining partitions are also skipped. This protocol implementation bug has user-visible effects, when the only user of the adaptor -- read repair -- happens during a query which has limitations on the amount of content in each partition. There are two such queries: select distinct ... and select ... with partition limit. When converting the repaired mutation to to query result, these queries will trigger the skip sequence in the consumer and due to the above described bug, will skip the remaining partitions in the results, omitting these from the final query result. This patch fixes the protocol bug, the return value of the underlying consumer's consume_end_of_partition() is now respected. A unit test is also added which reproduces the problem both with select distinct ... and select ... per partition limit. Follow-up work: * frozen_mutation_consumer_adaptor::on_end_of_partition() calls the underlying consumer's on_end_of_stream(), so when consuming multiple frozen mutations, the underlying's on_end_of_stream() is called for each partition. This is incorrect but benign. * Improve documentation of mutation_reader::consume_pausable(). Fixes: #20084 Closes scylladb/scylladb#23657 (cherry picked from commit `d67202972a`) Closes scylladb/scylladb#23694	2025-04-11 10:53:31 +03:00
Botond Dénes	f7761729cc	Merge '[Backport 2025.1] nodetool: cluster repair: add a command to repair tablet keyspaces' from Scylladb[bot] Add a new nodetool cluster super-command. Add nodetool cluster repair command to repair tablet keyspaces. It uses the new /storage_service/tablets/repair API. The nodetool cluster repair command allows you to specify the keyspace and tables to be repaired. A cluster repair of many tables will request /storage_service/tablets/repair and wait for the result synchronously for each table. The nodetool repair command, which was previously used to repair keyspaces of any type, now repairs only vnode keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/22409. Needs backport to 2025.1 that introduces the new tablet repair API - (cherry picked from commit `cbde835792`) - (cherry picked from commit `b81c81c7f4`) - (cherry picked from commit `aa3973c850`) - (cherry picked from commit `8bbc5e8923`) - (cherry picked from commit `02fb71da42`) - (cherry picked from commit `9769d7a564`) Parent PR: #22905 Closes scylladb/scylladb#23672 * github.com:scylladb/scylladb: docs: nodetool: update repair and add tablet-repair docs test: nodetool: add tests for cluster repair command nodetool: add cluster repair command nodetool: repair: extract getting hosts and dcs to functions nodetool: repair: warn about repairing tablet keyspaces nodetool: repair: move keyspace_uses_tablets function	2025-04-11 10:53:03 +03:00
Raphael S. Carvalho	75cd8e9492	replica: Fix truncate and drop table after tablet migration happens When running those operations after a tablet replica is migrated away from a shard, an assert can fail resulting in a crash. Status quo (around the assert in truncate procedure): 1) Highest RP seen by table is saved in low_mark, and the current time in low_mark_at. 2) Then compaction is disabled in order to not mix data written before truncate, and data written later. 3) Then memtable is flushed in order for the data written before truncate to be available in sstables and then removed. 4) Now, current time is saved in truncated_at, which is supposedly the time of truncate to decide which sstables to remove. Note: truncated_at is likely above low_mark_at due to steps 2 and 3. The interesting part of the assert is: (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp) Note: RP in the assert above is the highest RP among all sstables generated before truncated_at. RP is retrieved by table::discard_sstables(). If truncated_at > low_mark_at, maybe newer data was written during steps 2 and 3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with RP > low_mark. So assert's 2nd condition is there to defend against the scenario above. truncated_at and low_mark_at uses millisecond granularity, so even if truncated_at == low_mark_at, data could have been written in steps 2 and 3 (during same MS window), failing the assert. This is fragile. Reproducer: To reproduce the problem, truncated_at must be > low_mark_at, which can easily happen with both drop table and truncate due to steps 2 and 3. If a shard has 2 or more tablets, the table's highest RP refer to just one tablet in that shard. If the tablet with the highest RP is migrated away, then the sstables in that shard will have lower RP than the recorded highest RP (it's a table wide state, which makes sense since CL is shared among tablets). So when either drop table or truncate runs, low_mark will be potentially bigger than highest RP retrieved from sstables. Proposed solution: The current assert is hacked to not fail if writes sneak in, during steps 2 and 3, but it's still fragile and seems not to serve its real purpose, since it's allowing for RP > low_mark. We should be able to say that low_mark >= RP, as a way of asserting we're not leaving data targeted by truncate behind (or that we're not removing the wrong data). But the problem is that we're saving low_mark in step 1, before preparation steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying all data written so far is targeted for removal. But as of today, low_mark refers to all data written up to step 1. So low_mark is now only one set before issuing flush, and also accounts for all potentially flushed data. Fixes #18059. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23560 (cherry picked from commit `0f59deffaa`) (cherry picked from commit 7554d4bbe09967f9b7a55575b5dfdde4f6616862) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23649	2025-04-11 10:52:37 +03:00
Raphael S. Carvalho	7007dabdf9	storage_service: Don't retry split when table is dropped The split monitor wasn't handling the scenario where the table being split is dropped. The monitor would be unable to find the tablet map of such a table, and the error would be treated as a retryable one causing the monitor to fall into an endless retry loop, with sleeps in between. And that would block further splits, since the monitor would be busy with the retries. The fix is about detecting table was dropped and skipping to the next candidate, if any. Fixes #21859. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#22933 (cherry picked from commit `4d8a333a7f`) Closes scylladb/scylladb#23480	2025-04-11 10:52:05 +03:00
Aleksandra Martyniuk	636ec802c3	service: tasks: hold token_metadata_ptr in tablet_virtual_task Hold token_metadata_ptr in tablet_virtual_task methods that iterate over tablets, to keep the tablet_map alive. Fixes: https://github.com/scylladb/scylladb/issues/22316. Closes scylladb/scylladb#22740 (cherry picked from commit `f8e4198e72`) Closes scylladb/scylladb#22937	2025-04-11 10:51:07 +03:00
Avi Kivity	3335557075	Merge '[Backport 2025.1] row_cache: don't garbage-collect tombstones which cover data in memtables' from Scylladb[bot] The row cache can garbage-collect tombstones in two places: 1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it; 2) During reads - reads now compact data including garbage collection; In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables. This PR includes fixes for (2), which were not handled at all currently. (1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included. Fixes: https://github.com/scylladb/scylladb/issues/23291 Fixes: https://github.com/scylladb/scylladb/issues/23252 The fix will need backport to all live release. - (cherry picked from commit `c2518cdf1a`) - (cherry picked from commit `6b5b563ef7`) - (cherry picked from commit `7e600a0747`) - (cherry picked from commit `d126ea09ba`) - (cherry picked from commit `cb76cafb60`) - (cherry picked from commit `df09b3f970`) - (cherry picked from commit `e5afd9b5fb`) - (cherry picked from commit `34b18d7ef4`) - (cherry picked from commit `f7938e3f8b`) - (cherry picked from commit `6c1f6427b3`) - (cherry picked from commit `0d39091df2`) Parent PR: #23255 Closes scylladb/scylladb#23673 * github.com:scylladb/scylladb: test/boost/row_cache_test: add memtable overlap check tests replica/table: add error injection to memtable post-flush phase utils/error_injection: add a way to set parameters from error injection points test/cluster: add test_data_resurrection_in_memtable.py test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts replica/mutation_dump: don't assume cells are live replica/database: do_apply() add error injection point replica: improve memtable overlap checks for the cache replica/memtable: add is_merging_to_cache() db/row_cache: add overlap-check for cache tombstone garbage collection mutation/mutation_compactor: copy key passed-in to consume_new_partition()	2025-04-10 21:42:28 +03:00
Avi Kivity	6ff7927d67	sstables: store features early in write path sstable features indicate that an sstable has some extension, or that some bug was fixed. They allow us to know if we can rely on certain properties in a read sstables. Currently, sstable features are set early in the read path (when we read the scylla metadata file) and very late in the write path (when we write the scylla metadata file just before sealing the sstable). However, we happen to read features before we set them in the write path - when we resize the bloom filter for a newly written sstable we instantiate an index reader, and that depends on some features. As a result, we read a disengaged optional (for the scylla metadata component) as if it was engaged. This somehow worked so far, but fails with libstdc++ hash table implementation. Fix it by moving storage of the features to the sstable itself, and setting it early in the write path. Fixes #23484 Closes scylladb/scylladb#23485 (cherry picked from commit `73e4a3c581`) Closes scylladb/scylladb#23504	2025-04-10 21:41:09 +03:00
Pavel Emelyanov	1021a3d126	Merge '[Backport 2025.1] Allow abort during join_cluster' from Scylladb[bot] Bootstrap or replace can take a long time, but since `feef7d3fa1`, the stop_signal is checked only in checkpoints, and in particular, abort isn't requested during join_cluster. Fixes #23222 * requires backport on top of https://github.com/scylladb/scylladb/pull/23184 - (cherry picked from commit `0fc196991a`) - (cherry picked from commit `f269480f53`) - (cherry picked from commit `41f02c521d`) Parent PR: #23306 Closes scylladb/scylladb#23461 * github.com:scylladb/scylladb: main: allow abort during join_cluster main: add checkpoint before joining cluster storage_service: add start_sys_dist_ks	2025-04-10 19:03:46 +03:00
Avi Kivity	5d8bb068fa	Merge '[Backport 2025.1] streaming: fix the way a reason of streaming failure is determined' from Scylladb[bot] During streaming receiving node gets and processes mutation fragments. If this operation fails, receiver responds with -1 status code, unless it failed due to no_such_column_family in which case streaming of this table should be skipped. However, when the table was dropped, an exception handler on receiver side may get not only data_dictionary::no_such_column_family, but also seastar::nested_exception of two no_such_column_family. Encountered example: ``` ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14)) ``` In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family> clause and gets handled the same as any other exception type. Replace try_catch clause with table_sync_and_check that synchronizes the schema and check if the table exists. Fixes: https://github.com/scylladb/scylladb/issues/22834. Needs backport to all live version, as they all contain the bug - (cherry picked from commit `876cf32e9d`) - (cherry picked from commit `faf3aa13db`) - (cherry picked from commit `44748d624d`) - (cherry picked from commit `35bc1fe276`) Parent PR: #22868 Closes scylladb/scylladb#23290 * github.com:scylladb/scylladb: streaming: fix the way a reason of streaming failure is determined streaming: save a continuation lambda streaming: use streaming namespace in table_check.{cc,hh} repair: streaming: move table_check.{cc,hh} to streaming	2025-04-10 18:22:16 +03:00
Lakshmi Narayanan Sreethar	fb069f0fbf	topology_coordinator: fix indentation in generate_migration_updates Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `dccce670c1`)	2025-04-10 18:39:10 +05:30
Lakshmi Narayanan Sreethar	48077b160d	topology_coordinator: do not schedule migrations when there are pending resize finalizations Resize finalization is executed in a separate topology transition state, `tablet_resize_finalization`, to ensure it does not overlap with tablet transitions. The topology transitions into the `tablet_resize_finalization` state only when no tablet migrations are scheduled or being executed. If there is a large load-balancing backlog, split finalization might be delayed indefinitely, leaving the tables with large tablets. To fix this, do not schedule tablet migrations on any tables when there are pending resize finalizations. This ensures that migrations from the same table and other unrelated tables do not block resize finalization. Also added a testcase to verify the fix. Fixes #21762 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `5b47d84399`)	2025-04-10 18:39:10 +05:30
Lakshmi Narayanan Sreethar	c286fc231a	load_balancer: make repair plans only when there is no pending resize finalization Do not make repair plans if any table has pending resize finalization. This is to ensure that the finalization doesn't get delayed by reapir tasks. Refs #21762 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `8cabc66f07`)	2025-04-10 18:20:00 +05:30
Botond Dénes	df4872b82a	test/boost/row_cache_test: add memtable overlap check tests Similar to test/cluster/test_data_resurrection_in_memtable.py but works on a single node and uses more low-level mechanism. These tests can also reproduce more advanced scenarios, like concurrent reads, with some reading from flushed memtables. (cherry picked from commit `0d39091df2`)	2025-04-10 06:52:18 -04:00
Botond Dénes	7943db9844	replica/table: add error injection to memtable post-flush phase After the memtable was flushed to disk, but before it is merged to cache. The injection point will only active for the table specified in the "table_name" injection parameter. (cherry picked from commit `6c1f6427b3`)	2025-04-10 06:52:18 -04:00
Botond Dénes	bd8c584a01	utils/error_injection: add a way to set parameters from error injection points With this, now it is possible to have two-way communication between the error injection point and its enabler. The test can enable the error injection point, then wait until it is hit, before proceedin. (cherry picked from commit `f7938e3f8b`)	2025-04-10 06:52:18 -04:00
Botond Dénes	50c05abd14	test/cluster: add test_data_resurrection_in_memtable.py Reproducers for #23252 and #23291 -- cache garbage collecting tombstones resurrecting data in the memtable. (cherry picked from commit `34b18d7ef4`)	2025-04-10 06:52:18 -04:00
Aleksandra Martyniuk	3a49808707	streaming: fix the way a reason of streaming failure is determined During streaming receiving node gets and processes mutation fragments. If this operation fails, receiver responds with -1 status code, unless it failed due to no_such_column_family in which case streaming of this table should be skipped. However, when the table was dropped, an exception handler on receiver side may get not only data_dictionary::no_such_column_family, but also seastar::nested_exception of two no_such_column_family. Encountered example: ``` ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14)) ``` In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family> clause and gets handled the same as any other exception type. Replace try_catch clause with table_sync_and_check that synchronizes the schema and check if the table exists. Fixes: https://github.com/scylladb/scylladb/issues/22834. (cherry picked from commit `35bc1fe276`)	2025-04-10 09:35:56 +02:00
Aleksandra Martyniuk	b57774dea6	streaming: save a continuation lambda In the following patches, an additional preemption point will be added to the coroutine lambda in register_stream_mutation_fragments. Assign a lambda to a variable to prolong the captures lifetime. (cherry picked from commit `44748d624d`)	2025-04-10 09:35:55 +02:00
Aleksandra Martyniuk	67b0ea99a0	streaming: use streaming namespace in table_check.{cc,hh} (cherry picked from commit `faf3aa13db`)	2025-04-10 09:35:54 +02:00
Aleksandra Martyniuk	7fa0e041eb	repair: streaming: move table_check.{cc,hh} to streaming (cherry picked from commit `876cf32e9d`)	2025-04-10 09:34:23 +02:00
Botond Dénes	de1d8372fa	test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts Such that a given index in the return hosts refers to the same underlying Scylla instance, as the same index in the passed-in nodes list. This is what users of this method intuitively expect, but currently the returned hosts list is unordered (has random order). (cherry picked from commit `e5afd9b5fb`)	2025-04-10 03:17:27 -04:00
Botond Dénes	dcc3604e02	replica/mutation_dump: don't assume cells are live Currently the dumper unconditionally extracts the value of atomic cells, assuming they are live. This doesn't always hold of course and attempting to get the value of a dead cell will lead to marshalling errors. Fix by checking is_live() before attempting to get the cell value. Fix for both regular and collection cells. (cherry picked from commit `df09b3f970`)	2025-04-10 03:17:27 -04:00
Botond Dénes	39ca3463b3	replica/database: do_apply() add error injection point So writes (to user tables) can be failed on a replica, via error injection. Should simplify tests which want to create differences in what writes different replicas receive. (cherry picked from commit `cb76cafb60`)	2025-04-10 03:17:27 -04:00
Botond Dénes	1c7a6ba140	replica: improve memtable overlap checks for the cache The current memtable overlap check that is used by the cache -- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only checks the active memtable, so memtables which are either being flushed or are already flushed and also have active reads against them do not participate in the overlap check. This can result in temporary data resurrection, where a cache read can garbage-collect a tombstone which still covers data in a flushing or flushed memtable, which still have active read against it. To prevent this, extend the overlap check to also consider all of the memtable list. Furthermore, memtable_list::erase() now places the removed (flushed) memtable in an intrusive list. These entries are alive only as long as there are readers still keeping an `lw_shared_ptr<memtable>` alive. This list is now also consulted on overlap checks. (cherry picked from commit `d126ea09ba`)	2025-04-10 03:17:27 -04:00
Botond Dénes	4febf2a938	replica/memtable: add is_merging_to_cache() And set it when the memtable is merged to cache. (cherry picked from commit `7e600a0747`)	2025-04-10 03:17:27 -04:00
Botond Dénes	b43d024ffb	db/row_cache: add overlap-check for cache tombstone garbage collection The cache should not garbage-collect tombstone which cover data in the memtable. Add overlap checks (get_max_purgeable) to garbage collection to detect tombstones which cover data in the memtable and to prevent their garbage collection. (cherry picked from commit `6b5b563ef7`)	2025-04-10 03:17:27 -04:00
Botond Dénes	4bb1969a7f	mutation/mutation_compactor: copy key passed-in to consume_new_partition() This doesn't introduce additional work for single-partition queries: the key is copied anyway on consume_end_of_stream(). Multi-partition reads and compaction are not that sensitive to additional copy added. This change fixes a bug in the compacting_reader: currently the reader passes _last_uncompacted_partition_start.key() to the compactor's consume_new_partition(). When the compactor emits enough content for this partition, _last_uncompacted_partition_start is moved from to emit the partition start, this makes the key reference passed to the compaction corrupt (refer to moved-from value). This in turn means that subsequent GC checks done by the compactor will be done with a corrupt key and therefore can result in tombstone being garbage-collected while they still cover data elsewhere (data resurrection). The compacting reader is violating the API contract and normally the bug should be fixed there. We make an exception here because doing the fix in the mutation compactor better aligns with our future plans: * The fix simplifies the compactor (gets rid of _last_dk). * Prepares the way to get rid of the consume API used by the compactor. (cherry picked from commit `c2518cdf1a`)	2025-04-10 03:17:27 -04:00
Anna Stuchlik	6bcf513f11	doc: add enabling consistent topology updates to the 2025.1 upgrade guide-from-2024 This commit adds the procedure to enable consistent topology updates for upgrades from 2024.1 to 2025.1 (or from 2024.2 to 2025.1 if the feature wasn't enabled after upgrading from 2024.1 to 2024.2). Fixes https://github.com/scylladb/scylladb/issues/23650 Closes scylladb/scylladb#23651 (cherry picked from commit `93a7b3ac1d`) Closes scylladb/scylladb#23670	2025-04-10 10:09:23 +03:00
Botond Dénes	b1a995b571	Merge '[Backport 2025.1] tablets: Make tablet allocation equalize per-shard load ' from Scylladb[bot] Before, it was equalizing per-node load (tablet count), which is wrong in heterogeneous clusters. Nodes with fewer shards will end up with overloaded shards. Fixes #23378 - (cherry picked from commit `d6232a4f5f`) - (cherry picked from commit `6bff596fce`) Parent PR: #23478 Closes scylladb/scylladb#23635 * github.com:scylladb/scylladb: tablets: Make tablet allocation equalize per-shard load tablets: load_balancer: Fix reporting of total load per node	2025-04-10 10:08:38 +03:00
Botond Dénes	ec7da3d785	tools/scylla-nodetool: s/GetInt()/GetInt64()/ GetInt() was observed to fail when the integer JSON value overflows the int32_t type, which `GetInt()` uses for storage. When this happens, rapidjson will assign a distinct 64 bit integer type to the value, and attempting to access it as 32 bit integer triggers the wrong-type error, resulting in assert failure. This was hit on the field where invoking nodetool netstats resulted in nodetool crashing when the streamed bytes amounts were higher than maxint. To avoid such bugs in the future, replace all usage of GetInt() in nodetool of GetInt64(), just to be sure. A reproducer is added to the nodetool netstats crash. Fixes: scylladb/scylladb#23394 Closes scylladb/scylladb#23395 (cherry picked from commit `bd8973a025`) Closes scylladb/scylladb#23476	2025-04-10 10:05:18 +03:00

1 2 3 4 5 ...

46562 Commits