scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 00:50:35 +00:00

Author	SHA1	Message	Date
Avi Kivity	b33dd2bd7d	Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery. Add a full-stack test which checks that rows with bad keys are correctly handled. Fixes: https://github.com/scylladb/scylladb/issues/24489 The bug is present in all versions, has to be backported to all supported versions. Closes scylladb/scylladb#24492 * github.com:scylladb/scylladb: test/boost/sstable_datafile_test: add test for corrupt data sstables/mx/writer: handler rows with empty keys test/lib/cql_assertions: introduce columns_assertions sstables: add corrupt_data_handler to sstables::sstables tools/scylla-sstable: make large_data_handler a local db: introduce corrupt_data_handler mutation: introduce frozen_mutation_fragment_v2 mutation/mutation_partition_view: read_{clustering,static}_row(): return row type mutation/mutation_partition_view: extract de-ser of {clustering,static} row idl-compiler.py: generate skip() definition for enums serializers idl: extract full_position.idl from position_in_partition.idl db/system_keyspace: add apply_mutation() db/system_keyspace: introduce the corrupt_data table	2025-06-29 18:18:36 +03:00
Piotr Dulikowski	62efe6616a	Merge 'mapreduce: add tablet-aware dispatching algorithm' from Andrzej Jackowski The primary motivation for this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB is generally transitioning towards tablets, and using tablets simplifies work dispatching, the decision was made to design the new algorithm specifically for tablets. The goal of the algorithm is to divide the work in such a way that each `tablet_replica` (that is <host, shard> pair) processes two tablets at a time. The new algorithm can be summarized as follows: 1. Prepare a tablet_replica -> partition_range mapping where the values cover the entire space. 2. For each tablet_replica, in parallel, take two partition ranges and dispatch them to the node hosting the replica. The ERM is released and re-acquired in each iteration, allowing the destination (i.e., tablet_replica) to change for each artition range (in such cases, the partition range is assigned to the appropriate tablet_replica). In step 1, the main difference compared to the old algorithm (dispatch_to_vnodes) is that partition ranges are assigned to a tablet_replica rather than just the host. In step 2, the main difference is that the work is divided into smaller batches, and the ERM is released and re-acquired for each batch. In the current implementation, each node can correctly handle every partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because mapreduce_service::execute_on_this_shard creates a new pager that coordinates the partition range read, including obtaining its own ERM. However, every partition range that is absent locally is handled by shard 0. Therefore, proper routing of partition ranges is necessary to avoid shard 0 overload. This is why, in step 2, the ERM is retained during each batch processing, and the tablet_replica is refreshed for each processed range. Additionally, shard_id is added to mapreduce request. When shard_id is set, the entire partition range is handled by the specified shard. As the new tablet-aware mapreduce algorithm balances the workload across shards, shard_id ensure that the balance is preserved, even during events such as tablet splits. This patch series: - Refactors a bit mapreduce service, to facilitate having two algorithm versions (one for vnodes and one for tablets). - Implements tablet-aware dispatching algorithm. - Adds shard_id to mapreduce request and uses the information to handle requests entirely by selected shard. - Adds test_long_query_timeout_erm to verify the new functionality. Fixes: scylladb#21831 No backport, as it is rather new feature than a bugfix. Closes scylladb/scylladb#24383 * github.com:scylladb/scylladb: mapreduce: add missing comma and space in mapreduce_request operator<< mapreduce: add shard_id_hint to mapreduce request test: add test_long_query_timeout_erm mapreduce: add tablet-aware dispatching algorithm storage_proxy: make storage_proxy::is_alive public mapreduce: remove _shared_token_metadata from mapreduce_service mapreduce: move dispatching logic to dispatch_to_vnodes mapreduce: remove underscores from variable names mapreduce: move req_with_modified_pr handling to a new function mapreduce: change next_vnode lambda to get_next_partition_range function	2025-06-26 12:25:39 +02:00
Andrzej Jackowski	9dbb1468b4	mapreduce: remove _shared_token_metadata from mapreduce_service Before this change, `mapreduce_service` used `_shared_token_metadata` to get the topology. However, the token was used in a part of the code that already had its own ERM with its own metadata token. Moreover, as mapreduce_service's token and ERM's token are not guaranteed to be the same, inconsistencies could occur. Therefore, this commit removes `_shared_token_metadata` and its usage.	2025-06-25 08:42:16 +02:00
Botond Dénes	aae212a87c	test/lib/cql_assertions: introduce columns_assertions To enable targeted and optionally typed assertions against individual columns in a row.	2025-06-25 08:41:29 +03:00
Botond Dénes	ebd9420687	sstables: add corrupt_data_handler to sstables::sstables Similar to how large_data_handler is handled, propagate through sstables::sstables_manager and store its owner: replica::database. Tests and tools are also patched. Mostly mechanical changes, updating constructors and patching callers.	2025-06-25 08:41:26 +03:00
Benny Halevy	15bee9f232	sstables: sstable_generation_generator: set last_generation=0 by default Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	f0f7c83705	test: lib: test_env: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	0310a03de6	test: sstable_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Avi Kivity	cd79a8fc25	Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz" This reverts commit `0b516da95b`, reversing changes made to `30199552ac`. It breaks cluster.random_failures.test_random_failures.test_random_failures in debug mode (at least). Fixes #24513	2025-06-16 22:38:12 +03:00
Tomasz Grabiec	0b516da95b	Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft. Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`. Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/19649 Closes scylladb/scylladb#20853 * github.com:scylladb/scylladb: storage_service: always wake up load balancer on update tablet metadata db: schema_applier: call destroy also when exception occurs db: replica: simplify seeding ERM during shema change db: remove cleanup from add_column_family db: abort on exception during schema commit phase db: make user defined types changes atomic replica: db: make keyspace schema changes atomic db: atomically apply changes to tables and views replica: make truncate_table_on_all_shards get whole schema from table_shards service: split update_tablet_metadata into two phases service: pull out update_tablet_metadata from migration_listener db: service: add store_service dependency to schema_applier service: simplify load_tablet_metadata and update_tablet_metadata db: don't perform move on tablet_hint reference replica: split add_column_family_and_make_directory into steps replica: db: split drop_table into steps db: don't move map references in merge_tables_and_views() db: introduce commit_on_shard function db: access types during schema merge via special storage replica: make non-preemptive keyspace create/update/delete functions public replica: split update keyspace into two phases replica: split creating keyspace into two functions db: rename create_keyspace_from_schema_partition db: decouple functions and aggregates schema change notification from merging code db: store functions and aggregates change batch in schema_applier db: decouple tables and views schema change notifications from merging code db: store tables and views schema diff in schema_applier db: decouple user type schema change notifications from types merging code service: unify keyspace notification functions arguments db: replica: decouple keyspace schema change notifications to a separate function db: add class encapsulating schema merging	2025-06-10 13:45:32 +02:00
Raphael S. Carvalho	2d716f3ffe	replica: Fix truncate assert failure Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed a preexisting fragility which I missed. 1) truncate gets RP mark X, truncated_at = second T 2) new sstable written during snapshot or later, also at second T (difference of MS) 3) discard_sstables() get RP Y > saved RP X, since creation time of sstable with RP Y is equal to truncated_at = second T. So the problem is that truncate is using a clock of second granularity for filtering out sstables written later, and after we got low mark and truncate time, it can happen that a sstable is flushed later within the same second, but at a different millisecond. By switching to a millisecond clock (db_clock), we allow sstables written later within the same second from being filtered out. It's not perfect but extremely unlikely a new write lands and get flushed in the same millisecond we recorded truncated_at timepoint. In practice, truncate will not be used concurrently to writes, so this should be enough for our tests performing such concurrent actions. We're moving away from gc_clock which is our cheap lowres_clock, but time is only retrieved when creating sstable objects, which frequency of creation is low enough for not having significant consequences, and also db_clock should be cheap enough since it's usually syscall-less. Fixes #23771. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24426	2025-06-08 15:59:15 +03:00
Marcin Maliszkiewicz	a27776b4ff	replica: make truncate_table_on_all_shards get whole schema from table_shards Before for views and indexes it was fetching base schema from db (and couple other properties). This is a problem once we introduce atomic tables and views deletion (in the following commit). Because once we delete table it can no longer be fetched from db object, and truncation is performed after atomically deleting all relevant tables/views/indexes. Now the whole relevant schema will be fetched via global_table_ptr (table_shards) object.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	92e3d69f79	db: service: add store_service dependency to schema_applier There is already implicit logical dependency via migration_notifier but in the next commits we'll be moving store_service out from it as we need better control (i.e. return a value from the call).	2025-06-06 08:50:33 +02:00
Dawid Mędrek	c60035cbf6	test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default We've adjusted all of the Boost tests so they respect the invariant enforced by the `rf_rack_valid_keyspaces` configuration option, or explicitly disabled the option in those that turned out to be more problematic and will require more attention. Thanks to that, we can now enable it by default in the test suite.	2025-05-27 18:53:39 +02:00
Aleksandra Martyniuk	9c03255fd2	cql_test_env: main: move stream_manager initialization Currently, stream_manager is initialized after storage_service and so it is stopped before the storage_service is. In its stop method storage_service accesses stream_manager which is uninitialized at a time. Move stream_manager initialization over the storage_service initialization. Fixes: #23207. Closes scylladb/scylladb#24008	2025-05-15 17:17:35 +03:00
Avi Kivity	5e764d1de2	Merge 'Drop v2 and flat from reader and related names' from Botond Dénes Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names. Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant). The changes in this PR are entirely mechanical, mostly just search-and-replace. Code cleanup, no backport required. Closes scylladb/scylladb#24087 * github.com:scylladb/scylladb: test/boost/mutation_reader_another_test: drop v2 from reader and related names test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/ test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/ test/boost/mutation_test: s/consumer_v2/consumer/ test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/ readers/mutation_readers: s/generating_reader_v2/generating_reader/ readers/mutation_readers: s/delegating_reader_v2/delegating_reader/ readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/ readers/mutation_source: s/make_reader_v2/make_mutation_reader/ readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/ readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/ mutation/mutation_compactor: drop v2 from compactor and related names replica/table: s/make_reader_v2/make_mutation_reader/ mutation_writer: s/bucket_writer_v2/bucket_writer/ readers/queue: drop v2 from reader and related names readers/multishard: drop v2 from reader and related names readers/evictable: drop v2 from reader and related names readers/multi_range: remove flat from name	2025-05-11 22:22:35 +03:00
Botond Dénes	17b667b116	test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/	2025-05-09 07:53:30 -04:00
Botond Dénes	674d41e3e6	readers/mutation_source: s/make_reader_v2/make_mutation_reader/	2025-05-09 07:53:29 -04:00
Botond Dénes	ca7f557e86	readers/multishard: drop v2 from reader and related names	2025-05-09 07:53:29 -04:00
Michał Chojnowski	1bcf77951c	compress: distribute compression dictionaries over shards We don't want each shard to have its own copy of each dictionary. It would unnecessary pressure on cache and memory. Instead, we want to share dictionaries between shards. Before this commit, all dictionaries live on shard 0. All other shards borrow foreign shared pointers from shard 0. There's a problem with this setup: dictionary blobs receive many random accesses. If shard 0 is on a remote NUMA node, this could pose a performance problem. Therefore, for each dictionary, we would like to have one copy per NUMA node, not one copy per the entire machine. And each shard should use the copy belonging to its own NUMA node. This is the main goal of this patch. There is another issue with putting all dicts on shard 0: it eats an assymetric amount of memory from shard 0. This commit spreads the ownership of dicts over all shards within the NUMA group, to make the situation more symmetric. (Dict owner is decided based on the hash of dict contents). It should be noted that the last part isn't necessarily a good thing, though. While it makes the situation more symmetric within each node, it makes it less symmetric across the cluster, if different node sizes are present. If dicts occupy 1% of memory on each shard of a 100-shard node, then the same dicts would occupy 100% of memory on a 1-shard node. So for the sake of cluster-wide symmetry, we might later want to consider e.g. making the memory limit for dictionaries inversely proportional to the number of shards.	2025-05-07 14:43:18 +02:00
Michał Chojnowski	8649adafa8	test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version In next patches, make_sstable_compressor_factory() will have to disappear. In preparation for that, we switch to a seastar::thread-dependent replacement.	2025-05-07 14:43:04 +02:00
Michał Chojnowski	0e4d0ded8d	test: remove sstables::test_env::do_with() `sstable_manager` depends on `sstable_compressor_factory&`. Currently, `test_env` obtains an implementation of this interface with the synchronous `make_sstable_compressor_factory()`. But after this patch, the only implementation of that interface `sstable_compressor_factory&` will use `sharded<...>`, so its construction will become asynchronous, and the synchronous `make_sstable_compressor_factory()` must disappear. There are several possible ways to deal with this, but I think the easiest one is to write an asynchronous replacement for `make_sstable_compressor_factory()` that will keep the same signature but will be only usable in a `seastar::thread`. All other uses of `make_sstable_compressor_factory()` outside of `test_env::do_with()` already are in seastar threads, so if we just get rid of `test_env::do_with()`, then we will be able to use that thread-dependent replacement. This is the purpose of this commit. We shouldn't be losing much.	2025-05-07 13:19:21 +02:00
Raphael S. Carvalho	21d1e78457	compaction: Wire table_state into make_sstable_set() This will be useful for feeding token range owned by compaction group into sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	59dad2121f	compaction: Introduce token_range() to table_state This provides a way for compaction layer to know compaction group's token range. It will be important for sstable set impl to know the token range of underlying group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Avi Kivity	2dcd2b21ae	Merge 'tablets: Equalize per-table balance when allocating tablets for a new table' from Tomasz Grabiec Fixes the following scenario: 1. Scale out adds new nodes to each rack 2. Table is created - all tablets are allocated to new nodes because they have low load 3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed We're wrong to try to equalize global load when allocating tablets, and we should equalize per-table load instead, and let background load balancing fix it in a fair way. It will add to the allocated storage imbalance, but: 1. The table is initially empty, so doesn't impact actual storage imbalance. 2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately. 3. If the table was created before imbalance was formed, we would end up in the same situation as in the problematic scenario after the patch. 4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in. Before we have CPU-aware tablet allocation, and thus can prove we have CPU capacity on the small nodes, we should respect per-table balance as this is the way in which we achieve full CPU utilization. Fixes #23631 Backport to 2025.1 because load imbalance is a serious problem in production. Closes scylladb/scylladb#23708 * github.com:scylladb/scylladb: tablets: Equalize per-table balance when allocating tablets for a new table load_sketch: Tolerate missing tablet_map when selecting for a given table tests: tablets: Simplify tests by moving common code to topology_builder	2025-04-21 17:06:30 +03:00
Pavel Emelyanov	eb5b52f598	Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278 Fixes: scylladb/scylladb#22869 Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga Closes scylladb/scylladb#23800 * github.com:scylladb/scylladb: doc: changing topology when changing snitches is no longer supported test: cluster: introduce test_no_dc_rack_change storage_service: don't update DC/rack in update_topology_with_local_metadata main: make dc and rack immutable after bootstrap test: cluster: remove test_snitch_change	2025-04-21 15:52:55 +03:00
Piotr Dulikowski	ce2fab7cce	main: make dc and rack immutable after bootstrap Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278	2025-04-17 16:22:26 +02:00
Tomasz Grabiec	d493a8d736	tests: tablets: Simplify tests by moving common code to topology_builder Reduces code duplication.	2025-04-15 16:05:41 +02:00
Pavel Emelyanov	b25cb5af0c	Merge 'Use named gates' from Benny Halevy Name the gates and phased barriers we use to make it easy to debug gate_closed_exception Refs https://github.com/scylladb/seastar/pull/2688 * Enhancement only, no backport needed Closes scylladb/scylladb#23329 * github.com:scylladb/scylladb: utils: loading_cache: use named_gate utils: flush_queue: use named_gate sstables_manager: use named gate sstables_loader: use named gate utils: phased_barrier, pluggable: use named gate utils: s3::client::multipart_upload: use named gate utils: s3::client: use named_gate transport: controller: use named gate tracing: trace_keyspace_helper: use named gate task_manager: module: use named gate topology_coordinator: use named gate storage_service: use named gate storage_proxy: wait_for_hint_sync_point: use named gate storage_proxy: remote: use named gate service: session: use named gate service: raft: raft_rpc: use named gate service: raft: raft_group0: use named gate service: raft: persistent_discovery: use named gate service: raft: group0_state_machine: use named gate service: migration_manager: use named gate replica: table: use named gate replica: compaction_group, storage_group: use named gate redis: query_processor: use named gate repair: repair_meta: use named gate reader_concurrency_semaphore: use named gate raft: server_impl: use named gate querier_cache: use named gate gms: gossiper: use named gate generic_server: use named gate db: sstables_format_listener: use named gate db: snapshot: backup_task: use named gate db: snapshot_ctl: use named gate hints: hints_sender: use named gate hints: manager: use named gate hints: hint_endpoint_manager: use named gate commitlog: segment_manager: use named gate db: batchlog_manager: use named gate query_processor: remote: use named gate compaction: compaction_state: use named gate alternator/server: use named_gate	2025-04-14 20:56:32 +03:00
Pavel Emelyanov	1bd991a111	test: Inherit sstable_assertions from sstables::test The latter class is invented to let tests access private fields of an sstable (mostly methods). The former is in fact an extended version of that also does some checks. Howerver, they don't inherit from each other, and the sstable_assertions partially duplicates some funtionality of the test one. Add the inheritance, remove the duplicated methods from the child class, update the callers (the test class returns future<>s, the assertions one "knows" it runs in seastar thread) and marm sstable::read_toc() private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23697	2025-04-14 13:45:14 +03:00
Benny Halevy	e1fe82ed33	utils: phased_barrier, pluggable: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Tomasz Grabiec	0b9a75d7b6	virtual-tables: Introduce system.load_per_node Can be used to query per-node stats about load as seen by the load balancer. In particular, node's capacity will be used by tablet-mon.py to scale tablet columns so that equal height is equal node utilization.	2025-04-09 20:21:51 +02:00
Marcin Maliszkiewicz	b94acfb37b	test: remove alternator code from perf-simple-query This kind of benchmark was superseded by perf-alternator which has more options, workflows and most importantly measures overhead of http server layer (including json parsing). There is no need to maintain additional code in perf-simple-query. Closes scylladb/scylladb#23474	2025-04-06 18:15:16 +03:00
Botond Dénes	a0d8102a1f	replica/memtable: s/make_flat_reader/make_mutation_reader/ Following the recent refactoring of removing "flat" and "v2" from reader names, replacing all the fully qualified names with simply "mutation_reader". Closes scylladb/scylladb#23346	2025-04-01 17:58:13 +03:00
Pavel Emelyanov	2ee9cec1d3	Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed. This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f Refs https://github.com/scylladb/scylladb/issues/22428 Closes scylladb/scylladb#22952 * github.com:scylladb/scylladb: Remove db::config::object_storage_config Move `object_storage.yaml` endpoints to `scylla.yaml`	2025-04-01 16:01:44 +03:00
Michał Chojnowski	b77c611c00	raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback Before this patch, `system.dicts` contains only one dictionary, for RPC compression, with the fixed name "general". In later parts of this series, we will add more dictionaries to system.dicts, one per table, for SSTable compression. To enable that, this patch adjusts the callback mechanism for group0's `write_mutations` command, so that the mutation callbacks for group0-managed tables can see which partition keys were affected. This way, the callbacks can query only the modified partitions instead of doing a full scan. (This is necessary to prevent quadratic behaviours.) For now, only the `system.dicts` callback uses the partition keys.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	30a9d471fa	sstables: plug an `sstable_compressor_factory` into `sstables_manager` Create a `sstable_compressor_factory_impl` in `scylla_main`, and pipe it through constructors into `sstables_manager`. In next commits, the factory available through the `sstables_manager` will be used to create compressors for SSTable readers and writers.	2025-04-01 00:07:28 +02:00
Robert Bindar	b647196121	Remove db::config::object_storage_config That map became redundant once we added object_storage_endpoints in the config, this patch removes it and switches all the user code to use the new option. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 17:15:12 +03:00
Avi Kivity	7646e1448a	Merge 'cql3: Introduce RF-rack-valid keyspaces' from Dawid Mędrek This PR is an introductory step towards enforcing RF-rack-valid keyspaces in Scylla. The scope of changes: * defining RF-rack-valid keyspaces, * introducing a configuration option enforcing RF-rack-valid keyspaces, * restricting the CREATE and ALTER KEYSPACE statements so that they never lead to RF-rack invalid keyspaces, * during the initialization of a node, it verifies that all existing keyspaces are RF-rack-valid. If not, the initialization fails. We provide tests verifying that the changes behave as intended. --- Note that there are a number of things that still need to be implemented. That includes, for instance, restricting topology operations too. --- Implementation strategy (going beyond the scope of this PR): 1. Introduce the new configuration option `rf_rack_valid_keyspaces`. 2. Start enforcing RF-rack-validity in keyspaces if the option is enabled. 3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests. 4. Once the tests have been adjusted, change the default value of the option to enabled. 5. Stop explicitly enabling the option in tests. 6. Get rid of the option. --- Fixes scylladb/scylladb#20356 Fixes scylladb/scylladb#23276 Fixes scylladb/scylladb#23300 --- Backport: this is part of the requirements for releasing 2025.1. Closes scylladb/scylladb#23138 * github.com:scylladb/scylladb: main: Refuse to start node when RF-rack-invalid keyspace exists cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces db/config: Introduce RF-rack-valid keyspaces	2025-03-20 19:10:36 +02:00
Botond Dénes	d06bc27979	Merge 'Don't export string filenames from sstable' from Pavel Emelyanov There are several sstring-returning methods on class sstable that return paths to files. Mostly these are used to print them into logs, sometimes are used to be put into exception messages. And there are places that use these strings as file names. Since now sstables can also be stored on S3, generic code shouldn't consider those strings as on disk file names. Other than that, even when the methods are used to put component names into logs, in many cases these log messages come with debug or trace level, so generated strings are immediately dropped on the floor, but generating it is not extremely cheap. Code would benefit from using lazily-printed names. This change introduces the component_name struct that wraps sstable reference and component ID (which is a numerical enum of several items). When printed, the component_name formatter calls the aforementioned filename generation, thus implementing lazy printing. And since there's no automatic conversion of component_name-s into strings, all the code that treats them as file paths, becomes explicit. refs: #14122 (previous ugly attempt to achieve the same goal) Closes scylladb/scylladb#23194 * github.com:scylladb/scylladb: sstable: Remove unused malformed_sstable_exctpion(string filename) sstables: Make filename() return component_name sstables: Make file_writer keep component_name on board sstables: Make get_filename() return component_name sstables: Make toc_filename() return component_name sstables: Make sstable::index_filename() return component_name sstables: Introduce struct component_name sstables: Remove unused sstable::component_filenames() method sstables: Do not print component filenames on load-and-stream wrap-up sstables: Explicitly format prefix in S3 object name making sstables: Don't include directory name in exception sstables: Use fmt::format instead of string concatenation sstables: Rename filename($component) calls to ${component}_filename() sstables: Rename local filename variable to component_name	2025-03-20 09:51:03 +02:00
Dawid Mędrek	0e04a6f3eb	main: Refuse to start node when RF-rack-invalid keyspace exists When a node is started with the option `rf_rack_valid_keyspaces` enabled, the initialization will fail if there is an RF-rack-invalid keyspace. We want to force the user to adjust their existing keyspaces when upgrading to 2025.* so that the invariant that every keyspace is RF-rack-valid is always satisfied. Fixes scylladb/scylladb#23300	2025-03-19 15:13:44 +01:00
Pavel Emelyanov	f06cc32812	sstables: Make filename() return component_name Similarly to toc_, index_ and data filenames, make the generic component name getter return back not string, but a wrapper object. Most of callers are log messages and exception generations. Other than that there are tests, filesystem storage driver and few more places in generic code who "know" that they work with real files, so make them use explicit fmt::to_string(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	0cdeed858c	sstables: Make toc_filename() return component_name Most of the callers use the returned value as log message parameter, some construct malformed_sstable_exception that was prepared by previous patch. The remaining callers explicitly use fmt::to_string(), these are - pending deletion log creation - filesystem storage code - tests - stream-blob code that re-loads sstable All but the last one are OK to use string toc name, the last one is not very correct in its usage of toc_filename string, but it needs more care to be fixed properly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	dcc9167734	sstables: Rename filename($component) calls to ${component}_filename() There's a generic sstable::filename(component_type) method that returns a file name for the given component. For "popular" components, namely TOC, Data and Index there are dedicated sstable methods to get their names. Fix existing callers of the generic method to use the former. It's shorter, nicer and makes further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Botond Dénes	969b07fdfd	test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/ The class actually implements the FlattenedConsumer, so fix the comment. This eliminates the only reference to the StreamedMutationConsumer concept.	2025-03-18 07:57:04 -04:00
Pavel Emelyanov	2bb455ec75	Merge 'Main: stop system_keyspace' from Benny Halevy This series adds an async guard to system_keyspace operations and adds a deferred action to stop the system_keyspace in main() before destroying the service. This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped. * Enhancement, no backport needed Closes scylladb/scylladb#23113 * github.com:scylladb/scylladb: main: stop system keyspace system_keyspace: call shutdown from stop system_keyspace: shutdown: allow calling more than once database, compaction_manager, large_data_handler: use pluggable<system_keysapce> utils: add class pluggable	2025-03-14 13:23:28 +03:00
Avi Kivity	696ce4c982	Merge "convert some parts of the gossiper to host ids" from Gleb " This is series starts conversion of the gossiper to use host ids to index nodes. It does not touch the main map yet, but converts a lot of internal code to host id. There are also some unrelated cleanups that were done while working on the series. On of which is dropping code related to old shadow round. We replaced shadow round with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0, so there should be no compatibility problem. We already dropped a lot of old shadow round code in previous patches anyway. I tested manually that old and new node can co-exist in the same cluster, " * 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits) gossiper: drop unneeded code gossiper: move _expire_time_endpoint_map to host_id gossiper: move _just_removed_endpoints to host id gossiper: drop unused get_msg_addr function messaging_service: change connection dropping notification to pass host id only messaging_service: pass host id to remove_rpc_client in down notification treewide: pass host id to endpoint_lifecycle_subscriber treewide: drop endpoint life cycle subscribers that do nothing load_meter: move to host id treewide: use host id directly in endpoint state change subscribers treewide: pass host id to endpoint state change subscribers gossiper: drop deprecated unsafe_assassinate_endpoint operation storage_service: drop unused code in handle_state_removed treewide: drop endpoint state change subscribers that do nothing gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory gossiper: start using host ids to send messages earlier messaging_service: add temporary address map entry on incoming connection topology_coordinator: notify about IP change from sync_raft_topology_nodes as well treewide: move everyone to use host id based gossiper::is_alive and drop ip based one storage_proxy: drop unused template ...	2025-03-13 13:36:31 +02:00
Avi Kivity	b1d9f80d85	Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec Before this patch, the load balancer was equalizing tablet count per shard, so it achieved balance assuming that: 1) tablets have the same size 2) shards have the same capacity That can cause imbalance of utilization if shards have different capacity, which can happen in heterogeneous clusters with different instance types. One of the causes for capacity difference is that larger instances run with fewer shards due to vCPUs being dedicated to IRQ handling. This makes those shards have more disk capacity, and more CPU power. After this patch, the load balancer equalizes shard's storage utilization, so it no longer assumes that shards have the same capacity. It still assumes that each tablet has equal size. So it's a middle step towards full size-aware balancing. One consequence is that to be able to balance, the load balancer need to know about every node's capacity, which is collected with the same RPC which collects load_stats for average tablet size. This is not a significant set back because migrations cannot proceed anyway if nodes are down due to barriers. We could make intra-node migration scheduling work without capacity information, but it's pointless due to above, so not implemented. Also, per-shard goal for tablet count is still the same for all nodes in the cluster, so nodes with less capacity will be below limit and nodes with more capacity will be slightly above limit. This shouldn't be a significant problem in practice, we could compensate for this by increasing the limit. Refs #23042 Closes scylladb/scylladb#23079 * github.com:scylladb/scylladb: tablets: Make load balancing capacity-aware topology_coordinator: Fix confusing log message topology_coordinator: Refresh load stats after adding a new node topology_coordinator: Allow capacity stats to be refreshed with some nodes down topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places test: boost: tablets_test: Always provide capacity in load_stats test: perf_load_balancing: Set node capacity test: perf_load_balancing: Convert to topology_builder config, disk_space_monitor: Allow overriding capacity via config storage_service, tablets: Collect per-node capacity in load_stats	2025-03-11 14:34:27 +02:00
Gleb Natapov	f0af3f261e	messaging_service: add temporary address map entry on incoming connection We want to move to use host ids as soon as possible. Currently it is possible only after the full gossiper exchange (because only at this point gossiper state is added and with it address map entry). To make it possible to move to host ids earlier this patch adds address map entries on incoming communication during CLIENT_ID verb processing. The patch also adds generation to CLIENT_ID to use it when address map is updated. It is done so that older gossiper entries can be overwritten with newer mapping in case of IP change.	2025-03-11 12:09:21 +02:00
Tomasz Grabiec	69c49fb1a7	test: boost: tablets_test: Always provide capacity in load_stats Move shared_load_stats to topology_builder.hh so that topology_builder can maintain it. It will set capacity for all created nodes. Needed after load balancer requires capacity to make decisions.	2025-03-06 13:35:37 +01:00

1 2 3 4 5 ...

1493 Commits