scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-05 06:23:03 +00:00

Author	SHA1	Message	Date
Benny Halevy	bdd5a61139	commitlog: segment_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Patryk Jędrzejczak	07a7a75b98	Merge 'raft: implement the limited voters feature' from Emil Maskovsky Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures. Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested). Tests added: * boost/group0_voter_registry_test.cc: run time on CI: ~3.5s * topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total Fixes: scylladb/scylladb#18793 No backport: This is a new feature that will not be backported. Closes scylladb/scylladb#21969 * https://github.com/scylladb/scylladb: raft: distribute voters by rack inside DC raft/test: fix lint warnings in `test_raft_no_quorum` raft/test: add the upgrade test for limited voters feature raft topology: handle on_up/on_down to add/remove node from voters raft: fix the indentation after the limited voters changes raft: implement the limited voters feature raft: drop the voter removal from the decommission raft/test: disable the `stop_before_becoming_raft_voter` test raft/test: stop the server less gracefully in the voters test	2025-04-10 15:29:15 +02:00
Avi Kivity	9559e53f55	Merge 'Adjust tablet-mon.py for capacity-aware load balancing' from Tomasz Grabiec After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph. To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer. Closes scylladb/scylladb#23584 * github.com:scylladb/scylladb: tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization tablet-mon.py: Center tablet id text properly in the vertical axis tablet-mon.py: Show migration stage tag in table mode only when migrating virtual-tables: Introduce system.load_per_node virtual_tables: memtable_filling_virtual_table: Propagate permit to execute() docs: virtual-tables: Fix instructions service: tablets: Keep load_stats inside tablet_allocator	2025-04-10 14:59:08 +03:00
Botond Dénes	d67202972a	mutation/frozen_mutation: frozen_mutation_consumer_adaptor: fix end-of-partition handling This adaptor adapts a mutation reader pausable consumer to the frozen mutation visitor interface. The pausable consumer protocol allows the consumer to skip the remaining parts of the partition and resume the consumption with the next one. To do this, the consumer just has to return stop_iteration::yes from one of the consume() overloads for clustering elements, then return stop_iteration::no from consume_end_of_partition(). Due to a bug in the adaptor, this sequence leads to terminating the consumption completely -- so any remaining partitions are also skipped. This protocol implementation bug has user-visible effects, when the only user of the adaptor -- read repair -- happens during a query which has limitations on the amount of content in each partition. There are two such queries: select distinct ... and select ... with partition limit. When converting the repaired mutation to to query result, these queries will trigger the skip sequence in the consumer and due to the above described bug, will skip the remaining partitions in the results, omitting these from the final query result. This patch fixes the protocol bug, the return value of the underlying consumer's consume_end_of_partition() is now respected. A unit test is also added which reproduces the problem both with select distinct ... and select ... per partition limit. Follow-up work: * frozen_mutation_consumer_adaptor::on_end_of_partition() calls the underlying consumer's on_end_of_stream(), so when consuming multiple frozen mutations, the underlying's on_end_of_stream() is called for each partition. This is incorrect but benign. * Improve documentation of mutation_reader::consume_pausable(). Fixes: #20084 Closes scylladb/scylladb#23657	2025-04-10 13:19:57 +03:00
Avi Kivity	ed3e4f33fd	Merge 'generic_server: throttle and shed incoming connections according to semaphore limit' from Marcin Maliszkiewicz Adds new live updatable config: uninitialized_connections_semaphore_cpu_concurrency. It should help to reduce cpu usage by limiting cpu concurrency for new connections. As a last resort when those connections are waiting for initial processing too long (over 1m) they are shed. New connections_shed and connections_blocked metrics are added for tracking. Testing: - manually via simple program creating high number of connection and constantly re-connecting - added benchmark Following are benchmark results: Before: ``` > build/release/test/perf/perf_generic_server --smp=1 170101.41 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4695 insns/op, 3178 cycles/op, 0 errors) [...] throughput: mean=173850.06 standard-deviation=1844.48 median=174509.66 median-absolute-deviation=874.23 maximum=175087.49 minimum=170588.54 instructions_per_op: mean=4725.59 standard-deviation=13.35 median=4729.38 median-absolute-deviation=12.49 maximum=4738.61 minimum=4709.96 cpu_cycles_per_op: mean=3135.08 standard-deviation=32.13 median=3122.68 median-absolute-deviation=22.29 maximum=3179.38 minimum=3103.15 ``` After: ``` > build/release/test/perf/perf_generic_server --smp=1 167373.19 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4821 insns/op, 3371 cycles/op, 0 errors) [...] throughput: mean= 171199.55 standard-deviation=2484.58 median= 171667.06 median-absolute-deviation=2087.63 maximum=173689.11 minimum=167904.76 instructions_per_op: mean= 4801.90 standard-deviation=16.54 median= 4796.78 median-absolute-deviation=9.32 maximum=4830.71 minimum=4789.81 cpu_cycles_per_op: mean= 3245.26 standard-deviation=32.28 median= 3230.44 median-absolute-deviation=16.52 maximum=3297.39 minimum=3215.62 ``` The patch adds around 67 insns/op so it's effect on performance should be negligible. Fixes: https://github.com/scylladb/scylladb/issues/22844 Closes scylladb/scylladb#22828 * github.com:scylladb/scylladb: transport: move on_connection_close into connection destructor test: perf: make aggregated_perf_results formatting more human readable transport: add blocked and shed connection metrics generic_server: throttle and shed incoming connections according to semaphore limit generic_server: add data source and sink wrappers bookkeeping network IO generic_server: coroutinize part of server::do_accepts test: add benchmark for generic_server test: perf: add option to count multiple ops per time_parallel iteration generic_server: add semaphore for limiting new connections concurrency generic_server: add config to the constructor generic_server: add on_connection_ready handler	2025-04-09 21:41:38 +03:00
Tomasz Grabiec	668094dc58	virtual_tables: memtable_filling_virtual_table: Propagate permit to execute() So that population can access read's timeout and mark the permit as awaiting.	2025-04-09 20:21:51 +02:00
Marcin Maliszkiewicz	ed82bede39	generic_server: add semaphore for limiting new connections concurrency It will be used in following commits.	2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz	33122d3f93	generic_server: add config to the constructor	2025-04-09 10:30:58 +02:00
Botond Dénes	0d39091df2	test/boost/row_cache_test: add memtable overlap check tests Similar to test/cluster/test_data_resurrection_in_memtable.py but works on a single node and uses more low-level mechanism. These tests can also reproduce more advanced scenarios, like concurrent reads, with some reading from flushed memtables.	2025-04-08 00:11:36 -04:00
Botond Dénes	6b5b563ef7	db/row_cache: add overlap-check for cache tombstone garbage collection The cache should not garbage-collect tombstone which cover data in the memtable. Add overlap checks (get_max_purgeable) to garbage collection to detect tombstones which cover data in the memtable and to prevent their garbage collection.	2025-04-08 00:11:35 -04:00
Emil Maskovsky	76ceaf129b	raft: distribute voters by rack inside DC Distribute the voters evenly across racks in the datacenters. When distributing the voters across datacenters, the datacenters with more racks will be preferred in case of a tie. Also, in case of asymmetric voter distribution (2 DCs), the DC with more racks will have more voters (if the node counts allow it). In case of a single datacenter, the voters will be distributed across racks evenly (in the similar manner as done for the whole datacenters). The intention is that similar to losing a datacenter, we want to avoid losing the majority if a rack goes down - so if there are multiple racks, we want to distribute the voters across them in such a way that losing the whole rack will not cause the majority loss (if possible).	2025-04-07 12:31:37 +02:00
Emil Maskovsky	1d06ea3a5a	raft: implement the limited voters feature Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of datacenters (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose the majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. Currently the voter limits will not be configurable (we might introduce configurable limits later if that would be needed/requested). The feature is enabled by the `group0_limited_voters` feature flag to avoid issues with cluster upgrade (the feature will be only enabled once all nodes in the cluster are upgraded to the version supporting the feature). Fixes: scylladb/scylladb#18793	2025-04-07 12:31:18 +02:00
Pavel Emelyanov	10376b5b85	db: Re-use database::snapshot_table_on_all_shards() There are two snapshot-on-all-shards methods on the database -- the one that snapshots a keyspace and the one that snapshots a vector of tables. The latter snapshots a single table with a neat helper, while the former has the helper open-coded. Re-using the helper in keyspace snapshot is worth it, but needs to patch the helper to work on uuid, rather than ks:cf pair of strings. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23532	2025-04-07 11:55:43 +02:00
Botond Dénes	1198213000	Merge 'tablets: Make tablet allocation equalize per-shard load ' from Tomasz Grabiec Before, it was equalizing per-node load (tablet count), which is wrong in heterogeneous clusters. Nodes with fewer shards will end up with overloaded shards. Refs #23378 Closes scylladb/scylladb#23478 * github.com:scylladb/scylladb: tablets: Make tablet allocation equalize per-shard load tablets: load_balancer: Fix reporting of total load per node	2025-04-03 16:32:53 +03:00
Botond Dénes	fcdae20fd1	Merge 'Add tablet enforcing option' from Benny Halevy This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing `enable_tablets` option. It can be set to the following values: disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option enabled: New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option `tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether tablets are disabled or enabled by default for new keyspaces, respectively. In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}` keyspace option, when the keyspace is created. `tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`. However, it does not allow to opt-out when creating new keyspaces by setting `tablets = {'enabled': false}` Refs scylladb/scylla-enterprise#4355 * Requires backport to 2025.1 Closes scylladb/scylladb#22273 * github.com:scylladb/scylladb: boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option db/config: add tablets_mode_for_new_keyspaces option	2025-04-03 16:32:19 +03:00
Botond Dénes	a0d8102a1f	replica/memtable: s/make_flat_reader/make_mutation_reader/ Following the recent refactoring of removing "flat" and "v2" from reader names, replacing all the fully qualified names with simply "mutation_reader". Closes scylladb/scylladb#23346	2025-04-01 17:58:13 +03:00
Pavel Emelyanov	2ee9cec1d3	Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed. This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f Refs https://github.com/scylladb/scylladb/issues/22428 Closes scylladb/scylladb#22952 * github.com:scylladb/scylladb: Remove db::config::object_storage_config Move `object_storage.yaml` endpoints to `scylla.yaml`	2025-04-01 16:01:44 +03:00
Avi Kivity	69684e16d8	Merge 'sstables: add SSTable compression with shared dictionaries ' from Michał Chojnowski This PR extends Scylla's SSTable compression with the ability to use compression dictionaries shared across compression chunks. This involves several changes: - We refactor `compression_parameters` and friends (`compressor`, `sstables::local_compression`, `sstables::compression`) to prepare for making the construction of `compressor`s asynchronous, to enable sharing pieces of compressors (the dictionaries) across shards. - We introduce the notion of "hidden compression options" which are written to `CompressionInfo.db` and used to construct decompressors, like regular options, but don't appear in the schema. (We later stuff the SSTable's dictionary into `CompressionInfo.db` using a sequence of such options). - We add a cluster feature which guards the creation of dictionary-compressed SSTables. - We introduce a central "compressor factory" (one instance shared by all shards), which from this point onward is used to construct all `compressor` objects (one per SSTable) used to process the SSTables. When constructing a compressor for writing, it uses the "current"/"recommended" dictionary (which is passed to the factory from the actively-observed contents of the group0-managed `system.dicts`). When constructing a compressor for reading, it uses the dictionary written in the hidden compression options in CompressionInfo.db. And it keeps dictionaries deduplicated, so that each unique live dictionary blob has only one instance in memory, shared across shards. - We teach the relevant `lz4` and `zstd` compressor wrappers about the dictionaries. - We add a HTTP API call which samples pieces of the given table (i.e. the Data.db files) from across the cluster, trains a dictionary on it, and publishes it via `system.dicts` as the new current dictionary for that table. (And we add some RPC verbs to support that). - We add a HTTP API call which estimates the impact of various available compression configurations on the compression ratio. - We add an autotrainer fiber which periodically retrains dicts for dict-aware tables and publishes them if they seem to be a significant improvement. Known imperfections: - The factory currently keeps one dictionary instance on the entire node, but we probably want one copy per NUMA node. I didn't do that because exposing NUMA knowledge to Scylla seems to require some changes in Seastar first. New feature, no backporting involved. Closes scylladb/scylladb#23025 * github.com:scylladb/scylladb: docs: add user-facing documentation for SSTable compression with shared dicts docs/dev: add sstable-compression-dicts.md test: add test_sstable_compression_dictionaries_autotrain.py test: add test_sstable_compression_dictionaries_basic.py test/pylib/rest_client: add `keyspace_upgrade_sstables` helper main: run a sstable_dict_autotrainer api: add the estimate_compression_ratios API call dict_autotrainer: introduce sstable_dict_autotrainer db/system_keyspace: add query_dict_timestamp compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor main: clean up sstable compression dicts after table drops sstables/compress: discard hidden compression options after the decompressor is created compress: change compressor_ptr from shared_ptr to unique_ptr api: add the retrain_dict API call storage_service: add some dict-related routines main: in compression_dict_updated_callback, recognize and use SSTable compression dicts storage_service: add do_sample_sstables() messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback database: add sample_data_files() database: add take_sstable_set_snapshot() compress: teach `lz4_processor` about dictionaries compress: teach `zstd_processor` about dictionaries sstables: delegate compressor creation to the compressor factory sstables: plug an `sstable_compressor_factory` into `sstables_manager` sstables: introduce sstable_compressor_factory utils/hashers: add get_sha256() gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature compress: add hidden dictionary options compress: remove `compression_parameters::get_compressor()` sstables/compress: remove get_sstable_compressor() sstables/compress: move ownership of `compressor` to `sstable::compression` compress: remove compressor::option_names() compress: clean up the constructor of zstd_processor compress: squash zstd.cc into compress.cc sstables/compress: break the dependency of `compression_parameters` on `compressor` compress.hh: switch compressor::name() from an instance member to a virtual call bytes: adapt fmt_hex to std::span<const std::byte>	2025-04-01 12:47:34 +03:00
Pavel Emelyanov	b5a124f60c	sstable_directory: Move highest_generation_seen() to distributed_loader.cc This method is only used by the loader code (and tests). Also, There's the highest_version_seen() peer that sits in the loader code either. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23324	2025-04-01 09:15:14 +03:00
Michał Chojnowski	10fa4abde7	compress: change compressor_ptr from shared_ptr to unique_ptr Cleanup patch. After we moved the ownership of compressors to sstables, compressor objects never have shared lifetime. `unique_ptr` is more appropriate for them than `shared_ptr` now. (And besides expressing the intent better, using `unique_ptr` prevents an accidental cross-shard `shared_ptr` copy).	2025-04-01 00:07:29 +02:00
Michał Chojnowski	11be7c0704	compress: remove `compression_parameters::get_compressor()` Following up on the previous commits, we avoid constructing compressors where not necessary, by checking things directly on `compression_parameters` instead.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	8e611536b0	sstables/compress: move ownership of `compressor` to `sstable::compression` SSTable readers and writers use `compressor` objects to compress and decompress chunks of SSTable data files. `compressor` objects are read-only, so only one of them is needed for each SSTable. Before this commit, each reader and writer has its own `compressor` object. This isn't necessary, but it's okay. But later in this series it will stop being okay, because the creation of a `compressor` will become an expensive cross-shard operation (because it might require sharing a compression dictionary from another shard). So we have to adjust the code so that there is only once `compressor` per sstable, not one per reader/writer. We stuff the ownership of this compressor into `sstable::compression`. To make the ownership clear, we remove `compression_ptr` shared pointers from readers and writers, and make them access the compressor via the `sstable::compression` instead.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	cfe69e057f	sstables/compress: break the dependency of `compression_parameters` on `compressor` Note: this commit is meant to be a code refactoring only and is not intended to change the observable behaviour. Today `schema` contains a `compression_parameters`. `compression_parameters` contains an instance of `compressor`, and SSTable writers just share that instance. This is fine because `compressor` is a stateless object, functionally dependent on the schema. But in later parts of the series, we will break this functional dependency by adding dictionaries to compressors. Two writers for the same schema might have different dictionaries, so they won't be able to just share a single instance contained in the schema. And when that happens, having a `compressor` instance in the `schema`/`compression_parameters` will become awkward, since it won't be actually used. It will be only a container for options. In addition, for performance reasons, we will want to share some pieces of compressors across shards, which will require -- in the general case -- a construction of a compressor to be asynchronous, and therefore not possible inside the constructor of `compression_parameters`. This commit modifies `compression_parameters` so that it doesn't hold or construct instances of `compressor`. Before this patch, the `compressor` instance constructed in `compression_parameters` has an additional role of validating and holding compressor-specific options. (Today the only such option is the zstd compression level). This means that the pieces of logic responsible for compressor-specific options have to be rewritten. That ends up being the bulk of this commit.	2025-04-01 00:07:27 +02:00
Robert Bindar	b647196121	Remove db::config::object_storage_config That map became redundant once we added object_storage_endpoints in the config, this patch removes it and switches all the user code to use the new option. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 17:15:12 +03:00
Tomasz Grabiec	6bff596fce	tablets: Make tablet allocation equalize per-shard load Before, it was equalizing per-node load (tablet count), which is wrong in heterogenous clusters. Nodes with fewer shards will end up with overloaded shards. Refs #23378	2025-03-31 14:34:30 +02:00
Botond Dénes	90c20858ed	Merge 'test/database: Remove most of take_snapshot() helper overloads and re-use them more' from Pavel Emelyanov This helper facilitate snapshot creation by various test cases in database_test.cc. This PR generalizes all overloads into one that suits all callers and patches one more test case to use it as well. Closes scylladb/scylladb#23482 * github.com:scylladb/scylladb: test/database: Re-use take_snapshot() helper once more test/database: Remove most of take_snapshot() helper overloads	2025-03-31 15:20:51 +03:00
Benny Halevy	5f2ce0b022	loading_cache_test: test_loading_cache_reload_during_eviction: use manual_clock Rather than lowres_clock, as since `32b7cab917`, loading_cache_for_test uses manual_clock for timing and relying on lowres_clock to time the test might run out of memory on fast test machines. Fixes #23497 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23498	2025-03-31 14:53:06 +03:00
Pavel Emelyanov	ac582efb44	test/database: Re-use take_snapshot() helper once more There's a test case that can call the recently patched take_snapshot() helper as well. This changes nothing, but makes further patching a bit simpler (not in this branch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-31 13:18:06 +03:00
Pavel Emelyanov	7e6380b6bd	test/database: Remove most of take_snapshot() helper overloads There are 3 of those that help tests (re)shuffle cql_test_env/database, skip_flush == true/false options and keyspace/table/snapshot names. There's little sense in having that many of those, just one overload with default arguments suits most of the callers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-31 13:18:06 +03:00
Evgeniy Naydanov	c4ae4e247a	test.py: refactor paths constants and options Add path constants to `test` module and use them in different test suites instead of own dups of the same code: - TOP_SRC_DIR : ScyllaDB's source code root directory - TEST_DIR : the directory with test.py tests and libs - BUILD_DIR : directory with ScyllaDB's build artefacts Add TestSuite.log_dir attribute as a ScyllaDB's build mode subdir of a path provided using `--tmpdir` CLI argument. Don't use `tmpdir` name because it mixed up with pytest's built-in fixture and `--tmpdir` option itself. Change default value for `--tmdir` from `./testlog` to `TOP_SRC_DIR/testlog` Refactor `ResourceGather*` classes to use path from a `test` object instead of providing it separately. Move modes constants to `test` module and remove duplications. Move `prepare_dirs()` and `start_3rd_party_services()` from `pylib.util` to `pylib.suite.base` to avoid circular imports (with little refactoring to use `pathlib.Path` instead of `str` as paths.) Also, in some places refactor to use f-strings for formatting.	2025-03-30 03:19:29 +00:00
Piotr Dulikowski	288216a89e	Merge 'Ignore wrapped exceptions `gate_closed_exception` and `rpc::closed_error` when node shuts down.' from Sergey Zolotukhin Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error` in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped in a `nested_exception`, an error message is printed, causing tests to fail. This commit adds handling for nested exceptions in this case to prevent unnecessary error messages. Fixes scylladb/scylladb#23325 Fixes scylladb/scylladb#23305 Fixes scylladb/scylladb#21815 Backport: looks like this is quite a frequent issue, therefore backport to 2025.1. Closes scylladb/scylladb#23336 * github.com:scylladb/scylladb: database: Pass schema_ptr as const ref in `wrap_commitlog_add_error` database: Unify exception handling in `do_apply` and `apply_with_commitlog` storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down. exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.	2025-03-27 11:39:42 +01:00
Sergey Zolotukhin	6abfed9817	exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.	2025-03-26 11:15:13 +01:00
Benny Halevy	9fac0045d1	boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 15:39:53 +02:00
Benny Halevy	62aeba759b	tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option `tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`. However, it does not allow to opt-out when creating new keyspaces by setting `tablets = {'enabled': false}`. Refs scylladb/scylla-enterprise#4355 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 15:32:16 +02:00
Benny Halevy	c62865df90	db/config: add tablets_mode_for_new_keyspaces option The new option deprecates the existing `enable_tablets` option. It will be extended in the next patch with a 3rd value: "enforced" while will enable tablets by default for new keyspace but without the posibility to opt out using the `tablets = {'enabled': false}` keyspace schema option. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 14:54:45 +02:00
Avi Kivity	cc5fe542ed	test: ignore unused fmt::to_string() result fmt 11.1 apparently marks to_string() as [[nodiscard]]. Here we aren't interested in the result, so explicitly ignore it to avoid an error. Closes scylladb/scylladb#23403	2025-03-24 10:19:09 +03:00
Avi Kivity	cd04ab1a4e	test: avoid spaces when defining user-defined literal operator Clang 20 complains when it sees a user-defined literal operator defined with a space before the underscore. Assume it's adhering to the standard and comply. Closes scylladb/scylladb#23401	2025-03-24 10:17:12 +03:00
Pavel Emelyanov	d436fb8045	Merge 'Fix EAR not applied on write to S3 (but on read).' from Calle Wilund Fixes #23225 Fixes #23185 Adds a "wrap_sink" (with default implementation) to sstables::file_io_extension, and moves extension wrapping of file and sink objects to storage level. (Wrapping/handling on sstable level would be problematic, because for file storage we typically re-use the sstable file objects for sinks, whereas for S3 we do not). This ensures we apply encryption on both read and write, whereas we previously only did so on read -> fail. Adds io wrapper objects for adapting file/sink for default implementation, as well as a proper encrypted sink implementation for EAR. Unit tests for io objects and a macro test for S3 encrypted storage included. Closes scylladb/scylladb#23261 * github.com:scylladb/scylladb: encryption: Add "wrap_sink" to encryption sstable extension encrypted_file_impl: Add encrypted_data_sink sstables::storage: Move wrapping sstable components to storage provider sstables::file_io_extension: Add a "wrap_sink" method. sstables::file_io_extension: Make sstable argument to "wrap" const utils: Add "io-wrappers", useful IO helper types	2025-03-24 10:12:46 +03:00
Pavel Emelyanov	bd313c581f	test: Add unit test for newly introduced download source Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-21 12:01:06 +03:00
Calle Wilund	9ac9813c62	encrypted_file_impl: Add encrypted_data_sink Adds a sibling type to encrypted file, a data_sink, that will write a data stream in the same block format as a file object would. Including end padding. For making encrypted data sink writing less cumbersome.	2025-03-20 14:54:24 +00:00
Calle Wilund	e100af5280	sstables::file_io_extension: Make sstable argument to "wrap" const This matches the signature of call sites. Since the only "real" extension to actually make a marker in the sstable will do so in the scylla component, which is writable even in a const sstable, this is ok.	2025-03-20 14:54:09 +00:00
Calle Wilund	98a6d0f79c	utils: Add "io-wrappers", useful IO helper types Mainly to add a somewhat functional file-impl wrapping a data_sink. This can implement a rudimentary, write-only, file based on any output sink. For testing, and because they fit there, place memory sink and source types there as well.	2025-03-20 14:54:09 +00:00
Botond Dénes	d06bc27979	Merge 'Don't export string filenames from sstable' from Pavel Emelyanov There are several sstring-returning methods on class sstable that return paths to files. Mostly these are used to print them into logs, sometimes are used to be put into exception messages. And there are places that use these strings as file names. Since now sstables can also be stored on S3, generic code shouldn't consider those strings as on disk file names. Other than that, even when the methods are used to put component names into logs, in many cases these log messages come with debug or trace level, so generated strings are immediately dropped on the floor, but generating it is not extremely cheap. Code would benefit from using lazily-printed names. This change introduces the component_name struct that wraps sstable reference and component ID (which is a numerical enum of several items). When printed, the component_name formatter calls the aforementioned filename generation, thus implementing lazy printing. And since there's no automatic conversion of component_name-s into strings, all the code that treats them as file paths, becomes explicit. refs: #14122 (previous ugly attempt to achieve the same goal) Closes scylladb/scylladb#23194 * github.com:scylladb/scylladb: sstable: Remove unused malformed_sstable_exctpion(string filename) sstables: Make filename() return component_name sstables: Make file_writer keep component_name on board sstables: Make get_filename() return component_name sstables: Make toc_filename() return component_name sstables: Make sstable::index_filename() return component_name sstables: Introduce struct component_name sstables: Remove unused sstable::component_filenames() method sstables: Do not print component filenames on load-and-stream wrap-up sstables: Explicitly format prefix in S3 object name making sstables: Don't include directory name in exception sstables: Use fmt::format instead of string concatenation sstables: Rename filename($component) calls to ${component}_filename() sstables: Rename local filename variable to component_name	2025-03-20 09:51:03 +02:00
Avi Kivity	a62ab824e6	schema: deprecate schema_extension schema_extension allows making invisible changes to system_schema that evade upgrade rollback tests. They appear in system_schema as an encoded blob which reduces serviceability, as they cannot be read. Deprecate it and point users to adding explicit columns in scylla_tables. We could probably make use of the data structure, after we teach it to encode its payload into proper named and typed columns instead of using IDL. Closes scylladb/scylladb#23151	2025-03-19 20:36:16 +02:00
Pavel Emelyanov	1ba91e28cb	sstables: Make get_filename() return component_name Similarly to previous patches -- mostly the result is used as log argument. The remaining users include - scylla sstable tool that dumps component names to json output - API endpoint that returns component names to user - tests these are all good to explicitly convert component_names to strings. There are few more places that expect strings instead of component name objects. For now they also use fmt::to_string() explicitly, partially it will be fixed later, mostly -- as future follow-ups. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	420b5bee20	test/s3: Increase boost/s3_test log levels When something goes wrong, it's impossible to find anyting out without s3 and http logs, so increase them for boost tests. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23245	2025-03-18 15:59:05 +02:00
Botond Dénes	2795d83b32	Merge 'commitlog: Serialize file deletion and distribute replayed segments' from Calle Wilund Fixes #23017 When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either a.) replay release b.) timer check (explicit) c.) timer initiated flush callback where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed. Now, eventually, this should be released, once we do deletion again, but this can take a while. Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure. As noted in the issue above, when replaying a large commitlog from an unclean node, we can cause shard 0 db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is to simply give the segments to all CL shards, thus distributing the available space. Closes scylladb/scylladb#23150 * github.com:scylladb/scylladb: main/commitlog: wait for file deletion and distribute recycled segments to shards commitlog: Serialize file deletion	2025-03-18 11:47:17 +02:00
Calle Wilund	4ed81e05bf	commitlog: Serialize file deletion Fixes #23017 When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either a.) replay release b.) timer check (explicit) c.) timer initiated flush callback where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed. Now, eventually, this should be released, once we do deletion again, but this can take a while. Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure. Small unit test included.	2025-03-17 12:09:00 +00:00
Avi Kivity	4416b0c732	treewide: use angle brackets for including seastar headers Seastar is an external library, so we use angle brackets to include its interfaces. Closes scylladb/scylladb#23301	2025-03-17 10:03:06 +02:00
Raphael S. Carvalho	e9944f0b7c	service: Introduce rack-aware co-location migrations for tablet merge Merge co-location can emit migrations across racks even when RF=#racks, reducing availability and affecting consistency of base-view pairing. Given replica set of sibling tablets T0 and T1 below: [T0: (rack1,rack3,rack2)] [T1: (rack2,rack1,rack3)] Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at only a subset of racks, reducing availability. This is the main problem fixed by this patch. It also lays the ground for consistent base-view replica pairing, which is rack-based. For tables on which views can be created we plan to enforce the constraint that replicas don't move across racks and that all tablets use the same set of racks (RF=#racks). This patch avoids moving replicas across racks unless it's necessary, so if the constraint is satisfied before merge, there will be no co-locating migrations across racks. This constraint of RF=#racks is not enforced yet, it requires more extensive changes. Fixes #22994. Refs #17265. This patch is based on Raphael's work done in PR #23081. The main differences are: 1) Instead of sorting replicas by rack, we try to find replicas in sibling tablets which belong to the same rack. This is similar to how we match replicas within the same host. It reduces number of across-rack migrations even if RF!=#racks, which the original patch didn't handle. Unlike the original patch, it also avoids rack-overloaded in case RF!=#racks 2) We emit across-rack co-locating migrations if we have no other choice in order to finalize the merge This is ok, since views are not supported with tablets yet. Later, we will disallow this for tables which have views, and we will allow creating views in the first place only when no such migrations can happen (RF=#racks). 3) Added boost unit test which checks that rack overload is avoided during merge in case RF<#racks 4) Moved logging of across-rack migration to debug level 5) Exposed metric for across-rack co-locating migrations Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com> Closes scylladb/scylladb#23247	2025-03-16 22:45:00 +02:00

1 2 3 4 5 ...

3814 Commits