scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 19:10:42 +00:00

Author	SHA1	Message	Date
Kamil Braun	89ee2a6834	Merge 'drop ip addresses from token metadata' from Gleb Now that all topology related code uses host ids there is not point to maintain ip to id (and back) mappings in the token metadata. After the patch the mapping will be maintained in the gossiper only. The rest of the system will use host ids and in rare cases where translation is needed (mostly for UX compatibility reasons) the translation will be done using gossiper. Fixes: scylladb/scylla#21777 * 'gleb/drop-ip-from-tm-v3' of github.com:scylladb/scylla-dev: (57 commits) hint manager: do not translate ip to id in case hint manager is stopped already locator: token_metadata: drop update_host_id() function that does nothing now locator: topology: drop indexing by ips repair: drop unneeded code storage_service: use host_id to look for a node in on_alive handler storage_proxy: translate ips to ids in forward array using gossiper locator: topology: remove unused functions storage_service: check for outdated ip in on_change notification in the peers table storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology topology coordinator: change connection dropping code to work on host ids cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query locator: drop unused function from tablet_effective_replication_map api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead locator: token_metadata: remove unused ip based functions locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs gossiper: drop get_unreachable_token_owners functions storage_service: use gossiper to map ip to id in node_ops operations storage_service: fix indentation after the last patch storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node token_metadata: drop no longer used functions ...	2025-01-17 11:00:52 +01:00
Pavel Emelyanov	14c3fbbf8c	Merge 'sstable_directory: do not load remote unshared sstables in process_descriptor()' from Lakshmi Narayanan Sreethar The sstable loader relied on the generation id to provide an efficient hint about the shard that owns an sstable. But, this hint was rendered ineffective with the introduction of UUID generation, as the shard id was no longer embedded in the generation id. This also became suboptimal with the introduction of tablets. Commit `0c77f77` addressed this issue by reading the minimum from disk to determine sstable ownership but this improvement was lost with commit `63f1969`, which optimistically assumed that hints would work most of the time, which isn't true. This commit restores that change - shard id of a table is deduced by reading minially from disk and then the sstable is fully loaded only if it belongs to the local shard. This patch also adds a testcase to verify that the sstable are loaded only in their respective shards. Fixes #21015 This fixes a regression and should be backported. Closes scylladb/scylladb#22263 * github.com:scylladb/scylladb: sstable_directory: do not load remote sstables in process_descriptor sstable_directory: update `load_sstable()` definition sstable_directory: reintroduce `get_shards_for_this_sstable()`	2025-01-17 11:17:54 +03:00
Nadav Har'El	955ac1b7b7	test/alternator: close boto3 client before shutting down For several years now, we have seen a strange, and very rare, flakiness in Alternator tests described in issue #17564: We see all the test pass, pytest declares them to have passed, and while Python is existing, it crashes with a signal 11 (SIGSEGV). Because this happens exclusively in test/alternator and never in the test/cqlpy, we suspect that something that the test/alternator leaves behind but test/cqlpy does not, causes some race and crashes during shutdown. The immediate suspect is the boto3 library, or rather, the urllib3 library which it uses. This is more-or-less the only thing that test/alternator does which test/cqlpy doesn't. The urllib3 library keeps around pools of reusable connections, and it's possible (although I don't actually have any proof for it) that these open connections may cause a crash during shutdown. So in this patch I add to the "dynamodb" and "dynamodbstreams" fixtures (which all Alternator tests use to connect to the server), a teardown which calls close() for the boto3 client object. This close() call percolates down to calling clear() on urllib3's PoolManager. Hopefully, this will make some difference in the chance to crash during shutdown - and if it doesn't, it won't hurt. Refs #17564 Closes scylladb/scylladb#22341	2025-01-16 19:21:00 -05:00
Gleb Natapov	1e4b2f25dc	locator: token_metadata: drop update_host_id() function that does nothing now	2025-01-16 16:37:08 +02:00
Gleb Natapov	50fb22c8f9	locator: topology: drop indexing by ips Do not track id to ip mapping in the topology class any longer. There are no remaining users.	2025-01-16 16:37:08 +02:00
Gleb Natapov	97f95f1dbd	locator: token_metadata: remove unused ip based functions	2025-01-16 16:37:07 +02:00
Gleb Natapov	415e8de36e	locator: topology: change get_datacenter_endpoints and get_datacenter_racks to return host ids and amend users	2025-01-16 16:37:06 +02:00
Gleb Natapov	8433947932	locator: topology: remove get_location overload that works on ip and its last users	2025-01-16 16:37:06 +02:00
Gleb Natapov	1b6e1456e5	messaging_service: drop the usage of ip based token_metadata APIs We want to drop ips from token_metadata so move to use host id based counterparts. Messaging service gets a function that maps from ips to id when is starts listening.	2025-01-16 16:37:06 +02:00
Gleb Natapov	542360e825	test: drop inet_address usage from network_topology_strategy_test Move the test to work on host ids. IPs will be dropped eventually.	2025-01-16 16:37:06 +02:00
Kefu Chai	8d7786cb0e	build: cmake: use wasm32-wasip1 as an alternative of wasm32-wasi wasm32-wasi has been removed in Rust 1.84 (Jan 5th, 2025). if one compiles the tree with Rust 1.84 or up, following build failure is expected: ``` [2/305] Building WASM /home/kefu/dev/scylladb/build/wasm/return_input.wasm FAILED: wasm/return_input.wasm /home/kefu/dev/scylladb/build/wasm/return_input.wasm cd /home/kefu/dev/scylladb/test/resource/wasm/rust && /usr/bin/cargo build --target=wasm32-wasi --example=return_input --locked --manifest-path=Cargo.toml --target-dir=/home/kefu/dev/scylladb/build/test/resource/wasm/rust && wasm-opt /home/kefu/dev/scylladb/build/test/resource/wasm/rust/wasm32-wasi//debug/examples/return_input.wasm -Oz -o /home/kefu/dev/scylladb/build/wasm/return_input.wasm && wasm-strip /home/kefu/dev/scylladb/build/wasm/return_input.wasm error: failed to run `rustc` to learn about target-specific information Caused by: process didn't exit successfully: `rustc - --crate-name ___ --print=file-names --target wasm32-wasi --crate-type bin --crate-type rlib --crate-type dylib --crate-type cdylib --crate-type staticlib --crate-type proc-macro --print=sysroot --print=split-debuginfo --print=crate-name --print=cfg` (exit status: 1) --- stderr error: Error loading target specification: Could not find specification for target "wasm32-wasi". Run `rustc --print target-list` for a list of built-in targets ``` in order to workaround this issue, let's check for supported target, and use wasm32-wasip1 if wasm32-wasi is not listed as the supported target. Refs #20878 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22320	2025-01-16 16:28:29 +03:00
Botond Dénes	b2a03e03f7	Merge 'raft: Handle non-critical config update errors in when changing voter status.' from Sergey Zolotukhin When a node is bootstrapped and joined a cluster as a non-voter and changes it's role to a voter, errors can occur while committing a new Raft record, for instance, if the Raft leader changes during this time. These errors are not critical and should not cause a node crash, as the action can be retried. Fixes scylladb/scylladb#20814 Backport: This issue occurs frequently and disrupts the CI workflow to some extent. Backports are needed for versions 6.1 and 6.2. Closes scylladb/scylladb#22253 * github.com:scylladb/scylladb: raft: refactor `remove_from_raft_config` to use a timed `modify_config` call. raft: Refactor functions using `modify_config` to use a common wrapper for retrying. raft: Handle non-critical config update errors in when changing status to voter. test: Add test to check that a node does not fail on unknown commit status error when starting up. raft: Add run_op_with_retry in raft_group0.	2025-01-16 11:00:47 +02:00
Piotr Dulikowski	72f28ce81e	Merge 'main, view: Pair view builder drain with its start' from Dawid Mędrek In this PR, we pair draining the view builder with its start. To better understand what was done and why, let's first look at the situation before this commit and the context of it: (a) The following things happened in order: 1. The view builder would be constructed. 2. Right after that, a deferred lambda would be created to stop the view builder during shutdown. 3. group0_service would be started. 4. A deferred lambda stopping group0_service would be created right after that. 5. The view builder would be started. (b) Because the view builder depends on group0_client, it couldn't be started before starting group0_service. On the other hand, other services depend on the view builder, e.g. the stream manager. That makes changing the order of initialization a difficult problem, so we want to avoid doing that unless we're sure it's the right choice. (c) Since the view builder uses group0_client, there was a possibility of running into a segmentation fault issue in the following scenario: 1. A call to `view_builder::mark_view_build_success()` is issued. 2. We stop group0_service. 3. `view_builder::mark_view_build_success()` calls `announce_with_raft()`, which leads to a use-after-free because group0_service has already been destroyed. This very scenario took place in scylladb/scylladb#20772. Initially, we decided to solve the issue by initializing group0_service a bit earlier (scylladb/scylladb@7bad8378c7). Unfortunately, it led to other issues described in scylladb/scylladb#21534, so we revert that patch. These changes are the second attempt to the problem where we want to solve it in a safer manner. The solution we came up with is to pair the start of the view builder with a deferred lambda that deinitializes it by calling `view_builder::drain()`. No other component of the system should be able to use the view builder anymore, so it's safe to do that. Furthermore, that pairing makes the analysis of initialization/deinitialization order much easier. We also solve the aformentioned use-after-free issue because the view builder itself will no longer attempt to use group0_client. Note that we still pair a deferred lambda calling `view_builder::stop()` with the construction of the view builder; that function will also call `view_builder::drain()`. Another notable thing is `view_builder::drain()` may be called earlier by `storage_service::do_drain()`. In other words, these changes cover the situation when Scylla runs into a problem when starting up. Backport: The patch I'm reverting made it to 6.2, so we want to backport this one there too. Fixes scylladb/scylladb#20772 Fixes scylladb/scylladb#21534 Closes scylladb/scylladb#21909 * github.com:scylladb/scylladb: test/topology_custom: Add test for Scylla with disabled view building main, view: Pair view builder drain with its start Revert "main,cql_test_env: start group0_service before view_builder"	2025-01-15 09:50:26 +01:00
Sergey Zolotukhin	8c48f7ad62	raft: Handle non-critical config update errors in when changing status to voter. When a node is bootstrapped and joins a cluster as a non-voter, errors can occur while committing a new Raft record, for instance, if the Raft leader changes during this time. These errors are not critical and should not cause a node crash, as the action can be retried. Fixes scylladb/scylladb#20814	2025-01-15 09:49:15 +01:00
Sergey Zolotukhin	16053a86f0	test: Add test to check that a node does not fail on unknown commit status error when starting up. Test that a node is starting successfully if while joining a cluster and becoming a voter, it receives an unknown commit status error. Test for scylladb/scylladb#20814	2025-01-14 17:12:06 +01:00
Kamil Braun	2eac7a2d61	Merge 'test/pylib: two trivial cleanups' from Kefu Chai - use "foo not in bar" instead of "not foo in bar" - test/pylib: use foo instead of `'{}'.format(foo)` --- it's a cleanup, hence no need to backport. Closes scylladb/scylladb#22066 * github.com:scylladb/scylladb: test/pylib: use `foo` instead of `'{}'.format(foo)` test/pylib: use "foo not in bar" instead of "not foo in bar"	2025-01-14 16:27:44 +01:00
Nadav Har'El	15c252fd8f	Merge 'docs: Update documentation on CREATE ROLE WITH HASHED PASSWORD' from Dawid Mędrek As part of #18750, we added a CQL statement CREATE ROLE WITH SALTED HASH that prevented hashing a password when creating a role, effectively leading to inserting a hash given by the user directly into the database. In #21350, we noticed that Cassandra had implemented a CQL statement of similar semantics but different syntax. We decided to rename Scylla's statement to be compatible with Cassandra. Unfortunately, we didn't notice one more difference between what we had in Scylla and what was part of Cassandra. Scylla's statement was originally supposed to only be used when restoring the schema and the user needn't have to be aware of its existence at all: the database produced a sequence of CQL statements that the user saved to a file and when a need to restore the schema arose, they would execute the contents of the file. That's why that although we documented the feature, it was only done in the necessary places. Those that weren't related to the backup & restore procedure were deliberately skipped. Cassandra, on the other hand, added the statement for a different purpose (for details, see the relevant issue) and it was supposed to be used by the user by design. The statement is also documented as such. Since we want to preserve compatibility with Cassandra, we document the statement and its semantics in the user documentation, explicitly implying that it can be used by the user. We also add a test verifying that logging in works correctly. Fixes scylladb/scylladb#21691 Backport: not needed. The relevant code didn't make it to 6.2 or any previous version of OSS. Closes scylladb/scylladb#21752 * github.com:scylladb/scylladb: docs: Update documentation on CREATE ROLE WITH HASHED PASSWORD test/boost: Add test for creating roles with hashed passwords	2025-01-14 15:33:30 +02:00
Kefu Chai	7215d4bfe9	utils: do not include unused headers these unused includes were identifier by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. please note, because quite a few source files relied on `utils/to_string.hh` to pull in the specialization of `fmt::formatter<std::optional<T>>`, after removing `#include <fmt/std.h>` from `utils/to_string.hh`, we have to include `fmt/std.h` directly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-14 07:56:39 -05:00
Aleksandra Martyniuk	592512fd0f	test: fix memtable_flush_period test memtable_flush_period test sets the flush period to 200ms and checks whether the data is flushed after 500ms. When flush period is set, the timer is armed with the given value. On expiration, memtables are flushed and then the timer is rearmed. There is no certainty that during 500ms the flush finishes, though. Check if after 500ms flush has started. Wait until there is an sstable. Fixes: #21965. Closes scylladb/scylladb#22162	2025-01-14 07:56:38 -05:00
Botond Dénes	686a997c04	Merge 'Complete implementation of configuring IO bandwidth limits' from Pavel Emelyanov In Scylla there are two options that control IO bandwidth limit -- the /storage_service/(compaction\|stream)_throughput REST API endpoints. The endpoints are partially implemented and have no counterparts in the nodetool. This set implements the missing bits and adds tests for new functionality. Closes scylladb/scylladb#21877 * github.com:scylladb/scylladb: nodetool: Implement [gs]etstreamthroughput commands nodetool: Implement [gs]etcompationthroughput commands test: Add validation of how IO-updating endpoints work api: Implement /storage_service/(stream\|compaction)_throughput endpoints api: Disqualify const config reference api: Implement /storage_service/stream_throughput endpoint api: Move stream throughput set/get endpoints from storage service block api: Move set_compaction_throughput_mb_per_sec to config block util: Include fmt/ranges.h in config_file.hh	2025-01-14 07:56:38 -05:00
Aleksandra Martyniuk	94f4871352	test: start waiting for task before it gets aborted Ensure that the repair task was aborted after wait API acknowledged its existence. Fixes: #22011. Closes scylladb/scylladb#22012	2025-01-14 07:56:37 -05:00
Lakshmi Narayanan Sreethar	63100b34da	sstable_directory: do not load remote sstables in process_descriptor The sstable loader relied on the generation id to provide an efficient hint about the shard that owns an sstable. But, this hint was rendered ineffective with the introduction of UUID generation, as the shard id was no longer embedded in the generation id. This also became suboptimal with the introduction of tablets. Commit `0c77f77` addressed this issue by reading the minimum from disk to determine sstable ownership but this improvement was lost with commit `63f1969`, which optimistically assumed that hints would work most of the time, which isn't true. This commit restores that change - shard id of a table is deduced by reading minially from disk and then the sstable is fully loaded only if it belongs to the local shard. This patch also adds a testcase to verify that the sstable are loaded only in their respective shards. Fixes #21015 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-01-13 20:01:30 +05:30
Nadav Har'El	321d0fd3b1	Merge 'Alternator: Add WCU suppport for update item' from Amnon Heiman This series adds WCU support for the Alternator update item. This motivation behind it, is to have a rough estimation of what a similar operation would have taken from WCU perspective if used with DynamoDB. The calculation is done while minimal overhead is the prime objective, the results are values that is less or equal to what it would have been in DynamoDB New feature, no need to backport. Closes scylladb/scylladb#21999 * github.com:scylladb/scylladb: alternator/test_returnconsumedcapacity.py: update item alternator/executor.cc: Add WCU for update_item	2025-01-13 14:35:46 +02:00
Kamil Braun	48a4efba2f	Merge 'Fix possible data corruption due to token keys clashing in read repair.' from Sergey Zolotukhin This update addresses an issue in the mutation diff calculation algorithm used during read repair. Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on the Murmur3 hash function, it could generate duplicate values for different partition keys, causing corruption in the affected rows' values. Fixes scylladb/scylladb#19101 Since the issue affects all the relevant scylla versions, backport to: 6.1, 6.2 Closes scylladb/scylladb#21996 * github.com:scylladb/scylladb: storage_proxy/read_repair: Remove redundant 'schema' parameter from `data_read_resolver::resolve` function. storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap. test: Add test case for checking read repair diff calculation when having conflicting keys.	2025-01-13 10:54:34 +01:00
Kamil Braun	88a48f2355	Merge 'Load peers table into the gossiper on boot' from Gleb Since we manage ip to id mapping directly in gossiper now we need to load the mapping on boot. We already do it anyway, but only due to a bug which checks raft topology mode config before it is set, so the code thinks that it is in the gossiper mode and loads peers table into the gossiper and token metadata. Fix the bug and load peers into the gossiper only since token metadata is managed by raft. The series also removes address map related test that no longer checks anything and replace it with unit test. It also adds the dc/rack check to "join node" rpc. The check is done during shadow round now, but for it to work it requires dc/rack to be propagated through the gossiper and we want to eventually drop it. Ref: scylladb/scylladb#21777 * 'load-peers' of https://github.com/gleb-cloudius/scylla: topology coordinator: reject replace request if topology does not match gossiper: fix the logic of shadow_round parameter storage_service: do not add endpoint to the gossiper during topology loading. storage_service: load peers into gossiper on boot in raft topology mode storage_service: set raft topology change mode before using it in join_cluster locator: drop inet_address usage to figure out per dc/rack replication test: drop test_old_ip_notification_repro.py test: address_map: check generation handling during entry addition	2025-01-13 09:40:36 +01:00
Andrei Chekun	2aea2610e0	test.py: Wait for tasks finish before going further Developers using asyncio.gather() often assume that it waits for all futures (awaitables) givens. But this isn't true when the return_exceptions parameter is False, which is the default. In that case, as soon as one future completes with an exception, the gather() call will return this exception immediately, and some of the finished tasks may continue to run in the background. This is bad for applications that use gather() to ensure that a list of background tasks has all completed. So such applications must use asyncio.gather() with return_exceptions=True, to wait for all given futures to complete either successfully or unsuccessfully. Closes scylladb/scylladb#22252	2025-01-13 09:43:28 +02:00
Kefu Chai	752e6561fb	test/pylib: log if scylla exits with non-zero status code When destroying a test cluster, ScyllaCluster.stop() calls ScyllaServer.stop() for each running server. Previously, non-zero exit status codes from scylla servers were silently ignored during test teardown. This change modifies the logging behavior to print the exit status code when a scylla server exits with a non-zero status. This helps developers quickly identify potential issues or unexpected terminations during test runs. Differences in handling: - Before: Non-zero exit codes were not logged - After: Non-zero exit codes are printed, providing visibility into server termination errors This improvement aids in diagnosing intermittent test failures or unexpected server shutdowns during test execution. Refs #21742 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21934	2025-01-13 09:09:43 +03:00
Kefu Chai	d815d7013c	sstables_loader: report progress with the unit of batch We restore a snapshot of table by streaming the sstables of the given snapshot of the table using `sstable_streamer::stream_sstable_mutations()` in batches. This function reads mutations from a set of sstables, and streams them to the target nodes. Due to the limit of this function, we are not able to track the progress in bytes. Previously, progress tracking used individual sstables as units, which caused inaccuracies with tablet-distributed tables, where: - An sstable spanning multiple tablets could be counted multiple times - Progress reporting could become misleading (e.g., showing "40" progress for a table with 10 sstables) This change introduces a more robust progress tracking method: - Use "batch" as the unit of progress instead of individual sstables. Each batch represents a tablet when restoring a table snapshot if the tablet being restored is distributed with tablets. When it comes to tables distributed with vnode, each batch represents an sstable. - Stream sstables for each tablet separately, handling both partially and fully contained sstables - Calculate progress based on the total number of sstables being streamed - Skip tablet IDs with no owned tokens For vnode-distributed tables, the number of "batches" directly corresponds to the number of sstables, ensuring: - Consistent progress reporting across different table distribution models - Simplified implementation - Accurate representation of restore progress The new approach provides a more reliable and uniform method of tracking restoration progress across different table distribution strategies. Also, Corrected the use of `_sstables.size()` in `sstable_streamer::stream_sstables()`. It addressed a review comment from Pavel that was inadvertently overlooked during previous rebasing the commit of `5ab4932f34`. Fixes scylladb/scylladb#21816 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21841	2025-01-13 09:04:35 +03:00
Dawid Mędrek	d1f960eee2	test/topology_custom: Add test for Scylla with disabled view building Before this commit, there doesn't seem to have been a test verifying that starting and shutting down Scylla behave correctly when the configuration option `view_building` is set to false. In these changes, we add one.	2025-01-13 00:41:27 +01:00
Dawid Mędrek	06ce976370	main, view: Pair view builder drain with its start In these changes, we pair draining the view builder with its start. To better understand what was done and why, let's first look at the situation before this commit and the context of it: (a) The following things happened in order: 1. The view builder would be constructed. 2. Right after that, a deferred lambda would be created to stop the view builder during shutdown. 3. group0_service would be started. 4. A deferred lambda stopping group0_service would be created right after that. 5. The view builder would be started. (b) Because the view builder depends on group0_client, it couldn't be started before starting group0_service. On the other hand, other services depend on the view builder, e.g. the stream manager. That makes changing the order of initialization a difficult problem, so we want to avoid doing that unless we're sure it's the right choice. (c) Since the view builder uses group0_client, there was a possibility of running into a segmentation fault issue in the following scenario: 1. A call to `view_builder::mark_view_build_success()` is issued. 2. We stop group0_service. 3. `view_builder::mark_view_build_success()` calls `announce_with_raft()`, which leads to a use-after-free because group0_service has already been destroyed. This very scenario took place in scylladb/scylladb#20772. Initially, we decided to solve the issue by initializing group0_service a bit earlier (scylladb/scylladb@7bad8378c7). Unfortunately, it led to other issues described in scylladb/scylladb#21534. We reverted that change in the previous commit. These changes are the second attempt to the problem where we want to solve it in a safer manner. The solution we came up with is to pair the start of the view builder with a deferred lambda that deinitializes it by calling `view_builder::drain()`. No other component of the system should be able to use the view builder anymore, so it's safe to do that. Furthermore, that pairing makes the analysis of initialization/deinitialization order much easier. We also solve the aformentioned use-after-free issue because the view builder itself will no longer attempt to use group0_client. Note that we still pair a deferred lambda calling `view_builder::stop()` with the construction of the view builder; that function will also call `view_builder::drain()`. Another notable thing is `view_builder::drain()` may be called earlier by `storage_service::do_drain()`. In other words, these changes cover the situation when Scylla runs into a problem when starting up. Fixes scylladb/scylladb#20772	2025-01-13 00:41:22 +01:00
Dawid Mędrek	a5715086a4	Revert "main,cql_test_env: start group0_service before view_builder" The patch solved a problem related to an initialization order (scylladb/scylladb#20772), but we ran into another one: scylladb/scylladb#21534. After moving the initialization of group0_service, it ended up being destroyed AFTER the CDC generation service would. Since CDC generations are accessed in `storage_service::topology_state_load()`: ``` for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) { rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id); co_await _cdc_gens.local().handle_cdc_generation(gen_id); ``` we started getting the following failure: ``` Service &seastar::sharded<cdc::generation_service>::local() [Service = cdc::generation_service]: Assertion `local_is_initialized()' failed. ``` We're reverting the patch to go back to a more stable version of Scylla and in the following commit, we'll solve the original issue in a more systematic way. This reverts commit `7bad8378c7`.	2025-01-12 18:13:56 +01:00
Avi Kivity	814942505f	Merge 'Introduce Encryption-at-Rest (EAR) for sstables and commitlog' from Calle Wilund Fixes https://github.com/scylladb/scylla-enterprise/issues/5016#issuecomment-2558464631 EAR - encryption at rest. Allows on-disk file encryption of sstables and commitlog data. Introduces OpenSSL based file level encrypted storage, managed via a set of providers ranging from local files to cloud KMS providers. For a more comprehensive explanation, see the included docs (or if possible, original source tree). Manual bulk merge of EAR feature from enterprise repo to main scylla repo. Breaks some features apart, but main EAR is still a humongous commit, because to separate this I would have to mess with code incrementally, adding time and risk. This PR includes the local file gen tool, tests and also p11 validation. Note: CI will not execute the full tests unless master CI is set to provide the same environment as the enterprise one. Not sure about the status of this ATM. Note: Includes code to compile against cryptsoft kmipc SDK, but not the SDK. If you happen to check out this tree in the scylla folder and configure, it will be linked against and KMIP functionality will be enabled, otherwise not. Closes scylladb/scylladb#22233 * github.com:scylladb/scylladb: docs: Add EAR docs main/build: Add p11-kit and initialize tools: Add local-file-key-generator tool tests: Add EAR tests tmpdir: shorten test tempdir path EAR: port the ear feature from enterprise cql_test_env: Add optional query timeout schema/migration_manager: Add schema validate sstables: add get_shared_components accessor config/config_file: Add exports and definitions of config_type_for<>	2025-01-12 16:10:46 +02:00
Benny Halevy	8d2ff8a915	utils: add disk_space_monitor Instantiated only on shard 0. Currently, only subscribe from unit test Manual unit test using loop mount was added. Note that the test requires sudo access and root access to /dev/loop, so it cannot run in rootless podman instance, and it'd fail with Permission denied. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#21523	2025-01-12 14:51:15 +02:00
Piotr Smaron	288f9b2b15	Introduce LDAP role manager & saslauthd authenticator This PR extends authentication with 2 mechanisms: - a new role_manager subclass, which allows managing users via LDAP server, - a new authenticator, which delegates plaintext authentication to a running saslauthd daemon. The features have been ported from the enterprise repository with their test.py tests and the documentation as part of changing license to source available. Fixes: scylladb/scylla-enterprise#5000 Fixes: scylladb/scylla-enterprise#5001 Closes scylladb/scylladb#22030	2025-01-12 14:50:29 +02:00
Calle Wilund	c596ae6eb1	tests: Add EAR tests Adds the migrated EAR/encryption tests. Note: Until scylla CI is updated to provide all the proper ENV vars, some tests will not execute.	2025-01-09 10:40:39 +00:00
Calle Wilund	ee62b61c84	tmpdir: shorten test tempdir path To make certain python tests work in CI	2025-01-09 10:37:35 +00:00
Sergey Zolotukhin	2f1731c551	test: Include parent test name in `ScyllaClusterManager` log file names. Add the test file name to `ScyllaClusterManager` log file names alongside the test function name. This avoids race conditions when tests with the same function names are executed simultaneously. Fixes scylladb/scylladb#21807 Backport: not needed since this is a fix in the testing scripts. Closes scylladb/scylladb#22192	2025-01-08 15:42:31 +02:00
Calle Wilund	e734fc11ec	cql_test_env: Add optional query timeout Some tests need queries to actually fail.	2025-01-08 12:50:03 +00:00
Kefu Chai	d0a3311ced	locator: do not include unused headers these unused includes were identifier by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22199	2025-01-08 14:26:48 +02:00
Benny Halevy	e6efaa3b73	Update seastar submodule * seastar 3133ecdd...a9bef537 (24): > file: add file_system_space > future: avoid inheriting from future payload type > treewide: include fmt/ostream.h for using fmt::print() > build: remove messages used for debugging > demos: Rename websocket demo to websocket_server demo > demos: Add a way to set port from cmd line in websocket demo > tls: Add optional builder + future-wait to cert reload callback + expose rebuild > rwlock: add try_hold_{read,write}_lock methods > json: add moving push to json_list > github: add a step to build "check-include-style" > build: add a target for checking include style > scheduling_group: use map for key configs instead of vector > scheduling_group: fix indentation > scheduling_group: fix race between scheduling group and key creation > http: Make request writing functions public > http: Expose connection_factory implementations > metrics: Use separate type for shared metadata > file: unexpected throw from inside noexcept > metrics: Internalize metric label sets > thread: optimize maybe_yield > reactor: fix crash in pending registration task after poller dtor > net: Fix ipv6 socket_address comparision > reactor, linux-aio: factor out get_smp_count() lambda > reactor, linux-aio: restore "available_aio" meaning after "reserve_iocbs" Fixed usage of seastar metric label sets due to: scylladb/seastar@733420d57 Merge 'metrics: Internalize metric label sets' from Stephan Dollberg Closes scylladb/scylladb#22076	2025-01-08 09:37:16 +02:00
Wojciech Mitros	d04f376227	mv: add an experimental feature for creating views using tablets We still have a number of issues to be solved for views with tablets. Until they are fixed, we should prevent users from creating them, and use the vnode-based views instead. This patch prepares the feature for enabling views with tablets. The feature is disabled by default, but currently it has no effect. After all tests are adjusted to use the feature, we should depend on the feature for deciding whether we can create materialized views in tablet-enabled keyspaces. The unit tests are adjusted to enable this feature explicitly, and it's also added to the scylla sstable tool config - this tool treats all tables as if they were tablet-based (surprisingly, with SimpleStrategy), so for it to work on views, the new feature must be enabled. Refs scylladb/scylladb#21832 Closes scylladb/scylladb#21833	2025-01-07 15:52:36 +01:00
Asias He	935dcd69fa	repair: Remove repair_task_info only when repair is finished In case of error, repair will be moved into the end_repair stage. We should not remove repair_task_info in this case because the repair task requested by the user is not finished yet. To fix, we should remove repair_task_info at the end of repair stage. Tests are added to ensure failed repair is not reported as finished. Closes scylladb/scylladb#21973	2025-01-07 16:19:40 +02:00
Emil Maskovsky	2ac9ed2073	raft: test the limited voters feature Test the limited voters feature by creating a cluster with 3 DCs, one of them disproportionately larger than the others. The raft majority should not be lost in case the large DC goes down. Fixes: scylladb/scylla#21915 Refs: scylladb/scylla#18793 Closes scylladb/scylladb#21901	2025-01-07 15:09:49 +01:00
Michael Litvak	0617564123	db/commitlog: make the commit log hard limit mandatory mark the config parameter --commitlog-use-hard-size-limit as deprecated so the default 'true' is always used, making the hard limit mandatory. Fixes scylladb/scylladb#16471 Closes scylladb/scylladb#21804	2025-01-07 15:03:56 +02:00
Botond Dénes	b3f8c4faa7	Merge 'node_ops: filter topology_requests entries shown by node_ops_virtual_task' from Aleksandra Martyniuk node_ops_virtual_task does not filter the entries of system.topology_request and so it creates statuses of operations that aren't node ops. Filter the entries used by node_ops_virtual_task. With this change, the status of a bootstrap of the first node will not be visible. Fixes: https://github.com/scylladb/scylladb/issues/22008. Needs backport to 6.2 that introduced node_ops_virtual_task Closes scylladb/scylladb#22009 * github.com:scylladb/scylladb: test: truncate the table before node ops task checks node_ops: rename a method that get node ops entries node_ops: filter topology_requests entries	2025-01-07 14:17:01 +02:00
Kefu Chai	353b522ca0	treewide: migrate from boost::adaptors::reversed to std::views::reverse now that we are allowed to use C++23. we now have the luxury of using `std::views::reverse`. - replace `boost::adaptors::transformed` with `std::views::transform` - remove unused `#include <boost/range/adaptor/reversed.hpp>` this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-07 13:22:00 +02:00
Botond Dénes	69150f0680	Merge 'Fix edge case issues related to tablet draining ' from Tomasz Grabiec Main problem: If we're draining the last node in a DC, we won't have a chance to evaluate candidates and notice that constraints cannot be satisfied (N < RF). Draining will succeed and node will be removed with replicas still present on that node. This will cause later draining in the same DC to fail when we will have 2 replicas which need relocaiton for a given tablet. The expected behvior is for draining to fail, because we cannot keep the RF in the DC. This is consistent, for example, with what happens when removing a node in a 2-node cluster with RF=2. Fixes #21826 Secondary problem: We allowed tablet_draining transition to be exited with undrained nodes, leaving replicas on nodes in the "left" state. Third problem: We removed DOWN nodes from the candidate node set, even when draining. This is not safe because it may lead to overload. This also makes the "main problem" more likely by extending it to the scenario when the DC is DOWN. The overload part in not a problem in practice currently, since migrations will block on global topology barrier if there are DOWN nodes. Closes scylladb/scylladb#21928 * github.com:scylladb/scylladb: tablets: load_balancer: Fail when draining with no candidate nodes tablets: load_balancer: Ignore skip_list when draining tablets: topology_coordinator: Keep tablet_draining transition if nodes are not drained	2025-01-07 13:04:00 +02:00
Kefu Chai	e4463b11af	treewide: replace boost::algorithm::join() with fmt::join() Replace usages of `boost::algorithm::join()` with `fmt::join()` to improve performance and reduce dependency on Boost. `fmt::join()` allows direct formatting of ranges and tuples with custom separators without creating intermediate strings. When formatting comma-separated values into another string, fmt::join() avoids the overhead of temporary string creation that `boost::algorithm::join()` requires. This change also helps streamline our dependencies by leveraging the existing fmt library instead of Boost.Algorithm. To avoid the ambiguity, some caller sites were updated to call `seastar::format()` explicitly. See also - boost::algorithm::join(): https://www.boost.org/doc/libs/1_87_0/doc/html/string_algo/reference.html#doxygen.join_8hpp - fmt::join(): https://fmt.dev/11.0/api/#ranges-api Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22082	2025-01-07 12:45:05 +02:00
Amnon Heiman	7390116620	alternator/test_returnconsumedcapacity.py: update item This patch adds tests for return consumed capacity for update_item. The tests cover: a simple update for a small object, a missing item, an update with a very large attribute (where the attribute itself is more than 1KB), and an update of a big item that uses read-before-write.	2025-01-06 09:55:17 +02:00
Nadav Har'El	e919794db8	test/alternator: fix mistakes introduced with test_service_levels.py This patch undoes multiple mistakes done when introducing the test for service levels in pull request #22031: 1. The PR introduced in test/alternator/run and test/alternator/suite.yaml a permanent role and service level that the service-level test is supposed to use. This was a mistake - the test can create the service level for its own use, using CQL, it does not need to assume such a service level already exists. It's important to fix this to allow the service level test to run against an installation of Scylla not set up by our own scripts. Moreover, while the code in suite.yaml was correct, the code in "run" was incorrect (used an outdated keyspace name). This patch removes that incorrect code. 2. The PR introduced a duplicate "cql" fixture, copied verbatim from test_cql_rbac.py (including a comment that was correct only in the latter file :-)). Let's de-duplicate it, using the fixture that I moved to conftest.py in the previous patch. 3. The PR used temporary_grant(). This needelessly complicated the test and added even more duplicate code, and this patch removes all that stuff. This test is about service levels, not RBAC and "grant". This test should just use a superuser role that has the permissions to do everything, and don't need to be granted specific permissions. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-01-05 19:40:14 +02:00

1 2 3 4 5 ...

8114 Commits