scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 04:37:00 +00:00

Author	SHA1	Message	Date
Nadav Har'El	78c10af960	test/cqlpy: add reproducer for INSERT JSON .. IF NOT EXISTS bug This patch adds an xfailing test reproducing a bug where when adding an IF NOT EXISTS to a INSERT JSON statement, the IF NOT EXISTS is ignored. This bug has been known for 4 years (issue #8682) and even has a FIXME referring to it in cql3/statements/update_statement.cc, but until now we didn't have a reproducing test. The tests in this patch also show that this bug is specific to INSERT JSON - regular INSERT works correctly - and also that Cassandra works correctly (and passes the test). Refs #8682 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25244	2025-07-30 20:14:50 +03:00
Piotr Smaron	8d5249420b	Update seastar submodule * seastar 60b2e7da...7c32d290 (14): > posix: Replace static_assert with concept > tls: Push iovec with the help of put(vector<temporary_buffer>) > io_queue: Narrow down friendship with reactor > util: drop concepts.hh > reactor: Re-use posix::to_timespec() helper > Fix incorrect defaults for io queue iops/bandwidth > net: functions describing ssl connection > Add label values to the duplicate metrics exception > Merge 'Nested scheduling groups (CPU only)' from Pavel Emelyanov test: Add unit test for cross-sched-groups wakeups test: Add unit test for fair CPU scheduling test: Add unit test for basic supergrops manipulations test: Add perf test for context switch latency scheduling: Add an internal method to get group's supergroup reactor: Add supergroup get_shares() API reactor: Add supergroup::set_shares() API reactor: Create scheduling groups in supergroups reactor: Supergroups destroying API reactor: Supergroups creating API reactor: Pass parent pointer to task_queue from caller reactor: Wakeup queue group on child activation reactor: Add pure virtual sched_entity::run_tasks() method reactor: Make task_queue_group be sched_entity too reactor: Split task_queue_group::run_some_tasks() reactor: Count and limit supergroup children reactor: Link sched entity to its parent reactor: Switch activate(task_queue) to work on sched_entity reactor: Move set_shares() to sched_entity() reactor: Make account_runtime() work with sched_entity reactor: Make insert_activating_task_queue() work on sched_entity reactor: Make pop_active_task_queue() work on sched_entity reactor: Make insert_active_task_queue() work on sched_entity reactor: Move timings to sched_entity reactor: Move active bit to sched_entity reactor: Move shares to sched_entity reactor: Move vruntime to sched_entity reactor: Introduce sched_entity reactor: Rename _activating_task_queues -> _activating reactor: Remove local atq variable reactor: Rename _active_task_queues -> _active reactor: Move account_runtime() to task_queue_group reactor: Move vruntime update from task_queue into _group reactor: Simplify task_queue_group::run_some_tasks() reactor: Move run_some_tasks() into task_queue_group reactor: Move insert_activating_task_queues() into task_queue_group reactor: Move pop_active_task_queue() into task_queue_group reactor: Move insert_active_task_queue() into task_queue_group reactor: Introduce and use task_queue_group::activate(task_queue) reactor: Introduce task_queue_group::active() reactor: Wrap scheduling fields into task_queue_group reactor: Simplify task_queue::activate() reactor: Rename task_queue::activate() -> wakeup() reactor: Make activate() method of class task_queue reactor: Make task_queue::run_tasks() return bool reactor: Simplify task_queue::run_tasks() reactor: Make run_tasks() method of class task_queue > Fix hang in io_queue for big write ioproperties numbers > split random io buffer size in 2 options > reactor: document run_in_background > Merge 'Add io_queue unit test for checking request rates' from Robert Bindar Add unit test for validating computed params in io_queue Move `disk_params` and `disk_config_params` to their own unit Add an overload for `disk_config_params::generate_config` Closes scylladb/scylladb#25254	2025-07-30 16:44:18 +03:00
Patryk Jędrzejczak	5ce16488c9	Merge 'test/cqlpy: two small fixes for "--release" feature' from Nadav Har'El This small series fixes two small bugs in the "--release" feature of test/cqlpy/run and test/alternator/run, which allows a developer to run signle-node functional tests against any past release of Scylla. The two patches fix: 1. Allow "run --release" to be used when Scylla has not even been built from source. 2. Fix a mistake in choosing the most recent release when only a ".0" and RC releases are available. This is currently the case for the 2025.2 branch, which is why I discovered the bug now. Fixes #25223 This patch only affects developer's experience if using the test/cqlpy/run script manually (these scripts are not used by CI), so should not be backported. Closes scylladb/scylladb#25227 * https://github.com/scylladb/scylladb: test/cqlpy: fix fetch_scylla.py for .0 releases test/cqlpy: fix "run --release" when Scylla hasn't been built	2025-07-30 15:13:26 +02:00
Aleksandra Martyniuk	99ff08ae78	streaming: close sink when exception is thrown If an exception is thrown in result_handling_cont in streaming, then the sink does not get closed. This leads to a node crash. Close sink in exception handler. Fixes: https://github.com/scylladb/scylladb/issues/25165. Closes scylladb/scylladb#25238	2025-07-30 14:26:14 +03:00
Andrei Chekun	4c33ff791b	build: add pytest-timeout to the toolchain Adding this plugin allows using timeout for a test or timeout for the whole session. This can be useful for Unit Test Custom task in the pipeline to avoid running tests is batches, that will mess with the test names later in Jenkins. Closes #25210 [avi: regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz ] Closes scylladb/scylladb#25243	2025-07-30 12:53:10 +03:00
Botond Dénes	2985c343ed	Merge 'repair: Avoid too many fragments in a single repair_row_on_wire' from Asias He When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message. This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression. This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream. Tests are added to make sure the message split works. Fixes #24808 Closes scylladb/scylladb#25002 * github.com:scylladb/scylladb: repair: Avoid too many fragments in a single repair_row_on_wire repair: Change partition_key_and_mutation_fragments to use chunked_vector utils: Allow chunked_vector::erase to work with non-default-constructible type	2025-07-29 17:45:57 +03:00
Patryk Jędrzejczak	8e43856ca7	Merge 'Pass more elaborated "reasons" to stop_ongoing_compactions()' from Pavel Emelyanov When running compactions are aborted by the aforementioned helper, in logs there appear a line like "Compaction for ks/cf was stopped due to: user-triggered operation". This message could've been better, since it may indicate several distinct reasons described with the same "user-triggered operation". With this PR the message will help telling "truncate", "cleanup", "rewrite" and "split" from each other. Closes scylladb/scylladb#25136 * https://github.com/scylladb/scylladb: compaction: Pass "reason" to perform_task_on_all_files() compaction: Pass "reason" to run_with_compaction_disabled() compaction: Pass "reason" to stop_and_disable_compaction()	2025-07-29 16:06:17 +02:00
Pavel Emelyanov	286fad4da6	api: Simplify table_info::name extraction with std::views::transform Instead of using lambda, pass pointer to struct member. The result is the same, but the code is nicer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25123	2025-07-29 15:56:58 +02:00
Nadav Har'El	22f845b128	docs/alternator: mention missing ShardFilter support Add in docs/alternator/compatibility.md a mention of the ShardFilter option which we don't support in Alternator Streams. This option was only introduced to DynamoDB a week ago, so it's not surprising we don't yet support it :-) Refs #25160 Closes scylladb/scylladb#25161	2025-07-29 14:37:24 +03:00
Andrei Chekun	a6a3d119e8	docs: update documentation with new way of running C++ tests Documentation had outdated information how to run C++ test. Additionally, some information added about gathered test metrics. Closes scylladb/scylladb#25180	2025-07-29 14:36:19 +03:00
Dawid Mędrek	408b45fa7e	db/commitlog: Extend error messages for corrupted data We're providing additional information in error messages when throwing an exception related to data corruption: when a segment is truncated and when it's content is invalid. That might prove helpful when debugging. Closes scylladb/scylladb#25190	2025-07-29 14:35:14 +03:00
Anna Stuchlik	b67bb641bc	doc: add OS support for ScyllaDB 2025.3 This commit adds the information about support for platforms in ScyllaDB version 2025.3. Fixes https://github.com/scylladb/scylladb/issues/24698 Closes scylladb/scylladb#25220	2025-07-29 14:33:12 +03:00
Anna Stuchlik	8365219d40	doc: add the upgrade guide from 2025.2 to 2025.3 This PR adds the upgrade guide from version 2025.2 to 2025.3. Also, it removes the upgrade guide existing for the previous version that is irrelevant in 2025.2 (upgrade from 2025.1 to 2025.2). Note that the new guide does not include the "Enable Consistent Topology Updates" page and note, as users upgrading to 2025.3 have consistent topology updates already enabled. Fixes https://github.com/scylladb/scylladb/issues/24696 Closes scylladb/scylladb#25219	2025-07-29 14:32:31 +03:00
Avi Kivity	11ee58090c	commitlog: replace std::enable_if with a constraint std::enable_if is obsolete and was replaced with concepts and constraint. Replace the std::is_fundamental_v enable_if constraint with std::integral. The latter is more accurate - std::ntoh() is not defined for floats, for example. In any case, we only read integrals in commitlog. Closes scylladb/scylladb#25226	2025-07-29 12:51:24 +02:00
Michał Chojnowski	6d27065f99	cql3/result_set: set GLOBAL_TABLES_SPEC in `metadata` if appropriate Unless the client uses the SKIP_METADATA flag, Scylla attaches some metadata to query results returned to the CQL client. In particular, it attaches the spec (keyspace name, table name, name, type) of the returned columns. By default, the keyspace name and table name is present in each column spec. However, since they are almost always the same for every column (I can't think of any case when they aren't the same; it would make sense if Cassandra supported joins, but it doesn't) that's a waste. So, as an optimization, the CQL protocol has the GLOBAL_TABLES_SPEC flag. The flag can be set if all columns belong to the same table, and if is set, then the keyspace and table name are only written in the first column spec, and skipped in other column specs. Scylla sets this flag, if appropriate, in responses to a PREPARE requests. But it never sets the flag in responses to queries. But it could. And this patch causes it to do that. Fixes #17788 Closes scylladb/scylladb#25205	2025-07-29 12:40:12 +03:00
Nadav Har'El	f6a3e6fbf0	sstables: don't depend on fmt 11.1 to build A recent commit `a0c29055e5` added some trace printouts which print an std::reference_wrapper<>. Apparently a formatter for this type was only added to fmt in version 11.1.0, and it doesn't exist on earlier versions, such as fmt 11.0.2 on Fedora 41. Let's avoid requiring shiny-new versions of fmt. The workaround is easy: just unwrap the reference_wrapper - print pr.get() instead of just pr, and Scylla returns to building correctly on Fedora 41. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25228	2025-07-29 11:32:06 +02:00
Patryk Jędrzejczak	3299ffba51	Merge 'raft_group0: split shutdown into abort-and-drain and destroy' from Petr Gusev Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`). However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used. This PR reworks the shutdown logic: * Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called. * Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup. The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods. Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`. Refs #25115 Fixes #24625 Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR. Closes scylladb/scylladb#25151 * https://github.com/scylladb/scylladb: raft_group0: split shutdown into abort_and_drain and destroy Revert "main.cc: fix group0 shutdown order"	2025-07-29 10:39:00 +02:00
Asias He	e28c75aa79	repair: Avoid too many fragments in a single repair_row_on_wire When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message. This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression. This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream. Tests are added to make sure the message split works. Fixes #24808	2025-07-29 13:43:53 +08:00
Asias He	266a518e4c	repair: Change partition_key_and_mutation_fragments to use chunked_vector With the change in "repair: Avoid too many fragments in a single repair_row_on_wire", the std::list<frozen_mutation_fragment> _mfs; in partition_key_and_mutation_fragments will not contain large number of fragments any more. Switch to use chunked_vector.	2025-07-29 13:43:17 +08:00
Asias He	4a4fbae8f7	utils: Allow chunked_vector::erase to work with non-default-constructible type This is needed for chunked_vector<frozen_mutation_fragment> in repair.	2025-07-29 13:43:17 +08:00
Avi Kivity	d3cdb88fe7	tools: toolchain: dbuild: increase depth of nested podman configuration coverage The initial support for nested containers (`2d2a2ef277`) worked on my machine (tm) and even laptop, but does not work on fresh installs. This is likely due to changes in where persistent configuration is stored on the host between various podman versions; even though my podman is fully updated, it uses configuration created long ago. Make nested containers work on fresh installs by also configuring /etc/containers/storage.conf. The important piece is to set graphroot to the same location as the host. Verified both on my machine and on a fresh install. Closes scylladb/scylladb#25156	2025-07-29 08:23:41 +03:00
Botond Dénes	f3ed27bd9e	Merge 'Move feature-service config creation code out of feature-service itself' from Pavel Emelyanov Nowadays the way to configure an internal service is 1. service declares its config struct 2. caller (main/test/tool) fills the respective config with values it wants 3. the service is started with the config passed by value The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config. For the reference: similar changes with other services: #23705 , #20174 , #19166 Closes scylladb/scylladb#25118 * github.com:scylladb/scylladb: gms,init: Move get_disabled_features_from_db_config() from gms code: Update callers generating feature service config gms: Make feature_config a simple struct gms: Split feature_config_from_db_config() into two	2025-07-29 08:17:49 +03:00
Anna Stuchlik	18b4d4a77c	doc: add tablets support information to the Drivers table This commit: - Extends the Drivers support table with information on which driver supports tablets and since which version. - Adds the driver support policy to the Drivers page. - Reorganizes the Drivers page to accommodate the updates. In addition: - The CPP-over-Rust driver is added to the table. - The information about Serverless (which we don't support) is removed and replaced with tablets to correctly describe the contents of the table. Fixes https://github.com/scylladb/scylladb/issues/19471 Refs https://github.com/scylladb/scylladb-docs-homepage/issues/69 Closes scylladb/scylladb#24635	2025-07-29 08:11:42 +03:00
Avi Kivity	f7324a44a2	compaction: demote normal compaction start/end log messages to debug level Compaction is routine and the log messages pollute the log files, hiding important information. All the data is available via `nodetool compactionhistory`. Reduce noise by demoting those log messages to debug level. One test is adjusted to use debug level for compaction, since it listens for those messages. Closes scylladb/scylladb#24949	2025-07-29 08:02:22 +03:00
Nadav Har'El	e43828c10b	test/cqlpy: fix fetch_scylla.py for .0 releases The test/cqlpy/fetch_scylla.py script is used by test/cqlpy/run and test/alternator/run to implement their "--release" option - which allows you to run current tests against any official release of Scylla downloaded from Scylla's S3 bucket. When you ask to get release "2025.1", the idea is to fetch the latest release available in the 2025.1 stream - currently it is 2025.1.5. fetch_scylla.py does this by listing the available 2025.1 releases, sorting them and fetching the last one. We had a bug in the sort order - version 0 was sorted before version 0-rc1, which is incorrect (the version 2025.2.0 came after 2025.2.0~rc1). For most releases this didn't cause any problem - 0~rc1 was sorted after 0, but 5 (for example) came after both, so 2025.1.5 got downloaded. But when a release has only an rc and a .0 release, we incorrectly used the rc instead of the .0. This patch fixes the sort order by using the "/" character, which sorts before "0", in rc version strings when sorting the release numbers. Before this patch, we had this problem in "--release 2025.2" because currently 2025.2 only has RC releases (rc0 and rc1) and a .0 release, and we wrongly downloaded the rc1. After this patch, the .0 is chosen as expected: $ test/cqlpy/run --release 2025.2 Chosen download for ScyllaDB 2025.2: 2025.2.0 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-28 22:02:15 +03:00
Nadav Har'El	72358ee9f4	test/cqlpy: fix "run --release" when Scylla hasn't been built The "--release" option of test/cqlpy/run can be used to run current cqlpy tests against any official release of Scylla, which is automatically downloaded from Scylla's S3 bucket. You should be able to run tests like that even without having compiled Scylla from source. But we had a bug, where test/cqlpy/run looked for the built Scylla executable before parsing the "--release" option, and this bug is fixed in this patch. The Alternator version of the run script, test/alternator/run, doesn't need to be fixed because it already did things in the right order. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-28 21:42:02 +03:00
Dawid Mędrek	b41151ff1a	test: Enable RF-rack-valid keyspaces in all Python suites We're enabling the configuration option `rf_rack_valid_keyspaces` in all Python test suites. All relevant tests have been adjusted to work with it enabled. That encompasses the following suites: * alternator, * broadcast_tables, * cluster (already enabled in scylladb/scylladb@ee96f8dcfc), * cql, * cqlpy (already enabled in scylladb/scylladb@be0877ce69), * nodetool, * rest_api. Two remaining suites that use tests written in Python, redis and scylla_gdb, are not affected, at least not directly. The redis suite requires creating an instance of Scylla manually, and the tests don't do anything that could violate the restriction. The scylla_gdb suite focuses on testing the capabilities of scylla-gdb.py, but even then it reuses the `run` file from the cqlpy suite. Fixes scylladb/scylladb#25126 Closes scylladb/scylladb#24617	2025-07-28 16:32:59 +02:00
Gleb Natapov	198cfc6fe7	migration manager: do not use group0 on non zero shard Commit `ddc3b6dcf5` added a check of group0 state in get_schema_for_write(), but group0 client can only be used on shard 0, and get_schema_for_write() can be called on any shard, so we cannot use _group0_client there directly. Move assert where we use another group0 function already where it is guarantied to run on shard 0. Closes scylladb/scylladb#25204	2025-07-28 14:10:01 +02:00
Nadav Har'El	b4fc3578fc	Merge 'LWT: enable for tablet-based tables' from Petr Gusev This PR enables LWT (Lightweight Transactions) support for tablet-based tables by leveraging colocated tables. Currently, storing Paxos state in system tables causes two major issues: * Loss of Paxos state during tablet migration or base table rebuilds * When a tablet is migrated or the base table is rebuilt, system tables don't retain Paxos state. * This breaks LWT correctness in certain scenarios. * Failing test cases demonstrating this: * test_lwt_state_is_preserved_on_tablet_migration * test_lwt_state_is_preserved_on_rebuild * Shard misalignment and performance overhead * Tablets may be placed on arbitrary shards by the tablet balancer. * Accessing Paxos state in system tables could require a shard jump, degrading performance. We move Paxos state into a dedicated Paxos table, colocated with the base table: * Each base table gets its own Paxos state table. * This table is lazily created on the first LWT operation. * Its tablets are colocated with those of the base table, ensuring: * Co-migration during tablet movement * Co-rebuilding with the base table * Shard alignment for local access to Paxos state Some reasoning for why this is sufficient to preserve LWT correctness is discussed in [2]. This PR addresses two issues from the "Why doesn't it work for tablets" section in [1]: * Tablet migration vs LWT correctness * Paxos table sharding Other issues ("bounce to shard" and "locking for intranode_migration") have already been resolved in previous PRs. References [1] - [LWT over tablets design](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.goufx7gx24yu) [2] - [LWT: Paxos state and tablet balancer](https://docs.google.com/document/d/1-xubDo612GGgguc0khCj5ukmMGgLGCLWLIeG6GtHTY4/edit?tab=t.0) [3] - [Colocated tables PR](https://github.com/scylladb/scylladb/pull/22906#issuecomment-3027123886) [4] - [Possible LWT consistency violations after a topology change](https://github.com/scylladb/scylladb/issues/5251) Backport: not needed because this is a new feature. Closes scylladb/scylladb#24819 * github.com:scylladb/scylladb: create_keyspace: fix warning for tablets docs: fix lwt.rst docs: fix tablets.rst alternator: enable LWT random_failures: enable execute_lwt_transaction test_tablets_lwt: add test_paxos_state_table_permissions test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft test_tablets_lwt: test timeout creating paxos state table test_tablets_lwt: add test_lwt_concurrent_base_table_recreation test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild test_tablets_lwt: migrate test_lwt_support_with_tablets test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration test_tablets_lwt: add simple test for LWT check_internal_table_permissions: handle Paxos state tables client_state: extract check_internal_table_permissions paxos_store: handle base table removal database: get_base_table_for_tablet_colocation: handle paxos state table paxos_state: use node_local_only mode to access paxos state query_options: add node_local_only mode storage_proxy: handle node_local_only in query storage_proxy: handle node_local_only in mutate storage_proxy: introduce node_local_only flag abstract_replication_strategy: remove unused using storage_proxy: add coordinator_mutate_options storage_proxy: rename create_write_response_handler -> make_write_response_handler storage_proxy: simplify mutate_prepare paxos_state: lazily create paxos state table migration_manager: add timeout to start_group0_operation and announce paxos_store: use non-internal queries qp: make make_internal_options public paxos_store: conditional cf_id filter paxos_store: coroutinize feature_service: add LWT_WITH_TABLETS feature paxos_state: inline system_keyspace functions into paxos_store paxos_state: extract state access functions into paxos_store	2025-07-28 13:19:23 +03:00
Taras Veretilnyk	6b6622e07a	docs: fix typo in command name enbleautocompaction -> enableautocompaction Renamed the file and updated all references from 'enbleautocompaction' to the correct 'enableautocompaction'. Fixes scylladb/scylladb#25172 Closes scylladb/scylladb#25175	2025-07-28 12:49:26 +03:00
Tomasz Grabiec	55116ee660	topology_coordinator: Trigger load stats refresh after replace Otherwise, tablet rebuilt will be delayed for up to 60s, as the tablet scheduler needs load stats for the new node (replacing) to make decisisons. Fixes #25163 Closes scylladb/scylladb#25181	2025-07-28 11:07:17 +02:00
Robert Bindar	d921a565de	Add open-coredump script depndencies to install-dependencies.sh Whilst the coredump script checks for prerequisites, the user experience is not ideal because you either have to go in the script and get the list of deps and install them or wait for the script to complain about lacking dependencies one by one. This commit completes the list of dependencies in the install script (some of them were already there for Fedora), so you already have them installed by the time you get to run the coredump script. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> [avi: - remove trailing whitespace - regenerate frozen toolchain Optimized clang binaries generated and stored in https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz ] Closes #22369 Closes scylladb/scylladb#25203	2025-07-28 06:45:01 +03:00
Avi Kivity	1930f3e67f	Merge 'sstables/mx/reader: accommodate inexact partition indexes' from Michał Chojnowski Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space. However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position. For example, if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first index entry after key `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.) Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges. Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome. Preparation for new functionality, no backporting needed. Closes scylladb/scylladb#25093 * github.com:scylladb/scylladb: sstables/index_reader: weaken some exactness guarantees in abstract_index_reader test/boost: add a test for inexact index lookups sstables/mx/reader: allow passing a custom index reader to the constructor sstables/index_reader: remove advance_to sstables/mx/reader: handle inexact lookups in `advance_context()` sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()` sstables/index_reader: make the return value of `get_partition_key` optional sstables/mx/reader: handle "backward jumps" in forward_to sstables/mx/reader: filter out partitions outside the queried range sstables/mx/reader: update _pr after `fast_forward_to`	2025-07-27 19:39:36 +03:00
Avi Kivity	8180cbcf48	Merge 'tablets: prevent accidental copy of tablets_map' from Benny Halevy As they are wasteful in many cases, it is better to move the tablet_map if possible, or clone it gently in an async fiber. Add clone() and clone_gently() methods to allow explicit copies. * minor optimization, no backport needed Closes scylladb/scylladb#24978 * github.com:scylladb/scylladb: tablets: prevent accidental copy of tablets_map locator: tablets: get rid of synchronous mutate_tablet_map	2025-07-27 16:48:27 +03:00
Lakshmi Narayanan Sreethar	0c5fa8e154	locator/token_metadata.cc: use chunked_vector to store _sorted_tokens The `token_metadata_impl` stores the sorted tokens in an `std::vector`. With a large number of nodes, the size of this vector can grow quickly, and updating it might lead to oversized allocations. This commit changes `_sorted_tokens` to a `chunked_vector` to avoid such issues. It also updates all related code to use `chunked_vector` instead of `std::vector`. Fixes #24876 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#25027	2025-07-27 11:29:22 +03:00
Tomasz Grabiec	a1d7722c6d	Merge 'api: repair_async: refuse repairing tablet keyspaces' from Aleksandra Martyniuk A tablet repair started with /storage_service/repair_async/ API bypasses tablet repair scheduler and repairs only the tablets that are owned by the requested node. Due to that, to safely repair the whole keyspace, we need to first disable tablet migrations and then start repair on all nodes. With the new API - /storage_service/tablets/repair - tailored to tablet repair requirements, we do not need additional preparation before repair. We may request it on one node in a cluster only and, thanks to tablet repair scheduler, a whole keyspace will be safely repaired. Both nodetool and Scylla Manager have already started using the new API to repair tablets. Refuse repairing tablet keyspaces with /storage_service/repair_async - 403 Forbidden is returned. repair_async should still be used to repair vnode keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/23008. Breaking change; no backport. Closes scylladb/scylladb#24678 * github.com:scylladb/scylladb: repair: remove unused code api: repair_async: forbid repairing tablet keyspaces	2025-07-27 09:25:42 +02:00
Piotr Dulikowski	44de563d38	Merge 'db/hints: Improve logging' from Dawid Mędrek We improve logging in critical functions in hinted handoff to capture more information about the behavior of the module. That should help us in debugging sessions. The logs should only be printed during more important events and so they should not clog the log files. Backport: not necessary. Closes scylladb/scylladb#25031 * github.com:scylladb/scylladb: db/hints/manager.cc: Add logs for changing host filter db/hints: Increase log level in critical functions	2025-07-27 09:25:42 +02:00
Michael Litvak	3ff388cd94	storage service: drain view builder before group0 The view builder uses group0 operations to coordinate view building, so we should drain the view builder before stopping group0. Fixes scylladb/scylladb#25096 Closes scylladb/scylladb#25101	2025-07-27 09:25:42 +02:00
Pavel Emelyanov	403a72918d	sstables/types.hh: Remove duplicate version.hh inclusion The latter header in included two times, one is enough Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25109	2025-07-27 09:25:42 +02:00
Pavel Emelyanov	1b9eb4cb9f	init.hh: Remove unused forward declarations The init.hh contains some bits that only main.cc needs. Some of its forward declarations are neede by neither the headers itself, nor the main.cc that includes it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25110	2025-07-27 09:25:42 +02:00
Petr Gusev	8b8b7adbe5	raft_group0: split shutdown into abort_and_drain and destroy Previously, raft_group0::abort() was called in storage_service::do_drain (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because raft::server depends on storage (via raft_sys_table_storage and group0_state_machine). However, this caused issues: services like sstable_dict_autotrainer and auth::service, which use group0_client but are not stopped by storage_service, could trigger use-after-free if raft_group0 was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used. This commit reworks the shutdown logic: * Introduces abort_and_drain(), which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see raft::stopped_error if they try to access group0 after abort_and_drain(). * Final destruction happens in a separate method destroy(), called later from main.cc. The raft_server_for_group::aborted is changed to a shared_future -- abort_server now returns a future so that we can wait for it in abort_and_drain(), it should return the future from the previous abort_server call, which can happen in the on_background_error callback. Node startup can fail before reaching storage_service, in which case ss.drain_on_shutdown() and abort_and_drain() are never called. To ensure proper cleanup, abort_and_drain() is called from main.cc before destroy(). Clients of raft_group_registry are expected to call destroy_server() for the servers they own. Currently, the only such client is raft_group0, which satisfies this requirement. As a result, raft_group_registry::stop_servers() is no longer needed. Instead, raft_group_registry::stop() now verifies that all servers have been properly destroyed. If any remain, it calls on_internal_error(). The call to drain_on_shutdown() in cql_test_env.cc appears redundant. The only source of raft::server instances in raft_group_registry is group0_service, and if group0_service.start() succeeds, both abort_and_drain() and destroy() are guaranteed to be called during shutdown.	2025-07-25 17:16:14 +02:00
Michał Chojnowski	b1da5f2d0f	sstables/index_reader: weaken some exactness guarantees in abstract_index_reader After making the sstable reader more permissive, we can weaken the abstract_index_reader interface.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	be1f54c6d2	test/boost: add a test for inexact index lookups	2025-07-25 11:00:18 +02:00
Michał Chojnowski	810eb93ff0	sstables/mx/reader: allow passing a custom index reader to the constructor For tests. Will be used for testing how the data reader reacts to various combinations of inexact index lookup results.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	fe8ee34024	sstables/index_reader: remove advance_to `advance_to` is unused now, so remove it.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	03bf6347e2	sstables/mx/reader: handle inexact lookups in `advance_context()` `advance_context()` needs an ability to advance the index to the partition immediately following the reader's current partition. For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)` But BTI (and any index format which stores only the prefixes of keys instead of whole keys) can't implement `advance_to` with its current semantics. The Data position returned by the index for a generic `advance_to` might be off by one partition. E.g. if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first entry after `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. However, BTI can be used exactly if the partition is known to be present in the sstable. (In the above example, if `bb` is known to be present in the sstable, then it must correspond to `b`. So the index can reliably advance to `bb` or the first partition after it). And this is enough for `advance_context()`, because the current partition is known to be present. So we can replace the usage of `advance_to` with an equivalent API call which only works with present keys, but in exchange is implementable by BTI. This makes `advance_to` unused, so we remove it.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	11792850dd	sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()` `advance_to_next_partition()` needs an ability to advance the index to the partition immediately following the reader's current partition. For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)` But BTI (and any index format which stores only the prefixes of keys instead of whole keys) can't implement `advance_to` with its current semantics. The Data position returned by the index for a generic `advance_to` might be off by one partition. E.g. if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first entry after `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. However, BTI can be used exactly if the partition is known to be present in the sstable. (In the above example, if `bb` is known to be present in the sstable, then it must correspond to `b`. So the index can reliably advance to `bb` or the first partition after it). And this is enough for `advance_to_next_partition()`, because the current partition is known to be present. So we can replace the usage of `advance_to` with an equivalent API call which only works with present keys, but in exchange is implementable by BTI.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	141895f9eb	sstables/index_reader: make the return value of `get_partition_key` optional BTI indexes only store encoded prefixes of partition keys, not the whole keys. They can't reliably implement `get_partition_key`. The index reader interface must be weakened and callers must be adapted.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	a0c29055e5	sstables/mx/reader: handle "backward jumps" in forward_to A bunch of code assumes that the Data.db stream can only go forward. But with BTI indexes, if we perform an advance_to, the index can point to a position which the data reader has already passed, since the index is inexact. The logic of the data reader ensures that it has stopped within the last partition range, or just immediately after it, after reading the next partition key and noticing that it doesn't belong to the range. But forward_to can only be used with increasing ranges. The start of the next range must be greater or equal to the end of the previous range. This means that the exact start of the next partition range must be no earlier than: 1. Before the partition key just read by the data reader, if the data reader is positioned immediately after a partition key. 2. The start of the first partition after the current data reader position, if the data reader isn't positioned immediately after a partition key. So, if the index returns a position smaller than the current data reader position, then: 1. If the reader is immediately after a partition key, we have to reuse this partition key (since we can't go back in the stream to read it again), and keep reading from the current position. 2. Otherwise we can safely walk the index to the first partition that lies no earlier than the current position.	2025-07-25 10:49:58 +02:00
Michał Chojnowski	218b2dffff	sstables/mx/reader: filter out partitions outside the queried range The current index format is exact: it always returns the position of the first partition in the queried partition range. But we are about the add an index format where that doesn't have to be the case. In BTI indexes, the lookup can be off by one partition sometimes. This patch prepares the reader for that, by skipping the partitions which were read by the data reader but don't belong to the queried range. Note: as of this patch, only the "normal path" is ever used. We add tests exercising these code paths later. Also note that, as of this patch, actually stepping outside the queried range would cause the reader to end up in a state where the underlying parser is positioned right after partition key immediately following the queried range. If the reader was forwarded to that key in this state, it would trip an assert, because the parser can't handle backward jumps. We will add logic to handle this case in the next patch.	2025-07-25 10:49:57 +02:00

1 2 3 4 5 ...

48709 Commits