test: test_remove_garbage_group0_members: wait for token ring and group0 consistency before removenode

The removenove initiator could have an outdated token ring (still considering the node removed by the previous removenode a token owner) and unexpectedly reject the operation. Fix that by waiting for token ring and group0 consistency before removenode. Note that the test already checks that consistency, but only for one node, which is different from the removenode initiator. This test has been removed in master together with the code being tested (the gossip-based topology). Hence, the fix is submitted directly to 2026.1. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1103 Backport to all supported branches (other than 2026.1), as the test can fail there. Closes scylladb/scylladb#29108 (cherry picked from commit 1398a55d16) Closes scylladb/scylladb#29205
database: Rate limit all tokens from a range
2026-03-24 16:09:02 +01:00 · 2026-03-24 16:04:01 +02:00 · 2026-03-23 23:50:15 +02:00 · 2026-03-20 11:00:38 +02:00 · 2026-03-20 11:00:11 +02:00 · 2026-03-20 10:59:26 +02:00
457 changed files with 16162 additions and 5165 deletions
--- a/.github/scripts/auto-backport.py
+++ b/.github/scripts/auto-backport.py
@@ -62,7 +62,7 @@ def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr
        if is_draft:
            labels_to_add.append("conflicts")
            pr_comment = f"@{pr.user.login} - This PR was marked as draft because it has conflicts\n"
-            pr_comment += "Please resolve them and mark this PR as ready for review"
+            pr_comment += "Please resolve them and remove the 'conflicts' label. The PR will be made ready for review automatically."
            backport_pr.create_issue_comment(pr_comment)
        
        # Apply all labels at once if we have any
@@ -142,20 +142,31 @@ def backport(repo, pr, version, commits, backport_base_branch, is_collaborator):


 def with_github_keyword_prefix(repo, pr):
-    pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"
-    match = re.findall(pattern, pr.body, re.IGNORECASE)
-    if not match:
-        for commit in pr.get_commits():
-            match = re.findall(pattern, commit.commit.message, re.IGNORECASE)
-            if match:
-                print(f'{pr.number} has a valid close reference in commit message {commit.sha}')
-                break
-    if not match:
-        print(f'No valid close reference for {pr.number}')
-        return False
-    else:
+    # GitHub issue pattern: #123, scylladb/scylladb#123, or full GitHub URLs
+    github_pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"
+    
+    # JIRA issue pattern: PKG-92 or https://scylladb.atlassian.net/browse/PKG-92
+    jira_pattern = r"(?:fix(?:|es|ed))\s*:?\s*(?:(?:https://scylladb\.atlassian\.net/browse/)?([A-Z]+-\d+))"
+    
+    # Check PR body for GitHub issues
+    github_match = re.findall(github_pattern, pr.body, re.IGNORECASE)
+    # Check PR body for JIRA issues
+    jira_match = re.findall(jira_pattern, pr.body, re.IGNORECASE)
+    
+    match = github_match or jira_match
+
+    if match:
        return True

+    for commit in pr.get_commits():
+        github_match = re.findall(github_pattern, commit.commit.message, re.IGNORECASE)
+        jira_match = re.findall(jira_pattern, commit.commit.message, re.IGNORECASE)
+        if github_match or jira_match:
+            print(f'{pr.number} has a valid close reference in commit message {commit.sha}')
+            return True
+
+    print(f'No valid close reference for {pr.number}')
+    return False

 def main():
    args = parse_args()
--- a/.github/workflows/backport-pr-fixes-validation.yaml
+++ b/.github/workflows/backport-pr-fixes-validation.yaml
@@ -18,7 +18,7 @@ jobs:
            
            // Regular expression pattern to check for "Fixes" prefix
            // Adjusted to dynamically insert the repository full name
-            const pattern = `Fixes:? (?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)`;
+            const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|(?:https://scylladb\\.atlassian\\.net/browse/)?([A-Z]+-\\d+))`;
            const regex = new RegExp(pattern);
            
            if (!regex.test(body)) {
--- a/.github/workflows/call_backport_with_jira.yaml
+++ b/.github/workflows/call_backport_with_jira.yaml
@@ -0,0 +1,53 @@
+name: Backport with Jira Integration
+
+on:
+  push:
+    branches:
+      - master
+      - next-*.*
+      - branch-*.*
+  pull_request_target:
+    types: [labeled, closed]
+    branches: 
+      - master
+      - next
+      - next-*.*
+      - branch-*.*
+
+jobs:
+  backport-on-push:
+    if: github.event_name == 'push'
+    uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
+    with:
+      event_type: 'push'
+      base_branch: ${{ github.ref }}
+      commits: ${{ github.event.before }}..${{ github.sha }}
+    secrets:
+      gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
+      jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
+
+  backport-on-label:
+    if: github.event_name == 'pull_request_target' && github.event.action == 'labeled'
+    uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
+    with:
+      event_type: 'labeled'
+      base_branch: refs/heads/${{ github.event.pull_request.base.ref }}
+      pull_request_number: ${{ github.event.pull_request.number }}
+      head_commit: ${{ github.event.pull_request.base.sha }}
+      label_name: ${{ github.event.label.name }}
+      pr_state: ${{ github.event.pull_request.state }}
+    secrets:
+      gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
+      jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
+
+  backport-chain:
+    if: github.event_name == 'pull_request_target' && github.event.action == 'closed' && github.event.pull_request.merged == true
+    uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
+    with:
+      event_type: 'chain'
+      base_branch: refs/heads/${{ github.event.pull_request.base.ref }}
+      pull_request_number: ${{ github.event.pull_request.number }}
+      pr_body: ${{ github.event.pull_request.body }}
+    secrets:
+      gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
+      jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
--- a/.github/workflows/trigger-scylla-ci.yaml
+++ b/.github/workflows/trigger-scylla-ci.yaml
@@ -3,19 +3,63 @@ name: Trigger Scylla CI Route
 on:
  issue_comment:
    types: [created]
+  pull_request_target:
+    types:
+      - unlabeled

 jobs:
  trigger-jenkins:
-    if: github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')
+    if: (github.event_name == 'issue_comment' && github.event.comment.user.login != 'scylladbbot') || github.event.label.name == 'conflicts'
    runs-on: ubuntu-latest
    steps:
+      - name: Verify Org Membership
+        id: verify_author
+        env:
+          EVENT_NAME: ${{ github.event_name }}
+          PR_AUTHOR: ${{ github.event.pull_request.user.login }}
+          PR_ASSOCIATION: ${{ github.event.pull_request.author_association }}
+          COMMENT_AUTHOR: ${{ github.event.comment.user.login }}
+          COMMENT_ASSOCIATION: ${{ github.event.comment.author_association }}
+        shell: bash
+        run: |
+          if [[ "$EVENT_NAME" == "pull_request_target" ]]; then
+            AUTHOR="$PR_AUTHOR"
+            ASSOCIATION="$PR_ASSOCIATION"
+          else
+            AUTHOR="$COMMENT_AUTHOR"
+            ASSOCIATION="$COMMENT_ASSOCIATION"
+          fi
+          ORG="scylladb"
+          if gh api "/orgs/${ORG}/members/${AUTHOR}" --silent 2>/dev/null; then
+            echo "member=true" >> $GITHUB_OUTPUT
+          else
+            echo "::warning::${AUTHOR} is not a member of ${ORG}; skipping CI trigger."
+            echo "member=false" >> $GITHUB_OUTPUT
+          fi
+
+      - name: Validate Comment Trigger
+        if: github.event_name == 'issue_comment'
+        id: verify_comment
+        env:
+          COMMENT_BODY: ${{ github.event.comment.body }}
+        shell: bash
+        run: |
+          CLEAN_BODY=$(echo "$COMMENT_BODY" | grep -v '^[[:space:]]*>')
+
+          if echo "$CLEAN_BODY" | grep -qi '@scylladbbot' && echo "$CLEAN_BODY" | grep -qi 'trigger-ci'; then
+            echo "trigger=true" >> $GITHUB_OUTPUT
+          else
+            echo "trigger=false" >> $GITHUB_OUTPUT
+          fi
+
      - name: Trigger Scylla-CI-Route Jenkins Job
+        if: steps.verify_author.outputs.member == 'true' && (github.event_name == 'pull_request_target' || steps.verify_comment.outputs.trigger == 'true')
        env:
          JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}
          JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}
          JENKINS_URL: "https://jenkins.scylladb.com"
+          PR_NUMBER: "${{ github.event.issue.number || github.event.pull_request.number }}"
+          PR_REPO_NAME: "${{ github.event.repository.full_name }}"
        run: |
-          PR_NUMBER=${{ github.event.issue.number }}
-          PR_REPO_NAME=${{ github.event.repository.full_name }}
          curl -X POST "$JENKINS_URL/job/releng/job/Scylla-CI-Route/buildWithParameters?PR_NUMBER=$PR_NUMBER&PR_REPO_NAME=$PR_REPO_NAME" \
-          --user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail -i -v
+            --user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,6 +1,6 @@
 [submodule "seastar"]
 	path = seastar
-	url = ../seastar
+	url = ../scylla-seastar
 	ignore = dirty
 [submodule "swagger-ui"]
 	path = swagger-ui
--- a/2
+++ b/2
@@ -78,7 +78,7 @@ fi

 # Default scylla product/version tags
 PRODUCT=scylla
-VERSION=2025.4.0-dev
+VERSION=2025.4.6

 if test -f version
 then
--- a/alternator/controller.cc
+++ b/alternator/controller.cc
@@ -136,6 +136,7 @@ future<> controller::start_server() {
                [this, addr, alternator_port, alternator_https_port, creds = std::move(creds)] (server& server) mutable {
            return server.init(addr, alternator_port, alternator_https_port, creds,
                    _config.alternator_enforce_authorization,
+                    _config.alternator_warn_authorization,
                    &_memory_limiter.local().get_semaphore(),
                    _config.max_concurrent_requests_per_shard);
        }).handle_exception([this, addr, alternator_port, alternator_https_port] (std::exception_ptr ep) {
--- a/alternator/executor.cc
+++ b/alternator/executor.cc
@@ -16,6 +16,7 @@
 #include "cdc/cdc_options.hh"
 #include "auth/service.hh"
 #include "db/config.hh"
+#include "db/view/view_build_status.hh"
 #include "mutation/tombstone.hh"
 #include "utils/log.hh"
 #include "schema/schema_builder.hh"
@@ -107,6 +108,20 @@ extern const sstring TTL_TAG_KEY("system:ttl_attribute");
 // following ones are base table's keys added as needed or range key list will be empty.
 static const sstring SPURIOUS_RANGE_KEY_ADDED_TO_GSI_AND_USER_DIDNT_SPECIFY_RANGE_KEY_TAG_KEY("system:spurious_range_key_added_to_gsi_and_user_didnt_specify_range_key");

+// The following tags also have the "system:" prefix but are NOT used
+// by Alternator to store table properties - only the user ever writes to
+// them, as a way to configure the table. As such, these tags are writable
+// (and readable) by the user, and not hidden by tag_key_is_internal().
+// The reason why both hidden (internal) and user-configurable tags share the
+// same "system:" prefix is historic.
+
+// Setting the tag with a numeric value will enable a specific initial number
+// of tablets (setting the value to 0 means enabling tablets with
+// an automatic selection of the best number of tablets).
+// Setting this tag to any non-numeric value (e.g., an empty string or the
+// word "none") will ask to disable tablets.
+static constexpr auto INITIAL_TABLETS_TAG_KEY = "system:initial_tablets";
+

 enum class table_status {
    active = 0,
@@ -129,7 +144,8 @@ static std::string_view table_status_to_sstring(table_status tbl_status) {
    return "UNKNOWN";
 }

-static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_view keyspace_name, service::storage_proxy& sp, gms::gossiper& gossiper, api::timestamp_type, const std::map<sstring, sstring>& tags_map, const gms::feature_service& feat);
+static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_view keyspace_name, service::storage_proxy& sp, gms::gossiper& gossiper, api::timestamp_type,
+        const std::map<sstring, sstring>& tags_map, const gms::feature_service& feat, const db::tablets_mode_t::mode tablets_mode);

 static map_type attrs_type() {
    static thread_local auto t = map_type_impl::get_instance(utf8_type, bytes_type, true);
@@ -243,7 +259,8 @@ executor::executor(gms::gossiper& gossiper,
      _mm(mm),
      _sdks(sdks),
      _cdc_metadata(cdc_metadata),
-      _enforce_authorization(_proxy.data_dictionary().get_config().alternator_enforce_authorization()),
+      _enforce_authorization(_proxy.data_dictionary().get_config().alternator_enforce_authorization),
+      _warn_authorization(_proxy.data_dictionary().get_config().alternator_warn_authorization),
      _ssg(ssg),
      _parsed_expression_cache(std::make_unique<parsed::expression_cache>(
        parsed::expression_cache::config{_proxy.data_dictionary().get_config().alternator_max_expression_cache_entries_per_shard},
@@ -879,15 +896,37 @@ future<executor::request_return_type> executor::describe_table(client_state& cli
    co_return rjson::print(std::move(response));
 }

+// This function increments the authorization_failures counter, and may also
+// log a warn-level message and/or throw an access_denied exception, depending
+// on what enforce_authorization and warn_authorization are set to.
+// Note that if enforce_authorization is false, this function will return
+// without throwing. So a caller that doesn't want to continue after an
+// authorization_error must explicitly return after calling this function.
+static void authorization_error(alternator::stats& stats, bool enforce_authorization, bool warn_authorization, std::string msg) {
+    stats.authorization_failures++;
+    if (enforce_authorization) {
+        if (warn_authorization) {
+            elogger.warn("alternator_warn_authorization=true: {}", msg);
+        }
+        throw api_error::access_denied(std::move(msg));
+    } else {
+        if (warn_authorization) {
+            elogger.warn("If you set alternator_enforce_authorization=true the following will be enforced: {}", msg);
+        }
+    }
+}
+
 // Check CQL's Role-Based Access Control (RBAC) permission_to_check (MODIFY,
 // SELECT, DROP, etc.) on the given table. When permission is denied an
 // appropriate user-readable api_error::access_denied is thrown.
 future<> verify_permission(
    bool enforce_authorization,
+    bool warn_authorization,
    const service::client_state& client_state,
    const schema_ptr& schema,
-    auth::permission permission_to_check) {
-    if (!enforce_authorization) {
+    auth::permission permission_to_check,
+    alternator::stats& stats) {
+    if (!enforce_authorization && !warn_authorization) {
        co_return;
    }
    // Unfortunately, the fix for issue #23218 did not modify the function
@@ -902,31 +941,33 @@ future<> verify_permission(
                if (client_state.user() && client_state.user()->name) {
                    username = client_state.user()->name.value();
                }
-                throw api_error::access_denied(fmt::format(
+                authorization_error(stats, enforce_authorization, warn_authorization, fmt::format(
                    "Write access denied on internal table {}.{} to role {} because it is not a superuser",
                    schema->ks_name(), schema->cf_name(), username));
+                co_return;
        }
    }
    auto resource = auth::make_data_resource(schema->ks_name(), schema->cf_name());
-    if (!co_await client_state.check_has_permission(auth::command_desc(permission_to_check, resource))) {
+    if (!client_state.user() || !client_state.user()->name ||
+        !co_await client_state.check_has_permission(auth::command_desc(permission_to_check, resource))) {
        sstring username = "<anonymous>";
        if (client_state.user() && client_state.user()->name) {
            username = client_state.user()->name.value();
        }
        // Using exceptions for errors makes this function faster in the
        // success path (when the operation is allowed).
-        throw api_error::access_denied(format(
-            "{} access on table {}.{} is denied to role {}",
+        authorization_error(stats, enforce_authorization, warn_authorization, fmt::format(
+            "{} access on table {}.{} is denied to role {}, client address {}",
            auth::permissions::to_string(permission_to_check),
-            schema->ks_name(), schema->cf_name(), username));
+            schema->ks_name(), schema->cf_name(), username, client_state.get_client_address()));
    }
 }

 // Similar to verify_permission() above, but just for CREATE operations.
 // Those do not operate on any specific table, so require permissions on
 // ALL KEYSPACES instead of any specific table.
-future<> verify_create_permission(bool enforce_authorization, const service::client_state& client_state) {
-    if (!enforce_authorization) {
+static future<> verify_create_permission(bool enforce_authorization, bool warn_authorization, const service::client_state& client_state, alternator::stats& stats) {
+    if (!enforce_authorization && !warn_authorization) {
        co_return;
    }
    auto resource = auth::resource(auth::resource_kind::data);
@@ -935,7 +976,7 @@ future<> verify_create_permission(bool enforce_authorization, const service::cli
        if (client_state.user() && client_state.user()->name) {
            username = client_state.user()->name.value();
        }
-        throw api_error::access_denied(format(
+        authorization_error(stats, enforce_authorization, warn_authorization, fmt::format(
            "CREATE access on ALL KEYSPACES is denied to role {}", username));
    }
 }
@@ -952,7 +993,7 @@ future<executor::request_return_type> executor::delete_table(client_state& clien

    schema_ptr schema = get_table(_proxy, request);
    rjson::value table_description = co_await fill_table_description(schema, table_status::deleting, _proxy, client_state, trace_state, permit);
-    co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::DROP);
+    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::DROP, _stats);
    co_await _mm.container().invoke_on(0, [&, cs = client_state.move_to_other_shard()] (service::migration_manager& mm) -> future<> {
        size_t retries = mm.get_concurrent_ddl_retries();
        for (;;) {
@@ -966,8 +1007,8 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
                throw api_error::resource_not_found(fmt::format("Requested resource not found: Table: {} not found", table_name));
            }

-            auto m = co_await service::prepare_column_family_drop_announcement(_proxy, keyspace_name, table_name, group0_guard.write_timestamp(), service::drop_views::yes);
-            auto m2 = co_await service::prepare_keyspace_drop_announcement(_proxy, keyspace_name, group0_guard.write_timestamp());
+            auto m = co_await service::prepare_column_family_drop_announcement(p.local(), keyspace_name, table_name, group0_guard.write_timestamp(), service::drop_views::yes);
+            auto m2 = co_await service::prepare_keyspace_drop_announcement(p.local(), keyspace_name, group0_guard.write_timestamp());

            std::move(m2.begin(), m2.end(), std::back_inserter(m));

@@ -1204,12 +1245,13 @@ void rmw_operation::set_default_write_isolation(std::string_view value) {
 // Alternator uses tags whose keys start with the "system:" prefix for
 // internal purposes. Those should not be readable by ListTagsOfResource,
 // nor writable with TagResource or UntagResource (see #24098).
-// Only a few specific system tags, currently only system:write_isolation,
-// are deliberately intended to be set and read by the user, so are not
-// considered "internal".
+// Only a few specific system tags, currently only "system:write_isolation"
+// and "system:initial_tablets", are deliberately intended to be set and read
+// by the user, so are not considered "internal".
 static bool tag_key_is_internal(std::string_view tag_key) {
-    return tag_key.starts_with("system:") &&
-        tag_key != rmw_operation::WRITE_ISOLATION_TAG_KEY;
+    return tag_key.starts_with("system:")
+        && tag_key != rmw_operation::WRITE_ISOLATION_TAG_KEY
+        && tag_key != INITIAL_TABLETS_TAG_KEY;
 }

 enum class update_tags_action { add_tags, delete_tags };
@@ -1290,7 +1332,7 @@ future<executor::request_return_type> executor::tag_resource(client_state& clien
    if (tags->Size() < 1) {
        co_return api_error::validation("The number of tags must be at least 1") ;
    }
-    co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::ALTER);
+    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::ALTER, _stats);
    co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
        update_tags_map(*tags, tags_map, update_tags_action::add_tags);
    });
@@ -1311,7 +1353,7 @@ future<executor::request_return_type> executor::untag_resource(client_state& cli

    schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
    get_stats_from_schema(_proxy, *schema)->api_operations.untag_resource++;
-    co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::ALTER);
+    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::ALTER, _stats);
    co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
        update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
    });
@@ -1496,8 +1538,26 @@ bytes extract_from_attrs_column_computation::compute_value(const schema&, const
    on_internal_error(elogger, "extract_from_attrs_column_computation::compute_value called without row");
 }

+// Because `CreateTable` request creates GSI/LSI together with the base table (so the base table is empty),
+// we can skip view building process and immediately mark the view as built on all nodes.
+//
+// However, we can do this only for tablet-based views because `view_building_worker` will automatically propagate
+// this information to `system.built_views` table (see `view_building_worker::update_built_views()`).
+// For vnode-based views, `view_builder` will process the view and mark it as built.
+static future<> mark_view_schemas_as_built(utils::chunked_vector<mutation>& out, std::vector<schema_ptr> schemas, api::timestamp_type ts, service::storage_proxy& sp) {
+    auto token_metadata = sp.get_token_metadata_ptr();
+    for (auto& schema: schemas) {
+        if (schema->is_view()) {
+            for (auto& host_id: token_metadata->get_topology().get_all_host_ids()) {
+                auto view_status_mut = co_await sp.system_keyspace().make_view_build_status_mutation(ts, {schema->ks_name(), schema->cf_name()}, host_id, db::view::build_status::SUCCESS);
+                out.push_back(std::move(view_status_mut));
+            }
+        }
+    }
+}

-static future<executor::request_return_type> create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request, service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper, bool enforce_authorization) {
+static future<executor::request_return_type> create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request,
+            service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper, bool enforce_authorization, bool warn_authorization, stats& stats, const db::tablets_mode_t::mode tablets_mode) {
    SCYLLA_ASSERT(this_shard_id() == 0);

    // We begin by parsing and validating the content of the CreateTable
@@ -1703,7 +1763,7 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
    set_table_creation_time(tags_map, db_clock::now());
    builder.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>(tags_map));

-    co_await verify_create_permission(enforce_authorization, client_state);
+    co_await verify_create_permission(enforce_authorization, warn_authorization, client_state, stats);

    schema_ptr schema = builder.build();
    for (auto& view_builder : view_builders) {
@@ -1724,7 +1784,7 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
        auto group0_guard = co_await mm.start_group0_operation();
        auto ts = group0_guard.write_timestamp();
        utils::chunked_vector<mutation> schema_mutations;
-        auto ksm = create_keyspace_metadata(keyspace_name, sp, gossiper, ts, tags_map, sp.features());
+        auto ksm = create_keyspace_metadata(keyspace_name, sp, gossiper, ts, tags_map, sp.features(), tablets_mode);
        // Alternator Streams doesn't yet work when the table uses tablets (#23838)
        if (stream_specification && stream_specification->IsObject()) {
            auto stream_enabled = rjson::find(*stream_specification, "StreamEnabled");
@@ -1733,10 +1793,15 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
                auto rs = locator::abstract_replication_strategy::create_replication_strategy(ksm->strategy_name(), params);
                if (rs->uses_tablets()) {
                    co_return api_error::validation("Streams not yet supported on a table using tablets (issue #23838). "
-                    "If you want to use streams, create a table with vnodes by setting the tag 'experimental:initial_tablets' set to 'none'.");
+                    "If you want to use streams, create a table with vnodes by setting the tag 'system:initial_tablets' set to 'none'.");
                }
            }
        }
+        // Creating an index in tablets mode requires the rf_rack_valid_keyspaces option to be enabled.
+        // GSI and LSI indexes are based on materialized views which require this option to avoid consistency issues.
+        if (!view_builders.empty() && ksm->uses_tablets() && !sp.data_dictionary().get_config().rf_rack_valid_keyspaces()) {
+            co_return api_error::validation("GlobalSecondaryIndexes and LocalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
+        }
        try {
            schema_mutations = service::prepare_new_keyspace_announcement(sp.local_db(), ksm, ts);
        } catch (exceptions::already_exists_exception&) {
@@ -1754,6 +1819,9 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
            schemas.push_back(view_builder.build());
        }
        co_await service::prepare_new_column_families_announcement(schema_mutations, sp, *ksm, schemas, ts);
+        if (ksm->uses_tablets()) {
+            co_await mark_view_schemas_as_built(schema_mutations, schemas, ts, sp);
+        }

        // If a role is allowed to create a table, we must give it permissions to
        // use (and eventually delete) the specific table it just created (and
@@ -1800,9 +1868,10 @@ future<executor::request_return_type> executor::create_table(client_state& clien
    _stats.api_operations.create_table++;
    elogger.trace("Creating table {}", request);

-    co_return co_await _mm.container().invoke_on(0, [&, tr = tracing::global_trace_state_ptr(trace_state), request = std::move(request), &sp = _proxy.container(), &g = _gossiper.container(), client_state_other_shard = client_state.move_to_other_shard(), enforce_authorization = bool(_enforce_authorization)]
+    co_return co_await _mm.container().invoke_on(0, [&, tr = tracing::global_trace_state_ptr(trace_state), request = std::move(request), &sp = _proxy.container(), &g = _gossiper.container(), client_state_other_shard = client_state.move_to_other_shard(), enforce_authorization = bool(_enforce_authorization), warn_authorization = bool(_warn_authorization)]
                                        (service::migration_manager& mm) mutable -> future<executor::request_return_type> {
-        co_return co_await create_table_on_shard0(client_state_other_shard.get(), tr, std::move(request), sp.local(), mm, g.local(), enforce_authorization);
+        const db::tablets_mode_t::mode tablets_mode = _proxy.data_dictionary().get_config().tablets_mode_for_new_keyspaces(); // type cast
+        co_return co_await create_table_on_shard0(client_state_other_shard.get(), tr, std::move(request), sp.local(), mm, g.local(), enforce_authorization, warn_authorization, _stats, std::move(tablets_mode));
    });
 }

@@ -1855,7 +1924,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
        verify_billing_mode(request);
    }

-    co_return co_await _mm.container().invoke_on(0, [&p = _proxy.container(), request = std::move(request), gt = tracing::global_trace_state_ptr(std::move(trace_state)), enforce_authorization = bool(_enforce_authorization), client_state_other_shard = client_state.move_to_other_shard(), empty_request]
+    co_return co_await _mm.container().invoke_on(0, [&p = _proxy.container(), request = std::move(request), gt = tracing::global_trace_state_ptr(std::move(trace_state)), enforce_authorization = bool(_enforce_authorization), warn_authorization = bool(_warn_authorization), client_state_other_shard = client_state.move_to_other_shard(), empty_request, &e = this->container()]
                                                (service::migration_manager& mm) mutable -> future<executor::request_return_type> {
        schema_ptr schema;
        size_t retries = mm.get_concurrent_ddl_retries();
@@ -1886,7 +1955,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
                    if (stream_enabled->GetBool()) {
                        if (p.local().local_db().find_keyspace(tab->ks_name()).get_replication_strategy().uses_tablets()) {
                        co_return api_error::validation("Streams not yet supported on a table using tablets (issue #23838). "
-                            "If you want to enable streams, re-create this table with vnodes (with the tag 'experimental:initial_tablets' set to 'none').");
+                            "If you want to enable streams, re-create this table with vnodes (with the tag 'system:initial_tablets' set to 'none').");
                        }
                        if (tab->cdc_options().enabled()) {
                            co_return api_error::validation("Table already has an enabled stream: TableName: " + tab->cf_name());
@@ -1953,6 +2022,10 @@ future<executor::request_return_type> executor::update_table(client_state& clien
                            co_return api_error::validation(fmt::format(
                                "LSI {} already exists in table {}, can't use same name for GSI", index_name, table_name));
                        }
+                        if (p.local().local_db().find_keyspace(keyspace_name).get_replication_strategy().uses_tablets() &&
+                                !p.local().data_dictionary().get_config().rf_rack_valid_keyspaces()) {
+                            co_return api_error::validation("GlobalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
+                        }

                        elogger.trace("Adding GSI {}", index_name);
                        // FIXME: read and handle "Projection" parameter. This will
@@ -2026,7 +2099,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
                co_return api_error::validation("UpdateTable requires one of GlobalSecondaryIndexUpdates, StreamSpecification or BillingMode to be specified");
            }

-            co_await verify_permission(enforce_authorization, client_state_other_shard.get(), schema, auth::permission::ALTER);
+            co_await verify_permission(enforce_authorization, warn_authorization, client_state_other_shard.get(), schema, auth::permission::ALTER, e.local()._stats);
            auto m = co_await service::prepare_column_family_update_announcement(p.local(), schema, std::vector<view_ptr>(), group0_guard.write_timestamp());
            for (view_ptr view : new_views) {
                auto m2 = co_await service::prepare_new_view_announcement(p.local(), view, group0_guard.write_timestamp());
@@ -2789,7 +2862,7 @@ future<executor::request_return_type> executor::put_item(client_state& client_st
    tracing::add_table_name(trace_state, op->schema()->ks_name(), op->schema()->cf_name());
    const bool needs_read_before_write = op->needs_read_before_write();

-    co_await verify_permission(_enforce_authorization, client_state, op->schema(), auth::permission::MODIFY);
+    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, op->schema(), auth::permission::MODIFY, _stats);

    auto cas_shard = op->shard_for_execute(needs_read_before_write);

@@ -2892,7 +2965,7 @@ future<executor::request_return_type> executor::delete_item(client_state& client
    tracing::add_table_name(trace_state, op->schema()->ks_name(), op->schema()->cf_name());
    const bool needs_read_before_write = _proxy.data_dictionary().get_config().alternator_force_read_before_write() || op->needs_read_before_write();

-    co_await verify_permission(_enforce_authorization, client_state, op->schema(), auth::permission::MODIFY);
+    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, op->schema(), auth::permission::MODIFY, _stats);

    auto cas_shard = op->shard_for_execute(needs_read_before_write);

@@ -2954,12 +3027,15 @@ struct primary_key_equal {
 // done is known prior to starting the operation). Nevertheless, we want to
 // do this mutation via LWT to ensure that it is serialized with other LWT
 // mutations to the same partition.
+// 
+// The std::vector<put_or_delete_item> must remain alive until the
+// storage_proxy::cas() future is resolved.
 class put_or_delete_item_cas_request : public service::cas_request {
    schema_ptr schema;
-    std::vector<put_or_delete_item> _mutation_builders;
+    const std::vector<put_or_delete_item>& _mutation_builders;
 public:
-    put_or_delete_item_cas_request(schema_ptr s, std::vector<put_or_delete_item>&& b) :
-        schema(std::move(s)), _mutation_builders(std::move(b)) { }
+    put_or_delete_item_cas_request(schema_ptr s, const std::vector<put_or_delete_item>& b) :
+        schema(std::move(s)), _mutation_builders(b) { }
    virtual ~put_or_delete_item_cas_request() = default;
    virtual std::optional<mutation> apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts) override {
        std::optional<mutation> ret;
@@ -2975,11 +3051,38 @@ public:
    }
 };

-static future<> cas_write(service::storage_proxy& proxy, schema_ptr schema, service::cas_shard cas_shard, dht::decorated_key dk, std::vector<put_or_delete_item>&& mutation_builders,
-        service::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit) {
+future<> executor::cas_write(schema_ptr schema, service::cas_shard cas_shard, const dht::decorated_key& dk,
+        const std::vector<put_or_delete_item>& mutation_builders, service::client_state& client_state,
+        tracing::trace_state_ptr trace_state, service_permit permit)
+{
+    if (!cas_shard.this_shard()) {
+        _stats.shard_bounce_for_lwt++;
+        return container().invoke_on(cas_shard.shard(), _ssg,
+                    [cs = client_state.move_to_other_shard(),
+                    &mb = mutation_builders,
+                    &dk,
+                    ks = schema->ks_name(),
+                    cf = schema->cf_name(),
+                    gt = tracing::global_trace_state_ptr(trace_state),
+                    permit = std::move(permit)]
+                    (executor& self) mutable {
+            return do_with(cs.get(), [&mb, &dk, ks = std::move(ks), cf = std::move(cf),
+                                    trace_state = tracing::trace_state_ptr(gt), &self]
+                                    (service::client_state& client_state) mutable {
+                auto schema = self._proxy.data_dictionary().find_schema(ks, cf);
+                service::cas_shard cas_shard(*schema, dk.token());
+
+                //FIXME: Instead of passing empty_service_permit() to the background operation,
+                // the current permit's lifetime should be prolonged, so that it's destructed
+                // only after all background operations are finished as well.
+                return self.cas_write(schema, std::move(cas_shard), dk, mb, client_state, std::move(trace_state), empty_service_permit());
+            });
+        });
+    }
+
    auto timeout = executor::default_timeout();
-    auto op = seastar::make_shared<put_or_delete_item_cas_request>(schema, std::move(mutation_builders));
-    return proxy.cas(schema, std::move(cas_shard), op, nullptr, to_partition_ranges(dk),
+    auto op = seastar::make_shared<put_or_delete_item_cas_request>(schema, mutation_builders);
+    return _proxy.cas(schema, std::move(cas_shard), op, nullptr, to_partition_ranges(dk),
            {timeout, std::move(permit), client_state, trace_state},
            db::consistency_level::LOCAL_SERIAL, db::consistency_level::LOCAL_QUORUM,
            timeout, timeout).discard_result();
@@ -3005,13 +3108,11 @@ struct schema_decorated_key_equal {

 // FIXME: if we failed writing some of the mutations, need to return a list
 // of these failed mutations rather than fail the whole write (issue #5650).
-static future<> do_batch_write(service::storage_proxy& proxy,
-        smp_service_group ssg,
+future<> executor::do_batch_write(
        std::vector<std::pair<schema_ptr, put_or_delete_item>> mutation_builders,
        service::client_state& client_state,
        tracing::trace_state_ptr trace_state,
-        service_permit permit,
-        stats& stats) {
+        service_permit permit) {
    if (mutation_builders.empty()) {
        return make_ready_future<>();
    }
@@ -3031,7 +3132,7 @@ static future<> do_batch_write(service::storage_proxy& proxy,
        for (auto& b : mutation_builders) {
            mutations.push_back(b.second.build(b.first, now));
        }
-        return proxy.mutate(std::move(mutations),
+        return _proxy.mutate(std::move(mutations),
                db::consistency_level::LOCAL_QUORUM,
                executor::default_timeout(),
                trace_state,
@@ -3042,48 +3143,41 @@ static future<> do_batch_write(service::storage_proxy& proxy,
        // Multiple mutations may be destined for the same partition, adding
        // or deleting different items of one partition. Join them together
        // because we can do them in one cas() call.
-        std::unordered_map<schema_decorated_key, std::vector<put_or_delete_item>, schema_decorated_key_hash, schema_decorated_key_equal>
-            key_builders(1, schema_decorated_key_hash{}, schema_decorated_key_equal{});
-        for (auto& b : mutation_builders) {
-            auto dk = dht::decorate_key(*b.first, b.second.pk());
-            auto [it, added] = key_builders.try_emplace(schema_decorated_key{b.first, dk});
+        using map_type = std::unordered_map<schema_decorated_key, 
+            std::vector<put_or_delete_item>, 
+            schema_decorated_key_hash, 
+            schema_decorated_key_equal>;
+        auto key_builders = std::make_unique<map_type>(1, schema_decorated_key_hash{}, schema_decorated_key_equal{});
+        for (auto&& b : std::move(mutation_builders)) {
+            auto [it, added] = key_builders->try_emplace(schema_decorated_key {
+                .schema = b.first,
+                .dk = dht::decorate_key(*b.first, b.second.pk())
+            });
            it->second.push_back(std::move(b.second));
        }
-        return parallel_for_each(std::move(key_builders), [&proxy, &client_state, &stats, trace_state, ssg, permit = std::move(permit)] (auto& e) {
-            stats.write_using_lwt++;
+        auto* key_builders_ptr = key_builders.get();
+        return parallel_for_each(*key_builders_ptr, [this, &client_state, trace_state, permit = std::move(permit)] (const auto& e) {
+            _stats.write_using_lwt++;
            auto desired_shard = service::cas_shard(*e.first.schema, e.first.dk.token());
-            if (desired_shard.this_shard()) {
-                return cas_write(proxy, e.first.schema, std::move(desired_shard), e.first.dk, std::move(e.second), client_state, trace_state, permit);
-            } else {
-                stats.shard_bounce_for_lwt++;
-                return proxy.container().invoke_on(desired_shard.shard(), ssg,
-                            [cs = client_state.move_to_other_shard(),
-                             mb = e.second,
-                             dk = e.first.dk,
-                             ks = e.first.schema->ks_name(),
-                             cf = e.first.schema->cf_name(),
-                             gt =  tracing::global_trace_state_ptr(trace_state),
-                             permit = std::move(permit)]
-                            (service::storage_proxy& proxy) mutable {
-                    return do_with(cs.get(), [&proxy, mb = std::move(mb), dk = std::move(dk), ks = std::move(ks), cf = std::move(cf),
-                                              trace_state = tracing::trace_state_ptr(gt)]
-                                              (service::client_state& client_state) mutable {
-                        auto schema = proxy.data_dictionary().find_schema(ks, cf);
+            auto s = e.first.schema;

-                        // The desired_shard on the original shard remains alive for the duration
-                        // of cas_write on this shard and prevents any tablet operations.
-                        // However, we need a local instance of cas_shard on this shard
-                        // to pass it to sp::cas, so we just create a new one.
-                        service::cas_shard cas_shard(*schema, dk.token());
-
-                        //FIXME: Instead of passing empty_service_permit() to the background operation,
-                        // the current permit's lifetime should be prolonged, so that it's destructed
-                        // only after all background operations are finished as well.
-                        return cas_write(proxy, schema, std::move(cas_shard), dk, std::move(mb), client_state, std::move(trace_state), empty_service_permit());
-                    });
-                }).finally([desired_shard = std::move(desired_shard)]{});
-            }
-        });
+            static const auto* injection_name = "alternator_executor_batch_write_wait";
+            return utils::get_local_injector().inject(injection_name, [s = std::move(s)] (auto& handler) -> future<> {
+                const auto ks = handler.get("keyspace");
+                const auto cf = handler.get("table");
+                const auto shard = std::atoll(handler.get("shard")->data());
+                if (ks == s->ks_name() && cf == s->cf_name() && shard == this_shard_id()) {
+                    elogger.info("{}: hit", injection_name);
+                    co_await handler.wait_for_message(std::chrono::steady_clock::now() + std::chrono::minutes{5});
+                    elogger.info("{}: continue", injection_name);
+                }
+            }).then([&e, desired_shard = std::move(desired_shard),
+                 &client_state, trace_state = std::move(trace_state), permit = std::move(permit), this]() mutable
+            {
+                return cas_write(e.first.schema, std::move(desired_shard), e.first.dk,
+                    std::move(e.second), client_state, std::move(trace_state), std::move(permit));
+            });
+        }).finally([key_builders = std::move(key_builders)]{});
    }
 }

@@ -3163,7 +3257,7 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
        per_table_wcu.emplace_back(std::make_pair(per_table_stats, schema));
    }
    for (const auto& b : mutation_builders) {
-        co_await verify_permission(_enforce_authorization, client_state, b.first, auth::permission::MODIFY);
+        co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, b.first, auth::permission::MODIFY, _stats);
    }
    // If alternator_force_read_before_write is true we will first get the previous item size
    // and only then do send the mutation.
@@ -3228,7 +3322,7 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
    _stats.wcu_total[stats::DELETE_ITEM] += wcu_delete_units;
    _stats.api_operations.batch_write_item_batch_total += total_items;
    _stats.api_operations.batch_write_item_histogram.add(total_items);
-    co_await do_batch_write(_proxy, _ssg, std::move(mutation_builders), client_state, trace_state, std::move(permit), _stats);
+    co_await do_batch_write(std::move(mutation_builders), client_state, trace_state, std::move(permit));
    // FIXME: Issue #5650: If we failed writing some of the updates,
    // need to return a list of these failed updates in UnprocessedItems
    // rather than fail the whole write (issue #5650).
@@ -3636,16 +3730,16 @@ future<std::vector<rjson::value>> executor::describe_multi_item(schema_ptr schem
        shared_ptr<cql3::selection::selection> selection,
        foreign_ptr<lw_shared_ptr<query::result>> query_result,
        shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
-        uint64_t& rcu_half_units) {
+        noncopyable_function<void(uint64_t)> item_callback) {
    cql3::selection::result_set_builder builder(*selection, gc_clock::now());
    query::result_view::consume(*query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
    auto result_set = builder.build();
    std::vector<rjson::value> ret;
    for (auto& result_row : result_set->rows()) {
        rjson::value item = rjson::empty_object();
-        rcu_consumed_capacity_counter consumed_capacity;
-        describe_single_item(*selection, result_row, *attrs_to_get, item, &consumed_capacity._total_bytes);
-        rcu_half_units += consumed_capacity.get_half_units();
+        uint64_t item_length_in_bytes = 0;
+        describe_single_item(*selection, result_row, *attrs_to_get, item, &item_length_in_bytes);
+        item_callback(item_length_in_bytes);
        ret.push_back(std::move(item));
        co_await coroutine::maybe_yield();
    }
@@ -4365,7 +4459,7 @@ future<executor::request_return_type> executor::update_item(client_state& client
    tracing::add_table_name(trace_state, op->schema()->ks_name(), op->schema()->cf_name());
    const bool needs_read_before_write = _proxy.data_dictionary().get_config().alternator_force_read_before_write() || op->needs_read_before_write();

-    co_await verify_permission(_enforce_authorization, client_state, op->schema(), auth::permission::MODIFY);
+    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, op->schema(), auth::permission::MODIFY, _stats);

    auto cas_shard = op->shard_for_execute(needs_read_before_write);

@@ -4475,7 +4569,7 @@ future<executor::request_return_type> executor::get_item(client_state& client_st
    const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");
    verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "GetItem");
    rcu_consumed_capacity_counter add_capacity(request, cl == db::consistency_level::LOCAL_QUORUM);
-    co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::SELECT);
+    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::SELECT, _stats);
    service::storage_proxy::coordinator_query_result qr =
        co_await _proxy.query(
            schema, std::move(command), std::move(partition_ranges), cl,
@@ -4584,7 +4678,6 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
        }
    };
    std::vector<table_requests> requests;
-    std::vector<std::vector<uint64_t>> responses_sizes;
    uint batch_size = 0;
    for (auto it = request_items.MemberBegin(); it != request_items.MemberEnd(); ++it) {
        table_requests rs(get_table_from_batch_request(_proxy, it));
@@ -4604,7 +4697,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
    }

    for (const table_requests& tr : requests) {
-        co_await verify_permission(_enforce_authorization, client_state, tr.schema, auth::permission::SELECT);
+        co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, tr.schema, auth::permission::SELECT, _stats);
    }

    _stats.api_operations.batch_get_item_batch_total += batch_size;
@@ -4612,11 +4705,10 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
    // If we got here, all "requests" are valid, so let's start the
    // requests for the different partitions all in parallel.
    std::vector<future<std::vector<rjson::value>>> response_futures;
-    responses_sizes.resize(requests.size());
-    size_t responses_sizes_pos = 0;
-    for (const auto& rs : requests) {
-        responses_sizes[responses_sizes_pos].resize(rs.requests.size());
-        size_t pos = 0;
+    std::vector<uint64_t> consumed_rcu_half_units_per_table(requests.size());
+    for (size_t i = 0; i < requests.size(); i++) {
+        const table_requests& rs = requests[i];
+        bool is_quorum = rs.cl == db::consistency_level::LOCAL_QUORUM;
        lw_shared_ptr<stats> per_table_stats = get_stats_from_schema(_proxy, *rs.schema);
        per_table_stats->api_operations.batch_get_item_histogram.add(rs.requests.size());
        for (const auto &r : rs.requests) {
@@ -4639,16 +4731,17 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
            auto command = ::make_lw_shared<query::read_command>(rs.schema->id(), rs.schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),
                    query::tombstone_limit(_proxy.get_tombstone_limit()));
            command->allow_limit = db::allow_per_partition_rate_limit::yes;
+            const auto item_callback = [is_quorum, &rcus_per_table = consumed_rcu_half_units_per_table[i]](uint64_t size) {
+                rcus_per_table += rcu_consumed_capacity_counter::get_half_units(size, is_quorum);
+            };
            future<std::vector<rjson::value>> f = _proxy.query(rs.schema, std::move(command), std::move(partition_ranges), rs.cl,
                    service::storage_proxy::coordinator_query_options(executor::default_timeout(), permit, client_state, trace_state)).then(
-                    [schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get, &response_size = responses_sizes[responses_sizes_pos][pos]] (service::storage_proxy::coordinator_query_result qr) mutable {
+                    [schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get, item_callback = std::move(item_callback)] (service::storage_proxy::coordinator_query_result qr) mutable {
                utils::get_local_injector().inject("alternator_batch_get_item", [] { throw std::runtime_error("batch_get_item injection"); });
-                return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get), response_size);
+                return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get), std::move(item_callback));
            });
-            pos++;
            response_futures.push_back(std::move(f));
        }
-        responses_sizes_pos++;
    }

    // Wait for all requests to complete, and then return the response.
@@ -4660,14 +4753,11 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
    rjson::value response = rjson::empty_object();
    rjson::add(response, "Responses", rjson::empty_object());
    rjson::add(response, "UnprocessedKeys", rjson::empty_object());
-    size_t rcu_half_units;
    auto fut_it = response_futures.begin();
-    responses_sizes_pos = 0;
    rjson::value consumed_capacity = rjson::empty_array();
-    for (const auto& rs : requests) {
+    for (size_t i = 0; i < requests.size(); i++) {
+        const table_requests& rs = requests[i];
        std::string table = table_name(*rs.schema);
-        size_t pos = 0;
-        rcu_half_units = 0;
        for (const auto &r : rs.requests) {
            auto& pk = r.first;
            auto& cks = r.second;
@@ -4682,7 +4772,6 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
                for (rjson::value& json : results) {
                    rjson::push_back(response["Responses"][table], std::move(json));
                }
-                rcu_half_units += rcu_consumed_capacity_counter::get_half_units(responses_sizes[responses_sizes_pos][pos], rs.cl == db::consistency_level::LOCAL_QUORUM);
            } catch(...) {
                eptr = std::current_exception();
                // This read of potentially several rows in one partition,
@@ -4706,8 +4795,8 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
                    rjson::push_back(response["UnprocessedKeys"][table]["Keys"], std::move(*ck.second));
                }
            }
-            pos++;
        }
+        uint64_t rcu_half_units = consumed_rcu_half_units_per_table[i];
        _stats.rcu_half_units_total += rcu_half_units;
        lw_shared_ptr<stats> per_table_stats = get_stats_from_schema(_proxy, *rs.schema);
        per_table_stats->rcu_half_units_total += rcu_half_units;
@@ -4717,7 +4806,6 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
            rjson::add(entry, "CapacityUnits", rcu_half_units*0.5);
            rjson::push_back(consumed_capacity, std::move(entry));
        }
-        responses_sizes_pos++;
    }

    if (should_add_rcu) {
@@ -5029,13 +5117,15 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
    }
    auto pos = paging_state.get_position_in_partition();
    if (pos.has_key()) {
-        auto exploded_ck = pos.key().explode();
-        auto exploded_ck_it = exploded_ck.begin();
-        for (const column_definition& cdef : schema.clustering_key_columns()) {
-            rjson::add_with_string_name(last_evaluated_key, std::string_view(cdef.name_as_text()), rjson::empty_object());
-            rjson::value& key_entry = last_evaluated_key[cdef.name_as_text()];
-            rjson::add_with_string_name(key_entry, type_to_string(cdef.type), json_key_column_value(*exploded_ck_it, cdef));
-            ++exploded_ck_it;
+        // Alternator itself allows at most one column in clustering key, but 
+        // user can use Alternator api to access system tables which might have
+        // multiple clustering key columns. So we need to handle that case here.
+        auto cdef_it = schema.clustering_key_columns().begin();        
+        for(const auto &exploded_ck : pos.key().explode()) {
+            rjson::add_with_string_name(last_evaluated_key, std::string_view(cdef_it->name_as_text()), rjson::empty_object());
+            rjson::value& key_entry = last_evaluated_key[cdef_it->name_as_text()];
+            rjson::add_with_string_name(key_entry, type_to_string(cdef_it->type), json_key_column_value(exploded_ck, *cdef_it));
+            ++cdef_it;
        }
    }
    // To avoid possible conflicts (and thus having to reserve these names) we
@@ -5069,10 +5159,11 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
        filter filter,
        query::partition_slice::option_set custom_opts,
        service::client_state& client_state,
-        cql3::cql_stats& cql_stats,
+        alternator::stats& stats,
        tracing::trace_state_ptr trace_state,
        service_permit permit,
-        bool enforce_authorization) {
+        bool enforce_authorization,
+        bool warn_authorization) {
    lw_shared_ptr<service::pager::paging_state> old_paging_state = nullptr;

    tracing::trace(trace_state, "Performing a database query");
@@ -5099,7 +5190,7 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
        old_paging_state = make_lw_shared<service::pager::paging_state>(pk, pos, query::max_partitions, query_id::create_null_id(), service::pager::paging_state::replicas_per_token_range{}, std::nullopt, 0);
    }

-    co_await verify_permission(enforce_authorization, client_state, table_schema, auth::permission::SELECT);
+    co_await verify_permission(enforce_authorization, warn_authorization, client_state, table_schema, auth::permission::SELECT, stats);

    auto regular_columns =
            table_schema->regular_columns() | std::views::transform(&column_definition::id)
@@ -5134,10 +5225,10 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
    if (paging_state) {
        rjson::add(items_descr, "LastEvaluatedKey", encode_paging_state(*table_schema, *paging_state));
    }
-    if (has_filter){
-        cql_stats.filtered_rows_read_total += p->stats().rows_read_total;
+    if (has_filter) {
+        stats.cql_stats.filtered_rows_read_total += p->stats().rows_read_total;
        // update our "filtered_row_matched_total" for all the rows matched, despited the filter
-        cql_stats.filtered_rows_matched_total += size;
+        stats.cql_stats.filtered_rows_matched_total += size;
    }
    if (opt_items) {
        if (opt_items->size() >= max_items_for_rapidjson_array) {
@@ -5261,7 +5352,7 @@ future<executor::request_return_type> executor::scan(client_state& client_state,
    verify_all_are_used(expression_attribute_values, used_attribute_values, "ExpressionAttributeValues", "Scan");

    return do_query(_proxy, schema, exclusive_start_key, std::move(partition_ranges), std::move(ck_bounds), std::move(attrs_to_get), limit, cl,
-            std::move(filter), query::partition_slice::option_set(), client_state, _stats.cql_stats, trace_state, std::move(permit), _enforce_authorization);
+            std::move(filter), query::partition_slice::option_set(), client_state, _stats, trace_state, std::move(permit), _enforce_authorization, _warn_authorization);
 }

 static dht::partition_range calculate_pk_bound(schema_ptr schema, const column_definition& pk_cdef, const rjson::value& comp_definition, const rjson::value& attrs) {
@@ -5742,7 +5833,7 @@ future<executor::request_return_type> executor::query(client_state& client_state
    query::partition_slice::option_set opts;
    opts.set_if<query::partition_slice::option::reversed>(!forward);
    return do_query(_proxy, schema, exclusive_start_key, std::move(partition_ranges), std::move(ck_bounds), std::move(attrs_to_get), limit, cl,
-            std::move(filter), opts, client_state, _stats.cql_stats, std::move(trace_state), std::move(permit), _enforce_authorization);
+            std::move(filter), opts, client_state, _stats, std::move(trace_state), std::move(permit), _enforce_authorization, _warn_authorization);
 }

 future<executor::request_return_type> executor::list_tables(client_state& client_state, service_permit permit, rjson::value request) {
@@ -5870,7 +5961,8 @@ future<executor::request_return_type> executor::describe_continuous_backups(clie
 // of nodes in the cluster: A cluster with 3 or more live nodes, gets RF=3.
 // A smaller cluster (presumably, a test only), gets RF=1. The user may
 // manually create the keyspace to override this predefined behavior.
-static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_view keyspace_name, service::storage_proxy& sp, gms::gossiper& gossiper, api::timestamp_type ts, const std::map<sstring, sstring>& tags_map, const gms::feature_service& feat) {
+static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_view keyspace_name, service::storage_proxy& sp, gms::gossiper& gossiper, api::timestamp_type ts,
+            const std::map<sstring, sstring>& tags_map, const gms::feature_service& feat, const db::tablets_mode_t::mode tablets_mode) {
    int endpoint_count = gossiper.num_endpoints();
    int rf = 3;
    if (endpoint_count < rf) {
@@ -5880,21 +5972,18 @@ static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_vie
    }
    auto opts = get_network_topology_options(sp, gossiper, rf);

-    // Even if the "tablets" experimental feature is available, we currently
-    // do not enable tablets by default on Alternator tables because LWT is
-    // not yet fully supported with tablets.
-    // The user can override the choice of whether or not to use tablets at
-    // table-creation time by supplying the following tag with a numeric value
-    // (setting the value to 0 means enabling tablets with automatic selection
-    // of the best number of tablets).
+    // Whether to use tablets for the table (actually for the keyspace of the
+    // table) is determined by tablets_mode (taken from the configuration
+    // option "tablets_mode_for_new_keyspaces"), as well as the presence and
+    // the value of a per-table tag system:initial_tablets
+    // (INITIAL_TABLETS_TAG_KEY).
+    // Setting the tag with a numeric value will enable a specific initial number
+    // of tablets (setting the value to 0 means enabling tablets with
+    // an automatic selection of the best number of tablets).
    // Setting this tag to any non-numeric value (e.g., an empty string or the
    // word "none") will ask to disable tablets.
-    // If we make this tag a permanent feature, it will get a "system:" prefix -
-    // until then we give it the "experimental:" prefix to not commit to it.
-    static constexpr auto INITIAL_TABLETS_TAG_KEY = "experimental:initial_tablets";
-    // initial_tablets currently defaults to unset, so tablets will not be
-    // used by default on new Alternator tables. Change this initialization
-    // to 0 enable tablets by default, with automatic number of tablets.
+    // When vnodes are asked for by the tag value, but tablets are enforced by config,
+    // throw an exception to the client.
    std::optional<unsigned> initial_tablets;
    if (feat.tablets) {
        auto it = tags_map.find(INITIAL_TABLETS_TAG_KEY);
@@ -5904,8 +5993,21 @@ static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_vie
            // initial_tablets to a disengaged optional.
            try {
                initial_tablets = std::stol(tags_map.at(INITIAL_TABLETS_TAG_KEY));
-            } catch(...) {
+            } catch (...) {
+                if (tablets_mode == db::tablets_mode_t::mode::enforced) {
+                    throw api_error::validation(format("Tag {} containing non-numerical value requests vnodes, but vnodes are forbidden by configuration option `tablets_mode_for_new_keyspaces: enforced`", INITIAL_TABLETS_TAG_KEY));
+                }
                initial_tablets = std::nullopt;
+                elogger.trace("Following {} tag containing non-numerical value, Alternator will attempt to create a keyspace {} with vnodes.", INITIAL_TABLETS_TAG_KEY, keyspace_name);
+            }
+        } else {
+            // No per-table tag present, use the value from config
+            if (tablets_mode == db::tablets_mode_t::mode::enabled || tablets_mode == db::tablets_mode_t::mode::enforced) {
+                initial_tablets = 0;
+                elogger.trace("Following the `tablets_mode_for_new_keyspaces` flag from the settings, Alternator will attempt to create a keyspace {} with tablets.", keyspace_name);
+            } else {
+                initial_tablets = std::nullopt;
+                elogger.trace("Following the `tablets_mode_for_new_keyspaces` flag from the settings, Alternator will attempt to create a keyspace {} with vnodes.", keyspace_name);
            }
        }
    }
--- a/alternator/executor.hh
+++ b/alternator/executor.hh
@@ -40,6 +40,7 @@ namespace cql3::selection {

 namespace service {
    class storage_proxy;
+    class cas_shard;
 }

 namespace cdc {
@@ -57,6 +58,7 @@ class schema_builder;
 namespace alternator {

 class rmw_operation;
+class put_or_delete_item;

 schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);
 bool is_alternator_keyspace(const sstring& ks_name);
@@ -139,6 +141,7 @@ class executor : public peering_sharded_service<executor> {
    db::system_distributed_keyspace& _sdks;
    cdc::metadata& _cdc_metadata;
    utils::updateable_value<bool> _enforce_authorization;
+    utils::updateable_value<bool> _warn_authorization;
    // An smp_service_group to be used for limiting the concurrency when
    // forwarding Alternator request between shards - if necessary for LWT.
    smp_service_group _ssg;
@@ -218,6 +221,16 @@ private:

    static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr, const std::map<sstring, sstring> *tags = nullptr);

+    future<> do_batch_write(
+        std::vector<std::pair<schema_ptr, put_or_delete_item>> mutation_builders,
+        service::client_state& client_state,
+        tracing::trace_state_ptr trace_state,
+        service_permit permit);
+
+    future<> cas_write(schema_ptr schema, service::cas_shard cas_shard, const dht::decorated_key& dk,
+        const std::vector<put_or_delete_item>& mutation_builders, service::client_state& client_state,
+        tracing::trace_state_ptr trace_state, service_permit permit);
+
 public:
    static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&, const std::map<sstring, sstring> *tags = nullptr);

@@ -228,12 +241,15 @@ public:
        const std::optional<attrs_to_get>&,
        uint64_t* = nullptr);

+    // Converts a multi-row selection result to JSON compatible with DynamoDB.
+    // For each row, this method calls item_callback, which takes the size of
+    // the item as the parameter.
    static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,
        const query::partition_slice&& slice,
        shared_ptr<cql3::selection::selection> selection,
        foreign_ptr<lw_shared_ptr<query::result>> query_result,
        shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
-        uint64_t& rcu_half_units);
+        noncopyable_function<void(uint64_t)> item_callback = {});

    static void describe_single_item(const cql3::selection::selection&,
        const std::vector<managed_bytes_opt>&,
@@ -261,7 +277,7 @@ bool is_big(const rjson::value& val, int big_size = 100'000);
 // Check CQL's Role-Based Access Control (RBAC) permission (MODIFY,
 // SELECT, DROP, etc.) on the given table. When permission is denied an
 // appropriate user-readable api_error::access_denied is thrown.
-future<> verify_permission(bool enforce_authorization, const service::client_state&, const schema_ptr&, auth::permission);
+future<> verify_permission(bool enforce_authorization, bool warn_authorization, const service::client_state&, const schema_ptr&, auth::permission, alternator::stats& stats);

 /**
 * Make return type for serializing the object "streamed",
--- a/alternator/serialization.cc
+++ b/alternator/serialization.cc
@@ -282,15 +282,23 @@ std::string type_to_string(data_type type) {
    return it->second;
 }

-bytes get_key_column_value(const rjson::value& item, const column_definition& column) {
+std::optional<bytes> try_get_key_column_value(const rjson::value& item, const column_definition& column) {
    std::string column_name = column.name_as_text();
    const rjson::value* key_typed_value = rjson::find(item, column_name);
    if (!key_typed_value) {
-        throw api_error::validation(fmt::format("Key column {} not found", column_name));
+        return std::nullopt;
    }
    return get_key_from_typed_value(*key_typed_value, column);
 }

+bytes get_key_column_value(const rjson::value& item, const column_definition& column) {
+    auto value = try_get_key_column_value(item, column);
+    if (!value) {
+        throw api_error::validation(fmt::format("Key column {} not found", column.name_as_text()));
+    }
+    return std::move(*value);
+}
+
 // Parses the JSON encoding for a key value, which is a map with a single
 // entry whose key is the type and the value is the encoded value.
 // If this type does not match the desired "type_str", an api_error::validation
@@ -380,20 +388,38 @@ clustering_key ck_from_json(const rjson::value& item, schema_ptr schema) {
        return clustering_key::make_empty();
    }
    std::vector<bytes> raw_ck;
-    // FIXME: this is a loop, but we really allow only one clustering key column.
+    // Note: it's possible to get more than one clustering column here, as
+    // Alternator can be used to read scylla internal tables.
    for (const column_definition& cdef : schema->clustering_key_columns()) {
-        bytes raw_value = get_key_column_value(item,  cdef);
+        auto raw_value = get_key_column_value(item,  cdef);
        raw_ck.push_back(std::move(raw_value));
    }

    return clustering_key::from_exploded(raw_ck);
 }

-position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema) {
-    auto ck = ck_from_json(item, schema);
-    if (is_alternator_keyspace(schema->ks_name())) {
-        return position_in_partition::for_key(std::move(ck));
+clustering_key_prefix ck_prefix_from_json(const rjson::value& item, schema_ptr schema) {
+    if (schema->clustering_key_size() == 0) {
+        return clustering_key_prefix::make_empty();
    }
+    std::vector<bytes> raw_ck;
+    for (const column_definition& cdef : schema->clustering_key_columns()) {
+        auto raw_value = try_get_key_column_value(item,  cdef);
+        if (!raw_value) {
+            break;
+        }
+        raw_ck.push_back(std::move(*raw_value));
+    }
+
+    return clustering_key_prefix::from_exploded(raw_ck);
+}
+
+position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema) {
+    const bool is_alternator_ks = is_alternator_keyspace(schema->ks_name());
+    if (is_alternator_ks) {
+        return position_in_partition::for_key(ck_from_json(item, schema));
+    }
+    
    const auto region_item = rjson::find(item, scylla_paging_region);
    const auto weight_item = rjson::find(item, scylla_paging_weight);
    if (bool(region_item) != bool(weight_item)) {
@@ -413,8 +439,9 @@ position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema)
        } else {
            throw std::runtime_error(fmt::format("Invalid value for weight: {}", weight_view));
        }
-        return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(std::move(ck)) : std::nullopt);
+        return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(ck_prefix_from_json(item, schema)) : std::nullopt);
    }
+    auto ck = ck_from_json(item, schema);
    if (ck.is_empty()) {
        return position_in_partition::for_partition_start();
    }
--- a/alternator/server.cc
+++ b/alternator/server.cc
@@ -31,6 +31,7 @@
 #include "utils/overloaded_functor.hh"
 #include "utils/aws_sigv4.hh"
 #include "client_data.hh"
+#include "utils/updateable_value.hh"

 static logging::logger slogger("alternator-server");

@@ -270,24 +271,57 @@ protected:
    }
 };

+// This function increments the authentication_failures counter, and may also
+// log a warn-level message and/or throw an exception, depending on what
+// enforce_authorization and warn_authorization are set to.
+// The username and client address are only used for logging purposes -
+// they are not included in the error message returned to the client, since
+// the client knows who it is.
+// Note that if enforce_authorization is false, this function will return
+// without throwing. So a caller that doesn't want to continue after an
+// authentication_error must explicitly return after calling this function.
+template<typename Exception>
+static void authentication_error(alternator::stats& stats, bool enforce_authorization, bool warn_authorization, Exception&& e, std::string_view user, gms::inet_address client_address) {
+    stats.authentication_failures++;
+    if (enforce_authorization) {
+        if (warn_authorization) {
+            slogger.warn("alternator_warn_authorization=true: {} for user {}, client address {}", e.what(), user, client_address);
+        }
+        throw std::move(e);
+    } else {
+        if (warn_authorization) {
+            slogger.warn("If you set alternator_enforce_authorization=true the following will be enforced: {} for user {}, client address {}", e.what(), user, client_address);
+        }
+    }
+}
+
 future<std::string> server::verify_signature(const request& req, const chunked_content& content) {
-    if (!_enforce_authorization) {
+    if (!_enforce_authorization.get() && !_warn_authorization.get()) {
        slogger.debug("Skipping authorization");
        return make_ready_future<std::string>();
    }
    auto host_it = req._headers.find("Host");
    if (host_it == req._headers.end()) {
-        throw api_error::invalid_signature("Host header is mandatory for signature verification");
+        authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
+            api_error::invalid_signature("Host header is mandatory for signature verification"), 
+            "", req.get_client_address());
+        return make_ready_future<std::string>();
    }
    auto authorization_it = req._headers.find("Authorization");
    if (authorization_it == req._headers.end()) {
-        throw api_error::missing_authentication_token("Authorization header is mandatory for signature verification");
+        authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
+            api_error::missing_authentication_token("Authorization header is mandatory for signature verification"),
+            "", req.get_client_address());
+        return make_ready_future<std::string>();
    }
    std::string host = host_it->second;
    std::string_view authorization_header = authorization_it->second;
    auto pos = authorization_header.find_first_of(' ');
    if (pos == std::string_view::npos || authorization_header.substr(0, pos) != "AWS4-HMAC-SHA256") {
-        throw api_error::invalid_signature(fmt::format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header));
+        authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
+            api_error::invalid_signature(fmt::format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header)),
+            "", req.get_client_address());
+        return make_ready_future<std::string>();
    }
    authorization_header.remove_prefix(pos+1);
    std::string credential;
@@ -322,7 +356,9 @@ future<std::string> server::verify_signature(const request& req, const chunked_c

    std::vector<std::string_view> credential_split = split(credential, '/');
    if (credential_split.size() != 5) {
-        throw api_error::validation(fmt::format("Incorrect credential information format: {}", credential));
+        authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
+            api_error::validation(fmt::format("Incorrect credential information format: {}", credential)), "", req.get_client_address());
+        return make_ready_future<std::string>();
    }
    std::string user(credential_split[0]);
    std::string datestamp(credential_split[1]);
@@ -346,7 +382,7 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
    auto cache_getter = [&proxy = _proxy, &as = _auth_service] (std::string username) {
        return get_key_from_roles(proxy, as, std::move(username));
    };
-    return _key_cache.get_ptr(user, cache_getter).then([this, &req, &content,
+    return _key_cache.get_ptr(user, cache_getter).then_wrapped([this, &req, &content,
                                                    user = std::move(user),
                                                    host = std::move(host),
                                                    datestamp = std::move(datestamp),
@@ -354,18 +390,32 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
                                                    signed_headers_map = std::move(signed_headers_map),
                                                    region = std::move(region),
                                                    service = std::move(service),
-                                                    user_signature = std::move(user_signature)] (key_cache::value_ptr key_ptr) {
+                                                    user_signature = std::move(user_signature)] (future<key_cache::value_ptr> key_ptr_fut) {
+        key_cache::value_ptr key_ptr(nullptr);
+        try {
+            key_ptr = key_ptr_fut.get();
+        } catch (const api_error& e) {
+            authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
+                e, user, req.get_client_address());
+            return std::string();
+        }
        std::string signature;
        try {
            signature = utils::aws::get_signature(user, *key_ptr, std::string_view(host), "/", req._method,
                datestamp, signed_headers_str, signed_headers_map, &content, region, service, "");
        } catch (const std::exception& e) {
-            throw api_error::invalid_signature(e.what());
+            authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
+                api_error::invalid_signature(fmt::format("invalid signature: {}", e.what())),
+                user, req.get_client_address());
+            return std::string();
        }

        if (signature != std::string_view(user_signature)) {
            _key_cache.remove(user);
-            throw api_error::unrecognized_client("The security token included in the request is invalid.");
+            authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
+                api_error::unrecognized_client("wrong signature"),
+                user, req.get_client_address());
+            return std::string();
        }
        return user;
    });
@@ -597,9 +647,11 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
 }

 future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
-        utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
+        utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization,
+        semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
    _memory_limiter = memory_limiter;
    _enforce_authorization = std::move(enforce_authorization);
+    _warn_authorization = std::move(warn_authorization);
    _max_concurrent_requests = std::move(max_concurrent_requests);
    if (!port && !https_port) {
        return make_exception_future<>(std::runtime_error("Either regular port or TLS port"
--- a/alternator/server.hh
+++ b/alternator/server.hh
@@ -43,6 +43,7 @@ class server : public peering_sharded_service<server> {

    key_cache _key_cache;
    utils::updateable_value<bool> _enforce_authorization;
+    utils::updateable_value<bool> _warn_authorization;
    utils::small_vector<std::reference_wrapper<seastar::httpd::http_server>, 2> _enabled_servers;
    named_gate _pending_requests;
    // In some places we will need a CQL updateable_timeout_config object even
@@ -94,7 +95,8 @@ public:
    server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);

    future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
-            utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
+            utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization,
+            semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
    future<> stop();
    // get_client_data() is called (on each shard separately) when the virtual
    // table "system.clients" is read. It is expected to generate a list of
--- a/alternator/stats.cc
+++ b/alternator/stats.cc
@@ -176,6 +176,16 @@ static void register_metrics_with_optional_table(seastar::metrics::metric_groups
            seastar::metrics::make_total_operations("expression_cache_misses", stats.expression_cache.requests[stats::expression_types::PROJECTION_EXPRESSION].misses,
                    seastar::metrics::description("Counts number of misses of cached expressions"), labels)(expression_label("ProjectionExpression")).aggregate(aggregate_labels).set_skip_when_empty()
    });
+
+    // Only register the following metrics for the global metrics, not per-table
+    if (!has_table) {
+        metrics.add_group("alternator", {
+            seastar::metrics::make_counter("authentication_failures", stats.authentication_failures,
+                seastar::metrics::description("total number of authentication failures"), labels).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
+            seastar::metrics::make_counter("authorization_failures", stats.authorization_failures,
+                seastar::metrics::description("total number of authorization failures"), labels).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
+        });
+    }
 }

 void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats) {
--- a/alternator/stats.hh
+++ b/alternator/stats.hh
@@ -79,6 +79,17 @@ public:
        utils::estimated_histogram batch_get_item_histogram{22}; // a histogram that covers the range 1 - 100
        utils::estimated_histogram batch_write_item_histogram{22}; // a histogram that covers the range 1 - 100
    } api_operations;
+    // Count of authentication and authorization failures, counted if either
+    // alternator_enforce_authorization or alternator_warn_authorization are
+    // set to true. If both are false, no authentication or authorization
+    // checks are performed, so failures are not recognized or counted.
+    // "authentication" failure means the request was not signed with a valid
+    // user and key combination. "authorization" failure means the request was
+    // authenticated to a valid user - but this user did not have permissions
+    // to perform the operation (considering RBAC settings and the user's
+    // superuser status).
+    uint64_t authentication_failures = 0;
+    uint64_t authorization_failures = 0;
    // Miscellaneous event counters
    uint64_t total_operations = 0;
    uint64_t unsupported_operations = 0;
--- a/alternator/streams.cc
+++ b/alternator/streams.cc
@@ -828,7 +828,7 @@ future<executor::request_return_type> executor::get_records(client_state& client

    tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());

-    co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::SELECT);
+    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::SELECT, _stats);

    db::consistency_level cl = db::consistency_level::LOCAL_QUORUM;
    partition_key pk = iter.shard.id.to_partition_key(*schema);
--- a/alternator/ttl.cc
+++ b/alternator/ttl.cc
@@ -94,7 +94,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
    }
    sstring attribute_name(v->GetString(), v->GetStringLength());

-    co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::ALTER);
+    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::ALTER, _stats);
    co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [&](std::map<sstring, sstring>& tags_map) {
        if (enabled) {
            if (tags_map.contains(TTL_TAG_KEY)) {
@@ -747,7 +747,7 @@ static future<bool> scan_table(
        auto my_host_id = erm->get_topology().my_host_id();
        const auto &tablet_map = erm->get_token_metadata().tablets().get_tablet_map(s->id());
        for (std::optional tablet = tablet_map.first_tablet(); tablet; tablet = tablet_map.next_tablet(*tablet)) {
-            auto tablet_primary_replica = tablet_map.get_primary_replica(*tablet);
+            auto tablet_primary_replica = tablet_map.get_primary_replica(*tablet, erm->get_topology());
            // check if this is the primary replica for the current tablet
            if (tablet_primary_replica.host == my_host_id && tablet_primary_replica.shard == this_shard_id()) {
                co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);
--- a/api/api-doc/storage_service.json
+++ b/api/api-doc/storage_service.json
@@ -898,6 +898,14 @@
                          "type":"string",
                          "paramType":"query",
                          "enum": ["all", "dc", "rack", "node"]
+                      },
+                      {
+                         "name":"primary_replica_only",
+                         "description":"Load the sstables and stream to the primary replica node within the scope, if one is specified. If not, stream to the global primary replica.",
+                         "required":false,
+                         "allowMultiple":false,
+                         "type":"boolean",
+                         "paramType":"query"
                      }
                  ]
              }
@@ -984,7 +992,7 @@
         ]
      },
      {
-         "path":"/storage_service/cleanup_all",
+         "path":"/storage_service/cleanup_all/",
         "operations":[
            {
               "method":"POST",
@@ -994,6 +1002,30 @@
               "produces":[
                  "application/json"
               ],
+               "parameters":[
+                    {
+                     "name":"global",
+                     "description":"true if cleanup of entire cluster is requested",
+                     "required":false,
+                     "allowMultiple":false,
+                     "type":"boolean",
+                     "paramType":"query"
+                  }
+               ]
+            }
+         ]
+      },
+      {
+         "path":"/storage_service/mark_node_as_clean",
+         "operations":[
+            {
+               "method":"POST",
+               "summary":"Mark the node as clean. After that the node will not be considered as needing cleanup during automatic cleanup which is triggered by some topology operations",
+               "type":"void",
+               "nickname":"reset_cleanup_needed",
+               "produces":[
+                  "application/json"
+               ],
               "parameters":[]
            }
         ]
@@ -2924,7 +2956,7 @@
                  },
                  {
                     "name":"incremental_mode",
-                     "description":"Set the incremental repair mode. Can be 'disabled', 'regular', or 'full'. 'regular': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to regular.",
+                     "description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
                     "required":false,
                     "allowMultiple":false,
                     "type":"string",
--- a/api/api-doc/tasks.json
+++ b/api/api-doc/tasks.json
@@ -42,6 +42,14 @@
                     "allowMultiple":false,
                     "type":"boolean",
                     "paramType":"query"
+                  },
+                  {
+                     "name":"consider_only_existing_data",
+                     "description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
+                     "required":false,
+                     "allowMultiple":false,
+                     "type":"boolean",
+                     "paramType":"query"
                  }
               ]
            }
--- a/api/storage_service.cc
+++ b/api/storage_service.cc
@@ -20,6 +20,7 @@
 #include "utils/hash.hh"
 #include <optional>
 #include <sstream>
+#include <stdexcept>
 #include <time.h>
 #include <algorithm>
 #include <functional>
@@ -496,6 +497,7 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&
        auto bucket = req->get_query_param("bucket");
        auto prefix = req->get_query_param("prefix");
        auto scope = parse_stream_scope(req->get_query_param("scope"));
+        auto primary_replica_only = validate_bool_x(req->get_query_param("primary_replica_only"), false);

        // TODO: the http_server backing the API does not use content streaming
        // should use it for better performance
@@ -506,7 +508,7 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&
        auto sstables = parsed.GetArray() |
            std::views::transform([] (const auto& s) { return sstring(rjson::to_string_view(s)); }) |
            std::ranges::to<std::vector>();
-        auto task_id = co_await sst_loader.local().download_new_sstables(keyspace, table, prefix, std::move(sstables), endpoint, bucket, scope);
+        auto task_id = co_await sst_loader.local().download_new_sstables(keyspace, table, prefix, std::move(sstables), endpoint, bucket, scope, primary_replica_only);
        co_return json::json_return_type(fmt::to_string(task_id));
    });

@@ -723,8 +725,14 @@ rest_cdc_streams_check_and_repair(sharded<service::storage_service>& ss, std::un
 static
 future<json::json_return_type>
 rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
-        apilog.info("cleanup_all");
-        auto done = co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<bool> {
+        bool global = true;
+        if (auto global_param = req->get_query_param("global"); !global_param.empty()) {
+            global = validate_bool(global_param);
+        }
+
+        apilog.info("cleanup_all global={}", global);
+
+        auto done = !global ? false : co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<bool> {
            if (!ss.is_topology_coordinator_enabled()) {
                co_return false;
            }
@@ -734,14 +742,35 @@ rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::
        if (done) {
            co_return json::json_return_type(0);
        }
-        // fall back to the local global cleanup if topology coordinator is not enabled
+        // fall back to the local cleanup if topology coordinator is not enabled or local cleanup is requested
        auto& db = ctx.db;
        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
        auto task = co_await compaction_module.make_and_start_task<compaction::global_cleanup_compaction_task_impl>({}, db);
        co_await task->done();
+
+        // Mark this node as clean
+        co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<> {
+            if (ss.is_topology_coordinator_enabled()) {
+                co_await ss.reset_cleanup_needed();
+            }
+        });
+
        co_return json::json_return_type(0);
 }

+static
+future<json::json_return_type>
+rest_reset_cleanup_needed(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
+        apilog.info("reset_cleanup_needed");
+        co_await ss.invoke_on(0, [] (service::storage_service& ss) {
+            if (!ss.is_topology_coordinator_enabled()) {
+                throw std::runtime_error("mark_node_as_clean is only supported when topology over raft is enabled");
+            }
+            return ss.reset_cleanup_needed();
+        });
+        co_return json_void();
+}
+
 static
 future<json::json_return_type>
 rest_force_flush(http_context& ctx, std::unique_ptr<http::request> req) {
@@ -1723,6 +1752,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
    ss::get_natural_endpoints.set(r, rest_bind(rest_get_natural_endpoints, ctx, ss));
    ss::cdc_streams_check_and_repair.set(r, rest_bind(rest_cdc_streams_check_and_repair, ss));
    ss::cleanup_all.set(r, rest_bind(rest_cleanup_all, ctx, ss));
+    ss::reset_cleanup_needed.set(r, rest_bind(rest_reset_cleanup_needed, ctx, ss));
    ss::force_flush.set(r, rest_bind(rest_force_flush, ctx));
    ss::force_keyspace_flush.set(r, rest_bind(rest_force_keyspace_flush, ctx));
    ss::decommission.set(r, rest_bind(rest_decommission, ss));
@@ -1800,6 +1830,7 @@ void unset_storage_service(http_context& ctx, routes& r) {
    ss::get_natural_endpoints.unset(r);
    ss::cdc_streams_check_and_repair.unset(r);
    ss::cleanup_all.unset(r);
+    ss::reset_cleanup_needed.unset(r);
    ss::force_flush.unset(r);
    ss::force_keyspace_flush.unset(r);
    ss::decommission.unset(r);
--- a/api/tasks.cc
+++ b/api/tasks.cc
@@ -38,76 +38,78 @@ static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
    };
 }

+static future<shared_ptr<compaction::major_keyspace_compaction_task_impl>> force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
+    auto& db = ctx.db;
+    auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");
+    auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
+    auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);
+    apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);
+
+    auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
+    std::optional<compaction::flush_mode> fmopt;
+    if (!flush && !consider_only_existing_data) {
+        fmopt = compaction::flush_mode::skip;
+    }
+    return compaction_module.make_and_start_task<compaction::major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);
+}
+
+static future<shared_ptr<compaction::upgrade_sstables_compaction_task_impl>> upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) {
+    auto& db = ctx.db;
+    bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
+
+    apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
+
+    auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
+    return compaction_module.make_and_start_task<compaction::upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
+}
+
+static future<shared_ptr<compaction::cleanup_keyspace_compaction_task_impl>> force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
+    auto& db = ctx.db;
+    auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
+    const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();
+    if (rs.is_local() || !rs.is_vnode_based()) {
+        auto reason = rs.is_local() ? "require" : "support";
+        apilog.info("Keyspace {} does not {} cleanup", keyspace, reason);
+        co_return nullptr;
+    }
+    apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);
+    if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
+        auto msg = "Can not perform cleanup operation when topology changes";
+        apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);
+        co_await coroutine::return_exception(std::runtime_error(msg));
+    }
+
+    auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
+    co_return co_await compaction_module.make_and_start_task<compaction::cleanup_keyspace_compaction_task_impl>(
+        {}, std::move(keyspace), db, table_infos, compaction::flush_mode::all_tables, tasks::is_user_task::yes);
+}
+
 void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& snap_ctl) {
    t::force_keyspace_compaction_async.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
-        auto& db = ctx.db;
-        auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");
-        auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
-        apilog.debug("force_keyspace_compaction_async: keyspace={} tables={}, flush={}", keyspace, table_infos, flush);
-
-        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
-        std::optional<compaction::flush_mode> fmopt;
-        if (!flush) {
-            fmopt = compaction::flush_mode::skip;
-        }
-        auto task = co_await compaction_module.make_and_start_task<compaction::major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt);
-
+        auto task = co_await force_keyspace_compaction(ctx, std::move(req));
        co_return json::json_return_type(task->get_status().id.to_sstring());
    });

    ss::force_keyspace_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
-        auto& db = ctx.db;
-        auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");
-        auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
-        auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);
-        apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);
-
-        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
-        std::optional<compaction::flush_mode> fmopt;
-        if (!flush && !consider_only_existing_data) {
-            fmopt = compaction::flush_mode::skip;
-        }
-        auto task = co_await compaction_module.make_and_start_task<compaction::major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);
+        auto task = co_await force_keyspace_compaction(ctx, std::move(req));
        co_await task->done();
        co_return json_void();
    });

    t::force_keyspace_cleanup_async.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
-        auto& db = ctx.db;
-        auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
-        apilog.info("force_keyspace_cleanup_async: keyspace={} tables={}", keyspace, table_infos);
-        if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
-            auto msg = "Can not perform cleanup operation when topology changes";
-            apilog.warn("force_keyspace_cleanup_async: keyspace={} tables={}: {}", keyspace, table_infos, msg);
-            co_await coroutine::return_exception(std::runtime_error(msg));
+        tasks::task_id id = tasks::task_id::create_null_id();
+        auto task = co_await force_keyspace_cleanup(ctx, ss, std::move(req));
+        if (task) {
+            id = task->get_status().id;
        }
-
-        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
-        auto task = co_await compaction_module.make_and_start_task<compaction::cleanup_keyspace_compaction_task_impl>({}, std::move(keyspace), db, table_infos, compaction::flush_mode::all_tables, tasks::is_user_task::yes);
-
-        co_return json::json_return_type(task->get_status().id.to_sstring());
+        co_return json::json_return_type(id.to_sstring());
    });

    ss::force_keyspace_cleanup.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
-        auto& db = ctx.db;
-        auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
-        const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();
-        if (rs.is_local() || !rs.is_vnode_based()) {
-            auto reason = rs.is_local() ? "require" : "support";
-            apilog.info("Keyspace {} does not {} cleanup", keyspace, reason);
-            co_return json::json_return_type(0);
+        auto task = co_await force_keyspace_cleanup(ctx, ss, std::move(req));
+        if (task) {
+            co_await task->done();
        }
-        apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);
-        if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
-            auto msg = "Can not perform cleanup operation when topology changes";
-            apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);
-            co_await coroutine::return_exception(std::runtime_error(msg));
-        }
-
-        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
-        auto task = co_await compaction_module.make_and_start_task<compaction::cleanup_keyspace_compaction_task_impl>(
-            {}, std::move(keyspace), db, table_infos, compaction::flush_mode::all_tables, tasks::is_user_task::yes);
-        co_await task->done();
        co_return json::json_return_type(0);
    });

@@ -129,25 +131,12 @@ void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::
    }));

    t::upgrade_sstables_async.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {
-        auto& db = ctx.db;
-        bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
-
-        apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
-
-        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
-        auto task = co_await compaction_module.make_and_start_task<compaction::upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
-
+        auto task = co_await upgrade_sstables(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
        co_return json::json_return_type(task->get_status().id.to_sstring());
    }));

    ss::upgrade_sstables.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {
-        auto& db = ctx.db;
-        bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
-
-        apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
-
-        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
-        auto task = co_await compaction_module.make_and_start_task<compaction::upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
+        auto task = co_await upgrade_sstables(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
        co_await task->done();
        co_return json::json_return_type(0);
    }));
--- a/auth/ldap_role_manager.cc
+++ b/auth/ldap_role_manager.cc
@@ -233,9 +233,9 @@ future<role_set> ldap_role_manager::query_granted(std::string_view grantee_name,
 }

 future<role_to_directly_granted_map>
-ldap_role_manager::query_all_directly_granted() {
+ldap_role_manager::query_all_directly_granted(::service::query_state& qs) {
    role_to_directly_granted_map result;
-    auto roles = co_await query_all();
+    auto roles = co_await query_all(qs);
    for (auto& role: roles) {
        auto granted_set = co_await query_granted(role, recursive_role_query::no);
        for (auto& granted: granted_set) {
@@ -247,8 +247,8 @@ ldap_role_manager::query_all_directly_granted() {
    co_return result;
 }

-future<role_set> ldap_role_manager::query_all() {
-    return _std_mgr.query_all();
+future<role_set> ldap_role_manager::query_all(::service::query_state& qs) {
+    return _std_mgr.query_all(qs);
 }

 future<> ldap_role_manager::create_role(std::string_view role_name) {
@@ -311,12 +311,12 @@ future<bool> ldap_role_manager::can_login(std::string_view role_name) {
 }

 future<std::optional<sstring>> ldap_role_manager::get_attribute(
-        std::string_view role_name, std::string_view attribute_name) {
-    return _std_mgr.get_attribute(role_name, attribute_name);
+        std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
+    return _std_mgr.get_attribute(role_name, attribute_name, qs);
 }

-future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name) {
-    return _std_mgr.query_attribute_for_all(attribute_name);
+future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) {
+    return _std_mgr.query_attribute_for_all(attribute_name, qs);
 }

 future<> ldap_role_manager::set_attribute(
--- a/auth/ldap_role_manager.hh
+++ b/auth/ldap_role_manager.hh
@@ -75,9 +75,9 @@ class ldap_role_manager : public role_manager {

    future<role_set> query_granted(std::string_view, recursive_role_query) override;

-    future<role_to_directly_granted_map> query_all_directly_granted() override;
+    future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;

-    future<role_set> query_all() override;
+    future<role_set> query_all(::service::query_state&) override;

    future<bool> exists(std::string_view) override;

@@ -85,9 +85,9 @@ class ldap_role_manager : public role_manager {

    future<bool> can_login(std::string_view) override;

-    future<std::optional<sstring>> get_attribute(std::string_view, std::string_view) override;
+    future<std::optional<sstring>> get_attribute(std::string_view, std::string_view, ::service::query_state&) override;

-    future<role_manager::attribute_vals> query_attribute_for_all(std::string_view) override;
+    future<role_manager::attribute_vals> query_attribute_for_all(std::string_view, ::service::query_state&) override;

    future<> set_attribute(std::string_view, std::string_view, std::string_view, ::service::group0_batch& mc) override;

--- a/auth/maintenance_socket_role_manager.cc
+++ b/auth/maintenance_socket_role_manager.cc
@@ -78,11 +78,11 @@ future<role_set> maintenance_socket_role_manager::query_granted(std::string_view
    return operation_not_supported_exception<role_set>("QUERY GRANTED");
 }

-future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted() {
+future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted(::service::query_state&) {
    return operation_not_supported_exception<role_to_directly_granted_map>("QUERY ALL DIRECTLY GRANTED");
 }

-future<role_set> maintenance_socket_role_manager::query_all() {
+future<role_set> maintenance_socket_role_manager::query_all(::service::query_state&) {
    return operation_not_supported_exception<role_set>("QUERY ALL");
 }

@@ -98,11 +98,11 @@ future<bool> maintenance_socket_role_manager::can_login(std::string_view role_na
    return make_ready_future<bool>(true);
 }

-future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
+future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) {
    return operation_not_supported_exception<std::optional<sstring>>("GET ATTRIBUTE");
 }

-future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name) {
+future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) {
    return operation_not_supported_exception<role_manager::attribute_vals>("QUERY ATTRIBUTE");
 }

--- a/auth/maintenance_socket_role_manager.hh
+++ b/auth/maintenance_socket_role_manager.hh
@@ -53,9 +53,9 @@ public:

    virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;

-    virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
+    virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;

-    virtual future<role_set> query_all() override;
+    virtual future<role_set> query_all(::service::query_state&) override;

    virtual future<bool> exists(std::string_view role_name) override;

@@ -63,9 +63,9 @@ public:

    virtual future<bool> can_login(std::string_view role_name) override;

-    virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;
+    virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;

-    virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;
+    virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;

    virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;

--- a/auth/permission.cc
+++ b/auth/permission.cc
@@ -36,7 +36,8 @@ static const std::unordered_map<sstring, auth::permission> permission_names({
        {"MODIFY", auth::permission::MODIFY},
        {"AUTHORIZE", auth::permission::AUTHORIZE},
        {"DESCRIBE", auth::permission::DESCRIBE},
-        {"EXECUTE", auth::permission::EXECUTE}});
+        {"EXECUTE", auth::permission::EXECUTE},
+        {"VECTOR_SEARCH_INDEXING", auth::permission::VECTOR_SEARCH_INDEXING}});

 const sstring& auth::permissions::to_string(permission p) {
    for (auto& v : permission_names) {
--- a/auth/permission.hh
+++ b/auth/permission.hh
@@ -33,6 +33,7 @@ enum class permission {
    // data access
    SELECT, // required for SELECT.
    MODIFY, // required for INSERT, UPDATE, DELETE, TRUNCATE.
+    VECTOR_SEARCH_INDEXING, // required for SELECT from tables with vector indexes if SELECT permission is not granted.

    // permission management
    AUTHORIZE, // required for GRANT and REVOKE.
@@ -54,7 +55,8 @@ typedef enum_set<
                permission::MODIFY,
                permission::AUTHORIZE,
                permission::DESCRIBE,
-                permission::EXECUTE>> permission_set;
+                permission::EXECUTE,
+                permission::VECTOR_SEARCH_INDEXING>> permission_set;

 bool operator<(const permission_set&, const permission_set&);

--- a/auth/resource.cc
+++ b/auth/resource.cc
@@ -41,22 +41,26 @@ static const std::unordered_map<resource_kind, std::size_t> max_parts{
        {resource_kind::functions, 2}};

 static permission_set applicable_permissions(const data_resource_view& dv) {
-    if (dv.table()) {
-        return permission_set::of<
+    
+    // We only support VECTOR_SEARCH_INDEXING permission for ALL KEYSPACES.
+
+    auto set = permission_set::of<
                permission::ALTER,
                permission::DROP,
                permission::SELECT,
                permission::MODIFY,
                permission::AUTHORIZE>();
+
+    if (!dv.table()) {
+        set.add(permission_set::of<permission::CREATE>());
    }

-    return permission_set::of<
-            permission::CREATE,
-            permission::ALTER,
-            permission::DROP,
-            permission::SELECT,
-            permission::MODIFY,
-            permission::AUTHORIZE>();
+    if (!dv.table() && !dv.keyspace()) {
+        set.add(permission_set::of<permission::VECTOR_SEARCH_INDEXING>());
+    }
+
+    return set;
+        
 }

 static permission_set applicable_permissions(const role_resource_view& rv) {
--- a/auth/role_manager.hh
+++ b/auth/role_manager.hh
@@ -17,12 +17,17 @@
 #include <seastar/core/format.hh>
 #include <seastar/core/sstring.hh>

+#include "auth/common.hh"
 #include "auth/resource.hh"
 #include "cql3/description.hh"
 #include "seastarx.hh"
 #include "exceptions/exceptions.hh"
 #include "service/raft/raft_group0_client.hh"

+namespace service {
+class query_state;
+};
+
 namespace auth {

 struct role_config final {
@@ -167,9 +172,9 @@ public:
    ///   (role2, role3)
    /// }
    ///  
-    virtual future<role_to_directly_granted_map> query_all_directly_granted() = 0;
+    virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state& = internal_distributed_query_state()) = 0;

-    virtual future<role_set> query_all() = 0;
+    virtual future<role_set> query_all(::service::query_state& = internal_distributed_query_state()) = 0;

    virtual future<bool> exists(std::string_view role_name) = 0;

@@ -186,12 +191,12 @@ public:
    ///
    /// \returns the value of the named attribute, if one is set.
    ///
-    virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) = 0;
+    virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& = internal_distributed_query_state()) = 0;

    ///
    /// \returns a mapping of each role's value for the named attribute, if one is set for the role.
    ///
-    virtual future<attribute_vals> query_attribute_for_all(std::string_view attribute_name) = 0;
+    virtual future<attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state& = internal_distributed_query_state()) = 0;

    /// Sets `attribute_name` with `attribute_value` for `role_name`.
    /// \returns an exceptional future with nonexistant_role if the role does not exist.
--- a/auth/service.hh
+++ b/auth/service.hh
@@ -231,6 +231,17 @@ struct command_desc {
    } type_ = type::OTHER;
 };

+/// Similar to command_desc, but used in cases where multiple permissions allow the access to the resource.
+struct command_desc_with_permission_set {
+    permission_set permission;
+    const ::auth::resource& resource;
+    enum class type {
+        ALTER_WITH_OPTS,
+        ALTER_SYSTEM_WITH_ALLOWED_OPTS,
+        OTHER
+    } type_ = type::OTHER;
+};
+
 ///
 /// Protected resources cannot be modified even if the performer has permissions to do so.
 ///
--- a/auth/standard_role_manager.cc
+++ b/auth/standard_role_manager.cc
@@ -663,21 +663,30 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n
    });
 }

-future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted() {
+future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted(::service::query_state& qs) {
    const sstring query = seastar::format("SELECT * FROM {}.{}",
            get_auth_ks_name(_qp),
            meta::role_members_table::name);

+    const auto results = co_await _qp.execute_internal(
+            query,
+            db::consistency_level::ONE,
+            qs,
+            cql3::query_processor::cache_internal::yes);
+
    role_to_directly_granted_map roles_map;
-    co_await _qp.query_internal(query, [&roles_map] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
-        roles_map.insert({row.get_as<sstring>("member"), row.get_as<sstring>("role")});
-        co_return stop_iteration::no;
-    });
+    std::transform(
+            results->begin(),
+            results->end(),
+            std::inserter(roles_map, roles_map.begin()),
+            [] (const cql3::untyped_result_set_row& row) {
+                return std::make_pair(row.get_as<sstring>("member"), row.get_as<sstring>("role")); }
+    );

    co_return roles_map;
 }

-future<role_set> standard_role_manager::query_all() {
+future<role_set> standard_role_manager::query_all(::service::query_state& qs) {
    const sstring query = seastar::format("SELECT {} FROM {}.{}",
            meta::roles_table::role_col_name,
            get_auth_ks_name(_qp),
@@ -695,7 +704,7 @@ future<role_set> standard_role_manager::query_all() {
    const auto results = co_await _qp.execute_internal(
            query,
            db::consistency_level::QUORUM,
-            internal_distributed_query_state(),
+            qs,
            cql3::query_processor::cache_internal::yes);

    role_set roles;
@@ -727,11 +736,11 @@ future<bool> standard_role_manager::can_login(std::string_view role_name) {
    });
 }

-future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
+future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
    const sstring query = seastar::format("SELECT name, value FROM {}.{} WHERE role = ? AND name = ?",
            get_auth_ks_name(_qp),
            meta::role_attributes_table::name);
-    const auto result_set = co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
+    const auto result_set = co_await _qp.execute_internal(query, db::consistency_level::ONE, qs, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
    if (!result_set->empty()) {
        const cql3::untyped_result_set_row &row = result_set->one();
        co_return std::optional<sstring>(row.get_as<sstring>("value"));
@@ -739,11 +748,11 @@ future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_
    co_return std::optional<sstring>{};
 }

-future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name) {
-    return query_all().then([this, attribute_name] (role_set roles) {
-        return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles)] (attribute_vals &role_to_att_val) {
-            return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name] (sstring role) {
-                return get_attribute(role, attribute_name).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
+future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name, ::service::query_state& qs) {
+    return query_all(qs).then([this, attribute_name, &qs] (role_set roles) {
+        return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles), &qs] (attribute_vals &role_to_att_val) {
+            return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name, &qs] (sstring role) {
+                return get_attribute(role, attribute_name, qs).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
                    if (att_val) {
                        role_to_att_val.emplace(std::move(role), std::move(*att_val));
                    }
@@ -788,7 +797,7 @@ future<> standard_role_manager::remove_attribute(std::string_view role_name, std
 future<std::vector<cql3::description>> standard_role_manager::describe_role_grants() {
    std::vector<cql3::description> result{};

-    const auto grants = co_await query_all_directly_granted();
+    const auto grants = co_await query_all_directly_granted(internal_distributed_query_state());
    result.reserve(grants.size());

    for (const auto& [grantee_role, granted_role] : grants) {
--- a/auth/standard_role_manager.hh
+++ b/auth/standard_role_manager.hh
@@ -66,9 +66,9 @@ public:

    virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;

-    virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
+    virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;

-    virtual future<role_set> query_all() override;
+    virtual future<role_set> query_all(::service::query_state&) override;

    virtual future<bool> exists(std::string_view role_name) override;

@@ -76,9 +76,9 @@ public:

    virtual future<bool> can_login(std::string_view role_name) override;

-    virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;
+    virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;

-    virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;
+    virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;

    virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;

--- a/cdc/generation.cc
+++ b/cdc/generation.cc
@@ -1209,6 +1209,23 @@ future<mutation> create_table_streams_mutation(table_id table, db_clock::time_po
    co_return std::move(m);
 }

+future<mutation> create_table_streams_mutation(table_id table, db_clock::time_point stream_ts, const utils::chunked_vector<cdc::stream_id>& stream_ids, api::timestamp_type ts) {
+    auto s = db::system_keyspace::cdc_streams_state();
+
+    mutation m(s, partition_key::from_single_value(*s,
+        data_value(table.uuid()).serialize_nonnull()
+    ));
+    m.set_static_cell("timestamp", stream_ts, ts);
+
+    for (const auto& sid : stream_ids) {
+        auto ck = clustering_key::from_singular(*s, dht::token::to_int64(sid.token()));
+        m.set_cell(ck, "stream_id", data_value(sid.to_bytes()), ts);
+        co_await coroutine::maybe_yield();
+    }
+
+    co_return std::move(m);
+}
+
 utils::chunked_vector<mutation>
 make_drop_table_streams_mutations(table_id table, api::timestamp_type ts) {
    utils::chunked_vector<mutation> mutations;
@@ -1235,32 +1252,50 @@ future<> generation_service::load_cdc_tablet_streams(std::optional<std::unordere
        tables_to_process = _cdc_metadata.get_tables_with_cdc_tablet_streams() | std::ranges::to<std::unordered_set<table_id>>();
    }

-    auto read_streams_state = [this] (const std::optional<std::unordered_set<table_id>>& tables, noncopyable_function<future<>(table_id, db_clock::time_point, std::vector<cdc::stream_id>)> f) -> future<> {
+    auto read_streams_state = [this] (const std::optional<std::unordered_set<table_id>>& tables, noncopyable_function<future<>(table_id, db_clock::time_point, utils::chunked_vector<cdc::stream_id>)> f) -> future<> {
        if (tables) {
            for (auto table : *tables) {
-                co_await _sys_ks.local().read_cdc_streams_state(table, [&] (table_id table, db_clock::time_point base_ts, std::vector<cdc::stream_id> base_stream_set) -> future<> {
+                co_await _sys_ks.local().read_cdc_streams_state(table, [&] (table_id table, db_clock::time_point base_ts, utils::chunked_vector<cdc::stream_id> base_stream_set) -> future<> {
                    return f(table, base_ts, std::move(base_stream_set));
                });
            }
        } else {
-            co_await _sys_ks.local().read_cdc_streams_state(std::nullopt, [&] (table_id table, db_clock::time_point base_ts, std::vector<cdc::stream_id> base_stream_set) -> future<> {
+            co_await _sys_ks.local().read_cdc_streams_state(std::nullopt, [&] (table_id table, db_clock::time_point base_ts, utils::chunked_vector<cdc::stream_id> base_stream_set) -> future<> {
                return f(table, base_ts, std::move(base_stream_set));
            });
        }
    };

-    co_await read_streams_state(changed_tables, [this, &tables_to_process] (table_id table, db_clock::time_point base_ts, std::vector<cdc::stream_id> base_stream_set) -> future<> {
+    co_await read_streams_state(changed_tables, [this, &tables_to_process] (table_id table, db_clock::time_point base_ts, utils::chunked_vector<cdc::stream_id> base_stream_set) -> future<> {
        table_streams new_table_map;

-        auto append_stream = [&new_table_map] (db_clock::time_point stream_tp, std::vector<cdc::stream_id> stream_set) {
+        auto append_stream = [&new_table_map] (db_clock::time_point stream_tp, utils::chunked_vector<cdc::stream_id> stream_set) {
            auto ts = std::chrono::duration_cast<api::timestamp_clock::duration>(stream_tp.time_since_epoch()).count();
            new_table_map[ts] = committed_stream_set {stream_tp, std::move(stream_set)};
        };

-        append_stream(base_ts, std::move(base_stream_set));
+        // if we already have a loaded streams map, and the base timestamp is unchanged, then read
+        // the history entries starting from the latest one we have and append it to the existing map.
+        // we can do it because we only append new rows with higher timestamps to the history table.
+        std::optional<std::reference_wrapper<const committed_stream_set>> from_streams;
+        std::optional<db_clock::time_point> from_ts;
+        const auto& all_streams = _cdc_metadata.get_all_tablet_streams();
+        if (auto it = all_streams.find(table); it != all_streams.end()) {
+            const auto& current_map = *it->second;
+            if (current_map.cbegin()->second.ts == base_ts) {
+                const auto& latest_entry = current_map.crbegin()->second;
+                from_streams = std::cref(latest_entry);
+                from_ts = latest_entry.ts;
+            }
+        }

-        co_await _sys_ks.local().read_cdc_streams_history(table, [&] (table_id tid, db_clock::time_point ts, cdc_stream_diff diff) -> future<> {
-            const auto& prev_stream_set = std::crbegin(new_table_map)->second.streams;
+        if (!from_ts) {
+            append_stream(base_ts, std::move(base_stream_set));
+        }
+
+        co_await _sys_ks.local().read_cdc_streams_history(table, from_ts, [&] (table_id tid, db_clock::time_point ts, cdc_stream_diff diff) -> future<> {
+            const auto& prev_stream_set = new_table_map.empty() ?
+                    from_streams->get().streams : std::crbegin(new_table_map)->second.streams;

            append_stream(ts, co_await cdc::metadata::construct_next_stream_set(
                    prev_stream_set, std::move(diff.opened_streams), diff.closed_streams));
@@ -1272,7 +1307,11 @@ future<> generation_service::load_cdc_tablet_streams(std::optional<std::unordere
                new_table_map_copy[ts] = entry;
                co_await coroutine::maybe_yield();
            }
-            svc._cdc_metadata.load_tablet_streams_map(table, std::move(new_table_map_copy));
+            if (!from_ts) {
+                svc._cdc_metadata.load_tablet_streams_map(table, std::move(new_table_map_copy));
+            } else {
+                svc._cdc_metadata.append_tablet_streams_map(table, std::move(new_table_map_copy));
+            }
        }));

        tables_to_process.erase(table);
@@ -1306,7 +1345,7 @@ future<> generation_service::query_cdc_timestamps(table_id table, bool ascending
    }
 }

-future<> generation_service::query_cdc_streams(table_id table, noncopyable_function<future<>(db_clock::time_point, const std::vector<cdc::stream_id>& current, cdc::cdc_stream_diff)> f) {
+future<> generation_service::query_cdc_streams(table_id table, noncopyable_function<future<>(db_clock::time_point, const utils::chunked_vector<cdc::stream_id>& current, cdc::cdc_stream_diff)> f) {
    const auto& all_tables = _cdc_metadata.get_all_tablet_streams();
    auto table_it = all_tables.find(table);
    if (table_it == all_tables.end()) {
@@ -1363,8 +1402,8 @@ future<> generation_service::generate_tablet_resize_update(utils::chunked_vector
        co_return;
    }

-    std::vector<cdc::stream_id> new_streams;
-    new_streams.reserve(new_tablet_map.tablet_count());
+    utils::chunked_vector<cdc::stream_id> new_streams;
+    co_await utils::reserve_gently(new_streams, new_tablet_map.tablet_count());
    for (auto tid : new_tablet_map.tablet_ids()) {
        new_streams.emplace_back(new_tablet_map.get_last_token(tid), 0);
        co_await coroutine::maybe_yield();
@@ -1386,4 +1425,113 @@ future<> generation_service::generate_tablet_resize_update(utils::chunked_vector
    muts.emplace_back(std::move(mut));
 }

+future<utils::chunked_vector<mutation>> get_cdc_stream_gc_mutations(table_id table, db_clock::time_point base_ts, const utils::chunked_vector<cdc::stream_id>& base_stream_set, api::timestamp_type ts) {
+    utils::chunked_vector<mutation> muts;
+    muts.reserve(2);
+
+    auto gc_now = gc_clock::now();
+    auto tombstone_ts = ts - 1;
+
+    {
+        // write the new base stream set to cdc_streams_state
+        auto s = db::system_keyspace::cdc_streams_state();
+        mutation m(s, partition_key::from_single_value(*s,
+            data_value(table.uuid()).serialize_nonnull()
+        ));
+        m.partition().apply(tombstone(tombstone_ts, gc_now));
+        m.set_static_cell("timestamp", data_value(base_ts), ts);
+
+        for (const auto& sid : base_stream_set) {
+            co_await coroutine::maybe_yield();
+            auto ck = clustering_key::from_singular(*s, dht::token::to_int64(sid.token()));
+            m.set_cell(ck, "stream_id", data_value(sid.to_bytes()), ts);
+        }
+        muts.emplace_back(std::move(m));
+    }
+
+    {
+        // remove all entries from cdc_streams_history up to the new base
+        auto s = db::system_keyspace::cdc_streams_history();
+        mutation m(s, partition_key::from_single_value(*s,
+            data_value(table.uuid()).serialize_nonnull()
+        ));
+        auto range = query::clustering_range::make_ending_with({
+                clustering_key_prefix::from_single_value(*s, timestamp_type->decompose(base_ts)), true});
+        auto bv = bound_view::from_range(range);
+        m.partition().apply_delete(*s, range_tombstone{bv.first, bv.second, tombstone{ts, gc_now}});
+        muts.emplace_back(std::move(m));
+    }
+
+    co_return std::move(muts);
+}
+
+table_streams::const_iterator get_new_base_for_gc(const table_streams& streams_map, std::chrono::seconds ttl) {
+    // find the most recent timestamp that is older than ttl_seconds, which will become the new base.
+    // all streams with older timestamps can be removed because they are closed for more than ttl_seconds
+    // (they are all replaced by streams with the newer timestamp).
+
+    auto ts_upper_bound = db_clock::now() - ttl;
+
+    auto it = streams_map.begin();
+    while (it != streams_map.end()) {
+        auto next_it = std::next(it);
+        if (next_it == streams_map.end()) {
+            break;
+        }
+
+        auto next_tp = next_it->second.ts;
+        if (next_tp <= ts_upper_bound) {
+            // the next timestamp is older than ttl_seconds, so the current one is obsolete
+            it = next_it;
+        } else {
+            break;
+        }
+    }
+
+    return it;
+}
+
+future<utils::chunked_vector<mutation>> generation_service::garbage_collect_cdc_streams_for_table(table_id table, std::optional<std::chrono::seconds> ttl, api::timestamp_type ts) {
+    const auto& table_streams = *_cdc_metadata.get_all_tablet_streams().at(table);
+
+    // if TTL is not provided by the caller then use the table's CDC TTL
+    auto base_schema = cdc::get_base_table(_db, *_db.find_schema(table));
+    ttl = ttl.or_else([&] -> std::optional<std::chrono::seconds> {
+        auto ttl_seconds = base_schema->cdc_options().ttl();
+        if (ttl_seconds > 0) {
+            return std::chrono::seconds(ttl_seconds);
+        } else {
+            // ttl=0 means no ttl
+            return std::nullopt;
+        }
+    });
+    if (!ttl) {
+        co_return utils::chunked_vector<mutation>{};
+    }
+
+    auto new_base_it = get_new_base_for_gc(table_streams, *ttl);
+    if (new_base_it == table_streams.begin() || new_base_it == table_streams.end()) {
+        // nothing to gc
+        co_return utils::chunked_vector<mutation>{};
+    }
+
+    for (auto it = table_streams.begin(); it != new_base_it; ++it) {
+        cdc_log.info("Garbage collecting CDC stream metadata for table {}: removing generation {} because it is older than the CDC TTL of {} seconds",
+                table, it->second.ts, *ttl);
+    }
+
+    co_return co_await get_cdc_stream_gc_mutations(table, new_base_it->second.ts, new_base_it->second.streams, ts);
+}
+
+future<> generation_service::garbage_collect_cdc_streams(utils::chunked_vector<canonical_mutation>& muts, api::timestamp_type ts) {
+    for (auto table : _cdc_metadata.get_tables_with_cdc_tablet_streams()) {
+        co_await coroutine::maybe_yield();
+
+        auto table_muts = co_await garbage_collect_cdc_streams_for_table(table, std::nullopt, ts);
+        for (auto&& m : table_muts) {
+            muts.emplace_back(std::move(m));
+        }
+    }
+}
+
 } // namespace cdc
--- a/cdc/generation.hh
+++ b/cdc/generation.hh
@@ -143,12 +143,12 @@ stream_state read_stream_state(int8_t val);

 struct committed_stream_set {
    db_clock::time_point ts;
-    std::vector<cdc::stream_id> streams;
+    utils::chunked_vector<cdc::stream_id> streams;
 };

 struct cdc_stream_diff {
-    std::vector<stream_id> closed_streams;
-    std::vector<stream_id> opened_streams;
+    utils::chunked_vector<stream_id> closed_streams;
+    utils::chunked_vector<stream_id> opened_streams;
 };

 using table_streams = std::map<api::timestamp_type, committed_stream_set>;
@@ -220,8 +220,11 @@ future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v3(
    size_t mutation_size_threshold, api::timestamp_type mutation_timestamp);

 future<mutation> create_table_streams_mutation(table_id, db_clock::time_point, const locator::tablet_map&, api::timestamp_type);
+future<mutation> create_table_streams_mutation(table_id, db_clock::time_point, const utils::chunked_vector<cdc::stream_id>&, api::timestamp_type);
 utils::chunked_vector<mutation> make_drop_table_streams_mutations(table_id, api::timestamp_type ts);

 future<mutation> get_switch_streams_mutation(table_id table, db_clock::time_point stream_ts, cdc_stream_diff diff, api::timestamp_type ts);
+future<utils::chunked_vector<mutation>> get_cdc_stream_gc_mutations(table_id table, db_clock::time_point base_ts, const utils::chunked_vector<cdc::stream_id>& base_stream_set, api::timestamp_type ts);
+table_streams::const_iterator get_new_base_for_gc(const table_streams&, std::chrono::seconds ttl);

 } // namespace cdc
--- a/cdc/generation_service.hh
+++ b/cdc/generation_service.hh
@@ -149,10 +149,13 @@ public:
    future<> load_cdc_tablet_streams(std::optional<std::unordered_set<table_id>> changed_tables);

    future<> query_cdc_timestamps(table_id table, bool ascending, noncopyable_function<future<>(db_clock::time_point)> f);
-    future<> query_cdc_streams(table_id table, noncopyable_function<future<>(db_clock::time_point, const std::vector<cdc::stream_id>& current, cdc::cdc_stream_diff)> f);
+    future<> query_cdc_streams(table_id table, noncopyable_function<future<>(db_clock::time_point, const utils::chunked_vector<cdc::stream_id>& current, cdc::cdc_stream_diff)> f);

    future<> generate_tablet_resize_update(utils::chunked_vector<canonical_mutation>& muts, table_id table, const locator::tablet_map& new_tablet_map, api::timestamp_type ts);

+    future<utils::chunked_vector<mutation>> garbage_collect_cdc_streams_for_table(table_id table, std::optional<std::chrono::seconds> ttl, api::timestamp_type ts);
+    future<> garbage_collect_cdc_streams(utils::chunked_vector<canonical_mutation>& muts, api::timestamp_type ts);
+
 private:
    /* Retrieve the CDC generation which starts at the given timestamp (from a distributed table created for this purpose)
     * and start using it for CDC log writes if it's not obsolete.
--- a/cdc/log.cc
+++ b/cdc/log.cc
@@ -67,10 +67,15 @@ shared_ptr<locator::abstract_replication_strategy> generate_replication_strategy
    return locator::abstract_replication_strategy::create_replication_strategy(ksm.strategy_name(), params);
 }

+// When dropping a column from a CDC log table, we set the drop timestamp
+// `column_drop_leeway` seconds into the future to ensure that for writes concurrent
+// with column drop, the write timestamp is before the column drop timestamp.
+constexpr auto column_drop_leeway = std::chrono::seconds(5);
+
 } // anonymous namespace

 namespace cdc {
-static schema_ptr create_log_schema(const schema&, const replica::database&, const keyspace_metadata&,
+static schema_ptr create_log_schema(const schema&, const replica::database&, const keyspace_metadata&, api::timestamp_type,
        std::optional<table_id> = {}, schema_ptr = nullptr);
 }

@@ -182,7 +187,7 @@ public:
        muts.emplace_back(std::move(mut));
    }

-    void on_pre_create_column_families(const keyspace_metadata& ksm, std::vector<schema_ptr>& cfms) override {
+    void on_pre_create_column_families(const keyspace_metadata& ksm, std::vector<schema_ptr>& cfms, api::timestamp_type ts) override {
        std::vector<schema_ptr> new_cfms;

        for (auto sp : cfms) {
@@ -201,7 +206,7 @@ public:
            }

            // in seastar thread
-            auto log_schema = create_log_schema(schema, db, ksm);
+            auto log_schema = create_log_schema(schema, db, ksm, ts);
            new_cfms.push_back(std::move(log_schema));
        }

@@ -248,7 +253,7 @@ public:
            }

            std::optional<table_id> maybe_id = log_schema ? std::make_optional(log_schema->id()) : std::nullopt;
-            auto new_log_schema = create_log_schema(new_schema, db, *keyspace.metadata(), std::move(maybe_id), log_schema);
+            auto new_log_schema = create_log_schema(new_schema, db, *keyspace.metadata(), timestamp, std::move(maybe_id), log_schema);

            auto log_mut = log_schema 
                ? db::schema_tables::make_update_table_mutations(_ctxt._proxy, keyspace.metadata(), log_schema, new_log_schema, timestamp)
@@ -580,7 +585,7 @@ bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name) {
 }

 static schema_ptr create_log_schema(const schema& s, const replica::database& db,
-        const keyspace_metadata& ksm, std::optional<table_id> uuid, schema_ptr old)
+        const keyspace_metadata& ksm, api::timestamp_type timestamp, std::optional<table_id> uuid, schema_ptr old)
 {
    schema_builder b(s.ks_name(), log_name(s.cf_name()));
    b.with_partitioner(cdc::cdc_partitioner::classname);
@@ -616,6 +621,28 @@ static schema_ptr create_log_schema(const schema& s, const replica::database& db
    b.with_column(log_meta_column_name_bytes("ttl"), long_type);
    b.with_column(log_meta_column_name_bytes("end_of_batch"), boolean_type);
    b.set_caching_options(caching_options::get_disabled_caching_options());
+
+    auto validate_new_column = [&] (const sstring& name) {
+        // When dropping a column from a CDC log table, we set the drop timestamp to be
+        // `column_drop_leeway` seconds into the future (see `create_log_schema`).
+        // Therefore, when recreating a column with the same name, we need to validate
+        // that it's not recreated too soon and that the drop timestamp has passed.
+        if (old && old->dropped_columns().contains(name)) {
+            const auto& drop_info = old->dropped_columns().at(name);
+            auto create_time = api::timestamp_clock::time_point(api::timestamp_clock::duration(timestamp));
+            auto drop_time = api::timestamp_clock::time_point(api::timestamp_clock::duration(drop_info.timestamp));
+            if (drop_time > create_time) {
+                throw exceptions::invalid_request_exception(format("Cannot add column {} because a column with the same name was dropped too recently. Please retry after {} seconds",
+                        name, std::chrono::duration_cast<std::chrono::seconds>(drop_time - create_time).count() + 1));
+            }
+        }
+    };
+
+    auto add_column = [&] (sstring name, data_type type) {
+        validate_new_column(name);
+        b.with_column(to_bytes(name), type);
+    };
+
    auto add_columns = [&] (const schema::const_iterator_range_type& columns, bool is_data_col = false) {
        for (const auto& column : columns) {
            auto type = column.type;
@@ -637,9 +664,9 @@ static schema_ptr create_log_schema(const schema& s, const replica::database& db
                    }
                ));
            }
-            b.with_column(log_data_column_name_bytes(column.name()), type);
+            add_column(log_data_column_name(column.name_as_text()), type);
            if (is_data_col) {
-                b.with_column(log_data_column_deleted_name_bytes(column.name()), boolean_type);
+                add_column(log_data_column_deleted_name(column.name_as_text()), boolean_type);
            }
            if (column.type->is_multi_cell()) {
                auto dtype = visit(*type, make_visitor(
@@ -655,7 +682,7 @@ static schema_ptr create_log_schema(const schema& s, const replica::database& db
                        throw std::invalid_argument("Should not reach");
                    }
                ));
-                b.with_column(log_data_column_deleted_elements_name_bytes(column.name()), dtype);
+                add_column(log_data_column_deleted_elements_name(column.name_as_text()), dtype);
            }
        }
    };
@@ -669,7 +696,7 @@ static schema_ptr create_log_schema(const schema& s, const replica::database& db
    }

    auto rs = generate_replication_strategy(ksm);
-    auto tombstone_gc_ext = seastar::make_shared<tombstone_gc_extension>(get_default_tombstone_gc_mode(*rs, db.get_token_metadata()));
+    auto tombstone_gc_ext = seastar::make_shared<tombstone_gc_extension>(get_default_tombstone_gc_mode(*rs, db.get_token_metadata(), false));
    b.add_extension(tombstone_gc_extension::NAME, std::move(tombstone_gc_ext));

    /**
@@ -681,7 +708,8 @@ static schema_ptr create_log_schema(const schema& s, const replica::database& db
        // not super efficient, but we don't do this often.
        for (auto& col : old->all_columns()) {
            if (!b.has_column({col.name(), col.name_as_text() })) {
-                b.without_column(col.name_as_text(), col.type, api::new_timestamp());
+                auto drop_ts = api::timestamp_clock::now() + column_drop_leeway;
+                b.without_column(col.name_as_text(), col.type, drop_ts.time_since_epoch().count());
            }
        }
    }
--- a/cdc/metadata.cc
+++ b/cdc/metadata.cc
@@ -54,7 +54,7 @@ cdc::stream_id get_stream(
 }

 static cdc::stream_id get_stream(
-        const std::vector<cdc::stream_id>& streams,
+        const utils::chunked_vector<cdc::stream_id>& streams,
        dht::token tok) {
    if (streams.empty()) {
        on_internal_error(cdc_log, "get_stream: streams empty");
@@ -159,7 +159,7 @@ cdc::stream_id cdc::metadata::get_vnode_stream(api::timestamp_type ts, dht::toke
    return ret;
 }

-const std::vector<cdc::stream_id>& cdc::metadata::get_tablet_stream_set(table_id tid, api::timestamp_type ts) const {
+const utils::chunked_vector<cdc::stream_id>& cdc::metadata::get_tablet_stream_set(table_id tid, api::timestamp_type ts) const {
    auto now = api::new_timestamp();
    if (ts > now + get_generation_leeway().count()) {
        throw exceptions::invalid_request_exception(seastar::format(
@@ -259,10 +259,10 @@ bool cdc::metadata::prepare(db_clock::time_point tp) {
    return !it->second;
 }

-future<std::vector<cdc::stream_id>> cdc::metadata::construct_next_stream_set(
-        const std::vector<cdc::stream_id>& prev_stream_set,
-        std::vector<cdc::stream_id> opened,
-        const std::vector<cdc::stream_id>& closed) {
+future<utils::chunked_vector<cdc::stream_id>> cdc::metadata::construct_next_stream_set(
+        const utils::chunked_vector<cdc::stream_id>& prev_stream_set,
+        utils::chunked_vector<cdc::stream_id> opened,
+        const utils::chunked_vector<cdc::stream_id>& closed) {

    if (closed.size() == prev_stream_set.size()) {
        // all previous streams are closed, so the next stream set is just the opened streams.
@@ -273,8 +273,8 @@ future<std::vector<cdc::stream_id>> cdc::metadata::construct_next_stream_set(
    // streams and removing the closed streams. we assume each stream set is
    // sorted by token, and the result is sorted as well.

-    std::vector<cdc::stream_id> next_stream_set;
-    next_stream_set.reserve(prev_stream_set.size() + opened.size() - closed.size());
+    utils::chunked_vector<cdc::stream_id> next_stream_set;
+    co_await utils::reserve_gently(next_stream_set, prev_stream_set.size() + opened.size() - closed.size());

    auto next_prev = prev_stream_set.begin();
    auto next_closed = closed.begin();
@@ -306,6 +306,10 @@ void cdc::metadata::load_tablet_streams_map(table_id tid, table_streams new_tabl
    _tablet_streams[tid] = make_lw_shared(std::move(new_table_map));
 }

+void cdc::metadata::append_tablet_streams_map(table_id tid, table_streams new_table_map) {
+    _tablet_streams[tid]->insert(std::make_move_iterator(new_table_map.begin()), std::make_move_iterator(new_table_map.end()));
+}
+
 void cdc::metadata::remove_tablet_streams_map(table_id tid) {
    _tablet_streams.erase(tid);
 }
@@ -314,8 +318,8 @@ std::vector<table_id> cdc::metadata::get_tables_with_cdc_tablet_streams() const
    return _tablet_streams | std::views::keys | std::ranges::to<std::vector<table_id>>();
 }

-future<cdc::cdc_stream_diff> cdc::metadata::generate_stream_diff(const std::vector<stream_id>& before, const std::vector<stream_id>& after) {
-    std::vector<stream_id> closed, opened;
+future<cdc::cdc_stream_diff> cdc::metadata::generate_stream_diff(const utils::chunked_vector<stream_id>& before, const utils::chunked_vector<stream_id>& after) {
+    utils::chunked_vector<stream_id> closed, opened;

    auto before_it = before.begin();
    auto after_it = after.begin();
--- a/cdc/metadata.hh
+++ b/cdc/metadata.hh
@@ -37,7 +37,9 @@ class metadata final {
    using container_t = std::map<api::timestamp_type, std::optional<topology_description>>;
    container_t _gens;

-    using table_streams_ptr = lw_shared_ptr<const table_streams>;
+    // per-table streams map for tables in tablets-based keyspaces.
+    // the streams map is shared with the virtual tables reader, hence we can only insert new entries to it, not erase.
+    using table_streams_ptr = lw_shared_ptr<table_streams>;
    using tablet_streams_map = std::unordered_map<table_id, table_streams_ptr>;

    tablet_streams_map _tablet_streams;
@@ -47,7 +49,7 @@ class metadata final {

    container_t::const_iterator gen_used_at(api::timestamp_type ts) const;

-    const std::vector<stream_id>& get_tablet_stream_set(table_id tid, api::timestamp_type ts) const;
+    const utils::chunked_vector<stream_id>& get_tablet_stream_set(table_id tid, api::timestamp_type ts) const;

 public:
    /* Is a generation with the given timestamp already known or obsolete? It is obsolete if and only if
@@ -100,6 +102,7 @@ public:
    bool prepare(db_clock::time_point ts);

    void load_tablet_streams_map(table_id tid, table_streams new_table_map);
+    void append_tablet_streams_map(table_id tid, table_streams new_table_map);
    void remove_tablet_streams_map(table_id tid);

    const tablet_streams_map& get_all_tablet_streams() const {
@@ -108,14 +111,14 @@ public:

    std::vector<table_id> get_tables_with_cdc_tablet_streams() const;

-    static future<std::vector<stream_id>> construct_next_stream_set(
-        const std::vector<cdc::stream_id>& prev_stream_set,
-        std::vector<cdc::stream_id> opened,
-        const std::vector<cdc::stream_id>& closed);
+    static future<utils::chunked_vector<stream_id>> construct_next_stream_set(
+        const utils::chunked_vector<cdc::stream_id>& prev_stream_set,
+        utils::chunked_vector<cdc::stream_id> opened,
+        const utils::chunked_vector<cdc::stream_id>& closed);

    static future<cdc_stream_diff> generate_stream_diff(
-        const std::vector<stream_id>& before,
-        const std::vector<stream_id>& after);
+        const utils::chunked_vector<stream_id>& before,
+        const utils::chunked_vector<stream_id>& after);

 };

--- a/compaction/compaction_manager.cc
+++ b/compaction/compaction_manager.cc
@@ -1506,13 +1506,15 @@ future<> compaction_manager::maybe_wait_for_sstable_count_reduction(compaction_g
        co_return;
    }
    auto num_runs_for_compaction = [&, this] -> future<size_t> {
-        auto& cs = t.get_compaction_strategy();
+        auto cs = t.get_compaction_strategy();
        auto desc = co_await cs.get_sstables_for_compaction(t, get_strategy_control());
        co_return std::ranges::size(desc.sstables
            | std::views::transform(std::mem_fn(&sstables::sstable::run_identifier))
            | std::ranges::to<std::unordered_set>());
    };
-    const auto threshold = size_t(std::max(schema->max_compaction_threshold(), 32));
+    const auto threshold = utils::get_local_injector().inject_parameter<size_t>("set_sstable_count_reduction_threshold")
+        .value_or(size_t(std::max(schema->max_compaction_threshold(), 32)));
+
    auto count = co_await num_runs_for_compaction();
    if (count <= threshold) {
        cmlog.trace("No need to wait for sstable count reduction in {}: {} <= {}",
@@ -1527,9 +1529,7 @@ future<> compaction_manager::maybe_wait_for_sstable_count_reduction(compaction_g
    auto& cstate = get_compaction_state(&t);
    try {
        while (can_perform_regular_compaction(t) && co_await num_runs_for_compaction() > threshold) {
-            co_await cstate.compaction_done.wait([this, &t] {
-                return !can_perform_regular_compaction(t);
-            });
+            co_await cstate.compaction_done.wait();
        }
    } catch (const broken_condition_variable&) {
        co_return;
--- a/compaction/compaction_strategy.cc
+++ b/compaction/compaction_strategy.cc
@@ -804,9 +804,9 @@ compaction_strategy_state compaction_strategy_state::make(const compaction_strat
        case compaction_strategy_type::incremental:
            return compaction_strategy_state(default_empty_state{});
        case compaction_strategy_type::leveled:
-            return compaction_strategy_state(leveled_compaction_strategy_state{});
+            return compaction_strategy_state(seastar::make_shared<leveled_compaction_strategy_state>());
        case compaction_strategy_type::time_window:
-            return compaction_strategy_state(time_window_compaction_strategy_state{});
+            return compaction_strategy_state(seastar::make_shared<time_window_compaction_strategy_state>());
        default:
            throw std::runtime_error("strategy not supported");
    }
--- a/compaction/compaction_strategy_state.hh
+++ b/compaction/compaction_strategy_state.hh
@@ -18,7 +18,7 @@ namespace compaction {
 class compaction_strategy_state {
 public:
    struct default_empty_state {};
-    using states_variant = std::variant<default_empty_state, leveled_compaction_strategy_state, time_window_compaction_strategy_state>;
+    using states_variant = std::variant<default_empty_state, leveled_compaction_strategy_state_ptr, time_window_compaction_strategy_state_ptr>;
 private:
    states_variant _state;
 public:
--- a/compaction/leveled_compaction_strategy.cc
+++ b/compaction/leveled_compaction_strategy.cc
@@ -14,12 +14,12 @@

 namespace compaction {

-leveled_compaction_strategy_state& leveled_compaction_strategy::get_state(compaction_group_view& table_s) const {
-    return table_s.get_compaction_strategy_state().get<leveled_compaction_strategy_state>();
+leveled_compaction_strategy_state_ptr leveled_compaction_strategy::get_state(compaction_group_view& table_s) const {
+    return table_s.get_compaction_strategy_state().get<leveled_compaction_strategy_state_ptr>();
 }

 future<compaction_descriptor> leveled_compaction_strategy::get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) {
-    auto& state = get_state(table_s);
+    auto state = get_state(table_s);
    auto candidates = co_await control.candidates(table_s);
    // NOTE: leveled_manifest creation may be slightly expensive, so later on,
    // we may want to store it in the strategy itself. However, the sstable
@@ -27,10 +27,10 @@ future<compaction_descriptor> leveled_compaction_strategy::get_sstables_for_comp
    // sstable in it may be marked for deletion after compacted.
    // Currently, we create a new manifest whenever it's time for compaction.
    leveled_manifest manifest = leveled_manifest::create(table_s, candidates, _max_sstable_size_in_mb, _stcs_options);
-    if (!state.last_compacted_keys) {
-        generate_last_compacted_keys(state, manifest);
+    if (!state->last_compacted_keys) {
+        generate_last_compacted_keys(*state, manifest);
    }
-    auto candidate = manifest.get_compaction_candidates(*state.last_compacted_keys, state.compaction_counter);
+    auto candidate = manifest.get_compaction_candidates(*state->last_compacted_keys, state->compaction_counter);

    if (!candidate.sstables.empty()) {
        auto main_set = co_await table_s.main_sstable_set();
@@ -78,12 +78,12 @@ compaction_descriptor leveled_compaction_strategy::get_major_compaction_job(comp
 }

 void leveled_compaction_strategy::notify_completion(compaction_group_view& table_s, const std::vector<sstables::shared_sstable>& removed, const std::vector<sstables::shared_sstable>& added) {
-    auto& state = get_state(table_s);
+    auto state = get_state(table_s);
    // All the update here is only relevant for regular compaction's round-robin picking policy, and if
    // last_compacted_keys wasn't generated by regular, it means regular is disabled since last restart,
    // therefore we can skip the updates here until regular runs for the first time. Once it runs,
    // it will be able to generate last_compacted_keys correctly by looking at metadata of files.
-    if (removed.empty() || added.empty() || !state.last_compacted_keys) {
+    if (removed.empty() || added.empty() || !state->last_compacted_keys) {
        return;
    }
    auto min_level = std::numeric_limits<uint32_t>::max();
@@ -99,16 +99,16 @@ void leveled_compaction_strategy::notify_completion(compaction_group_view& table
        }
        target_level = std::max(target_level, int(candidate->get_sstable_level()));
    }
-    state.last_compacted_keys.value().at(min_level) = last->get_last_decorated_key();
+    state->last_compacted_keys.value().at(min_level) = last->get_last_decorated_key();

    for (int i = leveled_manifest::MAX_LEVELS - 1; i > 0; i--) {
-        state.compaction_counter[i]++;
+        state->compaction_counter[i]++;
    }
-    state.compaction_counter[target_level] = 0;
+    state->compaction_counter[target_level] = 0;

    if (leveled_manifest::logger.level() == logging::log_level::debug) {
-        for (auto j = 0U; j < state.compaction_counter.size(); j++) {
-            leveled_manifest::logger.debug("CompactionCounter: {}: {}", j, state.compaction_counter[j]);
+        for (auto j = 0U; j < state->compaction_counter.size(); j++) {
+            leveled_manifest::logger.debug("CompactionCounter: {}: {}", j, state->compaction_counter[j]);
        }
    }
 }
--- a/compaction/leveled_compaction_strategy.hh
+++ b/compaction/leveled_compaction_strategy.hh
@@ -36,6 +36,8 @@ struct leveled_compaction_strategy_state {
    leveled_compaction_strategy_state();
 };

+using leveled_compaction_strategy_state_ptr = seastar::shared_ptr<leveled_compaction_strategy_state>;
+
 class leveled_compaction_strategy : public compaction_strategy_impl {
    static constexpr int32_t DEFAULT_MAX_SSTABLE_SIZE_IN_MB = 160;
    static constexpr auto SSTABLE_SIZE_OPTION = "sstable_size_in_mb";
@@ -45,7 +47,7 @@ class leveled_compaction_strategy : public compaction_strategy_impl {
 private:
    int32_t calculate_max_sstable_size_in_mb(std::optional<sstring> option_value) const;

-    leveled_compaction_strategy_state& get_state(compaction_group_view& table_s) const;
+    leveled_compaction_strategy_state_ptr get_state(compaction_group_view& table_s) const;
 public:
    static unsigned ideal_level_for_input(const std::vector<sstables::shared_sstable>& input, uint64_t max_sstable_size);
    static void validate_options(const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options);
--- a/compaction/time_window_compaction_strategy.cc
+++ b/compaction/time_window_compaction_strategy.cc
@@ -13,6 +13,7 @@
 #include "sstables/sstables.hh"
 #include "sstables/sstable_set_impl.hh"
 #include "compaction_strategy_state.hh"
+#include "utils/error_injection.hh"

 #include <ranges>

@@ -22,8 +23,8 @@ extern logging::logger clogger;

 using timestamp_type = api::timestamp_type;

-time_window_compaction_strategy_state& time_window_compaction_strategy::get_state(compaction_group_view& table_s) const {
-    return table_s.get_compaction_strategy_state().get<time_window_compaction_strategy_state>();
+time_window_compaction_strategy_state_ptr time_window_compaction_strategy::get_state(compaction_group_view& table_s) const {
+    return table_s.get_compaction_strategy_state().get<time_window_compaction_strategy_state_ptr>();
 }

 const std::unordered_map<sstring, std::chrono::seconds> time_window_compaction_strategy_options::valid_window_units = {
@@ -335,7 +336,7 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_

 future<compaction_descriptor>
 time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) {
-    auto& state = get_state(table_s);
+    auto state = get_state(table_s);
    auto compaction_time = gc_clock::now();
    auto candidates = co_await control.candidates(table_s);

@@ -344,7 +345,7 @@ time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_vi
    }

    auto now = db_clock::now();
-    if (now - state.last_expired_check > _options.expired_sstable_check_frequency) {
+    if (now - state->last_expired_check > _options.expired_sstable_check_frequency) {
        clogger.debug("[{}] TWCS expired check sufficiently far in the past, checking for fully expired SSTables", fmt::ptr(this));

        // Find fully expired SSTables. Those will be included no matter what.
@@ -356,12 +357,14 @@ time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_vi
        // Keep checking for fully_expired_sstables until we don't find
        // any among the candidates, meaning they are either already compacted
        // or registered for compaction.
-        state.last_expired_check = now;
+        state->last_expired_check = now;
    } else {
        clogger.debug("[{}] TWCS skipping check for fully expired SSTables", fmt::ptr(this));
    }

-    auto compaction_candidates = get_next_non_expired_sstables(table_s, control, std::move(candidates), compaction_time);
+    co_await utils::get_local_injector().inject("twcs_get_sstables_for_compaction", utils::wait_for_message(30s));
+
+    auto compaction_candidates = get_next_non_expired_sstables(table_s, control, std::move(candidates), compaction_time, *state);
    clogger.debug("[{}] Going to compact {} non-expired sstables", fmt::ptr(this), compaction_candidates.size());
    co_return compaction_descriptor(std::move(compaction_candidates));
 }
@@ -384,8 +387,8 @@ time_window_compaction_strategy::compaction_mode(const time_window_compaction_st

 std::vector<sstables::shared_sstable>
 time_window_compaction_strategy::get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control,
-        std::vector<sstables::shared_sstable> non_expiring_sstables, gc_clock::time_point compaction_time) {
-    auto most_interesting = get_compaction_candidates(table_s, control, non_expiring_sstables);
+        std::vector<sstables::shared_sstable> non_expiring_sstables, gc_clock::time_point compaction_time, time_window_compaction_strategy_state& state) {
+    auto most_interesting = get_compaction_candidates(table_s, control, non_expiring_sstables, state);

    if (!most_interesting.empty()) {
        return most_interesting;
@@ -410,14 +413,14 @@ time_window_compaction_strategy::get_next_non_expired_sstables(compaction_group_
 }

 std::vector<sstables::shared_sstable>
-time_window_compaction_strategy::get_compaction_candidates(compaction_group_view& table_s, strategy_control& control, std::vector<sstables::shared_sstable> candidate_sstables) {
-    auto& state = get_state(table_s);
+time_window_compaction_strategy::get_compaction_candidates(compaction_group_view& table_s, strategy_control& control,
+    std::vector<sstables::shared_sstable> candidate_sstables, time_window_compaction_strategy_state& state) {
    auto [buckets, max_timestamp] = get_buckets(std::move(candidate_sstables), _options);
    // Update the highest window seen, if necessary
    state.highest_window_seen = std::max(state.highest_window_seen, max_timestamp);

    return newest_bucket(table_s, control, std::move(buckets), table_s.min_compaction_threshold(), table_s.schema()->max_compaction_threshold(),
-        state.highest_window_seen);
+        state.highest_window_seen, state);
 }

 timestamp_type
@@ -465,8 +468,7 @@ namespace compaction {

 std::vector<sstables::shared_sstable>
 time_window_compaction_strategy::newest_bucket(compaction_group_view& table_s, strategy_control& control, std::map<timestamp_type, std::vector<sstables::shared_sstable>> buckets,
-        int min_threshold, int max_threshold, timestamp_type now) {
-    auto& state = get_state(table_s);
+        int min_threshold, int max_threshold, timestamp_type now, time_window_compaction_strategy_state& state) {
    clogger.debug("time_window_compaction_strategy::newest_bucket:\n  now {}\n{}", now, buckets);

    for (auto&& [key, bucket] : buckets | std::views::reverse) {
@@ -517,7 +519,7 @@ time_window_compaction_strategy::trim_to_threshold(std::vector<sstables::shared_
 }

 future<int64_t> time_window_compaction_strategy::estimated_pending_compactions(compaction_group_view& table_s) const {
-    auto& state = get_state(table_s);
+    auto state = get_state(table_s);
    auto min_threshold = table_s.min_compaction_threshold();
    auto max_threshold = table_s.schema()->max_compaction_threshold();
    auto main_set = co_await table_s.main_sstable_set();
@@ -526,7 +528,7 @@ future<int64_t> time_window_compaction_strategy::estimated_pending_compactions(c

    int64_t n = 0;
    for (auto& [bucket_key, bucket] : buckets) {
-        switch (compaction_mode(state, bucket, bucket_key, max_timestamp, min_threshold)) {
+        switch (compaction_mode(*state, bucket, bucket_key, max_timestamp, min_threshold)) {
        case bucket_compaction_mode::size_tiered:
            n += size_tiered_compaction_strategy::estimated_pending_compactions(bucket, min_threshold, max_threshold, _stcs_options);
            break;
--- a/compaction/time_window_compaction_strategy.hh
+++ b/compaction/time_window_compaction_strategy.hh
@@ -67,6 +67,8 @@ struct time_window_compaction_strategy_state {
    std::unordered_set<api::timestamp_type> recent_active_windows;
 };

+using time_window_compaction_strategy_state_ptr = seastar::shared_ptr<time_window_compaction_strategy_state>;
+
 class time_window_compaction_strategy : public compaction_strategy_impl {
    time_window_compaction_strategy_options _options;
    size_tiered_compaction_strategy_options _stcs_options;
@@ -87,7 +89,7 @@ public:

    static void validate_options(const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options);
 private:
-    time_window_compaction_strategy_state& get_state(compaction_group_view& table_s) const;
+    time_window_compaction_strategy_state_ptr get_state(compaction_group_view& table_s) const;

    static api::timestamp_type
    to_timestamp_type(time_window_compaction_strategy_options::timestamp_resolutions resolution, int64_t timestamp_from_sstable) {
@@ -110,9 +112,11 @@ private:
    compaction_mode(const time_window_compaction_strategy_state&, const bucket_t& bucket, api::timestamp_type bucket_key, api::timestamp_type now, size_t min_threshold) const;

    std::vector<sstables::shared_sstable>
-    get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control, std::vector<sstables::shared_sstable> non_expiring_sstables, gc_clock::time_point compaction_time);
+    get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control, std::vector<sstables::shared_sstable> non_expiring_sstables,
+        gc_clock::time_point compaction_time, time_window_compaction_strategy_state& state);

-    std::vector<sstables::shared_sstable> get_compaction_candidates(compaction_group_view& table_s, strategy_control& control, std::vector<sstables::shared_sstable> candidate_sstables);
+    std::vector<sstables::shared_sstable> get_compaction_candidates(compaction_group_view& table_s, strategy_control& control,
+        std::vector<sstables::shared_sstable> candidate_sstables, time_window_compaction_strategy_state& state);
 public:
    // Find the lowest timestamp for window of given size
    static api::timestamp_type
@@ -126,7 +130,7 @@ public:

    std::vector<sstables::shared_sstable>
    newest_bucket(compaction_group_view& table_s, strategy_control& control, std::map<api::timestamp_type, std::vector<sstables::shared_sstable>> buckets,
-        int min_threshold, int max_threshold, api::timestamp_type now);
+        int min_threshold, int max_threshold, api::timestamp_type now, time_window_compaction_strategy_state& state);

    static std::vector<sstables::shared_sstable>
    trim_to_threshold(std::vector<sstables::shared_sstable> bucket, int max_threshold);
--- a/conf/scylla.yaml
+++ b/conf/scylla.yaml
@@ -855,7 +855,7 @@ maintenance_socket: ignore
 # enable_create_table_with_compact_storage: false

 # Control tablets for new keyspaces.
-# Can be set to: disabled|enabled
+# Can be set to: disabled|enabled|enforced
 #
 # When enabled, newly created keyspaces will have tablets enabled by default.
 # That can be explicitly disabled in the CREATE KEYSPACE query
@@ -888,9 +888,18 @@ rf_rack_valid_keyspaces: false
 #
 # Vector Store options
 #
-# A comma-separated list of URIs for the vector store using DNS name. Only HTTP schema is supported. Port number is mandatory.
-# Default is empty, which means that the vector store is not used.
+# HTTP and HTTPS schemes are supported. Port number is mandatory.
+# If both `vector_store_primary_uri` and `vector_store_secondary_uri` are unset or empty, vector search is disabled.
+#
+# A comma-separated list of primary vector store node URIs. These nodes are preferred for vector search operations.
 # vector_store_primary_uri: http://vector-store.dns.name:{port}
+#
+# A comma-separated list of secondary vector store node URIs. These nodes are used as a fallback when all primary nodes are unavailable, and are typically located in a different availability zone for high availability.
+# vector_store_secondary_uri: http://vector-store.dns.name:{port}
+#
+# Options for encrypted connections to the vector store. These options are used for HTTPS URIs in vector_store_primary_uri and vector_store_secondary_uri.
+# vector_store_encryption_options:
+#    truststore: <not set, use system trust>

 # 
 # io-streaming rate limiting
--- a/configure.py
+++ b/configure.py
@@ -640,7 +640,8 @@ raft_tests = set([

 vector_search_tests = set([
    'test/vector_search/vector_store_client_test',
-    'test/vector_search/load_balancer_test'
+    'test/vector_search/load_balancer_test',
+    'test/vector_search/client_test'
 ])

 wasms = set([
@@ -1078,7 +1079,6 @@ scylla_core = (['message/messaging_service.cc',
                'utils/s3/client.cc',
                'utils/s3/retryable_http_client.cc',
                'utils/s3/retry_strategy.cc',
-                'utils/s3/s3_retry_strategy.cc',
                'utils/s3/credentials_providers/aws_credentials_provider.cc',
                'utils/s3/credentials_providers/environment_aws_credentials_provider.cc',
                'utils/s3/credentials_providers/instance_profile_credentials_provider.cc',
@@ -1263,6 +1263,9 @@ scylla_core = (['message/messaging_service.cc',
                'utils/disk_space_monitor.cc',
                'vector_search/vector_store_client.cc',
                'vector_search/dns.cc',
+                'vector_search/client.cc',
+                'vector_search/clients.cc',
+                'vector_search/truststore.cc'
                ] + [Antlr3Grammar('cql3/Cql.g')] \
                  + scylla_raft_core
               )
@@ -1570,6 +1573,7 @@ deps['test/boost/combined_tests'] += [
    'test/boost/query_processor_test.cc',
    'test/boost/reader_concurrency_semaphore_test.cc',
    'test/boost/repair_test.cc',
+    'test/boost/replicator_test.cc',
    'test/boost/restrictions_test.cc',
    'test/boost/role_manager_test.cc',
    'test/boost/row_cache_test.cc',
@@ -1657,6 +1661,7 @@ deps['test/raft/discovery_test'] =  ['test/raft/discovery_test.cc',

 deps['test/vector_search/vector_store_client_test'] =  ['test/vector_search/vector_store_client_test.cc'] + scylla_tests_dependencies
 deps['test/vector_search/load_balancer_test'] = ['test/vector_search/load_balancer_test.cc'] + scylla_tests_dependencies
+deps['test/vector_search/client_test'] = ['test/vector_search/client_test.cc'] + scylla_tests_dependencies

 wasm_deps = {}

--- a/cql3/Cql.g
+++ b/cql3/Cql.g
@@ -1224,7 +1224,7 @@ listPermissionsStatement returns [std::unique_ptr<list_permissions_statement> st
    ;

 permission returns [auth::permission perm = auth::permission{}]
-    : p=(K_CREATE | K_ALTER | K_DROP | K_SELECT | K_MODIFY | K_AUTHORIZE | K_DESCRIBE | K_EXECUTE)
+    : p=(K_CREATE | K_ALTER | K_DROP | K_SELECT | K_MODIFY | K_AUTHORIZE | K_DESCRIBE | K_EXECUTE | K_VECTOR_SEARCH_INDEXING)
    { $perm = auth::permissions::from_string($p.text); }
    ;

@@ -2398,6 +2398,8 @@ K_EXECUTE:     E X E C U T E;

 K_MUTATION_FRAGMENTS:    M U T A T I O N '_' F R A G M E N T S;

+K_VECTOR_SEARCH_INDEXING: V E C T O R '_' S E A R C H '_' I N D E X I N G;
+
 // Case-insensitive alpha characters
 fragment A: ('a'|'A');
 fragment B: ('b'|'B');
--- a/cql3/expr/expression.cc
+++ b/cql3/expr/expression.cc
@@ -1349,7 +1349,7 @@ static managed_bytes reserialize_value(View value_bytes,
    if (type.is_map()) {
        std::vector<std::pair<managed_bytes, managed_bytes>> elements = partially_deserialize_map(value_bytes);

-        const map_type_impl mapt = dynamic_cast<const map_type_impl&>(type);
+        const map_type_impl& mapt = dynamic_cast<const map_type_impl&>(type);
        const abstract_type& key_type = mapt.get_keys_type()->without_reversed();
        const abstract_type& value_type = mapt.get_values_type()->without_reversed();

@@ -1391,7 +1391,7 @@ static managed_bytes reserialize_value(View value_bytes,
        const vector_type_impl& vtype = dynamic_cast<const vector_type_impl&>(type);
        std::vector<managed_bytes> elements = vtype.split_fragmented(value_bytes);

-        auto elements_type = vtype.get_elements_type()->without_reversed();
+        const auto& elements_type = vtype.get_elements_type()->without_reversed();

        if (elements_type.bound_value_needs_to_be_reserialized()) {
            for (size_t i = 0; i < elements.size(); i++) {
--- a/cql3/restrictions/statement_restrictions.cc
+++ b/cql3/restrictions/statement_restrictions.cc
@@ -1322,6 +1322,10 @@ const std::vector<expr::expression>& statement_restrictions::index_restrictions(
    return _index_restrictions;
 }

+bool statement_restrictions::is_empty() const {
+    return !_where.has_value();
+}
+
 // Current score table:
 // local and restrictions include full partition key: 2
 // global: 1
--- a/cql3/restrictions/statement_restrictions.hh
+++ b/cql3/restrictions/statement_restrictions.hh
@@ -408,6 +408,8 @@ public:

    /// Checks that the primary key restrictions don't contain null values, throws invalid_request_exception otherwise.
    void validate_primary_key(const query_options& options) const;
+
+    bool is_empty() const;
 };

 statement_restrictions analyze_statement_restrictions(
--- a/cql3/statements/alter_table_statement.cc
+++ b/cql3/statements/alter_table_statement.cc
@@ -422,7 +422,14 @@ std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_sche
                throw exceptions::invalid_request_exception(format("The synchronous_updates option is only applicable to materialized views, not to base tables"));
            }

-            _properties->apply_to_builder(cfm, std::move(schema_extensions), db, keyspace());
+            if (is_cdc_log_table) {
+                auto gc_opts = _properties->get_tombstone_gc_options(schema_extensions);
+                if (gc_opts && gc_opts->mode() == tombstone_gc_mode::repair) {
+                    throw exceptions::invalid_request_exception("The 'repair' mode for tombstone_gc is not allowed on CDC log tables.");
+                }
+            }
+
+            _properties->apply_to_builder(cfm, std::move(schema_extensions), db, keyspace(), !is_cdc_log_table);
        }
        break;

--- a/cql3/statements/alter_view_statement.cc
+++ b/cql3/statements/alter_view_statement.cc
@@ -55,8 +55,29 @@ view_ptr alter_view_statement::prepare_view(data_dictionary::database db) const
    auto schema_extensions = _properties->make_schema_extensions(db.extensions());
    _properties->validate(db, keyspace(), schema_extensions);

+    bool is_colocated = [&] {
+        if (!db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
+            return false;
+        }
+        auto base_schema = db.find_schema(schema->view_info()->base_id());
+        if (!base_schema) {
+            return false;
+        }
+        return std::ranges::equal(
+            schema->partition_key_columns(),
+            base_schema->partition_key_columns(),
+            [](const column_definition& a, const column_definition& b) { return a.name() == b.name(); });
+    }();
+
+    if (is_colocated) {
+        auto gc_opts = _properties->get_tombstone_gc_options(schema_extensions);
+        if (gc_opts && gc_opts->mode() == tombstone_gc_mode::repair) {
+            throw exceptions::invalid_request_exception("The 'repair' mode for tombstone_gc is not allowed on co-located materialized view tables.");
+        }
+    }
+
    auto builder = schema_builder(schema);
-    _properties->apply_to_builder(builder, std::move(schema_extensions), db, keyspace());
+    _properties->apply_to_builder(builder, std::move(schema_extensions), db, keyspace(), !is_colocated);

    if (builder.get_gc_grace_seconds() == 0) {
        throw exceptions::invalid_request_exception(
--- a/cql3/statements/cf_prop_defs.cc
+++ b/cql3/statements/cf_prop_defs.cc
@@ -136,9 +136,7 @@ void cf_prop_defs::validate(const data_dictionary::database db, sstring ks_name,
            throw exceptions::configuration_exception(sstring("Missing sub-option '") + compression_parameters::SSTABLE_COMPRESSION + "' for the '" + KW_COMPRESSION + "' option.");
        }
        compression_parameters cp(*compression_options);
-        cp.validate(
-            compression_parameters::dicts_feature_enabled(bool(db.features().sstable_compression_dicts)),
-            compression_parameters::dicts_usage_allowed(db.get_config().sstable_compression_dictionaries_allow_in_ddl()));
+        cp.validate(compression_parameters::dicts_feature_enabled(bool(db.features().sstable_compression_dicts)));
    }

    auto per_partition_rate_limit_options = get_per_partition_rate_limit_options(schema_extensions);
@@ -286,7 +284,7 @@ std::optional<db::tablet_options::map_type> cf_prop_defs::get_tablet_options() c
    return std::nullopt;
 }

-void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_map schema_extensions, const data_dictionary::database& db, sstring ks_name) const {
+void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_map schema_extensions, const data_dictionary::database& db, sstring ks_name, bool supports_repair) const {
    if (has_property(KW_COMMENT)) {
        builder.set_comment(get_string(KW_COMMENT, ""));
    }
@@ -372,7 +370,7 @@ void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_
    }
    // Set default tombstone_gc mode.
    if (!schema_extensions.contains(tombstone_gc_extension::NAME)) {
-        auto ext = seastar::make_shared<tombstone_gc_extension>(get_default_tombstone_gc_mode(db, ks_name));
+        auto ext = seastar::make_shared<tombstone_gc_extension>(get_default_tombstone_gc_mode(db, ks_name, supports_repair));
        schema_extensions.emplace(tombstone_gc_extension::NAME, std::move(ext));
    }
    builder.set_extensions(std::move(schema_extensions));
--- a/cql3/statements/cf_prop_defs.hh
+++ b/cql3/statements/cf_prop_defs.hh
@@ -110,7 +110,7 @@ public:
    bool get_synchronous_updates_flag() const;
    std::optional<db::tablet_options::map_type> get_tablet_options() const;

-    void apply_to_builder(schema_builder& builder, schema::extensions_map schema_extensions, const data_dictionary::database& db, sstring ks_name) const;
+    void apply_to_builder(schema_builder& builder, schema::extensions_map schema_extensions, const data_dictionary::database& db, sstring ks_name, bool supports_repair) const;
    void validate_minimum_int(const sstring& field, int32_t minimum_value, int32_t default_value) const;
 };

--- a/cql3/statements/create_index_statement.cc
+++ b/cql3/statements/create_index_statement.cc
@@ -10,7 +10,10 @@

 #include <seastar/core/coroutine.hh>
 #include "create_index_statement.hh"
+#include "db/config.hh"
+#include "db/view/view.hh"
 #include "exceptions/exceptions.hh"
+#include "index/vector_index.hh"
 #include "prepared_statement.hh"
 #include "types/types.hh"
 #include "validation.hh"
@@ -92,9 +95,17 @@ std::vector<::shared_ptr<index_target>> create_index_statement::validate_while_e
        throw exceptions::invalid_request_exception(format("index names shouldn't be more than {:d} characters long (got \"{}\")", schema::NAME_LENGTH, _index_name.c_str()));
    }

-    if (!db.features().views_with_tablets && db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
-        throw exceptions::invalid_request_exception(format("Secondary indexes are not supported on base tables with tablets (keyspace '{}')", keyspace()));
+    // Regular secondary indexes require rf-rack-validity.
+    // Custom indexes need to validate this property themselves, if they need it.
+    if (!_properties || !_properties->custom_class) {
+        try {
+            db::view::validate_view_keyspace(db, keyspace());
+        } catch (const std::exception& e) {
+            // The type of the thrown exception is not specified, so we need to wrap it here.
+            throw exceptions::invalid_request_exception(e.what());
+        }
    }
+
    validate_for_local_index(*schema);

    std::vector<::shared_ptr<index_target>> targets;
@@ -108,7 +119,7 @@ std::vector<::shared_ptr<index_target>> create_index_statement::validate_while_e
            throw exceptions::invalid_request_exception(format("Non-supported custom class \'{}\' provided", *(_properties->custom_class)));
        }
        auto custom_index = (*custom_index_factory)();
-        custom_index->validate(*schema, *_properties, targets, db.features());
+        custom_index->validate(*schema, *_properties, targets, db.features(), db);
        _properties->index_version = custom_index->index_version(*schema);
    }

@@ -375,6 +386,15 @@ std::optional<create_index_statement::base_schema_with_new_index> create_index_s
                    format("Index {} is a duplicate of existing index {}", index.name(), existing_index.value().name()));
        }
    }
+    bool existing_vector_index = _properties->custom_class && _properties->custom_class == "vector_index" && secondary_index::vector_index::has_vector_index_on_column(*schema, targets[0]->column_name());
+    bool custom_index_with_same_name = _properties->custom_class && db.existing_index_names(keyspace()).contains(_index_name);
+    if (existing_vector_index || custom_index_with_same_name) {
+        if (_if_not_exists) {
+            return {};
+        } else {
+            throw exceptions::invalid_request_exception("There exists a duplicate custom index");
+        }
+    }
    auto index_table_name = secondary_index::index_table_name(accepted_name);
    if (db.has_schema(keyspace(), index_table_name)) {
        // We print this error even if _if_not_exists - in this case the user
--- a/cql3/statements/create_keyspace_statement.cc
+++ b/cql3/statements/create_keyspace_statement.cc
@@ -113,8 +113,7 @@ future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, utils::chun
        if (rs->uses_tablets()) {
            warnings.push_back(
                "Tables in this keyspace will be replicated using Tablets "
-                "and will not support Materialized Views, Secondary Indexes and counters features. "
-                "To use Materialized Views, Secondary Indexes or counters, drop this keyspace and re-create it "
+                "and will not support counters features. To use counters, drop this keyspace and re-create it "
                "without tablets by adding AND TABLETS = {'enabled': false} to the CREATE KEYSPACE statement.");
            if (ksm->initial_tablets().value()) {
                warnings.push_back("Keyspace `initial` tablets option is deprecated.  Use per-table tablet options instead.");
--- a/cql3/statements/create_table_statement.cc
+++ b/cql3/statements/create_table_statement.cc
@@ -31,8 +31,6 @@
 #include "db/config.hh"
 #include "compaction/time_window_compaction_strategy.hh"

-bool is_internal_keyspace(std::string_view name);
-
 namespace cql3 {

 namespace statements {
@@ -124,11 +122,7 @@ void create_table_statement::apply_properties_to(schema_builder& builder, const
        addColumnMetadataFromAliases(cfmd, Collections.singletonList(valueAlias), defaultValidator, ColumnDefinition.Kind.COMPACT_VALUE);
 #endif

-    if (!_properties->get_compression_options() && !is_internal_keyspace(keyspace())) {
-        builder.set_compressor_params(db.get_config().sstable_compression_user_table_options());
-    }
-
-    _properties->apply_to_builder(builder, _properties->make_schema_extensions(db.extensions()), db, keyspace());
+    _properties->apply_to_builder(builder, _properties->make_schema_extensions(db.extensions()), db, keyspace(), true);
 }

 void create_table_statement::add_column_metadata_from_aliases(schema_builder& builder, std::vector<bytes> aliases, const std::vector<data_type>& types, column_kind kind) const
--- a/cql3/statements/create_view_statement.cc
+++ b/cql3/statements/create_view_statement.cc
@@ -152,9 +152,13 @@ std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(

    schema_ptr schema = validation::validate_column_family(db, _base_name.get_keyspace(), _base_name.get_column_family());

-    if (!db.features().views_with_tablets && db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
-        throw exceptions::invalid_request_exception(format("Materialized views are not supported on base tables with tablets"));
+    try {
+        db::view::validate_view_keyspace(db, keyspace());
+    } catch (const std::exception& e) {
+        // The type of the thrown exception is not specified, so we need to wrap it here.
+        throw exceptions::invalid_request_exception(e.what());
    }
+
    if (schema->is_counter()) {
        throw exceptions::invalid_request_exception(format("Materialized views are not supported on counter tables"));
    }
@@ -369,7 +373,30 @@ std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(
            db::view::create_virtual_column(builder, def->name(), def->type);
        }
    }
-    _properties.properties()->apply_to_builder(builder, std::move(schema_extensions), db, keyspace());
+
+    bool is_colocated = [&] {
+        if (!db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
+            return false;
+        }
+        if (target_partition_keys.size() != schema->partition_key_columns().size()) {
+            return false;
+        }
+        for (size_t i = 0; i < target_partition_keys.size(); ++i) {
+            if (target_partition_keys[i] != &schema->partition_key_columns()[i]) {
+                return false;
+            }
+        }
+        return true;
+    }();
+
+    if (is_colocated) {
+        auto gc_opts = _properties.properties()->get_tombstone_gc_options(schema_extensions);
+        if (gc_opts && gc_opts->mode() == tombstone_gc_mode::repair) {
+            throw exceptions::invalid_request_exception("The 'repair' mode for tombstone_gc is not allowed on co-located materialized view tables.");
+        }
+    }
+
+    _properties.properties()->apply_to_builder(builder, std::move(schema_extensions), db, keyspace(), !is_colocated);

    if (builder.default_time_to_live().count() > 0) {
        throw exceptions::invalid_request_exception(
--- a/cql3/statements/describe_statement.cc
+++ b/cql3/statements/describe_statement.cc
@@ -23,6 +23,7 @@
 #include "index/vector_index.hh"
 #include "schema/schema.hh"
 #include "service/client_state.hh"
+#include "service/paxos/paxos_state.hh"
 #include "types/types.hh"
 #include "cql3/query_processor.hh"
 #include "cql3/cql_statement.hh"
@@ -329,6 +330,19 @@ future<std::vector<description>> table(const data_dictionary::database& db, cons
                "*/",
                *table_desc.create_statement);

+        table_desc.create_statement = std::move(os).to_managed_string();
+    } else if (service::paxos::paxos_store::try_get_base_table(name)) {
+        // Paxos state table is internally managed by Scylla and it shouldn't be exposed to the user.
+        // The table is allowed to be described as a comment to ease administrative work but it's hidden from all listings.
+        fragmented_ostringstream os{};
+
+        fmt::format_to(os.to_iter(),
+                "/* Do NOT execute this statement! It's only for informational purposes.\n"
+                "   A paxos state table is created automatically when enabling LWT on a base table.\n"
+                "\n{}\n"
+                "*/",
+                *table_desc.create_statement);
+
        table_desc.create_statement = std::move(os).to_managed_string();
    }
    result.push_back(std::move(table_desc));
@@ -364,7 +378,7 @@ future<std::vector<description>> table(const data_dictionary::database& db, cons
 future<std::vector<description>> tables(const data_dictionary::database& db, const lw_shared_ptr<keyspace_metadata>& ks, std::optional<bool> with_internals = std::nullopt) {
    auto& replica_db = db.real_database();
    auto tables = ks->tables() | std::views::filter([&replica_db] (const schema_ptr& s) {
-        return !cdc::is_log_for_some_table(replica_db, s->ks_name(), s->cf_name());
+        return !cdc::is_log_for_some_table(replica_db, s->ks_name(), s->cf_name()) && !service::paxos::paxos_store::try_get_base_table(s->cf_name());
    }) | std::ranges::to<std::vector<schema_ptr>>();
    std::ranges::sort(tables, std::ranges::less(), std::mem_fn(&schema::cf_name));

--- a/cql3/statements/select_statement.cc
+++ b/cql3/statements/select_statement.cc
@@ -21,6 +21,7 @@
 #include "exceptions/exceptions.hh"
 #include <seastar/core/future.hh>
 #include <seastar/coroutine/exception.hh>
+#include "index/vector_index.hh"
 #include "service/broadcast_tables/experimental/lang.hh"
 #include "service/qos/qos_common.hh"
 #include "vector_search/vector_store_client.hh"
@@ -245,7 +246,9 @@ future<> select_statement::check_access(query_processor& qp, const service::clie
        auto& cf_name = s->is_view()
            ? s->view_info()->base_name()
            : (cdc ? cdc->cf_name() : column_family());
-        co_await state.has_column_family_access(keyspace(), cf_name, auth::permission::SELECT);
+        const schema_ptr& base_schema = cdc ? cdc : _schema;
+        bool is_vector_indexed = secondary_index::vector_index::has_vector_index(*base_schema);
+        co_await state.has_column_family_access(keyspace(), cf_name, auth::permission::SELECT, auth::command_desc::type::OTHER, is_vector_indexed);
    } catch (const data_dictionary::no_such_column_family& e) {
        // Will be validated afterwards.
        co_return;
@@ -1026,7 +1029,7 @@ indexed_table_select_statement::prepare(data_dictionary::database db,
        if (it == indexes.end()) {
            throw exceptions::invalid_request_exception("ANN ordering by vector requires the column to be indexed using 'vector_index'");
        } else {
-            if (index_opt || parameters->allow_filtering() || restrictions->need_filtering() || check_needs_allow_filtering_anyway(*restrictions)) {
+            if (index_opt || parameters->allow_filtering() || !(restrictions->is_empty()) || check_needs_allow_filtering_anyway(*restrictions)) {
                throw exceptions::invalid_request_exception("ANN ordering by vector does not support filtering");
            }
            index_opt = *it;
@@ -1182,6 +1185,11 @@ future<shared_ptr<cql_transport::messages::result_message>> indexed_table_select
        if (stats) {
            stats->add_latency(duration);
        }
+        auto limit = get_limit(options, _limit);
+        auto page_size = options.get_page_size();
+        if (_prepared_ann_ordering.has_value() && page_size > 0 && (uint64_t) page_size < limit) {
+            result->add_warning("Paging is not supported for Vector Search queries. The entire result set has been returned.");
+        }
        co_return result;
 }

@@ -1217,11 +1225,18 @@ indexed_table_select_statement::actually_do_execute(query_processor& qp,

        auto [ann_column, ann_vector_expr] = _prepared_ann_ordering.value();

-        auto values = value_cast<vector_type_impl::native_type>(ann_column->type->deserialize(expr::evaluate(ann_vector_expr, options).to_bytes()));
+        auto expr_value = expr::evaluate(ann_vector_expr, options);
+
+        if (expr_value.is_null()) {
+            throw exceptions::invalid_request_exception(fmt::format("Unsupported null value for column {}", _prepared_ann_ordering->first->name_as_text()));
+        }
+
+        auto values = value_cast<vector_type_impl::native_type>(ann_column->type->deserialize(std::move(expr_value).to_bytes()));
        auto ann_vector = util::to_vector<float>(values);

-        auto as = abort_source();
-        auto pkeys = co_await qp.vector_store_client().ann(_schema->ks_name(), _index.metadata().name(), _schema , std::move(ann_vector), limit, as);
+        auto timeout = db::timeout_clock::now() + get_timeout(state.get_client_state(), options);
+        auto aoe = abort_on_expiry(timeout);
+        auto pkeys = co_await qp.vector_store_client().ann(_schema->ks_name(), _index.metadata().name(), _schema , std::move(ann_vector), limit, aoe.abort_source());
        if (!pkeys.has_value()) {
            co_await coroutine::return_exception(
                    exceptions::invalid_request_exception(std::visit(vector_search::vector_store_client::ann_error_visitor{}, pkeys.error())));
--- a/db/batchlog_manager.cc
+++ b/db/batchlog_manager.cc
@@ -59,27 +59,31 @@ db::batchlog_manager::batchlog_manager(cql3::query_processor& qp, db::system_key
    });
 }

-future<> db::batchlog_manager::do_batch_log_replay(post_replay_cleanup cleanup) {
-    return container().invoke_on(0, [cleanup] (auto& bm) -> future<> {
+future<db::all_batches_replayed> db::batchlog_manager::do_batch_log_replay(post_replay_cleanup cleanup) {
+    return container().invoke_on(0, [cleanup] (auto& bm) -> future<db::all_batches_replayed> {
        auto gate_holder = bm._gate.hold();
        auto sem_units = co_await get_units(bm._sem, 1);

        auto dest = bm._cpu++ % smp::count;
        blogger.debug("Batchlog replay on shard {}: starts", dest);
        auto last_replay = gc_clock::now();
+        all_batches_replayed all_replayed = all_batches_replayed::yes;
        if (dest == 0) {
-            co_await bm.replay_all_failed_batches(cleanup);
+            all_replayed = co_await bm.replay_all_failed_batches(cleanup);
        } else {
-            co_await bm.container().invoke_on(dest, [cleanup] (auto& bm) {
+            all_replayed = co_await bm.container().invoke_on(dest, [cleanup] (auto& bm) {
                return with_gate(bm._gate, [&bm, cleanup] {
                    return bm.replay_all_failed_batches(cleanup);
                });
            });
        }
-        co_await bm.container().invoke_on_all([last_replay] (auto& bm) {
-            bm._last_replay = last_replay;
-        });
+        if (all_replayed == all_batches_replayed::yes) {
+            co_await bm.container().invoke_on_all([last_replay] (auto& bm) {
+                bm._last_replay = last_replay;
+            });
+        }
        blogger.debug("Batchlog replay on shard {}: done", dest);
+        co_return all_replayed;
    });
 }

@@ -159,124 +163,127 @@ db_clock::duration db::batchlog_manager::get_batch_log_timeout() const {
    return _write_request_timeout * 2;
 }

-future<> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cleanup) {
+future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cleanup) {
    typedef db_clock::rep clock_type;

+    db::all_batches_replayed all_replayed = all_batches_replayed::yes;
    // rate limit is in bytes per second. Uses Double.MAX_VALUE if disabled (set to 0 in cassandra.yaml).
    // max rate is scaled by the number of nodes in the cluster (same as for HHOM - see CASSANDRA-5272).
    auto throttle = _replay_rate / _qp.proxy().get_token_metadata_ptr()->count_normal_token_owners();
    auto limiter = make_lw_shared<utils::rate_limiter>(throttle);

-    auto batch = [this, limiter](const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
+    auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG);
+    auto delete_batch = [this, schema = std::move(schema)] (utils::UUID id) {
+        auto key = partition_key::from_singular(*schema, id);
+        mutation m(schema, key);
+        auto now = service::client_state(service::client_state::internal_tag()).get_timestamp();
+        m.partition().apply_delete(*schema, clustering_key_prefix::make_empty(), tombstone(now, gc_clock::now()));
+        return _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
+    };
+
+    auto batch = [this, limiter, delete_batch = std::move(delete_batch), &all_replayed](const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
        auto written_at = row.get_as<db_clock::time_point>("written_at");
        auto id = row.get_as<utils::UUID>("id");
        // enough time for the actual write + batchlog entry mutation delivery (two separate requests).
+        auto now = db_clock::now();
        auto timeout = get_batch_log_timeout();
-        if (db_clock::now() < written_at + timeout) {
-            blogger.debug("Skipping replay of {}, too fresh", id);
-            return make_ready_future<stop_iteration>(stop_iteration::no);
-        }

        if (utils::get_local_injector().is_enabled("skip_batch_replay")) {
            blogger.debug("Skipping batch replay due to skip_batch_replay injection");
-            return make_ready_future<stop_iteration>(stop_iteration::no);
+            all_replayed = all_batches_replayed::no;
+            co_return stop_iteration::no;
        }

        // check version of serialization format
        if (!row.has("version")) {
            blogger.warn("Skipping logged batch because of unknown version");
-            return make_ready_future<stop_iteration>(stop_iteration::no);
+            co_await delete_batch(id);
+            co_return stop_iteration::no;
        }

        auto version = row.get_as<int32_t>("version");
        if (version != netw::messaging_service::current_version) {
-            blogger.warn("Skipping logged batch because of incorrect version");
-            return make_ready_future<stop_iteration>(stop_iteration::no);
+            blogger.warn("Skipping logged batch because of incorrect version {}; current version = {}", version, netw::messaging_service::current_version);
+            co_await delete_batch(id);
+            co_return stop_iteration::no;
        }

        auto data = row.get_blob_unfragmented("data");

        blogger.debug("Replaying batch {}", id);

-        auto fms = make_lw_shared<std::deque<canonical_mutation>>();
-        auto in = ser::as_input_stream(data);
-        while (in.size()) {
-            fms->emplace_back(ser::deserialize(in, std::type_identity<canonical_mutation>()));
-        }
-
-        auto size = data.size();
-
-        return map_reduce(*fms, [this, written_at] (canonical_mutation& fm) {
-            const auto& cf = _qp.proxy().local_db().find_column_family(fm.column_family_id());
-            return make_ready_future<canonical_mutation*>(written_at > cf.get_truncation_time() ? &fm : nullptr);
-        },
-        utils::chunked_vector<mutation>(),
-        [this] (utils::chunked_vector<mutation> mutations, canonical_mutation* fm) {
-            if (fm) {
-                schema_ptr s = _qp.db().find_schema(fm->column_family_id());
-                mutations.emplace_back(fm->to_mutation(s));
+        try {
+            auto fms = make_lw_shared<std::deque<canonical_mutation>>();
+            auto in = ser::as_input_stream(data);
+            while (in.size()) {
+                fms->emplace_back(ser::deserialize(in, std::type_identity<canonical_mutation>()));
+                schema_ptr s = _qp.db().find_schema(fms->back().column_family_id());
+                timeout = std::min(timeout, std::chrono::duration_cast<db_clock::duration>(s->tombstone_gc_options().propagation_delay_in_seconds()));
            }
-            return mutations;
-        }).then([this, limiter, written_at, size, fms] (utils::chunked_vector<mutation> mutations) {
-            if (mutations.empty()) {
-                return make_ready_future<>();
+
+            if (now < written_at + timeout) {
+                blogger.debug("Skipping replay of {}, too fresh", id);
+                co_return stop_iteration::no;
            }
-            const auto ttl = [written_at]() -> clock_type {
-                /*
-                 * Calculate ttl for the mutations' hints (and reduce ttl by the time the mutations spent in the batchlog).
-                 * This ensures that deletes aren't "undone" by an old batch replay.
-                 */
-                auto unadjusted_ttl = std::numeric_limits<gc_clock::rep>::max();
-                warn(unimplemented::cause::HINT);
-#if 0
-                for (auto& m : *mutations) {
-                    unadjustedTTL = Math.min(unadjustedTTL, HintedHandOffManager.calculateHintTTL(mutation));
+
+            auto size = data.size();
+
+            auto mutations = co_await map_reduce(*fms, [this, written_at] (canonical_mutation& fm) {
+                const auto& cf = _qp.proxy().local_db().find_column_family(fm.column_family_id());
+                return make_ready_future<canonical_mutation*>(written_at > cf.get_truncation_time() ? &fm : nullptr);
+            },
+            utils::chunked_vector<mutation>(),
+            [this] (utils::chunked_vector<mutation> mutations, canonical_mutation* fm) {
+                if (fm) {
+                    schema_ptr s = _qp.db().find_schema(fm->column_family_id());
+                    mutations.emplace_back(fm->to_mutation(s));
                }
-#endif
-                return unadjusted_ttl - std::chrono::duration_cast<gc_clock::duration>(db_clock::now() - written_at).count();
-            }();
-
-            if (ttl <= 0) {
-                return make_ready_future<>();
-            }
-            // Origin does the send manually, however I can't see a super great reason to do so.
-            // Our normal write path does not add much redundancy to the dispatch, and rate is handled after send
-            // in both cases.
-            // FIXME: verify that the above is reasonably true.
-            return limiter->reserve(size).then([this, mutations = std::move(mutations)] {
-                _stats.write_attempts += mutations.size();
-                // #1222 - change cl level to ALL, emulating origins behaviour of sending/hinting
-                // to all natural end points.
-                // Note however that origin uses hints here, and actually allows for this
-                // send to partially or wholly fail in actually sending stuff. Since we don't
-                // have hints (yet), send with CL=ALL, and hope we can re-do this soon.
-                // See below, we use retry on write failure.
-                auto timeout = db::timeout_clock::now() + write_timeout;
-                return _qp.proxy().send_batchlog_replay_to_all_replicas(std::move(mutations), timeout);
+                return mutations;
            });
-        }).then_wrapped([this, id](future<> batch_result) {
-            try {
-                batch_result.get();
-            } catch (data_dictionary::no_such_keyspace& ex) {
-                // should probably ignore and drop the batch
-            } catch (const data_dictionary::no_such_column_family&) {
-                // As above -- we should drop the batch if the table doesn't exist anymore.
-            } catch (...) {
-                blogger.warn("Replay failed (will retry): {}", std::current_exception());
-                // timeout, overload etc.
-                // Do _not_ remove the batch, assuning we got a node write error.
-                // Since we don't have hints (which origin is satisfied with),
-                // we have to resort to keeping this batch to next lap.
-                return make_ready_future<>();
+
+            if (!mutations.empty()) {
+                const auto ttl = [written_at]() -> clock_type {
+                    /*
+                    * Calculate ttl for the mutations' hints (and reduce ttl by the time the mutations spent in the batchlog).
+                    * This ensures that deletes aren't "undone" by an old batch replay.
+                    */
+                    auto unadjusted_ttl = std::numeric_limits<gc_clock::rep>::max();
+                    warn(unimplemented::cause::HINT);
+#if 0
+                    for (auto& m : *mutations) {
+                        unadjustedTTL = Math.min(unadjustedTTL, HintedHandOffManager.calculateHintTTL(mutation));
+                    }
+#endif
+                    return unadjusted_ttl - std::chrono::duration_cast<gc_clock::duration>(db_clock::now() - written_at).count();
+                }();
+
+                if (ttl > 0) {
+                    // Origin does the send manually, however I can't see a super great reason to do so.
+                    // Our normal write path does not add much redundancy to the dispatch, and rate is handled after send
+                    // in both cases.
+                    // FIXME: verify that the above is reasonably true.
+                    co_await limiter->reserve(size);
+                        _stats.write_attempts += mutations.size();
+                        auto timeout = db::timeout_clock::now() + write_timeout;
+                        co_await _qp.proxy().send_batchlog_replay_to_all_replicas(std::move(mutations), timeout);
+                }
            }
-            // delete batch
-            auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG);
-            auto key = partition_key::from_singular(*schema, id);
-            mutation m(schema, key);
-            auto now = service::client_state(service::client_state::internal_tag()).get_timestamp();
-            m.partition().apply_delete(*schema, clustering_key_prefix::make_empty(), tombstone(now, gc_clock::now()));
-            return _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
-        }).then([] { return make_ready_future<stop_iteration>(stop_iteration::no); });
+        } catch (data_dictionary::no_such_keyspace& ex) {
+            // should probably ignore and drop the batch
+        } catch (const data_dictionary::no_such_column_family&) {
+            // As above -- we should drop the batch if the table doesn't exist anymore.
+        } catch (...) {
+            blogger.warn("Replay failed (will retry): {}", std::current_exception());
+            all_replayed = all_batches_replayed::no;
+            // timeout, overload etc.
+            // Do _not_ remove the batch, assuning we got a node write error.
+            // Since we don't have hints (which origin is satisfied with),
+            // we have to resort to keeping this batch to next lap.
+            co_return stop_iteration::no;
+        }
+        // delete batch
+        co_await delete_batch(id);
+        co_return stop_iteration::no;
    };

    co_await with_gate(_gate, [this, cleanup, batch = std::move(batch)] () mutable -> future<> {
@@ -298,4 +305,6 @@ future<> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cle
            blogger.debug("Finished replayAllFailedBatches");
        });
    });
+
+    co_return all_replayed;
 }
--- a/db/batchlog_manager.hh
+++ b/db/batchlog_manager.hh
@@ -31,6 +31,8 @@ namespace db {

 class system_keyspace;

+using all_batches_replayed = bool_class<struct all_batches_replayed_tag>;
+
 struct batchlog_manager_config {
    std::chrono::duration<double> write_request_timeout;
    uint64_t replay_rate = std::numeric_limits<uint64_t>::max();
@@ -69,7 +71,7 @@ private:

    gc_clock::time_point _last_replay;

-    future<> replay_all_failed_batches(post_replay_cleanup cleanup);
+    future<all_batches_replayed> replay_all_failed_batches(post_replay_cleanup cleanup);
 public:
    // Takes a QP, not a distributes. Because this object is supposed
    // to be per shard and does no dispatching beyond delegating the the
@@ -80,7 +82,7 @@ public:
    future<> drain();
    future<> stop();

-    future<> do_batch_log_replay(post_replay_cleanup cleanup);
+    future<all_batches_replayed> do_batch_log_replay(post_replay_cleanup cleanup);

    future<size_t> count_all_batches() const;
    db_clock::duration get_batch_log_timeout() const;
--- a/db/commitlog/commitlog.cc
+++ b/db/commitlog/commitlog.cc
@@ -502,6 +502,9 @@ public:
    void flush_segments(uint64_t size_to_remove);
    void check_no_data_older_than_allowed();

+    // whitebox testing
+    std::function<future<>()> _oversized_pre_wait_memory_func;
+
 private:
    class shutdown_marker{};

@@ -1597,8 +1600,15 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ

    scope_increment_counter allocating(totals.active_allocations);

+    // #27992 - whitebox testing. signal we are trying to lock out 
+    // all allocators
+    if (_oversized_pre_wait_memory_func) {
+        co_await _oversized_pre_wait_memory_func();
+    }
+
    auto permit = co_await std::move(fut);
-    SCYLLA_ASSERT(_request_controller.available_units() == 0);
+    // #27992 - task reordering _can_ force the available units to negative. this is ok.
+    SCYLLA_ASSERT(_request_controller.available_units() <= 0);

    decltype(permit) fake_permit; // can't have allocate+sync release semaphore.
    bool failed = false;
@@ -1859,13 +1869,15 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
            }
        }
    }
-    SCYLLA_ASSERT(_request_controller.available_units() == 0);
+
+    auto avail = _request_controller.available_units();
+    SCYLLA_ASSERT(avail <= 0);
    SCYLLA_ASSERT(permit.count() == max_request_controller_units());
    auto nw = _request_controller.waiters();
    permit.return_all();
    // #20633 cannot guarantee controller avail is now full, since we could have had waiters when doing
    // return all -> now will be less avail
-    SCYLLA_ASSERT(nw > 0 || _request_controller.available_units() == ssize_t(max_request_controller_units()));
+    SCYLLA_ASSERT(nw > 0 || _request_controller.available_units() == (avail + ssize_t(max_request_controller_units())));

    if (!failed) {
        clogger.trace("Oversized allocation succeeded.");
@@ -1974,13 +1986,13 @@ future<> db::commitlog::segment_manager::replenish_reserve() {
            }
            continue;
        } catch (shutdown_marker&) {
-            _reserve_segments.abort(std::current_exception());
            break;
        } catch (...) {
            clogger.warn("Exception in segment reservation: {}", std::current_exception());
        }
        co_await sleep(100ms);
    }
+    _reserve_segments.abort(std::make_exception_ptr(shutdown_marker()));
 }

 future<std::vector<db::commitlog::descriptor>>
@@ -3624,6 +3636,10 @@ db::commitlog::read_log_file(const replay_state& state, sstring filename, sstrin
            auto old = pos;
            pos = next_pos(off);
            clogger.trace("Pos {} -> {} ({})", old, pos, off);
+            // #24346 check eof status whenever we move file pos.
+            if (pos >= file_size) {
+                eof = true;
+            }
        }

        future<> read_entry() {
@@ -3939,6 +3955,9 @@ void db::commitlog::update_max_data_lifetime(std::optional<uint64_t> commitlog_d
    _segment_manager->cfg.commitlog_data_max_lifetime_in_seconds = commitlog_data_max_lifetime_in_seconds;
 }

+void db::commitlog::set_oversized_pre_wait_memory_func(std::function<future<>()> f) {
+    _segment_manager->_oversized_pre_wait_memory_func = std::move(f);
+}

 future<std::vector<sstring>> db::commitlog::get_segments_to_replay() const {
    return _segment_manager->get_segments_to_replay();
--- a/db/commitlog/commitlog.hh
+++ b/db/commitlog/commitlog.hh
@@ -385,6 +385,9 @@ public:
    // (Re-)set data mix lifetime.
    void update_max_data_lifetime(std::optional<uint64_t> commitlog_data_max_lifetime_in_seconds);

+    // Whitebox testing. Do not use for production
+    void set_oversized_pre_wait_memory_func(std::function<future<>()>);
+
    using commit_load_reader_func = std::function<future<>(buffer_and_replay_position)>;

    class segment_error : public std::exception {};
--- a/db/commitlog/commitlog_replayer.cc
+++ b/db/commitlog/commitlog_replayer.cc
@@ -54,12 +54,14 @@ public:
        uint64_t applied_mutations = 0;
        uint64_t corrupt_bytes = 0;
        uint64_t truncated_at = 0;
+        uint64_t broken_files = 0;

        stats& operator+=(const stats& s) {
            invalid_mutations += s.invalid_mutations;
            skipped_mutations += s.skipped_mutations;
            applied_mutations += s.applied_mutations;
            corrupt_bytes += s.corrupt_bytes;
+            broken_files += s.broken_files;
            return *this;
        }
        stats operator+(const stats& s) const {
@@ -192,6 +194,8 @@ db::commitlog_replayer::impl::recover(const commitlog::descriptor& d, const comm
            s->corrupt_bytes += e.bytes();
        } catch (commitlog::segment_truncation& e) {
            s->truncated_at = e.position();
+        } catch (commitlog::header_checksum_error&) {
+            ++s->broken_files;
        } catch (...) {
            throw;
        }
@@ -370,6 +374,9 @@ future<> db::commitlog_replayer::recover(std::vector<sstring> files, sstring fna
                    if (stats.truncated_at != 0) {
                        rlogger.warn("Truncated file: {} at position {}.", f, stats.truncated_at);
                    }
+                    if (stats.broken_files != 0) {
+                        rlogger.warn("Corrupted file header: {}. Skipped.", f);
+                    }
                    rlogger.debug("Log replay of {} complete, {} replayed mutations ({} invalid, {} skipped)"
                                    , f
                                    , stats.applied_mutations
--- a/db/config.cc
+++ b/db/config.cc
@@ -1171,6 +1171,17 @@ db::config::config(std::shared_ptr<db::extensions> exts)
        "* default_weight: (Default: 1 **)  How many requests are handled during each turn of the RoundRobin.\n"
        "* weights: (Default: Keyspace: 1)  Takes a list of keyspaces. It sets how many requests are handled during each turn of the RoundRobin, based on the request_scheduler_id.")
    /**
+    * @Group Vector search settings
+    * @GroupDescription Settings for configuring and tuning vector search functionality.
+    */
+    , vector_store_primary_uri(this, "vector_store_primary_uri", liveness::LiveUpdate, value_status::Used, "",
+        "A comma-separated list of primary vector store node URIs. These nodes are preferred for vector search operations.")
+    , vector_store_secondary_uri(this, "vector_store_secondary_uri", liveness::LiveUpdate, value_status::Used, "",
+        "A comma-separated list of secondary vector store node URIs. These nodes are used as a fallback when all primary nodes are unavailable, and are typically located in a different availability zone for high availability.")
+    , vector_store_encryption_options(this, "vector_store_encryption_options", value_status::Used, {},
+        "Options for encrypted connections to the vector store. These options are used for HTTPS URIs in `vector_store_primary_uri` and `vector_store_secondary_uri`. The available options are:\n"
+        "* truststore: (Default: <not set, use system truststore>) Location of the truststore containing the trusted certificate for authenticating remote servers.")
+    /**
    * @Group Security properties
    * @GroupDescription Server and client security settings.
    */
@@ -1318,15 +1329,15 @@ db::config::config(std::shared_ptr<db::extensions> exts)
    , enable_sstables_mc_format(this, "enable_sstables_mc_format", value_status::Unused, true, "Enable SSTables 'mc' format to be used as the default file format.  Deprecated, please use \"sstable_format\" instead.")
    , enable_sstables_md_format(this, "enable_sstables_md_format", value_status::Unused, true, "Enable SSTables 'md' format to be used as the default file format.  Deprecated, please use \"sstable_format\" instead.")
    , sstable_format(this, "sstable_format", liveness::LiveUpdate, value_status::Used, "me", "Default sstable file format", {"md", "me", "ms"})
-    , sstable_compression_user_table_options(this, "sstable_compression_user_table_options", value_status::Used, compression_parameters{},
+    , sstable_compression_user_table_options(this, "sstable_compression_user_table_options", value_status::Used, compression_parameters{compression_parameters::algorithm::lz4_with_dicts},
        "Server-global user table compression options. If enabled, all user tables"
        "will be compressed using the provided options, unless overridden"
-        "by compression options in the table schema. The available options are:\n"
-        "* sstable_compression: The compression algorithm to use. Supported values: LZ4Compressor (default), LZ4WithDictsCompressor, SnappyCompressor, DeflateCompressor, ZstdCompressor, ZstdWithDictsCompressor, '' (empty string; disables compression).\n"
+        "by compression options in the table schema. User tables are all tables in non-system keyspaces. The available options are:\n"
+        "* sstable_compression: The compression algorithm to use. Supported values: LZ4Compressor, LZ4WithDictsCompressor (default), SnappyCompressor, DeflateCompressor, ZstdCompressor, ZstdWithDictsCompressor, '' (empty string; disables compression).\n"
        "* chunk_length_in_kb: (Default: 4) The size of chunks to compress in kilobytes. Allowed values are powers of two between 1 and 128.\n"
        "* crc_check_chance: (Default: 1.0) Not implemented (option value is ignored).\n"
        "* compression_level: (Default: 3) Compression level for ZstdCompressor and ZstdWithDictsCompressor. Higher levels provide better compression ratios at the cost of speed. Allowed values are integers between 1 and 22.")
-    , sstable_compression_dictionaries_allow_in_ddl(this, "sstable_compression_dictionaries_allow_in_ddl", liveness::LiveUpdate, value_status::Used, true,
+    , sstable_compression_dictionaries_allow_in_ddl(this, "sstable_compression_dictionaries_allow_in_ddl", liveness::LiveUpdate, value_status::Deprecated, true,
        "Allows for configuring tables to use SSTable compression with shared dictionaries. "
        "If the option is disabled, Scylla will reject CREATE and ALTER statements which try to set dictionary-based sstable compressors.\n"
        "This is only enforced when this node validates a new DDL statement; disabling the option won't disable dictionary-based compression "
@@ -1426,7 +1437,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
    , alternator_port(this, "alternator_port", value_status::Used, 0, "Alternator API port.")
    , alternator_https_port(this, "alternator_https_port", value_status::Used, 0, "Alternator API HTTPS port.")
    , alternator_address(this, "alternator_address", value_status::Used, "0.0.0.0", "Alternator API listening address.")
-    , alternator_enforce_authorization(this, "alternator_enforce_authorization", value_status::Used, false, "Enforce checking the authorization header for every request in Alternator.")
+    , alternator_enforce_authorization(this, "alternator_enforce_authorization", liveness::LiveUpdate, value_status::Used, false, "Enforce checking the authorization header for every request in Alternator.")
+    , alternator_warn_authorization(this, "alternator_warn_authorization", liveness::LiveUpdate, value_status::Used, false, "Count and log warnings about failed authentication or authorization")
    , alternator_write_isolation(this, "alternator_write_isolation", value_status::Used, "", "Default write isolation policy for Alternator.")
    , alternator_streams_time_window_s(this, "alternator_streams_time_window_s", value_status::Used, 10, "CDC query confidence window for alternator streams.")
    , alternator_timeout_in_ms(this, "alternator_timeout_in_ms", liveness::LiveUpdate, value_status::Used, 10000,
@@ -1448,7 +1460,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
        false,
        "Allow writing to system tables using the .scylla.alternator.system prefix")
    , alternator_max_expression_cache_entries_per_shard(this, "alternator_max_expression_cache_entries_per_shard", liveness::LiveUpdate, value_status::Used, 2000, "Maximum number of cached parsed request expressions, per shard.")
-    , vector_store_primary_uri(this, "vector_store_primary_uri", liveness::LiveUpdate, value_status::Used, "", "A comma-separated list of vector store node URIs. If not set, vector search is disabled.")
+    , alternator_max_users_query_size_in_trace_output(this, "alternator_max_users_query_size_in_trace_output", liveness::LiveUpdate, value_status::Used, uint64_t(4096),
+            "Maximum size of user's command in trace output (`alternator_op` entry). Larger traces will be truncated and have `<truncated>` message appended - which doesn't count to the maximum limit.")
    , abort_on_ebadf(this, "abort_on_ebadf", value_status::Used, true, "Abort the server on incorrect file descriptor access. Throws exception when disabled.")
    , sanitizer_report_backtrace(this, "sanitizer_report_backtrace", value_status::Used, false,
            "In debug mode, report log-structured allocator sanitizer violations with a backtrace. Slow.")
@@ -1524,9 +1537,9 @@ db::config::config(std::shared_ptr<db::extensions> exts)
    , error_injections_at_startup(this, "error_injections_at_startup", error_injection_value_status, {}, "List of error injections that should be enabled on startup.")
    , topology_barrier_stall_detector_threshold_seconds(this, "topology_barrier_stall_detector_threshold_seconds", value_status::Used, 2, "Report sites blocking topology barrier if it takes longer than this.")
    , enable_tablets(this, "enable_tablets", value_status::Used, false, "Enable tablets for newly created keyspaces. (deprecated)")
-    , tablets_mode_for_new_keyspaces(this, "tablets_mode_for_new_keyspaces", value_status::Used, tablets_mode_t::mode::unset, "Control tablets for new keyspaces.  Can be set to the following values:\n"
+    , tablets_mode_for_new_keyspaces(this, "tablets_mode_for_new_keyspaces", liveness::LiveUpdate, value_status::Used, tablets_mode_t::mode::unset, "Control tablets for new keyspaces.  Can be set to the following values:\n"
            "\tdisabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option\n"
-            "\tenabled:  New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option\n"
+            "\tenabled:  New keyspaces use tablets by default, unless disabled by the tablets={'enabled':false} option\n"
            "\tenforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option")
    , view_flow_control_delay_limit_in_ms(this, "view_flow_control_delay_limit_in_ms", liveness::LiveUpdate, value_status::Used, 1000,
        "The maximal amount of time that materialized-view update flow control may delay responses "
@@ -1740,6 +1753,21 @@ const db::extensions& db::config::extensions() const {
    return *_extensions;
 }

+compression_parameters db::config::get_sstable_compression_user_table_options(bool dicts_feature_enabled) const {
+    if (sstable_compression_user_table_options.is_set()
+            || dicts_feature_enabled
+            || !sstable_compression_user_table_options().uses_dictionary_compressor()) {
+        return sstable_compression_user_table_options();
+    } else {
+        // Fall back to non-dict if dictionary compression is not enabled cluster-wide.
+        auto options = sstable_compression_user_table_options();
+        auto params = options.get_options();
+        auto algo = compression_parameters::non_dict_equivalent(options.get_algorithm());
+        params[compression_parameters::SSTABLE_COMPRESSION] = sstring(compression_parameters::algorithm_to_name(algo));
+        return compression_parameters{params};
+    }
+}
+
 std::map<sstring, db::experimental_features_t::feature> db::experimental_features_t::map() {
    // We decided against using the construct-on-first-use idiom here:
    // https://github.com/scylladb/scylla/pull/5369#discussion_r353614807
@@ -1756,7 +1784,7 @@ std::map<sstring, db::experimental_features_t::feature> db::experimental_feature
        {"broadcast-tables", feature::BROADCAST_TABLES},
        {"keyspace-storage-options", feature::KEYSPACE_STORAGE_OPTIONS},
        {"tablets", feature::UNUSED},
-        {"views-with-tablets", feature::VIEWS_WITH_TABLETS}
+        {"views-with-tablets", feature::UNUSED}
    };
 }

--- a/db/config.hh
+++ b/db/config.hh
@@ -136,8 +136,7 @@ struct experimental_features_t {
        UDF,
        ALTERNATOR_STREAMS,
        BROADCAST_TABLES,
-        KEYSPACE_STORAGE_OPTIONS,
-        VIEWS_WITH_TABLETS
+        KEYSPACE_STORAGE_OPTIONS
    };
    static std::map<sstring, feature> map(); // See enum_option.
    static std::vector<enum_option<experimental_features_t>> all();
@@ -364,6 +363,9 @@ public:
    named_value<sstring> request_scheduler;
    named_value<sstring> request_scheduler_id;
    named_value<string_map> request_scheduler_options;
+    named_value<sstring> vector_store_primary_uri;
+    named_value<sstring> vector_store_secondary_uri;
+    named_value<string_map> vector_store_encryption_options;
    named_value<sstring> authenticator;
    named_value<sstring> internode_authenticator;
    named_value<sstring> authorizer;
@@ -432,7 +434,13 @@ public:
    named_value<bool> enable_sstables_mc_format;
    named_value<bool> enable_sstables_md_format;
    named_value<sstring> sstable_format;
+
+    // NOTE: Do not use this option directly.
+    // Use get_sstable_compression_user_table_options() instead.
    named_value<compression_parameters> sstable_compression_user_table_options;
+
+    compression_parameters get_sstable_compression_user_table_options(bool dicts_feature_enabled) const;
+
    named_value<bool> sstable_compression_dictionaries_allow_in_ddl;
    named_value<bool> sstable_compression_dictionaries_enable_writing;
    named_value<float> sstable_compression_dictionaries_memory_budget_fraction;
@@ -478,6 +486,7 @@ public:
    named_value<uint16_t> alternator_https_port;
    named_value<sstring> alternator_address;
    named_value<bool> alternator_enforce_authorization;
+    named_value<bool> alternator_warn_authorization;
    named_value<sstring> alternator_write_isolation;
    named_value<uint32_t> alternator_streams_time_window_s;
    named_value<uint32_t> alternator_timeout_in_ms;
@@ -486,8 +495,7 @@ public:
    named_value<uint32_t> alternator_max_items_in_batch_write;
    named_value<bool> alternator_allow_system_table_write;
    named_value<uint32_t> alternator_max_expression_cache_entries_per_shard;
-
-    named_value<sstring> vector_store_primary_uri;
+    named_value<uint64_t> alternator_max_users_query_size_in_trace_output;

    named_value<bool> abort_on_ebadf;

--- a/db/hints/internal/hint_endpoint_manager.cc
+++ b/db/hints/internal/hint_endpoint_manager.cc
@@ -248,7 +248,7 @@ future<db::commitlog> hint_endpoint_manager::add_store() noexcept {
            // which is larger than the segment ID of the RP of the last written hint.
            cfg.base_segment_id = _last_written_rp.base_id();

-            return commitlog::create_commitlog(std::move(cfg)).then([this] (commitlog l) -> future<commitlog> {
+            return commitlog::create_commitlog(std::move(cfg)).then([this] (this auto, commitlog l) -> future<commitlog> {
                // add_store() is triggered every time hint files are forcefully flushed to I/O (every hints_flush_period).
                // When this happens we want to refill _sender's segments only if it has finished with the segments he had before.
                if (_sender.have_segments()) {
--- a/db/hints/manager.cc
+++ b/db/hints/manager.cc
@@ -643,6 +643,12 @@ future<> manager::drain_for(endpoint_id host_id, gms::inet_address ip) noexcept
        co_return;
    }

+    if (!replay_allowed()) {
+        auto reason = seastar::format("Precondition violdated while trying to drain {} / {}: "
+                "hint replay is not allowed", host_id, ip);
+        on_internal_error(manager_logger, std::move(reason));
+    }
+
    manager_logger.info("Draining starts for {}", host_id);

    const auto holder = seastar::gate::holder{_draining_eps_gate};
--- a/db/hints/manager.hh
+++ b/db/hints/manager.hh
@@ -318,6 +318,10 @@ public:
    /// In both cases - removes the corresponding hints' directories after all hints have been drained and erases the
    /// corresponding hint_endpoint_manager objects.
    ///
+    /// Preconditions:
+    /// * Hint replay must be allowed (i.e. `replay_allowed()` must be true) throughout
+    ///   the execution of this function.
+    ///
    /// \param host_id host ID of the node that left the cluster
    /// \param ip the IP of the node that left the cluster
    future<> drain_for(endpoint_id host_id, gms::inet_address ip) noexcept;
@@ -342,15 +346,15 @@ public:
        return _state.contains(state::started);
    }

+    bool replay_allowed() const noexcept {
+        return _state.contains(state::replay_allowed);
+    }
+
 private:
    void set_started() noexcept {
        _state.set(state::started);
    }

-    bool replay_allowed() const noexcept {
-        return _state.contains(state::replay_allowed);
-    }
-
    void set_draining_all() noexcept {
        _state.set(state::draining_all);
    }
--- a/db/row_cache.cc
+++ b/db/row_cache.cc
@@ -850,7 +850,7 @@ mutation_reader row_cache::make_nonpopulating_reader(schema_ptr schema, reader_p
                    std::move(permit),
                    e.key(),
                    query::clustering_key_filter_ranges(slice.row_ranges(*schema, e.key().key())),
-                    e.partition().read(_tracker.region(), _tracker.memtable_cleaner(), nullptr, phase_of(pos)),
+                    e.partition().read(_tracker.region(), _tracker.memtable_cleaner(), &_tracker, phase_of(pos)),
                    false,
                    _tracker.region(),
                    _read_section,
--- a/db/schema_tables.cc
+++ b/db/schema_tables.cc
@@ -95,16 +95,16 @@ static logging::logger diff_logger("schema_diff");
 /** system.schema_* tables used to store keyspace/table/type attributes prior to C* 3.0 */
 namespace db {
 namespace {
-    const auto set_use_schema_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
-        if (ks_name == schema_tables::NAME) {
-            props.enable_schema_commitlog();
+    const auto set_use_schema_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
+        if (builder.ks_name() == schema_tables::NAME) {
+            builder.enable_schema_commitlog();
        }
    });
    const auto set_group0_table_options =
-        schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
-            if (ks_name == schema_tables::NAME) {
+        schema_builder::register_schema_initializer([](schema_builder& builder) {
+            if (builder.ks_name() == schema_tables::NAME) {
                // all schema tables are group0 tables
-                props.is_group0_table = true;
+                builder.set_is_group0_table();
            }
        });
 }
@@ -1911,7 +1911,7 @@ static void make_update_indices_mutations(
        if (!view_should_exist(index)) {
            return view_ptr(nullptr);
        }
-        auto view = cf.get_index_manager().create_view_for_index(index);
+        auto view = cf.get_index_manager().create_view_for_index(index, db.as_data_dictionary());
        auto view_mutations = make_view_mutations(view, timestamp, true);
        view_mutations.copy_to(mutations);
        return view;
@@ -1945,7 +1945,7 @@ static void make_update_indices_mutations(
                for (auto& replica: tablet_map.get_tablet_info(tid).replicas) {
                    auto id = utils::UUID_gen::get_time_UUID();
                    view::view_building_task task {
-                        id, view::view_building_task::task_type::build_range, view::view_building_task::task_state::idle,
+                        id, view::view_building_task::task_type::build_range, false,
                        new_table->id(), view->id(), replica, last_token
                    };

--- a/db/system_distributed_keyspace.cc
+++ b/db/system_distributed_keyspace.cc
@@ -42,11 +42,11 @@ extern logging::logger cdc_log;

 namespace db {
 namespace {
-    const auto set_wait_for_sync_to_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
-        if ((ks_name == system_distributed_keyspace::NAME_EVERYWHERE && cf_name == system_distributed_keyspace::CDC_GENERATIONS_V2) ||
-            (ks_name == system_distributed_keyspace::NAME && cf_name == system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION))
+    const auto set_wait_for_sync_to_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
+        if ((builder.ks_name() == system_distributed_keyspace::NAME_EVERYWHERE && builder.cf_name() == system_distributed_keyspace::CDC_GENERATIONS_V2) ||
+            (builder.ks_name() == system_distributed_keyspace::NAME && builder.cf_name() == system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION))
        {
-            props.wait_for_sync_to_commitlog = true;
+            builder.set_wait_for_sync_to_commitlog(true);
        }
    });
 }
--- a/db/system_keyspace.cc
+++ b/db/system_keyspace.cc
@@ -55,6 +55,7 @@
 #include "utils/shared_dict.hh"
 #include "replica/database.hh"
 #include "db/compaction_history_entry.hh"
+#include "mutation/async_utils.hh"

 #include <unordered_map>

@@ -65,59 +66,44 @@ static thread_local auto sstableinfo_type = user_type_impl::get_instance(

 namespace db {
 namespace {
-    const auto set_null_sharder = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
+    const auto set_null_sharder = schema_builder::register_schema_initializer([](schema_builder& builder) {
        // tables in the "system" keyspace which need to use null sharder
        static const std::unordered_set<sstring> tables = {
                // empty
        };
-        if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
-            props.use_null_sharder = true;
+        if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
+            builder.set_use_null_sharder(true);
        }
    });
-    const auto set_wait_for_sync_to_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
+    const auto set_wait_for_sync_to_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
        static const std::unordered_set<sstring> tables = {
            system_keyspace::PAXOS,
        };
-        if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
-            props.wait_for_sync_to_commitlog = true;
+        if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
+            builder.set_wait_for_sync_to_commitlog(true);
        }
    });
-    const auto set_use_schema_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
+    const auto set_use_schema_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
        static const std::unordered_set<sstring> tables = {
            schema_tables::SCYLLA_TABLE_SCHEMA_HISTORY,
            system_keyspace::BROADCAST_KV_STORE,
-            system_keyspace::CDC_GENERATIONS_V3,
            system_keyspace::RAFT,
            system_keyspace::RAFT_SNAPSHOTS,
            system_keyspace::RAFT_SNAPSHOT_CONFIG,
            system_keyspace::GROUP0_HISTORY,
            system_keyspace::DISCOVERY,
-            system_keyspace::TABLETS,
-            system_keyspace::TOPOLOGY,
-            system_keyspace::TOPOLOGY_REQUESTS,
            system_keyspace::LOCAL,
            system_keyspace::PEERS,
-            system_keyspace::SCYLLA_LOCAL,
            system_keyspace::COMMITLOG_CLEANUPS,
-            system_keyspace::SERVICE_LEVELS_V2,
-            system_keyspace::VIEW_BUILD_STATUS_V2,
-            system_keyspace::CDC_STREAMS_STATE,
-            system_keyspace::CDC_STREAMS_HISTORY,
-            system_keyspace::ROLES,
-            system_keyspace::ROLE_MEMBERS,
-            system_keyspace::ROLE_ATTRIBUTES,
-            system_keyspace::ROLE_PERMISSIONS,
            system_keyspace::v3::CDC_LOCAL,
-            system_keyspace::DICTS,
-            system_keyspace::VIEW_BUILDING_TASKS,
        };
-        if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
-            props.enable_schema_commitlog();
+        if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
+            builder.enable_schema_commitlog();
        }
    });

    const auto set_group0_table_options =
-        schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
+        schema_builder::register_schema_initializer([](schema_builder& builder) {
            static const std::unordered_set<sstring> tables = {
                // scylla_local may store a replicated tombstone related to schema
                // (see `make_group0_schema_version_mutation`), so we include it in the group0 tables list.
@@ -137,9 +123,10 @@ namespace {
                system_keyspace::ROLE_PERMISSIONS,
                system_keyspace::DICTS,
                system_keyspace::VIEW_BUILDING_TASKS,
+                system_keyspace::REPAIR_TASKS,
            };
-            if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
-                props.is_group0_table = true;
+            if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
+                builder.set_is_group0_table();
            }
        });
 }
@@ -462,6 +449,24 @@ schema_ptr system_keyspace::repair_history() {
    return schema;
 }

+schema_ptr system_keyspace::repair_tasks() {
+    static thread_local auto schema = [] {
+        auto id = generate_legacy_id(NAME, REPAIR_TASKS);
+        return schema_builder(NAME, REPAIR_TASKS, std::optional(id))
+            .with_column("task_uuid", uuid_type, column_kind::partition_key)
+            .with_column("operation", utf8_type, column_kind::clustering_key)
+            // First and last token for of the tablet
+            .with_column("first_token", long_type, column_kind::clustering_key)
+            .with_column("last_token", long_type, column_kind::clustering_key)
+            .with_column("timestamp", timestamp_type)
+            .with_column("table_uuid", uuid_type, column_kind::static_column)
+            .set_comment("Record tablet repair tasks")
+            .with_hash_version()
+            .build();
+    }();
+    return schema;
+}
+
 schema_ptr system_keyspace::built_indexes() {
    static thread_local auto built_indexes = [] {
        schema_builder builder(generate_legacy_id(NAME, BUILT_INDEXES), NAME, BUILT_INDEXES,
@@ -1667,7 +1672,7 @@ schema_ptr system_keyspace::view_building_tasks() {
                .with_column("key", utf8_type, column_kind::partition_key)
                .with_column("id", timeuuid_type, column_kind::clustering_key)
                .with_column("type", utf8_type)
-                .with_column("state", utf8_type)
+                .with_column("aborted", boolean_type)
                .with_column("base_id", uuid_type)
                .with_column("view_id", uuid_type)
                .with_column("last_token", long_type)
@@ -2463,14 +2468,14 @@ future<bool> system_keyspace::cdc_is_rewritten() {
 }

 future<> system_keyspace::read_cdc_streams_state(std::optional<table_id> table,
-        noncopyable_function<future<>(table_id, db_clock::time_point, std::vector<cdc::stream_id>)> f) {
+        noncopyable_function<future<>(table_id, db_clock::time_point, utils::chunked_vector<cdc::stream_id>)> f) {
    static const sstring all_tables_query = format("SELECT table_id, timestamp, stream_id FROM {}.{}", NAME, CDC_STREAMS_STATE);
    static const sstring single_table_query = format("SELECT table_id, timestamp, stream_id FROM {}.{} WHERE table_id = ?", NAME, CDC_STREAMS_STATE);

    struct cur_t {
        table_id tid;
        db_clock::time_point ts;
-        std::vector<cdc::stream_id> streams;
+        utils::chunked_vector<cdc::stream_id> streams;
    };
    std::optional<cur_t> cur;

@@ -2487,7 +2492,7 @@ future<> system_keyspace::read_cdc_streams_state(std::optional<table_id> table,
            if (cur) {
                co_await f(cur->tid, cur->ts, std::move(cur->streams));
            }
-            cur = { tid, ts, std::vector<cdc::stream_id>() };
+            cur = { tid, ts, utils::chunked_vector<cdc::stream_id>() };
        }
        cur->streams.push_back(std::move(stream_id));

@@ -2499,9 +2504,10 @@ future<> system_keyspace::read_cdc_streams_state(std::optional<table_id> table,
    }
 }

-future<> system_keyspace::read_cdc_streams_history(table_id table,
+future<> system_keyspace::read_cdc_streams_history(table_id table, std::optional<db_clock::time_point> from,
        noncopyable_function<future<>(table_id, db_clock::time_point, cdc::cdc_stream_diff)> f) {
-    static const sstring query = format("SELECT table_id, timestamp, stream_state, stream_id FROM {}.{} WHERE table_id = ?", NAME, CDC_STREAMS_HISTORY);
+    static const sstring query_all = format("SELECT table_id, timestamp, stream_state, stream_id FROM {}.{} WHERE table_id = ?", NAME, CDC_STREAMS_HISTORY);
+    static const sstring query_from = format("SELECT table_id, timestamp, stream_state, stream_id FROM {}.{} WHERE table_id = ? AND timestamp > ?", NAME, CDC_STREAMS_HISTORY);

    struct cur_t {
        table_id tid;
@@ -2510,7 +2516,11 @@ future<> system_keyspace::read_cdc_streams_history(table_id table,
    };
    std::optional<cur_t> cur;

-    co_await _qp.query_internal(query, db::consistency_level::ONE, {table.uuid()}, 1000, [&] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
+    co_await _qp.query_internal(from ? query_from : query_all,
+            db::consistency_level::ONE,
+            from ? data_value_list{table.uuid(), *from} : data_value_list{table.uuid()},
+            1000,
+            [&] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
        auto tid = table_id(row.get_as<utils::UUID>("table_id"));
        auto ts = row.get_as<db_clock::time_point>("timestamp");
        auto stream_state = cdc::read_stream_state(row.get_as<int8_t>("stream_state"));
@@ -2594,6 +2604,7 @@ std::vector<schema_ptr> system_keyspace::all_tables(const db::config& cfg) {
                    corrupt_data(),
                    scylla_local(), db::schema_tables::scylla_table_schema_history(),
                    repair_history(),
+                    repair_tasks(),
                    v3::views_builds_in_progress(), v3::built_views(),
                    v3::scylla_views_builds_in_progress(),
                    v3::truncated(),
@@ -2842,6 +2853,32 @@ future<> system_keyspace::get_repair_history(::table_id table_id, repair_history
    });
 }

+future<utils::chunked_vector<canonical_mutation>> system_keyspace::get_update_repair_task_mutations(const repair_task_entry& entry, api::timestamp_type ts) {
+    // Default to timeout the repair task entries in 10 days, this should be enough time for the management tools to query
+    constexpr int ttl = 10 * 24 * 3600;
+    sstring req = format("INSERT INTO system.{} (task_uuid, operation, first_token, last_token, timestamp, table_uuid) VALUES (?, ?, ?, ?, ?, ?) USING TTL {}", REPAIR_TASKS, ttl);
+    auto muts = co_await _qp.get_mutations_internal(req, internal_system_query_state(), ts,
+            {entry.task_uuid.uuid(), repair_task_operation_to_string(entry.operation),
+            entry.first_token, entry.last_token, entry.timestamp, entry.table_uuid.uuid()});
+    utils::chunked_vector<canonical_mutation> cmuts(muts.begin(), muts.end());
+    co_return cmuts;
+}
+
+future<> system_keyspace::get_repair_task(tasks::task_id task_uuid, repair_task_consumer f) {
+    sstring req = format("SELECT * from system.{} WHERE task_uuid = {}", REPAIR_TASKS, task_uuid);
+    co_await _qp.query_internal(req, [&f] (const cql3::untyped_result_set::row& row) mutable -> future<stop_iteration> {
+        repair_task_entry ent;
+        ent.task_uuid = tasks::task_id(row.get_as<utils::UUID>("task_uuid"));
+        ent.operation = repair_task_operation_from_string(row.get_as<sstring>("operation"));
+        ent.first_token = row.get_as<int64_t>("first_token");
+        ent.last_token = row.get_as<int64_t>("last_token");
+        ent.timestamp = row.get_as<db_clock::time_point>("timestamp");
+        ent.table_uuid = ::table_id(row.get_as<utils::UUID>("table_uuid"));
+        co_await f(std::move(ent));
+        co_return stop_iteration::no;
+    });
+}
+
 future<gms::generation_type> system_keyspace::increment_and_get_generation() {
    auto req = format("SELECT gossip_generation FROM system.{} WHERE key='{}'", LOCAL, LOCAL);
    auto rs = co_await _qp.execute_internal(req, cql3::query_processor::cache_internal::yes);
@@ -3057,14 +3094,14 @@ future<mutation> system_keyspace::make_remove_view_build_status_on_host_mutation
 static constexpr auto VIEW_BUILDING_KEY = "view_building";

 future<db::view::building_tasks> system_keyspace::get_view_building_tasks() {
-    static const sstring query = format("SELECT id, type, state, base_id, view_id, last_token, host_id, shard FROM {}.{} WHERE key = '{}'", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
+    static const sstring query = format("SELECT id, type, aborted, base_id, view_id, last_token, host_id, shard FROM {}.{} WHERE key = '{}'", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
    using namespace db::view;

    building_tasks tasks;
    co_await _qp.query_internal(query, [&] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
        auto id = row.get_as<utils::UUID>("id");
        auto type = task_type_from_string(row.get_as<sstring>("type"));
-        auto state = task_state_from_string(row.get_as<sstring>("state"));
+        auto aborted = row.get_as<bool>("aborted");
        auto base_id = table_id(row.get_as<utils::UUID>("base_id"));
        auto view_id = row.get_opt<utils::UUID>("view_id").transform([] (const utils::UUID& uuid) { return table_id(uuid); });
        auto last_token = dht::token::from_int64(row.get_as<int64_t>("last_token"));
@@ -3072,7 +3109,7 @@ future<db::view::building_tasks> system_keyspace::get_view_building_tasks() {
        auto shard = unsigned(row.get_as<int32_t>("shard"));

        locator::tablet_replica replica{host_id, shard};
-        view_building_task task{id, type, state, base_id, view_id, replica, last_token};
+        view_building_task task{id, type, aborted, base_id, view_id, replica, last_token};

        switch (type) {
        case db::view::view_building_task::task_type::build_range:
@@ -3091,7 +3128,7 @@ future<db::view::building_tasks> system_keyspace::get_view_building_tasks() {
 }

 future<mutation> system_keyspace::make_view_building_task_mutation(api::timestamp_type ts, const db::view::view_building_task& task) {
-    static const sstring stmt = format("INSERT INTO {}.{}(key, id, type, state, base_id, view_id, last_token, host_id, shard) VALUES ('{}', ?, ?, ?, ?, ?, ?, ?, ?)", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
+    static const sstring stmt = format("INSERT INTO {}.{}(key, id, type, aborted, base_id, view_id, last_token, host_id, shard) VALUES ('{}', ?, ?, ?, ?, ?, ?, ?, ?)", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
    using namespace db::view;

    data_value_or_unset view_id = unset_value{};
@@ -3102,7 +3139,7 @@ future<mutation> system_keyspace::make_view_building_task_mutation(api::timestam
        view_id = data_value(task.view_id->uuid());
    }
    auto muts = co_await _qp.get_mutations_internal(stmt, internal_system_query_state(), ts, {
-            task.id, task_type_to_sstring(task.type), task_state_to_sstring(task.state),
+            task.id, task_type_to_sstring(task.type), task.aborted,
            task.base_id.uuid(), view_id, dht::token::to_int64(task.last_token),
            task.replica.host.uuid(), int32_t(task.replica.shard)
    });
@@ -3112,18 +3149,6 @@ future<mutation> system_keyspace::make_view_building_task_mutation(api::timestam
    co_return std::move(muts[0]);
 }

-future<mutation> system_keyspace::make_update_view_building_task_state_mutation(api::timestamp_type ts, utils::UUID id, db::view::view_building_task::task_state state) {
-    static const sstring stmt = format("UPDATE {}.{} SET state = ? WHERE key = '{}' AND id = ?", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
-
-    auto muts = co_await _qp.get_mutations_internal(stmt, internal_system_query_state(), ts, {
-            task_state_to_sstring(state), id
-    });
-    if (muts.size() != 1) {
-        on_internal_error(slogger, fmt::format("expected 1 mutation got {}", muts.size()));
-    }
-    co_return std::move(muts[0]);
-}
-
 future<mutation> system_keyspace::make_remove_view_building_task_mutation(api::timestamp_type ts, utils::UUID id) {
    static const sstring stmt = format("DELETE FROM {}.{} WHERE key = '{}' AND id = ?", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);

@@ -3255,7 +3280,9 @@ future<mutation> system_keyspace::get_group0_history(sharded<replica::database>&
    SCYLLA_ASSERT(rs);
    auto& ps = rs->partitions();
    for (auto& p: ps) {
-        auto mut = p.mut().unfreeze(s);
+        // Note: we could decorate the frozen_mutation's key to check if it's the expected one
+        // but since this is a single partition table, we can just check after unfreezing the whole mutation.
+        auto mut = co_await unfreeze_gently(p.mut(), s);
        auto partition_key = value_cast<sstring>(utf8_type->deserialize(mut.key().get_component(*s, 0)));
        if (partition_key == GROUP0_HISTORY_KEY) {
            co_return mut;
@@ -3479,7 +3506,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
            supported_features = decode_features(deserialize_set_column(*topology(), row, "supported_features"));
        }

-        if (row.has("topology_request")) {
+        if (row.has("topology_request") && nstate != service::node_state::left) {
            auto req = service::topology_request_from_string(row.get_as<sstring>("topology_request"));
            ret.requests.emplace(host_id, req);
            switch(req) {
@@ -4000,4 +4027,35 @@ future<> system_keyspace::apply_mutation(mutation m) {
    return _qp.proxy().mutate_locally(m, {}, db::commitlog::force_sync(m.schema()->static_props().wait_for_sync_to_commitlog), db::no_timeout);
 }

+// The names are persisted in system tables so should not be changed.
+static const std::unordered_map<system_keyspace::repair_task_operation, sstring> repair_task_operation_to_name = {
+    {system_keyspace::repair_task_operation::requested, "requested"},
+    {system_keyspace::repair_task_operation::finished, "finished"},
+};
+
+static const std::unordered_map<sstring, system_keyspace::repair_task_operation> repair_task_operation_from_name = std::invoke([] {
+    std::unordered_map<sstring, system_keyspace::repair_task_operation> result;
+    for (auto&& [v, s] : repair_task_operation_to_name) {
+        result.emplace(s, v);
+    }
+    return result;
+});
+
+sstring system_keyspace::repair_task_operation_to_string(system_keyspace::repair_task_operation op) {
+    auto i = repair_task_operation_to_name.find(op);
+    if (i == repair_task_operation_to_name.end()) {
+        on_internal_error(slogger, format("Invalid repair task operation: {}", static_cast<int>(op)));
+    }
+    return i->second;
+}
+
+system_keyspace::repair_task_operation system_keyspace::repair_task_operation_from_string(const sstring& name) {
+    return repair_task_operation_from_name.at(name);
+}
+
 } // namespace db
+
+auto fmt::formatter<db::system_keyspace::repair_task_operation>::format(const db::system_keyspace::repair_task_operation& op, fmt::format_context& ctx) const
+        -> decltype(ctx.out()) {
+    return fmt::format_to(ctx.out(), "{}", db::system_keyspace::repair_task_operation_to_string(op));
+}
--- a/db/system_keyspace.hh
+++ b/db/system_keyspace.hh
@@ -57,6 +57,8 @@ namespace paxos {
 struct topology_request_state;

 class group0_guard;
+
+class raft_group0_client;
 }

 namespace netw {
@@ -184,6 +186,7 @@ public:
    static constexpr auto RAFT_SNAPSHOTS = "raft_snapshots";
    static constexpr auto RAFT_SNAPSHOT_CONFIG = "raft_snapshot_config";
    static constexpr auto REPAIR_HISTORY = "repair_history";
+    static constexpr auto REPAIR_TASKS = "repair_tasks";
    static constexpr auto GROUP0_HISTORY = "group0_history";
    static constexpr auto DISCOVERY = "discovery";
    static constexpr auto BROADCAST_KV_STORE = "broadcast_kv_store";
@@ -198,6 +201,15 @@ public:
    static constexpr auto VIEW_BUILD_STATUS_V2 = "view_build_status_v2";
    static constexpr auto DICTS = "dicts";
    static constexpr auto VIEW_BUILDING_TASKS = "view_building_tasks";
+    static constexpr auto VERSIONS = "versions";
+    static constexpr auto BATCHES = "batches";
+    static constexpr auto AVAILABLE_RANGES = "available_ranges";
+    static constexpr auto VIEWS_BUILDS_IN_PROGRESS = "views_builds_in_progress";
+    static constexpr auto BUILT_VIEWS = "built_views";
+    static constexpr auto SCYLLA_VIEWS_BUILDS_IN_PROGRESS = "scylla_views_builds_in_progress";
+    static constexpr auto CDC_LOCAL = "cdc_local";
+    static constexpr auto CDC_TIMESTAMPS = "cdc_timestamps";
+    static constexpr auto CDC_STREAMS = "cdc_streams";

    // auth
    static constexpr auto ROLES = "roles";
@@ -282,6 +294,7 @@ public:
    static schema_ptr raft();
    static schema_ptr raft_snapshots();
    static schema_ptr repair_history();
+    static schema_ptr repair_tasks();
    static schema_ptr group0_history();
    static schema_ptr discovery();
    static schema_ptr broadcast_kv_store();
@@ -420,6 +433,22 @@ public:
        int64_t range_end;
    };

+    enum class repair_task_operation {
+        requested,
+        finished,
+    };
+    static sstring repair_task_operation_to_string(repair_task_operation op);
+    static repair_task_operation repair_task_operation_from_string(const sstring& name);
+
+    struct repair_task_entry {
+        tasks::task_id task_uuid;
+        repair_task_operation operation;
+        int64_t first_token;
+        int64_t last_token;
+        db_clock::time_point timestamp;
+        table_id table_uuid;
+    };
+
    struct topology_requests_entry {
        utils::UUID id;
        utils::UUID initiating_host;
@@ -441,6 +470,10 @@ public:
    using repair_history_consumer = noncopyable_function<future<>(const repair_history_entry&)>;
    future<> get_repair_history(table_id, repair_history_consumer f);

+    future<utils::chunked_vector<canonical_mutation>> get_update_repair_task_mutations(const repair_task_entry& entry, api::timestamp_type ts);
+    using repair_task_consumer = noncopyable_function<future<>(const repair_task_entry&)>;
+    future<> get_repair_task(tasks::task_id task_uuid, repair_task_consumer f);
+
    future<> save_truncation_record(const replica::column_family&, db_clock::time_point truncated_at, db::replay_position);
    future<replay_positions> get_truncated_positions(table_id);
    future<> drop_truncation_rp_records();
@@ -576,7 +609,6 @@ public:
    // system.view_building_tasks
    future<db::view::building_tasks> get_view_building_tasks();
    future<mutation> make_view_building_task_mutation(api::timestamp_type ts, const db::view::view_building_task& task);
-    future<mutation> make_update_view_building_task_state_mutation(api::timestamp_type ts, utils::UUID id, db::view::view_building_task::task_state state);
    future<mutation> make_remove_view_building_task_mutation(api::timestamp_type ts, utils::UUID id);

    // system.scylla_local, view_building_processing_base key
@@ -601,8 +633,8 @@ public:
    future<bool> cdc_is_rewritten();
    future<> cdc_set_rewritten(std::optional<cdc::generation_id_v1>);

-    future<> read_cdc_streams_state(std::optional<table_id> table, noncopyable_function<future<>(table_id, db_clock::time_point, std::vector<cdc::stream_id>)> f);
-    future<> read_cdc_streams_history(table_id table, noncopyable_function<future<>(table_id, db_clock::time_point, cdc::cdc_stream_diff)> f);
+    future<> read_cdc_streams_state(std::optional<table_id> table, noncopyable_function<future<>(table_id, db_clock::time_point, utils::chunked_vector<cdc::stream_id>)> f);
+    future<> read_cdc_streams_history(table_id table, std::optional<db_clock::time_point> from, noncopyable_function<future<>(table_id, db_clock::time_point, cdc::cdc_stream_diff)> f);

    // Load Raft Group 0 id from scylla.local
    future<utils::UUID> get_raft_group0_id();
@@ -746,3 +778,8 @@ public:
 }; // class system_keyspace

 } // namespace db
+
+template <>
+struct fmt::formatter<db::system_keyspace::repair_task_operation> : fmt::formatter<string_view> {
+    auto format(const db::system_keyspace::repair_task_operation&, fmt::format_context& ctx) const -> decltype(ctx.out());
+};
--- a/db/view/view.cc
+++ b/db/view/view.cc
@@ -26,6 +26,7 @@
 #include <seastar/coroutine/maybe_yield.hh>
 #include <flat_map>

+#include "db/config.hh"
 #include "db/view/base_info.hh"
 #include "db/view/view_build_status.hh"
 #include "db/view/view_consumer.hh"
@@ -929,8 +930,7 @@ bool view_updates::can_skip_view_updates(const clustering_or_static_row& update,
    const row& existing_row = existing.cells();
    const row& updated_row = update.cells();

-    const bool base_has_nonexpiring_marker = update.marker().is_live() && !update.marker().is_expiring();
-    return std::ranges::all_of(_base->regular_columns(), [this, &updated_row, &existing_row, base_has_nonexpiring_marker] (const column_definition& cdef) {
+    return std::ranges::all_of(_base->regular_columns(), [this, &updated_row, &existing_row] (const column_definition& cdef) {
        const auto view_it = _view->columns_by_name().find(cdef.name());
        const bool column_is_selected = view_it != _view->columns_by_name().end();

@@ -938,49 +938,29 @@ bool view_updates::can_skip_view_updates(const clustering_or_static_row& update,
        // as part of its PK, there are NO virtual columns corresponding to the unselected columns in the view.
        // Because of that, we don't generate view updates when the value in an unselected column is created
        // or changes.
-        if (!column_is_selected && _base_info.has_base_non_pk_columns_in_view_pk) {
+        if (!column_is_selected) {
            return true;
        }

-        //TODO(sarna): Optimize collections case - currently they do not go under optimization
-        if (!cdef.is_atomic()) {
-            return false;
-        }
-
-        // We cannot skip if the value was created or deleted, unless we have a non-expiring marker
+        // We cannot skip if the value was created or deleted
        const auto* existing_cell = existing_row.find_cell(cdef.id);
        const auto* updated_cell = updated_row.find_cell(cdef.id);
        if (existing_cell == nullptr || updated_cell == nullptr) {
-            return existing_cell == updated_cell || (!column_is_selected && base_has_nonexpiring_marker);
+            return existing_cell == updated_cell;
        }
+
+        if (!cdef.is_atomic()) {
+            return existing_cell->as_collection_mutation().data == updated_cell->as_collection_mutation().data;
+        }
+
        atomic_cell_view existing_cell_view = existing_cell->as_atomic_cell(cdef);
        atomic_cell_view updated_cell_view = updated_cell->as_atomic_cell(cdef);

        // We cannot skip when a selected column is changed
-        if (column_is_selected) {
-            if (view_it->second->is_view_virtual()) {
-                return atomic_cells_liveness_equal(existing_cell_view, updated_cell_view);
-            }
-            return compare_atomic_cell_for_merge(existing_cell_view, updated_cell_view) == 0;
+        if (view_it->second->is_view_virtual()) {
+            return atomic_cells_liveness_equal(existing_cell_view, updated_cell_view);
        }
-
-        // With non-expiring row marker, liveness checks below are not relevant
-        if (base_has_nonexpiring_marker) {
-            return true;
-        }
-
-        if (existing_cell_view.is_live() != updated_cell_view.is_live()) {
-            return false;
-        }
-
-        // We cannot skip if the change updates TTL
-        const bool existing_has_ttl = existing_cell_view.is_live_and_has_ttl();
-        const bool updated_has_ttl = updated_cell_view.is_live_and_has_ttl();
-        if (existing_has_ttl || updated_has_ttl) {
-            return existing_has_ttl == updated_has_ttl && existing_cell_view.expiry() == updated_cell_view.expiry();
-        }
-
-        return true;
+        return compare_atomic_cell_for_merge(existing_cell_view, updated_cell_view) == 0;
    });
 }

@@ -3305,15 +3285,6 @@ public:
                          _step.base->schema()->cf_name(), _step.current_token(), view_names);
        }
        if (_step.reader.is_end_of_stream() && _step.reader.is_buffer_empty()) {
-            if (_step.current_key.key().is_empty()) {
-                // consumer got end-of-stream without consuming a single partition
-                vlogger.debug("Reader didn't produce anything, marking views as built");
-                while (!_step.build_status.empty()) {
-                    _built_views.views.push_back(std::move(_step.build_status.back()));
-                    _step.build_status.pop_back();
-                }
-            }
-
            // before going back to the minimum token, advance current_key to the end
            // and check for built views in that range.
            _step.current_key = { _step.prange.end().value_or(dht::ring_position::max()).value().token(), partition_key::make_empty()};
@@ -3332,6 +3303,7 @@ public:

 // Called in the context of a seastar::thread.
 void view_builder::execute(build_step& step, exponential_backoff_retry r) {
+    inject_failure("dont_start_build_step");
    gc_clock::time_point now = gc_clock::now();
    auto compaction_state = make_lw_shared<compact_for_query_state>(
            *step.reader.schema(),
@@ -3365,6 +3337,7 @@ void view_builder::execute(build_step& step, exponential_backoff_retry r) {
    seastar::when_all_succeed(bookkeeping_ops.begin(), bookkeeping_ops.end()).handle_exception([] (std::exception_ptr ep) {
        vlogger.warn("Failed to update materialized view bookkeeping ({}), continuing anyway.", ep);
    }).get();
+    utils::get_local_injector().inject("delay_finishing_build_step", utils::wait_for_message(60s)).get();
 }

 future<> view_builder::mark_as_built(view_ptr view) {
@@ -3715,5 +3688,22 @@ sstring build_status_to_sstring(build_status status) {
    on_internal_error(vlogger, fmt::format("Unknown view build status: {}", (int)status));
 }

+void validate_view_keyspace(const data_dictionary::database& db, std::string_view keyspace_name) {
+    const bool tablet_views_enabled = db.features().views_with_tablets;
+    // Note: if the configuration option `rf_rack_valid_keyspaces` is enabled, we can be
+    //       sure that all tablet-based keyspaces are RF-rack-valid. We check that
+    //       at start-up and then we don't allow for creating RF-rack-invalid keyspaces.
+    const bool rf_rack_valid_keyspaces = db.get_config().rf_rack_valid_keyspaces();
+    const bool required_config = tablet_views_enabled && rf_rack_valid_keyspaces;
+
+    const bool uses_tablets = db.find_keyspace(keyspace_name).get_replication_strategy().uses_tablets();
+
+    if (!required_config && uses_tablets) {
+        throw std::logic_error("Materialized views and secondary indexes are not supported on base tables with tablets. "
+                "To be able to use them, enable the configuration option `rf_rack_valid_keyspaces` and make sure "
+                "that the cluster feature `VIEWS_WITH_TABLETS` is enabled.");
+    }
+}
+
 } // namespace view
 } // namespace db
--- a/db/view/view.hh
+++ b/db/view/view.hh
@@ -309,6 +309,18 @@ endpoints_to_update get_view_natural_endpoint(
    bool use_tablets_basic_rack_aware_view_pairing,
    replica::cf_stats& cf_stats);

+/// Verify that the provided keyspace is eligible for storing materialized views.
+///
+/// Result:
+/// * If the keyspace is eligible, no effect.
+/// * If the keyspace is not eligible, an exception is thrown. Its type is not specified,
+///   and the user of this function cannot make any assumption about it. The carried exception
+///   message will be worded in a way that can be directly passed on to the end user.
+///
+/// Preconditions:
+/// * The provided `keyspace_name` must correspond to an existing keyspace.
+void validate_view_keyspace(const data_dictionary::database&, std::string_view keyspace_name);
+
 }

 }
--- a/db/view/view_building_coordinator.cc
+++ b/db/view/view_building_coordinator.cc
@@ -29,6 +29,10 @@
 #include "db/view/view_building_task_mutation_builder.hh"
 #include "utils/assert.hh"
 #include "idl/view.dist.hh"
+#include "utils/error_injection.hh"
+#include "utils/log.hh"
+
+using namespace std::chrono_literals;

 static logging::logger vbc_logger("view_building_coordinator");

@@ -102,6 +106,8 @@ future<> view_building_coordinator::run() {
        _vb_sm.event.broadcast();
    });

+    auto finished_tasks_gc_fiber = finished_task_gc_fiber();
+
    while (!_as.abort_requested()) {
        co_await utils::get_local_injector().inject("view_building_coordinator_pause_main_loop", utils::wait_for_message(std::chrono::minutes(2)));
        if (utils::get_local_injector().enter("view_building_coordinator_skip_main_loop")) {
@@ -119,12 +125,7 @@ future<> view_building_coordinator::run() {
                continue;
            }

-            auto started_new_work = co_await work_on_view_building(std::move(*guard_opt));
-            if (started_new_work) {
-                // If any tasks were started, do another iteration, so the coordinator can attach itself to the tasks (via RPC)
-                vbc_logger.debug("view building coordinator started new tasks, do next iteration without waiting for event");
-                continue;
-            }
+            co_await work_on_view_building(std::move(*guard_opt));
            co_await await_event();
        } catch (...) {
            handle_coordinator_error(std::current_exception());
@@ -140,6 +141,66 @@ future<> view_building_coordinator::run() {
            }
        }
    }
+
+    co_await std::move(finished_tasks_gc_fiber);
+}
+
+future<> view_building_coordinator::finished_task_gc_fiber() {
+    static auto task_gc_interval = 200ms;
+
+    while (!_as.abort_requested()) {
+        try {
+            co_await clean_finished_tasks();
+            co_await sleep_abortable(task_gc_interval, _as);
+        } catch (abort_requested_exception&) {
+            vbc_logger.debug("view_building_coordinator::finished_task_gc_fiber got abort_requested_exception");
+        } catch (service::group0_concurrent_modification&) {
+            vbc_logger.info("view_building_coordinator::finished_task_gc_fiber got group0_concurrent_modification");
+        } catch (raft::request_aborted&) {
+            vbc_logger.debug("view_building_coordinator::finished_task_gc_fiber got raft::request_aborted");
+        } catch (service::term_changed_error&) {
+            vbc_logger.debug("view_building_coordinator::finished_task_gc_fiber notices term change {} -> {}", _term, _raft.get_current_term());
+        } catch (raft::commit_status_unknown&) {
+            vbc_logger.warn("view_building_coordinator::finished_task_gc_fiber got raft::commit_status_unknown");
+        } catch (...) {
+            vbc_logger.error("view_building_coordinator::finished_task_gc_fiber got error: {}", std::current_exception());
+        }
+    }
+}
+
+future<> view_building_coordinator::clean_finished_tasks() {
+    // Avoid acquiring a group0 operation if there are no tasks.
+    if (_finished_tasks.empty()) {
+        co_return;
+    }
+
+    auto guard = co_await start_operation();
+    auto lock = co_await get_unique_lock(_mutex);
+
+    if (!_vb_sm.building_state.currently_processed_base_table || std::ranges::all_of(_finished_tasks, [] (auto& e) { return e.second.empty(); })) {
+        co_return;
+    }
+
+    view_building_task_mutation_builder builder(guard.write_timestamp());
+    for (auto& [replica, tasks]: _finished_tasks) {
+        for (auto& task_id: tasks) {
+            // The task might be aborted in the meantime. In this case we cannot remove it because we need it to create a new task.
+            //
+            // TODO: When we're aborting a view building task (for instance due to tablet migration),
+            //       we can look if we already finished it (check if it's in `_finished_tasks`).
+            //       If yes, we can just remove it instead of aborting it.
+            auto task_opt = _vb_sm.building_state.get_task(*_vb_sm.building_state.currently_processed_base_table, replica, task_id);
+            if (task_opt && !task_opt->get().aborted) {
+                builder.del_task(task_id);
+                vbc_logger.debug("Removing finished task with ID: {}", task_id);
+            }
+        }
+    }
+
+    co_await commit_mutations(std::move(guard), {builder.build()}, "remove finished view building tasks");
+    for (auto& [_, tasks_set]: _finished_tasks) {
+        tasks_set.clear();
+    }
 }

 future<std::optional<service::group0_guard>> view_building_coordinator::update_state(service::group0_guard guard) {
@@ -299,18 +360,16 @@ future<> view_building_coordinator::update_views_statuses(const service::group0_
    }
 }

-future<bool> view_building_coordinator::work_on_view_building(service::group0_guard guard) {
+future<> view_building_coordinator::work_on_view_building(service::group0_guard guard) {
    if (!_vb_sm.building_state.currently_processed_base_table) {
        vbc_logger.debug("No base table is selected, nothing to do.");
-        co_return false;
+        co_return;
    }

-    utils::chunked_vector<mutation> muts;
-    std::unordered_set<locator::tablet_replica> _remote_work_keys_to_erase;
+    // Acquire unique lock of `_finished_tasks` to ensure each replica has its own entry in it
+    // and to select tasks for them.
+    auto lock = co_await get_unique_lock(_mutex);
    for (auto& replica: get_replicas_with_tasks()) {
-        // Check whether the coordinator already waits for the remote work on the replica to be finished.
-        // If so: check if the work is done and and remove the shared_future, skip this replica otherwise.
-        bool skip_work_on_this_replica = false;
        if (_remote_work.contains(replica)) {
            if (!_remote_work[replica].available()) {
                vbc_logger.debug("Replica {} is still doing work", replica);
@@ -318,51 +377,25 @@ future<bool> view_building_coordinator::work_on_view_building(service::group0_gu
            }

            auto remote_results_opt = co_await _remote_work[replica].get_future();
-            if (remote_results_opt) {
-                auto results_muts = co_await update_state_after_work_is_done(guard, replica, std::move(*remote_results_opt));
-                muts.insert(muts.end(), std::make_move_iterator(results_muts.begin()), std::make_move_iterator(results_muts.end()));
-                // If the replica successfully finished its work, we need to commit mutations generated above before selecting next task
-                skip_work_on_this_replica = !results_muts.empty();
-            }
-
-            // If there were no mutations for this replica, we can just remove the entry from `_remote_work` map
-            // and start new work in the same iteration.
-            // Otherwise, the entry needs to be removed after the mutations are committed successfully.
-            if (skip_work_on_this_replica) {
-                _remote_work_keys_to_erase.insert(replica);
-            } else {
-                _remote_work.erase(replica);
-            }
+            _remote_work.erase(replica);
        }
-        if (!_gossiper.is_alive(replica.host)) {
+
+        const bool ignore_gossiper = utils::get_local_injector().enter("view_building_coordinator_ignore_gossiper");
+        if (!_gossiper.is_alive(replica.host) && !ignore_gossiper) {
            vbc_logger.debug("Replica {} is dead", replica);
            continue;
        }
-        if (skip_work_on_this_replica) {
-            continue;
+
+        if (!_finished_tasks.contains(replica)) {
+            _finished_tasks.insert({replica, {}});
        }

-        if (auto already_started_ids = _vb_sm.building_state.get_started_tasks(*_vb_sm.building_state.currently_processed_base_table, replica); !already_started_ids.empty()) {
-            // If the replica has any task in `STARTED` state, attach the coordinator to the work.
-            attach_to_started_tasks(replica, std::move(already_started_ids));
-        } else if (auto todo_ids = select_tasks_for_replica(replica); !todo_ids.empty()) {
-            // If the replica has no started tasks and there are tasks to do, mark them as started.
-            // The coordinator will attach itself to the work in next iteration.
-            auto new_mutations = co_await start_tasks(guard, std::move(todo_ids));
-            muts.insert(muts.end(), std::make_move_iterator(new_mutations.begin()), std::make_move_iterator(new_mutations.end()));
+        if (auto todo_ids = select_tasks_for_replica(replica); !todo_ids.empty()) {
+            start_remote_worker(replica, std::move(todo_ids));
        } else {
            vbc_logger.debug("Nothing to do for replica {}", replica);
        }
    }
-
-    if (!muts.empty()) {
-        co_await commit_mutations(std::move(guard), std::move(muts), "start view building tasks");
-        for (auto& key: _remote_work_keys_to_erase) {
-            _remote_work.erase(key);
-        }
-        co_return true;
-    }
-    co_return false;
 }

 std::set<locator::tablet_replica> view_building_coordinator::get_replicas_with_tasks() {
@@ -385,7 +418,7 @@ std::vector<utils::UUID> view_building_coordinator::select_tasks_for_replica(loc
    // Select only building tasks and return theirs ids
    auto filter_building_tasks = [] (const std::vector<view_building_task>& tasks) -> std::vector<utils::UUID> {
        return tasks | std::views::filter([] (const view_building_task& t) {
-            return t.type == view_building_task::task_type::build_range && t.state == view_building_task::task_state::idle;
+            return t.type == view_building_task::task_type::build_range && !t.aborted;
        }) | std::views::transform([] (const view_building_task& t) {
            return t.id;
        }) | std::ranges::to<std::vector>();
@@ -399,7 +432,29 @@ std::vector<utils::UUID> view_building_coordinator::select_tasks_for_replica(loc
    }

    auto& tablet_map = _db.get_token_metadata().tablets().get_tablet_map(*_vb_sm.building_state.currently_processed_base_table);
-    for (auto& [token, tasks]: _vb_sm.building_state.collect_tasks_by_last_token(*_vb_sm.building_state.currently_processed_base_table, replica)) {
+    auto tasks_by_last_token = _vb_sm.building_state.collect_tasks_by_last_token(*_vb_sm.building_state.currently_processed_base_table, replica);
+
+    // Remove completed tasks in `_finished_tasks` from `tasks_by_last_token`
+    auto it = tasks_by_last_token.begin();
+    while (it != tasks_by_last_token.end()) {
+        auto task_it = it->second.begin();
+        while (task_it != it->second.end()) {
+            if (_finished_tasks.at(replica).contains(task_it->id)) {
+                task_it = it->second.erase(task_it);
+            } else {
+                ++task_it;
+            }
+        }
+
+        // Remove the entry from `tasks_by_last_token` if its vector is empty
+        if (it->second.empty()) {
+            it = tasks_by_last_token.erase(it);
+        } else {
+            ++it;
+        }
+    }
+
+    for (auto& [token, tasks]: tasks_by_last_token) {
        auto tid = tablet_map.get_tablet_id(token);
        if (tablet_map.get_tablet_transition_info(tid)) {
            vbc_logger.debug("Tablet {} on replica {} is in transition.", tid, replica);
@@ -411,7 +466,7 @@ std::vector<utils::UUID> view_building_coordinator::select_tasks_for_replica(loc
            return building_tasks;
        } else {
            return tasks | std::views::filter([] (const view_building_task& t) {
-                return t.state == view_building_task::task_state::idle;
+                return !t.aborted;
            }) | std::views::transform([] (const view_building_task& t) {
                return t.id;
            }) | std::ranges::to<std::vector>();
@@ -421,71 +476,41 @@ std::vector<utils::UUID> view_building_coordinator::select_tasks_for_replica(loc
    return {};
 }

-future<utils::chunked_vector<mutation>> view_building_coordinator::start_tasks(const service::group0_guard& guard, std::vector<utils::UUID> tasks) {
-    vbc_logger.info("Starting tasks {}", tasks);
-
-    utils::chunked_vector<mutation> muts;
-    for (auto& t: tasks) {
-        auto mut = co_await _sys_ks.make_update_view_building_task_state_mutation(guard.write_timestamp(), t, view_building_task::task_state::started);
-        muts.push_back(std::move(mut));
-    }
-    co_return muts;
-}
-
-void view_building_coordinator::attach_to_started_tasks(const locator::tablet_replica& replica, std::vector<utils::UUID> tasks) {
+void view_building_coordinator::start_remote_worker(const locator::tablet_replica& replica, std::vector<utils::UUID> tasks) {
    vbc_logger.debug("Attaching to started tasks {} on replica {}", tasks, replica);
-    shared_future<std::optional<remote_work_results>> work = work_on_tasks(replica, std::move(tasks));
+    shared_future<std::optional<std::vector<utils::UUID>>> work = work_on_tasks(replica, std::move(tasks));
    _remote_work.insert({replica, std::move(work)});
 }

-future<std::optional<view_building_coordinator::remote_work_results>> view_building_coordinator::work_on_tasks(locator::tablet_replica replica, std::vector<utils::UUID> tasks) {
-    std::vector<view_task_result> remote_results;
+future<std::optional<std::vector<utils::UUID>>> view_building_coordinator::work_on_tasks(locator::tablet_replica replica, std::vector<utils::UUID> tasks) {
+    constexpr auto backoff_duration = std::chrono::seconds(1);
+    static thread_local logger::rate_limit rate_limit{backoff_duration};
+
+    std::vector<utils::UUID> remote_results;
+    bool rpc_failed = false;
+
    try {
-        remote_results = co_await ser::view_rpc_verbs::send_work_on_view_building_tasks(&_messaging, replica.host, _as, tasks);
+        remote_results = co_await ser::view_rpc_verbs::send_work_on_view_building_tasks(&_messaging, replica.host, _as, _term, replica.shard, tasks);
    } catch (...) {
-        vbc_logger.warn("Work on tasks {} on replica {}, failed with error: {}", tasks, replica, std::current_exception());
+        vbc_logger.log(log_level::warn, rate_limit, "Work on tasks {} on replica {}, failed with error: {}",
+                tasks, replica, std::current_exception());
+        rpc_failed = true;
+    }
+
+    if (rpc_failed) {
+        co_await seastar::sleep(backoff_duration);
        _vb_sm.event.broadcast();
        co_return std::nullopt;
    }

-    if (tasks.size() != remote_results.size()) {
-        on_internal_error(vbc_logger, fmt::format("Number of tasks ({}) and results ({}) do not match for replica {}", tasks.size(), remote_results.size(), replica));
-    }
+    // In `view_building_coordinator::work_on_view_building()` we made sure that,
+    // each replica has its own entry in the `_finished_tasks`, so now we can just take a shared lock
+    // and insert its of finished tasks to this replica bucket as there is at most one instance of this method for each replica.
+    auto lock = co_await get_shared_lock(_mutex);
+    _finished_tasks.at(replica).insert_range(remote_results);

-    remote_work_results results;
-    for (size_t i = 0; i < tasks.size(); ++i) {
-        results.push_back({tasks[i], remote_results[i]});
-    }
    _vb_sm.event.broadcast();
-    co_return results;
-}
-
-// Mark finished task as done (remove them from the table).
-// Retry failed tasks if possible (if failed tasks wasn't aborted).
-future<utils::chunked_vector<mutation>> view_building_coordinator::update_state_after_work_is_done(const service::group0_guard& guard, const locator::tablet_replica& replica, view_building_coordinator::remote_work_results results) {
-    vbc_logger.debug("Got results from replica {}: {}", replica, results);
-
-    utils::chunked_vector<mutation> muts;
-    for (auto& result: results) {
-        vbc_logger.info("Task {} was finished with result: {}", result.first, result.second);
-
-        if (!_vb_sm.building_state.currently_processed_base_table) {
-            continue;
-        }
-
-        // A task can be aborted by deleting it or by setting its state to `ABORTED`.
-        // If the task was aborted by changing the state,
-        // we shouldn't remove it here because it might be needed
-        // to generate updated after tablet operation (migration/resize)
-        // is finished.
-        auto task_opt = _vb_sm.building_state.get_task(*_vb_sm.building_state.currently_processed_base_table, replica, result.first);
-        if (task_opt && task_opt->get().state != view_building_task::task_state::aborted) {
-            // Otherwise, the task was completed successfully and we can remove it.
-            auto delete_mut = co_await _sys_ks.make_remove_view_building_task_mutation(guard.write_timestamp(), result.first);
-            muts.push_back(std::move(delete_mut));
-        }
-    }
-    co_return muts;
+    co_return remote_results;
 }

 future<> view_building_coordinator::stop() {
@@ -515,7 +540,7 @@ void view_building_coordinator::generate_tablet_migration_updates(utils::chunked
    auto create_task_copy_on_pending_replica = [&] (const view_building_task& task) {
        auto new_id = builder.new_id();
        builder.set_type(new_id, task.type)
-                .set_state(new_id, view_building_task::task_state::idle)
+                .set_aborted(new_id, false)
                .set_base_id(new_id, task.base_id)
                .set_last_token(new_id, task.last_token)
                .set_replica(new_id, *trinfo.pending_replica);
@@ -583,7 +608,7 @@ void view_building_coordinator::generate_tablet_resize_updates(utils::chunked_ve
    auto create_task_copy = [&] (const view_building_task& task, dht::token last_token) -> utils::UUID {
        auto new_id = builder.new_id();
        builder.set_type(new_id, task.type)
-                .set_state(new_id, view_building_task::task_state::idle)
+                .set_aborted(new_id, false)
                .set_base_id(new_id, task.base_id)
                .set_last_token(new_id, last_token)
                .set_replica(new_id, task.replica);
@@ -652,7 +677,7 @@ void view_building_coordinator::abort_tasks(utils::chunked_vector<canonical_muta
    auto abort_task_map = [&] (const task_map& task_map) {
        for (auto& [id, _]: task_map) {
            vbc_logger.debug("Aborting task {}", id);
-            builder.set_state(id, view_building_task::task_state::aborted);
+            builder.set_aborted(id, true);
        }
    };

@@ -682,7 +707,7 @@ void abort_view_building_tasks(const view_building_state_machine& vb_sm,
        for (auto& [id, task]: task_map) {
            if (task.last_token == last_token) {
                vbc_logger.debug("Aborting task {}", id);
-                builder.set_state(id, view_building_task::task_state::aborted);
+                builder.set_aborted(id, true);
            }
        }
    };
@@ -698,10 +723,10 @@ void abort_view_building_tasks(const view_building_state_machine& vb_sm,

 static void rollback_task_map(view_building_task_mutation_builder& builder, const task_map& task_map) {
    for (auto& [id, task]: task_map) {
-        if (task.state == view_building_task::task_state::aborted) {
+        if (task.aborted) {
            auto new_id = builder.new_id();
            builder.set_type(new_id, task.type)
-                .set_state(new_id, view_building_task::task_state::idle)
+                .set_aborted(new_id, false)
                .set_base_id(new_id, task.base_id)
                .set_last_token(new_id, task.last_token)
                .set_replica(new_id, task.replica);
--- a/db/view/view_building_coordinator.hh
+++ b/db/view/view_building_coordinator.hh
@@ -54,9 +54,9 @@ class view_building_coordinator : public service::endpoint_lifecycle_subscriber
    const raft::term_t _term;
    abort_source& _as;

-
-    using remote_work_results = std::vector<std::pair<utils::UUID, db::view::view_task_result>>;
-    std::unordered_map<locator::tablet_replica, shared_future<std::optional<remote_work_results>>> _remote_work;
+    std::unordered_map<locator::tablet_replica, shared_future<std::optional<std::vector<utils::UUID>>>> _remote_work;
+    shared_mutex _mutex; // guards `_finished_tasks` field
+    std::unordered_map<locator::tablet_replica, std::unordered_set<utils::UUID>> _finished_tasks;

 public:
    view_building_coordinator(replica::database& db, raft::server& raft, service::raft_group0& group0,
@@ -86,9 +86,11 @@ private:
    future<> commit_mutations(service::group0_guard guard, utils::chunked_vector<mutation> mutations, std::string_view description);
    void handle_coordinator_error(std::exception_ptr eptr);

+    future<> finished_task_gc_fiber();
+    future<> clean_finished_tasks();
+
    future<std::optional<service::group0_guard>> update_state(service::group0_guard guard);
-    // Returns if any new tasks were started
-    future<bool> work_on_view_building(service::group0_guard guard);
+    future<> work_on_view_building(service::group0_guard guard);

    future<> mark_view_build_status_started(const service::group0_guard& guard, table_id view_id, utils::chunked_vector<mutation>& out);
    future<> mark_all_remaining_view_build_statuses_started(const service::group0_guard& guard, table_id base_id, utils::chunked_vector<mutation>& out);
@@ -97,10 +99,8 @@ private:
    std::set<locator::tablet_replica> get_replicas_with_tasks();
    std::vector<utils::UUID> select_tasks_for_replica(locator::tablet_replica replica);

-    future<utils::chunked_vector<mutation>> start_tasks(const service::group0_guard& guard, std::vector<utils::UUID> tasks);
-    void attach_to_started_tasks(const locator::tablet_replica& replica, std::vector<utils::UUID> tasks);
-    future<std::optional<remote_work_results>> work_on_tasks(locator::tablet_replica replica, std::vector<utils::UUID> tasks);
-    future<utils::chunked_vector<mutation>> update_state_after_work_is_done(const service::group0_guard& guard, const locator::tablet_replica& replica, remote_work_results results);
+    void start_remote_worker(const locator::tablet_replica& replica, std::vector<utils::UUID> tasks);
+    future<std::optional<std::vector<utils::UUID>>> work_on_tasks(locator::tablet_replica replica, std::vector<utils::UUID> tasks);
 };

 void abort_view_building_tasks(const db::view::view_building_state_machine& vb_sm,
--- a/db/view/view_building_state.cc
+++ b/db/view/view_building_state.cc
@@ -13,10 +13,10 @@ namespace db {

 namespace view {

-view_building_task::view_building_task(utils::UUID id, task_type type, task_state state, table_id base_id, std::optional<table_id> view_id, locator::tablet_replica replica, dht::token last_token)
+view_building_task::view_building_task(utils::UUID id, task_type type, bool aborted, table_id base_id, std::optional<table_id> view_id, locator::tablet_replica replica, dht::token last_token)
        : id(id)
        , type(type)
-        , state(state)
+        , aborted(aborted)
        , base_id(base_id)
        , view_id(view_id)
        , replica(replica)
@@ -49,30 +49,6 @@ seastar::sstring task_type_to_sstring(view_building_task::task_type type) {
    }
 }

-view_building_task::task_state task_state_from_string(std::string_view str) {
-    if (str == "IDLE") {
-        return view_building_task::task_state::idle;
-    }
-    if (str == "STARTED") {
-        return view_building_task::task_state::started;
-    }
-    if (str == "ABORTED") {
-        return view_building_task::task_state::aborted;
-    }
-    throw std::runtime_error(fmt::format("Unknown view building task state: {}", str));
-}
-
-seastar::sstring task_state_to_sstring(view_building_task::task_state state) {
-    switch (state) {
-    case view_building_task::task_state::idle:
-        return "IDLE";
-    case view_building_task::task_state::started:
-        return "STARTED";
-    case view_building_task::task_state::aborted:
-        return "ABORTED";
-    }
-}
-
 std::optional<std::reference_wrapper<const view_building_task>> view_building_state::get_task(table_id base_id, locator::tablet_replica replica, utils::UUID id) const {
    if (!tasks_state.contains(base_id) || !tasks_state.at(base_id).contains(replica)) {
        return {};
@@ -151,46 +127,6 @@ std::map<dht::token, std::vector<view_building_task>> view_building_state::colle
    return tasks;
 }

-// Returns all tasks for `_vb_sm.building_state.currently_processed_base_table` and `replica` with `STARTED` state.
-std::vector<utils::UUID> view_building_state::get_started_tasks(table_id base_table_id, locator::tablet_replica replica) const {   
-    if (!tasks_state.contains(base_table_id) || !tasks_state.at(base_table_id).contains(replica)) {
-        // No tasks for this replica
-        return {};
-    }
-
-    std::vector<view_building_task> tasks;
-    auto& replica_tasks = tasks_state.at(base_table_id).at(replica);
-    for (auto& [_, view_tasks]: replica_tasks.view_tasks) {
-        for (auto& [_, task]: view_tasks) {
-            if (task.state == view_building_task::task_state::started) {
-                tasks.push_back(task);
-            }
-        }
-    }
-    for (auto& [_, task]: replica_tasks.staging_tasks) {
-        if (task.state == view_building_task::task_state::started) {
-            tasks.push_back(task);
-        }
-    }
-
-    // All collected tasks should have the same: type, base_id and last_token,
-    // so they can be executed in the same view_building_worker::batch.
-#ifdef SEASTAR_DEBUG
-    if (!tasks.empty()) {
-        auto& task = tasks.front();
-        for (auto& t: tasks) {
-            SCYLLA_ASSERT(task.type == t.type);
-            SCYLLA_ASSERT(task.base_id == t.base_id);
-            SCYLLA_ASSERT(task.last_token == t.last_token);
-        }
-    }
-#endif
-
-    return tasks | std::views::transform([] (const view_building_task& t) {
-        return t.id;
-    }) | std::ranges::to<std::vector>();
-}
-
 }

 }
--- a/db/view/view_building_state.hh
+++ b/db/view/view_building_state.hh
@@ -39,28 +39,17 @@ struct view_building_task {
        process_staging,
    };

-    // When a task is created, it starts with `IDLE` state.
-    // Then, the view building coordinator will decide to do the task and it will
-    // set the state to `STARTED`.
-    // When a task is finished the entry is removed.
-    //
-    // If a task is in progress when a tablet operation (migration/resize) starts,
-    // the task's state is set to `ABORTED`.
-    enum class task_state {
-        idle,
-        started,
-        aborted,
-    };
+
    utils::UUID id;
    task_type type;
-    task_state state;
+    bool aborted;

    table_id base_id;
    std::optional<table_id> view_id; // nullopt when task_type is `process_staging`
    locator::tablet_replica replica;
    dht::token last_token;

-    view_building_task(utils::UUID id, task_type type, task_state state,
+    view_building_task(utils::UUID id, task_type type, bool aborted,
            table_id base_id, std::optional<table_id> view_id,
            locator::tablet_replica replica, dht::token last_token);
 };
@@ -92,7 +81,6 @@ struct view_building_state {
    std::vector<std::reference_wrapper<const view_building_task>> get_tasks_for_host(table_id base_id, locator::host_id host) const;
    std::map<dht::token, std::vector<view_building_task>> collect_tasks_by_last_token(table_id base_table_id) const;
    std::map<dht::token, std::vector<view_building_task>> collect_tasks_by_last_token(table_id base_table_id, const locator::tablet_replica& replica) const;
-    std::vector<utils::UUID> get_started_tasks(table_id base_table_id, locator::tablet_replica replica) const;
 };

 // Represents global state of tablet-based views.
@@ -113,18 +101,8 @@ struct view_building_state_machine {
    condition_variable event;
 };

-struct view_task_result {
-    enum class command_status: uint8_t {
-        success = 0,
-        abort = 1,
-    };
-    db::view::view_task_result::command_status status;
-};
-
 view_building_task::task_type task_type_from_string(std::string_view str);
 seastar::sstring task_type_to_sstring(view_building_task::task_type type);
-view_building_task::task_state task_state_from_string(std::string_view str);
-seastar::sstring task_state_to_sstring(view_building_task::task_state state);

 } // namespace view_building

@@ -136,17 +114,11 @@ template <> struct fmt::formatter<db::view::view_building_task::task_type> : fmt
    }
 };

-template <> struct fmt::formatter<db::view::view_building_task::task_state> : fmt::formatter<string_view> {
-    auto format(db::view::view_building_task::task_state state, fmt::format_context& ctx) const {
-        return fmt::format_to(ctx.out(), "{}", db::view::task_state_to_sstring(state));
-    }
-};
-
 template <> struct fmt::formatter<db::view::view_building_task> : fmt::formatter<string_view> {
    auto format(db::view::view_building_task task, fmt::format_context& ctx) const {
        auto view_id = task.view_id ? fmt::to_string(*task.view_id) : "nullopt";
-        return fmt::format_to(ctx.out(), "view_building_task{{type: {}, state: {}, base_id: {}, view_id: {}, last_token: {}}}",
-                task.type, task.state, task.base_id, view_id, task.last_token);
+        return fmt::format_to(ctx.out(), "view_building_task{{type: {}, aborted: {}, base_id: {}, view_id: {}, last_token: {}}}",
+                task.type, task.aborted, task.base_id, view_id, task.last_token);
    }
 };

@@ -161,18 +133,3 @@ template <> struct fmt::formatter<db::view::replica_tasks> : fmt::formatter<stri
        return fmt::format_to(ctx.out(), "{{view_tasks: {}, staging_tasks: {}}}", replica_tasks.view_tasks, replica_tasks.staging_tasks);
    }
 };
-
-template <> struct fmt::formatter<db::view::view_task_result> : fmt::formatter<string_view> {
-    auto format(db::view::view_task_result result, fmt::format_context& ctx) const {
-        std::string_view res;
-        switch (result.status) {
-            case db::view::view_task_result::command_status::success:
-            res = "success";
-            break;
-        case db::view::view_task_result::command_status::abort:
-            res = "abort";
-            break;
-        }
-        return format_to(ctx.out(), "{}", res);
-    }
-};
--- a/db/view/view_building_task_mutation_builder.cc
+++ b/db/view/view_building_task_mutation_builder.cc
@@ -25,8 +25,8 @@ view_building_task_mutation_builder& view_building_task_mutation_builder::set_ty
    _m.set_clustered_cell(get_ck(id), "type", data_value(task_type_to_sstring(type)), _ts);
    return *this;
 }
-view_building_task_mutation_builder& view_building_task_mutation_builder::set_state(utils::UUID id, db::view::view_building_task::task_state state) {
-    _m.set_clustered_cell(get_ck(id), "state", data_value(task_state_to_sstring(state)), _ts);
+view_building_task_mutation_builder& view_building_task_mutation_builder::set_aborted(utils::UUID id, bool aborted) {
+    _m.set_clustered_cell(get_ck(id), "aborted", data_value(aborted), _ts);
    return *this;
 }
 view_building_task_mutation_builder& view_building_task_mutation_builder::set_base_id(utils::UUID id, table_id base_id) {
--- a/db/view/view_building_task_mutation_builder.hh
+++ b/db/view/view_building_task_mutation_builder.hh
@@ -32,7 +32,7 @@ public:
    static utils::UUID new_id();

    view_building_task_mutation_builder& set_type(utils::UUID id, db::view::view_building_task::task_type type);
-    view_building_task_mutation_builder& set_state(utils::UUID id, db::view::view_building_task::task_state state);
+    view_building_task_mutation_builder& set_aborted(utils::UUID id, bool aborted);
    view_building_task_mutation_builder& set_base_id(utils::UUID id, table_id base_id);
    view_building_task_mutation_builder& set_view_id(utils::UUID id, table_id view_id);
    view_building_task_mutation_builder& set_last_token(utils::UUID id, dht::token last_token);
--- a/db/view/view_building_worker.cc
+++ b/db/view/view_building_worker.cc
@@ -22,6 +22,7 @@
 #include "replica/database.hh"
 #include "service/storage_proxy.hh"
 #include "service/raft/raft_group0_client.hh"
+#include "service/raft/raft_group0.hh"
 #include "schema/schema_fwd.hh"
 #include "idl/view.dist.hh"
 #include "sstables/sstables.hh"
@@ -114,11 +115,11 @@ static locator::tablet_id get_sstable_tablet_id(const locator::tablet_map& table
    return tablet_id;
 }

-view_building_worker::view_building_worker(replica::database& db, db::system_keyspace& sys_ks, service::migration_notifier& mnotifier, service::raft_group0_client& group0_client, view_update_generator& vug, netw::messaging_service& ms, view_building_state_machine& vbsm)
+view_building_worker::view_building_worker(replica::database& db, db::system_keyspace& sys_ks, service::migration_notifier& mnotifier, service::raft_group0& group0, view_update_generator& vug, netw::messaging_service& ms, view_building_state_machine& vbsm)
        : _db(db)
        , _sys_ks(sys_ks)
        , _mnotifier(mnotifier)
-        , _group0_client(group0_client)
+        , _group0(group0)
        , _vug(vug)
        , _messaging(ms)
        , _vb_state_machine(vbsm)
@@ -127,8 +128,9 @@ view_building_worker::view_building_worker(replica::database& db, db::system_key
    init_messaging_service();
 }

-void view_building_worker::start_background_fibers() {
+future<> view_building_worker::init() {
    SCYLLA_ASSERT(this_shard_id() == 0);
+    co_await discover_existing_staging_sstables();
    _staging_sstables_registrator = run_staging_sstables_registrator();
    _view_building_state_observer = run_view_building_state_observer();
    _mnotifier.register_listener(this);
@@ -144,6 +146,7 @@ future<> view_building_worker::drain() {
    if (!_as.abort_requested()) {
        _as.request_abort();
    }
+    _state._mutex.broken();
    _staging_sstables_mutex.broken();
    _sstables_to_register_event.broken();
    if (this_shard_id() == 0) {
@@ -153,8 +156,7 @@ future<> view_building_worker::drain() {
        co_await std::move(state_observer);
        co_await _mnotifier.unregister_listener(this);
    }
-    co_await _state.clear_state();
-    _state.state_updated_cv.broken();
+    co_await _state.clear();
    co_await uninit_messaging_service();
 }

@@ -195,8 +197,6 @@ future<> view_building_worker::register_staging_sstable_tasks(std::vector<sstabl
 }

 future<> view_building_worker::run_staging_sstables_registrator() {
-    co_await discover_existing_staging_sstables();
-
    while (!_as.abort_requested()) {
        try {
            auto lock = co_await get_units(_staging_sstables_mutex, 1, _as);
@@ -225,44 +225,42 @@ future<> view_building_worker::create_staging_sstable_tasks() {

    utils::chunked_vector<canonical_mutation> cmuts;

-    auto guard = co_await _group0_client.start_operation(_as);
+    auto guard = co_await _group0.client().start_operation(_as);
    auto my_host_id = _db.get_token_metadata().get_topology().my_host_id();
    for (auto& [table_id, sst_infos]: _sstables_to_register) {
        for (auto& sst_info: sst_infos) {
            view_building_task task {
-                utils::UUID_gen::get_time_UUID(), view_building_task::task_type::process_staging, view_building_task::task_state::idle,
+                utils::UUID_gen::get_time_UUID(), view_building_task::task_type::process_staging, false,
                table_id, ::table_id{}, {my_host_id, sst_info.shard}, sst_info.last_token
            };
-            auto mut = co_await _group0_client.sys_ks().make_view_building_task_mutation(guard.write_timestamp(), task);
+            auto mut = co_await _group0.client().sys_ks().make_view_building_task_mutation(guard.write_timestamp(), task);
            cmuts.emplace_back(std::move(mut));
        }
    }

    vbw_logger.debug("Creating {} process_staging view_building_tasks", cmuts.size());
-    auto cmd = _group0_client.prepare_command(service::write_mutations{std::move(cmuts)}, guard, "create view building tasks");
-    co_await _group0_client.add_entry(std::move(cmd), std::move(guard), _as);
+    auto cmd = _group0.client().prepare_command(service::write_mutations{std::move(cmuts)}, guard, "create view building tasks");
+    co_await _group0.client().add_entry(std::move(cmd), std::move(guard), _as);

    // Move staging sstables from `_sstables_to_register` (on shard0) to `_staging_sstables` on corresponding shards.
    // Firstly reorgenize `_sstables_to_register` for easier movement.
    // This is done in separate loop after commiting the group0 command, because we need to move values from `_sstables_to_register`
    // (`staging_sstable_task_info` is non-copyable because of `foreign_ptr` field).
-    std::unordered_map<shard_id, std::unordered_map<table_id, std::unordered_map<dht::token, std::vector<foreign_ptr<sstables::shared_sstable>>>>> new_sstables_per_shard;
+    std::unordered_map<shard_id, std::unordered_map<table_id, std::vector<foreign_ptr<sstables::shared_sstable>>>> new_sstables_per_shard;
    for (auto& [table_id, sst_infos]: _sstables_to_register) {
        for (auto& sst_info: sst_infos) {
-            new_sstables_per_shard[sst_info.shard][table_id][sst_info.last_token].push_back(std::move(sst_info.sst_foreign_ptr));
+            new_sstables_per_shard[sst_info.shard][table_id].push_back(std::move(sst_info.sst_foreign_ptr));
        }
    }

    for (auto& [shard, sstables_per_table]: new_sstables_per_shard) {
        co_await container().invoke_on(shard, [sstables_for_this_shard = std::move(sstables_per_table)] (view_building_worker& local_vbw) mutable {
-            for (auto& [tid, ssts_map]: sstables_for_this_shard) {
-                for (auto& [token, ssts]: ssts_map) {
-                    auto unwrapped_ssts = ssts | std::views::as_rvalue | std::views::transform([] (auto&& fptr) {
-                        return fptr.unwrap_on_owner_shard();
-                    }) | std::ranges::to<std::vector>();
-                    auto& tid_ssts = local_vbw._staging_sstables[tid][token];
-                    tid_ssts.insert(tid_ssts.end(), std::make_move_iterator(unwrapped_ssts.begin()), std::make_move_iterator(unwrapped_ssts.end()));
-                }
+            for (auto& [tid, ssts]: sstables_for_this_shard) {
+                auto unwrapped_ssts = ssts | std::views::as_rvalue | std::views::transform([] (auto&& fptr) {
+                    return fptr.unwrap_on_owner_shard();
+                }) | std::ranges::to<std::vector>();
+                auto& tid_ssts = local_vbw._staging_sstables[tid];
+                tid_ssts.insert(tid_ssts.end(), std::make_move_iterator(unwrapped_ssts.begin()), std::make_move_iterator(unwrapped_ssts.end()));
            }
        });
    }
@@ -310,7 +308,10 @@ std::unordered_map<table_id, std::vector<view_building_worker::staging_sstable_t
            return;
        }

-        auto& tablet_map = _db.get_token_metadata().tablets().get_tablet_map(table_id);
+        // scylladb/scylladb#26403: Make sure to access the tablets map via the effective replication map of the table object.
+        // The token metadata object pointed to by the database (`_db.get_token_metadata()`) may not contain
+        // the tablets map of the currently processed table yet. After #24414 is fixed, this should not matter anymore.
+        auto& tablet_map = table->get_effective_replication_map()->get_token_metadata().tablets().get_tablet_map(table_id);
        auto sstables = table->get_sstables();
        for (auto sstable: *sstables) {
            if (!sstable->requires_view_building()) {
@@ -326,7 +327,7 @@ std::unordered_map<table_id, std::vector<view_building_worker::staging_sstable_t
                //                 or maybe it can be registered to view_update_generator directly.
                tasks_to_create[table_id].emplace_back(table_id, shard, last_token, make_foreign(std::move(sstable)));
            } else {
-                _staging_sstables[table_id][last_token].push_back(std::move(sstable));
+                _staging_sstables[table_id].push_back(std::move(sstable));
            }
        }
    });
@@ -342,10 +343,10 @@ future<> view_building_worker::run_view_building_state_observer() {
        bool sleep = false;
        try {
            vbw_logger.trace("view_building_state_observer() iteration");
-            auto read_apply_mutex_holder = co_await _group0_client.hold_read_apply_mutex(_as);
+            auto read_apply_mutex_holder = co_await _group0.client().hold_read_apply_mutex(_as);

            co_await update_built_views();
-            co_await update_building_state();
+            co_await check_for_aborted_tasks();
            _as.check();

            read_apply_mutex_holder.return_all();
@@ -376,7 +377,7 @@ future<> view_building_worker::update_built_views() {
        auto schema = _db.find_schema(table_id);
        return std::make_pair(schema->ks_name(), schema->cf_name());
    };
-    auto& sys_ks = _group0_client.sys_ks();
+    auto& sys_ks = _group0.client().sys_ks();

    std::set<std::pair<sstring, sstring>> built_views;
    for (auto& [id, statuses]: _vb_state_machine.views_state.status_map) {
@@ -405,22 +406,35 @@ future<> view_building_worker::update_built_views() {
    }
 }

-future<> view_building_worker::update_building_state() {
-    co_await _state.update(*this);
-    co_await _state.finish_completed_tasks();
-    _state.state_updated_cv.broadcast();
-}
+// Must be executed on shard0
+future<> view_building_worker::check_for_aborted_tasks() {
+    return container().invoke_on_all([building_state = _vb_state_machine.building_state] (view_building_worker& vbw) -> future<> {
+        auto lock = co_await get_units(vbw._state._mutex, 1, vbw._as);
+        co_await vbw._state.update_processing_base_table(vbw._db, building_state, vbw._as);
+        if (!vbw._state._batch) {
+            co_return;
+        }

-bool view_building_worker::is_shard_free(shard_id shard) {
-    return !std::ranges::any_of(_state.tasks_map, [&shard] (auto& task_entry) {
-        return task_entry.second->replica.shard == shard && task_entry.second->state == view_building_worker::batch_state::in_progress;
+        auto my_host_id = vbw._db.get_token_metadata().get_topology().my_host_id();
+        auto my_replica = locator::tablet_replica{my_host_id, this_shard_id()};
+        auto tasks_map = vbw._state._batch->tasks; // Potentially, we'll remove elements from the map, so we need a copy to iterate over it
+        for (auto& [id, t]: tasks_map) {
+            auto task_opt = building_state.get_task(t.base_id, my_replica, id);
+            if (!task_opt || task_opt->get().aborted) {
+                co_await vbw._state._batch->abort_task(id);
+            }
+        }
+
+        if (vbw._state._batch->tasks.empty()) {
+            co_await vbw._state.clean_up_after_batch();
+        }
    });
 }

 void view_building_worker::init_messaging_service() {
-    ser::view_rpc_verbs::register_work_on_view_building_tasks(&_messaging, [this] (std::vector<utils::UUID> ids) -> future<std::vector<view_task_result>> {
-        return container().invoke_on(0, [ids = std::move(ids)] (view_building_worker& vbw) mutable -> future<std::vector<view_task_result>> {
-            return vbw.work_on_tasks(std::move(ids));
+    ser::view_rpc_verbs::register_work_on_view_building_tasks(&_messaging, [this] (raft::term_t term, shard_id shard, std::vector<utils::UUID> ids) -> future<std::vector<utils::UUID>> {
+        return container().invoke_on(shard, [term, ids = std::move(ids)] (auto& vbw) mutable -> future<std::vector<utils::UUID>> {
+            return vbw.work_on_tasks(term, std::move(ids));
        });
    });
 }
@@ -429,235 +443,53 @@ future<> view_building_worker::uninit_messaging_service() {
    return ser::view_rpc_verbs::unregister(&_messaging);
 }

-future<std::vector<view_task_result>> view_building_worker::work_on_tasks(std::vector<utils::UUID> ids) {
-    vbw_logger.debug("Got request for results of tasks: {}", ids);
-    auto guard = co_await _group0_client.start_operation(_as, service::raft_timeout{});
-    auto processing_base_table = _state.processing_base_table;
-
-    auto are_tasks_finished = [&] () {
-        return std::ranges::all_of(ids, [this] (const utils::UUID& id) {
-            return _state.finished_tasks.contains(id) || _state.aborted_tasks.contains(id);
-        });
-    };
-
-    auto get_results = [&] () -> std::vector<view_task_result> {
-        std::vector<view_task_result> results;
-        for (const auto& id: ids) {
-            if (_state.finished_tasks.contains(id)) {
-                results.emplace_back(view_task_result::command_status::success);
-            } else if (_state.aborted_tasks.contains(id)) {
-                results.emplace_back(view_task_result::command_status::abort);
-            } else {
-                // This means that the task was aborted. Throw an error,
-                // so the coordinator will refresh its state and retry without aborted IDs.
-                throw std::runtime_error(fmt::format("No status for task {}", id));
-            }
-        }
-        return results;
-    };
-
-    if (are_tasks_finished()) {
-        // If the batch is already finished, we can return the results immediately.
-        vbw_logger.debug("Batch with tasks {} is already finished, returning results", ids);
-        co_return get_results();
-    }
-
-    // All of the tasks should be executed in the same batch
-    // (their statuses are set to started in the same group0 operation).
-    // If any ID is not present in the `tasks_map`, it means that it was aborted and we should fail this RPC call,
-    // so the coordinator can retry without aborted IDs.
-    // That's why we can identify the batch by random (.front()) ID from the `ids` vector.
-    auto id = ids.front();
-    while (!_state.tasks_map.contains(id) && processing_base_table == _state.processing_base_table) {
-        vbw_logger.warn("Batch with task {} is not found in tasks map, waiting until worker updates its state", id);
-        service::release_guard(std::move(guard));
-        co_await _state.state_updated_cv.wait();
-        guard = co_await _group0_client.start_operation(_as, service::raft_timeout{});
-    }
-
-    if (processing_base_table != _state.processing_base_table) {
-        // If the processing base table was changed, we should fail this RPC call because the tasks were aborted.
-        throw std::runtime_error(fmt::format("Processing base table was changed to {} ", _state.processing_base_table));
-    }
-
-    // Validate that any of the IDs wasn't aborted.
-    for (const auto& tid: ids) {
-        if (!_state.tasks_map[id]->tasks.contains(tid)) {
-            vbw_logger.warn("Task {} is not found in the batch", tid);
-            throw std::runtime_error(fmt::format("Task {} is not found in the batch", tid));
-        }
-    }
-
-    if (_state.tasks_map[id]->state == view_building_worker::batch_state::idle) {
-        vbw_logger.debug("Starting batch with tasks {}", _state.tasks_map[id]->tasks);
-        if (!is_shard_free(_state.tasks_map[id]->replica.shard)) {
-            throw std::runtime_error(fmt::format("Tried to start view building tasks ({}) on shard {} but the shard is busy", _state.tasks_map[id]->tasks, _state.tasks_map[id]->replica.shard, _state.tasks_map[id]->tasks));
-        }
-        _state.tasks_map[id]->start();
-    }
-
-    service::release_guard(std::move(guard));
-    while (!_as.abort_requested()) {
-        auto read_apply_mutex_holder = co_await _group0_client.hold_read_apply_mutex(_as);
-
-        if (are_tasks_finished()) {
-            co_return get_results();
-        }
-
-        // Check if the batch is still alive
-        if (!_state.tasks_map.contains(id)) {
-            throw std::runtime_error(fmt::format("Batch with task {} is not found in tasks map anymore.", id));
-        }
-
-        read_apply_mutex_holder.return_all();
-        co_await _state.tasks_map[id]->batch_done_cv.wait();
-    }
-    throw std::runtime_error("View building worker was aborted");
-}
-
-// Validates if the task can be executed in a batch on the same shard.
-static bool validate_can_be_one_batch(const view_building_task& t1, const view_building_task& t2) {
-    return t1.type == t2.type && t1.base_id == t2.base_id && t1.replica == t2.replica && t1.last_token == t2.last_token;
-}
-
 static std::unordered_set<table_id> get_ids_of_all_views(replica::database& db, table_id table_id) {
    return db.find_column_family(table_id).views() | std::views::transform([] (view_ptr vptr) {
        return vptr->id();
    }) | std::ranges::to<std::unordered_set>();;
 }

-future<> view_building_worker::local_state::flush_table(view_building_worker& vbw, table_id table_id) {
-    // `table_id` should point to currently processing base table but
-    // `view_building_worker::local_state::processing_base_table` may not be set to it yet, 
-    // so we need to pass it directly
-    co_await vbw.container().invoke_on_all([table_id] (view_building_worker& local_vbw) -> future<> {
-        auto base_cf = local_vbw._db.find_column_family(table_id).shared_from_this();
-        co_await when_all(base_cf->await_pending_writes(), base_cf->await_pending_streams());
-        co_await flush_base(base_cf, local_vbw._as);
-    });
-
-    flushed_views = get_ids_of_all_views(vbw._db, table_id);
-}
-
-future<> view_building_worker::local_state::update(view_building_worker& vbw) {
-    const auto& vb_state = vbw._vb_state_machine.building_state;
-
-    // Check if the base table to process was changed.
-    // If so, we clear the state, aborting tasks for previous base table and starting new ones for the new base table.
-    if (processing_base_table != vb_state.currently_processed_base_table) {
-        co_await clear_state();
-
-        if (vb_state.currently_processed_base_table) {
-            // When we start to process new base table, we need to flush its current data, so we can build the view.
-            co_await flush_table(vbw, *vb_state.currently_processed_base_table);
-        }
-
-        processing_base_table = vb_state.currently_processed_base_table;
-        vbw_logger.info("Processing base table was changed to: {}", processing_base_table);
-    }
-
-    if (!processing_base_table) {
-        vbw_logger.debug("No base table is selected to be processed.");
-        co_return;
-    }
-
-    std::vector<table_id> new_views;
-    auto all_view_ids = get_ids_of_all_views(vbw._db, *processing_base_table);
-    std::ranges::set_difference(all_view_ids, flushed_views, std::back_inserter(new_views));
-    if (!new_views.empty()) {
-        // Flush base table again in any new view was created, so the view building tasks will see up-to-date sstables.
-        // Otherwise, we may lose mutations created after previous flush but before the new view was created.
-        co_await flush_table(vbw, *processing_base_table);
-    }
-
-    auto erm = vbw._db.find_column_family(*processing_base_table).get_effective_replication_map();
-    auto my_host_id = erm->get_topology().my_host_id();
-    auto current_tasks_for_this_host = vb_state.get_tasks_for_host(*processing_base_table, my_host_id);
-
-    // scan view building state, collect alive and new (in STARTED state but not started by this worker) tasks
-    std::unordered_map<shard_id, std::vector<view_building_task>> new_tasks;
-    std::unordered_set<utils::UUID> alive_tasks; // save information about alive tasks to cleanup done/aborted ones
-    for (auto& task_ref: current_tasks_for_this_host) {
-        auto& task = task_ref.get();
-        auto id = task.id;
-
-        if (task.state != view_building_task::task_state::aborted) {
-            alive_tasks.insert(id);
-        }
-
-        if (tasks_map.contains(id) || finished_tasks.contains(id)) {
-            continue;
-        }
-        else if (task.state == view_building_task::task_state::started) {
-            auto shard = task.replica.shard;
-            if (new_tasks.contains(shard) && !validate_can_be_one_batch(new_tasks[shard].front(), task)) {
-                // Currently we allow only one batch per shard at a time
-                on_internal_error(vbw_logger, fmt::format("Got not-compatible tasks for the same shard. Task: {}, other: {}", new_tasks[shard].front(), task));
-            }
-            new_tasks[shard].push_back(task);
-        }
-        co_await coroutine::maybe_yield();
-    }
-
-    auto tasks_map_copy = tasks_map;
-
-    // Clear aborted tasks from tasks_map
-    for (auto it = tasks_map_copy.begin(); it != tasks_map_copy.end();) {
-        if (!alive_tasks.contains(it->first)) {
-            vbw_logger.debug("Aborting task {}", it->first);
-            aborted_tasks.insert(it->first);
-            co_await it->second->abort_task(it->first);
-            it = tasks_map_copy.erase(it);
-        } else {
-            ++it;
-        }
-    }
-
-    // Create batches for new tasks
-    for (const auto& [shard, shard_tasks]: new_tasks) {
-        auto tasks = shard_tasks | std::views::transform([] (const view_building_task& t) {
-            return std::make_pair(t.id, t);
-        }) | std::ranges::to<std::unordered_map>();
-        auto batch = seastar::make_shared<view_building_worker::batch>(vbw.container(), tasks, shard_tasks.front().base_id, shard_tasks.front().replica);
-
-        for (auto& [id, _]: tasks) {
-            tasks_map_copy.insert({id, batch});
-        }
-        co_await coroutine::maybe_yield();
-    }
-
-    tasks_map = std::move(tasks_map_copy);
-}
-
-future<> view_building_worker::local_state::finish_completed_tasks() {
-    for (auto it = tasks_map.begin(); it != tasks_map.end();) {
-        if (it->second->state == view_building_worker::batch_state::idle) {
-            ++it;
-        } else if (it->second->state == view_building_worker::batch_state::in_progress) {
-            vbw_logger.debug("Task {} is still in progress", it->first);
-            ++it;
-        } else {
-            co_await it->second->work.get_future();
-            finished_tasks.insert(it->first);
-            vbw_logger.info("Task {} was completed", it->first);
-            it->second->batch_done_cv.broadcast();
-            it = tasks_map.erase(it);
+// If `state::processing_base_table` is diffrent that the `view_building_state::currently_processed_base_table`,
+// clear the state, save and flush new base table
+future<> view_building_worker::state::update_processing_base_table(replica::database& db, const view_building_state& building_state, abort_source& as) {
+    if (processing_base_table != building_state.currently_processed_base_table) {
+        co_await clear();
+        if (building_state.currently_processed_base_table) {
+            co_await flush_base_table(db, *building_state.currently_processed_base_table, as);
        }
+        processing_base_table = building_state.currently_processed_base_table;
    }
 }

-future<> view_building_worker::local_state::clear_state() {
-    for (auto& [_, batch]: tasks_map) {
-        co_await batch->abort();
+// If `_batch` ptr points to valid object, co_await its `work` future, save completed tasks and delete the object
+future<> view_building_worker::state::clean_up_after_batch() {
+    if (_batch) {
+        co_await std::move(_batch->work);
+        for (auto& [id, _]: _batch->tasks) {
+            completed_tasks.insert(id);
+        }
+        _batch = nullptr;
    }
+}

+// Flush base table, set is as currently processing base table and save which views exist at the time of flush
+future<> view_building_worker::state::flush_base_table(replica::database& db, table_id base_table_id, abort_source& as) {
+    auto cf = db.find_column_family(base_table_id).shared_from_this();
+    co_await when_all(cf->await_pending_writes(), cf->await_pending_streams());
+    co_await flush_base(cf, as);
+    processing_base_table = base_table_id;
+    flushed_views = get_ids_of_all_views(db, base_table_id);
+}
+
+future<> view_building_worker::state::clear() {
+    if (_batch) {
+        _batch->as.request_abort();
+        co_await std::move(_batch->work);
+        _batch = nullptr;
+    }
    processing_base_table.reset();
+    completed_tasks.clear();
    flushed_views.clear();
-    tasks_map.clear();
-    finished_tasks.clear();
-    aborted_tasks.clear();
-    state_updated_cv.broadcast();
-    vbw_logger.debug("View building worker state was cleared.");
 }

 view_building_worker::batch::batch(sharded<view_building_worker>& vbw, std::unordered_map<utils::UUID, view_building_task> tasks, table_id base_id, locator::tablet_replica replica)
@@ -667,16 +499,12 @@ view_building_worker::batch::batch(sharded<view_building_worker>& vbw, std::unor
    , _vbw(vbw) {}

 void view_building_worker::batch::start() {
-    if (this_shard_id() != 0) {
-        on_internal_error(vbw_logger, "view_building_worker::batch should be started on shard0");
+    if (this_shard_id() != replica.shard) {
+        on_internal_error(vbw_logger, "view_building_worker::batch should be started on replica shard");
    }

-    state = batch_state::in_progress;
-    work = smp::submit_to(replica.shard, [this] () -> future<> {
-        return do_work();
-    }).finally([this] () {
-        state = batch_state::finished;
-        _vbw.local()._vb_state_machine.event.broadcast();
+    work = do_work().finally([this] {
+        promise.set_value();
    });
 }

@@ -691,10 +519,6 @@ future<> view_building_worker::batch::abort() {
    co_await smp::submit_to(replica.shard, [this] () {
        as.request_abort();
    });
-
-    if (work.valid()) {
-        co_await work.get_future();
-    }
 }

 future<> view_building_worker::batch::do_work() {
@@ -837,15 +661,174 @@ future<> view_building_worker::do_build_range(table_id base_id, std::vector<tabl
 }

 future<> view_building_worker::do_process_staging(table_id table_id, dht::token last_token) {
-    if (_staging_sstables[table_id][last_token].empty()) {
+    if (_staging_sstables[table_id].empty()) {
        co_return;
    }

    auto table = _db.get_tables_metadata().get_table(table_id).shared_from_this();
-    auto sstables = std::exchange(_staging_sstables[table_id][last_token], {});
-    co_await _vug.process_staging_sstables(std::move(table), std::move(sstables));
+    auto& tablet_map = table->get_effective_replication_map()->get_token_metadata().tablets().get_tablet_map(table_id);
+    auto tid = tablet_map.get_tablet_id(last_token);
+    auto tablet_range = tablet_map.get_token_range(tid);
+
+    // Select sstables belonging to the tablet (identified by `last_token`)
+    std::vector<sstables::shared_sstable> sstables_to_process;
+    for (auto& sst: _staging_sstables[table_id]) {
+        auto sst_last_token = sst->get_last_decorated_key().token();
+        if (tablet_range.contains(sst_last_token, dht::token_comparator())) {
+            sstables_to_process.push_back(sst);
+        }
+    }
+
+    co_await _vug.process_staging_sstables(std::move(table), sstables_to_process);
+
+    try {
+        // Remove processed sstables from `_staging_sstables` map
+        auto lock = co_await get_units(_staging_sstables_mutex, 1, _as);
+        std::unordered_set<sstables::shared_sstable> sstables_to_remove(sstables_to_process.begin(), sstables_to_process.end());
+        auto [first, last] = std::ranges::remove_if(_staging_sstables[table_id], [&] (auto& sst) {
+            return sstables_to_remove.contains(sst);
+        });
+        _staging_sstables[table_id].erase(first, last);
+    } catch (semaphore_aborted&) {
+        vbw_logger.warn("Semaphore was aborted while waiting to removed processed sstables for table {}", table_id);
+    }
 }

+void view_building_worker::load_sstables(table_id table_id, std::vector<sstables::shared_sstable> ssts) {
+    std::ranges::copy_if(std::move(ssts), std::back_inserter(_staging_sstables[table_id]), [] (auto& sst) {
+        return sst->state() == sstables::sstable_state::staging;
+    });
+}
+
+void view_building_worker::cleanup_staging_sstables(locator::effective_replication_map_ptr erm, table_id table_id, locator::tablet_id tid) {
+    auto& tablet_map = erm->get_token_metadata().tablets().get_tablet_map(table_id);
+    auto tablet_range = tablet_map.get_token_range(tid);
+
+    auto [first, last] = std::ranges::remove_if(_staging_sstables[table_id], [&] (auto& sst) {
+        auto sst_last_token = sst->get_last_decorated_key().token();
+        return tablet_range.contains(sst_last_token, dht::token_comparator());
+    });
+    _staging_sstables[table_id].erase(first, last);
+}
+
+future<view_building_state> view_building_worker::get_latest_view_building_state(raft::term_t term) {
+    return smp::submit_to(0, [&sharded_vbw = container(), term] () -> future<view_building_state> {
+        auto& vbw = sharded_vbw.local();
+        // auto guard = vbw._group0.client().start_operation(vbw._as);
+
+        auto& raft_server = vbw._group0.group0_server();
+        auto group0_holder = vbw._group0.hold_group0_gate();
+        co_await raft_server.read_barrier(&vbw._as);
+        if (raft_server.get_current_term() != term) {
+           throw std::runtime_error(fmt::format("Invalid raft term. Got {} but current term is {}", term, raft_server.get_current_term()));
+        }
+
+        co_return vbw._vb_state_machine.building_state;
+    });
+}
+
+future<std::vector<utils::UUID>> view_building_worker::work_on_tasks(raft::term_t term, std::vector<utils::UUID> ids) {
+    auto collect_completed_tasks = [&] {
+        std::vector<utils::UUID> completed;
+        for (auto& id: ids) {
+            if (_state.completed_tasks.contains(id)) {
+                completed.push_back(id);
+            }
+        }
+        return completed;
+    };
+
+    auto lock = co_await get_units(_state._mutex, 1, _as);
+    // Firstly check if there is any batch that is finished but wasn't cleaned up.
+    if (_state._batch && _state._batch->promise.available()) {
+        co_await _state.clean_up_after_batch();
+    }
+
+    // Check if tasks were already completed.
+    // If only part of the tasks were finished, return the subset and don't execute the remaining tasks.
+    std::vector<utils::UUID> completed = collect_completed_tasks();
+    if (!completed.empty()) {
+        co_return completed;
+    }
+    lock.return_all();
+
+    auto building_state = co_await get_latest_view_building_state(term);
+
+    lock = co_await get_units(_state._mutex, 1, _as);
+    co_await _state.update_processing_base_table(_db, building_state, _as);
+    // If there is no running batch, create it.
+    if (!_state._batch) {
+        if (!_state.processing_base_table) {
+            throw std::runtime_error("view_building_worker::state::processing_base_table needs to be set to work on view building");
+        }
+
+        auto my_host_id = _db.get_token_metadata().get_topology().my_host_id();
+        auto my_replica = locator::tablet_replica{my_host_id, this_shard_id()};
+        std::unordered_map<utils::UUID, view_building_task> tasks;
+        for (auto& id: ids) {
+            auto task_opt = building_state.get_task(*_state.processing_base_table, my_replica, id);
+            if (!task_opt) {
+                throw std::runtime_error(fmt::format("Task {} was not found for base table {} on replica {}", id, *building_state.currently_processed_base_table, my_replica));
+            }
+            tasks.insert({id, *task_opt});
+        }
+#ifdef SEASTAR_DEBUG
+        auto& some_task = tasks.begin()->second;
+        for (auto& [_, t]: tasks) {
+            SCYLLA_ASSERT(t.base_id == some_task.base_id);
+            SCYLLA_ASSERT(t.last_token == some_task.last_token);
+            SCYLLA_ASSERT(t.replica == some_task.replica);
+            SCYLLA_ASSERT(t.type == some_task.type);
+            SCYLLA_ASSERT(t.replica.shard == this_shard_id());
+        }
+#endif
+
+        // If any view was added after we did the initial flush, we need to do it again
+        if (std::ranges::any_of(tasks | std::views::values, [&] (const view_building_task& t) {
+            return t.view_id && !_state.flushed_views.contains(*t.view_id);
+        })) {
+            co_await _state.flush_base_table(_db, *_state.processing_base_table, _as);
+        }
+
+        // Create and start the batch
+        _state._batch = std::make_unique<batch>(container(), std::move(tasks), *building_state.currently_processed_base_table, my_replica);
+        _state._batch->start();
+    }
+
+    if (std::ranges::all_of(ids, [&] (auto& id) { return !_state._batch->tasks.contains(id); })) {
+        throw std::runtime_error(fmt::format(
+                "None of the tasks requested to work on is executed in current view building batch. Batch executes: {}, the RPC requested: {}",
+                _state._batch->tasks | std::views::keys, ids));
+    }
+    auto batch_future = _state._batch->promise.get_shared_future();
+    lock.return_all();
+
+    co_await std::move(batch_future);
+
+    lock = co_await get_units(_state._mutex, 1, _as);
+    co_await _state.clean_up_after_batch();
+    co_return collect_completed_tasks();
+}
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
 }

 }
--- a/db/view/view_building_worker.hh
+++ b/db/view/view_building_worker.hh
@@ -14,7 +14,9 @@
 #include <seastar/core/shared_future.hh>
 #include <unordered_map>
 #include <unordered_set>
+#include "locator/abstract_replication_strategy.hh"
 #include "locator/tablets.hh"
+#include "raft/raft.hh"
 #include "seastar/core/gate.hh"
 #include "db/view/view_building_state.hh"
 #include "sstables/shared_sstable.hh"
@@ -30,7 +32,7 @@ class messaging_service;
 }

 namespace service {
-class raft_group0_client;
+class raft_group0;
 }

 namespace db {
@@ -64,27 +66,16 @@ class view_building_worker : public seastar::peering_sharded_service<view_buildi
     *
     * When `work` future is finished, it means all tasks in `tasks_ids` are done.
     *
-     * The batch lives on shard 0 exclusively.
-     * When the batch starts to execute its tasks, it firstly copies all necessary data
-     * to the designated shard, then the work is done on the local copy of the data only.
+     * The batch lives on shard, where its executing its work exclusively.
     */
-
-    enum class batch_state {
-        idle,
-        in_progress,
-        finished,
-    };
-
    class batch {
    public:
-        batch_state state = batch_state::idle;
        table_id base_id;
        locator::tablet_replica replica;
        std::unordered_map<utils::UUID, view_building_task> tasks;

-        shared_future<> work;
-        condition_variable batch_done_cv;
-        // The abort has to be used only on `replica.shard`
+        shared_promise<> promise;
+        future<> work = make_ready_future();
        abort_source as;

        batch(sharded<view_building_worker>& vbw, std::unordered_map<utils::UUID, view_building_task> tasks, table_id base_id, locator::tablet_replica replica);
@@ -100,34 +91,18 @@ class view_building_worker : public seastar::peering_sharded_service<view_buildi

    friend class batch;

-    struct local_state {
+    struct state {
        std::optional<table_id> processing_base_table = std::nullopt;
-        // Stores ids of views for which the flush was done.
-        // When a new view is created, we need to flush the base table again,
-        // as data might be inserted.
+        std::unordered_set<utils::UUID> completed_tasks;
+        std::unique_ptr<batch> _batch = nullptr;
        std::unordered_set<table_id> flushed_views;
-        std::unordered_map<utils::UUID, shared_ptr<batch>> tasks_map;

-        std::unordered_set<utils::UUID> finished_tasks;
-        std::unordered_set<utils::UUID> aborted_tasks;
-
-        condition_variable state_updated_cv;
-
-        // Clears completed/aborted tasks and creates batches (without starting them) for started tasks.
-        // Returns a map of tasks per shard to execute.
-        future<> update(view_building_worker& vbw);
-
-        future<> finish_completed_tasks();
-
-        // The state can be aborted if, for example, a view is dropped, then all its tasks
-        // are aborted and the coordinator may choose new base table to process.
-        // This method aborts all batches as we stop to processing the current base table.
-        future<> clear_state();
-
-        // Flush table with `table_id` on all shards.
-        // This method should be used only on currently processing base table and
-        // it updates `flushed_views` field.
-        future<> flush_table(view_building_worker& vbw, table_id table_id);
+        semaphore _mutex = semaphore(1);
+        // All of the methods below should be executed while holding `_mutex` unit!
+        future<> update_processing_base_table(replica::database& db, const view_building_state& building_state, abort_source& as);
+        future<> flush_base_table(replica::database& db, table_id base_table_id, abort_source& as);
+        future<> clean_up_after_batch();
+        future<> clear();
    };

    // Wrapper which represents information needed to create
@@ -145,28 +120,28 @@ private:
    replica::database& _db;
    db::system_keyspace& _sys_ks;
    service::migration_notifier& _mnotifier;
-    service::raft_group0_client& _group0_client;
+    service::raft_group0& _group0;
    view_update_generator& _vug;
    netw::messaging_service& _messaging;
    view_building_state_machine& _vb_state_machine;
    abort_source _as;
    named_gate _gate;

-    local_state _state;
+    state _state;
    std::unordered_set<table_id> _views_in_progress;
    future<> _view_building_state_observer = make_ready_future<>();

    condition_variable _sstables_to_register_event;
    semaphore _staging_sstables_mutex = semaphore(1);
    std::unordered_map<table_id, std::vector<staging_sstable_task_info>> _sstables_to_register;
-    std::unordered_map<table_id, std::unordered_map<dht::token, std::vector<sstables::shared_sstable>>> _staging_sstables;
+    std::unordered_map<table_id, std::vector<sstables::shared_sstable>> _staging_sstables;
    future<> _staging_sstables_registrator = make_ready_future<>();

 public:
    view_building_worker(replica::database& db, db::system_keyspace& sys_ks, service::migration_notifier& mnotifier,
-            service::raft_group0_client& group0_client, view_update_generator& vug, netw::messaging_service& ms,
+            service::raft_group0& group0, view_update_generator& vug, netw::messaging_service& ms,
            view_building_state_machine& vbsm);
-    void start_background_fibers();
+    future<> init();

    future<> register_staging_sstable_tasks(std::vector<sstables::shared_sstable> ssts, table_id table_id);

@@ -177,11 +152,17 @@ public:
    virtual void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {};
    virtual void on_drop_view(const sstring& ks_name, const sstring& view_name) override;

+    // Used ONLY to load staging sstables migrated during intra-node tablet migration.
+    void load_sstables(table_id table_id, std::vector<sstables::shared_sstable> ssts);
+    // Used in cleanup/cleanup-target tablet transition stage
+    void cleanup_staging_sstables(locator::effective_replication_map_ptr erm, table_id table_id, locator::tablet_id tid);
+
 private:
+    future<view_building_state> get_latest_view_building_state(raft::term_t term);
+    future<> check_for_aborted_tasks();
+
    future<> run_view_building_state_observer();
    future<> update_built_views();
-    future<> update_building_state();
-    bool is_shard_free(shard_id shard);

    dht::token_range get_tablet_token_range(table_id table_id, dht::token last_token);
    future<> do_build_range(table_id base_id, std::vector<table_id> views_ids, dht::token last_token, abort_source& as);
@@ -195,7 +176,7 @@ private:

    void init_messaging_service();
    future<> uninit_messaging_service();
-    future<std::vector<view_task_result>> work_on_tasks(std::vector<utils::UUID> ids);
+    future<std::vector<utils::UUID>> work_on_tasks(raft::term_t term, std::vector<utils::UUID> ids);
 };

 }
--- a/db/view/view_update_generator.cc
+++ b/db/view/view_update_generator.cc
@@ -102,13 +102,13 @@ view_update_generator::view_update_generator(replica::database& db, sharded<serv
        , _early_abort_subscription(as.subscribe([this] () noexcept { do_abort(); }))
 {
    setup_metrics();
-    discover_staging_sstables();
    _db.plug_view_update_generator(*this);
 }

 view_update_generator::~view_update_generator() {}

 future<> view_update_generator::start() {
+    discover_staging_sstables();
    _started = seastar::async([this]() mutable {
        auto drop_sstable_references = defer([&] () noexcept {
            // Clear sstable references so sstables_manager::stop() doesn't hang.
--- a/db/virtual_tables.cc
+++ b/db/virtual_tables.cc
@@ -605,8 +605,8 @@ public:
    }

    static schema_ptr build_schema() {
-        auto id = generate_legacy_id(system_keyspace::NAME, "versions");
-        return schema_builder(system_keyspace::NAME, "versions", std::make_optional(id))
+        auto id = generate_legacy_id(system_keyspace::NAME, system_keyspace::VERSIONS);
+        return schema_builder(system_keyspace::NAME, system_keyspace::VERSIONS, std::make_optional(id))
            .with_column("key", utf8_type, column_kind::partition_key)
            .with_column("version", utf8_type)
            .with_column("build_mode", utf8_type)
@@ -1206,8 +1206,8 @@ public:

 private:
    static schema_ptr build_schema() {
-        auto id = generate_legacy_id(system_keyspace::NAME, "cdc_timestamps");
-        return schema_builder(system_keyspace::NAME, "cdc_timestamps", std::make_optional(id))
+        auto id = generate_legacy_id(system_keyspace::NAME, system_keyspace::CDC_TIMESTAMPS);
+        return schema_builder(system_keyspace::NAME, system_keyspace::CDC_TIMESTAMPS, std::make_optional(id))
            .with_column("keyspace_name", utf8_type, column_kind::partition_key)
            .with_column("table_name", utf8_type, column_kind::partition_key)
            .with_column("timestamp", reversed_type_impl::get_instance(timestamp_type), column_kind::clustering_key)
@@ -1278,7 +1278,7 @@ public:
            static_assert(int(cdc::stream_state::current) < int(cdc::stream_state::closed));
            static_assert(int(cdc::stream_state::closed) < int(cdc::stream_state::opened));

-            co_await _ss.query_cdc_streams(table, [&] (db_clock::time_point ts, const std::vector<cdc::stream_id>& current, cdc::cdc_stream_diff diff) -> future<> {
+            co_await _ss.query_cdc_streams(table, [&] (db_clock::time_point ts, const utils::chunked_vector<cdc::stream_id>& current, cdc::cdc_stream_diff diff) -> future<> {
                co_await emit_stream_set(ts, cdc::stream_state::current, current);
                co_await emit_stream_set(ts, cdc::stream_state::closed, diff.closed_streams);
                co_await emit_stream_set(ts, cdc::stream_state::opened, diff.opened_streams);
@@ -1289,8 +1289,8 @@ public:
    }
 private:
    static schema_ptr build_schema() {
-        auto id = generate_legacy_id(system_keyspace::NAME, "cdc_streams");
-        return schema_builder(system_keyspace::NAME, "cdc_streams", std::make_optional(id))
+        auto id = generate_legacy_id(system_keyspace::NAME, system_keyspace::CDC_STREAMS);
+        return schema_builder(system_keyspace::NAME, system_keyspace::CDC_STREAMS, std::make_optional(id))
            .with_column("keyspace_name", utf8_type, column_kind::partition_key)
            .with_column("table_name", utf8_type, column_kind::partition_key)
            .with_column("timestamp", timestamp_type, column_kind::clustering_key)
--- a/dht/i_partitioner.cc
+++ b/dht/i_partitioner.cc
@@ -204,7 +204,7 @@ ring_position_range_sharder::next(const schema& s) {
    return ring_position_range_and_shard{std::move(_range), shard};
 }

-ring_position_range_vector_sharder::ring_position_range_vector_sharder(const sharder& sharder, dht::partition_range_vector ranges)
+ring_position_range_vector_sharder::ring_position_range_vector_sharder(const sharder& sharder, utils::chunked_vector<dht::partition_range> ranges)
        : _ranges(std::move(ranges))
        , _sharder(sharder)
        , _current_range(_ranges.begin()) {
--- a/dht/sharder.hh
+++ b/dht/sharder.hh
@@ -11,6 +11,7 @@
 #include "dht/ring_position.hh"
 #include "dht/token-sharding.hh"
 #include "utils/interval.hh"
+#include "utils/chunked_vector.hh"

 #include <vector>

@@ -89,7 +90,7 @@ struct ring_position_range_and_shard_and_element : ring_position_range_and_shard
 //
 // During migration uses a view on shard routing for reads.
 class ring_position_range_vector_sharder {
-    using vec_type = dht::partition_range_vector;
+    using vec_type = utils::chunked_vector<dht::partition_range>;
    vec_type _ranges;
    const sharder& _sharder;
    vec_type::iterator _current_range;
@@ -104,7 +105,7 @@ public:
    // Initializes the `ring_position_range_vector_sharder` with the ranges to be processesd.
    // Input ranges should be non-overlapping (although nothing bad will happen if they do
    // overlap).
-    ring_position_range_vector_sharder(const sharder& sharder, dht::partition_range_vector ranges);
+    ring_position_range_vector_sharder(const sharder& sharder, utils::chunked_vector<dht::partition_range> ranges);
    // Fetches the next range-shard mapping. When the input range is exhausted, std::nullopt is
    // returned. Within an input range, results are contiguous and non-overlapping (but since input
    // ranges usually are discontiguous, overall the results are not contiguous). Together, the results
--- a/dht/token.hh
+++ b/dht/token.hh
@@ -30,6 +30,31 @@ enum class token_kind {
    after_all_keys,
 };

+// Represents a token for partition keys.
+// Has a disengaged state, which sorts before all engaged states.
+struct raw_token {
+    int64_t value;
+
+    /// Constructs a disengaged token.
+    raw_token() : value(std::numeric_limits<int64_t>::min()) {}
+
+    /// Constructs an engaged token.
+    /// The token must be of token_kind::key kind.
+    explicit raw_token(const token&);
+
+    explicit raw_token(int64_t v) : value(v) {};
+
+    std::strong_ordering operator<=>(const raw_token& o) const noexcept = default;
+    std::strong_ordering operator<=>(const token& o) const noexcept;
+
+    /// Returns true iff engaged.
+    explicit operator bool() const noexcept {
+        return value != std::numeric_limits<int64_t>::min();
+    }
+};
+
+using raw_token_opt = seastar::optimized_optional<raw_token>;
+
 class token {
    // INT64_MIN is not a legal token, but a special value used to represent
    // infinity in token intervals.
@@ -52,6 +77,10 @@ public:

    constexpr explicit token(int64_t d) noexcept : token(kind::key, normalize(d)) {}

+    token(raw_token raw) noexcept
+        : token(raw ? kind::key : kind::before_all_keys, raw.value)
+    { }
+
    // This constructor seems redundant with the bytes_view constructor, but
    // it's necessary for IDL, which passes a deserialized_bytes_proxy here.
    // (deserialized_bytes_proxy is convertible to bytes&&, but not bytes_view.)
@@ -223,6 +252,29 @@ public:
    }
 };

+inline
+raw_token::raw_token(const token& t)
+    : value(t.raw())
+{
+#ifdef DEBUG
+    assert(t._kind == token::kind::key);
+#endif
+}
+
+inline
+std::strong_ordering raw_token::operator<=>(const token& o) const noexcept {
+    switch (o._kind) {
+        case token::kind::after_all_keys:
+            return std::strong_ordering::less;
+        case token::kind::before_all_keys:
+            // before_all_keys has a raw value set to the same raw value as a disengaged raw_token, and sorts before all keys.
+            // So we can order them by just comparing raw values.
+            [[fallthrough]];
+        case token::kind::key:
+            return value <=> o._data;
+    }
+}
+
 inline constexpr std::strong_ordering tri_compare_raw(const int64_t l1, const int64_t l2) noexcept {
    if (l1 == l2) {
        return std::strong_ordering::equal;
@@ -329,6 +381,17 @@ struct fmt::formatter<dht::token> : fmt::formatter<string_view> {
    }
 };

+template <>
+struct fmt::formatter<dht::raw_token> : fmt::formatter<string_view> {
+    template <typename FormatContext>
+    auto format(const dht::raw_token& t, FormatContext& ctx) const {
+        if (!t) {
+            return fmt::format_to(ctx.out(), "null");
+        }
+        return fmt::format_to(ctx.out(), "{}", t.value);
+    }
+};
+
 namespace std {

 template<>
--- a/dist/common/scripts/scylla_io_setup
+++ b/dist/common/scripts/scylla_io_setup
@@ -131,6 +131,28 @@ def configure_iotune_open_fd_limit(shards_count):
        logging.error(f"Required FDs count: {precalculated_fds_count}, default limit: {fd_limits}!")
        sys.exit(1)

+def force_random_request_size_of_4k():
+    """
+    It is a known bug that on i4i, i7i, i8g, i8ge instances, the disk controller reports the wrong
+    physical sector size as 512bytes, but the actual physical sector size is 4096bytes. This function
+    helps us work around that issue until AWS manages to get a fix for it. It returns 4096 if it
+    detect it's running on one of the affected instance types, otherwise it returns None and IOTune
+    will use the physical sector size reported by the disk.
+    """
+    path="/sys/devices/virtual/dmi/id/product_name"
+
+    try:
+        with open(path, "r") as f:
+            instance_type = f.read().strip()
+    except FileNotFoundError:
+        logging.warning(f"Couldn't find {path}. Falling back to IOTune using the physical sector size reported by disk.")
+        return
+
+    prefixes = ["i7i", "i4i", "i8g", "i8ge"]
+    if any(instance_type.startswith(p) for p in prefixes):
+        return 4096
+
+
 def run_iotune():
            if "SCYLLA_CONF" in os.environ:
                conf_dir = os.environ["SCYLLA_CONF"]
@@ -173,6 +195,8 @@ def run_iotune():

            configure_iotune_open_fd_limit(cpudata.nr_shards())

+            if (reqsize := force_random_request_size_of_4k()):
+                iotune_args += ["--random-write-io-buffer-size", f"{reqsize}"]
            try:
                subprocess.check_call([bindir() + "/iotune",
                                       "--format", "envfile",
--- a/dist/common/scripts/scylla_raid_setup
+++ b/dist/common/scripts/scylla_raid_setup
@@ -17,6 +17,7 @@ import stat
 import logging
 import pyudev
 import psutil
+import platform
 from pathlib import Path
 from scylla_util import *
 from subprocess import run, SubprocessError
@@ -102,6 +103,21 @@ def is_selinux_enabled():
                return True
    return False

+def is_kernel_version_at_least(major, minor):
+    """Check if the Linux kernel version is at least major.minor"""
+    try:
+        kernel_version = platform.release()
+        # Extract major.minor from version string like "5.15.0-56-generic"
+        version_parts = kernel_version.split('.')
+        if len(version_parts) >= 2:
+            kernel_major = int(version_parts[0])
+            kernel_minor = int(version_parts[1])
+            return (kernel_major, kernel_minor) >= (major, minor)
+    except (ValueError, IndexError):
+        # If we can't parse the version, assume older kernel for safety
+        pass
+    return False
+
 if __name__ == '__main__':
    if os.getuid() > 0:
        print('Requires root permission.')
@@ -231,8 +247,17 @@ if __name__ == '__main__':
    # see https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/mkfs/xfs_mkfs.c .
    # and it also cannot be smaller than the sector size.
    block_size = max(1024, sector_size)
+
    run('udevadm settle', shell=True, check=True)
-    run(f'mkfs.xfs -b size={block_size} {fsdev} -K -m rmapbt=0 -m reflink=0', shell=True, check=True)
+
+    # On Linux 5.12+, sub-block overwrites are supported well, so keep the default block
+    # size, which will play better with the SSD.
+    if is_kernel_version_at_least(5, 12):
+        block_size_opt = ""
+    else:
+        block_size_opt = f"-b size={block_size}"
+
+    run(f'mkfs.xfs {block_size_opt} {fsdev} -K -m rmapbt=0 -m reflink=0', shell=True, check=True)
    run('udevadm settle', shell=True, check=True)

    if is_debian_variant():
--- a/dist/common/sysconfig/scylla-node-exporter
+++ b/dist/common/sysconfig/scylla-node-exporter
@@ -1 +1 @@
-SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --no-collector.hwmon --no-collector.bcache --no-collector.btrfs --no-collector.fibrechannel --no-collector.infiniband --no-collector.ipvs --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.rapl --no-collector.tapestats --no-collector.thermal_zone --no-collector.udp_queues --no-collector.zfs"
+SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --collector.ethtool.metrics-include='(bw_in_allowance_exceeded|bw_out_allowance_exceeded|conntrack_allowance_exceeded|conntrack_allowance_available|linklocal_allowance_exceeded)' --collector.ethtool --no-collector.hwmon --no-collector.bcache --no-collector.btrfs --no-collector.fibrechannel --no-collector.infiniband --no-collector.ipvs --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.rapl --no-collector.tapestats --no-collector.thermal_zone --no-collector.udp_queues --no-collector.zfs"
--- a/dist/docker/commandlineparser.py
+++ b/dist/docker/commandlineparser.py
@@ -31,4 +31,5 @@ def parse():
    parser.add_argument('--replace-address-first-boot', default=None, dest='replaceAddressFirstBoot', help="[[deprecated]] IP address of a dead node to replace.")
    parser.add_argument('--dc', default=None, dest='dc', help="The datacenter name for this node, for use with the snitch GossipingPropertyFileSnitch.")
    parser.add_argument('--rack', default=None, dest='rack', help="The rack name for this node, for use with the snitch GossipingPropertyFileSnitch.")
+    parser.add_argument('--blocked-reactor-notify-ms', default='25', dest='blocked_reactor_notify_ms', help="Set the blocked reactor notification timeout in milliseconds. Defaults to 25.")
    return parser.parse_known_args()
--- a/dist/docker/redhat/build_docker.sh
+++ b/dist/docker/redhat/build_docker.sh
@@ -97,7 +97,9 @@ bcp LICENSE-ScyllaDB-Source-Available.md /licenses/

 run microdnf clean all
 run microdnf --setopt=tsflags=nodocs -y update
-run microdnf --setopt=tsflags=nodocs -y install hostname kmod procps-ng python3 python3-pip
+run microdnf --setopt=tsflags=nodocs -y install hostname kmod procps-ng python3 python3-pip cpio
+# Extract only systemctl binary from systemd package to avoid installing the whole systemd in the container.
+run bash -rc "microdnf download systemd && rpm2cpio systemd-*.rpm | cpio -idmv ./usr/bin/systemctl && rm -rf systemd-*.rpm"
 run curl -L --output /etc/yum.repos.d/scylla.repo ${repo_file_url}
 run pip3 install --no-cache-dir --prefix /usr supervisor
 run bash -ec "echo LANG=C.UTF-8 > /etc/locale.conf"
@@ -106,6 +108,8 @@ run bash -ec "cat /scylla_bashrc >> /etc/bash.bashrc"
 run mkdir -p /var/log/scylla
 run chown -R scylla:scylla /var/lib/scylla
 run sed -i -e 's/^SCYLLA_ARGS=".*"$/SCYLLA_ARGS="--log-to-syslog 0 --log-to-stdout 1 --network-stack posix"/' /etc/sysconfig/scylla-server
+# Cleanup packages not needed in the final image and clean package manager cache to reduce image size.
+run bash -rc "microdnf remove -y cpio && microdnf clean all"

 run mkdir -p /opt/scylladb/supervisor
 run touch /opt/scylladb/SCYLLA-CONTAINER-FILE
--- a/dist/docker/scyllasetup.py
+++ b/dist/docker/scyllasetup.py
@@ -46,6 +46,7 @@ class ScyllaSetup:
        self._extra_args = extra_arguments
        self._dc = arguments.dc
        self._rack = arguments.rack
+        self._blocked_reactor_notify_ms = arguments.blocked_reactor_notify_ms

    def _run(self, *args, **kwargs):
        logging.info('running: {}'.format(args))
@@ -205,7 +206,7 @@ class ScyllaSetup:
        elif self._replaceAddressFirstBoot is not None:
            args += ["--replace-address-first-boot %s" % self._replaceAddressFirstBoot]

-        args += ["--blocked-reactor-notify-ms 999999999"]
+        args += ["--blocked-reactor-notify-ms %s" % self._blocked_reactor_notify_ms]

        with open("/etc/scylla.d/docker.conf", "w") as cqlshrc:
            cqlshrc.write("SCYLLA_DOCKER_ARGS=\"%s\"\n" % (" ".join(args) + " " + " ".join(self._extra_args)))
--- a/docs/_static/data/os-support.json
+++ b/docs/_static/data/os-support.json
@@ -1,16 +1,25 @@
 {
    "Linux Distributions": {
      "Ubuntu": ["22.04", "24.04"],
-      "Debian": ["11"],
+      "Debian": ["11", "12"],
      "Rocky / CentOS / RHEL": ["8", "9", "10"],
      "Amazon Linux": ["2023"]
    },
    "ScyllaDB Versions": [
+      {
+        "version": "ScyllaDB 2025.4",
+        "supported_OS": {
+          "Ubuntu": ["22.04", "24.04"],
+          "Debian": ["11", "12"],
+          "Rocky / CentOS / RHEL": ["8", "9", "10"],
+          "Amazon Linux": ["2023"]
+        }
+      },
      {
        "version": "ScyllaDB 2025.3",
        "supported_OS": {
          "Ubuntu": ["22.04", "24.04"],
-          "Debian": ["11"],
+          "Debian": ["11", "12"],
          "Rocky / CentOS / RHEL": ["8", "9", "10"],
          "Amazon Linux": ["2023"]
        }
@@ -19,7 +28,7 @@
        "version": "ScyllaDB 2025.2",
        "supported_OS": {
          "Ubuntu": ["22.04", "24.04"],
-          "Debian": ["11"],
+          "Debian": ["11", "12"],
          "Rocky / CentOS / RHEL": ["8", "9"],
          "Amazon Linux": ["2023"]
        }
@@ -28,7 +37,7 @@
        "version": "ScyllaDB 2025.1",
        "supported_OS": {
          "Ubuntu": ["22.04", "24.04"],
-          "Debian": ["11"],
+          "Debian": ["11", "12"],
          "Rocky / CentOS / RHEL": ["8", "9"],
          "Amazon Linux": ["2023"]
        }
--- a/docs/_utils/redirects.yaml
+++ b/docs/_utils/redirects.yaml
@@ -1,6 +1,18 @@
 ### a dictionary of redirections
 #old path: new path

+# Move the diver information to another project
+
+/stable/using-scylla/drivers/index.html: https://docs.scylladb.com/stable/drivers/index.html
+/stable/using-scylla/drivers/dynamo-drivers/index.html: https://docs.scylladb.com/stable/drivers/dynamo-drivers.html
+/stable/using-scylla/drivers/cql-drivers/index.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
+/stable/using-scylla/drivers/cql-drivers/scylla-python-driver.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
+/stable/using-scylla/drivers/cql-drivers/scylla-java-driver.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
+/stable/using-scylla/drivers/cql-drivers/scylla-go-driver.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
+/stable/using-scylla/drivers/cql-drivers/scylla-gocqlx-driver.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
+/stable/using-scylla/drivers/cql-drivers/scylla-cpp-driver.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
+/stable/using-scylla/drivers/cql-drivers/scylla-rust-driver.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
+
 # Redirect 2025.1 upgrade guides that are not on master but were indexed by Google (404 reported)

 /master/upgrade/upgrade-guides/upgrade-guide-from-2024.x-to-2025.1/upgrade-guide-from-2024.x-to-2025.1.html: https://docs.scylladb.com/manual/stable/upgrade/index.html
--- a/docs/alternator/alternator.md
+++ b/docs/alternator/alternator.md
@@ -134,10 +134,6 @@ want modify a non-top-level attribute directly (e.g., a.b[3].c) need RMW:
 Alternator implements such requests by reading the entire top-level
 attribute a, modifying only a.b[3].c, and then writing back a.

-Currently, Alternator doesn't use Tablets. That's because Alternator relies
-on LWT (lightweight transactions), and LWT is not supported in keyspaces
-with Tablets enabled.
-
 ```{eval-rst}
 .. toctree::
    :maxdepth: 2
--- a/docs/alternator/compatibility.md
+++ b/docs/alternator/compatibility.md
@@ -109,6 +109,32 @@ to do what, configure the following in ScyllaDB's configuration:
    alternator_enforce_authorization: true
 ```

+Note: switching `alternator_enforce_authorization` from `false` to `true`
+before the client application has the proper secret keys and permission
+tables set up will cause the application's requests to immediately fail.
+Therefore, we recommend to begin by keeping `alternator_enforce_authorization`
+set to `false` and setting `alternator_warn_authorization` to `true`.
+This setting will continue to allow all requests without failing on
+authentication or authorization errors - but will _count_ would-be
+authentication and authorization failures in the two metrics:
+
+* `scylla_alternator_authentication_failures`
+* `scylla_alternator_authorization_failures`
+
+`alternator_warn_authorization=true` also generates a WARN-level log message
+on each authentication or authorization failure. These log messages each
+includes the string `alternator_enforce_authorization=true`, and information
+that can help pinpoint the source of the error - such as the username
+involved in the attempt, and the address of the client sending the request.
+
+When you see that both metrics are not increasing (or, alternatively, that no
+more log messages appear), you can be sure that the application is properly
+set up and can finally set `alternator_enforce_authorization` to `true`.
+You can leave `alternator_warn_authorization` set or unset, depending on
+whether or not you want to see log messages when requests fail on
+authentication/authorization (in any case, the metric counts these failures,
+and the client will also get the error).
+
 Alternator implements the same [signature protocol](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html)
 as DynamoDB and the rest of AWS. Clients use, as usual, an access key ID and
 a secret access key to prove their identity and the authenticity of their
--- a/Show More
+++ b/Show More