tests: add tests for per-role timeouts

The test cases verify that setting timeout parameters per-role works and is validated.
docs: add a paragaph about per-role parameters
2020-11-27 12:43:53 +01:00 · 2020-11-27 12:43:53 +01:00 · 2020-11-27 12:37:27 +01:00 · 2020-11-27 12:37:27 +01:00 · 2020-11-27 12:37:17 +01:00 · 2020-11-26 17:56:55 +01:00
1483 changed files with 19073 additions and 49712 deletions
--- a/.github/workflows/pages.yml
+++ b/.github/workflows/pages.yml
@@ -1,33 +0,0 @@
-name: "CI Docs"
-
-on:
-  push:
-    branches:
-    - master
-    paths:
-    - 'docs/**'
-jobs:
-  release:
-    name: Build
-    runs-on: ubuntu-latest
-    env:
-      LATEST_VERSION: master
-    steps:
-    - name: Checkout
-      uses: actions/checkout@v2
-      with:
-        persist-credentials: false
-        fetch-depth: 0
-    - name: Set up Python
-      uses: actions/setup-python@v1
-      with:
-        python-version: 3.7
-    - name: Build docs
-      run: |
-        export PATH=$PATH:~/.local/bin
-        cd docs
-        make multiversion
-    - name: Deploy
-      run : ./docs/_utils/deploy.sh
-      env:
-        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
--- a/.gitignore
+++ b/.gitignore
@@ -25,5 +25,3 @@ tags
 testlog
 test/*/*.reject
 .vscode
-docs/_build
-docs/poetry.lock
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,6 +1,6 @@
 [submodule "seastar"]
 	path = seastar
-	url = ../scylla-seastar
+	url = ../seastar
 	ignore = dirty
 [submodule "swagger-ui"]
 	path = swagger-ui
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -498,7 +498,6 @@ set(scylla_sources
    mutation_writer/multishard_writer.cc
    mutation_writer/shard_based_splitting_writer.cc
    mutation_writer/timestamp_based_splitting_writer.cc
-    mutation_writer/feed_writers.cc
    partition_slice_builder.cc
    partition_version.cc
    querier.cc
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,18 +1,11 @@
-# Contributing to Scylla
+# Asking questions or requesting help

-## Asking questions or requesting help
+Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.

-Use the [Scylla Users mailing list](https://groups.google.com/g/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
+# Reporting an issue

-Join the [Scylla Developers mailing list](https://groups.google.com/g/scylladb-dev) for deeper technical discussions and to discuss your ideas for contributions.
+Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues.  Fill in as much information as you can in the issue template, especially for performance problems.

-## Reporting an issue
+# Contributing Code to Scylla

-Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to report issues or to suggest features. Fill in as much information as you can in the issue template, especially for performance problems.
-
-## Contributing code to Scylla
-
-Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
-If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).
-
-The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.
+To contribute code to Scylla, you need to sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.
--- a/DEDICATION.txt
+++ b/DEDICATION.txt
@@ -1 +0,0 @@
-Dedicated to the memory of Alberto José Araújo, a coworker and a friend.
--- a/NOTICE.txt
+++ b/NOTICE.txt
@@ -5,5 +5,3 @@ It includes files from https://github.com/antonblanchard/crc32-vpmsum (author An
 These files are located in utils/arch/powerpc/crc32-vpmsum. Their license may be found in licenses/LICENSE-crc32-vpmsum.TXT.

 It includes modified code from https://gitbox.apache.org/repos/asf?p=cassandra-dtest.git (owned by The Apache Software Foundation)
-
-It includes modified tests from https://github.com/etcd-io/etcd.git (owned by The etcd Authors)
--- a/README.md
+++ b/README.md
@@ -42,7 +42,7 @@ For further information, please see:
 * [Docker image build documentation] for information on how to build Docker images.

 [developer documentation]: HACKING.md
-[build documentation]: docs/guides/building.md
+[build documentation]: docs/building.md
 [docker image build documentation]: dist/docker/redhat/README.md

 ## Running Scylla
@@ -65,7 +65,7 @@ $ ./tools/toolchain/dbuild ./build/release/scylla --help

 ## Testing

-See [test.py manual](docs/guides/testing.md).
+See [test.py manual](docs/testing.md).

 ## Scylla APIs and compatibility
 By default, Scylla is compatible with Apache Cassandra and its APIs - CQL and
@@ -78,7 +78,10 @@ and the current compatibility of this feature as well as Scylla-specific extensi

 ## Documentation

-Documentation can be found [here](https://scylla.docs.scylladb.com).
+Documentation can be found in [./docs](./docs) and on the
+[wiki](https://github.com/scylladb/scylla/wiki). There is currently no clear
+definition of what goes where, so when looking for something be sure to check
+both.
 Seastar documentation can be found [here](http://docs.seastar.io/master/index.html).
 User documentation can be found [here](https://docs.scylladb.com/).

--- a/2
+++ b/2
@@ -1,7 +1,7 @@
 #!/bin/sh

 PRODUCT=scylla
-VERSION=4.5.7
+VERSION=4.4.dev

 if test -f version
 then
--- a/2
+++ b/2
--- a/alternator/auth.cc
+++ b/alternator/auth.cc
@@ -62,14 +62,6 @@ static std::string apply_sha256(std::string_view msg) {
    return to_hex(hasher.finalize());
 }

-static std::string apply_sha256(const std::vector<temporary_buffer<char>>& msg) {
-    sha256_hasher hasher;
-    for (const temporary_buffer<char>& buf : msg) {
-        hasher.update(buf.get(), buf.size());
-    }
-    return to_hex(hasher.finalize());
-}
-
 static std::string format_time_point(db_clock::time_point tp) {
    time_t time_point_repr = db_clock::to_time_t(tp);
    std::string time_point_str;
@@ -99,7 +91,7 @@ void check_expiry(std::string_view signature_date) {

 std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,
        std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,
-        const std::vector<temporary_buffer<char>>& body_content, std::string_view region, std::string_view service, std::string_view query_string) {
+        std::string_view body_content, std::string_view region, std::string_view service, std::string_view query_string) {
    auto amz_date_it = signed_headers_map.find("x-amz-date");
    if (amz_date_it == signed_headers_map.end()) {
        throw api_error::invalid_signature("X-Amz-Date header is mandatory for signature verification");
--- a/alternator/auth.hh
+++ b/alternator/auth.hh
@@ -39,7 +39,7 @@ using key_cache = utils::loading_cache<std::string, std::string>;

 std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,
        std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,
-        const std::vector<temporary_buffer<char>>& body_content, std::string_view region, std::string_view service, std::string_view query_string);
+        std::string_view body_content, std::string_view region, std::string_view service, std::string_view query_string);

 future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string username);

--- a/alternator/conditions.cc
+++ b/alternator/conditions.cc
@@ -123,7 +123,7 @@ struct rjson_engaged_ptr_comp {
 // as internally they're stored in an array, and the order of elements is
 // not important in set equality. See issue #5021
 static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2) {
-    if (!set1.IsArray() || !set2.IsArray() || set1.Size() != set2.Size()) {
+    if (set1.Size() != set2.Size()) {
        return false;
    }
    std::set<const rjson::value*, rjson_engaged_ptr_comp> set1_raw;
@@ -137,107 +137,45 @@ static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2
    }
    return true;
 }
-// Moreover, the JSON being compared can be a nested document with outer
-// layers of lists and maps and some inner set - and we need to get to that
-// inner set to compare it correctly with check_EQ_for_sets() (issue #8514).
-static bool check_EQ(const rjson::value* v1, const rjson::value& v2);
-static bool check_EQ_for_lists(const rjson::value& list1, const rjson::value& list2) {
-    if (!list1.IsArray() || !list2.IsArray() || list1.Size() != list2.Size()) {
-        return false;
-    }
-    auto it1 = list1.Begin();
-    auto it2 = list2.Begin();
-    while (it1 != list1.End()) {
-        // Note: Alternator limits an item's depth (rjson::parse() limits
-        // it to around 37 levels), so this recursion is safe.
-        if (!check_EQ(&*it1, *it2)) {
-            return false;
-        }
-        ++it1;
-        ++it2;
-    }
-    return true;
-}
-static bool check_EQ_for_maps(const rjson::value& list1, const rjson::value& list2) {
-    if (!list1.IsObject() || !list2.IsObject() || list1.MemberCount() != list2.MemberCount()) {
-        return false;
-    }
-    for (auto it1 = list1.MemberBegin(); it1 != list1.MemberEnd(); ++it1) {
-        auto it2 = list2.FindMember(it1->name);
-        if (it2 == list2.MemberEnd() || !check_EQ(&it1->value, it2->value)) {
-            return false;
-        }
-    }
-    return true;
-}

 // Check if two JSON-encoded values match with the EQ relation
 static bool check_EQ(const rjson::value* v1, const rjson::value& v2) {
-    if (v1 && v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {
-        auto it1 = v1->MemberBegin();
-        auto it2 = v2.MemberBegin();
-        if (it1->name != it2->name) {
-            return false;
-        }
-        if (it1->name == "SS" || it1->name == "NS" || it1->name == "BS") {
-            return check_EQ_for_sets(it1->value, it2->value);
-        } else if(it1->name == "L") {
-            return check_EQ_for_lists(it1->value, it2->value);
-        } else if(it1->name == "M") {
-            return check_EQ_for_maps(it1->value, it2->value);
-        } else {
-            // Other, non-nested types (number, string, etc.) can be compared
-            // literally, comparing their JSON representation.
-            return it1->value == it2->value;
-        }
-    } else {
-        // If v1 and/or v2 are missing (IsNull()) the result should be false.
-        // In the unlikely case that the object is malformed (issue #8070),
-        // let's also return false.
+    if (!v1) {
        return false;
    }
+    if (v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {
+        auto it1 = v1->MemberBegin();
+        auto it2 = v2.MemberBegin();
+        if ((it1->name == "SS" && it2->name == "SS") || (it1->name == "NS" && it2->name == "NS") || (it1->name == "BS" && it2->name == "BS")) {
+            return check_EQ_for_sets(it1->value, it2->value);
+        }
+    }
+    return *v1 == v2;
 }

 // Check if two JSON-encoded values match with the NE relation
 static bool check_NE(const rjson::value* v1, const rjson::value& v2) {
-    return !check_EQ(v1, v2);
+    return !v1 || *v1 != v2; // null is unequal to anything.
 }

 // Check if two JSON-encoded values match with the BEGINS_WITH relation
-bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2,
-                       bool v1_from_query, bool v2_from_query) {
-    bool bad = false;
-    if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
-        if (v1_from_query) {
-            throw api_error::validation("begins_with() encountered malformed argument");
-        } else {
-            bad = true;
-        }
-    } else if (v1->MemberBegin()->name != "S" && v1->MemberBegin()->name != "B") {
-        if (v1_from_query) {
-            throw api_error::validation(format("begins_with supports only string or binary type, got: {}", *v1));
-        } else {
-            bad = true;
-        }
-    }
+static bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2) {
+    // BEGINS_WITH requires that its single operand (v2) be a string or
+    // binary - otherwise it's a validation error. However, problems with
+    // the stored attribute (v1) will just return false (no match).
    if (!v2.IsObject() || v2.MemberCount() != 1) {
-        if (v2_from_query) {
-            throw api_error::validation("begins_with() encountered malformed argument");
-        } else {
-            bad = true;
-        }
-    } else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
-        if (v2_from_query) {
-            throw api_error::validation(format("begins_with() supports only string or binary type, got: {}", v2));
-        } else {
-            bad = true;
-        }
+        throw api_error::validation(format("BEGINS_WITH operator encountered malformed AttributeValue: {}", v2));
    }
-    if (bad) {
+    auto it2 = v2.MemberBegin();
+    if (it2->name != "S" && it2->name != "B") {
+        throw api_error::validation(format("BEGINS_WITH operator requires String or Binary type in AttributeValue, got {}", it2->name));
+    }
+
+
+    if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
        return false;
    }
    auto it1 = v1->MemberBegin();
-    auto it2 = v2.MemberBegin();
    if (it1->name != it2->name) {
        return false;
    }
@@ -341,40 +279,24 @@ static bool check_NOT_NULL(const rjson::value* val) {
    return val != nullptr;
 }

-// Only types S, N or B (string, number or bytes) may be compared by the
-// various comparion operators - lt, le, gt, ge, and between.
-// Note that in particular, if the value is missing (v->IsNull()), this
-// check returns false.
-static bool check_comparable_type(const rjson::value& v) {
-    if (!v.IsObject() || v.MemberCount() != 1) {
-        return false;
-    }
-    const rjson::value& type = v.MemberBegin()->name;
-    return type == "S" || type == "N" || type == "B";
-}
-
 // Check if two JSON-encoded values match with cmp.
 template <typename Comparator>
-bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp,
-                   bool v1_from_query, bool v2_from_query) {
-    bool bad = false;
-    if (!v1 || !check_comparable_type(*v1)) {
-        if (v1_from_query) {
-            throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
-        }
-        bad = true;
+bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp) {
+    if (!v2.IsObject() || v2.MemberCount() != 1) {
+        throw api_error::validation(
+                        format("{} requires a single AttributeValue of type String, Number, or Binary",
+                               cmp.diagnostic));
    }
-    if (!check_comparable_type(v2)) {
-        if (v2_from_query) {
-            throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
-        }
-        bad = true;
+    const auto& kv2 = *v2.MemberBegin();
+    if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {
+        throw api_error::validation(
+                        format("{} requires a single AttributeValue of type String, Number, or Binary",
+                               cmp.diagnostic));
    }
-    if (bad) {
+    if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
        return false;
    }
    const auto& kv1 = *v1->MemberBegin();
-    const auto& kv2 = *v2.MemberBegin();
    if (kv1.name != kv2.name) {
        return false;
    }
@@ -388,8 +310,7 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
    if (kv1.name == "B") {
        return cmp(base64_decode(kv1.value), base64_decode(kv2.value));
    }
-    // cannot reach here, as check_comparable_type() verifies the type is one
-    // of the above options.
+    clogger.error("check_compare panic: LHS type equals RHS type, but one is in {N,S,B} while the other isn't");
    return false;
 }

@@ -420,71 +341,56 @@ struct cmp_gt {
    static constexpr const char* diagnostic = "GT operator";
 };

-// True if v is between lb and ub, inclusive.  Throws or returns false
-// (depending on bounds_from_query parameter) if lb > ub.
+// True if v is between lb and ub, inclusive.  Throws if lb > ub.
 template <typename T>
-static bool check_BETWEEN(const T& v, const T& lb, const T& ub, bool bounds_from_query) {
+static bool check_BETWEEN(const T& v, const T& lb, const T& ub) {
    if (cmp_lt()(ub, lb)) {
-        if (bounds_from_query) {
-            throw api_error::validation(
-                format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
-        } else {
-            return false;
-        }
+        throw api_error::validation(
+                        format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
    }
    return cmp_ge()(v, lb) && cmp_le()(v, ub);
 }

-static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub,
-                          bool v_from_query, bool lb_from_query, bool ub_from_query) {
-    if ((v && v_from_query && !check_comparable_type(*v)) ||
-        (lb_from_query && !check_comparable_type(lb)) ||
-        (ub_from_query && !check_comparable_type(ub))) {
-        throw api_error::validation("between allow only the types String, Number, or Binary");
-
-    }
-    if (!v || !v->IsObject() || v->MemberCount() != 1 ||
-        !lb.IsObject() || lb.MemberCount() != 1 ||
-        !ub.IsObject() || ub.MemberCount() != 1) {
+static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub) {
+    if (!v) {
        return false;
    }
+    if (!v->IsObject() || v->MemberCount() != 1) {
+        throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", *v));
+    }
+    if (!lb.IsObject() || lb.MemberCount() != 1) {
+        throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", lb));
+    }
+    if (!ub.IsObject() || ub.MemberCount() != 1) {
+        throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", ub));
+    }

    const auto& kv_v = *v->MemberBegin();
    const auto& kv_lb = *lb.MemberBegin();
    const auto& kv_ub = *ub.MemberBegin();
-    bool bounds_from_query = lb_from_query && ub_from_query;
    if (kv_lb.name != kv_ub.name) {
-        if (bounds_from_query) {
-           throw api_error::validation(
+        throw api_error::validation(
                format("BETWEEN operator requires the same type for lower and upper bound; instead got {} and {}",
                       kv_lb.name, kv_ub.name));
-        } else {
-            return false;
-        }
    }
    if (kv_v.name != kv_lb.name) { // Cannot compare different types, so v is NOT between lb and ub.
        return false;
    }
    if (kv_v.name == "N") {
        const char* diag = "BETWEEN operator";
-        return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);
+        return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag));
    }
    if (kv_v.name == "S") {
        return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
                             std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
-                             std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),
-                             bounds_from_query);
+                             std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()));
    }
    if (kv_v.name == "B") {
-        return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value), bounds_from_query);
+        return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value));
    }
-    if (v_from_query) {
-        throw api_error::validation(
-            format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
+    throw api_error::validation(
+        format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
               kv_lb.name));
-    } else {
-        return false;
-    }
 }

 // Verify one Expect condition on one attribute (whose content is "got")
@@ -531,19 +437,19 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
            return check_NE(got, (*attribute_value_list)[0]);
        case comparison_operator_type::LT:
            verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
-            return check_compare(got, (*attribute_value_list)[0], cmp_lt{}, false, true);
+            return check_compare(got, (*attribute_value_list)[0], cmp_lt{});
        case comparison_operator_type::LE:
            verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
-            return check_compare(got, (*attribute_value_list)[0], cmp_le{}, false, true);
+            return check_compare(got, (*attribute_value_list)[0], cmp_le{});
        case comparison_operator_type::GT:
            verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
-            return check_compare(got, (*attribute_value_list)[0], cmp_gt{}, false, true);
+            return check_compare(got, (*attribute_value_list)[0], cmp_gt{});
        case comparison_operator_type::GE:
            verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
-            return check_compare(got, (*attribute_value_list)[0], cmp_ge{}, false, true);
+            return check_compare(got, (*attribute_value_list)[0], cmp_ge{});
        case comparison_operator_type::BEGINS_WITH:
            verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
-            return check_BEGINS_WITH(got, (*attribute_value_list)[0], false, true);
+            return check_BEGINS_WITH(got, (*attribute_value_list)[0]);
        case comparison_operator_type::IN:
            verify_operand_count(attribute_value_list, nonempty(), *comparison_operator);
            return check_IN(got, *attribute_value_list);
@@ -555,8 +461,7 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
            return check_NOT_NULL(got);
        case comparison_operator_type::BETWEEN:
            verify_operand_count(attribute_value_list, exact_size(2), *comparison_operator);
-            return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1],
-                                 false, true, true);
+            return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1]);
        case comparison_operator_type::CONTAINS:
            {
                verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
@@ -668,8 +573,7 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
            // Shouldn't happen unless we have a bug in the parser
            throw std::logic_error(format("Wrong number of values {} in BETWEEN primitive_condition", cond._values.size()));
        }
-        return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2],
-                             cond._values[0].is_constant(), cond._values[1].is_constant(), cond._values[2].is_constant());
+        return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2]);
    case parsed::primitive_condition::type::IN:
        return check_IN(calculated_values);
    case parsed::primitive_condition::type::VALUE:
@@ -700,17 +604,13 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
    case parsed::primitive_condition::type::NE:
        return check_NE(&calculated_values[0], calculated_values[1]);
    case parsed::primitive_condition::type::GT:
-        return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{},
-            cond._values[0].is_constant(), cond._values[1].is_constant());
+        return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{});
    case parsed::primitive_condition::type::GE:
-        return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{},
-            cond._values[0].is_constant(), cond._values[1].is_constant());
+        return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{});
    case parsed::primitive_condition::type::LT:
-        return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{},
-            cond._values[0].is_constant(), cond._values[1].is_constant());
+        return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{});
    case parsed::primitive_condition::type::LE:
-        return check_compare(&calculated_values[0], calculated_values[1], cmp_le{},
-            cond._values[0].is_constant(), cond._values[1].is_constant());
+        return check_compare(&calculated_values[0], calculated_values[1], cmp_le{});
    default:
        // Shouldn't happen unless we have a bug in the parser
        throw std::logic_error(format("Unknown type {} in primitive_condition object", (int)(cond._op)));
--- a/alternator/conditions.hh
+++ b/alternator/conditions.hh
@@ -52,7 +52,6 @@ bool verify_expected(const rjson::value& req, const rjson::value* previous_item)
 bool verify_condition(const rjson::value& condition, bool require_all, const rjson::value* previous_item);

 bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2);
-bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2, bool v1_from_query, bool v2_from_query);

 bool verify_condition_expression(
        const parsed::condition_expression& condition_expression,
--- a/alternator/error.hh
+++ b/alternator/error.hh
@@ -59,9 +59,6 @@ public:
    static api_error invalid_signature(std::string msg) {
        return api_error("InvalidSignatureException", std::move(msg));
    }
-    static api_error missing_authentication_token(std::string msg) {
-        return api_error("MissingAuthenticationTokenException", std::move(msg));
-    }
    static api_error unrecognized_client(std::string msg) {
        return api_error("UnrecognizedClientException", std::move(msg));
    }
@@ -80,9 +77,6 @@ public:
    static api_error trimmed_data_access_exception(std::string msg) {
        return api_error("TrimmedDataAccessException", std::move(msg));
    }
-    static api_error request_limit_exceeded(std::string msg) {
-        return api_error("RequestLimitExceeded", std::move(msg));
-    }
    static api_error internal(std::string msg) {
        return api_error("InternalServerError", std::move(msg), reply::status_type::internal_server_error);
    }
--- a/alternator/executor.cc
+++ b/alternator/executor.cc
@@ -55,7 +55,7 @@
 #include "schema.hh"
 #include "alternator/tags_extension.hh"
 #include "alternator/rmw_operation.hh"
-#include <seastar/core/coroutine.hh>
+
 #include <boost/range/adaptors.hpp>

 logging::logger elogger("alternator-executor");
@@ -202,7 +202,7 @@ static schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& r
    if (!schema) {
        // if we get here then the name was missing, since syntax or missing actual CF 
        // checks throw. Slow path, but just call get_table_name to generate exception. 
-        get_table_name(request);
+        get_table_name(request);        
    }
    return schema;
 }
@@ -220,7 +220,7 @@ static std::tuple<bool, std::string_view, std::string_view> try_get_internal_tab
    std::string_view ks_name = table_name.substr(0, delim);
    table_name.remove_prefix(ks_name.size() + 1);
    // Only internal keyspaces can be accessed to avoid leakage
-    if (!is_internal_keyspace(ks_name)) {
+    if (!is_internal_keyspace(sstring(ks_name))) {
        return {false, "", ""};
    }
    return {true, ks_name, table_name};
@@ -476,8 +476,8 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
        return make_ready_future<request_return_type>(api_error::resource_not_found(
                format("Requested resource not found: Table: {} not found", table_name)));
    }
-    return _mm.announce_column_family_drop(keyspace_name, table_name, service::migration_manager::drop_views::yes).then([this, keyspace_name] {
-        return _mm.announce_keyspace_drop(keyspace_name);
+    return _mm.announce_column_family_drop(keyspace_name, table_name, false, service::migration_manager::drop_views::yes).then([this, keyspace_name] {
+        return _mm.announce_keyspace_drop(keyspace_name, false);
    }).then([table_name = std::move(table_name)] {
        // FIXME: need more attributes?
        rjson::value table_description = rjson::empty_object();
@@ -704,48 +704,52 @@ static void update_tags_map(const rjson::value& tags, std::map<sstring, sstring>
 static future<> update_tags(service::migration_manager& mm, schema_ptr schema, std::map<sstring, sstring>&& tags_map) {
    schema_builder builder(schema);
    builder.add_extension(tags_extension::NAME, ::make_shared<tags_extension>(std::move(tags_map)));
-    return mm.announce_column_family_update(builder.build(), false, std::vector<view_ptr>());
+    return mm.announce_column_family_update(builder.build(), false, std::vector<view_ptr>(), false);
 }

 future<executor::request_return_type> executor::tag_resource(client_state& client_state, service_permit permit, rjson::value request) {
    _stats.api_operations.tag_resource++;

-    const rjson::value* arn = rjson::find(request, "ResourceArn");
-    if (!arn || !arn->IsString()) {
-        co_return api_error::access_denied("Incorrect resource identifier");
-    }
-    schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
-    std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
-    const rjson::value* tags = rjson::find(request, "Tags");
-    if (!tags || !tags->IsArray()) {
-        co_return api_error::validation("Cannot parse tags");
-    }
-    if (tags->Size() < 1) {
-        co_return api_error::validation("The number of tags must be at least 1") ;
-    }
-    update_tags_map(*tags, tags_map,  update_tags_action::add_tags);
-    co_await update_tags(_mm, schema, std::move(tags_map));
-    co_return json_string("");
+    return seastar::async([this, &client_state, request = std::move(request)] () mutable -> request_return_type {
+        const rjson::value* arn = rjson::find(request, "ResourceArn");
+        if (!arn || !arn->IsString()) {
+            return api_error::access_denied("Incorrect resource identifier");
+        }
+        schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
+        std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
+        const rjson::value* tags = rjson::find(request, "Tags");
+        if (!tags || !tags->IsArray()) {
+            return api_error::validation("Cannot parse tags");
+        }
+        if (tags->Size() < 1) {
+            return api_error::validation("The number of tags must be at least 1") ;
+        }
+        update_tags_map(*tags, tags_map,  update_tags_action::add_tags);
+        update_tags(_mm, schema, std::move(tags_map)).get();
+        return json_string("");
+    });
 }

 future<executor::request_return_type> executor::untag_resource(client_state& client_state, service_permit permit, rjson::value request) {
    _stats.api_operations.untag_resource++;

-    const rjson::value* arn = rjson::find(request, "ResourceArn");
-    if (!arn || !arn->IsString()) {
-        co_return api_error::access_denied("Incorrect resource identifier");
-    }
-    const rjson::value* tags = rjson::find(request, "TagKeys");
-    if (!tags || !tags->IsArray()) {
-        co_return api_error::validation(format("Cannot parse tag keys"));
-    }
+    return seastar::async([this, &client_state, request = std::move(request)] () -> request_return_type {
+        const rjson::value* arn = rjson::find(request, "ResourceArn");
+        if (!arn || !arn->IsString()) {
+            return api_error::access_denied("Incorrect resource identifier");
+        }
+        const rjson::value* tags = rjson::find(request, "TagKeys");
+        if (!tags || !tags->IsArray()) {
+            return api_error::validation(format("Cannot parse tag keys"));
+        }

-    schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
+        schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));

-    std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
-    update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
-    co_await update_tags(_mm, schema, std::move(tags_map));
-    co_return json_string("");
+        std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
+        update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
+        update_tags(_mm, schema, std::move(tags_map)).get();
+        return json_string("");
+    });
 }

 future<executor::request_return_type> executor::list_tags_of_resource(client_state& client_state, service_permit permit, rjson::value request) {
@@ -981,7 +985,7 @@ future<executor::request_return_type> executor::create_table(client_state& clien
    return create_keyspace(keyspace_name).handle_exception_type([] (exceptions::already_exists_exception&) {
            // Ignore the fact that the keyspace may already exist. See discussion in #6340
        }).then([this, table_name, request = std::move(request), schema, view_builders = std::move(view_builders), tags_map = std::move(tags_map)] () mutable {
-        return futurize_invoke([&] { return _mm.announce_new_column_family(schema); }).then([this, table_info = std::move(request), schema, view_builders = std::move(view_builders), tags_map = std::move(tags_map)] () mutable {
+        return futurize_invoke([&] { return _mm.announce_new_column_family(schema, false); }).then([this, table_info = std::move(request), schema, view_builders = std::move(view_builders), tags_map = std::move(tags_map)] () mutable {
            return parallel_for_each(std::move(view_builders), [this, schema] (schema_builder builder) {
                return _mm.announce_new_view(view_ptr(builder.build()));
            }).then([this, table_info = std::move(table_info), schema, tags_map = std::move(tags_map)] () mutable {
@@ -1237,16 +1241,10 @@ mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) co
    return m;
 }

-// The DynamoDB API doesn't let the client control the server's timeout, so
-// we have a global default_timeout() for Alternator requests. The value of
-// default_timeout is overwritten by main.cc based on the
-// "alternator_timeout_in_ms" configuration parameter.
-db::timeout_clock::duration executor::s_default_timeout = 10s;
-void executor::set_default_timeout(db::timeout_clock::duration timeout) {
-    s_default_timeout = timeout;
-}
+// The DynamoDB API doesn't let the client control the server's timeout.
+// Let's pick something reasonable:
 db::timeout_clock::time_point executor::default_timeout() {
-    return db::timeout_clock::now() + s_default_timeout;
+    return db::timeout_clock::now() + 10s;
 }
        
 static future<std::unique_ptr<rjson::value>> get_previous_item(
@@ -1882,182 +1880,18 @@ static std::string get_item_type_string(const rjson::value& v) {
    return it->name.GetString();
 }

-// attrs_to_get saves for each top-level attribute an attrs_to_get_node,
-// a hierarchy of subparts that need to be kept. The following function
-// takes a given JSON value and drops its parts which weren't asked to be
-// kept. It modifies the given JSON value, or returns false to signify that
-// the entire object should be dropped.
-// Note that The JSON value is assumed to be encoded using the DynamoDB
-// conventions - i.e., it is really a map whose key has a type string,
-// and the value is the real object.
-template<typename T>
-static bool hierarchy_filter(rjson::value& val, const attribute_path_map_node<T>& h) {
-    if (!val.IsObject() || val.MemberCount() != 1) {
-        // This shouldn't happen. We shouldn't have stored malformed objects.
-        // But today Alternator does not validate the structure of nested
-        // documents before storing them, so this can happen on read.
-        throw api_error::internal(format("Malformed value object read: {}", val));
-    }
-    const char* type = val.MemberBegin()->name.GetString();
-    rjson::value& v = val.MemberBegin()->value;
-    if (h.has_members()) {
-        const auto& members = h.get_members();
-        if (type[0] != 'M' || !v.IsObject()) {
-            // If v is not an object (dictionary, map), none of the members
-            // can match.
-            return false;
-        }
-        rjson::value newv = rjson::empty_object();
-        for (auto it = v.MemberBegin(); it != v.MemberEnd(); ++it) {
-            std::string attr = it->name.GetString();
-            auto x = members.find(attr);
-            if (x != members.end()) {
-                if (x->second) {
-                    // Only a part of this attribute is to be filtered, do it.
-                    if (hierarchy_filter(it->value, *x->second)) {
-                        rjson::set_with_string_name(newv, attr, std::move(it->value));
-                    }
-                } else {
-                    // The entire attribute is to be kept
-                    rjson::set_with_string_name(newv, attr, std::move(it->value));
-                }
-            }
-        }
-        if (newv.MemberCount() == 0) {
-            return false;
-        }
-        v = newv;
-    } else if (h.has_indexes()) {
-        const auto& indexes = h.get_indexes();
-        if (type[0] != 'L' || !v.IsArray()) {
-            return false;
-        }
-        rjson::value newv = rjson::empty_array();
-        const auto& a = v.GetArray();
-        for (unsigned i = 0; i < v.Size(); i++) {
-            auto x = indexes.find(i);
-            if (x != indexes.end()) {
-                if (x->second) {
-                    if (hierarchy_filter(a[i], *x->second)) {
-                        rjson::push_back(newv, std::move(a[i]));
-                    }
-                } else {
-                    // The entire attribute is to be kept
-                    rjson::push_back(newv, std::move(a[i]));
-                }
-            }
-        }
-        if (newv.Size() == 0) {
-            return false;
-        }
-        v = newv;
-    }
-    return true;
-}
-
-// Add a path to a attribute_path_map. Throws a validation error if the path
-// "overlaps" with one already in the filter (one is a sub-path of the other)
-// or "conflicts" with it (both a member and index is requested).
-template<typename T>
-void attribute_path_map_add(const char* source, attribute_path_map<T>& map, const parsed::path& p, T value = {}) {
-   using node = attribute_path_map_node<T>;
-    // The first step is to look for the top-level attribute (p.root()):
-    auto it = map.find(p.root());
-    if (it == map.end()) {
-        if (p.has_operators()) {
-            it = map.emplace(p.root(), node {std::nullopt}).first;
-        } else {
-            (void) map.emplace(p.root(), node {std::move(value)}).first;
-            // Value inserted for top-level node. We're done.
-            return;
-        }
-    } else if(!p.has_operators()) {
-        // If p is top-level and we already have it or a part of it
-        // in map, it's a forbidden overlapping path.
-        throw api_error::validation(format(
-            "Invalid {}: two document paths overlap at {}", source, p.root()));
-    } else if (it->second.has_value()) {
-        // If we're here, it != map.end() && p.has_operators && it->second.has_value().
-        // This means the top-level attribute already has a value, and we're
-        // trying to add a non-top-level value. It's an overlap.
-        throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p.root()));
-    }
-    node* h = &it->second;
-    // The second step is to walk h from the top-level node to the inner node
-    // where we're supposed to insert the value:
-    for (const auto& op : p.operators()) {
-        std::visit(overloaded_functor {
-            [&] (const std::string& member) {
-                if (h->is_empty()) {
-                    *h = node {typename node::members_t()};
-                } else if (h->has_indexes()) {
-                    throw api_error::validation(format("Invalid {}: two document paths conflict at {}", source, p));
-                } else if (h->has_value()) {
-                    throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
-                }
-                typename node::members_t& members = h->get_members();
-                auto it = members.find(member);
-                if (it == members.end()) {
-                    it = members.insert({member, std::make_unique<node>()}).first;
-                }
-                h = it->second.get();
-            },
-            [&] (unsigned index) {
-                if (h->is_empty()) {
-                    *h = node {typename node::indexes_t()};
-                } else if (h->has_members()) {
-                    throw api_error::validation(format("Invalid {}: two document paths conflict at {}", source, p));
-                } else if (h->has_value()) {
-                    throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
-                }
-                typename node::indexes_t& indexes = h->get_indexes();
-                auto it = indexes.find(index);
-                if (it == indexes.end()) {
-                    it = indexes.insert({index, std::make_unique<node>()}).first;
-                }
-                h = it->second.get();
-            }
-        }, op);
-    }
-    // Finally, insert the value in the node h.
-    if (h->is_empty()) {
-        *h = node {std::move(value)};
-    } else {
-        throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
-    }
-}
-
-// A very simplified version of the above function for the special case of
-// adding only top-level attribute. It's not only simpler, we also use a
-// different error message, referring to a "duplicate attribute"instead of
-// "overlapping paths". DynamoDB also has this distinction (errors in
-// AttributesToGet refer to duplicates, not overlaps, but errors in
-// ProjectionExpression refer to overlap - even if it's an exact duplicate).
-template<typename T>
-void attribute_path_map_add(const char* source, attribute_path_map<T>& map, const std::string& attr, T value = {}) {
-   using node = attribute_path_map_node<T>;
-    auto it = map.find(attr);
-    if (it == map.end()) {
-        map.emplace(attr, node {std::move(value)});
-    } else {
-        throw api_error::validation(format(
-            "Invalid {}: Duplicate attribute: {}", source, attr));
-    }
-}
-
 // calculate_attrs_to_get() takes either AttributesToGet or
 // ProjectionExpression parameters (having both is *not* allowed),
 // and returns the list of cells we need to read, or an empty set when
 // *all* attributes are to be returned.
-// However, in our current implementation, only top-level attributes are
-// stored as separate cells - a nested document is stored serialized together
-// (as JSON) in the same cell. So this function return a map - each key is the
-// top-level attribute we will need need to read, and the value for each
-// top-level attribute is the partial hierarchy (struct hierarchy_filter)
-// that we will need to extract from that serialized JSON.
-// For example, if ProjectionExpression lists a.b and a.c[2], we
-// return one top-level attribute name, "a", with the value "{b, c[2]}".
-static attrs_to_get calculate_attrs_to_get(const rjson::value& req, std::unordered_set<std::string>& used_attribute_names) {
+// In our current implementation, only top-level attributes are stored
+// as cells, and nested documents are stored serialized as JSON.
+// So this function currently returns only the the top-level attributes
+// but we also need to add, after the query, filtering to keep only
+// the parts of the JSON attributes that were chosen in the paths'
+// operators. Because we don't have such filtering yet (FIXME), we fail here
+// if the requested paths are anything but top-level attributes.
+std::unordered_set<std::string> calculate_attrs_to_get(const rjson::value& req, std::unordered_set<std::string>& used_attribute_names) {
    const bool has_attributes_to_get = req.HasMember("AttributesToGet");
    const bool has_projection_expression = req.HasMember("ProjectionExpression");
    if (has_attributes_to_get && has_projection_expression) {
@@ -2066,9 +1900,9 @@ static attrs_to_get calculate_attrs_to_get(const rjson::value& req, std::unorder
    }
    if (has_attributes_to_get) {
        const rjson::value& attributes_to_get = req["AttributesToGet"];
-        attrs_to_get ret;
+        std::unordered_set<std::string> ret;
        for (auto it = attributes_to_get.Begin(); it != attributes_to_get.End(); ++it) {
-            attribute_path_map_add("AttributesToGet", ret, it->GetString());
+            ret.insert(it->GetString());
        }
        return ret;
    } else if (has_projection_expression) {
@@ -2081,13 +1915,24 @@ static attrs_to_get calculate_attrs_to_get(const rjson::value& req, std::unorder
            throw api_error::validation(e.what());
        }
        resolve_projection_expression(paths_to_get, expression_attribute_names, used_attribute_names);
-        attrs_to_get ret;
-        for (const parsed::path& p : paths_to_get) {
-            attribute_path_map_add("ProjectionExpression", ret, p);
-        }
+        std::unordered_set<std::string> seen_column_names;
+        auto ret = boost::copy_range<std::unordered_set<std::string>>(paths_to_get |
+            boost::adaptors::transformed([&] (const parsed::path& p) {
+                if (p.has_operators()) {
+                    // FIXME: this check will need to change when we support non-toplevel attributes
+                    throw api_error::validation("Non-toplevel attributes in ProjectionExpression not yet implemented");
+                }
+                if (!seen_column_names.insert(p.root()).second) {
+                    // FIXME: this check will need to change when we support non-toplevel attributes
+                    throw api_error::validation(
+                            format("Invalid ProjectionExpression: two document paths overlap with each other: {} and {}.",
+                                    p.root(), p.root()));
+                }
+                return p.root();
+            }));
        return ret;
    }
-    // An empty map asks to read everything
+    // An empty set asks to read everything
    return {};
 }

@@ -2108,7 +1953,7 @@ static attrs_to_get calculate_attrs_to_get(const rjson::value& req, std::unorder
 */ 
 void executor::describe_single_item(const cql3::selection::selection& selection,
    const std::vector<bytes_opt>& result_row,
-    const attrs_to_get& attrs_to_get,
+    const std::unordered_set<std::string>& attrs_to_get,
    rjson::value& item,
    bool include_all_embedded_attributes) 
 {
@@ -2129,16 +1974,7 @@ void executor::describe_single_item(const cql3::selection::selection& selection,
                std::string attr_name = value_cast<sstring>(entry.first);
                if (include_all_embedded_attributes || attrs_to_get.empty() || attrs_to_get.contains(attr_name)) {
                    bytes value = value_cast<bytes>(entry.second);
-                    rjson::value v = deserialize_item(value);
-                    auto it = attrs_to_get.find(attr_name);
-                    if (it != attrs_to_get.end()) {
-                        // attrs_to_get may have asked for only part of this attribute:
-                        if (hierarchy_filter(v, it->second)) {
-                            rjson::set_with_string_name(item, attr_name, std::move(v));
-                        }
-                    } else {
-                        rjson::set_with_string_name(item, attr_name, std::move(v));
-                    }
+                    rjson::set_with_string_name(item, attr_name, deserialize_item(value));
                }
            }
        }
@@ -2150,7 +1986,7 @@ std::optional<rjson::value> executor::describe_single_item(schema_ptr schema,
        const query::partition_slice& slice,
        const cql3::selection::selection& selection,
        const query::result& query_result,
-        const attrs_to_get& attrs_to_get) {
+        const std::unordered_set<std::string>& attrs_to_get) {
    rjson::value item = rjson::empty_object();

    cql3::selection::result_set_builder builder(selection, gc_clock::now(), cql_serialization_format::latest());
@@ -2186,16 +2022,8 @@ static bool check_needs_read_before_write(const parsed::value& v) {
    }, v._value);
 }

-static bool check_needs_read_before_write(const attribute_path_map<parsed::update_expression::action>& update_expression) {
-    return boost::algorithm::any_of(update_expression, [](const auto& p) {
-        if (!p.second.has_value()) {
-            // If the action is not on the top-level attribute, we need to
-            // read the old item: we change only a part of the top-level
-            // attribute, and write the full top-level attribute back.
-            return true;
-        }
-        // Otherwise, the action p.second.get_value() is just on top-level
-        // attribute. Check if it needs read-before-write:
+static bool check_needs_read_before_write(const parsed::update_expression& update_expression) {
+    return boost::algorithm::any_of(update_expression.actions(), [](const parsed::update_expression::action& action) {
        return std::visit(overloaded_functor {
            [&] (const parsed::update_expression::action::set& a) -> bool {
                return check_needs_read_before_write(a._rhs._v1) || (a._rhs._op != 'v' && check_needs_read_before_write(a._rhs._v2));
@@ -2209,7 +2037,7 @@ static bool check_needs_read_before_write(const attribute_path_map<parsed::updat
            [&] (const parsed::update_expression::action::del& a) -> bool {
                return true;
            }
-        }, p.second.get_value()._action);
+        }, action._action);
    });
 }

@@ -2218,11 +2046,7 @@ public:
    // Some information parsed during the constructor to check for input
    // errors, and cached to be used again during apply().
    rjson::value* _attribute_updates;
-    // Instead of keeping a parsed::update_expression with an unsorted list
-    // list of actions, we keep them in an attribute_path_map which groups
-    // them by top-level attribute, and detects forbidden overlaps/conflicts.
-    attribute_path_map<parsed::update_expression::action> _update_expression;
-
+    parsed::update_expression _update_expression;
    parsed::condition_expression _condition_expression;

    update_item_operation(service::storage_proxy& proxy, rjson::value&& request);
@@ -2253,22 +2077,16 @@ update_item_operation::update_item_operation(service::storage_proxy& proxy, rjso
            throw api_error::validation("UpdateExpression must be a string");
        }
        try {
-            parsed::update_expression expr = parse_update_expression(update_expression->GetString());
-            resolve_update_expression(expr,
+            _update_expression = parse_update_expression(update_expression->GetString());
+            resolve_update_expression(_update_expression,
                    expression_attribute_names, expression_attribute_values,
                    used_attribute_names, used_attribute_values);
-            if (expr.empty()) {
-                throw api_error::validation("Empty expression in UpdateExpression is not allowed");
-            }
-            for (auto& action : expr.actions()) {
-                // Unfortunately we need to copy the action's path, because
-                // we std::move the action object.
-                auto p = action._path;
-                attribute_path_map_add("UpdateExpression", _update_expression, p, std::move(action));
-            }
        } catch(expressions_syntax_error& e) {
            throw api_error::validation(e.what());
        }
+        if (_update_expression.empty()) {
+            throw api_error::validation("Empty expression in UpdateExpression is not allowed");
+        }
    }
    _attribute_updates = rjson::find(_request, "AttributeUpdates");
    if (_attribute_updates) {
@@ -2310,187 +2128,6 @@ update_item_operation::needs_read_before_write() const {
           (_returnvalues != returnvalues::NONE && _returnvalues != returnvalues::UPDATED_NEW);
 }

-// action_result() returns the result of applying an UpdateItem action -
-// this result is either a JSON object or an unset optional which indicates
-// the action was a deletion. The caller (update_item_operation::apply()
-// below) will either write this JSON as the content of a column, or
-// use it as a piece in a bigger top-level attribute.
-static std::optional<rjson::value> action_result(
-        const parsed::update_expression::action& action,
-        const rjson::value* previous_item) {
-    return std::visit(overloaded_functor {
-        [&] (const parsed::update_expression::action::set& a) -> std::optional<rjson::value> {
-            return calculate_value(a._rhs, previous_item);
-        },
-        [&] (const parsed::update_expression::action::remove& a) -> std::optional<rjson::value> {
-            return std::nullopt;
-        },
-        [&] (const parsed::update_expression::action::add& a) -> std::optional<rjson::value> {
-            parsed::value base;
-            parsed::value addition;
-            base.set_path(action._path);
-            addition.set_constant(a._valref);
-            rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item);
-            rjson::value v2 = calculate_value(addition, calculate_value_caller::UpdateExpression, previous_item);
-            rjson::value result;
-            // An ADD can be used to create a new attribute (when
-            // v1.IsNull()) or to add to a pre-existing attribute:
-            if (v1.IsNull()) {
-                std::string v2_type = get_item_type_string(v2);
-                if (v2_type == "N" || v2_type == "SS" || v2_type == "NS" || v2_type == "BS") {
-                    result = v2;
-                } else {
-                    throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v2));
-                }
-            } else {
-                std::string v1_type = get_item_type_string(v1);
-                if (v1_type == "N") {
-                    if (get_item_type_string(v2) != "N") {
-                        throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
-                    }
-                    result = number_add(v1, v2);
-                } else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
-                    if (get_item_type_string(v2) != v1_type) {
-                        throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
-                    }
-                    result = set_sum(v1, v2);
-                } else {
-                    throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v1));
-                }
-            }
-            return result;
-        },
-        [&] (const parsed::update_expression::action::del& a) -> std::optional<rjson::value> {
-            parsed::value base;
-            parsed::value subset;
-            base.set_path(action._path);
-            subset.set_constant(a._valref);
-            rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item);
-            rjson::value v2 = calculate_value(subset, calculate_value_caller::UpdateExpression, previous_item);
-            if (!v1.IsNull()) {
-                return set_diff(v1, v2);
-            }
-            // When we return nullopt here, we ask to *delete* this attribute,
-            // which is unnecessary because we know the attribute does not
-            // exist anyway. This is a waste, but a small one. Note that also
-            // for the "remove" action above we don't bother to check if the
-            // previous_item add anything to remove.
-            return std::nullopt;
-        }
-    }, action._action);
-}
-
-// Print an attribute_path_map_node<action> as the list of paths it contains:
-static std::ostream& operator<<(std::ostream& out, const attribute_path_map_node<parsed::update_expression::action>& h) {
-    if (h.has_value()) {
-        out << " " << h.get_value()._path;
-    } else if (h.has_members()) {
-        for (auto& member : h.get_members()) {
-            out << *member.second;
-        }
-    } else if (h.has_indexes()) {
-        for (auto& index : h.get_indexes()) {
-            out << *index.second;
-        }
-    }
-    return out;
-}
-
-// Apply the hierarchy of actions in an attribute_path_map_node<action> to a
-// JSON object which uses DynamoDB's serialization conventions. The complete,
-// unmodified, previous_item is also necessary for the right-hand sides of the
-// actions. Modifies obj in-place or returns false if it is to be removed.
-static bool hierarchy_actions(
-        rjson::value& obj,
-        const attribute_path_map_node<parsed::update_expression::action>& h,
-        const rjson::value* previous_item)
-{
-    if (!obj.IsObject() || obj.MemberCount() != 1) {
-        // This shouldn't happen. We shouldn't have stored malformed objects.
-        // But today Alternator does not validate the structure of nested
-        // documents before storing them, so this can happen on read.
-        throw api_error::validation(format("Malformed value object read: {}", obj));
-    }
-    const char* type = obj.MemberBegin()->name.GetString();
-    rjson::value& v = obj.MemberBegin()->value;
-    if (h.has_value()) {
-        // Action replacing everything in this position in the hierarchy
-        std::optional<rjson::value> newv = action_result(h.get_value(), previous_item);
-        if (newv) {
-            obj = std::move(*newv);
-        } else {
-            return false;
-        }
-    } else if (h.has_members()) {
-        if (type[0] != 'M' || !v.IsObject()) {
-            // A .something on a non-map doesn't work.
-            throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
-        }
-        for (const auto& member : h.get_members()) {
-            std::string attr = member.first;
-            const attribute_path_map_node<parsed::update_expression::action>& subh = *member.second;
-            rjson::value *subobj = rjson::find(v, attr);
-            if (subobj) {
-                if (!hierarchy_actions(*subobj, subh, previous_item)) {
-                    rjson::remove_member(v, attr);
-                }
-            } else {
-                // When a.b does not exist, setting a.b itself (i.e.
-                // subh.has_value()) is fine, but setting a.b.c is not.
-                if (subh.has_value()) {
-                    std::optional<rjson::value> newv = action_result(subh.get_value(), previous_item);
-                    if (newv) {
-                        rjson::set_with_string_name(v, attr, std::move(*newv));
-                    } else {
-                        // Removing a.b when a is a map but a.b doesn't exist
-                        // is silently ignored. It's not considered an error.
-                    }
-                } else {
-                    throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
-                }
-            }
-        }
-    } else if (h.has_indexes()) {
-        if (type[0] != 'L' || !v.IsArray()) {
-            // A [i] on a non-list doesn't work.
-            throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
-        }
-        unsigned nremoved = 0;
-        for (const auto& index : h.get_indexes()) {
-            unsigned i = index.first - nremoved;
-            const attribute_path_map_node<parsed::update_expression::action>& subh = *index.second;
-            if (i < v.Size()) {
-                if (!hierarchy_actions(v[i], subh, previous_item)) {
-                    v.Erase(v.Begin() + i);
-                    // If we have the actions "REMOVE a[1] SET a[3] = :val",
-                    // the index 3 refers to the original indexes, before any
-                    // items were removed. So we offset the next indexes
-                    // (which are guaranteed to be higher than i - indexes is
-                    // a sorted map) by an increased "nremoved".
-                    nremoved++;
-                }
-            } else {
-                // If a[7] does not exist, setting a[7] itself (i.e.
-                // subh.has_value()) is fine - and appends an item, though
-                // not necessarily with index 7. But setting a[7].b will
-                // not work.
-                if (subh.has_value()) {
-                    std::optional<rjson::value> newv = action_result(subh.get_value(), previous_item);
-                    if (newv) {
-                        rjson::push_back(v, std::move(*newv));
-                    } else {
-                        // Removing a[7] when the list has fewer elements is
-                        // silently ignored. It's not considered an error.
-                    }
-                } else {
-                    throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
-                }
-            }
-        }
-    }
-    return true;
-}
-
 std::optional<mutation>
 update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const {
    if (!verify_expected(_request, previous_item.get()) ||
@@ -2505,37 +2142,17 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
    auto& row = m.partition().clustered_row(*_schema, _ck);
    attribute_collector attrs_collector;
    bool any_updates = false;
-    auto do_update = [&] (bytes&& column_name, const rjson::value& json_value,
-                          const attribute_path_map_node<parsed::update_expression::action>* h = nullptr) {
+    auto do_update = [&] (bytes&& column_name, const rjson::value& json_value) {
        any_updates = true;
-        if (_returnvalues == returnvalues::ALL_NEW) {
-            rjson::replace_with_string_name(_return_attributes,
-                to_sstring_view(column_name), rjson::copy(json_value));
-        } else if (_returnvalues == returnvalues::UPDATED_NEW) {
-            rjson::value&& v = rjson::copy(json_value);
-            if (h) {
-                // If the operation was only on specific attribute paths,
-                // leave only them in _return_attributes.
-                if (hierarchy_filter(v, *h)) {
-                    rjson::set_with_string_name(_return_attributes,
-                        to_sstring_view(column_name), std::move(v));
-                }
-            } else {
-                rjson::set_with_string_name(_return_attributes,
-                    to_sstring_view(column_name), std::move(v));
-            }
+        if (_returnvalues == returnvalues::ALL_NEW ||
+            _returnvalues == returnvalues::UPDATED_NEW) {
+            rjson::set_with_string_name(_return_attributes,
+                    to_sstring_view(column_name), rjson::copy(json_value));
        } else if (_returnvalues == returnvalues::UPDATED_OLD && previous_item) {
            std::string_view cn =  to_sstring_view(column_name);
            const rjson::value* col = rjson::find(*previous_item, cn);
            if (col) {
-                rjson::value&& v = rjson::copy(*col);
-                if (h) {
-                    if (hierarchy_filter(v, *h)) {
-                        rjson::set_with_string_name(_return_attributes, cn, std::move(v));
-                    }
-                } else {
-                    rjson::set_with_string_name(_return_attributes, cn, std::move(v));
-                }
+                rjson::set_with_string_name(_return_attributes, cn, rjson::copy(*col));
            }
        }
        const column_definition* cdef = _schema->get_column_definition(column_name);
@@ -2577,7 +2194,7 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
    // can just move previous_item later, when we don't need it any more.
    if (_returnvalues == returnvalues::ALL_NEW) {
        if (previous_item) {
-            _return_attributes = rjson::copy(*previous_item);
+            _return_attributes = std::move(*previous_item);
        } else {
            // If there is no previous item, usually a new item is created
            // and contains they given key. This may be cancelled at the end
@@ -2590,44 +2207,77 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
    }

    if (!_update_expression.empty()) {
-        for (auto& actions : _update_expression) {
-            // The actions of _update_expression are grouped by top-level
-            // attributes. Here, all actions in actions.second share the same
-            // top-level attribute actions.first.
-            std::string column_name = actions.first;
+        std::unordered_set<std::string> seen_column_names;
+        for (auto& action : _update_expression.actions()) {
+            if (action._path.has_operators()) {
+                // FIXME: implement this case
+                throw api_error::validation("UpdateItem support for nested updates not yet implemented");
+            }
+            std::string column_name = action._path.root();
            const column_definition* cdef = _schema->get_column_definition(to_bytes(column_name));
            if (cdef && cdef->is_primary_key()) {
-                throw api_error::validation(format("UpdateItem cannot update key column {}", column_name));
+                throw api_error::validation(
+                        format("UpdateItem cannot update key column {}", column_name));
            }
-            if (actions.second.has_value()) {
-                // An action on a top-level attribute column_name. The single
-                // action is actions.second.get_value(). We can simply invoke
-                // the action and replace the attribute with its result:
-                std::optional<rjson::value> result = action_result(actions.second.get_value(), previous_item.get());
-                if (result) {
-                    do_update(to_bytes(column_name), *result);
-                } else {
+            // DynamoDB forbids multiple updates in the same expression to
+            // modify overlapping document paths. Updates of one expression
+            // have the same timestamp, so it's unclear which would "win".
+            // FIXME: currently, without full support for document paths,
+            // we only check if the paths' roots are the same.
+            if (!seen_column_names.insert(column_name).second) {
+                throw api_error::validation(
+                        format("Invalid UpdateExpression: two document paths overlap with each other: {} and {}.",
+                                column_name, column_name));
+            }
+            std::visit(overloaded_functor {
+                [&] (const parsed::update_expression::action::set& a) {
+                    auto value = calculate_value(a._rhs, previous_item.get());
+                    do_update(to_bytes(column_name), value);
+                },
+                [&] (const parsed::update_expression::action::remove& a) {
                    do_delete(to_bytes(column_name));
+                },
+                [&] (const parsed::update_expression::action::add& a) {
+                    parsed::value base;
+                    parsed::value addition;
+                    base.set_path(action._path);
+                    addition.set_constant(a._valref);
+                    rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item.get());
+                    rjson::value v2 = calculate_value(addition, calculate_value_caller::UpdateExpression, previous_item.get());
+                    rjson::value result;
+                    std::string v1_type = get_item_type_string(v1);
+                    if (v1_type == "N") {
+                        if (get_item_type_string(v2) != "N") {
+                            throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
+                        }
+                        result = number_add(v1, v2);
+                    } else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
+                        if (get_item_type_string(v2) != v1_type) {
+                            throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
+                        }
+                        result = set_sum(v1, v2);
+                    } else {
+                        throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v1));
+                    }
+                    do_update(to_bytes(column_name), result);
+                },
+                [&] (const parsed::update_expression::action::del& a) {
+                    parsed::value base;
+                    parsed::value subset;
+                    base.set_path(action._path);
+                    subset.set_constant(a._valref);
+                    rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item.get());
+                    rjson::value v2 = calculate_value(subset, calculate_value_caller::UpdateExpression, previous_item.get());
+                    if (!v1.IsNull()) {
+                        std::optional<rjson::value> result  = set_diff(v1, v2);
+                        if (result) {
+                            do_update(to_bytes(column_name), *result);
+                        } else {
+                            do_delete(to_bytes(column_name));
+                        }
+                    }
                }
-            } else {
-                // We have actions on a path or more than one path in the same
-                // top-level attribute column_name - but not on the top-level
-                // attribute as a whole. We already read the full top-level
-                // attribute (see check_needs_read_before_write()), and now we
-                // need to modify pieces of it and write back the entire
-                // top-level attribute.
-                if (!previous_item) {
-                    throw api_error::validation(format("UpdateItem cannot update nested document path on non-existent item"));
-                }
-                const rjson::value *toplevel = rjson::find(*previous_item, column_name);
-                if (!toplevel) {
-                    throw api_error::validation(format("UpdateItem cannot update document path: missing attribute {}",
-                        column_name));
-                }
-                rjson::value result = rjson::copy(*toplevel);
-                hierarchy_actions(result, actions.second, previous_item.get());
-                do_update(to_bytes(column_name), std::move(result), &actions.second);
-            }
+            }, action._action);
        }
    }
    if (_returnvalues == returnvalues::ALL_OLD && previous_item) {
@@ -2745,7 +2395,7 @@ static rjson::value describe_item(schema_ptr schema,
        const query::partition_slice& slice,
        const cql3::selection::selection& selection,
        const query::result& query_result,
-        const attrs_to_get& attrs_to_get) {
+        const std::unordered_set<std::string>& attrs_to_get) {
    std::optional<rjson::value> opt_item = executor::describe_single_item(std::move(schema), slice, selection, std::move(query_result), attrs_to_get);
    if (!opt_item) {
        // If there is no matching item, we're supposed to return an empty
@@ -2817,7 +2467,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
    struct table_requests {
        schema_ptr schema;
        db::consistency_level cl;
-        ::shared_ptr<const attrs_to_get> attrs_to_get;
+        std::unordered_set<std::string> attrs_to_get;
        struct single_request {
            partition_key pk;
            clustering_key ck;
@@ -2832,7 +2482,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
        tracing::add_table_name(trace_state, sstring(executor::KEYSPACE_NAME_PREFIX) + rs.schema->cf_name(), rs.schema->cf_name());
        rs.cl = get_read_consistency(it->value);
        std::unordered_set<std::string> used_attribute_names;
-        rs.attrs_to_get = ::make_shared<const attrs_to_get>(calculate_attrs_to_get(it->value, used_attribute_names));
+        rs.attrs_to_get = calculate_attrs_to_get(it->value, used_attribute_names);
        verify_all_are_used(request, "ExpressionAttributeNames", used_attribute_names, "GetItem");
        auto& keys = (it->value)["Keys"];
        for (const rjson::value& key : keys.GetArray()) {
@@ -2862,7 +2512,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
            future<std::tuple<std::string, std::optional<rjson::value>>> f = _proxy.query(rs.schema, std::move(command), std::move(partition_ranges), rs.cl,
                    service::storage_proxy::coordinator_query_options(executor::default_timeout(), permit, client_state, trace_state)).then(
                    [schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get] (service::storage_proxy::coordinator_query_result qr) mutable {
-                std::optional<rjson::value> json = describe_single_item(schema, partition_slice, *selection, *qr.query_result, *attrs_to_get);
+                std::optional<rjson::value> json = describe_single_item(schema, partition_slice, *selection, *qr.query_result, std::move(attrs_to_get));
                return make_ready_future<std::tuple<std::string, std::optional<rjson::value>>>(
                        std::make_tuple(schema->cf_name(), std::move(json)));
            });
@@ -3031,7 +2681,7 @@ void filter::for_filters_on(const noncopyable_function<void(std::string_view)>&
 class describe_items_visitor {
    typedef std::vector<const column_definition*> columns_t;
    const columns_t& _columns;
-    const attrs_to_get& _attrs_to_get;
+    const std::unordered_set<std::string>& _attrs_to_get;
    std::unordered_set<std::string> _extra_filter_attrs;
    const filter& _filter;
    typename columns_t::const_iterator _column_it;
@@ -3040,7 +2690,7 @@ class describe_items_visitor {
    size_t _scanned_count;

 public:
-    describe_items_visitor(const columns_t& columns, const attrs_to_get& attrs_to_get, filter& filter)
+    describe_items_visitor(const columns_t& columns, const std::unordered_set<std::string>& attrs_to_get, filter& filter)
            : _columns(columns)
            , _attrs_to_get(attrs_to_get)
            , _filter(filter)
@@ -3089,12 +2739,6 @@ public:
                    std::string attr_name = value_cast<sstring>(entry.first);
                    if (_attrs_to_get.empty() || _attrs_to_get.contains(attr_name) || _extra_filter_attrs.contains(attr_name)) {
                        bytes value = value_cast<bytes>(entry.second);
-                        // Even if _attrs_to_get asked to keep only a part of a
-                        // top-level attribute, we keep the entire attribute
-                        // at this stage, because the item filter might still
-                        // need the other parts (it was easier for us to keep
-                        // extra_filter_attrs at top-level granularity). We'll
-                        // filter the unneeded parts after item filtering.
                        rjson::set_with_string_name(_item, attr_name, deserialize_item(value));
                    }
                }
@@ -3105,24 +2749,11 @@ public:

    void end_row() {
        if (_filter.check(_item)) {
-            // As noted above, we kept entire top-level attributes listed in
-            // _attrs_to_get. We may need to only keep parts of them.
-            for (const auto& attr: _attrs_to_get) {
-                // If !attr.has_value() it means we were asked not to keep
-                // attr entirely, but just parts of it.
-                if (!attr.second.has_value()) {
-                    rjson::value* toplevel= rjson::find(_item, attr.first);
-                    if (toplevel && !hierarchy_filter(*toplevel, attr.second)) {
-                        rjson::remove_member(_item, attr.first);
-                    }
-                }
-            }
            // Remove the extra attributes _extra_filter_attrs which we had
            // to add just for the filter, and not requested to be returned:
            for (const auto& attr : _extra_filter_attrs) {
                rjson::remove_member(_item, attr);
            }
-
            rjson::push_back(_items, std::move(_item));
        }
        _item = rjson::empty_object();
@@ -3138,7 +2769,7 @@ public:
    }
 };

-static rjson::value describe_items(schema_ptr schema, const query::partition_slice& slice, const cql3::selection::selection& selection, std::unique_ptr<cql3::result_set> result_set, attrs_to_get&& attrs_to_get, filter&& filter) {
+static rjson::value describe_items(schema_ptr schema, const query::partition_slice& slice, const cql3::selection::selection& selection, std::unique_ptr<cql3::result_set> result_set, std::unordered_set<std::string>&& attrs_to_get, filter&& filter) {
    describe_items_visitor visitor(selection.get_columns(), attrs_to_get, filter);
    result_set->visit(visitor);
    auto scanned_count = visitor.get_scanned_count();
@@ -3157,7 +2788,7 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
    for (const column_definition& cdef : schema.partition_key_columns()) {
        rjson::set_with_string_name(last_evaluated_key, std::string_view(cdef.name_as_text()), rjson::empty_object());
        rjson::value& key_entry = last_evaluated_key[cdef.name_as_text()];
-        rjson::set_with_string_name(key_entry, type_to_string(cdef.type), json_key_column_value(*exploded_pk_it, cdef));
+        rjson::set_with_string_name(key_entry, type_to_string(cdef.type), rjson::parse(to_json_string(*cdef.type, *exploded_pk_it)));
        ++exploded_pk_it;
    }
    auto ck = paging_state.get_clustering_key();
@@ -3167,7 +2798,7 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
        for (const column_definition& cdef : schema.clustering_key_columns()) {
            rjson::set_with_string_name(last_evaluated_key, std::string_view(cdef.name_as_text()), rjson::empty_object());
            rjson::value& key_entry = last_evaluated_key[cdef.name_as_text()];
-            rjson::set_with_string_name(key_entry, type_to_string(cdef.type), json_key_column_value(*exploded_ck_it, cdef));
+            rjson::set_with_string_name(key_entry, type_to_string(cdef.type), rjson::parse(to_json_string(*cdef.type, *exploded_ck_it)));
            ++exploded_ck_it;
        }
    }
@@ -3179,7 +2810,7 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
        const rjson::value* exclusive_start_key,
        dht::partition_range_vector&& partition_ranges,
        std::vector<query::clustering_range>&& ck_bounds,
-        attrs_to_get&& attrs_to_get,
+        std::unordered_set<std::string>&& attrs_to_get,
        uint32_t limit,
        db::consistency_level cl,
        filter&& filter,
@@ -3219,7 +2850,7 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
    auto p = service::pager::query_pagers::pager(schema, selection, *query_state_ptr, *query_options, command, std::move(partition_ranges), nullptr);

    return p->fetch_page(limit, gc_clock::now(), executor::default_timeout()).then(
-            [p = std::move(p), schema, cql_stats, partition_slice = std::move(partition_slice),
+            [p, schema, cql_stats, partition_slice = std::move(partition_slice),
             selection = std::move(selection), query_state_ptr = std::move(query_state_ptr),
             attrs_to_get = std::move(attrs_to_get),
             query_options = std::move(query_options),
@@ -3905,10 +3536,26 @@ future<> executor::create_keyspace(std::string_view keyspace_name) {
        }
        auto opts = get_network_topology_options(rf);
        auto ksm = keyspace_metadata::new_keyspace(keyspace_name_str, "org.apache.cassandra.locator.NetworkTopologyStrategy", std::move(opts), true);
-        return _mm.announce_new_keyspace(ksm, api::new_timestamp());
+        return _mm.announce_new_keyspace(ksm, api::new_timestamp(), false);
    });
 }

+static tracing::trace_state_ptr create_tracing_session() {
+    tracing::trace_state_props_set props;
+    props.set<tracing::trace_state_props::full_tracing>();
+    return tracing::tracing::get_local_tracing_instance().create_session(tracing::trace_type::QUERY, props);
+}
+
+tracing::trace_state_ptr executor::maybe_trace_query(client_state& client_state, sstring_view op, sstring_view query) {
+    tracing::trace_state_ptr trace_state;
+    if (tracing::tracing::get_local_tracing_instance().trace_next_query()) {
+        trace_state = create_tracing_session();
+        tracing::add_query(trace_state, query);
+        tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());
+    }
+    return trace_state;
+}
+
 future<> executor::start() {
    // Currently, nothing to do on initialization. We delay the keyspace
    // creation (create_keyspace()) until a table is actually created.
--- a/alternator/executor.hh
+++ b/alternator/executor.hh
@@ -53,10 +53,6 @@ namespace service {
    class storage_service;
 }

-namespace cdc {
-    class metadata;
-}
-
 namespace alternator {

 class rmw_operation;
@@ -74,77 +70,11 @@ public:
    std::string to_json() const override;
 };

-namespace parsed {
-class path;
-};
-
-// An attribute_path_map object is used to hold data for various attributes
-// paths (parsed::path) in a hierarchy of attribute paths. Each attribute path
-// has a root attribute, and then modified by member and index operators -
-// for example in "a.b[2].c" we have "a" as the root, then ".b" member, then
-// "[2]" index, and finally ".c" member.
-// Data can be added to an attribute_path_map using the add() function, but
-// requires that attributes with data not be *overlapping* or *conflicting*:
-//
-// 1. Two attribute paths which are identical or an ancestor of one another
-//    are considered *overlapping* and not allowed. If a.b.c has data,
-//    we can't add more data in a.b.c or any of its descendants like a.b.c.d.
-//
-// 2. Two attribute paths which need the same parent to have both a member and
-//    an index are considered *conflicting* and not allowed. E.g., if a.b has
-//    data, you can't add a[1]. The meaning of adding both would be that the
-//    attribute a is both a map and an array, which isn't sensible.
-//
-// These two requirements are common to the two places where Alternator uses
-// this abstraction to describe how a hierarchical item is to be transformed:
-//
-// 1. In ProjectExpression: for filtering from a full top-level attribute
-//    only the parts for which user asked in ProjectionExpression.
-//
-// 2. In UpdateExpression: for taking the previous value of a top-level
-//    attribute, and modifying it based on the instructions in the user
-//    wrote in UpdateExpression.
-
-template<typename T>
-class attribute_path_map_node {
-public:
-    using data_t = T;
-    // We need the extra unique_ptr<> here because libstdc++ unordered_map
-    // doesn't work with incomplete types :-(
-    using members_t =  std::unordered_map<std::string, std::unique_ptr<attribute_path_map_node<T>>>;
-    // The indexes list is sorted because DynamoDB requires handling writes
-    // beyond the end of a list in index order.
-    using indexes_t = std::map<unsigned, std::unique_ptr<attribute_path_map_node<T>>>;
-    // The prohibition on "overlap" and "conflict" explained above means
-    // That only one of data, members or indexes is non-empty.
-    std::optional<std::variant<data_t, members_t, indexes_t>> _content;
-
-    bool is_empty() const { return !_content; }
-    bool has_value() const { return _content && std::holds_alternative<data_t>(*_content); }
-    bool has_members() const { return _content && std::holds_alternative<members_t>(*_content); }
-    bool has_indexes() const { return _content && std::holds_alternative<indexes_t>(*_content); }
-    // get_members() assumes that has_members() is true
-    members_t& get_members() { return std::get<members_t>(*_content); }
-    const members_t& get_members() const { return std::get<members_t>(*_content); }
-    indexes_t& get_indexes() { return std::get<indexes_t>(*_content); }
-    const indexes_t& get_indexes() const { return std::get<indexes_t>(*_content); }
-    T& get_value() { return std::get<T>(*_content); }
-    const T& get_value() const { return std::get<T>(*_content); }
-};
-
-template<typename T>
-using attribute_path_map = std::unordered_map<std::string, attribute_path_map_node<T>>;
-
-using attrs_to_get_node = attribute_path_map_node<std::monostate>;
-using attrs_to_get = attribute_path_map<std::monostate>;
-
-
 class executor : public peering_sharded_service<executor> {
    service::storage_proxy& _proxy;
    service::migration_manager& _mm;
    db::system_distributed_keyspace& _sdks;
    service::storage_service& _ss;
-    cdc::metadata& _cdc_metadata;
    // An smp_service_group to be used for limiting the concurrency when
    // forwarding Alternator request between shards - if necessary for LWT.
    smp_service_group _ssg;
@@ -157,8 +87,8 @@ public:
    static constexpr auto KEYSPACE_NAME_PREFIX = "alternator_";
    static constexpr std::string_view INTERNAL_TABLE_PREFIX = ".scylla.alternator.";

-    executor(service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, service::storage_service& ss, cdc::metadata& cdc_metadata, smp_service_group ssg)
-        : _proxy(proxy), _mm(mm), _sdks(sdks), _ss(ss), _cdc_metadata(cdc_metadata), _ssg(ssg) {}
+    executor(service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, service::storage_service& ss, smp_service_group ssg)
+        : _proxy(proxy), _mm(mm), _sdks(sdks), _ss(ss), _ssg(ssg) {}

    future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
    future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
@@ -187,12 +117,10 @@ public:

    future<> create_keyspace(std::string_view keyspace_name);

+    static tracing::trace_state_ptr maybe_trace_query(client_state& client_state, sstring_view op, sstring_view query);
+
    static sstring table_name(const schema&);
    static db::timeout_clock::time_point default_timeout();
-    static void set_default_timeout(db::timeout_clock::duration timeout);
-private:
-    static db::timeout_clock::duration s_default_timeout;
-public:
    static schema_ptr find_table(service::storage_proxy&, const rjson::value& request);

 private:
@@ -208,14 +136,16 @@ public:
        const query::partition_slice&,
        const cql3::selection::selection&,
        const query::result&,
-        const attrs_to_get&);
+        const std::unordered_set<std::string>&);

    static void describe_single_item(const cql3::selection::selection&,
        const std::vector<bytes_opt>&,
-        const attrs_to_get&,
+        const std::unordered_set<std::string>&,
        rjson::value&,
        bool = false);

+
+
    void add_stream_options(const rjson::value& stream_spec, schema_builder&) const;
    void supplement_table_info(rjson::value& descr, const schema& schema) const;
    void supplement_table_stream_info(rjson::value& descr, const schema& schema) const;
--- a/alternator/expressions.cc
+++ b/alternator/expressions.cc
@@ -130,27 +130,6 @@ void condition_expression::append(condition_expression&& a, char op) {
    }, _expression);
 }

-void path::check_depth_limit() {
-    if (1 + _operators.size() > depth_limit) {
-        throw expressions_syntax_error(format("Document path exceeded {} nesting levels", depth_limit));
-    }
-}
-
-std::ostream& operator<<(std::ostream& os, const path& p) {
-    os << p.root();
-    for (const auto& op : p.operators()) {
-        std::visit(overloaded_functor {
-            [&] (const std::string& member) {
-                os << '.' << member;
-            },
-            [&] (unsigned index) {
-                os << '[' << index << ']';
-            }
-        }, op);
-    }
-    return os;
-}
-
 } // namespace parsed

 // The following resolve_*() functions resolve references in parsed
@@ -172,9 +151,10 @@ std::ostream& operator<<(std::ostream& os, const path& p) {
 // we need to resolve the expression just once but then use it many times
 // (once for each item to be filtered).

-static std::optional<std::string> resolve_path_component(const std::string& column_name,
+static void resolve_path(parsed::path& p,
        const rjson::value* expression_attribute_names,
        std::unordered_set<std::string>& used_attribute_names) {
+    const std::string& column_name = p.root();
    if (column_name.size() > 0 && column_name.front() == '#') {
        if (!expression_attribute_names) {
            throw api_error::validation(
@@ -186,30 +166,7 @@ static std::optional<std::string> resolve_path_component(const std::string& colu
                    format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));
        }
        used_attribute_names.emplace(column_name);
-        return std::string(rjson::to_string_view(*value));
-    }
-    return std::nullopt;
-}
-
-static void resolve_path(parsed::path& p,
-        const rjson::value* expression_attribute_names,
-        std::unordered_set<std::string>& used_attribute_names) {
-    std::optional<std::string> r = resolve_path_component(p.root(), expression_attribute_names, used_attribute_names);
-    if (r) {
-        p.set_root(std::move(*r));
-    }
-    for (auto& op : p.operators()) {
-        std::visit(overloaded_functor {
-            [&] (std::string& s) {
-                r = resolve_path_component(s, expression_attribute_names, used_attribute_names);
-                if (r) {
-                    s = std::move(*r);
-                }
-            },
-            [&] (unsigned index) {
-                // nothing to resolve
-            }
-        }, op);
+        p.set_root(std::string(rjson::to_string_view(*value)));
    }
 }

@@ -646,8 +603,52 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
            }
            rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
            rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
-            return to_bool_json(check_BEGINS_WITH(v1.IsNull() ? nullptr : &v1,  v2,
-                                    f._parameters[0].is_constant(), f._parameters[1].is_constant()));
+            // TODO: There's duplication here with check_BEGINS_WITH().
+            // But unfortunately, the two functions differ a bit.
+
+            // If one of v1 or v2 is malformed or has an unsupported type
+            // (not B or S), what we do depends on whether it came from
+            // the user's query (is_constant()), or the item. Unsupported
+            // values in the query result in an error, but if they are in
+            // the item, we silently return false (no match).
+            bool bad = false;
+            if (!v1.IsObject() || v1.MemberCount() != 1) {
+                bad = true;
+                if (f._parameters[0].is_constant()) {
+                    throw api_error::validation(format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v1));
+                }
+            } else if (v1.MemberBegin()->name != "S" && v1.MemberBegin()->name != "B") {
+                bad = true;
+                if (f._parameters[0].is_constant()) {
+                    throw api_error::validation(format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v1));
+                }
+            }
+            if (!v2.IsObject() || v2.MemberCount() != 1) {
+                bad = true;
+                if (f._parameters[1].is_constant()) {
+                    throw api_error::validation(format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v2));
+                }
+            } else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
+                bad = true;
+                if (f._parameters[1].is_constant()) {
+                    throw api_error::validation(format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v2));
+                }
+            }
+            bool ret = false;
+            if (!bad) {
+                auto it1 = v1.MemberBegin();
+                auto it2 = v2.MemberBegin();
+                if (it1->name == it2->name) {
+                    if (it2->name == "S") {
+                        std::string_view val1 = rjson::to_string_view(it1->value);
+                        std::string_view val2 = rjson::to_string_view(it2->value);
+                        ret = val1.starts_with(val2);
+                    } else /* it2->name == "B" */ {
+                        ret = base64_begins_with(rjson::to_string_view(it1->value), rjson::to_string_view(it2->value));
+                    }
+                }
+            }
+            return to_bool_json(ret);
        }
    },
    {"contains", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
@@ -666,55 +667,6 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
    },
 };

-// Given a parsed::path and an item read from the table, extract the value
-// of a certain attribute path, such as "a" or "a.b.c[3]". Returns a null
-// value if the item or the requested attribute does not exist.
-// Note that the item is assumed to be encoded in JSON using DynamoDB
-// conventions - each level of a nested document is a map with one key -
-// a type (e.g., "M" for map) - and its value is the representation of
-// that value.
-static rjson::value extract_path(const rjson::value* item,
-        const parsed::path& p, calculate_value_caller caller) {
-    if (!item) {
-        return rjson::null_value();
-    }
-    const rjson::value* v = rjson::find(*item, p.root());
-    if (!v) {
-        return rjson::null_value();
-    }
-    for (const auto& op : p.operators()) {
-        if (!v->IsObject() || v->MemberCount() != 1) {
-            // This shouldn't happen. We shouldn't have stored malformed
-            // objects. But today Alternator does not validate the structure
-            // of nested documents before storing them, so this can happen on
-            // read.
-            throw api_error::validation(format("{}: malformed item read: {}", *item));
-        }
-        const char* type = v->MemberBegin()->name.GetString();
-        v = &(v->MemberBegin()->value);
-        std::visit(overloaded_functor {
-            [&] (const std::string& member) {
-                if (type[0] == 'M' && v->IsObject()) {
-                    v = rjson::find(*v, member);
-                } else {
-                    v = nullptr;
-                }
-            },
-            [&] (unsigned index) {
-                if (type[0] == 'L' && v->IsArray() && index < v->Size()) {
-                    v = &(v->GetArray()[index]);
-                } else {
-                    v = nullptr;
-                }
-            }
-        }, op);
-        if (!v) {
-            return rjson::null_value();
-        }
-    }
-    return rjson::copy(*v);
-}
-
 // Given a parsed::value, which can refer either to a constant value from
 // ExpressionAttributeValues, to the value of some attribute, or to a function
 // of other values, this function calculates the resulting value.
@@ -732,12 +684,21 @@ rjson::value calculate_value(const parsed::value& v,
            auto function_it = function_handlers.find(std::string_view(f._function_name));
            if (function_it == function_handlers.end()) {
                throw api_error::validation(
-                        format("{}: unknown function '{}' called.", caller, f._function_name));
+                        format("UpdateExpression: unknown function '{}' called.", f._function_name));
            }
            return function_it->second(caller, previous_item, f);
        },
        [&] (const parsed::path& p) -> rjson::value {
-            return extract_path(previous_item, p, caller);
+            if (!previous_item) {
+                return rjson::null_value();
+            }
+            std::string update_path = p.root();
+            if (p.has_operators()) {
+                // FIXME: support this
+                throw api_error::validation("Reading attribute paths not yet implemented");
+            }
+            const rjson::value* previous_value = rjson::find(*previous_item, update_path);
+            return previous_value ? rjson::copy(*previous_value) : rjson::null_value();
        }
    }, v._value);
 }
--- a/alternator/expressions_types.hh
+++ b/alternator/expressions_types.hh
@@ -49,23 +49,15 @@ class path {
    // dot (e.g., ".xyz").
    std::string _root;
    std::vector<std::variant<std::string, unsigned>> _operators;
-    // It is useful to limit the depth of a user-specified path, because is
-    // allows us to use recursive algorithms without worrying about recursion
-    // depth. DynamoDB officially limits the length of paths to 32 components
-    // (including the root) so let's use the same limit.
-    static constexpr unsigned depth_limit = 32;
-    void check_depth_limit();
 public:
    void set_root(std::string root) {
        _root = std::move(root);
    }
    void add_index(unsigned i) {
        _operators.emplace_back(i);
-        check_depth_limit();
    }
    void add_dot(std::string(name)) {
        _operators.emplace_back(std::move(name));
-        check_depth_limit();
    }
    const std::string& root() const {
        return _root;
@@ -73,13 +65,6 @@ public:
    bool has_operators() const {
        return !_operators.empty();
    }
-    const std::vector<std::variant<std::string, unsigned>>& operators() const {
-        return _operators;
-    }
-    std::vector<std::variant<std::string, unsigned>>& operators() {
-        return _operators;
-    }
-    friend std::ostream& operator<<(std::ostream&, const path&);
 };

 // When an expression is first parsed, all constants are references, like
--- a/alternator/server.cc
+++ b/alternator/server.cc
@@ -22,8 +22,6 @@
 #include "alternator/server.hh"
 #include "log.hh"
 #include <seastar/http/function_handlers.hh>
-#include <seastar/http/short_streams.hh>
-#include <seastar/core/coroutine.hh>
 #include <seastar/json/json_elements.hh>
 #include "seastarx.hh"
 #include "error.hh"
@@ -61,40 +59,6 @@ inline std::vector<std::string_view> split(std::string_view text, char separator
    return tokens;
 }

-// Handle CORS (Cross-origin resource sharing) in the HTTP request:
-// If the request has the "Origin" header specifying where the script which
-// makes this request comes from, we need to reply with the header
-// "Access-Control-Allow-Origin: *" saying that this (and any) origin is fine.
-// Additionally, if preflight==true (i.e., this is an OPTIONS request),
-// the script can also "request" in headers that the server allows it to use
-// some HTTP methods and headers in the followup request, and the server
-// should respond by "allowing" them in the response headers.
-// We also add the header "Access-Control-Expose-Headers" to let the script
-// access additional headers in the response.
-// This handle_CORS() should be used when handling any HTTP method - both the
-// usual GET and POST, and also the "preflight" OPTIONS method.
-static void handle_CORS(const request& req, reply& rep, bool preflight) {
-    if (!req.get_header("origin").empty()) {
-        rep.add_header("Access-Control-Allow-Origin", "*");
-        // This is the list that DynamoDB returns for expose headers. I am
-        // not sure why not just return "*" here, what's the risk?
-        rep.add_header("Access-Control-Expose-Headers", "x-amzn-RequestId,x-amzn-ErrorType,x-amzn-ErrorMessage,Date");
-        if (preflight) {
-            sstring s = req.get_header("Access-Control-Request-Headers");
-            if (!s.empty()) {
-                rep.add_header("Access-Control-Allow-Headers", std::move(s));
-            }
-            s = req.get_header("Access-Control-Request-Method");
-            if (!s.empty()) {
-                rep.add_header("Access-Control-Allow-Methods", std::move(s));
-            }
-            // Our CORS response never change anyway, let the browser cache it
-            // for two hours (Chrome's maximum):
-            rep.add_header("Access-Control-Max-Age", "7200");
-        }
-    }
-}
-
 // DynamoDB HTTP error responses are structured as follows
 // https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html
 // Our handlers throw an exception to report an error. If the exception
@@ -129,10 +93,6 @@ public:
                 [&] (const json::json_return_type& json_return_value) {
                     slogger.trace("api_handler success case");
                     if (json_return_value._body_writer) {
-                         // Unfortunately, write_body() forces us to choose
-                         // from a fixed and irrelevant list of "mime-types"
-                         // at this point. But we'll override it with the
-                         // one (application/x-amz-json-1.0) below.
                         rep->write_body("json", std::move(json_return_value._body_writer));
                     } else {
                         rep->_content += json_return_value._res;
@@ -145,16 +105,14 @@ public:

             return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
         });
-    }) { }
+    }), _type("json") { }

    api_handler(const api_handler&) = default;
    future<std::unique_ptr<reply>> handle(const sstring& path,
            std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
-        handle_CORS(*req, *rep, false);
        return _f_handle(std::move(req), std::move(rep)).then(
                [this](std::unique_ptr<reply> rep) {
-                    rep->set_mime_type("application/x-amz-json-1.0");
-                    rep->done();
+                    rep->done(_type);
                    return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
                });
    }
@@ -168,6 +126,7 @@ protected:
    }

    future_handler_function _f_handle;
+    sstring _type;
 };

 class gated_handler : public handler_base {
@@ -187,7 +146,6 @@ public:
    health_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
 protected:
    virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
-        handle_CORS(*req, *rep, false);
        rep->set_status(reply::status_type::ok);
        rep->write_body("txt", format("healthy: {}", req->get_header("Host")));
        return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
@@ -220,23 +178,7 @@ protected:
    }
 };

-// The CORS (Cross-origin resource sharing) protocol can send an OPTIONS
-// request before ("pre-flight") the main request. The response to this
-// request can be empty, but needs to have the right headers (which we
-// fill with handle_CORS())
-class options_handler : public gated_handler {
-public:
-    options_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
-protected:
-    virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
-        handle_CORS(*req, *rep, true);
-        rep->set_status(reply::status_type::ok);
-        rep->write_body("txt", sstring(""));
-        return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
-    }
-};
-
-future<> server::verify_signature(const request& req, const chunked_content& content) {
+future<> server::verify_signature(const request& req) {
    if (!_enforce_authorization) {
        slogger.debug("Skipping authorization");
        return make_ready_future<>();
@@ -247,34 +189,27 @@ future<> server::verify_signature(const request& req, const chunked_content& con
    }
    auto authorization_it = req._headers.find("Authorization");
    if (authorization_it == req._headers.end()) {
-        throw api_error::missing_authentication_token("Authorization header is mandatory for signature verification");
+        throw api_error::invalid_signature("Authorization header is mandatory for signature verification");
    }
    std::string host = host_it->second;
-    std::string_view authorization_header = authorization_it->second;
-    auto pos = authorization_header.find_first_of(' ');
-    if (pos == std::string_view::npos || authorization_header.substr(0, pos) != "AWS4-HMAC-SHA256") {
-        throw api_error::invalid_signature(format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header));
-    }
-    authorization_header.remove_prefix(pos+1);
+    std::vector<std::string_view> credentials_raw = split(authorization_it->second, ' ');
    std::string credential;
    std::string user_signature;
    std::string signed_headers_str;
    std::vector<std::string_view> signed_headers;
-    do {
-        // Either one of a comma or space can mark the end of an entry
-        pos = authorization_header.find_first_of(" ,");
-        std::string_view entry = authorization_header.substr(0, pos);
-        if (pos != std::string_view::npos) {
-            authorization_header.remove_prefix(pos + 1);
-        }
-        if (entry.empty()) {
-            continue;
-        }
+    for (std::string_view entry : credentials_raw) {
        std::vector<std::string_view> entry_split = split(entry, '=');
        if (entry_split.size() != 2) {
+            if (entry != "AWS4-HMAC-SHA256") {
+                throw api_error::invalid_signature(format("Only AWS4-HMAC-SHA256 algorithm is supported. Found: {}", entry));
+            }
            continue;
        }
        std::string_view auth_value = entry_split[1];
+        // Commas appear as an additional (quite redundant) delimiter
+        if (auth_value.back() == ',') {
+            auth_value.remove_suffix(1);
+        }
        if (entry_split[0] == "Credential") {
            credential = std::string(auth_value);
        } else if (entry_split[0] == "Signature") {
@@ -284,8 +219,7 @@ future<> server::verify_signature(const request& req, const chunked_content& con
            signed_headers = split(auth_value, ';');
            std::sort(signed_headers.begin(), signed_headers.end());
        }
-    } while (pos != std::string_view::npos);
-
+    }
    std::vector<std::string_view> credential_split = split(credential, '/');
    if (credential_split.size() != 5) {
        throw api_error::validation(format("Incorrect credential information format: {}", credential));
@@ -312,7 +246,7 @@ future<> server::verify_signature(const request& req, const chunked_content& con
    auto cache_getter = [&qp = _qp] (std::string username) {
        return get_key_from_roles(qp, std::move(username));
    };
-    return _key_cache.get_ptr(user, cache_getter).then([this, &req, &content,
+    return _key_cache.get_ptr(user, cache_getter).then([this, &req,
                                                    user = std::move(user),
                                                    host = std::move(host),
                                                    datestamp = std::move(datestamp),
@@ -322,7 +256,7 @@ future<> server::verify_signature(const request& req, const chunked_content& con
                                                    service = std::move(service),
                                                    user_signature = std::move(user_signature)] (key_cache::value_ptr key_ptr) {
        std::string signature = get_signature(user, *key_ptr, std::string_view(host), req._method,
-                datestamp, signed_headers_str, signed_headers_map, content, region, service, "");
+                datestamp, signed_headers_str, signed_headers_map, req.content, region, service, "");

        if (signature != std::string_view(user_signature)) {
            _key_cache.remove(user);
@@ -331,91 +265,43 @@ future<> server::verify_signature(const request& req, const chunked_content& con
    });
 }

-static tracing::trace_state_ptr create_tracing_session(tracing::tracing& tracing_instance) {
-    tracing::trace_state_props_set props;
-    props.set<tracing::trace_state_props::full_tracing>();
-    props.set_if<tracing::trace_state_props::log_slow_query>(tracing_instance.slow_query_tracing_enabled());
-    return tracing_instance.create_session(tracing::trace_type::QUERY, props);
-}
-
-// truncated_content_view() prints a potentially long chunked_content for
-// debugging purposes. In the common case when the content is not excessively
-// long, it just returns a view into the given content, without any copying.
-// But when the content is very long, it is truncated after some arbitrary
-// max_len (or one chunk, whichever comes first), with "<truncated>" added at
-// the end. To do this modification to the string, we need to create a new
-// std::string, so the caller must pass us a reference to one, "buf", where
-// we can store the content. The returned view is only alive for as long this
-// buf is kept alive.
-static std::string_view truncated_content_view(const chunked_content& content, std::string& buf) {
-    constexpr size_t max_len = 1024;
-    if (content.empty()) {
-        return std::string_view();
-    } else if (content.size() == 1 && content.begin()->size() <= max_len) {
-        return std::string_view(content.begin()->get(), content.begin()->size());
-    } else {
-        buf = std::string(content.begin()->get(), std::min(content.begin()->size(), max_len)) + "<truncated>";
-        return std::string_view(buf);
-    }
-}
-
-static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, sstring_view op, const chunked_content& query) {
-    tracing::trace_state_ptr trace_state;
-    tracing::tracing& tracing_instance = tracing::tracing::get_local_tracing_instance();
-    if (tracing_instance.trace_next_query() || tracing_instance.slow_query_tracing_enabled()) {
-        trace_state = create_tracing_session(tracing_instance);
-        std::string buf;
-        tracing::add_session_param(trace_state, "alternator_op", op);
-        tracing::add_query(trace_state, truncated_content_view(query, buf));
-        tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());
-    }
-    return trace_state;
-}
-
-future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request> req) {
+future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request>&& req) {
    _executor._stats.total_operations++;
    sstring target = req->get_header(TARGET);
    std::vector<std::string_view> split_target = split(target, '.');
    //NOTICE(sarna): Target consists of Dynamo API version followed by a dot '.' and operation type (e.g. CreateTable)
    std::string op = split_target.empty() ? std::string() : std::string(split_target.back());
-    // JSON parsing can allocate up to roughly 2x the size of the raw
-    // document, + a couple of bytes for maintenance.
-    // TODO: consider the case where req->content_length is missing. Maybe
-    // we need to take the content_length_limit and return some of the units
-    // when we finish read_content_and_verify_signature?
-    size_t mem_estimate = req->content_length * 2 + 8000;
-    auto units_fut = get_units(*_memory_limiter, mem_estimate);
-    if (_memory_limiter->waiters()) {
-        ++_executor._stats.requests_blocked_memory;
-    }
-    auto units = co_await std::move(units_fut);
-    assert(req->content_stream);
-    chunked_content content = co_await httpd::read_entire_stream(*req->content_stream);
-    co_await verify_signature(*req, content);
-
-    if (slogger.is_enabled(log_level::trace)) {
-        std::string buf;
-        slogger.trace("Request: {} {} {}", op, truncated_content_view(content, buf), req->_headers);
-    }
-    auto callback_it = _callbacks.find(op);
-    if (callback_it == _callbacks.end()) {
-        _executor._stats.unsupported_operations++;
-        co_return api_error::unknown_operation(format("Unsupported operation {}", op));
-    }
-    if (_pending_requests.get_count() >= _max_concurrent_requests) {
-        _executor._stats.requests_shed++;
-        co_return api_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}", _pending_requests.get_count()));
-    }
-    _pending_requests.enter();
-    auto leave = defer([this] { _pending_requests.leave(); });
-    //FIXME: Client state can provide more context, e.g. client's endpoint address
-    // We use unique_ptr because client_state cannot be moved or copied
-    executor::client_state client_state{executor::client_state::internal_tag()};
-    tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, op, content);
-    tracing::trace(trace_state, op);
-    rjson::value json_request = co_await _json_parser.parse(std::move(content));
-    co_return co_await callback_it->second(_executor, client_state, trace_state,
-            make_service_permit(std::move(units)), std::move(json_request), std::move(req));
+    slogger.trace("Request: {} {} {}", op, req->content, req->_headers);
+    return verify_signature(*req).then([this, op, req = std::move(req)] () mutable {
+        auto callback_it = _callbacks.find(op);
+        if (callback_it == _callbacks.end()) {
+            _executor._stats.unsupported_operations++;
+            throw api_error::unknown_operation(format("Unsupported operation {}", op));
+        }
+        return with_gate(_pending_requests, [this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] () mutable {
+            //FIXME: Client state can provide more context, e.g. client's endpoint address
+            // We use unique_ptr because client_state cannot be moved or copied
+            return do_with(std::make_unique<executor::client_state>(executor::client_state::internal_tag()),
+                    [this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] (std::unique_ptr<executor::client_state>& client_state) mutable {
+                tracing::trace_state_ptr trace_state = executor::maybe_trace_query(*client_state, op, req->content);
+                tracing::trace(trace_state, op);
+                // JSON parsing can allocate up to roughly 2x the size of the raw document, + a couple of bytes for maintenance.
+                // FIXME: by this time, the whole HTTP request was already read, so some memory is already occupied.
+                // Once HTTP allows working on streams, we should grab the permit *before* reading the HTTP payload.
+                size_t mem_estimate = req->content.size() * 3 + 8000;
+                auto units_fut = get_units(*_memory_limiter, mem_estimate);
+                if (_memory_limiter->waiters()) {
+                    ++_executor._stats.requests_blocked_memory;
+                }
+                return units_fut.then([this, callback_it = std::move(callback_it), &client_state, trace_state, req = std::move(req)] (semaphore_units<> units) mutable {
+                    return _json_parser.parse(req->content).then([this, callback_it = std::move(callback_it), &client_state, trace_state,
+                            units = std::move(units), req = std::move(req)] (rjson::value json_request) mutable {
+                        return callback_it->second(_executor, *client_state, trace_state, make_service_permit(std::move(units)), std::move(json_request), std::move(req)).finally([trace_state] {});
+                    });
+                });
+            });
+        });
+    });
 }

 void server::set_routes(routes& r) {
@@ -437,7 +323,6 @@ void server::set_routes(routes& r) {
    // scan an entire subnet for nodes responding to the health request,
    // or even just scan for open ports.
    r.put(operation_type::GET, "/localnodes", new local_nodelist_handler(_pending_requests));
-    r.put(operation_type::OPTIONS, "/", new options_handler(_pending_requests));
 }

 //FIXME: A way to immediately invalidate the cache should be considered,
@@ -520,10 +405,9 @@ server::server(executor& exec, cql3::query_processor& qp)
 }

 future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
-        bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
+        bool enforce_authorization, semaphore* memory_limiter) {
    _memory_limiter = memory_limiter;
    _enforce_authorization = enforce_authorization;
-    _max_concurrent_requests = std::move(max_concurrent_requests);
    if (!port && !https_port) {
        return make_exception_future<>(std::runtime_error("Either regular port or TLS port"
                " must be specified in order to init an alternator HTTP server instance"));
@@ -535,14 +419,12 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:
            if (port) {
                set_routes(_http_server._routes);
                _http_server.set_content_length_limit(server::content_length_limit);
-                _http_server.set_content_streaming(true);
                _http_server.listen(socket_address{addr, *port}).get();
                _enabled_servers.push_back(std::ref(_http_server));
            }
            if (https_port) {
                set_routes(_https_server._routes);
                _https_server.set_content_length_limit(server::content_length_limit);
-                _https_server.set_content_streaming(true);
                _https_server.set_tls_credentials(creds->build_reloadable_server_credentials([](const std::unordered_set<sstring>& files, std::exception_ptr ep) {
                    if (ep) {
                        slogger.warn("Exception loading {}: {}", files, ep);
@@ -580,7 +462,7 @@ server::json_parser::json_parser() : _run_parse_json_thread(async([this] {
                return;
            }
            try {
-                _parsed_document = rjson::parse_yieldable(std::move(_raw_document));
+                _parsed_document = rjson::parse_yieldable(_raw_document);
                _current_exception = nullptr;
            } catch (...) {
                _current_exception = std::current_exception();
@@ -590,12 +472,12 @@ server::json_parser::json_parser() : _run_parse_json_thread(async([this] {
    })) {
 }

-future<rjson::value> server::json_parser::parse(chunked_content&& content) {
+future<rjson::value> server::json_parser::parse(std::string_view content) {
    if (content.size() < yieldable_parsing_threshold) {
-        return make_ready_future<rjson::value>(rjson::parse(std::move(content)));
+        return make_ready_future<rjson::value>(rjson::parse(content));
    }
-    return with_semaphore(_parsing_sem, 1, [this, content = std::move(content)] () mutable {
-        _raw_document = std::move(content);
+    return with_semaphore(_parsing_sem, 1, [this, content] {
+        _raw_document = content;
        _document_waiting.signal();
        return _document_parsed.wait().then([this] {
            if (_current_exception) {
--- a/alternator/server.hh
+++ b/alternator/server.hh
@@ -28,13 +28,10 @@
 #include <optional>
 #include "alternator/auth.hh"
 #include "utils/small_vector.hh"
-#include "utils/updateable_value.hh"
 #include <seastar/core/units.hh>

 namespace alternator {

-using chunked_content = rjson::chunked_content;
-
 class server {
    static constexpr size_t content_length_limit = 16*MB;
    using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,
@@ -53,11 +50,10 @@ class server {
    alternator_callbacks_map _callbacks;

    semaphore* _memory_limiter;
-    utils::updateable_value<uint32_t> _max_concurrent_requests;

    class json_parser {
        static constexpr size_t yieldable_parsing_threshold = 16*KB;
-        chunked_content _raw_document;
+        std::string_view _raw_document;
        rjson::value _parsed_document;
        std::exception_ptr _current_exception;
        semaphore _parsing_sem{1};
@@ -67,10 +63,7 @@ class server {
        future<> _run_parse_json_thread;
    public:
        json_parser();
-        // Moving a chunked_content into parse() allows parse() to free each
-        // chunk as soon as it is parsed, so when chunks are relatively small,
-        // we don't need to store the sum of unparsed and parsed sizes.
-        future<rjson::value> parse(chunked_content&& content);
+        future<rjson::value> parse(std::string_view content);
        future<> stop();
    };
    json_parser _json_parser;
@@ -79,12 +72,12 @@ public:
    server(executor& executor, cql3::query_processor& qp);

    future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
-            bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
+            bool enforce_authorization, semaphore* memory_limiter);
    future<> stop();
 private:
    void set_routes(seastar::httpd::routes& r);
-    future<> verify_signature(const seastar::httpd::request&, const chunked_content&);
-    future<executor::request_return_type> handle_api_request(std::unique_ptr<request> req);
+    future<> verify_signature(const seastar::httpd::request& r);
+    future<executor::request_return_type> handle_api_request(std::unique_ptr<request>&& req);
 };

 }
--- a/alternator/stats.cc
+++ b/alternator/stats.cc
@@ -38,7 +38,6 @@ stats::stats() : api_operations{} {
 #define OPERATION_LATENCY(name, CamelCaseName) \
                seastar::metrics::make_histogram("op_latency", \
                        seastar::metrics::description("Latency histogram of an operation via Alternator API"), {op(CamelCaseName)}, [this]{return to_metrics_histogram(api_operations.name);}),
-            OPERATION(batch_get_item, "BatchGetItem")
            OPERATION(batch_write_item, "BatchWriteItem")
            OPERATION(create_backup, "CreateBackup")
            OPERATION(create_global_table, "CreateGlobalTable")
@@ -97,8 +96,6 @@ stats::stats() : api_operations{} {
                    seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements")),
            seastar::metrics::make_total_operations("requests_blocked_memory", requests_blocked_memory,
                    seastar::metrics::description("Counts a number of requests blocked due to memory pressure.")),
-            seastar::metrics::make_total_operations("requests_shed", requests_shed,
-                    seastar::metrics::description("Counts a number of requests shed due to overload.")),
            seastar::metrics::make_total_operations("filtered_rows_read_total", cql_stats.filtered_rows_read_total,
                    seastar::metrics::description("number of rows read during filtering operations")),
            seastar::metrics::make_total_operations("filtered_rows_matched_total", cql_stats.filtered_rows_matched_total,
--- a/alternator/stats.hh
+++ b/alternator/stats.hh
@@ -92,7 +92,6 @@ public:
    uint64_t write_using_lwt = 0;
    uint64_t shard_bounce_for_lwt = 0;
    uint64_t requests_blocked_memory = 0;
-    uint64_t requests_shed = 0;
    // CQL-derived stats
    cql3::cql_stats cql_stats;
 private:
--- a/alternator/streams.cc
+++ b/alternator/streams.cc
@@ -34,7 +34,6 @@
 #include "cdc/log.hh"
 #include "cdc/generation.hh"
 #include "cdc/cdc_options.hh"
-#include "cdc/metadata.hh"
 #include "db/system_distributed_keyspace.hh"
 #include "utils/UUID_gen.hh"
 #include "cql3/selection/selection.hh"
@@ -471,7 +470,8 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
    auto status = "DISABLED";

    if (opts.enabled()) {
-        if (!_cdc_metadata.streams_available()) {
+        auto& metadata = _ss.get_cdc_metadata();
+        if (!metadata.streams_available()) {
            status = "ENABLING";
        } else {
            status = "ENABLED";
@@ -499,11 +499,19 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
    // TODO: creation time

    auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
+    // cannot really "resume" query, must iterate all data. because we cannot query neither "time" (pk) > something,
+    // or on expired...
+    // TODO: maybe add secondary index to topology table to enable this?
+    return _sdks.cdc_get_versioned_streams({ normal_token_owners }).then([this, &db, schema, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc), ttl](std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {

-    // filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
-    auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
+        // filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
+        auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);

-    return _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners }).then([this, &db, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc)] (std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {
+        auto i = topologies.lower_bound(low_ts);
+        // need first gen _intersecting_ the timestamp.
+        if (i != topologies.begin()) {
+            i = std::prev(i);
+        }

        auto e = topologies.end();
        auto prev = e;
@@ -511,7 +519,9 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

        std::optional<shard_id> last;

-        auto i = topologies.begin();
+        // i is now at the youngest generation we include. make a mark of it.
+        auto first = i;
+
        // if we're a paged query, skip to the generation where we left of.
        if (shard_start) {
            i = topologies.find(shard_start->time);
@@ -537,7 +547,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
        };

        // need a prev even if we are skipping stuff
-        if (i != topologies.begin()) {
+        if (i != first) {
            prev = std::prev(i);
        }

@@ -845,18 +855,16 @@ future<executor::request_return_type> executor::get_records(client_state& client
    static const bytes op_column_name = cdc::log_meta_column_name_bytes("operation");
    static const bytes eor_column_name = cdc::log_meta_column_name_bytes("end_of_batch");

-    auto key_names = boost::copy_range<attrs_to_get>(
+    auto key_names = boost::copy_range<std::unordered_set<std::string>>(
        boost::range::join(std::move(base->partition_key_columns()), std::move(base->clustering_key_columns()))
-        | boost::adaptors::transformed([&] (const column_definition& cdef) {
-            return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })
+        | boost::adaptors::transformed([&] (const column_definition& cdef) { return cdef.name_as_text(); })
    );
    // Include all base table columns as values (in case pre or post is enabled).
    // This will include attributes not stored in the frozen map column
-    auto attr_names = boost::copy_range<attrs_to_get>(base->regular_columns()
+    auto attr_names = boost::copy_range<std::unordered_set<std::string>>(base->regular_columns()
        // this will include the :attrs column, which we will also force evaluating. 
        // But not having this set empty forces out any cdc columns from actual result 
-        | boost::adaptors::transformed([] (const column_definition& cdef) {
-            return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })
+        | boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.name_as_text(); })
    );

    std::vector<const column_definition*> columns;
@@ -1020,9 +1028,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
        }

        // ugh. figure out if we are and end-of-shard
-        auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
-        
-        return _sdks.cdc_current_generation_timestamp({ normal_token_owners }).then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {
+        return cdc::get_local_streams_timestamp().then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {
            auto& shard = iter.shard;            

            if (shard.time < ts && ts < high_ts) {
--- a/api/api-doc/column_family.json
+++ b/api/api-doc/column_family.json
@@ -2925,10 +2925,6 @@
         "id":"toppartitions_query_results",
         "description":"nodetool toppartitions query results",
         "properties":{
-            "read_cardinality":{
-               "type":"long",
-               "description":"Number of the unique operations in the sample set"
-            },
            "read":{
               "type":"array",
               "items":{
@@ -2936,10 +2932,6 @@
               },
               "description":"Read results"
            },
-            "write_cardinality":{
-               "type":"long",
-               "description":"Number of the unique operations in the sample set"
-            },
            "write":{
               "type":"array",
               "items":{
--- a/api/api-doc/gossiper.json
+++ b/api/api-doc/gossiper.json
@@ -148,30 +148,6 @@
               ]
            }
         ]
-      },
-      {
-         "path":"/gossiper/force_remove_endpoint/{addr}",
-         "operations":[
-            {
-               "method":"POST",
-               "summary":"Force remove an endpoint from gossip",
-               "type":"void",
-               "nickname":"force_remove_endpoint",
-               "produces":[
-                  "application/json"
-               ],
-               "parameters":[
-                  {
-                     "name":"addr",
-                     "description":"The endpoint address",
-                     "required":true,
-                     "allowMultiple":false,
-                     "type":"string",
-                     "paramType":"path"
-                  }
-               ]
-            }
-         ]
      }
   ]
 }
--- a/api/api-doc/messaging_service.json
+++ b/api/api-doc/messaging_service.json
@@ -76,7 +76,7 @@
               "items":{
                  "type":"message_counter"
               },
-               "nickname":"get_replied_messages",
+               "nickname":"get_completed_messages",
               "produces":[
                  "application/json"
               ],
--- a/api/api-doc/storage_service.json
+++ b/api/api-doc/storage_service.json
@@ -104,68 +104,6 @@
            }
         ]
      },
-      {
-         "path":"/storage_service/toppartitions/",
-         "operations":[
-            {
-               "method":"GET",
-               "summary":"Toppartitions query",
-               "type":"toppartitions_query_results",
-               "nickname":"toppartitions_generic",
-               "produces":[
-                  "application/json"
-               ],
-               "parameters":[
-                  {
-                     "name":"table_filters",
-                     "description":"Optional list of table name filters in keyspace:name format",
-                     "required":false,
-                     "allowMultiple":false,
-                     "type":"array",
-                     "items":{
-                        "type":"string"
-                     },
-                     "paramType":"query"
-                  },
-                  {
-                     "name":"keyspace_filters",
-                     "description":"Optional list of keyspace filters",
-                     "required":false,
-                     "allowMultiple":false,
-                     "type":"array",
-                     "items":{
-                        "type":"string"
-                     },
-                     "paramType":"query"
-                  },
-                  {
-                     "name":"duration",
-                     "description":"Duration (in milliseconds) of monitoring operation",
-                     "required":true,
-                     "allowMultiple":false,
-                     "type": "long",
-                     "paramType":"query"
-                  },
-                  {
-                    "name":"list_size",
-                    "description":"number of the top partitions to list",
-                    "required":false,
-                    "allowMultiple":false,
-                    "type": "long",
-                    "paramType":"query"
-                 },
-                 {
-                    "name":"capacity",
-                    "description":"capacity of stream summary: determines amount of resources used in query processing",
-                    "required":false,
-                    "allowMultiple":false,
-                    "type": "long",
-                    "paramType":"query"
-                 }
-              ]
-            }
-         ]
-      },
      {
         "path":"/storage_service/nodes/leaving",
         "operations":[
@@ -1032,14 +970,6 @@
                     "type":"string",
                     "paramType":"query"
                  },
-                  {
-                     "name":"ignore_nodes",
-                     "description":"Which hosts are to ignore in this repair. Multiple hosts can be listed separated by commas.",
-                     "required":false,
-                     "allowMultiple":false,
-                     "type":"string",
-                     "paramType":"query"
-                  },
                  {
                     "name":"trace",
                     "description":"If the value is the string 'true' with any capitalization, enable tracing of the repair.",
@@ -1175,14 +1105,6 @@
                     "allowMultiple":false,
                     "type":"string",
                     "paramType":"query"
-                  },
-                  {
-                     "name":"ignore_nodes",
-                     "description":"List of dead nodes to ingore in removenode operation",
-                     "required":false,
-                     "allowMultiple":false,
-                     "type":"string",
-                     "paramType":"query"
                  }
               ]
            }
@@ -1834,22 +1756,6 @@
                     "allowMultiple":false,
                     "type":"string",
                     "paramType":"query"
-                  },
-                  {
-                     "name":"load_and_stream",
-                     "description":"Load the sstables and stream to all replica nodes that owns the data",
-                     "required":false,
-                     "allowMultiple":false,
-                     "type":"string",
-                     "paramType":"query"
-                  },
-                  {
-                     "name":"primary_replica_only",
-                     "description":"Load the sstables and stream to primary replica node that owns the data. Repair is needed after the load and stream process",
-                     "required":false,
-                     "allowMultiple":false,
-                     "type":"string",
-                     "paramType":"query"
                  }
               ]
            }
@@ -1960,14 +1866,6 @@
                     "allowMultiple":false,
                     "type":"long",
                     "paramType":"query"
-                  },
-                  {
-                     "name":"fast",
-                     "description":"Lightweight tracing mode: if true, slow queries tracing records only session headers",
-                     "required":false,
-                     "allowMultiple":false,
-                     "type":"boolean",
-                     "paramType":"query"
                  }
               ]
            },
@@ -2466,10 +2364,6 @@
            "threshold":{
               "type":"long",
               "description":"The slow query logging threshold in microseconds. Queries that takes longer, will be logged"
-            },
-            "fast":{
-               "type":"boolean",
-               "description":"Is lightweight tracing mode enabled. In that mode tracing ignore events and tracks only sessions."
            }
         }
      },
--- a/api/api-doc/system.json
+++ b/api/api-doc/system.json
@@ -52,22 +52,6 @@
            }
         ]
      },
-      {
-         "path":"/system/drop_sstable_caches",
-         "operations":[
-            {
-               "method":"POST",
-               "summary":"Drop in-memory caches for data which is in sstables",
-               "type":"void",
-               "nickname":"drop_sstable_caches",
-               "produces":[
-                  "application/json"
-               ],
-               "parameters":[
-               ]
-            }
-         ]
-      },
      {
         "path":"/system/uptime_ms",
         "operations":[
--- a/api/column_family.cc
+++ b/api/column_family.cc
@@ -28,7 +28,6 @@
 #include <algorithm>
 #include "db/system_keyspace_view_types.hh"
 #include "db/data_listeners.hh"
-#include "storage_service.hh"

 extern logging::logger apilog;

@@ -181,7 +180,7 @@ static future<json::json_return_type> get_cf_unleveled_sstables(http_context& ct

 static int64_t min_partition_size(column_family& cf) {
    int64_t res = INT64_MAX;
-    for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
+    for (auto i: *cf.get_sstables() ) {
        res = std::min(res, i->get_stats_metadata().estimated_partition_size.min());
    }
    return (res == INT64_MAX) ? 0 : res;
@@ -189,7 +188,7 @@ static int64_t min_partition_size(column_family& cf) {

 static int64_t max_partition_size(column_family& cf) {
    int64_t res = 0;
-    for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
+    for (auto i: *cf.get_sstables() ) {
        res = std::max(i->get_stats_metadata().estimated_partition_size.max(), res);
    }
    return res;
@@ -197,7 +196,7 @@ static int64_t max_partition_size(column_family& cf) {

 static integral_ratio_holder mean_partition_size(column_family& cf) {
    integral_ratio_holder res;
-    for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
+    for (auto i: *cf.get_sstables() ) {
        auto c = i->get_stats_metadata().estimated_partition_size.count();
        res.sub += i->get_stats_metadata().estimated_partition_size.mean() * c;
        res.total += c;
@@ -275,7 +274,7 @@ public:

 static double get_compression_ratio(column_family& cf) {
    sum_ratio<double> result;
-    for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
+    for (auto i : *cf.get_sstables()) {
        auto compression_ratio = i->get_compression_ratio();
        if (compression_ratio != sstables::metadata_collector::NO_COMPRESSION_RATIO) {
            result(compression_ratio);
@@ -311,8 +310,8 @@ void set_column_family(http_context& ctx, routes& r) {
        return res;
    });

-    cf::get_column_family.set(r, [&ctx] (std::unique_ptr<request> req){
-            std::list<cf::column_family_info> res;
+    cf::get_column_family.set(r, [&ctx] (const_req req){
+            vector<cf::column_family_info> res;
            for (auto i: ctx.db.local().get_column_families_mapping()) {
                cf::column_family_info info;
                info.ks = i.first.first;
@@ -320,7 +319,7 @@ void set_column_family(http_context& ctx, routes& r) {
                info.type = "ColumnFamilies";
                res.push_back(info);
            }
-            return make_ready_future<json::json_return_type>(json::stream_range_as_array(std::move(res), std::identity()));
+            return res;
        });

    cf::get_column_family_name_keyspace.set(r, [&ctx] (const_req req){
@@ -332,15 +331,15 @@ void set_column_family(http_context& ctx, routes& r) {
    });

    cf::get_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {
-        return map_reduce_cf(ctx, req->param["name"], uint64_t{0}, [](column_family& cf) {
+        return map_reduce_cf(ctx, req->param["name"], 0, [](column_family& cf) {
            return cf.active_memtable().partition_count();
-        }, std::plus<>());
+        }, std::plus<int>());
    });

    cf::get_all_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {
-        return map_reduce_cf(ctx, uint64_t{0}, [](column_family& cf) {
+        return map_reduce_cf(ctx, 0, [](column_family& cf) {
            return cf.active_memtable().partition_count();
-        }, std::plus<>());
+        }, std::plus<int>());
    });

    cf::get_memtable_on_heap_size.set(r, [] (const_req req) {
@@ -425,7 +424,7 @@ void set_column_family(http_context& ctx, routes& r) {
    cf::get_estimated_row_size_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
            utils::estimated_histogram res(0);
-            for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
+            for (auto i: *cf.get_sstables() ) {
                res.merge(i->get_stats_metadata().estimated_partition_size);
            }
            return res;
@@ -437,7 +436,7 @@ void set_column_family(http_context& ctx, routes& r) {
    cf::get_estimated_row_count.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {
            uint64_t res = 0;
-            for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
+            for (auto i: *cf.get_sstables() ) {
                res += i->get_stats_metadata().estimated_partition_size.count();
            }
            return res;
@@ -448,7 +447,7 @@ void set_column_family(http_context& ctx, routes& r) {
    cf::get_estimated_column_count_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
            utils::estimated_histogram res(0);
-            for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
+            for (auto i: *cf.get_sstables() ) {
                res.merge(i->get_stats_metadata().estimated_cells_count);
            }
            return res;
@@ -600,8 +599,7 @@ void set_column_family(http_context& ctx, routes& r) {

    cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
-            auto sstables = cf.get_sstables();
-            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
+            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
                return s + sst->filter_get_false_positive();
            });
        }, std::plus<uint64_t>());
@@ -609,8 +607,7 @@ void set_column_family(http_context& ctx, routes& r) {

    cf::get_all_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
-            auto sstables = cf.get_sstables();
-            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
+            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
                return s + sst->filter_get_false_positive();
            });
        }, std::plus<uint64_t>());
@@ -618,8 +615,7 @@ void set_column_family(http_context& ctx, routes& r) {

    cf::get_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
-            auto sstables = cf.get_sstables();
-            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
+            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
                return s + sst->filter_get_recent_false_positive();
            });
        }, std::plus<uint64_t>());
@@ -627,8 +623,7 @@ void set_column_family(http_context& ctx, routes& r) {

    cf::get_all_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
-            auto sstables = cf.get_sstables();
-            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
+            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
                return s + sst->filter_get_recent_false_positive();
            });
        }, std::plus<uint64_t>());
@@ -660,54 +655,48 @@ void set_column_family(http_context& ctx, routes& r) {

    cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
-            auto sstables = cf.get_sstables();
-            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
-                return s + sst->filter_size();
+            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
+                return sst->filter_size();
            });
        }, std::plus<uint64_t>());
    });

    cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
-            auto sstables = cf.get_sstables();
-            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
-                return s + sst->filter_size();
+            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
+                return sst->filter_size();
            });
        }, std::plus<uint64_t>());
    });

    cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
-            auto sstables = cf.get_sstables();
-            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
-                return s + sst->filter_memory_size();
+            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
+                return sst->filter_memory_size();
            });
        }, std::plus<uint64_t>());
    });

    cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
-            auto sstables = cf.get_sstables();
-            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
-                return s + sst->filter_memory_size();
+            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
+                return sst->filter_memory_size();
            });
        }, std::plus<uint64_t>());
    });

    cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
-            auto sstables = cf.get_sstables();
-            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
-                return s + sst->get_summary().memory_footprint();
+            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
+                return sst->get_summary().memory_footprint();
            });
        }, std::plus<uint64_t>());
    });

    cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
-            auto sstables = cf.get_sstables();
-            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
-                return s + sst->get_summary().memory_footprint();
+            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
+                return sst->get_summary().memory_footprint();
            });
        }, std::plus<uint64_t>());
    });
@@ -984,20 +973,42 @@ void set_column_family(http_context& ctx, routes& r) {
        });
    });

-
    cf::toppartitions.set(r, [&ctx] (std::unique_ptr<request> req) {
-        auto name = req->param["name"];
-        auto [ks, cf] = parse_fully_qualified_cf_name(name);
+        auto name_param = req->param["name"];
+        auto [ks, cf] = parse_fully_qualified_cf_name(name_param);

        api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
        api::req_param<unsigned> capacity(*req, "capacity", 256);
        api::req_param<unsigned> list_size(*req, "list_size", 10);

        apilog.info("toppartitions query: name={} duration={} list_size={} capacity={}",
-            name, duration.param, list_size.param, capacity.param);
+            name_param, duration.param, list_size.param, capacity.param);

-        return seastar::do_with(db::toppartitions_query(ctx.db, {{ks, cf}}, {}, duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {
-            return run_toppartitions_query(q, ctx, true);
+        return seastar::do_with(db::toppartitions_query(ctx.db, ks, cf, duration.value, list_size, capacity), [&ctx](auto& q) {
+            return q.scatter().then([&q] {
+                return sleep(q.duration()).then([&q] {
+                    return q.gather(q.capacity()).then([&q] (auto topk_results) {
+                        apilog.debug("toppartitions query: processing results");
+                        cf::toppartitions_query_results results;
+
+                        for (auto& d: topk_results.read.top(q.list_size())) {
+                            cf::toppartitions_record r;
+                            r.partition = sstring(d.item);
+                            r.count = d.count;
+                            r.error = d.error;
+                            results.read.push(r);
+                        }
+                        for (auto& d: topk_results.write.top(q.list_size())) {
+                            cf::toppartitions_record r;
+                            r.partition = sstring(d.item);
+                            r.count = d.count;
+                            r.error = d.error;
+                            results.write.push(r);
+                        }
+                        return make_ready_future<json::json_return_type>(results);
+                    });
+                });
+            });
        });
    });

--- a/api/column_family.hh
+++ b/api/column_family.hh
@@ -116,7 +116,4 @@ future<json::json_return_type>  get_cf_stats(http_context& ctx, const sstring& n
 future<json::json_return_type>  get_cf_stats(http_context& ctx,
        int64_t column_family_stats::*f);

-
-std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name);
-
 }
--- a/api/compaction_manager.cc
+++ b/api/compaction_manager.cc
@@ -58,7 +58,6 @@ void set_compaction_manager(http_context& ctx, routes& r) {

            for (const auto& c : cm.get_compactions()) {
                cm::summary s;
-                s.id = c->compaction_uuid.to_sstring();
                s.ks = c->ks_name;
                s.cf = c->cf_name;
                s.unit = "keys";
--- a/api/gossiper.cc
+++ b/api/gossiper.cc
@@ -66,13 +66,6 @@ void set_gossiper(http_context& ctx, routes& r) {
            return make_ready_future<json::json_return_type>(json_void());
        });
    });
-
-    httpd::gossiper_json::force_remove_endpoint.set(r, [](std::unique_ptr<request> req) {
-        gms::inet_address ep(req->param["addr"]);
-        return gms::get_local_gossiper().force_remove_endpoint(ep).then([] {
-            return make_ready_future<json::json_return_type>(json_void());
-        });
-    });
 }

 }
--- a/api/lsa.cc
+++ b/api/lsa.cc
@@ -26,7 +26,6 @@
 #include <seastar/http/exception.hh>
 #include "utils/logalloc.hh"
 #include "log.hh"
-#include "database.hh"

 namespace api {

--- a/api/messaging_service.cc
+++ b/api/messaging_service.cc
@@ -96,10 +96,6 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging
        return c.get_stats().sent_messages;
    }));

-    get_replied_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
-        return c.get_stats().replied;
-    }));
-
    get_dropped_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
        // We don't have the same drop message mechanism
        // as origin has.
@@ -159,7 +155,6 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging
 void unset_messaging_service(http_context& ctx, routes& r) {
    get_timeout_messages.unset(r);
    get_sent_messages.unset(r);
-    get_replied_messages.unset(r);
    get_dropped_messages.unset(r);
    get_exception_messages.unset(r);
    get_pending_messages.unset(r);
--- a/api/storage_service.cc
+++ b/api/storage_service.cc
@@ -23,13 +23,10 @@
 #include "api/api-doc/storage_service.json.hh"
 #include "db/config.hh"
 #include "db/schema_tables.hh"
-#include "utils/hash.hh"
-#include <sstream>
+#include <optional>
 #include <time.h>
 #include <boost/range/adaptor/map.hpp>
 #include <boost/range/adaptor/filtered.hpp>
-#include <boost/algorithm/string/trim_all.hpp>
-#include <boost/functional/hash.hpp>
 #include "service/storage_service.hh"
 #include "service/load_meter.hh"
 #include "db/commitlog/commitlog.hh"
@@ -49,9 +46,6 @@
 #include "transport/controller.hh"
 #include "thrift/controller.hh"
 #include "locator/token_metadata.hh"
-#include "cdc/generation_service.hh"
-
-extern logging::logger apilog;

 namespace api {

@@ -100,37 +94,6 @@ static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
    };
 }

-seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request) {
-    namespace cf = httpd::column_family_json;
-    return q.scatter().then([&q, legacy_request] {
-        return sleep(q.duration()).then([&q, legacy_request] {
-            return q.gather(q.capacity()).then([&q, legacy_request] (auto topk_results) {
-                apilog.debug("toppartitions query: processing results");
-                cf::toppartitions_query_results results;
-
-                results.read_cardinality = topk_results.read.size();
-                results.write_cardinality = topk_results.write.size();
-
-                for (auto& d: topk_results.read.top(q.list_size())) {
-                    cf::toppartitions_record r;
-                    r.partition = (legacy_request ? "" : "(" + d.item.schema->ks_name() + ":" + d.item.schema->cf_name() + ") ") + sstring(d.item);
-                    r.count = d.count;
-                    r.error = d.error;
-                    results.read.push(r);
-                }
-                for (auto& d: topk_results.write.top(q.list_size())) {
-                    cf::toppartitions_record r;
-                    r.partition = (legacy_request ? "" : "(" + d.item.schema->ks_name() + ":" + d.item.schema->cf_name() + ") ") + sstring(d.item);
-                    r.count = d.count;
-                    r.error = d.error;
-                    results.write.push(r);
-                }
-                return make_ready_future<json::json_return_type>(results);
-            });
-        });
-    });
-}
-
 future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
    if (tables.empty()) {
        tables = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
@@ -196,7 +159,7 @@ void unset_rpc_controller(http_context& ctx, routes& r) {
 void set_repair(http_context& ctx, routes& r, sharded<netw::messaging_service>& ms) {
    ss::repair_async.set(r, [&ctx, &ms](std::unique_ptr<request> req) {
        static std::vector<sstring> options = {"primaryRange", "parallelism", "incremental",
-                "jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "ignore_nodes", "trace",
+                "jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "trace",
                "startToken", "endToken" };
        std::unordered_map<sstring, sstring> options_map;
        for (auto o : options) {
@@ -262,7 +225,7 @@ void set_repair(http_context& ctx, routes& r, sharded<netw::messaging_service>&
            try {
                res = fut.get0();
            } catch (std::exception& e) {
-                return make_exception_future<json::json_return_type>(httpd::bad_param_exception(e.what()));
+                return make_exception_future<json::json_return_type>(httpd::server_error_exception(e.what()));
            }
            return make_ready_future<json::json_return_type>(json::json_return_type(res));
        });
@@ -324,56 +287,6 @@ void set_storage_service(http_context& ctx, routes& r) {
        }));
    });

-    ss::toppartitions_generic.set(r, [&ctx] (std::unique_ptr<request> req) {
-        bool filters_provided = false;
-
-        std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};
-        if (req->query_parameters.contains("table_filters")) {
-            filters_provided = true;
-            auto filters = req->get_query_param("table_filters");
-            std::stringstream ss { filters };
-            std::string filter;
-            while (!filters.empty() && ss.good()) {
-                std::getline(ss, filter, ',');
-                table_filters.emplace(parse_fully_qualified_cf_name(filter));
-            }
-        }
-
-        std::unordered_set<sstring> keyspace_filters {};
-        if (req->query_parameters.contains("keyspace_filters")) {
-            filters_provided = true;
-            auto filters = req->get_query_param("keyspace_filters");
-            std::stringstream ss { filters };
-            std::string filter;
-            while (!filters.empty() && ss.good()) {
-                std::getline(ss, filter, ',');
-                keyspace_filters.emplace(std::move(filter));
-            }
-        }
-
-        // when the query is empty return immediately
-        if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
-            apilog.debug("toppartitions query: processing results");
-            httpd::column_family_json::toppartitions_query_results results;
-
-            results.read_cardinality = 0;
-            results.write_cardinality = 0;
-
-            return make_ready_future<json::json_return_type>(results);
-        }
-
-        api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
-        api::req_param<unsigned> capacity(*req, "capacity", 256);
-        api::req_param<unsigned> list_size(*req, "list_size", 10);
-
-        apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",
-            !table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.param, list_size.param, capacity.param);
-
-        return seastar::do_with(db::toppartitions_query(ctx.db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {
-            return run_toppartitions_query(q, ctx);
-        });
-    });
-
    ss::get_leaving_nodes.set(r, [&ctx](const_req req) {
        return container_to_vec(ctx.get_token_metadata().get_leaving_endpoints());
    });
@@ -487,7 +400,7 @@ void set_storage_service(http_context& ctx, routes& r) {
    });

    ss::cdc_streams_check_and_repair.set(r, [&ctx] (std::unique_ptr<request> req) {
-        return service::get_local_storage_service().get_cdc_generation_service().check_and_repair_cdc_streams().then([] {
+        return service::get_local_storage_service().check_and_repair_cdc_streams().then([] {
            return make_ready_future<json::json_return_type>(json_void());
        });
    });
@@ -583,22 +496,7 @@ void set_storage_service(http_context& ctx, routes& r) {

    ss::remove_node.set(r, [](std::unique_ptr<request> req) {
        auto host_id = req->get_query_param("host_id");
-        std::vector<sstring> ignore_nodes_strs= split(req->get_query_param("ignore_nodes"), ",");
-        auto ignore_nodes = std::list<gms::inet_address>();
-        for (std::string n : ignore_nodes_strs) {
-            try {
-                std::replace(n.begin(), n.end(), '\"', ' ');
-                std::replace(n.begin(), n.end(), '\'', ' ');
-                boost::trim_all(n);
-                if (!n.empty()) {
-                    auto node = gms::inet_address(n);
-                    ignore_nodes.push_back(node);
-                }
-            } catch (...) {
-                throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}", ignore_nodes_strs, n));
-            }
-        }
-        return service::get_local_storage_service().removenode(host_id, std::move(ignore_nodes)).then([] {
+        return service::get_local_storage_service().removenode(host_id).then([] {
            return make_ready_future<json::json_return_type>(json_void());
        });
    });
@@ -818,19 +716,11 @@ void set_storage_service(http_context& ctx, routes& r) {
    ss::load_new_ss_tables.set(r, [&ctx](std::unique_ptr<request> req) {
        auto ks = validate_keyspace(ctx, req->param);
        auto cf = req->get_query_param("cf");
-        auto stream = req->get_query_param("load_and_stream");
-        auto primary_replica = req->get_query_param("primary_replica_only");
-        boost::algorithm::to_lower(stream);
-        boost::algorithm::to_lower(primary_replica);
-        bool load_and_stream = stream == "true" || stream == "1";
-        bool primary_replica_only = primary_replica == "true" || primary_replica == "1";
        // No need to add the keyspace, since all we want is to avoid always sending this to the same
        // CPU. Even then I am being overzealous here. This is not something that happens all the time.
        auto coordinator = std::hash<sstring>()(cf) % smp::count;
-        return service::get_storage_service().invoke_on(coordinator,
-                [ks = std::move(ks), cf = std::move(cf),
-                load_and_stream, primary_replica_only] (service::storage_service& s) {
-            return s.load_new_sstables(ks, cf, load_and_stream, primary_replica_only);
+        return service::get_storage_service().invoke_on(coordinator, [ks = std::move(ks), cf = std::move(cf)] (service::storage_service& s) {
+            return s.load_new_sstables(ks, cf);
        }).then_wrapped([] (auto&& f) {
            if (f.failed()) {
                auto msg = fmt::format("Failed to load new sstables: {}", f.get_exception());
@@ -886,7 +776,6 @@ void set_storage_service(http_context& ctx, routes& r) {
        res.enable = tracing::tracing::get_local_tracing_instance().slow_query_tracing_enabled();
        res.ttl = tracing::tracing::get_local_tracing_instance().slow_query_record_ttl().count() ;
        res.threshold = tracing::tracing::get_local_tracing_instance().slow_query_threshold().count();
-        res.fast = tracing::tracing::get_local_tracing_instance().ignore_trace_events_enabled();
        return res;
    });

@@ -894,9 +783,8 @@ void set_storage_service(http_context& ctx, routes& r) {
        auto enable = req->get_query_param("enable");
        auto ttl = req->get_query_param("ttl");
        auto threshold = req->get_query_param("threshold");
-        auto fast = req->get_query_param("fast");
        try {
-            return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold, fast] (auto& local_tracing) {
+            return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold] (auto& local_tracing) {
                if (threshold != "") {
                    local_tracing.set_slow_query_threshold(std::chrono::microseconds(std::stol(threshold.c_str())));
                }
@@ -906,9 +794,6 @@ void set_storage_service(http_context& ctx, routes& r) {
                if (enable != "") {
                    local_tracing.set_slow_query_enabled(strcasecmp(enable.c_str(), "true") == 0);
                }
-                if (fast != "") {
-                    local_tracing.set_ignore_trace_events(strcasecmp(fast.c_str(), "true") == 0);
-                }
            }).then([] {
                return make_ready_future<json::json_return_type>(json_void());
            });
@@ -1078,7 +963,7 @@ void set_storage_service(http_context& ctx, routes& r) {
                        tst.keyspace = schema->ks_name();
                        tst.table = schema->cf_name();

-                        for (auto sstables = t->get_sstables_including_compacted_undeleted(); auto sstable : *sstables) {
+                        for (auto sstable : *t->get_sstables_including_compacted_undeleted()) {
                            auto ts = db_clock::to_time_t(sstable->data_file_write_time());
                            ::tm t;
                            ::gmtime_r(&ts, &t);
--- a/api/storage_service.hh
+++ b/api/storage_service.hh
@@ -23,7 +23,6 @@

 #include <seastar/core/sharded.hh>
 #include "api.hh"
-#include "db/data_listeners.hh"

 namespace cql_transport { class controller; }
 class thrift_controller;
@@ -41,6 +40,5 @@ void set_rpc_controller(http_context& ctx, routes& r, thrift_controller& ctl);
 void unset_rpc_controller(http_context& ctx, routes& r);
 void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_ctl);
 void unset_snapshot(http_context& ctx, routes& r);
-seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);

 }
--- a/api/system.cc
+++ b/api/system.cc
@@ -25,9 +25,6 @@
 #include <seastar/core/reactor.hh>
 #include <seastar/http/exception.hh>
 #include "log.hh"
-#include "database.hh"
-
-extern logging::logger apilog;

 namespace api {

@@ -73,16 +70,6 @@ void set_system(http_context& ctx, routes& r) {
        }
        return json::json_void();
    });
-
-    hs::drop_sstable_caches.set(r, [&ctx](std::unique_ptr<request> req) {
-        apilog.info("Dropping sstable caches");
-        return ctx.db.invoke_on_all([] (database& db) {
-            return db.drop_caches();
-        }).then([] {
-            apilog.info("Caches dropped");
-            return json::json_return_type(json::json_void());
-        });
-    });
 }

 }
--- a/atomic_cell.cc
+++ b/atomic_cell.cc
@@ -24,130 +24,142 @@
 #include "counters.hh"
 #include "types.hh"

+/// LSA mirator for cells with irrelevant type
+///
+///
+const data::type_imr_descriptor& no_type_imr_descriptor() {
+    static thread_local data::type_imr_descriptor state(data::type_info::make_variable_size());
+    return state;
+}
+
 atomic_cell atomic_cell::make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
-    return atomic_cell_type::make_dead(timestamp, deletion_time);
+    auto& imr_data = no_type_imr_descriptor();
+    return atomic_cell(
+            imr_data.type_info(),
+            imr_object_type::make(data::cell::make_dead(timestamp, deletion_time), &imr_data.lsa_migrator())
+    );
 }

 atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value, atomic_cell::collection_member cm) {
-    return atomic_cell_type::make_live(timestamp, single_fragment_range(value));
-}
-
-atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value, atomic_cell::collection_member cm) {
-    return atomic_cell_type::make_live(timestamp, fragment_range(value));
+    auto& imr_data = type.imr_state();
+    return atomic_cell(
+        imr_data.type_info(),
+        imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
+    );
 }

 atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value, atomic_cell::collection_member cm) {
-    return atomic_cell_type::make_live(timestamp, value);
+    auto& imr_data = type.imr_state();
+    return atomic_cell(
+        imr_data.type_info(),
+        imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
+    );
 }

 atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value, collection_member cm)
 {
-    return atomic_cell_type::make_live(timestamp, value);
+    auto& imr_data = type.imr_state();
+    return atomic_cell(
+        imr_data.type_info(),
+        imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
+    );
 }

 atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
                             gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
-    return atomic_cell_type::make_live(timestamp, single_fragment_range(value), expiry, ttl);
-}
-
-atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value,
-                             gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
-    return atomic_cell_type::make_live(timestamp, fragment_range(value), expiry, ttl);
+    auto& imr_data = type.imr_state();
+    return atomic_cell(
+        imr_data.type_info(),
+        imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
+    );
 }

 atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
                             gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
-    return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
+    auto& imr_data = type.imr_state();
+    return atomic_cell(
+        imr_data.type_info(),
+        imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
+    );
 }

 atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
                                   gc_clock::time_point expiry, gc_clock::duration ttl, collection_member cm)
 {
-    return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
+    auto& imr_data = type.imr_state();
+    return atomic_cell(
+        imr_data.type_info(),
+        imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
+    );
 }

 atomic_cell atomic_cell::make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
-    return atomic_cell_type::make_live_counter_update(timestamp, value);
+    auto& imr_data = no_type_imr_descriptor();
+    return atomic_cell(
+        imr_data.type_info(),
+        imr_object_type::make(data::cell::make_live_counter_update(timestamp, value), &imr_data.lsa_migrator())
+    );
 }

 atomic_cell atomic_cell::make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size) {
-    return atomic_cell_type::make_live_uninitialized(timestamp, size);
+    auto& imr_data = no_type_imr_descriptor();
+    return atomic_cell(
+        imr_data.type_info(),
+        imr_object_type::make(data::cell::make_live_uninitialized(imr_data.type_info(), timestamp, size), &imr_data.lsa_migrator())
+    );
+}
+
+static imr::utils::object<data::cell::structure> copy_cell(const data::type_imr_descriptor& imr_data, const uint8_t* ptr)
+{
+    using imr_object_type = imr::utils::object<data::cell::structure>;
+
+    // If the cell doesn't own any memory it is trivial and can be copied with
+    // memcpy.
+    auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
+    if (!f.template get<data::cell::tags::external_data>()) {
+        data::cell::context ctx(f, imr_data.type_info());
+        // XXX: We may be better off storing the total cell size in memory. Measure!
+        auto size = data::cell::structure::serialized_object_size(ptr, ctx);
+        return imr_object_type::make_raw(size, [&] (uint8_t* dst) noexcept {
+            std::copy_n(ptr, size, dst);
+        }, &imr_data.lsa_migrator());
+    }
+
+    return imr_object_type::make(data::cell::copy_fn(imr_data.type_info(), ptr), &imr_data.lsa_migrator());
 }

 atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)
-    : _data(other._view) {
-    set_view(_data);
-}
-
-// Based on:
-//  - org.apache.cassandra.db.AbstractCell#reconcile()
-//  - org.apache.cassandra.db.BufferExpiringCell#reconcile()
-//  - org.apache.cassandra.db.BufferDeletedCell#reconcile()
-int
-compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
-    if (left.timestamp() != right.timestamp()) {
-        return left.timestamp() > right.timestamp() ? 1 : -1;
-    }
-    if (left.is_live() != right.is_live()) {
-        return left.is_live() ? -1 : 1;
-    }
-    if (left.is_live()) {
-        auto c = compare_unsigned(left.value(), right.value());
-        if (c != 0) {
-            return c;
-        }
-        if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {
-            // prefer expiring cells.
-            return left.is_live_and_has_ttl() ? 1 : -1;
-        }
-        if (left.is_live_and_has_ttl()) {
-            if (left.expiry() != right.expiry()) {
-                return left.expiry() < right.expiry() ? -1 : 1;
-            } else {
-                // prefer the cell that was written later,
-                // so it survives longer after it expires, until purged.
-                if (left.ttl() != right.ttl()) {
-                    return left.ttl() < right.ttl() ? 1 : -1;
-                } else {
-                    return 0;
-                }
-            }
-        }
-    } else {
-        // Both are deleted
-        if (left.deletion_time() != right.deletion_time()) {
-            // Origin compares big-endian serialized deletion time. That's because it
-            // delegates to AbstractCell.reconcile() which compares values after
-            // comparing timestamps, which in case of deleted cells will hold
-            // serialized expiry.
-            return (uint64_t) left.deletion_time().time_since_epoch().count()
-                   < (uint64_t) right.deletion_time().time_since_epoch().count() ? -1 : 1;
-        }
-    }
-    return 0;
-}
+    : atomic_cell(type.imr_state().type_info(),
+                  copy_cell(type.imr_state(), other._view.raw_pointer()))
+{ }

 atomic_cell_or_collection atomic_cell_or_collection::copy(const abstract_type& type) const {
-    if (_data.empty()) {
+    if (!_data.get()) {
        return atomic_cell_or_collection();
    }
-    return atomic_cell_or_collection(managed_bytes(_data));
+    auto& imr_data = type.imr_state();
+    return atomic_cell_or_collection(
+        copy_cell(imr_data, _data.get())
+    );
 }

 atomic_cell_or_collection::atomic_cell_or_collection(const abstract_type& type, atomic_cell_view acv)
-    : _data(acv._view)
+    : _data(copy_cell(type.imr_state(), acv._view.raw_pointer()))
 {
 }

 bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_cell_or_collection& other) const
 {
-    if (_data.empty() || other._data.empty()) {
-        return _data.empty() && other._data.empty();
+    auto ptr_a = _data.get();
+    auto ptr_b = other._data.get();
+
+    if (!ptr_a || !ptr_b) {
+        return !ptr_a && !ptr_b;
    }

    if (type.is_atomic()) {
-        auto a = atomic_cell_view::from_bytes(type, _data);
-        auto b = atomic_cell_view::from_bytes(type, other._data);
+        auto a = atomic_cell_view::from_bytes(type.imr_state().type_info(), _data);
+        auto b = atomic_cell_view::from_bytes(type.imr_state().type_info(), other._data);
        if (a.timestamp() != b.timestamp()) {
            return false;
        }
@@ -179,7 +191,28 @@ bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_c

 size_t atomic_cell_or_collection::external_memory_usage(const abstract_type& t) const
 {
-    return _data.external_memory_usage();
+    if (!_data.get()) {
+        return 0;
+    }
+    auto ctx = data::cell::context(_data.get(), t.imr_state().type_info());
+
+    auto view = data::cell::structure::make_view(_data.get(), ctx);
+    auto flags = view.get<data::cell::tags::flags>();
+
+    size_t external_value_size = 0;
+    if (flags.get<data::cell::tags::external_data>()) {
+        if (flags.get<data::cell::tags::collection>()) {
+            external_value_size = as_collection_mutation().data.size_bytes();
+        } else {
+            auto cell_view = data::cell::atomic_cell_view(t.imr_state().type_info(), view);
+            external_value_size = cell_view.value_size();
+        }
+        // Add overhead of chunk headers. The last one is a special case.
+        external_value_size += (external_value_size - 1) / data::cell::effective_external_chunk_length * data::cell::external_chunk_overhead;
+        external_value_size += data::cell::external_last_chunk_overhead;
+    }
+    return data::cell::structure::serialized_object_size(_data.get(), ctx)
+        + imr_object_type::size_overhead + external_value_size;
 }

 std::ostream&
@@ -188,7 +221,7 @@ operator<<(std::ostream& os, const atomic_cell_view& acv) {
        return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
            acv.is_counter_update()
                    ? "counter_update_value=" + to_sstring(acv.counter_update_value())
-                    : to_hex(to_bytes(acv.value())),
+                    : to_hex(acv.value().linearize()),
            acv.timestamp(),
            acv.is_live_and_has_ttl() ? acv.expiry().time_since_epoch().count() : -1,
            acv.is_live_and_has_ttl() ? acv.ttl().count() : 0);
@@ -214,11 +247,12 @@ operator<<(std::ostream& os, const atomic_cell_view::printer& acvp) {
                cell_value_string_builder << "counter_update_value=" << acv.counter_update_value();
            } else {
                cell_value_string_builder << "shards: ";
-                auto ccv = counter_cell_view(acv);
-                cell_value_string_builder << ::join(", ", ccv.shards());
+                counter_cell_view::with_linearized(acv, [&cell_value_string_builder] (counter_cell_view& ccv) {
+                    cell_value_string_builder << ::join(", ", ccv.shards());
+                });
            }
        } else {
-            cell_value_string_builder << type.to_string(to_bytes(acv.value()));
+            cell_value_string_builder << type.to_string(acv.value().linearize());
        }
        return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
            cell_value_string_builder.str(),
@@ -237,11 +271,12 @@ operator<<(std::ostream& os, const atomic_cell::printer& acp) {
 }

 std::ostream& operator<<(std::ostream& os, const atomic_cell_or_collection::printer& p) {
-    if (p._cell._data.empty()) {
+    if (!p._cell._data.get()) {
        return os << "{ null atomic_cell_or_collection }";
    }
+    using dc = data::cell;
    os << "{ ";
-    if (p._cdef.type->is_multi_cell()) {
+    if (dc::structure::get_member<dc::tags::flags>(p._cell._data.get()).get<dc::tags::collection>()) {
        os << "collection ";
        auto cmv = p._cell.as_collection_mutation();
        os << collection_mutation_view::printer(*p._cdef.type, cmv);
--- a/atomic_cell.hh
+++ b/atomic_cell.hh
@@ -26,12 +26,12 @@
 #include "tombstone.hh"
 #include "gc_clock.hh"
 #include "utils/managed_bytes.hh"
-#include "utils/fragment_range.hh"
 #include <seastar/net//byteorder.hh>
-#include <seastar/util/bool_class.hh>
 #include <cstdint>
 #include <iosfwd>
-#include <concepts>
+#include "data/cell.hh"
+#include "data/schema_info.hh"
+#include "imr/utils.hh"
 #include "utils/fragmented_temporary_buffer.hh"

 #include "serializer.hh"
@@ -40,191 +40,41 @@ class abstract_type;
 class collection_type_impl;
 class atomic_cell_or_collection;

-using atomic_cell_value = managed_bytes;
-template <mutable_view is_mutable>
-using atomic_cell_value_basic_view = managed_bytes_basic_view<is_mutable>;
-using atomic_cell_value_view = atomic_cell_value_basic_view<mutable_view::no>;
-using atomic_cell_value_mutable_view = atomic_cell_value_basic_view<mutable_view::yes>;
-
-template <typename T>
-requires std::is_trivial_v<T>
-static void set_field(atomic_cell_value_mutable_view& out, unsigned offset, T val) {
-    auto out_view = managed_bytes_mutable_view(out);
-    out_view.remove_prefix(offset);
-    write<T>(out_view, val);
-}
-
-template <typename T>
-requires std::is_trivial_v<T>
-static void set_field(atomic_cell_value& out, unsigned offset, T val) {
-    auto out_view = atomic_cell_value_mutable_view(out);
-    set_field(out_view, offset, val);
-}
-
-template <FragmentRange Buffer>
-static void set_value(managed_bytes& b, unsigned value_offset, const Buffer& value) {
-    auto v = managed_bytes_mutable_view(b).substr(value_offset, value.size_bytes());
-    for (auto frag : value) {
-        write_fragmented(v, single_fragmented_view(frag));
-    }
-}
-
-template <typename T, FragmentedView Input>
-requires std::is_trivial_v<T>
-static T get_field(Input in, unsigned offset = 0) {
-    in.remove_prefix(offset);
-    return read_simple<T>(in);
-}
-
-/*
- * Represents atomic cell layout. Works on serialized form.
- *
- * Layout:
- *
- *  <live>  := <int8_t:flags><int64_t:timestamp>(<int64_t:expiry><int32_t:ttl>)?<value>
- *  <dead>  := <int8_t:    0><int64_t:timestamp><int64_t:deletion_time>
- */
-class atomic_cell_type final {
-private:
-    static constexpr int8_t LIVE_FLAG = 0x01;
-    static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
-    static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.
-    static constexpr unsigned flags_size = 1;
-    static constexpr unsigned timestamp_offset = flags_size;
-    static constexpr unsigned timestamp_size = 8;
-    static constexpr unsigned expiry_offset = timestamp_offset + timestamp_size;
-    static constexpr unsigned expiry_size = 8;
-    static constexpr unsigned deletion_time_offset = timestamp_offset + timestamp_size;
-    static constexpr unsigned deletion_time_size = 8;
-    static constexpr unsigned ttl_offset = expiry_offset + expiry_size;
-    static constexpr unsigned ttl_size = 4;
-    friend class counter_cell_builder;
-private:
-    static bool is_counter_update(atomic_cell_value_view cell) {
-        return cell.front() & COUNTER_UPDATE_FLAG;
-    }
-    static bool is_live(atomic_cell_value_view cell) {
-        return cell.front() & LIVE_FLAG;
-    }
-    static bool is_live_and_has_ttl(atomic_cell_value_view cell) {
-        return cell.front() & EXPIRY_FLAG;
-    }
-    static bool is_dead(atomic_cell_value_view cell) {
-        return !is_live(cell);
-    }
-    // Can be called on live and dead cells
-    static api::timestamp_type timestamp(atomic_cell_value_view cell) {
-        return get_field<api::timestamp_type>(cell, timestamp_offset);
-    }
-    static void set_timestamp(atomic_cell_value_mutable_view& cell, api::timestamp_type ts) {
-        set_field(cell, timestamp_offset, ts);
-    }
-    // Can be called on live cells only
-private:
-    template <mutable_view is_mutable>
-    static managed_bytes_basic_view<is_mutable> do_get_value(managed_bytes_basic_view<is_mutable> cell) {
-        auto expiry_field_size = bool(cell.front() & EXPIRY_FLAG) * (expiry_size + ttl_size);
-        auto value_offset = flags_size + timestamp_size + expiry_field_size;
-        cell.remove_prefix(value_offset);
-        return cell;
-    }
-public:
-    static atomic_cell_value_view value(managed_bytes_view cell) {
-        return do_get_value(cell);
-    }
-    static atomic_cell_value_mutable_view value(managed_bytes_mutable_view cell) {
-        return do_get_value(cell);
-    }
-    // Can be called on live counter update cells only
-    static int64_t counter_update_value(atomic_cell_value_view cell) {
-        return get_field<int64_t>(cell, flags_size + timestamp_size);
-    }
-    // Can be called only when is_dead() is true.
-    static gc_clock::time_point deletion_time(atomic_cell_value_view cell) {
-        assert(is_dead(cell));
-        return gc_clock::time_point(gc_clock::duration(get_field<int64_t>(cell, deletion_time_offset)));
-    }
-    // Can be called only when is_live_and_has_ttl() is true.
-    static gc_clock::time_point expiry(atomic_cell_value_view cell) {
-        assert(is_live_and_has_ttl(cell));
-        auto expiry = get_field<int64_t>(cell, expiry_offset);
-        return gc_clock::time_point(gc_clock::duration(expiry));
-    }
-    // Can be called only when is_live_and_has_ttl() is true.
-    static gc_clock::duration ttl(atomic_cell_value_view cell) {
-        assert(is_live_and_has_ttl(cell));
-        return gc_clock::duration(get_field<int32_t>(cell, ttl_offset));
-    }
-    static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
-        managed_bytes b(managed_bytes::initialized_later(), flags_size + timestamp_size + deletion_time_size);
-        b[0] = 0;
-        set_field(b, timestamp_offset, timestamp);
-        set_field(b, deletion_time_offset, static_cast<int64_t>(deletion_time.time_since_epoch().count()));
-        return b;
-    }
-    template <FragmentRange Buffer>
-    static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value) {
-        auto value_offset = flags_size + timestamp_size;
-        managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());
-        b[0] = LIVE_FLAG;
-        set_field(b, timestamp_offset, timestamp);
-        set_value(b, value_offset, value);
-        return b;
-    }
-    static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
-        auto value_offset = flags_size + timestamp_size;
-        managed_bytes b(managed_bytes::initialized_later(), value_offset + sizeof(value));
-        b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;
-        set_field(b, timestamp_offset, timestamp);
-        set_field(b, value_offset, value);
-        return b;
-    }
-    template <FragmentRange Buffer>
-    static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value, gc_clock::time_point expiry, gc_clock::duration ttl) {
-        auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;
-        managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());
-        b[0] = EXPIRY_FLAG | LIVE_FLAG;
-        set_field(b, timestamp_offset, timestamp);
-        set_field(b, expiry_offset, static_cast<int64_t>(expiry.time_since_epoch().count()));
-        set_field(b, ttl_offset, static_cast<int32_t>(ttl.count()));
-        set_value(b, value_offset, value);
-        return b;
-    }
-    static managed_bytes make_live_uninitialized(api::timestamp_type timestamp, size_t size) {
-        auto value_offset = flags_size + timestamp_size;
-        managed_bytes b(managed_bytes::initialized_later(), value_offset + size);
-        b[0] = LIVE_FLAG;
-        set_field(b, timestamp_offset, timestamp);
-        return b;
-    }
-    template <mutable_view is_mutable>
-    friend class basic_atomic_cell_view;
-    friend class atomic_cell;
-};
+using atomic_cell_value_view = data::value_view;
+using atomic_cell_value_mutable_view = data::value_mutable_view;

 /// View of an atomic cell
 template<mutable_view is_mutable>
 class basic_atomic_cell_view {
 protected:
-    managed_bytes_basic_view<is_mutable> _view;
-	friend class atomic_cell;
+    data::cell::basic_atomic_cell_view<is_mutable> _view;
+    friend class atomic_cell;
+public:
+    using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const uint8_t*, uint8_t*>;
 protected:
-    void set_view(managed_bytes_basic_view<is_mutable> v) {
-        _view = v;
-    }
-    basic_atomic_cell_view() = default;
-    explicit basic_atomic_cell_view(managed_bytes_basic_view<is_mutable> v) : _view(std::move(v)) { }
+    explicit basic_atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> v)
+        : _view(std::move(v)) { }
+
+    basic_atomic_cell_view(const data::type_info& ti, pointer_type ptr)
+        : _view(data::cell::make_atomic_cell_view(ti, ptr))
+    { }
+
    friend class atomic_cell_or_collection;
 public:
    operator basic_atomic_cell_view<mutable_view::no>() const noexcept {
        return basic_atomic_cell_view<mutable_view::no>(_view);
    }

+    void swap(basic_atomic_cell_view& other) noexcept {
+        using std::swap;
+        swap(_view, other._view);
+    }
+
    bool is_counter_update() const {
-        return atomic_cell_type::is_counter_update(_view);
+        return _view.is_counter_update();
    }
    bool is_live() const {
-        return atomic_cell_type::is_live(_view);
+        return _view.is_live();
    }
    bool is_live(tombstone t, bool is_counter) const {
        return is_live() && !is_covered_by(t, is_counter);
@@ -233,72 +83,73 @@ public:
        return is_live() && !is_covered_by(t, is_counter) && !has_expired(now);
    }
    bool is_live_and_has_ttl() const {
-        return atomic_cell_type::is_live_and_has_ttl(_view);
+        return _view.is_expiring();
    }
    bool is_dead(gc_clock::time_point now) const {
-        return atomic_cell_type::is_dead(_view) || has_expired(now);
+        return !is_live() || has_expired(now);
    }
    bool is_covered_by(tombstone t, bool is_counter) const {
        return timestamp() <= t.timestamp || (is_counter && t.timestamp != api::missing_timestamp);
    }
    // Can be called on live and dead cells
    api::timestamp_type timestamp() const {
-        return atomic_cell_type::timestamp(_view);
+        return _view.timestamp();
    }
    void set_timestamp(api::timestamp_type ts) {
-        atomic_cell_type::set_timestamp(_view, ts);
+        _view.set_timestamp(ts);
    }
    // Can be called on live cells only
-    atomic_cell_value_basic_view<is_mutable> value() const {
-        return atomic_cell_type::value(_view);
+    data::basic_value_view<is_mutable> value() const {
+        return _view.value();
    }
    // Can be called on live cells only
    size_t value_size() const {
-        return atomic_cell_type::value(_view).size();
+        return _view.value_size();
    }
    bool is_value_fragmented() const {
-        return _view.is_fragmented();
+        return _view.is_value_fragmented();
    }
    // Can be called on live counter update cells only
    int64_t counter_update_value() const {
-        return atomic_cell_type::counter_update_value(_view);
+        return _view.counter_update_value();
    }
    // Can be called only when is_dead(gc_clock::time_point)
    gc_clock::time_point deletion_time() const {
-        return !is_live() ? atomic_cell_type::deletion_time(_view) : expiry() - ttl();
+        return !is_live() ? _view.deletion_time() : expiry() - ttl();
    }
    // Can be called only when is_live_and_has_ttl()
    gc_clock::time_point expiry() const {
-        return atomic_cell_type::expiry(_view);
+        return _view.expiry();
    }
    // Can be called only when is_live_and_has_ttl()
    gc_clock::duration ttl() const {
-        return atomic_cell_type::ttl(_view);
+        return _view.ttl();
    }
    // Can be called on live and dead cells
    bool has_expired(gc_clock::time_point now) const {
        return is_live_and_has_ttl() && expiry() <= now;
    }

-    managed_bytes_view serialize() const {
-        return _view;
+    bytes_view serialize() const {
+        return _view.serialize();
    }
 };

 class atomic_cell_view final : public basic_atomic_cell_view<mutable_view::no> {
-    atomic_cell_view(managed_bytes_view v)
-        : basic_atomic_cell_view(v) {}
+    atomic_cell_view(const data::type_info& ti, const uint8_t* data)
+        : basic_atomic_cell_view<mutable_view::no>(ti, data) {}

    template<mutable_view is_mutable>
-    atomic_cell_view(basic_atomic_cell_view<is_mutable> view)
-        : basic_atomic_cell_view<mutable_view::no>(view) {}
+    atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> view)
+        : basic_atomic_cell_view<mutable_view::no>(view) { }
    friend class atomic_cell;
 public:
-    static atomic_cell_view from_bytes(const abstract_type& t, managed_bytes_view v) {
-        return atomic_cell_view(v);
+    static atomic_cell_view from_bytes(const data::type_info& ti, const imr::utils::object<data::cell::structure>& data) {
+        return atomic_cell_view(ti, data.get());
    }
-    static atomic_cell_view from_bytes(const abstract_type& t, bytes_view v) {
-        return atomic_cell_view(managed_bytes_view(v));
+
+    static atomic_cell_view from_bytes(const data::type_info& ti, bytes_view bv) {
+        return atomic_cell_view(ti, reinterpret_cast<const uint8_t*>(bv.begin()));
    }

    friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);
@@ -313,11 +164,11 @@ public:
 };

 class atomic_cell_mutable_view final : public basic_atomic_cell_view<mutable_view::yes> {
-    atomic_cell_mutable_view(managed_bytes_mutable_view data)
-        : basic_atomic_cell_view(data) {}
+    atomic_cell_mutable_view(const data::type_info& ti, uint8_t* data)
+        : basic_atomic_cell_view<mutable_view::yes>(ti, data) {}
 public:
-    static atomic_cell_mutable_view from_bytes(const abstract_type& t, managed_bytes_mutable_view v) {
-        return atomic_cell_mutable_view(v);
+    static atomic_cell_mutable_view from_bytes(const data::type_info& ti, imr::utils::object<data::cell::structure>& data) {
+        return atomic_cell_mutable_view(ti, data.get());
    }

    friend class atomic_cell;
@@ -326,31 +177,26 @@ public:
 using atomic_cell_ref = atomic_cell_mutable_view;

 class atomic_cell final : public basic_atomic_cell_view<mutable_view::yes> {
-    managed_bytes _data;
-    atomic_cell(managed_bytes b) : _data(std::move(b))  {
-        set_view(_data);
-    }
-
+    using imr_object_type =  imr::utils::object<data::cell::structure>;
+    imr_object_type _data;
+    atomic_cell(const data::type_info& ti, imr::utils::object<data::cell::structure>&& data)
+        : basic_atomic_cell_view<mutable_view::yes>(ti, data.get()), _data(std::move(data)) {}
 public:
    class collection_member_tag;
    using collection_member = bool_class<collection_member_tag>;

-    atomic_cell(atomic_cell&& o) noexcept : _data(std::move(o._data)) {
-        set_view(_data);
-    }
+    atomic_cell(atomic_cell&&) = default;
    atomic_cell& operator=(const atomic_cell&) = delete;
-    atomic_cell& operator=(atomic_cell&& o) {
-        _data = std::move(o._data);
-        set_view(_data);
-        return *this;
+    atomic_cell& operator=(atomic_cell&&) = default;
+    void swap(atomic_cell& other) noexcept {
+        basic_atomic_cell_view<mutable_view::yes>::swap(other);
+        _data.swap(other._data);
    }
-    operator atomic_cell_view() const { return atomic_cell_view(managed_bytes_view(_data)); }
+    operator atomic_cell_view() const { return atomic_cell_view(_view); }
    atomic_cell(const abstract_type& t, atomic_cell_view other);
    static atomic_cell make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time);
    static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
                                 collection_member = collection_member::no);
-    static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value,
-                                 collection_member = collection_member::no);
    static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
                                 collection_member = collection_member::no);
    static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
@@ -362,8 +208,6 @@ public:
    static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value);
    static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, bytes_view value,
        gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
-    static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, managed_bytes_view value,
-        gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
    static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
        gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
    static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
--- a/atomic_cell_hash.hh
+++ b/atomic_cell_hash.hh
@@ -52,7 +52,9 @@ struct appending_hash<atomic_cell_view> {
        feed_hash(h, cell.timestamp());
        if (cell.is_live()) {
            if (cdef.is_counter()) {
-                ::feed_hash(h, counter_cell_view(cell));
+                counter_cell_view::with_linearized(cell, [&] (counter_cell_view ccv) {
+                    ::feed_hash(h, ccv);
+                });
                return;
            }
            if (cell.is_live_and_has_ttl()) {
--- a/atomic_cell_or_collection.hh
+++ b/atomic_cell_or_collection.hh
@@ -26,14 +26,20 @@
 #include "schema.hh"
 #include "hashing.hh"

+#include "imr/utils.hh"
+
 // A variant type that can hold either an atomic_cell, or a serialized collection.
 // Which type is stored is determined by the schema.
-// Has an "empty" state.
-// Objects moved-from are left in an empty state.
 class atomic_cell_or_collection final {
-    managed_bytes _data;
+    // FIXME: This has made us lose small-buffer optimisation. Unfortunately,
+    // due to the changed cell format it would be less effective now, anyway.
+    // Measure the actual impact because any attempts to fix this will become
+    // irrelevant once rows are converted to the IMR as well, so maybe we can
+    // live with this like that.
+    using imr_object_type = imr::utils::object<data::cell::structure>;
+    imr_object_type _data;
 private:
-    atomic_cell_or_collection(managed_bytes&& data) : _data(std::move(data)) {}
+    atomic_cell_or_collection(imr::utils::object<data::cell::structure>&& data) : _data(std::move(data)) {}
 public:
    atomic_cell_or_collection() = default;
    atomic_cell_or_collection(atomic_cell_or_collection&&) = default;
@@ -43,16 +49,20 @@ public:
    atomic_cell_or_collection(atomic_cell ac) : _data(std::move(ac._data)) {}
    atomic_cell_or_collection(const abstract_type& at, atomic_cell_view acv);
    static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }
-    atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(*cdef.type, _data); }
-    atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(*cdef.type, _data); }
+    atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
+    atomic_cell_ref as_atomic_cell_ref(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
+    atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
    atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm._data)) { }
    atomic_cell_or_collection copy(const abstract_type&) const;
    explicit operator bool() const {
-        return !_data.empty();
+        return bool(_data);
    }
    static constexpr bool can_use_mutable_view() {
        return true;
    }
+    void swap(atomic_cell_or_collection& other) noexcept {
+        _data.swap(other._data);
+    }
    static atomic_cell_or_collection from_collection_mutation(collection_mutation data) { return std::move(data._data); }
    collection_mutation_view as_collection_mutation() const;
    bytes_view serialize() const;
@@ -72,3 +82,12 @@ public:
    };
    friend std::ostream& operator<<(std::ostream&, const printer&);
 };
+
+namespace std {
+
+inline void swap(atomic_cell_or_collection& a, atomic_cell_or_collection& b) noexcept
+{
+    a.swap(b);
+}
+
+}
--- a/auth/common.cc
+++ b/auth/common.cc
@@ -82,7 +82,7 @@ static future<> create_metadata_table_if_missing_impl(
    b.set_uuid(uuid);
    schema_ptr table = b.build();
    return ignore_existing([&mm, table = std::move(table)] () {
-        return mm.announce_new_column_family(table);
+        return mm.announce_new_column_family(table, false);
    });
 }

--- a/auth/password_authenticator.cc
+++ b/auth/password_authenticator.cc
@@ -66,6 +66,7 @@ constexpr std::string_view password_authenticator_name("org.apache.cassandra.aut

 // name of the hash column.
 static constexpr std::string_view SALTED_HASH = "salted_hash";
+static constexpr std::string_view OPTIONS = "options";
 static constexpr std::string_view DEFAULT_USER_NAME = meta::DEFAULT_SUPERUSER_NAME;
 static const sstring DEFAULT_USER_PASSWORD = sstring(meta::DEFAULT_SUPERUSER_NAME);

@@ -203,11 +204,11 @@ bool password_authenticator::require_authentication() const {
 }

 authentication_option_set password_authenticator::supported_options() const {
-    return authentication_option_set{authentication_option::password};
+    return authentication_option_set{authentication_option::password, authentication_option::options};
 }

 authentication_option_set password_authenticator::alterable_options() const {
-    return authentication_option_set{authentication_option::password};
+    return authentication_option_set{authentication_option::password, authentication_option::options};
 }

 future<authenticated_user> password_authenticator::authenticate(
@@ -262,21 +263,46 @@ future<authenticated_user> password_authenticator::authenticate(
    });
 }

+future<> password_authenticator::maybe_update_custom_options(std::string_view role_name, const authentication_options& options) const {
+    static const sstring query = format("UPDATE {} SET {} = ? WHERE {} = ?",
+            meta::roles_table::qualified_name,
+            OPTIONS,
+            meta::roles_table::role_col_name);
+
+    if (!options.options) {
+        return make_ready_future<>();
+    }
+
+    std::vector<std::pair<data_value, data_value>> entries;
+    for (const auto& entry : *options.options) {
+        entries.push_back({data_value(entry.first), data_value(entry.second)});
+    }
+    auto map_value = make_map_value(map_type_impl::get_instance(utf8_type, utf8_type, false), entries);
+
+    return _qp.execute_internal(
+            query,
+            consistency_for_user(role_name),
+            internal_distributed_query_state(),
+            {std::move(map_value), sstring(role_name)}).discard_result();
+}
+
 future<> password_authenticator::create(std::string_view role_name, const authentication_options& options) const {
    if (!options.password) {
-        return make_ready_future<>();
+        return maybe_update_custom_options(role_name, options);
    }

    return _qp.execute_internal(
            update_row_query(),
            consistency_for_user(role_name),
            internal_distributed_query_state(),
-            {passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
+            {passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result().then([this, role_name, &options] {
+                return maybe_update_custom_options(role_name, options);
+            });
 }

 future<> password_authenticator::alter(std::string_view role_name, const authentication_options& options) const {
    if (!options.password) {
-        return make_ready_future<>();
+        return maybe_update_custom_options(role_name, options);
    }

    static const sstring query = format("UPDATE {} SET {} = ? WHERE {} = ?",
@@ -288,7 +314,9 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
            query,
            consistency_for_user(role_name),
            internal_distributed_query_state(),
-            {passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
+            {passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result().then([this, role_name, &options] {
+                return maybe_update_custom_options(role_name, options);
+            }).discard_result();
 }

 future<> password_authenticator::drop(std::string_view name) const {
@@ -304,7 +332,22 @@ future<> password_authenticator::drop(std::string_view name) const {
 }

 future<custom_options> password_authenticator::query_custom_options(std::string_view role_name) const {
-    return make_ready_future<custom_options>();
+    static const sstring query = format("SELECT {} FROM {} WHERE {} = ?",
+            OPTIONS,
+            meta::roles_table::qualified_name,
+            meta::roles_table::role_col_name);
+
+    return _qp.execute_internal(
+            query, consistency_for_user(role_name),
+            internal_distributed_query_state(),
+            {sstring(role_name)}).then([](::shared_ptr<cql3::untyped_result_set> rs) {
+        custom_options opts;
+        const auto& row = rs->one();
+        if (row.has(OPTIONS)) {
+            row.get_map_data<sstring, sstring>(OPTIONS, std::inserter(opts, opts.end()), utf8_type, utf8_type);
+        }
+        return opts;
+    });
 }

 const resource_set& password_authenticator::protected_resources() const {
--- a/auth/password_authenticator.hh
+++ b/auth/password_authenticator.hh
@@ -94,6 +94,8 @@ public:
    virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const override;

 private:
+    future<> maybe_update_custom_options(std::string_view role_name, const authentication_options& options) const;
+
    bool legacy_metadata_exists() const;

    future<> migrate_legacy_metadata() const;
--- a/auth/roles-metadata.cc
+++ b/auth/roles-metadata.cc
@@ -43,7 +43,8 @@ std::string_view creation_query() {
            "  can_login boolean,"
            "  is_superuser boolean,"
            "  member_of set<text>,"
-            "  salted_hash text"
+            "  salted_hash text,"
+            "  options frozen<map<text, text>>,"
            ")",
            qualified_name,
            role_col_name);
--- a/auth/service.cc
+++ b/auth/service.cc
@@ -154,7 +154,7 @@ future<> service::create_keyspace_if_missing(::service::migration_manager& mm) c

        // We use min_timestamp so that default keyspace metadata will loose with any manual adjustments.
        // See issue #2129.
-        return mm.announce_new_keyspace(ksm, api::min_timestamp);
+        return mm.announce_new_keyspace(ksm, api::min_timestamp, false);
    }

    return make_ready_future<>();
--- a/bytes.hh
+++ b/bytes.hh
@@ -28,7 +28,6 @@
 #include <iosfwd>
 #include <functional>
 #include "utils/mutable_view.hh"
-#include <xxhash.h>

 using bytes = basic_sstring<int8_t, uint32_t, 31, false>;
 using bytes_view = std::basic_string_view<int8_t>;
@@ -36,10 +35,6 @@ using bytes_mutable_view = basic_mutable_view<bytes_view::value_type>;
 using bytes_opt = std::optional<bytes>;
 using sstring_view = std::string_view;

-inline bytes to_bytes(bytes&& b) {
-    return std::move(b);
-}
-
 inline sstring_view to_sstring_view(bytes_view view) {
    return {reinterpret_cast<const char*>(view.data()), view.size()};
 }
@@ -48,6 +43,17 @@ inline bytes_view to_bytes_view(sstring_view view) {
    return {reinterpret_cast<const int8_t*>(view.data()), view.size()};
 }

+namespace std {
+
+template <>
+struct hash<bytes_view> {
+    size_t operator()(bytes_view v) const {
+        return hash<sstring_view>()({reinterpret_cast<const char*>(v.begin()), v.size()});
+    }
+};
+
+}
+
 struct fmt_hex {
    bytes_view& v;
    fmt_hex(bytes_view& v) noexcept : v(v) {}
@@ -88,30 +94,6 @@ struct appending_hash<bytes_view> {
    }
 };

-struct bytes_view_hasher : public hasher {
-    XXH64_state_t _state;
-    bytes_view_hasher(uint64_t seed = 0) noexcept {
-        XXH64_reset(&_state, seed);
-    }
-    void update(const char* ptr, size_t length) noexcept {
-        XXH64_update(&_state, ptr, length);
-    }
-    size_t finalize() {
-        return static_cast<size_t>(XXH64_digest(&_state));
-    }
-};
-
-namespace std {
-template <>
-struct hash<bytes_view> {
-    size_t operator()(bytes_view v) const {
-        bytes_view_hasher h;
-        appending_hash<bytes_view>{}(h, v);
-        return h.finalize();
-    }
-};
-} // namespace std
-
 inline int32_t compare_unsigned(bytes_view v1, bytes_view v2) {
  auto size = std::min(v1.size(), v2.size());
  if (size) {
--- a/bytes_ostream.hh
+++ b/bytes_ostream.hh
@@ -24,10 +24,9 @@
 #include <boost/range/iterator_range.hpp>

 #include "bytes.hh"
+#include <seastar/core/unaligned.hh>
 #include "hashing.hh"
 #include <seastar/core/simple-stream.hh>
-#include <concepts>
-
 /**
 * Utility for writing data into a buffer when its final size is not known up front.
 *
@@ -40,7 +39,7 @@ public:
    using size_type = bytes::size_type;
    using value_type = bytes::value_type;
    using fragment_type = bytes_view;
-    static constexpr size_type max_chunk_size() { return max_alloc_size() - sizeof(chunk); }
+    static constexpr size_type max_chunk_size() { return 128 * 1024; }
 private:
    static_assert(sizeof(value_type) == 1, "value_type is assumed to be one byte long");
    struct chunk {
@@ -60,7 +59,6 @@ private:
        void operator delete(void* ptr) { free(ptr); }
    };
    static constexpr size_type default_chunk_size{512};
-    static constexpr size_type max_alloc_size() { return 128 * 1024; }
 private:
    std::unique_ptr<chunk> _begin;
    chunk* _current;
@@ -134,15 +132,16 @@ private:
        return _current->size - _current->offset;
    }
    // Figure out next chunk size.
-    //   - must be enough for data_size + sizeof(chunk)
+    //   - must be enough for data_size
    //   - must be at least _initial_chunk_size
    //   - try to double each time to prevent too many allocations
-    //   - should not exceed max_alloc_size, unless data_size requires so
+    //   - do not exceed max_chunk_size
    size_type next_alloc_size(size_t data_size) const {
        auto next_size = _current
                ? _current->size * 2
                : _initial_chunk_size;
-        next_size = std::min(next_size, max_alloc_size());
+        next_size = std::min(next_size, max_chunk_size());
+        // FIXME: check for overflow?
        return std::max<size_type>(next_size, data_size + sizeof(chunk));
    }
    // Makes room for a contiguous region of given size.
@@ -234,9 +233,9 @@ public:
    };

    // Returns a place holder for a value to be written later.
-    template <std::integral T>
+    template <typename T>
    inline
-    place_holder<T>
+    std::enable_if_t<std::is_fundamental<T>::value, place_holder<T>>
    write_place_holder() {
        return place_holder<T>{alloc(sizeof(T))};
    }
--- a/cache_flat_mutation_reader.hh
+++ b/cache_flat_mutation_reader.hh
@@ -102,7 +102,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
    // Points to the underlying reader conforming to _schema,
    // either to *_underlying_holder or _read_context->underlying().underlying().
    flat_mutation_reader* _underlying = nullptr;
-    flat_mutation_reader_opt _underlying_holder;
+    std::optional<flat_mutation_reader> _underlying_holder;

    future<> do_fill_buffer(db::timeout_clock::time_point);
    future<> ensure_underlying(db::timeout_clock::time_point);
@@ -112,7 +112,6 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
    void move_to_next_range();
    void move_to_range(query::clustering_row_ranges::const_iterator);
    void move_to_next_entry();
-    void maybe_drop_last_entry() noexcept;
    void add_to_buffer(const partition_snapshot_row_cursor&);
    void add_clustering_row_to_buffer(mutation_fragment&&);
    void add_to_buffer(range_tombstone&&);
@@ -123,7 +122,6 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
    bool can_populate() const;
    // Marks the range between _last_row (exclusive) and _next_row (exclusive) as continuous,
    // provided that the underlying reader still matches the latest version of the partition.
-    // Invalidates _last_row.
    void maybe_update_continuity();
    // Tries to ensure that the lower bound of the current population range exists.
    // Returns false if it failed and range cannot be populated.
@@ -165,12 +163,11 @@ public:
    cache_flat_mutation_reader(const cache_flat_mutation_reader&) = delete;
    cache_flat_mutation_reader(cache_flat_mutation_reader&&) = delete;
    virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override;
-    virtual future<> next_partition() override {
+    virtual void next_partition() override {
        clear_buffer_to_next_partition();
        if (is_buffer_empty()) {
            _end_of_stream = true;
        }
-        return make_ready_future<>();
    }
    virtual future<> fast_forward_to(const dht::partition_range&, db::timeout_clock::time_point timeout) override {
        clear_buffer();
@@ -267,9 +264,6 @@ future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_poin
        }
        _state = state::reading_from_underlying;
        _population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema);
-        if (!_read_context->partition_exists()) {
-            return read_from_underlying(timeout);
-        }
        auto end = _next_row_in_range ? position_in_partition(_next_row.position())
                                      : position_in_partition(_upper_bound);
        return _underlying->fast_forward_to(position_range{_lower_bound, std::move(end)}, timeout).then([this, timeout] {
@@ -334,6 +328,7 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
                }
                if (_next_row_in_range) {
                    maybe_update_continuity();
+                    _last_row = _next_row;
                    add_to_buffer(_next_row);
                    try {
                        move_to_next_entry();
@@ -346,14 +341,14 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
                    if (no_clustering_row_between(*_schema, _upper_bound, _next_row.position())) {
                        this->maybe_update_continuity();
                    } else if (can_populate()) {
-                        rows_entry::tri_compare cmp(*_schema);
+                        rows_entry::compare less(*_schema);
                        auto& rows = _snp->version()->partition().clustered_rows();
                        if (query::is_single_row(*_schema, *_ck_ranges_curr)) {
                            with_allocator(_snp->region().allocator(), [&] {
                                auto e = alloc_strategy_unique_ptr<rows_entry>(
                                    current_allocator().construct<rows_entry>(_ck_ranges_curr->start()->value()));
                                // Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
-                                auto insert_result = rows.insert_before_hint(_next_row.get_iterator_in_latest_version(), *e, cmp);
+                                auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
                                auto inserted = insert_result.second;
                                auto it = insert_result.first;
                                if (inserted) {
@@ -369,7 +364,7 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
                                auto e = alloc_strategy_unique_ptr<rows_entry>(
                                    current_allocator().construct<rows_entry>(*_schema, _upper_bound, is_dummy::yes, is_continuous::yes));
                                // Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
-                                auto insert_result = rows.insert_before_hint(_next_row.get_iterator_in_latest_version(), *e, cmp);
+                                auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
                                auto inserted = insert_result.second;
                                if (inserted) {
                                    clogger.trace("csm {}: inserted dummy at {}", fmt::ptr(this), _upper_bound);
@@ -379,7 +374,6 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
                                    clogger.trace("csm {}: mark {} as continuous", fmt::ptr(this), insert_result.first->position());
                                    insert_result.first->set_continuous(true);
                                }
-                                maybe_drop_last_entry();
                            });
                        }
                    } else {
@@ -410,12 +404,12 @@ bool cache_flat_mutation_reader::ensure_population_lower_bound() {
    if (!_last_row.is_in_latest_version()) {
        with_allocator(_snp->region().allocator(), [&] {
            auto& rows = _snp->version()->partition().clustered_rows();
-            rows_entry::tri_compare cmp(*_schema);
+            rows_entry::compare less(*_schema);
            // FIXME: Avoid the copy by inserting an incomplete clustering row
            auto e = alloc_strategy_unique_ptr<rows_entry>(
                current_allocator().construct<rows_entry>(*_schema, *_last_row));
            e->set_continuous(false);
-            auto insert_result = rows.insert_before_hint(rows.end(), *e, cmp);
+            auto insert_result = rows.insert_check(rows.end(), *e, less);
            auto inserted = insert_result.second;
            if (inserted) {
                clogger.trace("csm {}: inserted lower bound dummy at {}", fmt::ptr(this), e->position());
@@ -433,7 +427,6 @@ void cache_flat_mutation_reader::maybe_update_continuity() {
        with_allocator(_snp->region().allocator(), [&] {
            rows_entry& e = _next_row.ensure_entry_in_latest().row;
            e.set_continuous(true);
-            maybe_drop_last_entry();
        });
    } else {
        _read_context->cache().on_mispopulate();
@@ -462,17 +455,17 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
    clogger.trace("csm {}: populate({})", fmt::ptr(this), clustering_row::printer(*_schema, cr));
    _lsa_manager.run_in_update_section_with_allocator([this, &cr] {
        mutation_partition& mp = _snp->version()->partition();
-        rows_entry::tri_compare cmp(*_schema);
+        rows_entry::compare less(*_schema);

        if (_read_context->digest_requested()) {
            cr.cells().prepare_hash(*_schema, column_kind::regular_column);
        }
        auto new_entry = alloc_strategy_unique_ptr<rows_entry>(
-            current_allocator().construct<rows_entry>(*_schema, cr.key(), cr.as_deletable_row()));
+            current_allocator().construct<rows_entry>(*_schema, cr.key(), cr.tomb(), cr.marker(), cr.cells()));
        new_entry->set_continuous(false);
        auto it = _next_row.iterators_valid() ? _next_row.get_iterator_in_latest_version()
-                                              : mp.clustered_rows().lower_bound(cr.key(), cmp);
-        auto insert_result = mp.clustered_rows().insert_before_hint(it, *new_entry, cmp);
+                                              : mp.clustered_rows().lower_bound(cr.key(), less);
+        auto insert_result = mp.clustered_rows().insert_check(it, *new_entry, less);
        if (insert_result.second) {
            _snp->tracker()->insert(*new_entry);
            new_entry.release();
@@ -528,6 +521,7 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
    // We add the row to the buffer even when it's full.
    // This simplifies the code. For more info see #3139.
    if (_next_row_in_range) {
+        _last_row = _next_row;
        add_to_buffer(_next_row);
        move_to_next_entry();
    } else {
@@ -576,8 +570,8 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
                clogger.trace("csm {}: insert dummy at {}", fmt::ptr(this), _lower_bound);
                auto it = with_allocator(_lsa_manager.region().allocator(), [&] {
                    auto& rows = _snp->version()->partition().clustered_rows();
-                    auto new_entry = alloc_strategy_unique_ptr<rows_entry>(current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no));
-                    return rows.insert_before(_next_row.get_iterator_in_latest_version(), std::move(new_entry));
+                    auto new_entry = current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no);
+                    return rows.insert_before(_next_row.get_iterator_in_latest_version(), *new_entry);
                });
                _snp->tracker()->insert(*it);
                _last_row = partition_snapshot_row_weakref(*_snp, it, true);
@@ -589,38 +583,6 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
    }
 }

-// Drops _last_row entry when possible without changing logical contents of the partition.
-// Call only when _last_row and _next_row are valid.
-// Calling after ensure_population_lower_bound() is ok.
-// _next_row must have a greater position than _last_row.
-// Invalidates references but keeps the _next_row valid.
-inline
-void cache_flat_mutation_reader::maybe_drop_last_entry() noexcept {
-    // Drop dummy entry if it falls inside a continuous range.
-    // This prevents unnecessary dummy entries from accumulating in cache and slowing down scans.
-    //
-    // Eviction can happen only from oldest versions to preserve the continuity non-overlapping rule
-    // (See docs/design-notes/row_cache.md)
-    //
-    if (_last_row
-            && _last_row->dummy()
-            && _last_row->continuous()
-            && _snp->at_latest_version()
-            && _snp->at_oldest_version()) {
-
-        with_allocator(_snp->region().allocator(), [&] {
-            _last_row->on_evicted(_read_context->cache()._tracker);
-        });
-        _last_row = nullptr;
-
-        // There could be iterators pointing to _last_row, invalidate them
-        _snp->region().allocator().invalidate_references();
-
-        // Don't invalidate _next_row, move_to_next_entry() expects it to be still valid.
-        _next_row.force_valid();
-    }
-}
-
 // _next_row must be inside the range.
 inline
 void cache_flat_mutation_reader::move_to_next_entry() {
@@ -628,18 +590,14 @@ void cache_flat_mutation_reader::move_to_next_entry() {
    if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
        move_to_next_range();
    } else {
-        auto new_last_row = partition_snapshot_row_weakref(_next_row);
        if (!_next_row.next()) {
            move_to_end();
            return;
        }
-        _last_row = std::move(new_last_row);
        _next_row_in_range = !after_current_range(_next_row.position());
        clogger.trace("csm {}: next={}, cont={}, in_range={}", fmt::ptr(this), _next_row.position(), _next_row.continuous(), _next_row_in_range);
        if (!_next_row.continuous()) {
            start_reading_from_underlying();
-        } else {
-            maybe_drop_last_entry();
        }
    }
 }
@@ -660,13 +618,6 @@ void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_curs
    if (!row.dummy()) {
        _read_context->cache().on_row_hit();
        add_clustering_row_to_buffer(mutation_fragment(*_schema, _permit, row.row(_read_context->digest_requested())));
-    } else {
-        position_in_partition::less_compare less(*_schema);
-        if (less(_lower_bound, row.position())) {
-            _lower_bound = row.position();
-            _lower_bound_changed = true;
-        }
-        _read_context->cache()._tracker.on_dummy_row_hit();
    }
 }

--- a/canonical_mutation.cc
+++ b/canonical_mutation.cc
@@ -37,7 +37,7 @@
 #include "idl/mutation.dist.impl.hh"
 #include <iostream>

-canonical_mutation::canonical_mutation(bytes_ostream data)
+canonical_mutation::canonical_mutation(bytes data)
        : _data(std::move(data))
 { }

@@ -45,7 +45,8 @@ canonical_mutation::canonical_mutation(const mutation& m)
 {
    mutation_partition_serializer part_ser(*m.schema(), m.partition());

-    ser::writer_of_canonical_mutation<bytes_ostream> wr(_data);
+    bytes_ostream out;
+    ser::writer_of_canonical_mutation<bytes_ostream> wr(out);
    std::move(wr).write_table_id(m.schema()->id())
                 .write_schema_version(m.schema()->version())
                 .write_key(m.key())
@@ -53,6 +54,7 @@ canonical_mutation::canonical_mutation(const mutation& m)
                 .partition([&] (auto wr) {
                     part_ser.write(std::move(wr));
                 }).end_canonical_mutation();
+    _data = to_bytes(out.linearize());
 }

 utils::UUID canonical_mutation::column_family_id() const {
--- a/canonical_mutation.hh
+++ b/canonical_mutation.hh
@@ -32,9 +32,9 @@
 // Safe to access from other shards via const&.
 // Safe to pass serialized across nodes.
 class canonical_mutation {
-    bytes_ostream _data;
+    bytes _data;
 public:
-    explicit canonical_mutation(bytes_ostream);
+    explicit canonical_mutation(bytes);
    explicit canonical_mutation(const mutation&);

    canonical_mutation(canonical_mutation&&) = default;
@@ -51,7 +51,7 @@ public:

    utils::UUID column_family_id() const;

-    const bytes_ostream& representation() const { return _data; }
+    const bytes& representation() const { return _data; }

    friend std::ostream& operator<<(std::ostream& os, const canonical_mutation& cm);
 };
--- a/cdc/generation.cc
+++ b/cdc/generation.cc
@@ -22,13 +22,10 @@
 #include <boost/type.hpp>
 #include <random>
 #include <unordered_set>
-#include <algorithm>
 #include <seastar/core/sleep.hh>
-#include <seastar/core/coroutine.hh>

 #include "keys.hh"
 #include "schema_builder.hh"
-#include "database.hh"
 #include "db/config.hh"
 #include "db/system_keyspace.hh"
 #include "db/system_distributed_keyspace.hh"
@@ -39,8 +36,6 @@
 #include "gms/gossiper.hh"

 #include "cdc/generation.hh"
-#include "cdc/cdc_options.hh"
-#include "cdc/generation_service.hh"

 extern logging::logger cdc_log;

@@ -179,29 +174,10 @@ bool topology_description::operator==(const topology_description& o) const {
    return _entries == o._entries;
 }

-const std::vector<token_range_description>& topology_description::entries() const& {
+const std::vector<token_range_description>& topology_description::entries() const {
    return _entries;
 }

-std::vector<token_range_description>&& topology_description::entries() && {
-    return std::move(_entries);
-}
-
-static std::vector<stream_id> create_stream_ids(
-        size_t index, dht::token start, dht::token end, size_t shard_count, uint8_t ignore_msb) {
-    std::vector<stream_id> result;
-    result.reserve(shard_count);
-    dht::sharder sharder(shard_count, ignore_msb);
-    for (size_t shard_idx = 0; shard_idx < shard_count; ++shard_idx) {
-        auto t = dht::find_first_token_for_shard(sharder, start, end, shard_idx);
-        // compose the id from token and the "index" of the range end owning vnode
-        // as defined by token sort order. Basically grouping within this
-        // shard set.
-        result.emplace_back(stream_id(t, index));
-    }
-    return result;
-}
-
 class topology_description_generator final {
    const db::config& _cfg;
    const std::unordered_set<dht::token>& _bootstrap_tokens;
@@ -241,9 +217,18 @@ class topology_description_generator final {
        desc.token_range_end = end;

        auto [shard_count, ignore_msb] = get_sharding_info(end);
-        desc.streams = create_stream_ids(index, start, end, shard_count, ignore_msb);
+        desc.streams.reserve(shard_count);
        desc.sharding_ignore_msb = ignore_msb;

+        dht::sharder sharder(shard_count, ignore_msb);
+        for (size_t shard_idx = 0; shard_idx < shard_count; ++shard_idx) {
+            auto t = dht::find_first_token_for_shard(sharder, start, end, shard_idx);
+            // compose the id from token and the "index" of the range end owning vnode
+            // as defined by token sort order. Basically grouping within this
+            // shard set.
+            desc.streams.emplace_back(stream_id(t, index));
+        }
+
        return desc;
    }
 public:
@@ -309,39 +294,8 @@ future<db_clock::time_point> get_local_streams_timestamp() {
    });
 }

-// non-static for testing
-size_t limit_of_streams_in_topology_description() {
-    // Each stream takes 16B and we don't want to exceed 4MB so we can have
-    // at most 262144 streams but not less than 1 per vnode.
-    return 4 * 1024 * 1024 / 16;
-}
-
-// non-static for testing
-topology_description limit_number_of_streams_if_needed(topology_description&& desc) {
-    int64_t streams_count = 0;
-    for (auto& tr_desc : desc.entries()) {
-        streams_count += tr_desc.streams.size();
-    }
-
-    size_t limit = std::max(limit_of_streams_in_topology_description(), desc.entries().size());
-    if (limit >= streams_count) {
-        return std::move(desc);
-    }
-    size_t streams_per_vnode_limit = limit / desc.entries().size();
-    auto entries = std::move(desc).entries();
-    auto start = entries.back().token_range_end;
-    for (size_t idx = 0; idx < entries.size(); ++idx) {
-        auto end = entries[idx].token_range_end;
-        if (entries[idx].streams.size() > streams_per_vnode_limit) {
-            entries[idx].streams =
-                create_stream_ids(idx, start, end, streams_per_vnode_limit, entries[idx].sharding_ignore_msb);
-        }
-        start = end;
-    }
-    return topology_description(std::move(entries));
-}
-
-future<db_clock::time_point> make_new_cdc_generation(
+// Run inside seastar::async context.
+db_clock::time_point make_new_cdc_generation(
        const db::config& cfg,
        const std::unordered_set<dht::token>& bootstrap_tokens,
        const locator::token_metadata_ptr tmptr,
@@ -352,25 +306,13 @@ future<db_clock::time_point> make_new_cdc_generation(
    using namespace std::chrono;
    auto gen = topology_description_generator(cfg, bootstrap_tokens, tmptr, g).generate();

-    // If the cluster is large we may end up with a generation that contains
-    // large number of streams. This is problematic because we store the
-    // generation in a single row. For a generation with large number of rows
-    // this will lead to a row that can be as big as 32MB. This is much more
-    // than the limit imposed by commitlog_segment_size_in_mb. If the size of
-    // the row that describes a new generation grows above
-    // commitlog_segment_size_in_mb, the write will fail and the new node won't
-    // be able to join. To avoid such problem we make sure that such row is
-    // always smaller than 4MB. We do that by removing some CDC streams from
-    // each vnode if the total number of streams is too large.
-    gen = limit_number_of_streams_if_needed(std::move(gen));
-
    // Begin the race.
    auto ts = db_clock::now() + (
            (!add_delay || ring_delay == milliseconds(0)) ? milliseconds(0) : (
                2 * ring_delay + duration_cast<milliseconds>(generation_leeway)));
-    co_await sys_dist_ks.insert_cdc_topology_description(ts, std::move(gen), { tmptr->count_normal_token_owners() });
+    sys_dist_ks.insert_cdc_topology_description(ts, std::move(gen), { tmptr->count_normal_token_owners() }).get();

-    co_return ts;
+    return ts;
 }

 std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_address& endpoint, const gms::gossiper& g) {
@@ -379,581 +321,63 @@ std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_ad
    return gms::versioned_value::cdc_streams_timestamp_from_string(streams_ts_string);
 }

-static future<> do_update_streams_description(
+// Run inside seastar::async context.
+static void do_update_streams_description(
        db_clock::time_point streams_ts,
        db::system_distributed_keyspace& sys_dist_ks,
        db::system_distributed_keyspace::context ctx) {
-    if (co_await sys_dist_ks.cdc_desc_exists(streams_ts, ctx)) {
-        cdc_log.info("Generation {}: streams description table already updated.", streams_ts);
-        co_return;
+    if (sys_dist_ks.cdc_desc_exists(streams_ts, ctx).get0()) {
+        cdc_log.debug("update_streams_description: description of generation {} already inserted", streams_ts);
+        return;
    }

    // We might race with another node also inserting the description, but that's ok. It's an idempotent operation.

-    auto topo = co_await sys_dist_ks.read_cdc_topology_description(streams_ts, ctx);
+    auto topo = sys_dist_ks.read_cdc_topology_description(streams_ts, ctx).get0();
    if (!topo) {
-        throw no_generation_data_exception(streams_ts);
+        throw std::runtime_error(format("could not find streams data for timestamp {}", streams_ts));
    }

-    co_await sys_dist_ks.create_cdc_desc(streams_ts, *topo, ctx);
+    std::set<cdc::stream_id> streams_set;
+    for (auto& entry: topo->entries()) {
+        streams_set.insert(entry.streams.begin(), entry.streams.end());
+    }
+
+    std::vector<cdc::stream_id> streams_vec(streams_set.begin(), streams_set.end());
+
+    sys_dist_ks.create_cdc_desc(streams_ts, streams_vec, ctx).get();
    cdc_log.info("CDC description table successfully updated with generation {}.", streams_ts);
 }

-future<> update_streams_description(
+void update_streams_description(
        db_clock::time_point streams_ts,
        shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
        noncopyable_function<unsigned()> get_num_token_owners,
        abort_source& abort_src) {
    try {
-        co_await do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
-    } catch (...) {
+        do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
+    } catch(...) {
        cdc_log.warn(
            "Could not update CDC description table with generation {}: {}. Will retry in the background.",
            streams_ts, std::current_exception());

        // It is safe to discard this future: we keep system distributed keyspace alive.
-        (void)(([] (db_clock::time_point streams_ts,
-                    shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
-                    noncopyable_function<unsigned()> get_num_token_owners,
-                    abort_source& abort_src) -> future<> {
+        (void)seastar::async([
+            streams_ts, sys_dist_ks, get_num_token_owners = std::move(get_num_token_owners), &abort_src
+        ] {
            while (true) {
-                co_await sleep_abortable(std::chrono::seconds(60), abort_src);
+                sleep_abortable(std::chrono::seconds(60), abort_src).get();
                try {
-                    co_await do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
-                    co_return;
+                    do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
+                    return;
                } catch (...) {
                    cdc_log.warn(
                        "Could not update CDC description table with generation {}: {}. Will try again.",
                        streams_ts, std::current_exception());
                }
            }
-        })(streams_ts, std::move(sys_dist_ks), std::move(get_num_token_owners), abort_src));
-    }
-}
-
-static db_clock::time_point as_timepoint(const utils::UUID& uuid) {
-    return db_clock::time_point{std::chrono::milliseconds(utils::UUID_gen::get_adjusted_timestamp(uuid))};
-}
-
-static future<std::vector<db_clock::time_point>> get_cdc_desc_v1_timestamps(
-        db::system_distributed_keyspace& sys_dist_ks,
-        abort_source& abort_src,
-        const noncopyable_function<unsigned()>& get_num_token_owners) {
-    while (true) {
-        try {
-            co_return co_await sys_dist_ks.get_cdc_desc_v1_timestamps({ get_num_token_owners() });
-        } catch (...) {
-            cdc_log.warn(
-                    "Failed to retrieve generation timestamps for rewriting: {}. Retrying in 60s.",
-                    std::current_exception());
-        }
-        co_await sleep_abortable(std::chrono::seconds(60), abort_src);
-    }
-}
-
-// Contains a CDC log table's creation time (extracted from its schema's id)
-// and its CDC TTL setting.
-struct time_and_ttl {
-    db_clock::time_point creation_time;
-    int ttl;
-};
-
-/*
- * See `maybe_rewrite_streams_descriptions`.
- * This is the long-running-in-the-background part of that function.
- * It returns the timestamp of the last rewritten generation (if any).
- */
-static future<std::optional<db_clock::time_point>> rewrite_streams_descriptions(
-        std::vector<time_and_ttl> times_and_ttls,
-        shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
-        noncopyable_function<unsigned()> get_num_token_owners,
-        abort_source& abort_src) {
-    cdc_log.info("Retrieving generation timestamps for rewriting...");
-    auto tss = co_await get_cdc_desc_v1_timestamps(*sys_dist_ks, abort_src, get_num_token_owners);
-    cdc_log.info("Generation timestamps retrieved.");
-
-    // Find first generation timestamp such that some CDC log table may contain data before this timestamp.
-    // This predicate is monotonic w.r.t the timestamps.
-    auto now = db_clock::now();
-    std::sort(tss.begin(), tss.end());
-    auto first = std::partition_point(tss.begin(), tss.end(), [&] (db_clock::time_point ts) {
-        // partition_point finds first element that does *not* satisfy the predicate.
-        return std::none_of(times_and_ttls.begin(), times_and_ttls.end(),
-                [&] (const time_and_ttl& tat) {
-            // In this CDC log table there are no entries older than the table's creation time
-            // or (now - the table's ttl). We subtract 10s to account for some possible clock drift.
-            // If ttl is set to 0 then entries in this table never expire. In that case we look
-            // only at the table's creation time.
-            auto no_entries_older_than =
-                (tat.ttl == 0 ? tat.creation_time : std::max(tat.creation_time, now - std::chrono::seconds(tat.ttl)))
-                    - std::chrono::seconds(10);
-            return no_entries_older_than < ts;
        });
-    });
-
-    // Find first generation timestamp such that some CDC log table may contain data in this generation.
-    // This and all later generations need to be written to the new streams table.
-    if (first != tss.begin()) {
-        --first;
    }
-
-    if (first == tss.end()) {
-        cdc_log.info("No generations to rewrite.");
-        co_return std::nullopt;
-    }
-
-    cdc_log.info("First generation to rewrite: {}", *first);
-
-    bool each_success = true;
-    co_await max_concurrent_for_each(first, tss.end(), 10, [&] (db_clock::time_point ts) -> future<> {
-        while (true) {
-            try {
-                co_return co_await do_update_streams_description(ts, *sys_dist_ks, { get_num_token_owners() });
-            } catch (const no_generation_data_exception& e) {
-                cdc_log.error("Failed to rewrite streams for generation {}: {}. Giving up.", ts, e);
-                each_success = false;
-                co_return;
-            } catch (...) {
-                cdc_log.warn("Failed to rewrite streams for generation {}: {}. Retrying in 60s.", ts, std::current_exception());
-            }
-            co_await sleep_abortable(std::chrono::seconds(60), abort_src);
-        }
-    });
-
-    if (each_success) {
-        cdc_log.info("Rewriting stream tables finished successfully.");
-    } else {
-        cdc_log.info("Rewriting stream tables finished, but some generations could not be rewritten (check the logs).");
-    }
-
-    if (first != tss.end()) {
-        co_return *std::prev(tss.end());
-    }
-
-    co_return std::nullopt;
-}
-
-future<> maybe_rewrite_streams_descriptions(
-        const database& db,
-        shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
-        noncopyable_function<unsigned()> get_num_token_owners,
-        abort_source& abort_src) {
-    if (!db.has_schema(sys_dist_ks->NAME, sys_dist_ks->CDC_DESC_V1)) {
-        // This cluster never went through a Scylla version which used this table
-        // or the user deleted the table. Nothing to do.
-        co_return;
-    }
-
-    if (co_await db::system_keyspace::cdc_is_rewritten()) {
-        co_return;
-    }
-
-    if (db.get_config().cdc_dont_rewrite_streams()) {
-        cdc_log.warn("Stream rewriting disabled. Manual administrator intervention may be required...");
-        co_return;
-    }
-
-    // For each CDC log table get the TTL setting (from CDC options) and the table's creation time
-    std::vector<time_and_ttl> times_and_ttls;
-    for (auto& [_, cf] : db.get_column_families()) {
-        auto& s = *cf->schema();
-        auto base = cdc::get_base_table(db, s.ks_name(), s.cf_name());
-        if (!base) {
-            // Not a CDC log table.
-            continue;
-        }
-        auto& cdc_opts = base->cdc_options();
-        if (!cdc_opts.enabled()) {
-            // This table is named like a CDC log table but it's not one.
-            continue;
-        }
-
-        times_and_ttls.push_back(time_and_ttl{as_timepoint(s.id()), cdc_opts.ttl()});
-    }
-
-    if (times_and_ttls.empty()) {
-        // There's no point in rewriting old generations' streams (they don't contain any data).
-        cdc_log.info("No CDC log tables present, not rewriting stream tables.");
-        co_return co_await db::system_keyspace::cdc_set_rewritten(std::nullopt);
-    }
-
-    // It's safe to discard this future: the coroutine keeps system_distributed_keyspace alive
-    // and the abort source's lifetime extends the lifetime of any other service.
-    (void)(([_times_and_ttls = std::move(times_and_ttls), _sys_dist_ks = std::move(sys_dist_ks),
-                _get_num_token_owners = std::move(get_num_token_owners), &_abort_src = abort_src] () mutable -> future<> {
-        auto times_and_ttls = std::move(_times_and_ttls);
-        auto sys_dist_ks = std::move(_sys_dist_ks);
-        auto get_num_token_owners = std::move(_get_num_token_owners);
-        auto& abort_src = _abort_src;
-
-        // This code is racing with node startup. At this point, we're most likely still waiting for gossip to settle
-        // and some nodes that are UP may still be marked as DOWN by us.
-        // Let's sleep a bit to increase the chance that the first attempt at rewriting succeeds (it's still ok if
-        // it doesn't - we'll retry - but it's nice if we succeed without any warnings).
-        co_await sleep_abortable(std::chrono::seconds(10), abort_src);
-
-        cdc_log.info("Rewriting stream tables in the background...");
-        auto last_rewritten = co_await rewrite_streams_descriptions(
-                std::move(times_and_ttls),
-                std::move(sys_dist_ks),
-                std::move(get_num_token_owners),
-                abort_src);
-
-        co_await db::system_keyspace::cdc_set_rewritten(last_rewritten);
-    })());
-}
-
-static void assert_shard_zero(const sstring& where) {
-    if (this_shard_id() != 0) {
-        on_internal_error(cdc_log, format("`{}`: must be run on shard 0", where));
-    }
-}
-
-class and_reducer {
-private:
-    bool _result = true;
-public:
-    future<> operator()(bool value) {
-        _result = value && _result;
-        return make_ready_future<>();
-    }
-    bool get() {
-        return _result;
-    }
-};
-
-class or_reducer {
-private:
-    bool _result = false;
-public:
-    future<> operator()(bool value) {
-        _result = value || _result;
-        return make_ready_future<>();
-    }
-    bool get() {
-        return _result;
-    }
-};
-
-class generation_handling_nonfatal_exception : public std::runtime_error {
-    using std::runtime_error::runtime_error;
-};
-
-constexpr char could_not_retrieve_msg_template[]
-        = "Could not retrieve CDC streams with timestamp {} upon gossip event. Reason: \"{}\". Action: {}.";
-
-generation_service::generation_service(
-            const db::config& cfg, gms::gossiper& g, sharded<db::system_distributed_keyspace>& sys_dist_ks,
-            abort_source& abort_src, const locator::shared_token_metadata& stm)
-        : _cfg(cfg), _gossiper(g), _sys_dist_ks(sys_dist_ks), _abort_src(abort_src), _token_metadata(stm) {
-}
-
-future<> generation_service::stop() {
-    if (this_shard_id() == 0) {
-        co_await _gossiper.unregister_(shared_from_this());
-    }
-
-    _stopped = true;
-}
-
-generation_service::~generation_service() {
-    assert(_stopped);
-}
-
-future<> generation_service::after_join(std::optional<db_clock::time_point>&& startup_gen_ts) {
-    assert_shard_zero(__PRETTY_FUNCTION__);
-    assert(db::system_keyspace::bootstrap_complete());
-
-    _gen_ts = std::move(startup_gen_ts);
-    _gossiper.register_(shared_from_this());
-
-    _joined = true;
-
-    // Retrieve the latest CDC generation seen in gossip (if any).
-    co_await scan_cdc_generations();
-}
-
-void generation_service::on_join(gms::inet_address ep, gms::endpoint_state ep_state) {
-    assert_shard_zero(__PRETTY_FUNCTION__);
-
-    auto val = ep_state.get_application_state_ptr(gms::application_state::CDC_STREAMS_TIMESTAMP);
-    if (!val) {
-        return;
-    }
-
-    on_change(ep, gms::application_state::CDC_STREAMS_TIMESTAMP, *val);
-}
-
-void generation_service::on_change(gms::inet_address ep, gms::application_state app_state, const gms::versioned_value& v) {
-    assert_shard_zero(__PRETTY_FUNCTION__);
-
-    if (app_state != gms::application_state::CDC_STREAMS_TIMESTAMP) {
-        return;
-    }
-
-    auto ts = gms::versioned_value::cdc_streams_timestamp_from_string(v.value);
-    cdc_log.debug("Endpoint: {}, CDC generation timestamp change: {}", ep, ts);
-
-    handle_cdc_generation(ts).get();
-}
-
-future<> generation_service::check_and_repair_cdc_streams() {
-    if (!_joined) {
-        throw std::runtime_error("check_and_repair_cdc_streams: node not initialized yet");
-    }
-
-    auto latest = _gen_ts;
-    const auto& endpoint_states = _gossiper.get_endpoint_states();
-    for (const auto& [addr, state] : endpoint_states) {
-        if (!_gossiper.is_normal(addr))  {
-            throw std::runtime_error(format("All nodes must be in NORMAL state while performing check_and_repair_cdc_streams"
-                    " ({} is in state {})", addr, _gossiper.get_gossip_status(state)));
-        }
-
-        const auto ts = get_streams_timestamp_for(addr, _gossiper);
-        if (!latest || (ts && *ts > *latest)) {
-            latest = ts;
-        }
-    }
-
-    bool should_regenerate = false;
-    std::optional<topology_description> gen;
-
-    static const auto timeout_msg = "Timeout while fetching CDC topology description";
-    static const auto topology_read_error_note = "Note: this is likely caused by"
-            " node(s) being down or unreachable. It is recommended to check the network and"
-            " restart/remove the failed node(s), then retry checkAndRepairCdcStreams command";
-    static const auto exception_translating_msg = "Translating the exception to `request_execution_exception`";
-    const auto tmptr = _token_metadata.get();
-    auto sys_dist_ks = get_sys_dist_ks();
-    try {
-        gen = co_await sys_dist_ks->read_cdc_topology_description(
-                *latest, { tmptr->count_normal_token_owners() });
-    } catch (exceptions::request_timeout_exception& e) {
-        cdc_log.error("{}: \"{}\". {}.", timeout_msg, e.what(), exception_translating_msg);
-        throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
-                format("{}. {}.", timeout_msg, topology_read_error_note));
-    } catch (exceptions::unavailable_exception& e) {
-        static const auto unavailable_msg = "Node(s) unavailable while fetching CDC topology description";
-        cdc_log.error("{}: \"{}\". {}.", unavailable_msg, e.what(), exception_translating_msg);
-        throw exceptions::request_execution_exception(exceptions::exception_code::UNAVAILABLE,
-                format("{}. {}.", unavailable_msg, topology_read_error_note));
-    } catch (...) {
-        const auto ep = std::current_exception();
-        if (is_timeout_exception(ep)) {
-            cdc_log.error("{}: \"{}\". {}.", timeout_msg, ep, exception_translating_msg);
-            throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
-                    format("{}. {}.", timeout_msg, topology_read_error_note));
-        }
-        // On exotic errors proceed with regeneration
-        cdc_log.error("Exception while reading CDC topology description: \"{}\". Regenerating streams anyway.", ep);
-        should_regenerate = true;
-    }
-
-    if (!gen) {
-        cdc_log.error(
-            "Could not find CDC generation with timestamp {} in distributed system tables (current time: {}),"
-            " even though some node gossiped about it.",
-            latest, db_clock::now());
-        should_regenerate = true;
-    } else {
-        std::unordered_set<dht::token> gen_ends;
-        for (const auto& entry : gen->entries()) {
-            gen_ends.insert(entry.token_range_end);
-        }
-        for (const auto& metadata_token : tmptr->sorted_tokens()) {
-            if (!gen_ends.contains(metadata_token)) {
-                cdc_log.warn("CDC generation {} missing token {}. Regenerating.", latest, metadata_token);
-                should_regenerate = true;
-                break;
-            }
-        }
-    }
-
-    if (!should_regenerate) {
-        if (latest != _gen_ts) {
-            co_await do_handle_cdc_generation(*latest);
-        }
-        cdc_log.info("CDC generation {} does not need repair", latest);
-        co_return;
-    }
-    const auto new_gen_ts = co_await make_new_cdc_generation(_cfg,
-            {}, std::move(tmptr), _gossiper, *sys_dist_ks,
-            std::chrono::milliseconds(_cfg.ring_delay_ms()), true /* add delay */);
-    // Need to artificially update our STATUS so other nodes handle the timestamp change
-    auto status = _gossiper.get_application_state_ptr(
-            utils::fb_utilities::get_broadcast_address(), gms::application_state::STATUS);
-    if (!status) {
-        cdc_log.error("Our STATUS is missing");
-        cdc_log.error("Aborting CDC generation repair due to missing STATUS");
-        co_return;
-    }
-    // Update _gen_ts first, so that do_handle_cdc_generation (which will get called due to the status update)
-    // won't try to update the gossiper, which would result in a deadlock inside add_local_application_state
-    _gen_ts = new_gen_ts;
-    co_await _gossiper.add_local_application_state({
-            { gms::application_state::CDC_STREAMS_TIMESTAMP, gms::versioned_value::cdc_streams_timestamp(new_gen_ts) },
-            { gms::application_state::STATUS, *status }
-    });
-    co_await db::system_keyspace::update_cdc_streams_timestamp(new_gen_ts);
-}
-
-future<> generation_service::handle_cdc_generation(std::optional<db_clock::time_point> ts) {
-    assert_shard_zero(__PRETTY_FUNCTION__);
-
-    if (!ts) {
-        co_return;
-    }
-
-    if (!db::system_keyspace::bootstrap_complete() || !_sys_dist_ks.local_is_initialized()
-            || !_sys_dist_ks.local().started()) {
-        // The service should not be listening for generation changes until after the node
-        // is bootstrapped. Therefore we would previously assume that this condition
-        // can never become true and call on_internal_error here, but it turns out that
-        // it may become true on decommission: the node enters NEEDS_BOOTSTRAP
-        // state before leaving the token ring, so bootstrap_complete() becomes false.
-        // In that case we can simply return.
-        co_return;
-    }
-
-    if (co_await container().map_reduce(and_reducer(), [ts = *ts] (generation_service& svc) {
-        return !svc._cdc_metadata.prepare(ts);
-    })) {
-        co_return;
-    }
-
-    bool using_this_gen = false;
-    try {
-        using_this_gen = co_await do_handle_cdc_generation_intercept_nonfatal_errors(*ts);
-    } catch (generation_handling_nonfatal_exception& e) {
-        cdc_log.warn(could_not_retrieve_msg_template, ts, e.what(), "retrying in the background");
-        async_handle_cdc_generation(*ts);
-        co_return;
-    } catch (...) {
-        cdc_log.error(could_not_retrieve_msg_template, ts, std::current_exception(), "not retrying");
-        co_return; // Exotic ("fatal") exception => do not retry
-    }
-
-    if (using_this_gen) {
-        cdc_log.info("Starting to use generation {}", *ts);
-        co_await update_streams_description(*ts, get_sys_dist_ks(),
-                [tmptr = _token_metadata.get()] { return tmptr->count_normal_token_owners(); },
-                _abort_src);
-    }
-}
-
-void generation_service::async_handle_cdc_generation(db_clock::time_point ts) {
-    assert_shard_zero(__PRETTY_FUNCTION__);
-
-    (void)(([] (db_clock::time_point ts, shared_ptr<generation_service> svc) -> future<> {
-        while (true) {
-            co_await sleep_abortable(std::chrono::seconds(5), svc->_abort_src);
-
-            try {
-                bool using_this_gen = co_await svc->do_handle_cdc_generation_intercept_nonfatal_errors(ts);
-                if (using_this_gen) {
-                    cdc_log.info("Starting to use generation {}", ts);
-                    co_await update_streams_description(ts, svc->get_sys_dist_ks(),
-                            [tmptr = svc->_token_metadata.get()] { return tmptr->count_normal_token_owners(); },
-                            svc->_abort_src);
-                }
-                co_return;
-            } catch (generation_handling_nonfatal_exception& e) {
-                cdc_log.warn(could_not_retrieve_msg_template, ts, e.what(), "continuing to retry in the background");
-            } catch (...) {
-                cdc_log.error(could_not_retrieve_msg_template, ts, std::current_exception(), "not retrying anymore");
-                co_return; // Exotic ("fatal") exception => do not retry
-            }
-
-            if (co_await svc->container().map_reduce(and_reducer(), [ts] (generation_service& svc) {
-                return svc._cdc_metadata.known_or_obsolete(ts);
-            })) {
-                co_return;
-            }
-        }
-    })(ts, shared_from_this()));
-}
-
-future<> generation_service::scan_cdc_generations() {
-    assert_shard_zero(__PRETTY_FUNCTION__);
-
-    std::optional<db_clock::time_point> latest;
-    for (const auto& ep: _gossiper.get_endpoint_states()) {
-        auto ts = get_streams_timestamp_for(ep.first, _gossiper);
-        if (!latest || (ts && *ts > *latest)) {
-            latest = ts;
-        }
-    }
-
-    if (latest) {
-        cdc_log.info("Latest generation seen during startup: {}", *latest);
-        co_await handle_cdc_generation(latest);
-    } else {
-        cdc_log.info("No generation seen during startup.");
-    }
-}
-
-future<bool> generation_service::do_handle_cdc_generation_intercept_nonfatal_errors(db_clock::time_point ts) {
-    assert_shard_zero(__PRETTY_FUNCTION__);
-
-    try {
-        co_return co_await do_handle_cdc_generation(ts);
-    } catch (exceptions::request_timeout_exception& e) {
-        throw generation_handling_nonfatal_exception(e.what());
-    } catch (exceptions::unavailable_exception& e) {
-        throw generation_handling_nonfatal_exception(e.what());
-    } catch (exceptions::read_failure_exception& e) {
-        throw generation_handling_nonfatal_exception(e.what());
-    } catch (...) {
-        const auto ep = std::current_exception();
-        if (is_timeout_exception(ep)) {
-            throw generation_handling_nonfatal_exception(format("{}", ep));
-        }
-        throw;
-    }
-}
-
-future<bool> generation_service::do_handle_cdc_generation(db_clock::time_point ts) {
-    assert_shard_zero(__PRETTY_FUNCTION__);
-
-    auto sys_dist_ks = get_sys_dist_ks();
-    auto gen = co_await sys_dist_ks->read_cdc_topology_description(
-            ts, { _token_metadata.get()->count_normal_token_owners() });
-    if (!gen) {
-        throw std::runtime_error(format(
-            "Could not find CDC generation with timestamp {} in distributed system tables (current time: {}),"
-            " even though some node gossiped about it.",
-            ts, db_clock::now()));
-    }
-
-    // If we're not gossiping our own generation timestamp (because we've upgraded from a non-CDC/old version,
-    // or we somehow lost it due to a byzantine failure), start gossiping someone else's timestamp.
-    // This is to avoid the upgrade check on every restart (see `should_propose_first_cdc_generation`).
-    // And if we notice that `ts` is higher than our timestamp, we will start gossiping it instead,
-    // so if the node that initially gossiped `ts` leaves the cluster while `ts` is still the latest generation,
-    // the cluster will remember.
-    if (!_gen_ts || *_gen_ts < ts) {
-        _gen_ts = ts;
-        co_await db::system_keyspace::update_cdc_streams_timestamp(ts);
-        co_await _gossiper.add_local_application_state(
-                gms::application_state::CDC_STREAMS_TIMESTAMP, gms::versioned_value::cdc_streams_timestamp(ts));
-    }
-
-    // Return `true` iff the generation was inserted on any of our shards.
-    co_return co_await container().map_reduce(or_reducer(), [ts, &gen] (generation_service& svc) {
-        auto gen_ = *gen;
-        return svc._cdc_metadata.insert(ts, std::move(gen_));
-    });
-}
-
-shared_ptr<db::system_distributed_keyspace> generation_service::get_sys_dist_ks() {
-    assert_shard_zero(__PRETTY_FUNCTION__);
-
-    if (!_sys_dist_ks.local_is_initialized()) {
-        throw std::runtime_error("system distributed keyspace not initialized");
-    }
-
-    return _sys_dist_ks.local_shared();
 }

 } // namespace cdc
--- a/cdc/generation.hh
+++ b/cdc/generation.hh
@@ -41,7 +41,6 @@
 #include "db_clock.hh"
 #include "dht/token.hh"
 #include "locator/token_metadata.hh"
-#include "utils/chunked_vector.hh"

 namespace seastar {
    class abort_source;
@@ -66,7 +65,6 @@ public:

    stream_id() = default;
    stream_id(bytes);
-    stream_id(dht::token, size_t);

    bool is_set() const;
    bool operator==(const stream_id&) const;
@@ -80,6 +78,9 @@ public:

    partition_key to_partition_key(const schema& log_schema) const;
    static int64_t token_from_bytes(bytes_view);
+private:
+    friend class topology_description_generator;
+    stream_id(dht::token, size_t);
 };

 /* Describes a mapping of tokens to CDC streams in a token range.
@@ -112,8 +113,7 @@ public:
    topology_description(std::vector<token_range_description> entries);
    bool operator==(const topology_description&) const;

-    const std::vector<token_range_description>& entries() const&;
-    std::vector<token_range_description>&& entries() &&;
+    const std::vector<token_range_description>& entries() const;
 };

 /**
@@ -122,19 +122,14 @@ public:
 */ 
 class streams_version {
 public:
-    utils::chunked_vector<stream_id> streams;
+    std::vector<stream_id> streams;
    db_clock::time_point timestamp;
+    std::optional<db_clock::time_point> expired;

-    streams_version(utils::chunked_vector<stream_id> s, db_clock::time_point ts)
+    streams_version(std::vector<stream_id> s, db_clock::time_point ts, std::optional<db_clock::time_point> exp)
        : streams(std::move(s))
        , timestamp(ts)
-    {}
-};
-
-class no_generation_data_exception : public std::runtime_error {
-public:
-    no_generation_data_exception(db_clock::time_point generation_ts)
-        : std::runtime_error(format("could not find generation data for timestamp {}", generation_ts))
+        , expired(std::move(exp))
    {}
 };

@@ -167,7 +162,7 @@ future<db_clock::time_point> get_local_streams_timestamp();
 * (not guaranteed in the current implementation, but expected to be the common case;
 *  we assume that `ring_delay` is enough for other nodes to learn about the new generation).
 */
-future<db_clock::time_point> make_new_cdc_generation(
+db_clock::time_point make_new_cdc_generation(
        const db::config& cfg,
        const std::unordered_set<dht::token>& bootstrap_tokens,
        const locator::token_metadata_ptr tmptr,
@@ -190,22 +185,13 @@ std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_ad
 *
 * Returning from this function does not mean that the table update was successful: the function
 * might run an asynchronous task in the background.
+ *
+ * Run inside seastar::async context.
 */
-future<> update_streams_description(
+void update_streams_description(
        db_clock::time_point,
        shared_ptr<db::system_distributed_keyspace>,
        noncopyable_function<unsigned()> get_num_token_owners,
        abort_source&);

-/* Part of the upgrade procedure. Useful in case where the version of Scylla that we're upgrading from
- * used the "cdc_streams_descriptions" table. This procedure ensures that the new "cdc_streams_descriptions_v2"
- * table contains streams of all generations that were present in the old table and may still contain data
- * (i.e. there exist CDC log tables that may contain rows with partition keys being the stream IDs from
- * these generations). */
-future<> maybe_rewrite_streams_descriptions(
-        const database&,
-        shared_ptr<db::system_distributed_keyspace>,
-        noncopyable_function<unsigned()> get_num_token_owners,
-        abort_source&);
-
 } // namespace cdc
--- a/cdc/generation_service.hh
+++ b/cdc/generation_service.hh
@@ -1,138 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- *
- * Modified by ScyllaDB
- * Copyright (C) 2021 ScyllaDB
- *
- */
-
-#pragma once
-
-#include "cdc/metadata.hh"
-#include "gms/i_endpoint_state_change_subscriber.hh"
-
-namespace db {
-class system_distributed_keyspace;
-}
-
-namespace gms {
-class gossiper;
-}
-
-namespace cdc {
-
-class generation_service : public peering_sharded_service<generation_service>
-                         , public async_sharded_service<generation_service>
-                         , public gms::i_endpoint_state_change_subscriber {
-
-    bool _stopped = false;
-
-    // The node has joined the token ring. Set to `true` on `after_join` call.
-    bool _joined = false;
-
-    const db::config& _cfg;
-    gms::gossiper& _gossiper;
-    sharded<db::system_distributed_keyspace>& _sys_dist_ks;
-    abort_source& _abort_src;
-    const locator::shared_token_metadata& _token_metadata;
-
-    /* Maintains the set of known CDC generations used to pick streams for log writes (i.e., the partition keys of these log writes).
-     * Updated in response to certain gossip events (see the handle_cdc_generation function).
-     */
-    cdc::metadata _cdc_metadata;
-
-    /* The latest known generation timestamp and the timestamp that we're currently gossiping
-     * (as CDC_STREAMS_TIMESTAMP application state).
-     *
-     * Only shard 0 manages this, hence it will be std::nullopt on all shards other than 0.
-     * This timestamp is also persisted in the system.cdc_local table.
-     *
-     * On shard 0 this may be nullopt only in one special case: rolling upgrade, when we upgrade
-     * from an old version of Scylla that didn't support CDC. In that case one node in the cluster
-     * will create the first generation and start gossiping it; it may be us, or it may be some
-     * different node. In any case, eventually - after one of the nodes gossips the first timestamp
-     * - we'll catch on and this variable will be updated with that generation.
-     */
-    std::optional<db_clock::time_point> _gen_ts;
-public:
-    generation_service(const db::config&, gms::gossiper&,
-            sharded<db::system_distributed_keyspace>&, abort_source&, const locator::shared_token_metadata&);
-
-    future<> stop();
-    ~generation_service();
-
-    /* After the node bootstraps and creates a new CDC generation, or restarts and loads the last
-     * known generation timestamp from persistent storage, this function should be called with
-     * that generation timestamp moved in as the `startup_gen_ts` parameter.
-     * This passes the responsibility of managing generations from the node startup code to this service;
-     * until then, the service remains dormant.
-     * At the time of writing this comment, the startup code is in `storage_service::join_token_ring`, hence
-     * `after_join` should be called at the end of that function.
-     * Precondition: the node has completed bootstrapping and system_distributed_keyspace is initialized.
-     * Must be called on shard 0 - that's where the generation management happens.
-     */
-    future<> after_join(std::optional<db_clock::time_point>&& startup_gen_ts);
-
-    cdc::metadata& get_cdc_metadata() {
-        return _cdc_metadata;
-    }
-
-    virtual void before_change(gms::inet_address, gms::endpoint_state, gms::application_state, const gms::versioned_value&) override {}
-    virtual void on_alive(gms::inet_address, gms::endpoint_state) override {}
-    virtual void on_dead(gms::inet_address, gms::endpoint_state) override {}
-    virtual void on_remove(gms::inet_address) override {}
-    virtual void on_restart(gms::inet_address, gms::endpoint_state) override {}
-
-    virtual void on_join(gms::inet_address, gms::endpoint_state) override;
-    virtual void on_change(gms::inet_address, gms::application_state, const gms::versioned_value&) override;
-
-    future<> check_and_repair_cdc_streams();
-
-private:
-    /* Retrieve the CDC generation which starts at the given timestamp (from a distributed table created for this purpose)
-     * and start using it for CDC log writes if it's not obsolete.
-     */
-    future<> handle_cdc_generation(std::optional<db_clock::time_point>);
-
-    /* If `handle_cdc_generation` fails, it schedules an asynchronous retry in the background
-     * using `async_handle_cdc_generation`.
-     */
-    void async_handle_cdc_generation(db_clock::time_point);
-
-    /* Wrapper around `do_handle_cdc_generation` which intercepts timeout/unavailability exceptions.
-     * Returns: do_handle_cdc_generation(ts). */
-    future<bool> do_handle_cdc_generation_intercept_nonfatal_errors(db_clock::time_point);
-
-    /* Returns `true` iff we started using the generation (it was not obsolete or already known),
-     * which means that this node might write some CDC log entries using streams from this generation. */
-    future<bool> do_handle_cdc_generation(db_clock::time_point);
-
-    /* Scan CDC generation timestamps gossiped by other nodes and retrieve the latest one.
-     * This function should be called once at the end of the node startup procedure
-     * (after the node is started and running normally, it will retrieve generations on gossip events instead).
-     */
-    future<> scan_cdc_generations();
-
-    /* generation_service code might be racing with system_distributed_keyspace deinitialization
-     * (the deinitialization order is broken).
-     * Therefore, whenever we want to access sys_dist_ks in a background task,
-     * we need to check if the instance is still there. Storing the shared pointer will keep it alive.
-     */
-    shared_ptr<db::system_distributed_keyspace> get_sys_dist_ks();
-};
-
-} // namespace cdc
--- a/cdc/log.cc
+++ b/cdc/log.cc
@@ -32,7 +32,6 @@
 #include "cdc/split.hh"
 #include "cdc/cdc_options.hh"
 #include "cdc/change_visitor.hh"
-#include "cdc/metadata.hh"
 #include "bytes.hh"
 #include "database.hh"
 #include "db/config.hh"
@@ -49,9 +48,6 @@
 #include "cql3/untyped_result_set.hh"
 #include "log.hh"
 #include "utils/rjson.hh"
-#include "utils/UUID_gen.hh"
-#include "utils/managed_bytes.hh"
-#include "utils/fragment_range.hh"
 #include "types.hh"
 #include "concrete_types.hh"
 #include "types/listlike_partial_deserializing_iterator.hh"
@@ -74,7 +70,7 @@ using namespace std::chrono_literals;
 logging::logger cdc_log("cdc");

 namespace cdc {
-static schema_ptr create_log_schema(const schema&, std::optional<utils::UUID> = {}, schema_ptr = nullptr);
+static schema_ptr create_log_schema(const schema&, std::optional<utils::UUID> = {});
 }

 static constexpr auto cdc_group_name = "cdc";
@@ -221,10 +217,10 @@ public:
                return;
            }

-            auto new_log_schema = create_log_schema(new_schema, log_schema ? std::make_optional(log_schema->id()) : std::nullopt, log_schema);
+            auto new_log_schema = create_log_schema(new_schema, log_schema ? std::make_optional(log_schema->id()) : std::nullopt);

            auto log_mut = log_schema 
-                ? db::schema_tables::make_update_table_mutations(db, keyspace.metadata(), log_schema, new_log_schema, timestamp, false)
+                ? db::schema_tables::make_update_table_mutations(keyspace.metadata(), log_schema, new_log_schema, timestamp, false)
                : db::schema_tables::make_create_table_mutations(keyspace.metadata(), new_log_schema, timestamp)
                ;

@@ -281,8 +277,8 @@ private:
    }
 };

-cdc::cdc_service::cdc_service(service::storage_proxy& proxy, cdc::metadata& cdc_metadata)
-    : cdc_service(db_context::builder(proxy, cdc_metadata).build())
+cdc::cdc_service::cdc_service(service::storage_proxy& proxy)
+    : cdc_service(db_context::builder(proxy).build())
 {}

 cdc::cdc_service::cdc_service(db_context ctxt)
@@ -490,7 +486,7 @@ bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name) {
    return to_bytes(cdc_deleted_elements_column_prefix) + column_name;
 }

-static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid, schema_ptr old) {
+static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid) {
    schema_builder b(s.ks_name(), log_name(s.cf_name()));
    b.with_partitioner("com.scylladb.dht.CDCPartitioner");
    b.set_compaction_strategy(sstables::compaction_strategy_type::time_window);
@@ -571,25 +567,11 @@ static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID>
        b.set_uuid(*uuid);
    }

-    /**
-     * #10473 - if we are redefining the log table, we need to ensure any dropped
-     * columns are registered in "dropped_columns" table, otherwise clients will not
-     * be able to read data older than now.
-     */
-    if (old) {
-        // not super efficient, but we don't do this often.
-        for (auto& col : old->all_columns()) {
-            if (!b.has_column({col.name(), col.name_as_text() })) {
-                b.without_column(col.name_as_text(), col.type, api::new_timestamp());
-            }
-        }
-    }
-
    return b.build();
 }

-db_context::builder::builder(service::storage_proxy& proxy, cdc::metadata& cdc_metadata)
-    : _proxy(proxy), _cdc_metadata(cdc_metadata)
+db_context::builder::builder(service::storage_proxy& proxy) 
+    : _proxy(proxy) 
 {}

 db_context::builder& db_context::builder::with_migration_notifier(service::migration_notifier& migration_notifier) {
@@ -597,11 +579,22 @@ db_context::builder& db_context::builder::with_migration_notifier(service::migra
    return *this;
 }

+db_context::builder& db_context::builder::with_token_metadata(const locator::token_metadata& token_metadata) {
+    _token_metadata = token_metadata;
+    return *this;
+}
+
+db_context::builder& db_context::builder::with_cdc_metadata(cdc::metadata& cdc_metadata) {
+    _cdc_metadata = cdc_metadata;
+    return *this;
+}
+
 db_context db_context::builder::build() {
    return db_context{
        _proxy,
        _migration_notifier ? _migration_notifier->get() : service::get_local_storage_service().get_migration_notifier(),
-        _cdc_metadata,
+        _token_metadata ? _token_metadata->get() : service::get_local_storage_service().get_token_metadata(),
+        _cdc_metadata ? _cdc_metadata->get() : service::get_local_storage_service().get_cdc_metadata(),
    };
 }

@@ -679,14 +672,6 @@ void collection_iterator<bytes_view>::parse() {
    _current = k;
 }

-template<>
-void collection_iterator<managed_bytes_view>::parse() {
-    assert(_rem > 0);
-    _next = _v;
-    auto k = read_collection_value(_next, cql_serialization_format::internal());
-    _current = k;
-}
-
 template<typename Container, typename T>
 class maybe_back_insert_iterator : public std::back_insert_iterator<Container> {
    const abstract_type& _type;
@@ -730,16 +715,16 @@ private:
       }
       return false;
    }
-    int32_t compare(const T&, const value_type& v);
+    bool compare(const T&, const value_type& v);
 };

 template<>
-int32_t maybe_back_insert_iterator<std::vector<std::pair<bytes_view, bytes_view>>, bytes_view>::compare(const bytes_view& t, const value_type& v) {
+bool maybe_back_insert_iterator<std::vector<std::pair<bytes_view, bytes_view>>, bytes_view>::compare(const bytes_view& t, const value_type& v) {
    return _type.compare(t, v.first);
 }

 template<>
-int32_t maybe_back_insert_iterator<std::vector<bytes_view>, bytes_view>::compare(const bytes_view& t, const value_type& v) {
+bool maybe_back_insert_iterator<std::vector<bytes_view>, bytes_view>::compare(const bytes_view& t, const value_type& v) {
    return _type.compare(t, v);
 }

@@ -787,18 +772,18 @@ static bytes merge(const set_type_impl& ctype, const bytes_opt& prev, const byte
    return set_type_impl::serialize_partially_deserialized_form(res, cql_serialization_format::internal());
 }
 static bytes merge(const user_type_impl& type, const bytes_opt& prev, const bytes_opt& next, const bytes_opt& deleted) {
-    std::vector<managed_bytes_view_opt> res(type.size());
-    udt_for_each(prev, [&res, i = res.begin()](managed_bytes_view_opt k) mutable {
+    std::vector<bytes_view_opt> res(type.size());
+    udt_for_each(prev, [&res, i = res.begin()](bytes_view_opt k) mutable {
        *i++ = k;
    });
-    udt_for_each(next, [&res, i = res.begin()](managed_bytes_view_opt k) mutable {
+    udt_for_each(next, [&res, i = res.begin()](bytes_view_opt k) mutable {
        if (k) {
            *i = k;
        }
        ++i;
    });
-    collection_iterator<managed_bytes_view> e, d(deleted);
-    std::for_each(d, e, [&res](managed_bytes_view k) {
+    collection_iterator<bytes_view> e, d(deleted);
+    std::for_each(d, e, [&res](bytes_view k) {
        auto index = deserialize_field_index(k);
        res[index] = std::nullopt;
    });
@@ -837,13 +822,13 @@ static bytes_opt get_preimage_col_value(const column_definition& cdef, const cql
                auto v = pirow->get_view(cdef.name_as_text());
                auto f = cql_serialization_format::internal();
                auto n = read_collection_size(v, f);
-                std::vector<bytes> tmp;
+                std::vector<bytes_view> tmp;
                tmp.reserve(n);
                while (n--) {
-                    tmp.emplace_back(read_collection_value(v, f).linearize()); // key
+                    tmp.emplace_back(read_collection_value(v, f)); // key
                    read_collection_value(v, f); // value. ignore.
                }
-                return set_type_impl::serialize_partially_deserialized_form({tmp.begin(), tmp.end()}, f);
+                return set_type_impl::serialize_partially_deserialized_form(tmp, f);
            },
            [&] (const abstract_type& o) -> bytes {
                return pirow->get_blob(cdef.name_as_text());
@@ -999,13 +984,13 @@ private:
 };

 static bytes get_bytes(const atomic_cell_view& acv) {
-    return to_bytes(acv.value());
+    return acv.value().linearize();
 }

-static bytes_view get_bytes_view(const atomic_cell_view& acv, std::forward_list<bytes>& buf) {
+static bytes_view get_bytes_view(const atomic_cell_view& acv, std::vector<bytes>& buf) {
    return acv.value().is_fragmented()
-        ? bytes_view{buf.emplace_front(to_bytes(acv.value()))}
-        : acv.value().current_fragment();
+        ? bytes_view{buf.emplace_back(acv.value().linearize())}
+        : acv.value().first_fragment();
 }

 static ttl_opt get_ttl(const atomic_cell_view& acv) {
@@ -1158,10 +1143,10 @@ struct process_row_visitor {
                _touched_parts.set<stats::part_type::UDT>();

                struct udt_visitor : public collection_visitor {
-                    std::vector<bytes_view_opt> _added_cells;
-                    std::forward_list<bytes>& _buf;
+                    std::vector<bytes_opt> _added_cells;
+                    std::vector<bytes>& _buf;

-                    udt_visitor(ttl_opt& ttl_column, size_t num_keys, std::forward_list<bytes>& buf)
+                    udt_visitor(ttl_opt& ttl_column, size_t num_keys, std::vector<bytes>& buf)
                        : collection_visitor(ttl_column), _added_cells(num_keys), _buf(buf) {}

                    void live_collection_cell(bytes_view key, const atomic_cell_view& cell) {
@@ -1170,7 +1155,7 @@ struct process_row_visitor {
                    }
                };

-                std::forward_list<bytes> buf;
+                std::vector<bytes> buf;
                udt_visitor v(_ttl_column, type.size(), buf);

                visit_collection(v);
@@ -1189,9 +1174,9 @@ struct process_row_visitor {

                struct map_or_list_visitor : public collection_visitor {
                    std::vector<std::pair<bytes_view, bytes_view>> _added_cells;
-                    std::forward_list<bytes>& _buf;
+                    std::vector<bytes>& _buf;

-                    map_or_list_visitor(ttl_opt& ttl_column, std::forward_list<bytes>& buf)
+                    map_or_list_visitor(ttl_opt& ttl_column, std::vector<bytes>& buf)
                        : collection_visitor(ttl_column), _buf(buf) {}

                    void live_collection_cell(bytes_view key, const atomic_cell_view& cell) {
@@ -1200,7 +1185,7 @@ struct process_row_visitor {
                    }
                };

-                std::forward_list<bytes> buf;
+                std::vector<bytes> buf;
                map_or_list_visitor v(_ttl_column, buf);

                visit_collection(v);
@@ -1312,13 +1297,6 @@ struct process_change_visitor {
                _clustering_row_states, _generate_delta_values);
        visit_row_cells(v);

-        if (_enable_updating_state) {
-            // #7716: if there are no regular columns, our visitor would not have visited any cells,
-            // hence it would not have created a row_state for this row. In effect, postimage wouldn't be produced.
-            // Ensure that the row state exists.
-            _clustering_row_states.try_emplace(ckey);
-        }
-
        _builder.set_operation(log_ck, v._cdc_op);
        _builder.set_ttl(log_ck, v._ttl_column);
    }
@@ -1670,7 +1648,13 @@ public:
      try {
        return _ctx._proxy.query(_schema, std::move(command), std::move(partition_ranges), select_cl, service::storage_proxy::coordinator_query_options(default_timeout(), empty_service_permit(), client_state)).then(
                [s = _schema, partition_slice = std::move(partition_slice), selection = std::move(selection)] (service::storage_proxy::coordinator_query_result qr) -> lw_shared_ptr<cql3::untyped_result_set> {
-            return make_lw_shared<cql3::untyped_result_set>(*s, std::move(qr.query_result), *selection, partition_slice);
+                    cql3::selection::result_set_builder builder(*selection, gc_clock::now(), cql_serialization_format::latest());
+                    query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *s, *selection));
+                    auto result_set = builder.build();
+                    if (!result_set || result_set->empty()) {
+                        return {};
+                    }
+                    return make_lw_shared<cql3::untyped_result_set>(*result_set);
        });
      } catch (exceptions::unavailable_exception& e) {
        // `query` can throw `unavailable_exception`, which is seen by clients as ~ "NoHostAvailable". 
@@ -1714,7 +1698,7 @@ public:
                    // as there will be no clustering row data to load into the state.
                    return;
                }
-                ck_parts.emplace_back(v->linearize());
+                ck_parts.emplace_back(*v);
            }
            auto ck = clustering_key::from_exploded(std::move(ck_parts));

--- a/cdc/log.hh
+++ b/cdc/log.hh
@@ -80,7 +80,7 @@ class cdc_service final : public async_sharded_service<cdc::cdc_service> {
    std::unique_ptr<impl> _impl;
 public:
    future<> stop();
-    cdc_service(service::storage_proxy&, cdc::metadata&);
+    cdc_service(service::storage_proxy&);
    cdc_service(db_context);
    ~cdc_service();

@@ -100,16 +100,20 @@ public:
 struct db_context final {
    service::storage_proxy& _proxy;
    service::migration_notifier& _migration_notifier;
+    const locator::token_metadata& _token_metadata;
    cdc::metadata& _cdc_metadata;

    class builder final {
        service::storage_proxy& _proxy;
-        cdc::metadata& _cdc_metadata;
        std::optional<std::reference_wrapper<service::migration_notifier>> _migration_notifier;
+        std::optional<std::reference_wrapper<const locator::token_metadata>> _token_metadata;
+        std::optional<std::reference_wrapper<cdc::metadata>> _cdc_metadata;
    public:
-        builder(service::storage_proxy& proxy, cdc::metadata&);
+        builder(service::storage_proxy& proxy);

        builder& with_migration_notifier(service::migration_notifier& migration_notifier);
+        builder& with_token_metadata(const locator::token_metadata& token_metadata);
+        builder& with_cdc_metadata(cdc::metadata&);

        db_context build();
    };
--- a/cdc/metadata.cc
+++ b/cdc/metadata.cc
@@ -51,8 +51,7 @@ static cdc::stream_id get_stream(
    return entry.streams[shard_id];
 }

-// non-static for testing
-cdc::stream_id get_stream(
+static cdc::stream_id get_stream(
        const std::vector<cdc::token_range_description>& entries,
        dht::token tok) {
    if (entries.empty()) {
--- a/checked-file-impl.hh
+++ b/checked-file-impl.hh
@@ -31,7 +31,10 @@ class checked_file_impl : public file_impl {
 public:

    checked_file_impl(const io_error_handler& error_handler, file f)
-            : file_impl(*get_file_impl(f)),  _error_handler(error_handler), _file(f) {
+            : _error_handler(error_handler), _file(f) {
+        _memory_dma_alignment = f.memory_dma_alignment();
+        _disk_read_dma_alignment = f.disk_read_dma_alignment();
+        _disk_write_dma_alignment = f.disk_write_dma_alignment();
    }

    virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override {
--- a/clustering_bounds_comparator.hh
+++ b/clustering_bounds_comparator.hh
@@ -67,8 +67,8 @@ public:
        int operator()(const clustering_key_prefix& p1, int32_t w1, const clustering_key_prefix& p2, int32_t w2) const {
            auto type = _s.get().clustering_key_prefix_type();
            auto res = prefix_equality_tri_compare(type->types().begin(),
-                type->begin(p1.representation()), type->end(p1.representation()),
-                type->begin(p2.representation()), type->end(p2.representation()),
+                type->begin(p1), type->end(p1),
+                type->begin(p2), type->end(p2),
                ::tri_compare);
            if (res) {
                return res;
--- a/clustering_ranges_walker.hh
+++ b/clustering_ranges_walker.hh
@@ -65,11 +65,6 @@ private:
                _current_start = position_in_partition_view::for_range_start(_current_range.front());
                _current_end = position_in_partition_view::for_range_end(_current_range.front());
            }
-        } else {
-             // If the first range is contiguous with the static row, then advance _current_end as much as we can
-             if (_current_range && !_current_range.front().start()) {
-                 _current_end = position_in_partition_view::for_range_end(_current_range.front());
-             }
        }
    }

--- a/collection_mutation.cc
+++ b/collection_mutation.cc
@@ -22,6 +22,7 @@
 #include "types/collection.hh"
 #include "types/user.hh"
 #include "concrete_types.hh"
+#include "atomic_cell_or_collection.hh"
 #include "mutation_partition.hh"
 #include "compaction_garbage_collector.hh"
 #include "combine.hh"
@@ -29,28 +30,40 @@
 #include "collection_mutation.hh"

 collection_mutation::collection_mutation(const abstract_type& type, collection_mutation_view v)
-    : _data(v.data) {}
+    : _data(imr_object_type::make(data::cell::make_collection(v.data), &type.imr_state().lsa_migrator())) {}

-collection_mutation::collection_mutation(const abstract_type& type, managed_bytes data)
-    : _data(std::move(data)) {}
+collection_mutation::collection_mutation(const abstract_type& type, const bytes_ostream& data)
+	: _data(imr_object_type::make(data::cell::make_collection(fragment_range_view(data)), &type.imr_state().lsa_migrator())) {}
+
+static collection_mutation_view get_collection_mutation_view(const uint8_t* ptr)
+{
+    auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
+    auto ti = data::type_info::make_collection();
+    data::cell::context ctx(f, ti);
+    auto view = data::cell::structure::get_member<data::cell::tags::cell>(ptr).as<data::cell::tags::collection>(ctx);
+    auto dv = data::cell::variable_value::make_view(view, f.get<data::cell::tags::external_data>());
+    return collection_mutation_view { dv };
+}

 collection_mutation::operator collection_mutation_view() const
 {
-    return collection_mutation_view{managed_bytes_view(_data)};
+    return get_collection_mutation_view(_data.get());
 }

 collection_mutation_view atomic_cell_or_collection::as_collection_mutation() const {
-    return collection_mutation_view{managed_bytes_view(_data)};
+    return get_collection_mutation_view(_data.get());
 }

 bool collection_mutation_view::is_empty() const {
-    auto in = collection_mutation_input_stream(fragment_range(data));
+    auto in = collection_mutation_input_stream(data);
    auto has_tomb = in.read_trivial<bool>();
    return !has_tomb && in.read_trivial<uint32_t>() == 0;
 }

-bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone tomb, gc_clock::time_point now) const {
-    auto in = collection_mutation_input_stream(fragment_range(data));
+template <typename F>
+requires std::is_invocable_r_v<const data::type_info&, F, collection_mutation_input_stream&>
+static bool is_any_live(const atomic_cell_value_view& data, tombstone tomb, gc_clock::time_point now, F&& read_cell_type_info) {
+    auto in = collection_mutation_input_stream(data);
    auto has_tomb = in.read_trivial<bool>();
    if (has_tomb) {
        auto ts = in.read_trivial<api::timestamp_type>();
@@ -60,10 +73,9 @@ bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone

    auto nr = in.read_trivial<uint32_t>();
    for (uint32_t i = 0; i != nr; ++i) {
-        auto key_size = in.read_trivial<uint32_t>();
-        in.skip(key_size);
+        auto& type_info = read_cell_type_info(in);
        auto vsize = in.read_trivial<uint32_t>();
-        auto value = atomic_cell_view::from_bytes(type, in.read(vsize));
+        auto value = atomic_cell_view::from_bytes(type_info, in.read(vsize));
        if (value.is_live(tomb, now, false)) {
            return true;
        }
@@ -72,8 +84,33 @@ bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone
    return false;
 }

-api::timestamp_type collection_mutation_view::last_update(const abstract_type& type) const {
-    auto in = collection_mutation_input_stream(fragment_range(data));
+bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone tomb, gc_clock::time_point now) const {
+    return visit(type, make_visitor(
+    [&] (const collection_type_impl& ctype) {
+        auto& type_info = ctype.value_comparator()->imr_state().type_info();
+        return ::is_any_live(data, tomb, now, [&type_info] (collection_mutation_input_stream& in) -> const data::type_info& {
+            auto key_size = in.read_trivial<uint32_t>();
+            in.skip(key_size);
+            return type_info;
+        });
+    },
+    [&] (const user_type_impl& utype) {
+        return ::is_any_live(data, tomb, now, [&utype] (collection_mutation_input_stream& in) -> const data::type_info& {
+            auto key_size = in.read_trivial<uint32_t>();
+            auto key = in.read(key_size);
+            return utype.type(deserialize_field_index(key))->imr_state().type_info();
+        });
+    },
+    [&] (const abstract_type& o) -> bool {
+        throw std::runtime_error(format("collection_mutation_view::is_any_live: unknown type {}", o.name()));
+    }
+    ));
+}
+
+template <typename F>
+requires std::is_invocable_r_v<const data::type_info&, F, collection_mutation_input_stream&>
+static api::timestamp_type last_update(const atomic_cell_value_view& data, F&& read_cell_type_info) {
+    auto in = collection_mutation_input_stream(data);
    api::timestamp_type max = api::missing_timestamp;
    auto has_tomb = in.read_trivial<bool>();
    if (has_tomb) {
@@ -83,16 +120,39 @@ api::timestamp_type collection_mutation_view::last_update(const abstract_type& t

    auto nr = in.read_trivial<uint32_t>();
    for (uint32_t i = 0; i != nr; ++i) {
-        const auto key_size = in.read_trivial<uint32_t>();
-        in.skip(key_size);
+        auto& type_info = read_cell_type_info(in);
        auto vsize = in.read_trivial<uint32_t>();
-        auto value = atomic_cell_view::from_bytes(type, in.read(vsize));
+        auto value = atomic_cell_view::from_bytes(type_info, in.read(vsize));
        max = std::max(value.timestamp(), max);
    }

    return max;
 }

+
+api::timestamp_type collection_mutation_view::last_update(const abstract_type& type) const {
+    return visit(type, make_visitor(
+    [&] (const collection_type_impl& ctype) {
+        auto& type_info = ctype.value_comparator()->imr_state().type_info();
+        return ::last_update(data, [&type_info] (collection_mutation_input_stream& in) -> const data::type_info& {
+            auto key_size = in.read_trivial<uint32_t>();
+            in.skip(key_size);
+            return type_info;
+        });
+    },
+    [&] (const user_type_impl& utype) {
+        return ::last_update(data, [&utype] (collection_mutation_input_stream& in) -> const data::type_info& {
+            auto key_size = in.read_trivial<uint32_t>();
+            auto key = in.read(key_size);
+            return utype.type(deserialize_field_index(key))->imr_state().type_info();
+        });
+    },
+    [&] (const abstract_type& o) -> api::timestamp_type {
+        throw std::runtime_error(format("collection_mutation_view::last_update: unknown type {}", o.name()));
+    }
+    ));
+}
+
 std::ostream& operator<<(std::ostream& os, const collection_mutation_view::printer& cmvp) {
    fmt::print(os, "{{collection_mutation_view ");
    cmvp._cmv.with_deserialized(cmvp._type, [&os, &type = cmvp._type] (const collection_mutation_view_description& cmvd) {
@@ -218,31 +278,28 @@ static collection_mutation serialize_collection_mutation(
    auto size = accumulate(cells, (size_t)4, element_size);
    size += 1;
    if (tomb) {
-        size += sizeof(int64_t) + sizeof(int64_t);
+        size += sizeof(tomb.timestamp) + sizeof(tomb.deletion_time);
    }
-    managed_bytes ret(managed_bytes::initialized_later(), size);
-    managed_bytes_mutable_view out(ret);
-    write<uint8_t>(out, uint8_t(bool(tomb)));
+    bytes_ostream ret;
+    ret.reserve(size);
+    auto out = ret.write_begin();
+    *out++ = bool(tomb);
    if (tomb) {
-        write<int64_t>(out, tomb.timestamp);
-        write<int64_t>(out, tomb.deletion_time.time_since_epoch().count());
+        write(out, tomb.timestamp);
+        write(out, tomb.deletion_time.time_since_epoch().count());
    }
-    auto writek = [&out] (bytes_view v) {
-        write<int32_t>(out, v.size());
-        write_fragmented(out, single_fragmented_view(v));
-    };
-    auto writev = [&out] (managed_bytes_view v) {
-        write<int32_t>(out, v.size());
-        write_fragmented(out, v);
+    auto writeb = [&out] (bytes_view v) {
+        serialize_int32(out, v.size());
+        out = std::copy_n(v.begin(), v.size(), out);
    };
    // FIXME: overflow?
-    write<int32_t>(out, boost::distance(cells));
+    serialize_int32(out, boost::distance(cells));
    for (auto&& kv : cells) {
        auto&& k = kv.first;
        auto&& v = kv.second;
-        writek(k);
+        writeb(k);

-        writev(v.serialize());
+        writeb(v.serialize());
    }
    return collection_mutation(type, ret);
 }
@@ -391,12 +448,13 @@ deserialize_collection_mutation(const abstract_type& type, collection_mutation_i
    return visit(type, make_visitor(
    [&] (const collection_type_impl& ctype) {
        // value_comparator(), ugh
-        return deserialize_collection_mutation(in, [&ctype] (collection_mutation_input_stream& in) {
+        auto& type_info = ctype.value_comparator()->imr_state().type_info();
+        return deserialize_collection_mutation(in, [&type_info] (collection_mutation_input_stream& in) {
            // FIXME: we could probably avoid the need for size
            auto ksize = in.read_trivial<uint32_t>();
            auto key = in.read(ksize);
            auto vsize = in.read_trivial<uint32_t>();
-            auto value = atomic_cell_view::from_bytes(*ctype.value_comparator(), in.read(vsize));
+            auto value = atomic_cell_view::from_bytes(type_info, in.read(vsize));
            return std::make_pair(key, value);
        });
    },
@@ -406,7 +464,8 @@ deserialize_collection_mutation(const abstract_type& type, collection_mutation_i
            auto ksize = in.read_trivial<uint32_t>();
            auto key = in.read(ksize);
            auto vsize = in.read_trivial<uint32_t>();
-            auto value = atomic_cell_view::from_bytes(*utype.type(deserialize_field_index(key)), in.read(vsize));
+            auto value = atomic_cell_view::from_bytes(
+                    utype.type(deserialize_field_index(key))->imr_state().type_info(), in.read(vsize));
            return std::make_pair(key, value);
        });
    },
--- a/collection_mutation.hh
+++ b/collection_mutation.hh
@@ -31,6 +31,7 @@
 #include <iosfwd>

 class abstract_type;
+class bytes_ostream;
 class compaction_garbage_collector;
 class row_tombstone;

@@ -69,7 +70,7 @@ struct collection_mutation_view_description {
    collection_mutation serialize(const abstract_type&) const;
 };

-using collection_mutation_input_stream = utils::linearizing_input_stream<fragment_range<managed_bytes_view>, marshal_exception>;
+using collection_mutation_input_stream = utils::linearizing_input_stream<atomic_cell_value_view, marshal_exception>;

 // Given a linearized collection_mutation_view, returns an auxiliary struct allowing the inspection of each cell.
 // The struct is an observer of the data given by the collection_mutation_view and is only valid while the
@@ -79,7 +80,7 @@ collection_mutation_view_description deserialize_collection_mutation(const abstr

 class collection_mutation_view {
 public:
-    managed_bytes_view data;
+    atomic_cell_value_view data;

    // Is this a noop mutation?
    bool is_empty() const;
@@ -96,7 +97,7 @@ public:
    // calls it on the corresponding description of `this`.
    template <typename F>
    inline decltype(auto) with_deserialized(const abstract_type& type, F f) const {
-        auto stream = collection_mutation_input_stream(fragment_range(data));
+        auto stream = collection_mutation_input_stream(data);
        return f(deserialize_collection_mutation(type, stream));
    }

@@ -121,11 +122,12 @@ public:
 //  The mutation may also contain a collection-wide tombstone.
 class collection_mutation {
 public:
-    managed_bytes _data;
+    using imr_object_type =  imr::utils::object<data::cell::structure>;
+    imr_object_type _data;

    collection_mutation() {}
    collection_mutation(const abstract_type&, collection_mutation_view);
-    collection_mutation(const abstract_type&, managed_bytes);
+    collection_mutation(const abstract_type& type, const bytes_ostream& data);
    operator collection_mutation_view() const;
 };

@@ -134,4 +136,4 @@ collection_mutation merge(const abstract_type&, collection_mutation_view, collec
 collection_mutation difference(const abstract_type&, collection_mutation_view, collection_mutation_view);

 // Serializes the given collection of cells to a sequence of bytes ready to be sent over the CQL protocol.
-bytes_ostream serialize_for_cql(const abstract_type&, collection_mutation_view, cql_serialization_format);
+bytes serialize_for_cql(const abstract_type&, collection_mutation_view, cql_serialization_format);
--- a/compatible_ring_position.hh
+++ b/compatible_ring_position.hh
@@ -43,7 +43,7 @@ public:
    const ::schema& schema() const {
        return *_schema;
    }
-    friend std::strong_ordering tri_compare(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
+    friend int tri_compare(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
        return dht::ring_position_tri_compare(*x._schema, *x._rpv, *y._rpv);
    }
    friend bool operator<(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
@@ -83,7 +83,7 @@ public:
    const ::schema& schema() const {
        return *_schema;
    }
-    friend std::strong_ordering tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {
+    friend int tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {
        return dht::ring_position_tri_compare(*x._schema, *x._rp, *y._rp);
    }
    friend bool operator<(const compatible_ring_position& x, const compatible_ring_position& y) {
@@ -133,7 +133,7 @@ public:
        };
        return std::visit(rpv_accessor{}, *_crp_or_view);
    }
-    friend std::strong_ordering tri_compare(const compatible_ring_position_or_view& x, const compatible_ring_position_or_view& y) {
+    friend int tri_compare(const compatible_ring_position_or_view& x, const compatible_ring_position_or_view& y) {
        struct schema_accessor {
            const ::schema& operator()(const compatible_ring_position& crp) {
                return crp.schema();
--- a/compound.hh
+++ b/compound.hh
@@ -73,19 +73,12 @@ private:
     *   <len(value1)><value1><len(value2)><value2>...<len(value_n)><value_n>
     *
     */
-    template<typename RangeOfSerializedComponents, FragmentedMutableView Out>
-    static void serialize_value(RangeOfSerializedComponents&& values, Out out) {
+    template<typename RangeOfSerializedComponents, typename CharOutputIterator>
+    static void serialize_value(RangeOfSerializedComponents&& values, CharOutputIterator& out) {
        for (auto&& val : values) {
            assert(val.size() <= std::numeric_limits<size_type>::max());
            write<size_type>(out, size_type(val.size()));
-            using val_type = std::remove_cvref_t<decltype(val)>;
-            if constexpr (FragmentedView<val_type>) {
-                write_fragmented(out, val);
-            } else if constexpr (std::same_as<val_type, managed_bytes>) {
-                write_fragmented(out, managed_bytes_view(val));
-            } else {
-                write_fragmented(out, single_fragmented_view(val));
-            }
+            out = std::copy(val.begin(), val.end(), out);
        }
    }
    template <typename RangeOfSerializedComponents>
@@ -97,27 +90,25 @@ private:
        return len;
    }
 public:
-    managed_bytes serialize_single(managed_bytes&& v) const {
-        return serialize_value({std::move(v)});
-    }
-    managed_bytes serialize_single(bytes&& v) const {
+    bytes serialize_single(bytes&& v) const {
        return serialize_value({std::move(v)});
    }
    template<typename RangeOfSerializedComponents>
-    static managed_bytes serialize_value(RangeOfSerializedComponents&& values) {
+    static bytes serialize_value(RangeOfSerializedComponents&& values) {
        auto size = serialized_size(values);
        if (size > std::numeric_limits<size_type>::max()) {
            throw std::runtime_error(format("Key size too large: {:d} > {:d}", size, std::numeric_limits<size_type>::max()));
        }
-        managed_bytes b(managed_bytes::initialized_later(), size);
-        serialize_value(values, managed_bytes_mutable_view(b));
+        bytes b(bytes::initialized_later(), size);
+        auto i = b.begin();
+        serialize_value(values, i);
        return b;
    }
    template<typename T>
-    static managed_bytes serialize_value(std::initializer_list<T> values) {
+    static bytes serialize_value(std::initializer_list<T> values) {
        return serialize_value(boost::make_iterator_range(values.begin(), values.end()));
    }
-    managed_bytes serialize_optionals(const std::vector<bytes_opt>& values) const {
+    bytes serialize_optionals(const std::vector<bytes_opt>& values) const {
        return serialize_value(values | boost::adaptors::transformed([] (const bytes_opt& bo) -> bytes_view {
            if (!bo) {
                throw std::logic_error("attempted to create key component from empty optional");
@@ -125,7 +116,7 @@ public:
            return *bo;
        }));
    }
-    managed_bytes serialize_value_deep(const std::vector<data_value>& values) const {
+    bytes serialize_value_deep(const std::vector<data_value>& values) const {
        // TODO: Optimize
        std::vector<bytes> partial;
        partial.reserve(values.size());
@@ -136,26 +127,25 @@ public:
        }
        return serialize_value(partial);
    }
-    managed_bytes decompose_value(const value_type& values) const {
+    bytes decompose_value(const value_type& values) const {
        return serialize_value(values);
    }
    class iterator {
    public:
        using iterator_category = std::input_iterator_tag;
-        using value_type = const managed_bytes_view;
+        using value_type = const bytes_view;
        using difference_type = std::ptrdiff_t;
-        using pointer = const value_type*;
-        using reference = const value_type&;
+        using pointer = const bytes_view*;
+        using reference = const bytes_view&;
    private:
-        managed_bytes_view _v;
-        managed_bytes_view _current;
-        size_t _remaining = 0;
+        bytes_view _v;
+        bytes_view _current;
    private:
        void read_current() {
-            _remaining = _v.size_bytes();
            size_type len;
            {
                if (_v.empty()) {
+                    _v = bytes_view(nullptr, 0);
                    return;
                }
                len = read_simple<size_type>(_v);
@@ -163,16 +153,15 @@ public:
                    throw_with_backtrace<marshal_exception>(format("compound_type iterator - not enough bytes, expected {:d}, got {:d}", len, _v.size()));
                }
            }
-            _current = _v.prefix(len);
-            _v.remove_prefix(_current.size_bytes());
+            _current = bytes_view(_v.begin(), len);
+            _v.remove_prefix(len);
        }
    public:
        struct end_iterator_tag {};
-        iterator(const managed_bytes_view& v) : _v(v) {
+        iterator(const bytes_view& v) : _v(v) {
            read_current();
        }
-        iterator(end_iterator_tag, const managed_bytes_view& v) : _v() {}
-        iterator() {}
+        iterator(end_iterator_tag, const bytes_view& v) : _v(nullptr, 0) {}
        iterator& operator++() {
            read_current();
            return *this;
@@ -184,40 +173,29 @@ public:
        }
        const value_type& operator*() const { return _current; }
        const value_type* operator->() const { return &_current; }
-        bool operator==(const iterator& i) const { return _remaining == i._remaining; }
+        bool operator!=(const iterator& i) const { return _v.begin() != i._v.begin(); }
+        bool operator==(const iterator& i) const { return _v.begin() == i._v.begin(); }
    };
-    static iterator begin(managed_bytes_view v) {
+    static iterator begin(const bytes_view& v) {
        return iterator(v);
    }
-    static iterator end(managed_bytes_view v) {
+    static iterator end(const bytes_view& v) {
        return iterator(typename iterator::end_iterator_tag(), v);
    }
-    static boost::iterator_range<iterator> components(managed_bytes_view v) {
+    static boost::iterator_range<iterator> components(const bytes_view& v) {
        return { begin(v), end(v) };
    }
-    value_type deserialize_value(managed_bytes_view v) const {
+    value_type deserialize_value(bytes_view v) const {
        std::vector<bytes> result;
        result.reserve(_types.size());
        std::transform(begin(v), end(v), std::back_inserter(result), [] (auto&& v) {
-            return to_bytes(v);
+            return bytes(v.begin(), v.end());
        });
        return result;
    }
-    bool less(managed_bytes_view b1, managed_bytes_view b2) const {
-        return with_linearized(b1, [&] (bytes_view bv1) {
-            return with_linearized(b2, [&] (bytes_view bv2) {
-                return less(bv1, bv2);
-            });
-        });
-    }
    bool less(bytes_view b1, bytes_view b2) const {
        return compare(b1, b2) < 0;
    }
-    size_t hash(managed_bytes_view v) const{
-        return with_linearized(v, [&] (bytes_view v) {
-            return hash(v);
-        });
-    }
    size_t hash(bytes_view v) const {
        if (_byte_order_equal) {
            return std::hash<bytes_view>()(v);
@@ -230,13 +208,6 @@ public:
        }
        return h;
    }
-    int compare(managed_bytes_view b1, managed_bytes_view b2) const {
-        return with_linearized(b1, [&] (bytes_view bv1) {
-            return with_linearized(b2, [&] (bytes_view bv2) {
-                return compare(bv1, bv2);
-            });
-        });
-    }
    int compare(bytes_view b1, bytes_view b2) const {
        if (_byte_order_comparable) {
            if (_is_reversed) {
@@ -251,21 +222,15 @@ public:
            });
    }
    // Retruns true iff given prefix has no missing components
-    bool is_full(managed_bytes_view v) const {
+    bool is_full(bytes_view v) const {
        assert(AllowPrefixes == allow_prefixes::yes);
        return std::distance(begin(v), end(v)) == (ssize_t)_types.size();
    }
-    bool is_empty(managed_bytes_view v) const {
-        return v.empty();
-    }
-    bool is_empty(const managed_bytes& v) const {
-        return v.empty();
-    }
    bool is_empty(bytes_view v) const {
        return begin(v) == end(v);
    }
-    void validate(managed_bytes_view v) const {
-        std::vector<managed_bytes_view> values(begin(v), end(v));
+    void validate(bytes_view v) const {
+        std::vector<bytes_view> values(begin(v), end(v));
        if (AllowPrefixes == allow_prefixes::no && values.size() < _types.size()) {
            throw marshal_exception(fmt::format("compound::validate(): non-prefixable compound cannot be a prefix"));
        }
@@ -278,13 +243,6 @@ public:
            _types[i]->validate(values[i], cql_serialization_format::internal());
        }
    }
-    bool equal(managed_bytes_view v1, managed_bytes_view v2) const {
-        return with_linearized(v1, [&] (bytes_view bv1) {
-            return with_linearized(v2, [&] (bytes_view bv2) {
-                return equal(bv1, bv2);
-            });
-        });
-    }
    bool equal(bytes_view v1, bytes_view v2) const {
        if (_byte_order_equal) {
            return compare_unsigned(v1, v2) == 0;
--- a/compound_compat.hh
+++ b/compound_compat.hh
@@ -54,9 +54,9 @@ template <typename CompoundType>
 class legacy_compound_view {
    static_assert(!CompoundType::is_prefixable, "Legacy view not defined for prefixes");
    CompoundType& _type;
-    managed_bytes_view _packed;
+    bytes_view _packed;
 public:
-    legacy_compound_view(CompoundType& c, managed_bytes_view packed)
+    legacy_compound_view(CompoundType& c, bytes_view packed)
        : _type(c)
        , _packed(packed)
    { }
@@ -147,18 +147,18 @@ public:
        { }

        // @k1 and @k2 must be serialized using @type, which was passed to the constructor.
-        int operator()(managed_bytes_view k1, managed_bytes_view k2) const {
+        int operator()(bytes_view k1, bytes_view k2) const {
            if (_type.is_singular()) {
                return compare_unsigned(*_type.begin(k1), *_type.begin(k2));
            }
            return lexicographical_tri_compare(
                _type.begin(k1), _type.end(k1),
                _type.begin(k2), _type.end(k2),
-                [] (const managed_bytes_view& c1, const managed_bytes_view& c2) -> int {
+                [] (const bytes_view& c1, const bytes_view& c2) -> int {
                    if (c1.size() != c2.size() || !c1.size()) {
                        return c1.size() < c2.size() ? -1 : c1.size() ? 1 : 0;
                    }
-                    return compare_unsigned(c1, c2);
+                    return memcmp(c1.begin(), c2.begin(), c1.size());
                });
        }
    };
@@ -188,7 +188,7 @@ public:
 // @packed is assumed to be serialized using supplied @type.
 template <typename CompoundType>
 static inline
-bytes to_legacy(CompoundType& type, managed_bytes_view packed) {
+bytes to_legacy(CompoundType& type, bytes_view packed) {
    legacy_compound_view<CompoundType> lv(type, packed);
    bytes legacy_form(bytes::initialized_later(), lv.size());
    std::copy(lv.begin(), lv.end(), legacy_form.begin());
@@ -264,12 +264,6 @@ private:
    static void write_value(Value&& val, CharOutputIterator& out) {
        out = std::copy(val.begin(), val.end(), out);
    }
-    template<typename CharOutputIterator>
-    static void write_value(managed_bytes_view val, CharOutputIterator& out) {
-        for (bytes_view frag : fragment_range(val)) {
-            out = std::copy(frag.begin(), frag.end(), out);
-        }
-    }
    template <typename CharOutputIterator>
    static void write_value(const data_value& val, CharOutputIterator& out) {
        val.serialize(out);
@@ -411,7 +405,6 @@ public:
        iterator(end_iterator_tag) : _v(nullptr, 0) {}

    public:
-        iterator() : iterator(end_iterator_tag()) {}
        iterator& operator++() {
            read_current();
            return *this;
--- a/conf/scylla.yaml
+++ b/conf/scylla.yaml
@@ -99,8 +99,8 @@ listen_address: localhost
 # listen_on_broadcast_address: false

 # port for the CQL native transport to listen for clients on
-# For security reasons, you should not expose this port to the internet. Firewall it if needed.
-# To disable the CQL native transport, remove this option and configure native_transport_port_ssl.
+# For security reasons, you should not expose this port to the internet.  Firewall it if needed.
+# To disable the CQL native transport, set this option to 0.
 native_transport_port: 9042

 # Like native_transport_port, but clients are forwarded to specific shards, based on the
--- a/configure.py
+++ b/configure.py
@@ -59,9 +59,6 @@ i18n_xlat = {
 }

 python3_dependencies = subprocess.run('./install-dependencies.sh --print-python3-runtime-packages', shell=True, capture_output=True, encoding='utf-8').stdout.strip()
-node_exporter_filename = subprocess.run('./install-dependencies.sh --print-node-exporter-filename', shell=True, capture_output=True, encoding='utf-8').stdout.strip()
-node_exporter_dirname = os.path.basename(node_exporter_filename).rstrip('.tar.gz')
-

 def pkgname(name):
    if name in i18n_xlat:
@@ -126,21 +123,18 @@ def ensure_tmp_dir_exists():
        os.makedirs(tempfile.tempdir)


-def try_compile_and_link(compiler, source='', flags=[], verbose=False):
+def try_compile_and_link(compiler, source='', flags=[]):
    ensure_tmp_dir_exists()
    with tempfile.NamedTemporaryFile() as sfile:
        ofile = tempfile.mktemp()
        try:
            sfile.file.write(bytes(source, 'utf-8'))
            sfile.file.flush()
-            ret = subprocess.run([compiler, '-x', 'c++', '-o', ofile, sfile.name] + args.user_cflags.split() + flags,
-                                 capture_output=True)
-            if verbose:
-                print(f"Compilation failed: {compiler} -x c++ -o {ofile} {sfile.name} {args.user_cflags} {flags}")
-                print(source)
-                print(ret.stdout.decode('utf-8'))
-                print(ret.stderr.decode('utf-8'))
-            return ret.returncode == 0
+            # We can't write to /dev/null, since in some cases (-ftest-coverage) gcc will create an auxiliary
+            # output file based on the name of the output file, and "/dev/null.gcsa" is not a good name
+            return subprocess.call([compiler, '-x', 'c++', '-o', ofile, sfile.name] + args.user_cflags.split() + flags,
+                                   stdout=subprocess.DEVNULL,
+                                   stderr=subprocess.DEVNULL) == 0
        finally:
            if os.path.exists(ofile):
                os.unlink(ofile)
@@ -167,21 +161,7 @@ def linker_flags(compiler):
            link_flags.append(threads_flag)
        return ' '.join(link_flags)
    else:
-        linker = ''
-        try:
-            subprocess.call(["gold", "-v"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
-            linker = 'gold'
-        except:
-            pass
-        try:
-            subprocess.call(["lld", "-v"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
-            linker = 'lld'
-        except:
-            pass
-        if linker:
-            print(f'Linker {linker} found, but the compilation attempt failed, defaulting to default system linker')
-        else:
-            print('Note: neither lld nor gold found; using default system linker')
+        print('Note: neither lld nor gold found; using default system linker')
        return ''


@@ -275,34 +255,29 @@ modes = {
        'cxxflags': '-DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
        'cxx_ld_flags': '',
        'stack-usage-threshold': 1024*40,
-        'optimization-level': 'g',
    },
    'release': {
-        'cxxflags': '-ffunction-sections -fdata-sections ',
+        'cxxflags': '-O3 -ffunction-sections -fdata-sections ',
        'cxx_ld_flags': '-Wl,--gc-sections',
        'stack-usage-threshold': 1024*13,
-        'optimization-level': '3',
    },
    'dev': {
-        'cxxflags': '-DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION',
+        'cxxflags': '-O1 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION',
        'cxx_ld_flags': '',
        'stack-usage-threshold': 1024*21,
-        'optimization-level': '2',
    },
    'sanitize': {
-        'cxxflags': '-DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
+        'cxxflags': '-Os -DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
        'cxx_ld_flags': '',
        'stack-usage-threshold': 1024*50,
-        'optimization-level': 's',
    }
 }

 scylla_tests = set([
    'test/boost/UUID_test',
-    'test/boost/cdc_generation_test',
    'test/boost/aggregate_fcts_test',
    'test/boost/allocation_strategy_test',
-    'test/boost/alternator_unit_test',
+    'test/boost/alternator_base64_test',
    'test/boost/anchorless_list_test',
    'test/boost/auth_passwords_test',
    'test/boost/auth_resource_test',
@@ -354,8 +329,8 @@ scylla_tests = set([
    'test/boost/gossip_test',
    'test/boost/gossiping_property_file_snitch_test',
    'test/boost/hash_test',
-    'test/boost/hashers_test',
    'test/boost/idl_test',
+    'test/boost/imr_test',
    'test/boost/input_stream_test',
    'test/boost/json_cql_query_test',
    'test/boost/json_test',
@@ -369,10 +344,10 @@ scylla_tests = set([
    'test/boost/estimated_histogram_test',
    'test/boost/logalloc_test',
    'test/boost/managed_vector_test',
-    'test/boost/managed_bytes_test',
    'test/boost/intrusive_array_test',
    'test/boost/map_difference_test',
    'test/boost/memtable_test',
+    'test/boost/meta_test',
    'test/boost/multishard_mutation_query_test',
    'test/boost/murmur_hash_test',
    'test/boost/mutation_fragment_test',
@@ -397,7 +372,6 @@ scylla_tests = set([
    'test/boost/schema_change_test',
    'test/boost/schema_registry_test',
    'test/boost/secondary_index_test',
-    'test/boost/tracing',
    'test/boost/index_with_paging_test',
    'test/boost/serialization_test',
    'test/boost/serialized_action_test',
@@ -412,7 +386,6 @@ scylla_tests = set([
    'test/boost/sstable_directory_test',
    'test/boost/sstable_test',
    'test/boost/sstable_move_test',
-    'test/boost/statement_restrictions_test',
    'test/boost/storage_proxy_test',
    'test/boost/top_k_test',
    'test/boost/transport_test',
@@ -428,13 +401,9 @@ scylla_tests = set([
    'test/boost/vint_serialization_test',
    'test/boost/virtual_reader_test',
    'test/boost/bptree_test',
-    'test/boost/btree_test',
-    'test/boost/radix_tree_test',
    'test/boost/double_decker_test',
    'test/boost/stall_free_test',
-    'test/boost/raft_address_map_test',
-    'test/boost/raft_sys_table_storage_test',
-    'test/boost/sstable_set_test',
+    'test/boost/imr_test',
    'test/manual/ec2_snitch_test',
    'test/manual/enormous_table_scan_test',
    'test/manual/gce_snitch_test',
@@ -453,7 +422,6 @@ scylla_tests = set([
    'test/perf/perf_mutation',
    'test/perf/perf_collection',
    'test/perf/perf_row_cache_update',
-    'test/perf/perf_row_cache_reads',
    'test/perf/perf_simple_query',
    'test/perf/perf_sstable',
    'test/unit/lsa_async_eviction_test',
@@ -461,11 +429,7 @@ scylla_tests = set([
    'test/unit/row_cache_alloc_stress_test',
    'test/unit/row_cache_stress_test',
    'test/unit/bptree_stress_test',
-    'test/unit/btree_stress_test',
    'test/unit/bptree_compaction_test',
-    'test/unit/btree_compaction_test',
-    'test/unit/radix_tree_stress_test',
-    'test/unit/radix_tree_compaction_test',
 ])

 perf_tests = set([
@@ -479,15 +443,13 @@ perf_tests = set([

 raft_tests = set([
    'test/raft/replication_test',
-    'test/raft/fsm_test',
-    'test/raft/etcd_test',
+    'test/boost/raft_fsm_test',
 ])

 apps = set([
    'scylla',
    'test/tools/cql_repl',
    'tools/scylla-types',
-    'tools/scylla-sstable-index',
 ])

 tests = scylla_tests | perf_tests | raft_tests
@@ -527,8 +489,6 @@ arg_parser.add_argument('--dpdk-target', action='store', dest='dpdk_target', def
                        help='Path to DPDK SDK target location (e.g. <DPDK SDK dir>/x86_64-native-linuxapp-gcc)')
 arg_parser.add_argument('--debuginfo', action='store', dest='debuginfo', type=int, default=1,
                        help='Enable(1)/disable(0)compiler debug information generation')
-arg_parser.add_argument('--optimization-level', action='append', dest='mode_o_levels', metavar='MODE=LEVEL', default=[],
-                        help=f'Override default compiler optimization level for mode (defaults: {" ".join([x+"="+modes[x]["optimization-level"] for x in modes])})')
 arg_parser.add_argument('--static-stdc++', dest='staticcxx', action='store_true',
                        help='Link libgcc and libstdc++ statically')
 arg_parser.add_argument('--static-thrift', dest='staticthrift', action='store_true',
@@ -551,34 +511,25 @@ arg_parser.add_argument('--with-antlr3', dest='antlr3_exec', action='store', def
                        help='path to antlr3 executable')
 arg_parser.add_argument('--with-ragel', dest='ragel_exec', action='store', default='ragel',
        help='path to ragel executable')
+arg_parser.add_argument('--build-raft', dest='build_raft', action='store_true', default=False,
+                        help='build raft code')
 add_tristate(arg_parser, name='stack-guards', dest='stack_guards', help='Use stack guards')
 arg_parser.add_argument('--verbose', dest='verbose', action='store_true',
                        help='Make configure.py output more verbose (useful for debugging the build process itself)')
 arg_parser.add_argument('--test-repeat', dest='test_repeat', action='store', type=str, default='1',
                         help='Set number of times to repeat each unittest.')
 arg_parser.add_argument('--test-timeout', dest='test_timeout', action='store', type=str, default='7200')
-arg_parser.add_argument('--clang-inline-threshold', action='store', type=int, dest='clang_inline_threshold', default=-1,
-                        help="LLVM-specific inline threshold compilation parameter")
 args = arg_parser.parse_args()

+if not args.build_raft:
+    all_artifacts.difference_update(raft_tests)
+    tests.difference_update(raft_tests)
+
 defines = ['XXH_PRIVATE_API',
           'SEASTAR_TESTING_MAIN',
 ]

-extra_cxxflags = {
-    'debug': {},
-    'dev': {},
-    'release': {},
-    'sanitize': {}
-}
-
-scylla_raft_core = [
-    'raft/raft.cc',
-    'raft/server.cc',
-    'raft/fsm.cc',
-    'raft/tracker.cc',
-    'raft/log.cc',
-]
+extra_cxxflags = {}

 scylla_core = (['database.cc',
                'absl-flat_hash_map.cc',
@@ -621,16 +572,14 @@ scylla_core = (['database.cc',
                'counters.cc',
                'compress.cc',
                'zstd.cc',
+                'sstables/mp_row_consumer.cc',
                'sstables/sstables.cc',
                'sstables/sstables_manager.cc',
-                'sstables/sstable_set.cc',
-                'sstables/mx/reader.cc',
                'sstables/mx/writer.cc',
-                'sstables/kl/reader.cc',
                'sstables/kl/writer.cc',
                'sstables/sstable_version.cc',
                'sstables/compress.cc',
-                'sstables/sstable_mutation_reader.cc',
+                'sstables/partition.cc',
                'sstables/compaction.cc',
                'sstables/compaction_strategy.cc',
                'sstables/size_tiered_compaction_strategy.cc',
@@ -887,6 +836,7 @@ scylla_core = (['database.cc',
                'vint-serialization.cc',
                'utils/arch/powerpc/crc32-vpmsum/crc32_wrapper.cc',
                'querier.cc',
+                'data/cell.cc',
                'mutation_writer/multishard_writer.cc',
                'multishard_mutation_query.cc',
                'reader_concurrency_semaphore.cc',
@@ -897,16 +847,8 @@ scylla_core = (['database.cc',
                'utils/error_injection.cc',
                'mutation_writer/timestamp_based_splitting_writer.cc',
                'mutation_writer/shard_based_splitting_writer.cc',
-                'mutation_writer/feed_writers.cc',
                'lua.cc',
-                'service/raft/schema_raft_state_machine.cc',
-                'service/raft/raft_sys_table_storage.cc',
-                'serializer.cc',
-                'service/raft/raft_rpc.cc',
-                'service/raft/raft_gossip_failure_detector.cc',
-                'service/raft/raft_services.cc',
-                ] + [Antlr3Grammar('cql3/Cql.g')] + [Thrift('interface/cassandra.thrift', 'Cassandra')] \
-                  + scylla_raft_core
+                ] + [Antlr3Grammar('cql3/Cql.g')] + [Thrift('interface/cassandra.thrift', 'Cassandra')]
               )

 api = ['api/api.cc',
@@ -1002,7 +944,6 @@ idls = ['idl/gossip_digest.idl.hh',
        'idl/view.idl.hh',
        'idl/messaging_service.idl.hh',
        'idl/paxos.idl.hh',
-        'idl/raft.idl.hh',
        ]

 headers = find_headers('.', excluded_dirs=['idl', 'build', 'seastar', '.git'])
@@ -1027,14 +968,20 @@ scylla_tests_dependencies = scylla_core + idls + scylla_tests_generic_dependenci
    'test/lib/random_schema.cc',
 ]

-scylla_raft_dependencies = scylla_raft_core + ['utils/uuid.cc']
+scylla_raft_dependencies = [
+    'raft/raft.cc',
+    'raft/server.cc',
+    'raft/fsm.cc',
+    'raft/progress.cc',
+    'raft/log.cc',
+    'utils/uuid.cc'
+]

 deps = {
    'scylla': idls + ['main.cc', 'release.cc', 'utils/build_id.cc'] + scylla_core + api + alternator + redis,
    'test/tools/cql_repl': idls + ['test/tools/cql_repl.cc'] + scylla_core + scylla_tests_generic_dependencies,
    #FIXME: we don't need all of scylla_core here, only the types module, need to modularize scylla_core.
    'tools/scylla-types': idls + ['tools/scylla-types.cc'] + scylla_core,
-    'tools/scylla-sstable-index': idls + ['tools/scylla-sstable-index.cc'] + scylla_core,
 }

 pure_boost_tests = set([
@@ -1054,13 +1001,13 @@ pure_boost_tests = set([
    'test/boost/dynamic_bitset_test',
    'test/boost/enum_option_test',
    'test/boost/enum_set_test',
-    'test/boost/hashers_test',
    'test/boost/idl_test',
    'test/boost/json_test',
    'test/boost/keys_test',
    'test/boost/like_matcher_test',
    'test/boost/linearizing_input_stream_test',
    'test/boost/map_difference_test',
+    'test/boost/meta_test',
    'test/boost/nonwrapping_range_test',
    'test/boost/observable_test',
    'test/boost/range_test',
@@ -1070,13 +1017,11 @@ pure_boost_tests = set([
    'test/boost/top_k_test',
    'test/boost/vint_serialization_test',
    'test/boost/bptree_test',
-    'test/boost/utf8_test',
-    'test/boost/btree_test',
    'test/manual/streaming_histogram_test',
 ])

 tests_not_using_seastar_test_framework = set([
-    'test/boost/alternator_unit_test',
+    'test/boost/alternator_base64_test',
    'test/boost/small_vector_test',
    'test/manual/gossip',
    'test/manual/message',
@@ -1091,11 +1036,7 @@ tests_not_using_seastar_test_framework = set([
    'test/unit/lsa_sync_eviction_test',
    'test/unit/row_cache_alloc_stress_test',
    'test/unit/bptree_stress_test',
-    'test/unit/btree_stress_test',
    'test/unit/bptree_compaction_test',
-    'test/unit/btree_compaction_test',
-    'test/unit/radix_tree_stress_test',
-    'test/unit/radix_tree_compaction_test',
    'test/manual/sstable_scan_footprint_test',
 ]) | pure_boost_tests

@@ -1138,6 +1079,8 @@ deps['test/boost/estimated_histogram_test'] = ['test/boost/estimated_histogram_t
 deps['test/boost/anchorless_list_test'] = ['test/boost/anchorless_list_test.cc']
 deps['test/perf/perf_fast_forward'] += ['release.cc']
 deps['test/perf/perf_simple_query'] += ['release.cc']
+deps['test/boost/meta_test'] = ['test/boost/meta_test.cc']
+deps['test/boost/imr_test'] = ['test/boost/imr_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
 deps['test/boost/reusable_buffer_test'] = [
    "test/boost/reusable_buffer_test.cc",
    "test/lib/log.cc",
@@ -1152,11 +1095,10 @@ deps['test/boost/linearizing_input_stream_test'] = [
 ]

 deps['test/boost/duration_test'] += ['test/lib/exception_utils.cc']
-deps['test/boost/alternator_unit_test'] += ['alternator/base64.cc']
+deps['test/boost/alternator_base64_test'] += ['alternator/base64.cc']

 deps['test/raft/replication_test'] = ['test/raft/replication_test.cc'] + scylla_raft_dependencies
-deps['test/raft/fsm_test'] =  ['test/raft/fsm_test.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
-deps['test/raft/etcd_test'] =  ['test/raft/etcd_test.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
+deps['test/boost/raft_fsm_test'] =  ['test/boost/raft_fsm_test.cc', 'test/lib/log.cc'] + scylla_raft_dependencies

 deps['utils/gz/gen_crc_combine_table'] = ['utils/gz/gen_crc_combine_table.cc']

@@ -1197,6 +1139,7 @@ warnings = [
    '-Wno-delete-non-abstract-non-virtual-dtor',
    '-Wno-unknown-attributes',
    '-Wno-braced-scalar-init',
+    '-Wno-unused-value',
    '-Wno-range-loop-construct',
    '-Wno-unused-function',
    '-Wno-implicit-int-float-conversion',
@@ -1212,18 +1155,9 @@ warnings = [w

 warnings = ' '.join(warnings + ['-Wno-error=deprecated-declarations'])

-def clang_inline_threshold():
-    if args.clang_inline_threshold != -1:
-        return args.clang_inline_threshold
-    elif platform.machine() == 'aarch64':
-        # we see miscompiles with 1200 and above with format("{}", uuid)
-        return 600
-    else:
-        return 2500
-
 optimization_flags = [
    '--param inline-unit-growth=300', # gcc
-    f'-mllvm -inline-threshold={clang_inline_threshold()}',  # clang
+    '-mllvm -inline-threshold=2500',  # clang
 ]
 optimization_flags = [o
                      for o in optimization_flags
@@ -1234,15 +1168,6 @@ if flag_supported(flag='-Wstack-usage=4096', compiler=args.cxx):
    for mode in modes:
        modes[mode]['cxxflags'] += f' -Wstack-usage={modes[mode]["stack-usage-threshold"]} -Wno-error=stack-usage='

-for mode_level in args.mode_o_levels:
-    ( mode, level ) = mode_level.split('=', 2)
-    if mode not in modes:
-        raise Exception(f'Mode {mode} is missing, cannot configure optimization level for it')
-    modes[mode]['optimization-level'] = level
-
-for mode in modes:
-    modes[mode]['cxxflags'] += f' -O{modes[mode]["optimization-level"]}'
-
 linker_flags = linker_flags(compiler=args.cxx)

 dbgflag = '-g -gz' if args.debuginfo else ''
@@ -1308,8 +1233,7 @@ compiler_test_src = '''
 int main() { return 0; }
 '''
 if not try_compile_and_link(compiler=args.cxx, source=compiler_test_src):
-    try_compile_and_link(compiler=args.cxx, source=compiler_test_src, verbose=True)
-    print('Wrong compiler version or incorrect flags. Scylla needs GCC >= 10.1.1 with coroutines (-fcoroutines) or clang >= 10.0.0 to compile.')
+    print('Wrong GCC version. Scylla needs GCC >= 10.1.1 to compile.')
    sys.exit(1)

 if not try_compile(compiler=args.cxx, source='#include <boost/version.hpp>'):
@@ -1360,9 +1284,7 @@ scylla_release = file.read().strip()
 file = open(f'{outdir}/SCYLLA-PRODUCT-FILE', 'r')
 scylla_product = file.read().strip()

-for m in ['debug', 'release', 'sanitize', 'dev']:
-    cxxflags = "-DSCYLLA_VERSION=\"\\\"" + scylla_version + "\\\"\" -DSCYLLA_RELEASE=\"\\\"" + scylla_release + "\\\"\" -DSCYLLA_BUILD_MODE=\"\\\"" + m + "\\\"\""
-    extra_cxxflags[m]["release.cc"] = cxxflags
+extra_cxxflags["release.cc"] = "-DSCYLLA_VERSION=\"\\\"" + scylla_version + "\\\"\" -DSCYLLA_RELEASE=\"\\\"" + scylla_release + "\\\"\""

 for m in ['debug', 'release', 'sanitize']:
    modes[m]['cxxflags'] += ' ' + dbgflag
@@ -1516,7 +1438,6 @@ abseil_libs = ['absl/' + lib for lib in [
    'numeric/libabsl_int128.a',
    'hash/libabsl_city.a',
    'hash/libabsl_hash.a',
-    'hash/libabsl_wyhash.a',
    'base/libabsl_malloc_internal.a',
    'base/libabsl_spinlock_wait.a',
    'base/libabsl_base.a',
@@ -1537,6 +1458,9 @@ libs = ' '.join([maybe_static(args.staticyamlcpp, '-lyaml-cpp'), '-latomic', '-l
 if not args.staticboost:
    args.user_cflags += ' -DBOOST_TEST_DYN_LINK'

+if build_raft:
+    args.user_cflags += ' -DENABLE_SCYLLA_RAFT'
+
 # thrift version detection, see #4538
 proc_res = subprocess.run(["thrift", "-version"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
 proc_res_output = proc_res.stdout.decode("utf-8")
@@ -1817,8 +1741,8 @@ with open(buildfile_tmp, 'w') as f:
        for obj in compiles:
            src = compiles[obj]
            f.write('build {}: cxx.{} {} || {} {}\n'.format(obj, mode, src, seastar_dep, gen_headers_dep))
-            if src in extra_cxxflags[mode]:
-                f.write('    cxxflags = {seastar_cflags} $cxxflags $cxxflags_{mode} {extra_cxxflags}\n'.format(mode=mode, extra_cxxflags=extra_cxxflags[mode][src], **modeval))
+            if src in extra_cxxflags:
+                f.write('    cxxflags = {seastar_cflags} $cxxflags $cxxflags_{mode} {extra_cxxflags}\n'.format(mode=mode, extra_cxxflags=extra_cxxflags[src], **modeval))
        for swagger in swaggers:
            hh = swagger.headers(gen_dir)[0]
            cc = swagger.sources(gen_dir)[0]
@@ -1874,7 +1798,7 @@ with open(buildfile_tmp, 'w') as f:
        f.write(textwrap.dedent('''\
            build $builddir/{mode}/iotune: copy $builddir/{mode}/seastar/apps/iotune/iotune
            ''').format(**locals()))
-        f.write('build $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz: package $builddir/{mode}/scylla $builddir/{mode}/iotune $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian $builddir/node_exporter | always\n'.format(**locals()))
+        f.write('build $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz: package $builddir/{mode}/scylla $builddir/{mode}/iotune $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian | always\n'.format(**locals()))
        f.write('  mode = {mode}\n'.format(**locals()))
        f.write(f'build $builddir/dist/{mode}/redhat: rpmbuild $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz\n')
        f.write(f'  mode = {mode}\n')
@@ -1883,7 +1807,7 @@ with open(buildfile_tmp, 'w') as f:
        f.write(f'build dist-server-{mode}: phony $builddir/dist/{mode}/redhat $builddir/dist/{mode}/debian\n')
        f.write(f'build dist-jmx-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-jmx-package.tar.gz dist-jmx-rpm dist-jmx-deb\n')
        f.write(f'build dist-tools-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-tools-package.tar.gz dist-tools-rpm dist-tools-deb\n')
-        f.write(f'build dist-python3-{mode}: phony dist-python3-tar dist-python3-rpm dist-python3-deb\n')
+        f.write(f'build dist-python3-{mode}: phony dist-python3-tar dist-python3-rpm dist-python3-deb compat-python3-rpm compat-python3-deb\n')
        f.write(f'build dist-unified-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}.{scylla_release}.tar.gz\n')
        f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}.{scylla_release}.tar.gz: unified $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-python3-package.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-jmx-package.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-tools-package.tar.gz | always\n')
        f.write(f'  mode = {mode}\n')
@@ -1949,6 +1873,22 @@ with open(buildfile_tmp, 'w') as f:
        build dist-tools-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-tools-package.tar.gz'.format(mode=mode, scylla_product=scylla_product) for mode in build_modes])}
        build dist-tools: phony dist-tools-tar dist-tools-rpm dist-tools-deb

+        rule compat-python3-reloc
+          command = mkdir -p $builddir/release && ln -f $dir/$artifact $builddir/release/
+        rule compat-python3-rpm
+          command = cd $dir && ./reloc/build_rpm.sh --reloc-pkg $artifact --builddir ../../build/redhat
+        rule compat-python3-deb
+          command = cd $dir && ./reloc/build_deb.sh --reloc-pkg $artifact --builddir ../../build/debian
+        build $builddir/release/{scylla_product}-python3-package.tar.gz: compat-python3-reloc tools/python3/build/{scylla_product}-python3-package.tar.gz
+          dir = tools/python3
+          artifact = $builddir/{scylla_product}-python3-package.tar.gz
+        build compat-python3-rpm: compat-python3-rpm tools/python3/build/{scylla_product}-python3-package.tar.gz
+          dir = tools/python3
+          artifact = $builddir/{scylla_product}-python3-package.tar.gz
+        build compat-python3-deb: compat-python3-deb tools/python3/build/{scylla_product}-python3-package.tar.gz
+          dir = tools/python3
+          artifact = $builddir/{scylla_product}-python3-package.tar.gz
+
        build tools/python3/build/{scylla_product}-python3-package.tar.gz: build-submodule-reloc
          reloc_dir = tools/python3
          args = --packages "{python3_dependencies}"
@@ -1959,7 +1899,7 @@ with open(buildfile_tmp, 'w') as f:
          dir = tools/python3
          artifact = $builddir/{scylla_product}-python3-package.tar.gz
        build dist-python3-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-python3-package.tar.gz'.format(mode=mode, scylla_product=scylla_product) for mode in build_modes])}
-        build dist-python3: phony dist-python3-tar dist-python3-rpm dist-python3-deb $builddir/release/{scylla_product}-python3-package.tar.gz
+        build dist-python3: phony dist-python3-tar dist-python3-rpm dist-python3-deb $builddir/release/{scylla_product}-python3-package.tar.gz compat-python3-rpm compat-python3-deb
        build dist-deb: phony dist-server-deb dist-python3-deb dist-jmx-deb dist-tools-deb
        build dist-rpm: phony dist-server-rpm dist-python3-rpm dist-jmx-rpm dist-tools-rpm
        build dist-tar: phony dist-unified-tar dist-server-tar dist-python3-tar dist-jmx-tar dist-tools-tar
@@ -2017,9 +1957,6 @@ with open(buildfile_tmp, 'w') as f:
        rule debian_files_gen
            command = ./dist/debian/debian_files_gen.py
        build $builddir/debian/debian: debian_files_gen | always
-        rule extract_node_exporter
-            command = tar -C build -xvpf {node_exporter_filename} --no-same-owner && rm -rfv build/node_exporter && mv -v build/{node_exporter_dirname} build/node_exporter
-        build $builddir/node_exporter: extract_node_exporter | always
        ''').format(**globals()))

 os.rename(buildfile_tmp, buildfile)
--- a/converting_mutation_partition_applier.cc
+++ b/converting_mutation_partition_applier.cc
@@ -36,9 +36,9 @@ converting_mutation_partition_applier::upgrade_cell(const abstract_type& new_typ
                                atomic_cell::collection_member cm) {
    if (cell.is_live() && !old_type.is_counter()) {
        if (cell.is_live_and_has_ttl()) {
-            return atomic_cell::make_live(new_type, cell.timestamp(), cell.value(), cell.expiry(), cell.ttl(), cm);
+            return atomic_cell::make_live(new_type, cell.timestamp(), cell.value().linearize(), cell.expiry(), cell.ttl(), cm);
        }
-        return atomic_cell::make_live(new_type, cell.timestamp(), cell.value(), cm);
+        return atomic_cell::make_live(new_type, cell.timestamp(), cell.value().linearize(), cm);
    } else {
        return atomic_cell(new_type, cell);
    }
--- a/counters.cc
+++ b/counters.cc
@@ -19,10 +19,16 @@
 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.
 */

+#include "service/storage_service.hh"
 #include "counters.hh"
 #include "mutation.hh"
 #include "combine.hh"

+counter_id counter_id::local()
+{
+    return counter_id(service::get_local_storage_service().get_local_id());
+}
+
 std::ostream& operator<<(std::ostream& os, const counter_id& id) {
    return os << id.to_uuid();
 }
@@ -118,14 +124,16 @@ void counter_cell_view::apply(const column_definition& cdef, atomic_cell_or_coll

    assert(!dst_ac.is_counter_update());
    assert(!src_ac.is_counter_update());
+ with_linearized(dst_ac, [&] (counter_cell_view dst_ccv) {
+  with_linearized(src_ac, [&] (counter_cell_view src_ccv) {

-    auto src_ccv = counter_cell_view(src_ac);
-    auto dst_ccv = counter_cell_view(dst_ac);
    if (dst_ccv.shard_count() >= src_ccv.shard_count()) {
        auto dst_amc = dst.as_mutable_atomic_cell(cdef);
        auto src_amc = src.as_mutable_atomic_cell(cdef);
-        if (apply_in_place(cdef, dst_amc, src_amc)) {
-            return;
+        if (!dst_amc.is_value_fragmented() && !src_amc.is_value_fragmented()) {
+            if (apply_in_place(cdef, dst_amc, src_amc)) {
+                return;
+            }
        }
    }

@@ -140,6 +148,8 @@ void counter_cell_view::apply(const column_definition& cdef, atomic_cell_or_coll

    auto cell = result.build(std::max(dst_ac.timestamp(), src_ac.timestamp()));
    src = std::exchange(dst, atomic_cell_or_collection(std::move(cell)));
+  });
+ });
 }

 std::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, atomic_cell_view b)
@@ -154,8 +164,8 @@ std::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, ato
        return { };
    }

-    auto a_ccv = counter_cell_view(a);
-    auto b_ccv = counter_cell_view(b);
+ return with_linearized(a, [&] (counter_cell_view a_ccv) {
+  return with_linearized(b, [&] (counter_cell_view b_ccv) {
    auto a_shards = a_ccv.shards();
    auto b_shards = b_ccv.shards();

@@ -182,13 +192,15 @@ std::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, ato
        diff = atomic_cell::make_live(*counter_type, a.timestamp(), bytes_view());
    }
    return diff;
+  });
+ });
 }


-void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset, utils::UUID local_id) {
+void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset) {
    // FIXME: allow current_state to be frozen_mutation

-    auto transform_new_row_to_shards = [&s = *m.schema(), clock_offset, local_id] (column_kind kind, auto& cells) {
+    auto transform_new_row_to_shards = [&s = *m.schema(), clock_offset] (column_kind kind, auto& cells) {
        cells.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
            auto& cdef = s.column_at(kind, id);
            auto acv = ac_o_c.as_atomic_cell(cdef);
@@ -196,7 +208,7 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
                return; // continue -- we are in lambda
            }
            auto delta = acv.counter_update_value();
-            auto cs = counter_shard(counter_id(local_id), delta, clock_offset + 1);
+            auto cs = counter_shard(counter_id::local(), delta, clock_offset + 1);
            ac_o_c = counter_cell_builder::from_single_shard(acv.timestamp(), cs);
        });
    };
@@ -211,7 +223,7 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st

    clustering_key::less_compare cmp(*m.schema());

-    auto transform_row_to_shards = [&s = *m.schema(), clock_offset, local_id] (column_kind kind, auto& transformee, auto& state) {
+    auto transform_row_to_shards = [&s = *m.schema(), clock_offset] (column_kind kind, auto& transformee, auto& state) {
        std::deque<std::pair<column_id, counter_shard>> shards;
        state.for_each_cell([&] (column_id id, const atomic_cell_or_collection& ac_o_c) {
            auto& cdef = s.column_at(kind, id);
@@ -219,13 +231,14 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
            if (!acv.is_live()) {
                return; // continue -- we are in lambda
            }
-            auto ccv = counter_cell_view(acv);
-            auto cs = ccv.get_shard(counter_id(local_id));
+          counter_cell_view::with_linearized(acv, [&] (counter_cell_view ccv) {
+            auto cs = ccv.local_shard();
            if (!cs) {
                return; // continue
            }
            shards.emplace_back(std::make_pair(id, counter_shard(*cs)));
          });
+        });

        transformee.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
            auto& cdef = s.column_at(kind, id);
@@ -240,7 +253,7 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
            auto delta = acv.counter_update_value();

            if (shards.empty() || shards.front().first > id) {
-                auto cs = counter_shard(counter_id(local_id), delta, clock_offset + 1);
+                auto cs = counter_shard(counter_id::local(), delta, clock_offset + 1);
                ac_o_c = counter_cell_builder::from_single_shard(acv.timestamp(), cs);
            } else {
                auto& cs = shards.front().second;
--- a/counters.hh
+++ b/counters.hh
@@ -61,6 +61,8 @@ public:
        return !(*this == other);
    }
 public:
+    static counter_id local();
+
    // For tests.
    static counter_id generate_random() {
        return counter_id(utils::make_random_uuid());
@@ -81,20 +83,21 @@ class basic_counter_shard_view {
        total_size = unsigned(logical_clock) + sizeof(int64_t),
    };
 private:
-    managed_bytes_basic_view<is_mutable> _base;
+    using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const signed char*, signed char*>;
+    pointer_type _base;
 private:
    template<typename T>
    T read(offset off) const {
-        auto v = _base;
-        v.remove_prefix(size_t(off));
-        return read_simple_native<T>(v);
+        T value;
+        std::copy_n(_base + static_cast<unsigned>(off), sizeof(T), reinterpret_cast<signed char*>(&value));
+        return value;
    }
 public:
    static constexpr auto size = size_t(offset::total_size);
 public:
    basic_counter_shard_view() = default;
-    explicit basic_counter_shard_view(managed_bytes_basic_view<is_mutable> v) noexcept
-        : _base(v) { }
+    explicit basic_counter_shard_view(pointer_type ptr) noexcept
+        : _base(ptr) { }

    counter_id id() const { return read<counter_id>(offset::id); }
    int64_t value() const { return read<int64_t>(offset::value); }
@@ -105,24 +108,15 @@ public:
        static constexpr size_t size = size_t(offset::total_size) - off;

        signed char tmp[size];
-        auto tmp_view = single_fragmented_mutable_view(bytes_mutable_view(std::data(tmp), std::size(tmp)));
-
-        managed_bytes_mutable_view this_view = _base.substr(off, size);
-        managed_bytes_mutable_view other_view = other._base.substr(off, size);
-
-        copy_fragmented_view(tmp_view, this_view);
-        copy_fragmented_view(this_view, other_view);
-        copy_fragmented_view(other_view, tmp_view);
+        std::copy_n(_base + off, size, tmp);
+        std::copy_n(other._base + off, size, _base + off);
+        std::copy_n(tmp, size, other._base + off);
    }

    void set_value_and_clock(const basic_counter_shard_view& other) noexcept {
        static constexpr size_t off = size_t(offset::value);
        static constexpr size_t size = size_t(offset::total_size) - off;
-
-        managed_bytes_mutable_view this_view = _base.substr(off, size);
-        managed_bytes_mutable_view other_view = other._base.substr(off, size);
-
-        copy_fragmented_view(this_view, other_view);
+        std::copy_n(other._base + off, size, _base + off);
    }

    bool operator==(const basic_counter_shard_view& other) const {
@@ -148,6 +142,11 @@ class counter_shard {
    counter_id _id;
    int64_t _value;
    int64_t _logical_clock;
+private:
+    template<typename T>
+    static void write(const T& value, bytes::iterator& out) {
+        out = std::copy_n(reinterpret_cast<const signed char*>(&value), sizeof(T), out);
+    }
 private:
    // Shared logic for applying counter_shards and counter_shard_views.
    // T is either counter_shard or basic_counter_shard_view<U>.
@@ -198,10 +197,10 @@ public:
    static constexpr size_t serialized_size() {
        return counter_shard_view::size;
    }
-    void serialize(atomic_cell_value_mutable_view& out) const {
-        write_native<counter_id>(out, _id);
-        write_native<int64_t>(out, _value);
-        write_native<int64_t>(out, _logical_clock);
+    void serialize(bytes::iterator& out) const {
+        write(_id, out);
+        write(_value, out);
+        write(_logical_clock, out);
    }
 };

@@ -238,7 +237,7 @@ public:
    size_t serialized_size() const {
        return _shards.size() * counter_shard::serialized_size();
    }
-    void serialize(atomic_cell_value_mutable_view& out) const {
+    void serialize(bytes::iterator& out) const {
        for (auto&& cs : _shards) {
            cs.serialize(out);
        }
@@ -249,18 +248,31 @@ public:
    }

    atomic_cell build(api::timestamp_type timestamp) const {
+        // If we can assume that the counter shards never cross fragment boundaries
+        // the serialisation code gets much simpler.
+        static_assert(data::cell::maximum_external_chunk_length % counter_shard::serialized_size() == 0);
+
        auto ac = atomic_cell::make_live_uninitialized(*counter_type, timestamp, serialized_size());

-        auto dst = ac.value();
+        auto dst_it = ac.value().begin();
+        auto dst_current = *dst_it++;
        for (auto&& cs : _shards) {
-            cs.serialize(dst);
+            if (dst_current.empty()) {
+                dst_current = *dst_it++;
+            }
+            assert(!dst_current.empty());
+            auto value_dst = dst_current.data();
+            cs.serialize(value_dst);
+            dst_current.remove_prefix(counter_shard::serialized_size());
        }
        return ac;
    }

    static atomic_cell from_single_shard(api::timestamp_type timestamp, const counter_shard& cs) {
+        // We don't really need to bother with fragmentation here.
+        static_assert(data::cell::maximum_external_chunk_length >= counter_shard::serialized_size());
        auto ac = atomic_cell::make_live_uninitialized(*counter_type, timestamp, counter_shard::serialized_size());
-        auto dst = ac.value();
+        auto dst = ac.value().first_fragment().begin();
        cs.serialize(dst);
        return ac;
    }
@@ -299,7 +311,12 @@ public:
 template<mutable_view is_mutable>
 class basic_counter_cell_view {
 protected:
+    using linearized_value_view = std::conditional_t<is_mutable == mutable_view::no,
+                                                     bytes_view, bytes_mutable_view>;
+    using pointer_type = std::conditional_t<is_mutable == mutable_view::no,
+                                            bytes_view::const_pointer, bytes_mutable_view::pointer>;
    basic_atomic_cell_view<is_mutable> _cell;
+    linearized_value_view _value;
 private:
    class shard_iterator {
    public:
@@ -309,12 +326,12 @@ private:
        using pointer = basic_counter_shard_view<is_mutable>*;
        using reference = basic_counter_shard_view<is_mutable>&;
    private:
-        managed_bytes_basic_view<is_mutable> _current;
+        pointer_type _current;
        basic_counter_shard_view<is_mutable> _current_view;
-        size_t _pos = 0;
    public:
-        shard_iterator(managed_bytes_basic_view<is_mutable> v, size_t offset) noexcept
-            : _current(v), _current_view(_current), _pos(offset) { }
+        shard_iterator() = default;
+        shard_iterator(pointer_type ptr) noexcept
+            : _current(ptr), _current_view(ptr) { }

        basic_counter_shard_view<is_mutable>& operator*() noexcept {
            return _current_view;
@@ -323,8 +340,8 @@ private:
            return &_current_view;
        }
        shard_iterator& operator++() noexcept {
-            _pos += counter_shard_view::size;
-            _current_view = basic_counter_shard_view<is_mutable>(_current.substr(_pos, counter_shard_view::size));
+            _current += counter_shard_view::size;
+            _current_view = basic_counter_shard_view<is_mutable>(_current);
            return *this;
        }
        shard_iterator operator++(int) noexcept {
@@ -333,8 +350,8 @@ private:
            return it;
        }
        shard_iterator& operator--() noexcept {
-            _pos -= counter_shard_view::size;
-            _current_view = basic_counter_shard_view<is_mutable>(_current.substr(_pos, counter_shard_view::size));
+            _current -= counter_shard_view::size;
+            _current_view = basic_counter_shard_view<is_mutable>(_current);
            return *this;
        }
        shard_iterator operator--(int) noexcept {
@@ -343,29 +360,31 @@ private:
            return it;
        }
        bool operator==(const shard_iterator& other) const noexcept {
-            return _pos == other._pos;
+            return _current == other._current;
+        }
+        bool operator!=(const shard_iterator& other) const noexcept {
+            return !(*this == other);
        }
    };
 public:
    boost::iterator_range<shard_iterator> shards() const {
-        auto value = _cell.value();
-        auto begin = shard_iterator(value, 0);
-        auto end = shard_iterator(value, value.size());
+        auto begin = shard_iterator(_value.data());
+        auto end = shard_iterator(_value.data() + _value.size());
        return boost::make_iterator_range(begin, end);
    }

    size_t shard_count() const {
-        return _cell.value().size() / counter_shard_view::size;
+        return _cell.value().size_bytes() / counter_shard_view::size;
    }
-public:
+protected:
    // ac must be a live counter cell
-    explicit basic_counter_cell_view(basic_atomic_cell_view<is_mutable> ac) noexcept
-        : _cell(ac)
+    explicit basic_counter_cell_view(basic_atomic_cell_view<is_mutable> ac, linearized_value_view vv) noexcept
+        : _cell(ac), _value(vv)
    {
        assert(_cell.is_live());
        assert(!_cell.is_counter_update());
    }
-
+public:
    api::timestamp_type timestamp() const { return _cell.timestamp(); }

    static data_type total_value_type() { return long_type; }
@@ -386,6 +405,11 @@ public:
        return *it;
    }

+    std::optional<counter_shard_view> local_shard() const {
+        // TODO: consider caching local shard position
+        return get_shard(counter_id::local());
+    }
+
    bool operator==(const basic_counter_cell_view& other) const {
        return timestamp() == other.timestamp() && boost::equal(shards(), other.shards());
    }
@@ -394,6 +418,14 @@ public:
 struct counter_cell_view : basic_counter_cell_view<mutable_view::no> {
    using basic_counter_cell_view::basic_counter_cell_view;

+    template<typename Function>
+    static decltype(auto) with_linearized(basic_atomic_cell_view<mutable_view::no> ac, Function&& fn) {
+        return ac.value().with_linearized([&] (bytes_view value_view) {
+            counter_cell_view ccv(ac, value_view);
+            return fn(ccv);
+        });
+    }
+
    // Reversibly applies two counter cells, at least one of them must be live.
    static void apply(const column_definition& cdef, atomic_cell_or_collection& dst, atomic_cell_or_collection& src);

@@ -408,8 +440,9 @@ struct counter_cell_mutable_view : basic_counter_cell_view<mutable_view::yes> {
    using basic_counter_cell_view::basic_counter_cell_view;

    explicit counter_cell_mutable_view(atomic_cell_mutable_view ac) noexcept
-        : basic_counter_cell_view<mutable_view::yes>(ac)
+        : basic_counter_cell_view<mutable_view::yes>(ac, ac.value().first_fragment())
    {
+        assert(!ac.value().is_fragmented());
    }

    void set_timestamp(api::timestamp_type ts) { _cell.set_timestamp(ts); }
@@ -418,7 +451,7 @@ struct counter_cell_mutable_view : basic_counter_cell_view<mutable_view::yes> {
 // Transforms mutation dst from counter updates to counter shards using state
 // stored in current_state.
 // If current_state is present it has to be in the same schema as dst.
-void transform_counter_updates_to_shards(mutation& dst, const mutation* current_state, uint64_t clock_offset, utils::UUID local_id);
+void transform_counter_updates_to_shards(mutation& dst, const mutation* current_state, uint64_t clock_offset);

 template<>
 struct appending_hash<counter_shard_view> {
--- a/cql3/Cql.g
+++ b/cql3/Cql.g
@@ -394,7 +394,6 @@ selectStatement returns [std::unique_ptr<raw::select_statement> expr]
        bool allow_filtering = false;
        bool is_json = false;
        bool bypass_cache = false;
-        auto attrs = std::make_unique<cql3::attributes::raw>();
    }
    : K_SELECT (
                ( K_JSON { is_json = true; } )?
@@ -409,12 +408,11 @@ selectStatement returns [std::unique_ptr<raw::select_statement> expr]
      ( K_LIMIT rows=intValue { limit = rows; } )?
      ( K_ALLOW K_FILTERING  { allow_filtering = true; } )?
      ( K_BYPASS K_CACHE { bypass_cache = true; })?
-      ( usingClause[attrs] )?
      {
          auto params = make_lw_shared<raw::select_statement::parameters>(std::move(orderings), is_distinct, allow_filtering, is_json, bypass_cache);
          $expr = std::make_unique<raw::select_statement>(std::move(cf), std::move(params),
            std::move(sclause), std::move(wclause), std::move(limit), std::move(per_partition_limit),
-            std::move(gbcolumns), std::move(attrs));
+            std::move(gbcolumns));
      }
    ;

@@ -523,7 +521,6 @@ usingClause[std::unique_ptr<cql3::attributes::raw>& attrs]
 usingClauseObjective[std::unique_ptr<cql3::attributes::raw>& attrs]
    : K_TIMESTAMP ts=intValue { attrs->timestamp = ts; }
    | K_TTL t=intValue { attrs->time_to_live = t; }
-    | K_TIMEOUT to=term { attrs->timeout = to; }
    ;

 /**
@@ -931,7 +928,7 @@ alterKeyspaceStatement returns [std::unique_ptr<cql3::statements::alter_keyspace
 alterTableStatement returns [std::unique_ptr<alter_table_statement> expr]
    @init {
        alter_table_statement::type type;
-        auto props = cql3::statements::cf_prop_defs();
+        auto props = make_shared<cql3::statements::cf_prop_defs>();
        std::vector<alter_table_statement::column_change> column_changes;
        std::vector<std::pair<shared_ptr<cql3::column_identifier::raw>, shared_ptr<cql3::column_identifier::raw>>> renames;
    }
@@ -947,7 +944,7 @@ alterTableStatement returns [std::unique_ptr<alter_table_statement> expr]
            | '('     id1=cident { column_changes.emplace_back(alter_table_statement::column_change{id1}); }
                 (',' idn=cident { column_changes.emplace_back(alter_table_statement::column_change{idn}); } )* ')'
            )
-          | K_WITH  properties[props]                 { type = alter_table_statement::type::opts; }
+          | K_WITH  properties[*props]                 { type = alter_table_statement::type::opts; }
          | K_RENAME                                  { type = alter_table_statement::type::rename; }
               id1=cident K_TO toId1=cident { renames.emplace_back(id1, toId1); }
               ( K_AND idn=cident K_TO toIdn=cident { renames.emplace_back(idn, toIdn); } )*
@@ -987,9 +984,9 @@ alterTypeStatement returns [std::unique_ptr<alter_type_statement> expr]
 */
 alterViewStatement returns [std::unique_ptr<alter_view_statement> expr]
    @init {
-        auto props = cql3::statements::cf_prop_defs();
+        auto props = make_shared<cql3::statements::cf_prop_defs>();
    }
-    : K_ALTER K_MATERIALIZED K_VIEW cf=columnFamilyName K_WITH properties[props]
+    : K_ALTER K_MATERIALIZED K_VIEW cf=columnFamilyName K_WITH properties[*props]
    {
        $expr = std::make_unique<alter_view_statement>(std::move(cf), std::move(props));
    }
@@ -1124,7 +1121,7 @@ dataResource returns [uninitialized<auth::resource> res]
    : K_ALL K_KEYSPACES { $res = auth::resource(auth::resource_kind::data); }
    | K_KEYSPACE ks = keyspaceName { $res = auth::make_data_resource($ks.id); }
    | ( K_COLUMNFAMILY )? cf = columnFamilyName
-      { $res = auth::make_data_resource($cf.name.has_keyspace() ? $cf.name.get_keyspace() : "", $cf.name.get_column_family()); }
+      { $res = auth::make_data_resource($cf.name->get_keyspace(), $cf.name->get_column_family()); }
    ;

 roleResource returns [uninitialized<auth::resource> res]
@@ -1261,8 +1258,8 @@ ident returns [shared_ptr<cql3::column_identifier> id]

 // Keyspace & Column family names
 keyspaceName returns [sstring id]
-    @init { auto name = cql3::cf_name(); }
-    : ksName[name] { $id = name.get_keyspace(); }
+    @init { auto name = make_shared<cql3::cf_name>(); }
+    : ksName[*name] { $id = name->get_keyspace(); }
    ;

 indexName returns [::shared_ptr<cql3::index_name> name]
@@ -1270,9 +1267,9 @@ indexName returns [::shared_ptr<cql3::index_name> name]
    : (ksName[*name] '.')? idxName[*name]
    ;

-columnFamilyName returns [cql3::cf_name name]
-    @init { $name = cql3::cf_name(); }
-    : (ksName[name] '.')? cfName[name]
+columnFamilyName returns [::shared_ptr<cql3::cf_name> name]
+    @init { $name = ::make_shared<cql3::cf_name>(); }
+    : (ksName[*name] '.')? cfName[*name]
    ;

 userTypeName returns [uninitialized<cql3::ut_name> name]
@@ -1549,10 +1546,6 @@ relation[std::vector<cql3::relation_ptr>& clauses]
          {
              $clauses.emplace_back(cql3::multi_column_relation::create_non_in_relation(ids, type, literal));
          }
-      | type=relationType K_SCYLLA_CLUSTERING_BOUND literal=tupleLiteral /* (a, b, c) > (1, 2, 3) or (a, b, c) > (?, ?, ?) */
-          {
-              $clauses.emplace_back(cql3::multi_column_relation::create_scylla_clustering_bound_non_in_relation(ids, type, literal));
-          }
      | type=relationType tupleMarker=markerForTuple /* (a, b, c) >= ? */
          { $clauses.emplace_back(cql3::multi_column_relation::create_non_in_relation(ids, type, tupleMarker)); }
      )
@@ -1768,7 +1761,6 @@ basic_unreserved_keyword returns [sstring str]
        | K_PER
        | K_PARTITION
        | K_GROUP
-        | K_TIMEOUT
        ) { $str = $k.text; }
    ;

@@ -1919,15 +1911,11 @@ K_PARTITION:   P A R T I T I O N;

 K_SCYLLA_TIMEUUID_LIST_INDEX: S C Y L L A '_' T I M E U U I D '_' L I S T '_' I N D E X;
 K_SCYLLA_COUNTER_SHARD_LIST: S C Y L L A '_' C O U N T E R '_' S H A R D '_' L I S T; 
-K_SCYLLA_CLUSTERING_BOUND: S C Y L L A '_' C L U S T E R I N G '_' B O U N D;
-

 K_GROUP:       G R O U P;

 K_LIKE:        L I K E;

-K_TIMEOUT:     T I M E O U T;
-
 // Case-insensitive alpha characters
 fragment A: ('a'|'A');
 fragment B: ('b'|'B');
--- a/cql3/abstract_marker.cc
+++ b/cql3/abstract_marker.cc
@@ -70,11 +70,11 @@ abstract_marker::raw::raw(int32_t bind_index)
 ::shared_ptr<term> abstract_marker::raw::prepare(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const
 {
    if (receiver->type->is_collection()) {
-        if (receiver->type->without_reversed().is_list()) {
+        if (receiver->type->get_kind() == abstract_type::kind::list) {
            return ::make_shared<lists::marker>(_bind_index, receiver);
-        } else if (receiver->type->without_reversed().is_set()) {
+        } else if (receiver->type->get_kind() == abstract_type::kind::set) {
            return ::make_shared<sets::marker>(_bind_index, receiver);
-        } else if (receiver->type->without_reversed().is_map()) {
+        } else if (receiver->type->get_kind() == abstract_type::kind::map) {
            return ::make_shared<maps::marker>(_bind_index, receiver);
        }
        assert(0);
--- a/cql3/attributes.cc
+++ b/cql3/attributes.cc
@@ -44,13 +44,12 @@
 namespace cql3 {

 std::unique_ptr<attributes> attributes::none() {
-    return std::unique_ptr<attributes>{new attributes{{}, {}, {}}};
+    return std::unique_ptr<attributes>{new attributes{{}, {}}};
 }

-attributes::attributes(::shared_ptr<term>&& timestamp, ::shared_ptr<term>&& time_to_live, ::shared_ptr<term>&& timeout)
+attributes::attributes(::shared_ptr<term>&& timestamp, ::shared_ptr<term>&& time_to_live)
    : _timestamp{std::move(timestamp)}
    , _time_to_live{std::move(time_to_live)}
-    , _timeout{std::move(timeout)}
 { }

 bool attributes::is_timestamp_set() const {
@@ -61,10 +60,6 @@ bool attributes::is_time_to_live_set() const {
    return bool(_time_to_live);
 }

-bool attributes::is_timeout_set() const {
-    return bool(_timeout);
-}
-
 int64_t attributes::get_timestamp(int64_t now, const query_options& options) {
    if (!_timestamp) {
        return now;
@@ -77,12 +72,14 @@ int64_t attributes::get_timestamp(int64_t now, const query_options& options) {
    if (tval.is_unset_value()) {
        return now;
    }
+  return with_linearized(*tval, [&] (bytes_view val) {
    try {
-        data_type_for<int64_t>()->validate(*tval, options.get_cql_serialization_format());
+        data_type_for<int64_t>()->validate(val, options.get_cql_serialization_format());
    } catch (marshal_exception& e) {
        throw exceptions::invalid_request_exception("Invalid timestamp value");
    }
-    return value_cast<int64_t>(data_type_for<int64_t>()->deserialize(*tval));
+    return value_cast<int64_t>(data_type_for<int64_t>()->deserialize(val));
+  });
 }

 int32_t attributes::get_time_to_live(const query_options& options) {
@@ -96,15 +93,16 @@ int32_t attributes::get_time_to_live(const query_options& options) {
    if (tval.is_unset_value()) {
        return 0;
    }
-
+  auto ttl = with_linearized(*tval, [&] (bytes_view val) {
    try {
-        data_type_for<int32_t>()->validate(*tval, options.get_cql_serialization_format());
+        data_type_for<int32_t>()->validate(val, options.get_cql_serialization_format());
    }
    catch (marshal_exception& e) {
        throw exceptions::invalid_request_exception("Invalid TTL value");
    }
-    auto ttl = value_cast<int32_t>(data_type_for<int32_t>()->deserialize(*tval));

+    return value_cast<int32_t>(data_type_for<int32_t>()->deserialize(val));
+  });
    if (ttl < 0) {
        throw exceptions::invalid_request_exception("A TTL must be greater or equal to 0");
    }
@@ -117,25 +115,6 @@ int32_t attributes::get_time_to_live(const query_options& options) {
    return ttl;
 }

-
-db::timeout_clock::duration attributes::get_timeout(const query_options& options) const {
-    auto timeout = _timeout->bind_and_get(options);
-    if (timeout.is_null() || timeout.is_unset_value()) {
-        throw exceptions::invalid_request_exception("Timeout value cannot be unset/null");
-    }
-    cql_duration duration = value_cast<cql_duration>(duration_type->deserialize(*timeout));
-    if (duration.months || duration.days) {
-        throw exceptions::invalid_request_exception("Timeout values cannot be expressed in days/months");
-    }
-    if (duration.nanoseconds % 1'000'000 != 0) {
-        throw exceptions::invalid_request_exception("Timeout values cannot have granularity finer than milliseconds");
-    }
-    if (duration.nanoseconds < 0) {
-        throw exceptions::invalid_request_exception("Timeout values must be non-negative");
-    }
-    return std::chrono::duration_cast<db::timeout_clock::duration>(std::chrono::nanoseconds(duration.nanoseconds));
-}
-
 void attributes::collect_marker_specification(variable_specifications& bound_names) const {
    if (_timestamp) {
        _timestamp->collect_marker_specification(bound_names);
@@ -143,16 +122,12 @@ void attributes::collect_marker_specification(variable_specifications& bound_nam
    if (_time_to_live) {
        _time_to_live->collect_marker_specification(bound_names);
    }
-    if (_timeout) {
-        _timeout->collect_marker_specification(bound_names);
-    }
 }

 std::unique_ptr<attributes> attributes::raw::prepare(database& db, const sstring& ks_name, const sstring& cf_name) const {
    auto ts = !timestamp ? ::shared_ptr<term>{} : timestamp->prepare(db, ks_name, timestamp_receiver(ks_name, cf_name));
    auto ttl = !time_to_live ? ::shared_ptr<term>{} : time_to_live->prepare(db, ks_name, time_to_live_receiver(ks_name, cf_name));
-    auto to = !timeout ? ::shared_ptr<term>{} : timeout->prepare(db, ks_name, timeout_receiver(ks_name, cf_name));
-    return std::unique_ptr<attributes>{new attributes{std::move(ts), std::move(ttl), std::move(to)}};
+    return std::unique_ptr<attributes>{new attributes{std::move(ts), std::move(ttl)}};
 }

 lw_shared_ptr<column_specification> attributes::raw::timestamp_receiver(const sstring& ks_name, const sstring& cf_name) const {
@@ -163,8 +138,4 @@ lw_shared_ptr<column_specification> attributes::raw::time_to_live_receiver(const
    return make_lw_shared<column_specification>(ks_name, cf_name, ::make_shared<column_identifier>("[ttl]", true), data_type_for<int32_t>());
 }

-lw_shared_ptr<column_specification> attributes::raw::timeout_receiver(const sstring& ks_name, const sstring& cf_name) const {
-    return make_lw_shared<column_specification>(ks_name, cf_name, ::make_shared<column_identifier>("[timeout]", true), duration_type);
-}
-
 }
--- a/cql3/attributes.hh
+++ b/cql3/attributes.hh
@@ -54,39 +54,31 @@ class attributes final {
 private:
    const ::shared_ptr<term> _timestamp;
    const ::shared_ptr<term> _time_to_live;
-    const ::shared_ptr<term> _timeout;
 public:
    static std::unique_ptr<attributes> none();
 private:
-    attributes(::shared_ptr<term>&& timestamp, ::shared_ptr<term>&& time_to_live, ::shared_ptr<term>&& timeout);
+    attributes(::shared_ptr<term>&& timestamp, ::shared_ptr<term>&& time_to_live);
 public:
    bool is_timestamp_set() const;

    bool is_time_to_live_set() const;

-    bool is_timeout_set() const;
-
    int64_t get_timestamp(int64_t now, const query_options& options);

    int32_t get_time_to_live(const query_options& options);

-    db::timeout_clock::duration get_timeout(const query_options& options) const;
-
    void collect_marker_specification(variable_specifications& bound_names) const;

    class raw final {
    public:
        ::shared_ptr<term::raw> timestamp;
        ::shared_ptr<term::raw> time_to_live;
-        ::shared_ptr<term::raw> timeout;

        std::unique_ptr<attributes> prepare(database& db, const sstring& ks_name, const sstring& cf_name) const;
    private:
        lw_shared_ptr<column_specification> timestamp_receiver(const sstring& ks_name, const sstring& cf_name) const;

        lw_shared_ptr<column_specification> time_to_live_receiver(const sstring& ks_name, const sstring& cf_name) const;
-
-        lw_shared_ptr<column_specification> timeout_receiver(const sstring& ks_name, const sstring& cf_name) const;
    };
 };

--- a/cql3/authorized_prepared_statements_cache.hh
+++ b/cql3/authorized_prepared_statements_cache.hh
@@ -35,28 +35,6 @@ struct authorized_prepared_statements_cache_size {
 class authorized_prepared_statements_cache_key {
 public:
    using cache_key_type = std::pair<auth::authenticated_user, typename cql3::prepared_cache_key_type::cache_key_type>;
-
-    struct view {
-        const auth::authenticated_user& user_ref;
-        const cql3::prepared_cache_key_type& prep_cache_key_ref;
-    };
-
-    struct view_hasher {
-        size_t operator()(const view& kv) {
-            return cql3::authorized_prepared_statements_cache_key::hash(kv.user_ref, kv.prep_cache_key_ref.key());
-        }
-    };
-
-    struct view_equal {
-        bool operator()(const authorized_prepared_statements_cache_key& k1, const view& k2) {
-            return k1.key().first == k2.user_ref && k1.key().second == k2.prep_cache_key_ref.key();
-        }
-
-        bool operator()(const view& k2, const authorized_prepared_statements_cache_key& k1) {
-            return operator()(k1, k2);
-        }
-    };
-
 private:
    cache_key_type _key;

@@ -122,12 +100,10 @@ private:

 public:
    using key_type = cache_key_type;
-    using key_view_type = typename key_type::view;
-    using key_view_hasher = typename key_type::view_hasher;
-    using key_view_equal = typename key_type::view_equal;
    using value_type = checked_weak_ptr;
    using entry_is_too_big = typename cache_type::entry_is_too_big;
-    using value_ptr = typename cache_type::value_ptr;
+    using iterator = typename cache_type::iterator;
+
 private:
    cache_type _cache;
    logging::logger& _logger;
@@ -148,12 +124,38 @@ public:
        }).discard_result();
    }

-    value_ptr find(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
-        return _cache.find(key_view_type{user, prep_cache_key}, key_view_hasher(), key_view_equal());
+    iterator find(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
+        struct key_view {
+            const auth::authenticated_user& user_ref;
+            const cql3::prepared_cache_key_type& prep_cache_key_ref;
+        };
+
+        struct hasher {
+            size_t operator()(const key_view& kv) {
+                return cql3::authorized_prepared_statements_cache_key::hash(kv.user_ref, kv.prep_cache_key_ref.key());
+            }
+        };
+
+        struct equal {
+            bool operator()(const key_type& k1, const key_view& k2) {
+                return k1.key().first == k2.user_ref && k1.key().second == k2.prep_cache_key_ref.key();
+            }
+
+            bool operator()(const key_view& k2, const key_type& k1) {
+                return operator()(k1, k2);
+            }
+        };
+
+        return _cache.find(key_view{user, prep_cache_key}, hasher(), equal());
+    }
+
+    iterator end() {
+        return _cache.end();
    }

    void remove(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
-        _cache.remove(key_view_type{user, prep_cache_key}, key_view_hasher(), key_view_equal());
+        iterator it = find(user, prep_cache_key);
+        _cache.remove(it);
    }

    size_t size() const {
--- a/cql3/constants.hh
+++ b/cql3/constants.hh
@@ -192,12 +192,9 @@ public:

        virtual ::shared_ptr<terminal> bind(const query_options& options) override {
            auto bytes = bind_and_get(options);
-            if (bytes.is_null()) {
+            if (!bytes) {
                return ::shared_ptr<terminal>{};
            }
-            if (bytes.is_unset_value()) {
-                return UNSET_VALUE;
-            }
            return ::make_shared<constants::value>(std::move(cql3::raw_value::make_value(to_bytes(*bytes))));
        }
    };
@@ -230,7 +227,9 @@ public:
            } else if (value.is_unset_value()) {
                return;
            }
-            auto increment = value_cast<int64_t>(long_type->deserialize_value(*value));
+            auto increment = with_linearized(*value, [] (bytes_view value_view) {
+                return value_cast<int64_t>(long_type->deserialize_value(value_view));
+            });
            m.set_cell(prefix, column, make_counter_update_cell(increment, params));
        }
    };
@@ -245,7 +244,9 @@ public:
            } else if (value.is_unset_value()) {
                return;
            }
-            auto increment = value_cast<int64_t>(long_type->deserialize_value(*value));
+            auto increment = with_linearized(*value, [] (bytes_view value_view) {
+                return value_cast<int64_t>(long_type->deserialize_value(value_view));
+            });
            if (increment == std::numeric_limits<int64_t>::min()) {
                throw exceptions::invalid_request_exception(format("The negation of {:d} overflows supported counter precision (signed 8 bytes integer)", increment));
            }
--- a/cql3/cql_statement.hh
+++ b/cql3/cql_statement.hh
@@ -59,8 +59,6 @@ class result_message;

 namespace cql3 {

-class query_processor;
-
 class metadata;
 shared_ptr<const metadata> make_empty_metadata();

@@ -101,9 +99,11 @@ public:
     * @param options options for this query (consistency, variables, pageSize, ...)
     */
    virtual future<::shared_ptr<cql_transport::messages::result_message>>
-        execute(query_processor& qp, service::query_state& state, const query_options& options) const = 0;
+        execute(service::storage_proxy& proxy, service::query_state& state, const query_options& options) const = 0;

-    virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const = 0;
+    virtual bool depends_on_keyspace(const sstring& ks_name) const = 0;
+
+    virtual bool depends_on_column_family(const sstring& cf_name) const = 0;

    virtual shared_ptr<const metadata> get_result_metadata() const = 0;

--- a/cql3/expr/expression.cc
+++ b/cql3/expr/expression.cc
@@ -27,9 +27,7 @@
 #include <fmt/ostream.h>
 #include <unordered_map>

-#include "cql3/constants.hh"
 #include "cql3/lists.hh"
-#include "cql3/statements/request_validations.hh"
 #include "cql3/tuples.hh"
 #include "index/secondary_index_manager.hh"
 #include "types/list.hh"
@@ -45,8 +43,7 @@ using boost::adaptors::transformed;

 namespace {

-static
-bytes_opt do_get_value(const schema& schema,
+std::optional<atomic_cell_value_view> do_get_value(const schema& schema,
        const column_definition& cdef,
        const partition_key& key,
        const clustering_key_prefix& ckey,
@@ -54,9 +51,9 @@ bytes_opt do_get_value(const schema& schema,
        gc_clock::time_point now) {
    switch (cdef.kind) {
        case column_kind::partition_key:
-            return to_bytes(key.get_component(schema, cdef.component_index()));
+            return atomic_cell_value_view(key.get_component(schema, cdef.component_index()));
        case column_kind::clustering_key:
-            return to_bytes(ckey.get_component(schema, cdef.component_index()));
+            return atomic_cell_value_view(ckey.get_component(schema, cdef.component_index()));
        default:
            auto cell = cells.find_cell(cdef.id);
            if (!cell) {
@@ -64,7 +61,7 @@ bytes_opt do_get_value(const schema& schema,
            }
            assert(cdef.is_atomic());
            auto c = cell->as_atomic_cell(cdef);
-            return c.is_dead(now) ? std::nullopt : bytes_opt(to_bytes(c.value()));
+            return c.is_dead(now) ? std::nullopt : std::optional<atomic_cell_value_view>(c.value());
    }
 }

@@ -141,8 +138,9 @@ bytes_opt get_value_from_partition_slice(

 /// Returns col's value from a mutation.
 bytes_opt get_value_from_mutation(const column_value& col, row_data_from_mutation data) {
-    return do_get_value(
+    const auto v = do_get_value(
            data.schema_, *col.col, data.partition_key_, data.clustering_key_, data.other_columns, data.now);
+    return v ? v->linearize() : bytes_opt();
 }

 /// Returns col's value from the fetched data.
@@ -156,7 +154,7 @@ bytes_opt get_value(const column_value& col, const column_value_eval_bag& bag) {

 /// Type for comparing results of get_value().
 const abstract_type* get_value_comparator(const column_definition* cdef) {
-    return &cdef->type->without_reversed();
+    return cdef->type->is_reversed() ? cdef->type->underlying_type().get() : cdef->type.get();
 }

 /// Type for comparing results of get_value().
@@ -357,12 +355,16 @@ bytes_opt next_value(query::result_row_view::iterator_type& iter, const column_d
    if (cdef->type->is_multi_cell()) {
        auto cell = iter.next_collection_cell();
        if (cell) {
-            return linearized(*cell);
+            return cell->with_linearized([] (bytes_view data) {
+                return bytes(data.cbegin(), data.cend());
+            });
        }
    } else {
        auto cell = iter.next_atomic_cell();
        if (cell) {
-            return linearized(cell->value());
+            return cell->value().with_linearized([] (bytes_view data) {
+                return bytes(data.cbegin(), data.cend());
+            });
        }
    }
    return std::nullopt;
@@ -415,8 +417,6 @@ bool is_one_of(const column_value& col, term& rhs, const column_value_eval_bag&
    } else if (auto mkr = dynamic_cast<lists::marker*>(&rhs)) {
        // This is `a IN ?`.  RHS elements are values representable as bytes_opt.
        const auto values = static_pointer_cast<lists::value>(mkr->bind(bag.options));
-        statements::request_validations::check_not_null(
-                values, "Invalid null value for column %s", col.col->name_as_text());
        return boost::algorithm::any_of(values->get_elements(), [&] (const bytes_opt& b) {
                return equal(b, col, bag);
            });
@@ -568,8 +568,7 @@ const auto deref = boost::adaptors::transformed([] (const bytes_opt& b) { return

 /// Returns possible values from t, which must be RHS of IN.
 value_list get_IN_values(
-        const ::shared_ptr<term>& t, const query_options& options, const serialized_compare& comparator,
-        sstring_view column_name) {
+        const ::shared_ptr<term>& t, const query_options& options, const serialized_compare& comparator) {
    // RHS is prepared differently for different CQL cases.  Cast it dynamically to discern which case this is.
    if (auto dv = dynamic_pointer_cast<lists::delayed_value>(t)) {
        // Case `a IN (1,2,3)`.
@@ -579,12 +578,8 @@ value_list get_IN_values(
        return to_sorted_vector(std::move(result_range), comparator);
    } else if (auto mkr = dynamic_pointer_cast<lists::marker>(t)) {
        // Case `a IN ?`.  Collect all list-element values.
-        const auto val = mkr->bind(options);
-        if (val == constants::UNSET_VALUE) {
-            throw exceptions::invalid_request_exception(format("Invalid unset value for column {}", column_name));
-        }
-        statements::request_validations::check_not_null(val, "Invalid null value for column %s", column_name);
-        return to_sorted_vector(static_pointer_cast<lists::value>(val)->get_elements() | non_null | deref, comparator);
+        const auto val = static_pointer_cast<lists::value>(mkr->bind(options));
+        return to_sorted_vector(val->get_elements() | non_null | deref, comparator);
    }
    throw std::logic_error(format("get_IN_values(single column) on invalid term {}", *t));
 }
@@ -611,6 +606,22 @@ value_list get_IN_values(const ::shared_ptr<term>& t, size_t k, const query_opti

 static constexpr bool inclusive = true, exclusive = false;

+/// A range of all X such that X op val.
+nonwrapping_range<bytes> to_range(oper_t op, const bytes& val) {
+    switch (op) {
+    case oper_t::GT:
+        return nonwrapping_range<bytes>::make_starting_with(interval_bound(val, exclusive));
+    case oper_t::GTE:
+        return nonwrapping_range<bytes>::make_starting_with(interval_bound(val, inclusive));
+    case oper_t::LT:
+        return nonwrapping_range<bytes>::make_ending_with(interval_bound(val, exclusive));
+    case oper_t::LTE:
+        return nonwrapping_range<bytes>::make_ending_with(interval_bound(val, inclusive));
+    default:
+        throw std::logic_error(format("to_range: unknown comparison operator {}", op));
+    }
+}
+
 } // anonymous namespace

 expression make_conjunction(expression a, expression b) {
@@ -639,7 +650,7 @@ bool is_satisfied_by(
 std::vector<bytes_opt> first_multicolumn_bound(
        const expression& restr, const query_options& options, statements::bound bnd) {
    auto found = find_atom(restr, [bnd] (const binary_operator& oper) {
-        return matches(oper.op, bnd) && is_multi_column(oper);
+        return matches(oper.op, bnd) && std::holds_alternative<std::vector<column_value>>(oper.lhs);
    });
    if (found) {
        return static_pointer_cast<tuples::value>(found->rhs->bind(options))->get_elements();
@@ -648,27 +659,6 @@ std::vector<bytes_opt> first_multicolumn_bound(
    }
 }

-template<typename T>
-nonwrapping_range<T> to_range(oper_t op, const T& val) {
-    static constexpr bool inclusive = true, exclusive = false;
-    switch (op) {
-    case oper_t::EQ:
-        return nonwrapping_range<T>::make_singular(val);
-    case oper_t::GT:
-        return nonwrapping_range<T>::make_starting_with(interval_bound(val, exclusive));
-    case oper_t::GTE:
-        return nonwrapping_range<T>::make_starting_with(interval_bound(val, inclusive));
-    case oper_t::LT:
-        return nonwrapping_range<T>::make_ending_with(interval_bound(val, exclusive));
-    case oper_t::LTE:
-        return nonwrapping_range<T>::make_ending_with(interval_bound(val, inclusive));
-    default:
-        throw std::logic_error(format("to_range: unknown comparison operator {}", op));
-    }
-}
-
-template nonwrapping_range<clustering_key_prefix> to_range(oper_t, const clustering_key_prefix&);
-
 value_set possible_lhs_values(const column_definition* cdef, const expression& expr, const query_options& options) {
    const auto type = cdef ? get_value_comparator(cdef) : long_type.get();
    return std::visit(overloaded_functor{
@@ -696,7 +686,7 @@ value_set possible_lhs_values(const column_definition* cdef, const expression& e
                                return oper.op == oper_t::EQ ? value_set(value_list{*val})
                                        : to_range(oper.op, *val);
                            } else if (oper.op == oper_t::IN) {
-                                return get_IN_values(oper.rhs, options, type->as_less_comparator(), cdef->name_as_text());
+                                return get_IN_values(oper.rhs, options, type->as_less_comparator());
                            }
                            throw std::logic_error(format("possible_lhs_values: unhandled operator {}", oper));
                        },
@@ -786,11 +776,9 @@ bool is_supported_by(const expression& expr, const secondary_index::index& idx)
                            return idx.supports_expression(*col.col, oper.op);
                        },
                        [&] (const std::vector<column_value>& cvs) {
-                            if (cvs.size() == 1) {
-                                return idx.supports_expression(*cvs[0].col, oper.op);
-                            }
-                            // We don't use index table for multi-column restrictions, as it cannot avoid filtering.
-                            return false;
+                            return boost::algorithm::any_of(cvs, [&] (const column_value& c) {
+                                return idx.supports_expression(*c.col, oper.op);
+                            });
                        },
                        [&] (const token&) { return false; },
                    }, oper.lhs);
@@ -812,7 +800,7 @@ bool has_supporting_index(
 }

 std::ostream& operator<<(std::ostream& os, const column_value& cv) {
-    os << cv.col->name_as_text();
+    os << *cv.col;
    if (cv.sub) {
        os << '[' << *cv.sub << ']';
    }
@@ -827,10 +815,10 @@ std::ostream& operator<<(std::ostream& os, const expression& expr) {
                std::visit(overloaded_functor{
                        [&] (const token& t) { os << "TOKEN"; },
                        [&] (const column_value& col) {
-                            fmt::print(os, "{}", col);
+                            fmt::print(os, "({})", col);
                        },
                        [&] (const std::vector<column_value>& cvs) {
-                            fmt::print(os, "({})", fmt::join(cvs, ","));
+                            fmt::print(os, "(({}))", fmt::join(cvs, ","));
                        },
                    }, opr.lhs);
                os << ' ' << opr.op << ' ' << *opr.rhs;
--- a/cql3/expr/expression.hh
+++ b/cql3/expr/expression.hh
@@ -73,18 +73,11 @@ struct token {};

 enum class oper_t { EQ, NEQ, LT, LTE, GTE, GT, IN, CONTAINS, CONTAINS_KEY, IS_NOT, LIKE };

-/// Describes the nature of clustering-key comparisons.  Useful for implementing SCYLLA_CLUSTERING_BOUND.
-enum class comparison_order : char {
-    cql, ///< CQL order. (a,b)>(1,1) is equivalent to a>1 OR (a=1 AND b>1).
-    clustering, ///< Table's clustering order. (a,b)>(1,1) means any row past (1,1) in storage.
-};
-
 /// Operator restriction: LHS op RHS.
 struct binary_operator {
    std::variant<column_value, std::vector<column_value>, token> lhs;
    oper_t op;
    ::shared_ptr<term> rhs;
-    comparison_order order = comparison_order::cql;
 };

 /// A conjunction of restrictions.
@@ -138,10 +131,6 @@ extern value_set possible_lhs_values(const column_definition*, const expression&
 /// Turns value_set into a range, unless it's a multi-valued list (in which case this throws).
 extern nonwrapping_range<bytes> to_range(const value_set&);

-/// A range of all X such that X op val.
-template<typename T>
-nonwrapping_range<T> to_range(oper_t op, const T& val);
-
 /// True iff the index can support the entire expression.
 extern bool is_supported_by(const expression&, const secondary_index::index&);

@@ -193,8 +182,7 @@ inline const binary_operator* find(const expression& e, oper_t op) {
 }

 inline bool needs_filtering(oper_t op) {
-    return (op == oper_t::CONTAINS) || (op == oper_t::CONTAINS_KEY) || (op == oper_t::LIKE) ||
-           (op == oper_t::IS_NOT) || (op == oper_t::NEQ) ;
+    return (op == oper_t::CONTAINS) || (op == oper_t::CONTAINS_KEY) || (op == oper_t::LIKE);
 }

 inline auto find_needs_filtering(const expression& e) {
@@ -223,10 +211,6 @@ inline bool is_compare(oper_t op) {
    }
 }

-inline bool is_multi_column(const binary_operator& op) {
-    return holds_alternative<std::vector<column_value>>(op.lhs);
-}
-
 inline bool has_token(const expression& e) {
    return find_atom(e, [] (const binary_operator& o) { return std::holds_alternative<token>(o.lhs); });
 }
@@ -235,14 +219,6 @@ inline bool has_slice_or_needs_filtering(const expression& e) {
    return find_atom(e, [] (const binary_operator& o) { return is_slice(o.op) || needs_filtering(o.op); });
 }

-inline bool is_clustering_order(const binary_operator& op) {
-    return op.order == comparison_order::clustering;
-}
-
-inline auto find_clustering_order(const expression& e) {
-    return find_atom(e, is_clustering_order);
-}
-
 /// True iff binary_operator involves a collection.
 extern bool is_on_collection(const binary_operator&);

--- a/cql3/functions/aggregate_fcts.cc
+++ b/cql3/functions/aggregate_fcts.cc
@@ -219,7 +219,7 @@ struct aggregate_type_for<simple_date_native_type> {

 template<>
 struct aggregate_type_for<timeuuid_native_type> {
-    using type = timeuuid_native_type;
+    using type = timeuuid_native_type::primary_type;
 };

 template<>
@@ -227,7 +227,6 @@ struct aggregate_type_for<time_native_type> {
    using type = time_native_type::primary_type;
 };

-// WARNING: never invoke this on temporary values; it will return a dangling reference.
 template <typename Type>
 const Type& max_wrapper(const Type& t1, const Type& t2) {
    using std::max;
@@ -242,10 +241,6 @@ inline const net::inet_address& max_wrapper(const net::inet_address& t1, const n
    return std::memcmp(t1.data(), t2.data(), len) >= 0 ? t1 : t2;
 }

-inline const timeuuid_native_type& max_wrapper(const timeuuid_native_type& t1, const timeuuid_native_type& t2) {
-    return t1.uuid.timestamp() > t2.uuid.timestamp() ? t1 : t2;
-}
-
 template <typename Type>
 class impl_max_function_for final : public aggregate_function::aggregate {
   std::optional<typename aggregate_type_for<Type>::type> _max{};
@@ -328,7 +323,6 @@ make_max_function() {
    return make_shared<max_function_for<Type>>();
 }

-// WARNING: never invoke this on temporary values; it will return a dangling reference.
 template <typename Type>
 const Type& min_wrapper(const Type& t1, const Type& t2) {
    using std::min;
@@ -343,10 +337,6 @@ inline const net::inet_address& min_wrapper(const net::inet_address& t1, const n
    return std::memcmp(t1.data(), t2.data(), len) <= 0 ? t1 : t2;
 }

-inline timeuuid_native_type min_wrapper(timeuuid_native_type t1, timeuuid_native_type t2) {
-    return t1.uuid.timestamp() < t2.uuid.timestamp() ? t1 : t2;
-}
-
 template <typename Type>
 class impl_min_function_for final : public aggregate_function::aggregate {
   std::optional<typename aggregate_type_for<Type>::type> _min{};
--- a/cql3/functions/error_injection_fcts.cc
+++ b/cql3/functions/error_injection_fcts.cc
@@ -24,7 +24,6 @@
 #include "error_injection_fcts.hh"
 #include "utils/error_injection.hh"
 #include "types/list.hh"
-#include <seastar/core/map_reduce.hh>

 namespace cql3
 {
--- a/cql3/functions/functions.cc
+++ b/cql3/functions/functions.cc
@@ -54,7 +54,7 @@ std::ostream& operator<<(std::ostream& os, const std::vector<data_type>& arg_typ
 namespace cql3 {
 namespace functions {

-logging::logger log("cql3_fuctions");
+static logging::logger log("cql3_fuctions");

 bool abstract_function::requires_thread() const { return false; }

@@ -181,18 +181,13 @@ inline
 shared_ptr<function>
 make_from_json_function(database& db, const sstring& keyspace, data_type t) {
    return make_native_scalar_function<true>("fromjson", t, {utf8_type},
-            [&db, keyspace, t](cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
-        try {
-            rjson::value json_value = rjson::parse(utf8_type->to_string(parameters[0].value()));
-            bytes_opt parsed_json_value;
-            if (!json_value.IsNull()) {
-                parsed_json_value.emplace(from_json_object(*t, json_value, sf));
-            }
-            return parsed_json_value;
-        } catch(rjson::error& e) {
-            throw exceptions::function_execution_exception("fromJson",
-                format("Failed parsing fromJson parameter: {}", e.what()), keyspace, {t->name()});
+            [&db, &keyspace, t](cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
+        rjson::value json_value = rjson::parse(utf8_type->to_string(parameters[0].value()));
+        bytes_opt parsed_json_value;
+        if (!json_value.IsNull()) {
+            parsed_json_value.emplace(from_json_object(*t, json_value, sf));
        }
+        return parsed_json_value;
    });
 }

--- a/cql3/functions/native_scalar_function.hh
+++ b/cql3/functions/native_scalar_function.hh
@@ -78,22 +78,7 @@ public:
        return Pure;
    }
    virtual bytes_opt execute(cql_serialization_format sf, const std::vector<bytes_opt>& parameters) override {
-        try {
-            return _func(sf, parameters);
-        } catch(exceptions::cassandra_exception&) {
-            // If the function's code took the time to produce an official
-            // cassandra_exception, pass it through. Otherwise, below we will
-            // wrap the unknown exception in a function_execution_exception.
-            throw;
-        } catch(...) {
-            std::vector<sstring> args;
-            args.reserve(arg_types().size());
-            for (const data_type& a : arg_types()) {
-                args.push_back(a->name());
-            }
-            throw exceptions::function_execution_exception(name().name,
-                format("Failed execution of function {}: {}", name(), std::current_exception()), name().keyspace, std::move(args));
-        }
+        return _func(sf, parameters);
    }
 };

--- a/cql3/functions/user_function.cc
+++ b/cql3/functions/user_function.cc
@@ -21,13 +21,9 @@

 #include "user_function.hh"
 #include "lua.hh"
-#include "log.hh"

 namespace cql3 {
 namespace functions {
-
-extern logging::logger log;
-
 user_function::user_function(function_name name, std::vector<data_type> arg_types, std::vector<sstring> arg_names,
        sstring body, sstring language, data_type return_type, bool called_on_null_input, sstring bitcode,
        lua::runtime_config cfg)
@@ -60,9 +56,7 @@ bytes_opt user_function::execute(cql_serialization_format sf, const std::vector<
        }
        values.push_back(bytes ? type->deserialize(*bytes) : data_value::make_null(type));
    }
-    if (!seastar::thread::running_in_thread()) {
-        on_internal_error(log, "User function cannot be executed in this context");
-    }
+
    return lua::run_script(lua::bitcode_view{_bitcode}, values, return_type(), _cfg).get0();
 }
 }
--- a/cql3/index_name.cc
+++ b/cql3/index_name.cc
@@ -53,11 +53,11 @@ const sstring& index_name::get_idx() const
    return _idx_name;
 }

-cf_name index_name::get_cf_name() const
+::shared_ptr<cf_name> index_name::get_cf_name() const
 {
-    cf_name cf;
+    auto cf = ::make_shared<cf_name>();
    if (has_keyspace()) {
-        cf.set_keyspace(get_keyspace(), true);
+        cf->set_keyspace(get_keyspace(), true);
    }
    return cf;
 }
--- a/cql3/index_name.hh
+++ b/cql3/index_name.hh
@@ -55,7 +55,7 @@ public:

    const sstring& get_idx() const;

-    cf_name get_cf_name() const;
+    ::shared_ptr<cf_name> get_cf_name() const;

    virtual sstring to_string() const override;
 };
--- a/cql3/keyspace_element_name.cc
+++ b/cql3/keyspace_element_name.cc
@@ -55,7 +55,6 @@ bool keyspace_element_name::has_keyspace() const

 const sstring& keyspace_element_name::get_keyspace() const
 {
-    assert(_ks_name);
    return *_ks_name;
 }

--- a/cql3/lists.cc
+++ b/cql3/lists.cc
@@ -25,8 +25,8 @@
 #include "cql3_type.hh"
 #include "constants.hh"
 #include <boost/iterator/transform_iterator.hpp>
+#include <boost/range/adaptor/reversed.hpp>
 #include "types/list.hh"
-#include "utils/UUID_gen.hh"

 namespace cql3 {

@@ -40,7 +40,7 @@ lw_shared_ptr<column_specification>
 lists::value_spec_of(const column_specification& column) {
    return make_lw_shared<column_specification>(column.ks_name, column.cf_name,
            ::make_shared<column_identifier>(format("value({})", *column.name), true),
-                dynamic_cast<const list_type_impl&>(column.type->without_reversed()).get_elements_type());
+                dynamic_pointer_cast<const list_type_impl>(column.type)->get_elements_type());
 }

 lw_shared_ptr<column_specification>
@@ -87,7 +87,7 @@ lists::literal::prepare(database& db, const sstring& keyspace, lw_shared_ptr<col

 void
 lists::literal::validate_assignable_to(database& db, const sstring keyspace, const column_specification& receiver) const {
-    if (!receiver.type->without_reversed().is_list()) {
+    if (!dynamic_pointer_cast<const list_type_impl>(receiver.type)) {
        throw exceptions::invalid_request_exception(format("Invalid list literal for {} of type {}",
                *receiver.name, receiver.type->as_cql3_type()));
    }
@@ -125,11 +125,18 @@ lists::literal::to_string() const {

 lists::value
 lists::value::from_serialized(const fragmented_temporary_buffer::view& val, const list_type_impl& type, cql_serialization_format sf) {
+    return with_linearized(val, [&] (bytes_view v) {
+        return from_serialized(v, type, sf);
+    });
+}
+
+lists::value
+lists::value::from_serialized(bytes_view v, const list_type_impl& type, cql_serialization_format sf) {
    try {
        // Collections have this small hack that validate cannot be called on a serialized object,
        // but compose does the validation (so we're fine).
        // FIXME: deserializeForNativeProtocol()?!
-        auto l = value_cast<list_type_impl::native_type>(type.deserialize(val, sf));
+        auto l = value_cast<list_type_impl::native_type>(type.deserialize(v, sf));
        std::vector<bytes_opt> elements;
        elements.reserve(l.size());
        for (auto&& element : l) {
@@ -220,15 +227,17 @@ lists::delayed_value::bind(const query_options& options) {
 ::shared_ptr<terminal>
 lists::marker::bind(const query_options& options) {
    const auto& value = options.get_value_at(_bind_index);
-    auto& ltype = dynamic_cast<const list_type_impl&>(_receiver->type->without_reversed());
+    auto& ltype = static_cast<const list_type_impl&>(*_receiver->type);
    if (value.is_null()) {
        return nullptr;
    } else if (value.is_unset_value()) {
        return constants::UNSET_VALUE;
    } else {
        try {
-            ltype.validate(*value, options.get_cql_serialization_format());
-            return make_shared<lists::value>(value::from_serialized(*value, ltype, options.get_cql_serialization_format()));
+            return with_linearized(*value, [&] (bytes_view v) {
+                ltype.validate(v, options.get_cql_serialization_format());
+                return make_shared<lists::value>(value::from_serialized(v, ltype, options.get_cql_serialization_format()));
+            });
        } catch (marshal_exception& e) {
            throw exceptions::invalid_request_exception(
                    format("Exception while binding column {:s}: {:s}", _receiver->name->to_cql_string(), e.what()));
@@ -236,6 +245,20 @@ lists::marker::bind(const query_options& options) {
    }
 }

+constexpr db_clock::time_point lists::precision_time::REFERENCE_TIME;
+thread_local lists::precision_time lists::precision_time::_last = {db_clock::time_point::max(), 0};
+
+lists::precision_time
+lists::precision_time::get_next(db_clock::time_point millis) {
+    // FIXME: and if time goes backwards?
+    assert(millis <= _last.millis);
+    auto next =  millis < _last.millis
+            ? precision_time{millis, 9999}
+            : precision_time{millis, std::max(0, _last.nanos - 1)};
+    _last = next;
+    return next;
+}
+
 void
 lists::setter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
    auto value = _t->bind(params._options);
@@ -285,7 +308,9 @@ lists::setter_by_index::execute(mutation& m, const clustering_key_prefix& prefix
        return;
    }

-    auto idx = value_cast<int32_t>(data_type_for<int32_t>()->deserialize(*index));
+    auto idx = with_linearized(*index, [] (bytes_view v) {
+        return value_cast<int32_t>(data_type_for<int32_t>()->deserialize(v));
+    });
    auto&& existing_list_opt = params.get_prefetched_list(m.key(), prefix, column);
    if (!existing_list_opt) {
        throw exceptions::invalid_request_exception("Attempted to set an element on a list which is null");
@@ -373,18 +398,10 @@ lists::do_append(shared_ptr<term> value,
        collection_mutation_description appended;
        appended.cells.reserve(to_add.size());
        for (auto&& e : to_add) {
-            try {
-                auto uuid1 = utils::UUID_gen::get_time_UUID_bytes_from_micros_and_submicros(
-                    params.timestamp(),
-                    params._options.next_list_append_seq());
-                auto uuid = bytes(reinterpret_cast<const int8_t*>(uuid1.data()), uuid1.size());
-                // FIXME: can e be empty?
-                appended.cells.emplace_back(
-                    std::move(uuid),
-                    params.make_cell(*ltype->value_comparator(), *e, atomic_cell::collection_member::yes));
-            } catch (utils::timeuuid_submicro_out_of_range) {
-                throw exceptions::invalid_request_exception("Too many list values per single CQL statement or batch");
-            }
+            auto uuid1 = utils::UUID_gen::get_time_UUID_bytes();
+            auto uuid = bytes(reinterpret_cast<const int8_t*>(uuid1.data()), uuid1.size());
+            // FIXME: can e be empty?
+            appended.cells.emplace_back(std::move(uuid), params.make_cell(*ltype->value_comparator(), *e, atomic_cell::collection_member::yes));
        }
        m.set_cell(prefix, column, appended.serialize(*ltype));
    } else {
@@ -408,42 +425,20 @@ lists::prepender::execute(mutation& m, const clustering_key_prefix& prefix, cons

    auto&& lvalue = dynamic_pointer_cast<lists::value>(std::move(value));
    assert(lvalue);
-
-    // For prepend we need to be able to generate a unique but decreasing
-    // timeuuid. We achieve that by by using a time in the past which
-    // is 2x the distance between the original timestamp (it
-    // would be the current timestamp, user supplied timestamp, or
-    // unique monotonic LWT timestsamp, whatever is in query
-    // options) and a reference time of Jan 1 2010 00:00:00.
-    // E.g. if query timestamp is Jan 1 2020 00:00:00, the prepend
-    // timestamp will be Jan 1, 2000, 00:00:00.
-
-    // 2010-01-01T00:00:00+00:00 in api::timestamp_time format (microseconds)
-    static constexpr int64_t REFERENCE_TIME_MICROS = 1262304000L * 1000 * 1000;
-
-    int64_t micros = params.timestamp();
-    if (micros > REFERENCE_TIME_MICROS) {
-        micros = REFERENCE_TIME_MICROS - (micros - REFERENCE_TIME_MICROS);
-    } else {
-        // Scylla, unlike Cassandra, respects user-supplied timestamps
-        // in prepend, but there is nothing useful it can do with
-        // a timestamp less than Jan 1, 2010, 00:00:00.
-        throw exceptions::invalid_request_exception("List prepend custom timestamp must be greater than Jan 1 2010 00:00:00");
-    }
+    auto time = precision_time::REFERENCE_TIME - (db_clock::now() - precision_time::REFERENCE_TIME);

    collection_mutation_description mut;
    mut.cells.reserve(lvalue->get_elements().size());
-
+    // We reverse the order of insertion, so that the last element gets the lastest time
+    // (lists are sorted by time)
    auto ltype = static_cast<const list_type_impl*>(column.type.get());
-    int clockseq = params._options.next_list_prepend_seq(lvalue->_elements.size(), utils::UUID_gen::SUBMICRO_LIMIT);
-    for (auto&& v : lvalue->_elements) {
-        try {
-            auto uuid = utils::UUID_gen::get_time_UUID_bytes_from_micros_and_submicros(micros, clockseq++);
-            mut.cells.emplace_back(bytes(uuid.data(), uuid.size()), params.make_cell(*ltype->value_comparator(), *v, atomic_cell::collection_member::yes));
-        } catch (utils::timeuuid_submicro_out_of_range) {
-            throw exceptions::invalid_request_exception("Too many list values per single CQL statement or batch");
-        }
+    for (auto&& v : lvalue->_elements | boost::adaptors::reversed) {
+        auto&& pt = precision_time::get_next(time);
+        auto uuid = utils::UUID_gen::get_time_UUID_bytes(pt.millis.time_since_epoch().count(), pt.nanos);
+        mut.cells.emplace_back(bytes(uuid.data(), uuid.size()), params.make_cell(*ltype->value_comparator(), *v, atomic_cell::collection_member::yes));
    }
+    // now reverse again, to get the original order back
+    std::reverse(mut.cells.begin(), mut.cells.end());
    m.set_cell(prefix, column, mut.serialize(*ltype));
 }

--- a/cql3/lists.hh
+++ b/cql3/lists.hh
@@ -43,6 +43,7 @@

 #include "cql3/abstract_marker.hh"
 #include "to_string.hh"
+#include "utils/UUID_gen.hh"
 #include "operation.hh"

 namespace cql3 {
@@ -72,6 +73,7 @@ public:
    };

    class value : public multi_item_terminal, collection_terminal {
+        static value from_serialized(bytes_view v, const list_type_impl& type, cql_serialization_format sf);
    public:
        std::vector<bytes_opt> _elements;
    public:
@@ -120,6 +122,28 @@ public:
        virtual ::shared_ptr<terminal> bind(const query_options& options) override;
    };

+    /*
+     * For prepend, we need to be able to generate unique but decreasing time
+     * UUID, which is a bit challenging. To do that, given a time in milliseconds,
+     * we adds a number representing the 100-nanoseconds precision and make sure
+     * that within the same millisecond, that number is always decreasing. We
+     * do rely on the fact that the user will only provide decreasing
+     * milliseconds timestamp for that purpose.
+     */
+private:
+    class precision_time {
+    public:
+        // Our reference time (1 jan 2010, 00:00:00) in milliseconds.
+        static constexpr db_clock::time_point REFERENCE_TIME{std::chrono::milliseconds(1262304000000)};
+    private:
+        static thread_local precision_time _last;
+    public:
+        db_clock::time_point millis;
+        int32_t nanos;
+
+        static precision_time get_next(db_clock::time_point millis);
+    };
+
 public:
    class setter : public operation {
    public:
--- a/cql3/maps.cc
+++ b/cql3/maps.cc
@@ -55,14 +55,14 @@ lw_shared_ptr<column_specification>
 maps::key_spec_of(const column_specification& column) {
    return make_lw_shared<column_specification>(column.ks_name, column.cf_name,
                ::make_shared<column_identifier>(format("key({})", *column.name), true),
-                dynamic_cast<const map_type_impl&>(column.type->without_reversed()).get_keys_type());
+                 dynamic_pointer_cast<const map_type_impl>(column.type)->get_keys_type());
 }

 lw_shared_ptr<column_specification>
 maps::value_spec_of(const column_specification& column) {
    return make_lw_shared<column_specification>(column.ks_name, column.cf_name,
                ::make_shared<column_identifier>(format("value({})", *column.name), true),
-                 dynamic_cast<const map_type_impl&>(column.type->without_reversed()).get_values_type());
+                 dynamic_pointer_cast<const map_type_impl>(column.type)->get_values_type());
 }

 ::shared_ptr<term>
@@ -88,9 +88,7 @@ maps::literal::prepare(database& db, const sstring& keyspace, lw_shared_ptr<colu

        values.emplace(k, v);
    }
-    delayed_value value(
-            dynamic_cast<const map_type_impl&>(receiver->type->without_reversed()).get_keys_type()->as_less_comparator(),
-            values);
+    delayed_value value(static_pointer_cast<const map_type_impl>(receiver->type)->get_keys_type()->as_less_comparator(), values);
    if (all_terminal) {
        return value.bind(query_options::DEFAULT);
    } else {
@@ -100,7 +98,7 @@ maps::literal::prepare(database& db, const sstring& keyspace, lw_shared_ptr<colu

 void
 maps::literal::validate_assignable_to(database& db, const sstring& keyspace, const column_specification& receiver) const {
-    if (!receiver.type->without_reversed().is_map()) {
+    if (!dynamic_pointer_cast<const map_type_impl>(receiver.type)) {
        throw exceptions::invalid_request_exception(format("Invalid map literal for {} of type {}", *receiver.name, receiver.type->as_cql3_type()));
    }
    auto&& key_spec = maps::key_spec_of(receiver);
@@ -160,13 +158,15 @@ maps::value::from_serialized(const fragmented_temporary_buffer::view& fragmented
        // Collections have this small hack that validate cannot be called on a serialized object,
        // but compose does the validation (so we're fine).
        // FIXME: deserialize_for_native_protocol?!
-        auto m = value_cast<map_type_impl::native_type>(type.deserialize(fragmented_value, sf));
+      return with_linearized(fragmented_value, [&] (bytes_view value) {
+        auto m = value_cast<map_type_impl::native_type>(type.deserialize(value, sf));
        std::map<bytes, bytes, serialized_compare> map(type.get_keys_type()->as_less_comparator());
        for (auto&& e : m) {
            map.emplace(type.get_keys_type()->decompose(e.first),
                        type.get_values_type()->decompose(e.second));
        }
        return maps::value { std::move(map) };
+      });
    } catch (marshal_exception& e) {
        throw exceptions::invalid_request_exception(e.what());
    }
@@ -263,16 +263,14 @@ maps::marker::bind(const query_options& options) {
        return constants::UNSET_VALUE;
    }
    try {
-        _receiver->type->validate(*val, options.get_cql_serialization_format());
+        with_linearized(*val, [&] (bytes_view value) {
+            _receiver->type->validate(value, options.get_cql_serialization_format());
+        });
    } catch (marshal_exception& e) {
        throw exceptions::invalid_request_exception(
                format("Exception while binding column {:s}: {:s}", _receiver->name->to_cql_string(), e.what()));
    }
-    return ::make_shared<maps::value>(
-            maps::value::from_serialized(
-                    *val,
-                    dynamic_cast<const map_type_impl&>(_receiver->type->without_reversed()),
-                    options.get_cql_serialization_format()));
+    return ::make_shared<maps::value>(maps::value::from_serialized(*val, static_cast<const map_type_impl&>(*_receiver->type), options.get_cql_serialization_format()));
 }

 void
@@ -307,12 +305,6 @@ maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, c
    assert(column.type->is_multi_cell()); // "Attempted to set a value for a single key on a frozen map"m
    auto key = _k->bind_and_get(params._options);
    auto value = _t->bind_and_get(params._options);
-    if (value.is_unset_value()) {
-        return;
-    }
-    if (key.is_unset_value()) {
-        throw invalid_request_exception("Invalid unset map key");
-    }
    if (!key) {
        throw invalid_request_exception("Invalid null map key");
    }
--- a/cql3/multi_column_relation.hh
+++ b/cql3/multi_column_relation.hh
@@ -59,32 +59,30 @@ namespace cql3 {
 *  - SELECT ... WHERE (a, b) IN ?
 */
 class multi_column_relation final : public relation {
-public:
-    using mode = expr::comparison_order;
 private:
    std::vector<shared_ptr<column_identifier::raw>> _entities;
    shared_ptr<term::multi_column_raw> _values_or_marker;
    std::vector<shared_ptr<term::multi_column_raw>> _in_values;
    shared_ptr<tuples::in_raw> _in_marker;
-    mode _mode;
+
 public:
+
    multi_column_relation(std::vector<shared_ptr<column_identifier::raw>> entities,
        expr::oper_t relation_type, shared_ptr<term::multi_column_raw> values_or_marker,
-        std::vector<shared_ptr<term::multi_column_raw>> in_values, shared_ptr<tuples::in_raw> in_marker, mode m = mode::cql)
+        std::vector<shared_ptr<term::multi_column_raw>> in_values, shared_ptr<tuples::in_raw> in_marker)
        : relation(relation_type)
        , _entities(std::move(entities))
        , _values_or_marker(std::move(values_or_marker))
        , _in_values(std::move(in_values))
        , _in_marker(std::move(in_marker))
-        , _mode(m)
    { }

    static shared_ptr<multi_column_relation> create_multi_column_relation(
        std::vector<shared_ptr<column_identifier::raw>> entities, expr::oper_t relation_type,
        shared_ptr<term::multi_column_raw> values_or_marker, std::vector<shared_ptr<term::multi_column_raw>> in_values,
-        shared_ptr<tuples::in_raw> in_marker, mode m = mode::cql) {
+        shared_ptr<tuples::in_raw> in_marker) {
        return ::make_shared<multi_column_relation>(std::move(entities), relation_type, std::move(values_or_marker),
-            std::move(in_values), std::move(in_marker), m);
+            std::move(in_values), std::move(in_marker));
    }

    /**
@@ -101,15 +99,6 @@ public:
        return create_multi_column_relation(std::move(entities), relation_type, std::move(values_or_marker), {}, {});
    }

-    /**
-     * Same as above, but sets the magic mode that causes us to treat the restrictions as "raw" clustering bounds
-     */
-    static shared_ptr<multi_column_relation> create_scylla_clustering_bound_non_in_relation(std::vector<shared_ptr<column_identifier::raw>> entities,
-                                                                    expr::oper_t relation_type, shared_ptr<term::multi_column_raw> values_or_marker) {
-        assert(relation_type != expr::oper_t::IN);
-        return create_multi_column_relation(std::move(entities), relation_type, std::move(values_or_marker), {}, {}, mode::clustering);
-    }
-
    /**
     * Creates a multi-column IN relation with a list of IN values or markers.
     * For example: "SELECT ... WHERE (a, b) IN ((0, 1), (2, 3))"
@@ -202,7 +191,7 @@ protected:
            return cs->column_specification;
        });
        auto t = to_term(col_specs, *get_value(), db, schema->ks_name(), bound_names);
-        return ::make_shared<restrictions::multi_column_restriction::slice>(schema, rs, bound, inclusive, t, _mode);
+        return ::make_shared<restrictions::multi_column_restriction::slice>(schema, rs, bound, inclusive, t);
    }

    virtual shared_ptr<restrictions::restriction> new_contains_restriction(database& db, schema_ptr schema,
--- a/cql3/prepared_statements_cache.hh
+++ b/cql3/prepared_statements_cache.hh
@@ -102,7 +102,13 @@ private:
    using cache_key_type = typename prepared_cache_key_type::cache_key_type;
    using cache_type = utils::loading_cache<cache_key_type, prepared_cache_entry, utils::loading_cache_reload_enabled::no, prepared_cache_entry_size, utils::tuple_hash, std::equal_to<cache_key_type>, prepared_cache_stats_updater>;
    using cache_value_ptr = typename cache_type::value_ptr;
+    using cache_iterator = typename cache_type::iterator;
    using checked_weak_ptr = typename statements::prepared_statement::checked_weak_ptr;
+    struct value_extractor_fn {
+        checked_weak_ptr operator()(prepared_cache_entry& e) const {
+            return e->checked_weak_from_this();
+        }
+    };

 public:
    static const std::chrono::minutes entry_expiry;
@@ -110,9 +116,12 @@ public:
    using key_type = prepared_cache_key_type;
    using value_type = checked_weak_ptr;
    using statement_is_too_big = typename cache_type::entry_is_too_big;
+    /// \note both iterator::reference and iterator::value_type are checked_weak_ptr
+    using iterator = boost::transform_iterator<value_extractor_fn, cache_iterator>;

 private:
    cache_type _cache;
+    value_extractor_fn _value_extractor_fn;

 public:
    prepared_statements_cache(logging::logger& logger, size_t size)
@@ -126,12 +135,16 @@ public:
        });
    }

-    value_type find(const key_type& key) {
-        cache_value_ptr vp = _cache.find(key.key());
-        if (vp) {
-            return (*vp)->checked_weak_from_this();
-        }
-        return value_type();
+    iterator find(const key_type& key) {
+        return boost::make_transform_iterator(_cache.find(key.key()), _value_extractor_fn);
+    }
+
+    iterator end() {
+        return boost::make_transform_iterator(_cache.end(), _value_extractor_fn);
+    }
+
+    iterator begin() {
+        return boost::make_transform_iterator(_cache.begin(), _value_extractor_fn);
    }

    template <typename Pred>
--- a/cql3/query_options.cc
+++ b/cql3/query_options.cc
@@ -42,14 +42,12 @@
 #include "cql3/cql_config.hh"
 #include "query_options.hh"
 #include "version.hh"
-#include "db/consistency_level_type.hh"

 namespace cql3 {

 const cql_config default_cql_config;

-thread_local const query_options::specific_options query_options::specific_options::DEFAULT{
-    -1, {}, db::consistency_level::SERIAL, api::missing_timestamp};
+thread_local const query_options::specific_options query_options::specific_options::DEFAULT{-1, {}, {}, api::missing_timestamp};

 thread_local query_options query_options::DEFAULT{default_cql_config,
    db::consistency_level::ONE, std::nullopt,
--- a/cql3/query_options.hh
+++ b/cql3/query_options.hh
@@ -81,16 +81,7 @@ private:
    const specific_options _options;
    cql_serialization_format _cql_serialization_format;
    std::optional<std::vector<query_options>> _batch_options;
-    // We must use the same microsecond-precision timestamp for
-    // all cells created by an LWT statement or when a statement
-    // has a user-provided timestamp. In case the statement or
-    // a BATCH appends many values to a list, each value should
-    // get a unique and monotonic timeuuid. This sequence is
-    // used to make all time-based UUIDs:
-    // 1) share the same microsecond,
-    // 2) monotonic
-    // 3) unique.
-    mutable int _list_append_seq = 0;
+
 private:
    /**
     * @brief Batch query_options constructor.
@@ -242,39 +233,6 @@ public:
        return _cql_config;
    }

-    // Generate a next unique list sequence for list append, e.g.
-    // a = a + [val1, val2, ...]
-    int next_list_append_seq() const {
-        return _list_append_seq++;
-    }
-
-    // To preserve prepend monotonicity within a batch, each next
-    // value must get a timestamp that's smaller than the previous one:
-    // BEGIN BATCH
-    //      UPDATE t SET l = [1, 2] + l WHERE pk = 0;
-    //      UPDATE t SET l = [3] + l WHERE pk = 0;
-    //      UPDATE t SET l = [4] + l WHERE pk = 0;
-    // APPLY BATCH
-    // SELECT l FROM t WHERE pk = 0;
-    //  l
-    // ------------
-    // [4, 3, 1, 2]
-    //
-    // This function reserves the given number of prepend entries
-    // and returns an id for the first prepended entry (it
-    // got to be the smallest one, to preserve the order of
-    // a multi-value append).
-    //
-    // @retval sequence number of the first entry of a multi-value
-    // append. To get the next value, add 1.
-    int next_list_prepend_seq(int num_entries, int max_entries) const {
-        if (_list_append_seq + num_entries < max_entries) {
-            _list_append_seq += num_entries;
-            return max_entries - _list_append_seq;
-        }
-        return max_entries;
-    }
-
    void prepare(const std::vector<lw_shared_ptr<column_specification>>& specs);
 private:
    void fill_value_views();
--- a/cql3/query_processor.cc
+++ b/cql3/query_processor.cc
@@ -84,12 +84,11 @@ public:
    }
 };

-query_processor::query_processor(service::storage_proxy& proxy, database& db, service::migration_notifier& mn, service::migration_manager& mm, query_processor::memory_config mcfg, cql_config& cql_cfg)
+query_processor::query_processor(service::storage_proxy& proxy, database& db, service::migration_notifier& mn, query_processor::memory_config mcfg, cql_config& cql_cfg)
        : _migration_subscriber{std::make_unique<migration_subscriber>(this)}
        , _proxy(proxy)
        , _db(db)
        , _mnotifier(mn)
-        , _mm(mm)
        , _cql_config(cql_cfg)
        , _internal_state(new internal_state())
        , _prepared_cache(prep_cache_log, mcfg.prepared_statment_cache_size)
@@ -528,7 +527,7 @@ query_processor::process_authorized_statement(const ::shared_ptr<cql_statement>

    statement->validate(_proxy, client_state);

-    auto fut = statement->execute(*this, query_state, options);
+    auto fut = statement->execute(_proxy, query_state, options);

    return fut.then([statement] (auto msg) {
        if (msg) {
@@ -667,13 +666,10 @@ struct internal_query_state {
    bool more_results = true;
 };

-::shared_ptr<internal_query_state> query_processor::create_paged_state(
-        const sstring& query_string,
-        db::consistency_level cl,
-        const std::initializer_list<data_value>& values,
-        int32_t page_size) {
+::shared_ptr<internal_query_state> query_processor::create_paged_state(const sstring& query_string,
+        const std::initializer_list<data_value>& values, int32_t page_size) {
    auto p = prepare_internal(query_string);
-    auto opts = make_internal_options(p, values, cl, page_size);
+    auto opts = make_internal_options(p, values, db::consistency_level::ONE, page_size);
    ::shared_ptr<internal_query_state> res = ::make_shared<internal_query_state>(
            internal_query_state{
                    query_string,
@@ -752,7 +748,7 @@ future<> query_processor::for_each_cql_result(

 future<::shared_ptr<untyped_result_set>>
 query_processor::execute_paged_internal(::shared_ptr<internal_query_state> state) {
-    return state->p->statement->execute(*this, *_internal_state, *state->opts).then(
+    return state->p->statement->execute(_proxy, *_internal_state, *state->opts).then(
            [state, this](::shared_ptr<cql_transport::messages::result_message> msg) mutable {
        class visitor : public result_message::visitor_base {
            ::shared_ptr<internal_query_state> _state;
@@ -826,7 +822,7 @@ query_processor::execute_with_params(
        const std::initializer_list<data_value>& values) {
    auto opts = make_internal_options(p, values, cl);
    return do_with(std::move(opts), [this, &query_state, p = std::move(p)](auto & opts) {
-        return p->statement->execute(*this, query_state, opts).then([](auto msg) {
+        return p->statement->execute(_proxy, query_state, opts).then([](auto msg) {
            return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
        });
    });
@@ -854,7 +850,7 @@ query_processor::execute_batch(
                }
                log.trace("execute_batch({}): {}", batch->get_statements().size(), oss.str());
            }
-            return batch->execute(*this, query_state, options);
+            return batch->execute(_proxy, query_state, options);
        });
    });
 }
@@ -943,22 +939,20 @@ bool query_processor::migration_subscriber::should_invalidate(
        sstring ks_name,
        std::optional<sstring> cf_name,
        ::shared_ptr<cql_statement> statement) {
-    return statement->depends_on(ks_name, cf_name);
+    return statement->depends_on_keyspace(ks_name) && (!cf_name || statement->depends_on_column_family(*cf_name));
 }

-future<> query_processor::query_internal(
+future<> query_processor::query(
        const sstring& query_string,
-        db::consistency_level cl,
        const std::initializer_list<data_value>& values,
-        int32_t page_size,
        noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f) {
-    return for_each_cql_result(create_paged_state(query_string, cl, values, page_size), std::move(f));
+    return for_each_cql_result(create_paged_state(query_string, values), std::move(f));
 }

-future<> query_processor::query_internal(
+future<> query_processor::query(
        const sstring& query_string,
        noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f) {
-    return query_internal(query_string, db::consistency_level::ONE, {}, 1000, std::move(f));
+    return for_each_cql_result(create_paged_state(query_string, {}), std::move(f));
 }

 }
--- a/cql3/query_processor.hh
+++ b/cql3/query_processor.hh
@@ -58,10 +58,6 @@
 #include "service/query_state.hh"
 #include "transport/messages/result_message.hh"

-namespace service {
-class migration_manager;
-}
-
 namespace cql3 {

 namespace statements {
@@ -119,7 +115,6 @@ private:
    service::storage_proxy& _proxy;
    database& _db;
    service::migration_notifier& _mnotifier;
-    service::migration_manager& _mm;
    const cql_config& _cql_config;

    struct stats {
@@ -154,7 +149,7 @@ public:

    static std::unique_ptr<statements::raw::parsed_statement> parse_statement(const std::string_view& query);

-    query_processor(service::storage_proxy& proxy, database& db, service::migration_notifier& mn, service::migration_manager& mm, memory_config mcfg, cql_config& cql_cfg);
+    query_processor(service::storage_proxy& proxy, database& db, service::migration_notifier& mn, memory_config mcfg, cql_config& cql_cfg);

    ~query_processor();

@@ -170,19 +165,16 @@ public:
        return _proxy;
    }

-    const service::migration_manager& get_migration_manager() const noexcept { return _mm; }
-    service::migration_manager& get_migration_manager() noexcept { return _mm; }
-
    cql_stats& get_cql_stats() {
        return _cql_stats;
    }

    statements::prepared_statement::checked_weak_ptr get_prepared(const std::optional<auth::authenticated_user>& user, const prepared_cache_key_type& key) {
        if (user) {
-            auto vp = _authorized_prepared_cache.find(*user, key);
-            if (vp) {
+            auto it = _authorized_prepared_cache.find(*user, key);
+            if (it != _authorized_prepared_cache.end()) {
                try {
-                    return vp->get()->checked_weak_from_this();
+                    return it->get()->checked_weak_from_this();
                } catch (seastar::checked_ptr_is_null_exception&) {
                    // If the prepared statement got invalidated - remove the corresponding authorized_prepared_statements_cache entry as well.
                    _authorized_prepared_cache.remove(*user, key);
@@ -193,7 +185,11 @@ public:
    }

    statements::prepared_statement::checked_weak_ptr get_prepared(const prepared_cache_key_type& key) {
-        return _prepared_cache.find(key);
+        auto it = _prepared_cache.find(key);
+        if (it == _prepared_cache.end()) {
+            return statements::prepared_statement::checked_weak_ptr();
+        }
+        return *it;
    }

    future<::shared_ptr<cql_transport::messages::result_message>>
@@ -227,49 +223,75 @@ public:
    /*!
     * \brief iterate over all cql results using paging
     *
-     * You create a statement with optional parameters and pass
-     * a function that goes over the result rows.
+     * You Create a statement with optional paraemter and pass
+     * a function that goes over the results.
     *
-     * The passed function would be called for all rows; return future<stop_iteration::yes>
-     * to stop iteration.
+     * The passed function would be called for all the results, return stop_iteration::yes
+     * to stop during iteration.
     *
     * For example:
-            return query_internal(
-                    "SELECT * from system.compaction_history",
-                    db::consistency_level::ONE,
-                    {},
-                    [&history] (const cql3::untyped_result_set::row& row) mutable {
+            return query("SELECT * from system.compaction_history",
+                         [&history] (const cql3::untyped_result_set::row& row) mutable {
+                ....
+                ....
+                return stop_iteration::no;
+            });
+
+     * You can use place holder in the query, the prepared statement will only be done once.
+     *
+     *
+     * query_string - the cql string, can contain place holder
+     * f - a function to be run on each of the query result, if the function return false the iteration would stop
+     * args - arbitrary number of query parameters
+     */
+    template<typename... Args>
+    future<> query(
+            const sstring& query_string,
+            std::function<stop_iteration(const cql3::untyped_result_set_row&)>&& f,
+            Args&&... args) {
+        return for_each_cql_result(
+                create_paged_state(query_string, { data_value(std::forward<Args>(args))... }), std::move(f));
+    }
+
+    /*!
+     * \brief iterate over all cql results using paging
+     *
+     * You Create a statement with optional paraemter and pass
+     * a function that goes over the results.
+     *
+     * The passed function would be called for all the results, return future<stop_iteration::yes>
+     * to stop during iteration.
+     *
+     * For example:
+            return query("SELECT * from system.compaction_history",
+                         [&history] (const cql3::untyped_result_set::row& row) mutable {
                ....
                ....
                return make_ready_future<stop_iteration>(stop_iteration::no);
            });

-     * You can use placeholders in the query, the statement will only be prepared once.
+     * You can use place holder in the query, the prepared statement will only be done once.
     *
-     * query_string - the cql string, can contain placeholders
-     * cl - consistency level of the query
-     * values - values to be substituted for the placeholders in the query
-     * page_size - maximum page size
-     * f - a function to be run on each row of the query result,
-     *     if the function returns stop_iteration::yes the iteration will stop
+     *
+     * query_string - the cql string, can contain place holder
+     * values - query parameters value
+     * f - a function to be run on each of the query result, if the function return stop_iteration::no the iteration
+     * would stop
     */
-    future<> query_internal(
+    future<> query(
            const sstring& query_string,
-            db::consistency_level cl,
            const std::initializer_list<data_value>& values,
-            int32_t page_size,
            noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);

    /*
     * \brief iterate over all cql results using paging
-     * An overload of query_internal without query parameters
-     * using CL = ONE, no timeout, and page size = 1000.
+     * An overload of the query with future function without query parameters.
     *
-     * query_string - the cql string, can contain placeholders
-     * f - a function to be run on each row of the query result,
-     *     if the function returns stop_iteration::yes the iteration will stop
+     * query_string - the cql string, can contain place holder
+     * f - a function to be run on each of the query result, if the function return stop_iteration::no the iteration
+     * would stop
     */
-    future<> query_internal(
+    future<> query(
            const sstring& query_string,
            noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);

@@ -335,9 +357,8 @@ private:
     */
    ::shared_ptr<internal_query_state> create_paged_state(
            const sstring& query_string,
-            db::consistency_level,
-            const std::initializer_list<data_value>&,
-            int32_t page_size);
+            const std::initializer_list<data_value>& = { },
+            int32_t page_size = 1000);

    /*!
     * \brief run a query using paging
--- a/cql3/restrictions/multi_column_restriction.hh
+++ b/cql3/restrictions/multi_column_restriction.hh
@@ -377,24 +377,21 @@ protected:

 class multi_column_restriction::slice final : public multi_column_restriction {
    using restriction_shared_ptr = ::shared_ptr<clustering_key_restrictions>;
-    using mode = expr::comparison_order;
+private:
    term_slice _slice;
-    mode _mode;

-    slice(schema_ptr schema, std::vector<const column_definition*> defs, term_slice slice, mode m)
+    slice(schema_ptr schema, std::vector<const column_definition*> defs, term_slice slice)
        : multi_column_restriction(schema, std::move(defs))
        , _slice(slice)
-        , _mode(m)
    { }
 public:
-    slice(schema_ptr schema, std::vector<const column_definition*> defs, statements::bound bound, bool inclusive, shared_ptr<term> term, mode m = mode::cql)
-        : slice(schema, defs, term_slice::new_instance(bound, inclusive, term), m)
+    slice(schema_ptr schema, std::vector<const column_definition*> defs, statements::bound bound, bool inclusive, shared_ptr<term> term)
+        : slice(schema, defs, term_slice::new_instance(bound, inclusive, term))
    {
        expression = expr::binary_operator{
            std::vector<expr::column_value>(defs.cbegin(), defs.cend()),
            expr::pick_operator(bound, inclusive),
-            std::move(term),
-            m};
+            std::move(term)};
    }

    virtual bool is_supported_by(const secondary_index::index& index) const override {
@@ -407,7 +404,7 @@ public:
    }

    virtual std::vector<bounds_range_type> bounds_ranges(const query_options& options) const override {
-        if (_mode == mode::clustering || !is_mixed_order()) {
+        if (!is_mixed_order()) {
            return bounds_ranges_unified_order(options);
        } else {
            return bounds_ranges_mixed_order(options);
@@ -443,11 +440,6 @@ public:
                   get_columns_in_commons(other));
        auto other_slice = static_pointer_cast<slice>(other);

-        static auto mode2str = [](auto m) { return m == mode::cql ? "plain" : "SCYLLA_CLUSTERING_BOUND"; };
-        check_true(other_slice->_mode == this->_mode, 
-                    "Invalid combination of restrictions (%s / %s)",
-                    mode2str(this->_mode), mode2str(other_slice->_mode)
-                    );
        check_false(_slice.has_bound(statements::bound::START) && other_slice->_slice.has_bound(statements::bound::START),
                    "More than one restriction was found for the start bound on %s",
                    get_columns_in_commons(other));
@@ -492,7 +484,7 @@ private:
            auto end_prefix = clustering_key_prefix::from_optional_exploded(*_schema, end_components);
            end_bound = bounds_range_type::bound(std::move(end_prefix), _slice.is_inclusive(statements::bound::END));
        }
-        if (_mode == mode::cql && !is_asc_order()) {
+        if (!is_asc_order()) {
            std::swap(start_bound, end_bound);
        }
        auto range = bounds_range_type(start_bound, end_bound);
--- a/cql3/restrictions/single_column_primary_key_restrictions.hh
+++ b/cql3/restrictions/single_column_primary_key_restrictions.hh
@@ -171,7 +171,8 @@ public:

    virtual void merge_with(::shared_ptr<restriction> restriction) override {
        if (find_atom(restriction->expression, [] (const expr::binary_operator& b) {
-                    return std::holds_alternative<std::vector<expr::column_value>>(b.lhs);
+                    return std::holds_alternative<std::vector<expr::column_value>>(b.lhs)
+                            && std::get<std::vector<expr::column_value>>(b.lhs).size() > 1;
                })) {
            throw exceptions::invalid_request_exception(
                "Mixing single column relations and multi column relations on clustering columns is not allowed");
@@ -212,22 +213,30 @@ private:
    std::vector<range_type> compute_bounds(const query_options& options) const {
        std::vector<range_type> ranges;

+        static constexpr auto invalid_null_msg = std::is_same<ValueType, partition_key>::value
+            ? "Invalid null value for partition key part %s" : "Invalid null value for clustering key part %s";
+
        // TODO: rewrite this to simply invoke possible_lhs_values on each clustering column, find the first
        // non-list, and take Cartesian product of that prefix.  No need for to_range() and std::get() here.
        if (_restrictions->is_all_eq()) {
+            if (_restrictions->size() == 1) {
+                auto&& e = *restrictions().begin();
+                const auto b = std::get<expr::binary_operator>(e.second->expression).rhs->bind_and_get(options);
+                if (!b) {
+                    throw exceptions::invalid_request_exception(sprint(invalid_null_msg, e.first->name_as_text()));
+                }
+                return {range_type::make_singular(ValueType::from_single_value(*_schema, to_bytes(b)))};
+            }
            std::vector<bytes> components;
            components.reserve(_restrictions->size());
            for (auto&& e : restrictions()) {
                const column_definition* def = e.first;
                assert(components.size() == _schema->position(*def));
-                // Because _restrictions is all EQ, possible_lhs_values must return a list, not a range.
-                const auto b = std::get<expr::value_list>(possible_lhs_values(e.first, e.second->expression, options));
-                // Furthermore, this list is either a single element (when all RHSs are the same) or empty (when at
-                // least two are different, so the restrictions cannot hold simultaneously -- ie, c=1 AND c=2).
-                if (b.empty()) {
-                    return {};
+                const auto b = std::get<expr::binary_operator>(e.second->expression).rhs->bind_and_get(options);
+                if (!b) {
+                    throw exceptions::invalid_request_exception(sprint(invalid_null_msg, e.first->name_as_text()));
                }
-                components.emplace_back(b.front());
+                components.emplace_back(to_bytes(b));
            }
            return {range_type::make_singular(ValueType::from_exploded(*_schema, std::move(components)))};
        }
@@ -315,7 +324,7 @@ public:
        std::vector<bytes_opt> res;
        for (const ValueType& r : src) {
            for (const auto& component : r.components()) {
-                res.emplace_back(to_bytes(component));
+                res.emplace_back(component);
            }
        }
        return res;
--- a/cql3/restrictions/single_column_restrictions.hh
+++ b/cql3/restrictions/single_column_restrictions.hh
@@ -108,9 +108,6 @@ public:
            return bytes_opt{};
        } else {
            const auto values = std::get<expr::value_list>(possible_lhs_values(&cdef, it->second->expression, options));
-            if (values.empty()) {
-                return bytes_opt{};
-            }
            assert(values.size() == 1);
            return values.front();
        }
@@ -122,7 +119,7 @@ public:
     * @param column_def the column definition
     * @return the restriction associated to the specified column
     */
-    ::shared_ptr<single_column_restriction> get_restriction(const column_definition& column_def) const {
+    ::shared_ptr<restriction> get_restriction(const column_definition& column_def) const {
        auto i = _restrictions.find(&column_def);
        if (i == _restrictions.end()) {
            return {};
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Piotr Sarna	fcb349b026	tests: add tests for per-role timeouts The test cases verify that setting timeout parameters per-role works and is validated.	2020-11-27 12:43:53 +01:00
Piotr Sarna	28c558af95	docs: add a paragaph about per-role parameters This paragraph is also the first one in newly crated roles.md, which should be later filled with more information about roles.	2020-11-27 12:43:53 +01:00
Piotr Sarna	83b47ae394	cql3: add validating per-role timeout options Per-role timeout options are now validated when set: - they should represent a valid duration - the duration should have millisecond granularity, since the timeout clock does not support micro/nanoseconds.	2020-11-27 12:37:27 +01:00
Piotr Sarna	391d1f2b21	client_state: add updating per-role params Per-role parameters (currently: read_timeout and write_timeout) are now updated when a new connection is established. Also, the changes are immediately propagated for the connection which sent the CREATE ROLE/ALTER ROLE statement. The other connections which have the changed role are currently not immediately reloaded. It can be done in the future if needed, but all sessions with given roles should be tracked, or, alternatively, all sessions should be iterated and changed.	2020-11-27 12:37:27 +01:00
Piotr Sarna	137a8a0161	auth: add options support to password authenticator Custom options will be used later to provide per-role timeouts and other useful parameters.	2020-11-27 12:37:17 +01:00
Piotr Sarna	c473cb4a2d	treewide: remove timeout config from query options Timeout config is now stored in each connection, so there's no point in tracking it inside each query as well. This patch removes timeout_config from query_options and follows by removing now unnecessary parameters of many functions and constructors.	2020-11-26 17:56:55 +01:00
Piotr Sarna	98fac66361	cql3: use timeout config from client state instead of query options ... in batch statement, in order to be able to remove the timeout from query options later.	2020-11-26 17:55:29 +01:00
Piotr Sarna	2cbeb3678f	cql3: use timeout config from client state instead of query options ... in modification statement, in order to be able to remove the timeout from query options later.	2020-11-26 17:55:29 +01:00
Piotr Sarna	d61e1fd174	cql3: use timeout config from client state instead of query options ... in select statement, in order to be able to remove the timeout from query options later.	2020-11-26 17:55:29 +01:00
Piotr Sarna	f31ac0a8ca	service: add timeout config to client state Future patches will use this per-connection timeout config to allow setting different timeouts for each session, based on roles.	2020-11-26 17:55:14 +01:00
				`@@ -1 +0,0 @@`
				`Dedicated to the memory of Alberto José Araújo, a coworker and a friend.`