Compare commits

..

20 Commits

Author SHA1 Message Date
Yaniv Michael Kaul
8c293dfe36 schema/schema_registry: reformat code style
Reformat constructor initializer lists, brace placement, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around trace calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:41 +02:00
Yaniv Michael Kaul
97fb93a5b4 utils/gcp/object_storage: fix dead code and format string bug
Severity: low (dead code) + correctness (format string bug)

- Remove unused error_msg variable (parsed JSON but result discarded)
- Fix format string: "()" should be "({})" so _session_path is
  actually included in the debug log instead of being silently discarded
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:41 +02:00
Yaniv Michael Kaul
357b94ac60 utils/logalloc: remove redundant format() wrapper in reclamation rate log
Severity: low

format("{:.3f}", ...) was eagerly allocating a string to pass to
the logger. Move the precision specifier directly into the log format
string, letting the logger handle formatting lazily.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
037f291d48 dht/boot_strapper: reformat code style
Reformat function parameters, constructor initializer lists, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
b246a70404 dht/range_streamer: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
a071d54160 cdc/split: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around trace calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
62d369c113 auth/ldap_role_manager: reformat code style
Reformat indentation, brace placement, and line wrapping for
consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
bce09e055c locator/topology: defer expensive format() calls in debug log with value_of()
Severity: medium

Five format("{}", ...) ternary expressions create temporary strings
for a debug log on every topology node update. Use seastar::value_of()
to defer the formatting to log-time, avoiding 5 heap allocations when
debug logging is disabled.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
37004ac4dc locator/tablets: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
6f619908ce locator/network_topology_strategy: reformat code style
Reformat indentation, brace placement, and line wrapping for
consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
0fe552b074 cql3/query_processor: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
14482e2e07 gms/gossiper: guard expensive gossip message formatting in debug/trace logs
Severity: high (includes correctness bug fix)

- Guard formatting of gossip_digest_syn + ack at debug level (high: large
  message structures formatted per gossip round)
- Guard formatting of gossip_digest_ack at trace level (high: includes
  full endpoint_state_map)
- Fix use-after-move bug: ack2_msg was read after std::move at line 405.
  Capture formatted string before the move, guard both log calls (high:
  correctness bug + waste)
- Guard nodes vector formatting inside parallel_for_each loop (high:
  O(N^2) formatting across N iterations)
- Guard live_endpoints + _endpoints_to_talk_with formatting (medium)
- Guard live_members set formatting called twice per gossip round (medium)
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
4bda6dded5 replica/database: defer eager set formatting in debug log with value_of()
Severity: medium

fmt::format("{}", ks_names_set) in a ternary expression eagerly
formats an unordered_set<sstring> even when debug logging is disabled.
Use seastar::value_of() to defer the formatting to log-time.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
0be4fac5e5 db/schema_applier: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around trace calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
8818a0347e streaming/stream_transfer_task: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
489bdb2f93 repair/repair: defer percentage calculations in debug log with value_of()
Severity: low

Six percentage calculation function calls were evaluated eagerly as
debug log arguments on every repair range completion. Use
seastar::value_of() to defer the computations to log-time, skipping
them when debug logging is disabled.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
9efd94199d repair/row_level: avoid redundant coroutine invocations in debug logs
Severity: high

working_row_hashes().get() was being called multiple times purely for
debug log arguments, rebuilding the entire hash set each time. Cache
the result and wrap debug logs in is_enabled() guards to avoid the
expensive coroutine re-invocation when debug logging is disabled.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
10085ddb25 compaction/leveled_compaction_strategy: guard async I/O in debug log
Severity: high

co_await table_s.main_sstable_set() was called solely to get the
sstable count for a debug log message. This performs async I/O on
every LCS compaction even when debug logging is disabled. Wrap in
an is_enabled() guard.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:38 +02:00
Yaniv Michael Kaul
8635a2aa6f compaction/time_window_compaction_strategy: defer O(n^2) overlap computation in debug log with value_of()
Severity: high

Two sstable_set_overlapping_count() calls (O(n^2) overlap computation)
were passed as direct arguments to a debug log, causing them to be
evaluated eagerly even when debug logging is disabled. Use
seastar::value_of() to defer the computation to log-time.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:38 +02:00
Yaniv Michael Kaul
c7bb7b34c0 compaction/compaction_manager: avoid eager evaluation in debug log arguments
Severity: low-to-medium

- Remove redundant format() wrapping in debug log at consume_sstable()
  (low: one heap allocation per sstable consumed)
- Move get_compactions(filter).size() inside is_enabled() guard in
  do_stop_ongoing_compactions() (medium: builds entire compaction vector
  just for .size() in log message)
- Guard eager ::format() of descriptor.sstables with is_enabled(debug)
  (medium-high: formats entire sstable vector on every minor compaction)
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:38 +02:00
251 changed files with 8612 additions and 18029 deletions

View File

@@ -7,11 +7,6 @@ on:
- synchronize
- reopened
permissions:
contents: read
pull-requests: write
statuses: write
jobs:
validate_pr_author_email:
uses: scylladb/github-automation/.github/workflows/validate_pr_author_email.yml@main

View File

@@ -2,12 +2,6 @@ cmake_minimum_required(VERSION 3.27)
project(scylla)
# Disable CMake's automatic -fcolor-diagnostics injection (CMake 3.24+ adds
# it for Clang+Ninja). configure.py does not add any color diagnostics flags,
# so we clear the internal CMake variable to prevent injection.
set(CMAKE_CXX_COMPILE_OPTIONS_COLOR_DIAGNOSTICS "")
set(CMAKE_C_COMPILE_OPTIONS_COLOR_DIAGNOSTICS "")
list(APPEND CMAKE_MODULE_PATH
${CMAKE_CURRENT_SOURCE_DIR}/cmake
${CMAKE_CURRENT_SOURCE_DIR}/seastar/cmake)
@@ -57,16 +51,6 @@ set(CMAKE_CXX_EXTENSIONS ON CACHE INTERNAL "")
set(CMAKE_CXX_SCAN_FOR_MODULES OFF CACHE INTERNAL "")
set(CMAKE_VISIBILITY_INLINES_HIDDEN ON)
# Global defines matching configure.py
# Since gcc 13, libgcc doesn't need the exception workaround
add_compile_definitions(SEASTAR_NO_EXCEPTION_HACK)
# Hacks needed to expose internal APIs for xxhash dependencies
add_compile_definitions(XXH_PRIVATE_API)
# SEASTAR_TESTING_MAIN is added later (after add_subdirectory(seastar) and
# add_subdirectory(abseil)) to avoid leaking into the seastar subdirectory.
# If SEASTAR_TESTING_MAIN is defined globally before seastar, it causes a
# duplicate 'main' symbol in seastar_testing.
if(is_multi_config)
find_package(Seastar)
# this is atypical compared to standard ExternalProject usage:
@@ -114,31 +98,10 @@ else()
set(Seastar_IO_URING ON CACHE BOOL "" FORCE)
set(Seastar_SCHEDULING_GROUPS_COUNT 21 CACHE STRING "" FORCE)
set(Seastar_UNUSED_RESULT_ERROR ON CACHE BOOL "" FORCE)
# Match configure.py's build_seastar_shared_libs: Debug and Dev
# build Seastar as a shared library, others build it static.
if(CMAKE_BUILD_TYPE STREQUAL "Debug" OR CMAKE_BUILD_TYPE STREQUAL "Dev")
set(BUILD_SHARED_LIBS ON CACHE BOOL "" FORCE)
else()
set(BUILD_SHARED_LIBS OFF CACHE BOOL "" FORCE)
endif()
add_subdirectory(seastar)
# Coverage mode sets cmake_build_type='Debug' for Seastar
# (configure.py:515), so Seastar's pkg-config output includes sanitizer
# link flags in seastar_libs_coverage (configure.py:2514,2649).
# Seastar's own CMake only activates sanitizer targets for Debug/Sanitize
# configs, so we inject link options on the seastar target for Coverage.
# Using PUBLIC ensures they propagate to all targets linking Seastar
# (but not standalone tools like patchelf), matching configure.py's
# behavior. Compile-time flags and defines are handled globally in
# cmake/mode.Coverage.cmake.
if(CMAKE_BUILD_TYPE STREQUAL "Coverage")
target_link_options(seastar
PUBLIC
-fsanitize=address
-fsanitize=undefined
-fsanitize=vptr)
endif()
target_compile_definitions (seastar
PRIVATE
SEASTAR_NO_EXCEPTION_HACK)
endif()
set(ABSL_PROPAGATE_CXX_STD ON CACHE BOOL "" FORCE)
@@ -148,10 +111,8 @@ if(Scylla_ENABLE_LTO)
endif()
find_package(Sanitizers QUIET)
# Match configure.py:2192 — abseil gets sanitizer flags with -fno-sanitize=vptr
# to exclude vptr checks which are incompatible with abseil's usage.
list(APPEND absl_cxx_flags
$<$<CONFIG:Debug,Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_COMPILE_OPTIONS>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_COMPILE_OPTIONS>;-fno-sanitize=vptr>)
$<$<CONFIG:Debug,Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_COMPILE_OPTIONS>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_COMPILE_OPTIONS>>)
if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
list(APPEND ABSL_GCC_FLAGS ${absl_cxx_flags})
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
@@ -176,38 +137,9 @@ add_library(absl::headers ALIAS absl-headers)
# unfortunately.
set_target_properties(absl_strerror PROPERTIES EXCLUDE_FROM_ALL TRUE)
# Now that seastar and abseil subdirectories are fully processed, add
# SEASTAR_TESTING_MAIN globally. This matches configure.py's global define
# without leaking into seastar (which would cause duplicate main symbols).
add_compile_definitions(SEASTAR_TESTING_MAIN)
# System libraries dependencies
find_package(Boost REQUIRED
COMPONENTS filesystem program_options system thread regex unit_test_framework)
# When using shared Boost libraries, define BOOST_ALL_DYN_LINK (matching configure.py)
if(NOT Boost_USE_STATIC_LIBS)
add_compile_definitions(BOOST_ALL_DYN_LINK)
endif()
# CMake's Boost package config adds per-component defines like
# BOOST_UNIT_TEST_FRAMEWORK_DYN_LINK, BOOST_REGEX_DYN_LINK, etc. on the
# imported targets. configure.py only uses BOOST_ALL_DYN_LINK (which covers
# all components), so strip the per-component defines to align the two build
# systems.
foreach(_boost_target
Boost::unit_test_framework
Boost::regex
Boost::filesystem
Boost::program_options
Boost::system
Boost::thread)
if(TARGET ${_boost_target})
# Completely remove all INTERFACE_COMPILE_DEFINITIONS from the Boost target.
# This prevents per-component *_DYN_LINK and *_NO_LIB defines from
# propagating. BOOST_ALL_DYN_LINK (set globally) covers all components.
set_property(TARGET ${_boost_target} PROPERTY INTERFACE_COMPILE_DEFINITIONS)
endif()
endforeach()
target_link_libraries(Boost::regex
INTERFACE
ICU::i18n
@@ -264,10 +196,6 @@ if (Scylla_USE_PRECOMPILED_HEADER)
message(STATUS "Using precompiled header for Scylla - remember to add `sloppiness = pch_defines,time_macros` to ccache.conf, if you're using ccache.")
target_precompile_headers(scylla-precompiled-header PRIVATE "stdafx.hh")
target_compile_definitions(scylla-precompiled-header PRIVATE SCYLLA_USE_PRECOMPILED_HEADER)
# Match configure.py: -fpch-validate-input-files-content tells the compiler
# to check content of stdafx.hh if timestamps don't match (important for
# ccache/git workflows where timestamps may not be preserved).
add_compile_options(-fpch-validate-input-files-content)
endif()
else()
set(Scylla_USE_PRECOMPILED_HEADER_USE OFF)

2
abseil

Submodule abseil updated: 255c84dadd...d7aaad83b4

View File

@@ -699,17 +699,6 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
// for such a size.
co_return api_error::payload_too_large(fmt::format("Request content length limit of {} bytes exceeded", request_content_length_limit));
}
// Check the concurrency limit early, before acquiring memory and
// reading the request body, to avoid piling up memory from excess
// requests that will be rejected anyway. This mirrors the CQL
// transport which also checks concurrency before memory acquisition
// (transport/server.cc).
if (_pending_requests.get_count() >= _max_concurrent_requests) {
_executor._stats.requests_shed++;
co_return api_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}", _pending_requests.get_count()));
}
_pending_requests.enter();
auto leave = defer([this] () noexcept { _pending_requests.leave(); });
// JSON parsing can allocate up to roughly 2x the size of the raw
// document, + a couple of bytes for maintenance.
// If the Content-Length of the request is not available, we assume
@@ -771,6 +760,12 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
_executor._stats.unsupported_operations++;
co_return api_error::unknown_operation(fmt::format("Unsupported operation {}", op));
}
if (_pending_requests.get_count() >= _max_concurrent_requests) {
_executor._stats.requests_shed++;
co_return api_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}", _pending_requests.get_count()));
}
_pending_requests.enter();
auto leave = defer([this] () noexcept { _pending_requests.leave(); });
executor::client_state client_state(service::client_state::external_tag(),
_auth_service, &_sl_controller, _timeout_config.current_values(), req->get_client_address());
if (!username.empty()) {

View File

@@ -33,7 +33,7 @@
#include "data_dictionary/data_dictionary.hh"
#include "utils/rjson.hh"
static logging::logger slogger("alternator-streams");
static logging::logger elogger("alternator-streams");
/**
* Base template type to implement rapidjson::internal::TypeHelper<...>:s
@@ -437,7 +437,7 @@ const cdc::stream_id& find_parent_shard_in_previous_generation(db_clock::time_po
if (prev_streams.empty()) {
// something is really wrong - streams are empty
// let's try internal_error in hope it will be notified and fixed
on_internal_error(slogger, fmt::format("streams are empty for cdc generation at {} ({})", prev_timestamp, prev_timestamp.time_since_epoch().count()));
on_internal_error(elogger, fmt::format("streams are empty for cdc generation at {} ({})", prev_timestamp, prev_timestamp.time_since_epoch().count()));
}
auto it = std::lower_bound(prev_streams.begin(), prev_streams.end(), child.token(), [](const cdc::stream_id& id, const dht::token& t) {
return id.token() < t;
@@ -787,18 +787,16 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
struct event_id {
cdc::stream_id stream;
utils::UUID timestamp;
size_t index = 0;
static constexpr auto marker = 'E';
event_id(cdc::stream_id s, utils::UUID ts, size_t index)
event_id(cdc::stream_id s, utils::UUID ts)
: stream(s)
, timestamp(ts)
, index(index)
{}
friend std::ostream& operator<<(std::ostream& os, const event_id& id) {
fmt::print(os, "{}{}:{}:{}", marker, id.stream.to_bytes(), id.timestamp, id.index);
fmt::print(os, "{}{}:{}", marker, id.stream.to_bytes(), id.timestamp);
return os;
}
};
@@ -810,19 +808,7 @@ struct rapidjson::internal::TypeHelper<ValueType, alternator::event_id>
{};
namespace alternator {
namespace {
struct managed_bytes_ptr_hash {
size_t operator()(const managed_bytes *k) const noexcept {
return std::hash<managed_bytes>{}(*k);
}
};
struct managed_bytes_ptr_equal {
bool operator()(const managed_bytes *a, const managed_bytes *b) const noexcept {
return *a == *b;
}
};
}
future<executor::request_return_type> executor::get_records(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request) {
_stats.api_operations.get_records++;
auto start_time = std::chrono::steady_clock::now();
@@ -893,12 +879,6 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto pks = schema->partition_key_columns();
auto cks = schema->clustering_key_columns();
auto base_cks = base->clustering_key_columns();
if (base_cks.size() > 1) {
throw api_error::internal(fmt::format("invalid alternator table, clustering key count ({}) is bigger than one", base_cks.size()));
}
const bytes *clustering_key_column_name = !base_cks.empty() ? &base_cks.front().name() : nullptr;
std::transform(pks.begin(), pks.end(), std::back_inserter(columns), [](auto& c) { return &c; });
std::transform(cks.begin(), cks.end(), std::back_inserter(columns), [](auto& c) { return &c; });
@@ -953,40 +933,42 @@ future<executor::request_return_type> executor::get_records(client_state& client
return cdef->name->name() == eor_column_name;
})
);
auto clustering_key_index = clustering_key_column_name ? std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [&](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == *clustering_key_column_name;
})
) : 0;
std::optional<utils::UUID> timestamp;
struct Record {
rjson::value record;
rjson::value dynamodb;
};
const managed_bytes empty_managed_bytes;
std::unordered_map<const managed_bytes*, Record, managed_bytes_ptr_hash, managed_bytes_ptr_equal> records_map;
auto dynamodb = rjson::empty_object();
auto record = rjson::empty_object();
const auto dc_name = _proxy.get_token_metadata_ptr()->get_topology().get_datacenter();
using op_utype = std::underlying_type_t<cdc::operation>;
auto maybe_add_record = [&] {
if (!dynamodb.ObjectEmpty()) {
rjson::add(record, "dynamodb", std::move(dynamodb));
dynamodb = rjson::empty_object();
}
if (!record.ObjectEmpty()) {
rjson::add(record, "awsRegion", rjson::from_string(dc_name));
rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));
rjson::add(record, "eventSource", "scylladb:alternator");
rjson::add(record, "eventVersion", "1.1");
rjson::push_back(records, std::move(record));
record = rjson::empty_object();
--limit;
}
};
for (auto& row : result_set->rows()) {
auto op = static_cast<cdc::operation>(value_cast<op_utype>(data_type_for<op_utype>()->deserialize(*row[op_index])));
auto ts = value_cast<utils::UUID>(data_type_for<utils::UUID>()->deserialize(*row[ts_index]));
auto eor = row[eor_index].has_value() ? value_cast<bool>(boolean_type->deserialize(*row[eor_index])) : false;
const managed_bytes* cs_ptr = clustering_key_column_name ? &*row[clustering_key_index] : &empty_managed_bytes;
auto records_it = records_map.emplace(cs_ptr, Record{});
auto &record = records_it.first->second;
if (records_it.second) {
record.dynamodb = rjson::empty_object();
record.record = rjson::empty_object();
if (!dynamodb.HasMember("Keys")) {
auto keys = rjson::empty_object();
describe_single_item(*selection, row, key_names, keys);
rjson::add(record.dynamodb, "Keys", std::move(keys));
rjson::add(record.dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());
rjson::add(record.dynamodb, "SequenceNumber", sequence_number(ts));
rjson::add(record.dynamodb, "StreamViewType", type);
rjson::add(dynamodb, "Keys", std::move(keys));
rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());
rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));
rjson::add(dynamodb, "StreamViewType", type);
// TODO: SizeBytes
}
@@ -1010,10 +992,6 @@ future<executor::request_return_type> executor::get_records(client_state& client
* flags on CDC log, instead we use data to
* drive what is returned. This is (afaict)
* consistent with dynamo streams
*
* Note: BatchWriteItem will generate multiple records with
* the same timestamp, when write isolation is set to always
* (which triggers lwt), so we need to unpack them based on clustering key.
*/
switch (op) {
case cdc::operation::pre_image:
@@ -1022,14 +1000,14 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto item = rjson::empty_object();
describe_single_item(*selection, row, attr_names, item, nullptr, true);
describe_single_item(*selection, row, key_names, item);
rjson::add(record.dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
break;
}
case cdc::operation::update:
rjson::add(record.record, "eventName", "MODIFY");
rjson::add(record, "eventName", "MODIFY");
break;
case cdc::operation::insert:
rjson::add(record.record, "eventName", "INSERT");
rjson::add(record, "eventName", "INSERT");
break;
case cdc::operation::service_row_delete:
case cdc::operation::service_partition_delete:
@@ -1037,41 +1015,28 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto user_identity = rjson::empty_object();
rjson::add(user_identity, "Type", "Service");
rjson::add(user_identity, "PrincipalId", "dynamodb.amazonaws.com");
rjson::add(record.record, "userIdentity", std::move(user_identity));
rjson::add(record.record, "eventName", "REMOVE");
rjson::add(record, "userIdentity", std::move(user_identity));
rjson::add(record, "eventName", "REMOVE");
break;
}
default:
rjson::add(record.record, "eventName", "REMOVE");
rjson::add(record, "eventName", "REMOVE");
break;
}
if (eor) {
size_t index = 0;
for (auto& [_, rec] : records_map) {
rjson::add(rec.record, "awsRegion", rjson::from_string(dc_name));
rjson::add(rec.record, "eventID", event_id(iter.shard.id, *timestamp, index++));
rjson::add(rec.record, "eventSource", "scylladb:alternator");
rjson::add(rec.record, "eventVersion", "1.1");
rjson::add(rec.record, "dynamodb", std::move(rec.dynamodb));
rjson::push_back(records, std::move(rec.record));
}
records_map.clear();
maybe_add_record();
timestamp = ts;
if (records.Size() >= limit) {
// Note: we might have more than limit rows here - BatchWriteItem will emit multiple items
// with the same timestamp and we have no way of resume iteration midway through those,
// so we return all of them here.
if (limit == 0) {
break;
}
}
}
auto ret = rjson::empty_object();
auto nrecords = records.Size();
rjson::add(ret, "Records", std::move(records));
if (timestamp) {
if (nrecords != 0) {
// #9642. Set next iterators threshold to > last
shard_iterator next_iter(iter.table, iter.shard, *timestamp, false);
// Note that here we unconditionally return NextShardIterator,
@@ -1122,7 +1087,6 @@ bool executor::add_stream_options(const rjson::value& stream_specification, sche
cdc::options opts;
opts.enabled(true);
// cdc::delta_mode is ignored by Alternator, so aim for the least overhead.
opts.set_delta_mode(cdc::delta_mode::keys);
opts.ttl(std::chrono::duration_cast<std::chrono::seconds>(dynamodb_streams_max_window).count());

View File

@@ -743,7 +743,7 @@
"parameters":[
{
"name":"tag",
"description":"The snapshot tag to delete. If omitted, all snapshots are removed.",
"description":"the tag given to the snapshot",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -751,7 +751,7 @@
},
{
"name":"kn",
"description":"Comma-separated list of keyspace names to delete snapshots from. If omitted, snapshots are deleted from all keyspaces.",
"description":"Comma-separated keyspaces name that their snapshot will be deleted",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -759,7 +759,7 @@
},
{
"name":"cf",
"description":"A table name used to filter which table's snapshots are deleted. If omitted or empty, snapshots for all tables are eligible. When provided together with 'kn', the table is looked up in each listed keyspace independently. For secondary indexes, the logical index name (e.g. 'myindex') can be used and is resolved automatically.",
"description":"an optional table name that its snapshot will be deleted",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -3166,83 +3166,6 @@
]
},
{
"path":"/storage_service/vnode_tablet_migrations/keyspaces/{keyspace}",
"operations":[{
"method":"POST",
"summary":"Start vnodes-to-tablets migration for all tables in a keyspace",
"type":"void",
"nickname":"create_vnode_tablet_migration",
"produces":["application/json"],
"parameters":[
{
"name":"keyspace",
"description":"Keyspace name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
},
{
"method":"GET",
"summary":"Get a keyspace's vnodes-to-tablets migration status",
"type":"vnode_tablet_migration_status",
"nickname":"get_vnode_tablet_migration",
"produces":["application/json"],
"parameters":[
{
"name":"keyspace",
"description":"Keyspace name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}]
},
{
"path":"/storage_service/vnode_tablet_migrations/node/storage_mode",
"operations":[{
"method":"PUT",
"summary":"Set the intended storage mode for this node during vnodes-to-tablets migration",
"type":"void",
"nickname":"set_vnode_tablet_migration_node_storage_mode",
"produces":["application/json"],
"parameters":[
{
"name":"intended_mode",
"description":"Intended storage mode (tablets or vnodes)",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}]
},
{
"path":"/storage_service/vnode_tablet_migrations/keyspaces/{keyspace}/finalization",
"operations":[{
"method":"POST",
"summary":"Finalize vnodes-to-tablets migration for all tables in a keyspace",
"type":"void",
"nickname":"finalize_vnode_tablet_migration",
"produces":["application/json"],
"parameters":[
{
"name":"keyspace",
"description":"Keyspace name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}]
},
{
"path":"/storage_service/quiesce_topology",
"operations":[
@@ -3860,45 +3783,6 @@
"description":"The resulting compression ratio (estimated on a random sample of files)"
}
}
},
"vnode_tablet_migration_node_status":{
"id":"vnode_tablet_migration_node_status",
"description":"Node storage mode info during vnodes-to-tablets migration",
"properties":{
"host_id":{
"type":"string",
"description":"The host ID"
},
"current_mode":{
"type":"string",
"description":"The current storage mode: `vnodes` or `tablets`"
},
"intended_mode":{
"type":"string",
"description":"The intended storage mode: `vnodes` or `tablets`"
}
}
},
"vnode_tablet_migration_status":{
"id":"vnode_tablet_migration_status",
"description":"Vnodes-to-tablets migration status for a keyspace",
"properties":{
"keyspace":{
"type":"string",
"description":"The keyspace name"
},
"status":{
"type":"string",
"description":"The migration status: `vnodes` (not started), `migrating_to_tablets` (in progress), or `tablets` (complete)"
},
"nodes":{
"type":"array",
"items":{
"$ref":"vnode_tablet_migration_node_status"
},
"description":"Per-node storage mode information. Empty if the keyspace is not being migrated."
}
}
}
}
}

View File

@@ -23,7 +23,7 @@ void set_error_injection(http_context& ctx, routes& r) {
hf::enable_injection.set(r, [](std::unique_ptr<request> req) -> future<json::json_return_type> {
sstring injection = req->get_path_param("injection");
bool one_shot = strcasecmp(req->get_query_param("one_shot").c_str(), "true") == 0;
bool one_shot = req->get_query_param("one_shot") == "True";
auto params = co_await util::read_entire_stream_contiguous(*req->content_stream);
const size_t max_params_size = 1024 * 1024;

View File

@@ -30,7 +30,6 @@
#include <fmt/ranges.h>
#include "service/raft/raft_group0_client.hh"
#include "service/storage_service.hh"
#include "service/topology_state_machine.hh"
#include "service/load_meter.hh"
#include "gms/feature_service.hh"
#include "gms/gossiper.hh"
@@ -573,6 +572,14 @@ void unset_view_builder(http_context& ctx, routes& r) {
cf::get_built_indexes.unset(r);
}
static future<json::json_return_type> describe_ring_as_json(sharded<service::storage_service>& ss, sstring keyspace) {
co_return json::json_return_type(stream_range_as_array(co_await ss.local().describe_ring(keyspace), token_range_endpoints_to_json));
}
static future<json::json_return_type> describe_ring_as_json_for_table(const sharded<service::storage_service>& ss, sstring keyspace, sstring table) {
co_return json::json_return_type(stream_range_as_array(co_await ss.local().describe_ring_for_table(keyspace, table), token_range_endpoints_to_json));
}
namespace {
template <typename Key, typename Value>
storage_service_json::mapper map_to_json(const std::pair<Key, Value>& i) {
@@ -670,16 +677,13 @@ rest_describe_ring(http_context& ctx, sharded<service::storage_service>& ss, std
if (!req->param.exists("keyspace")) {
throw bad_param_exception("The keyspace param is not provided");
}
auto keyspace = validate_keyspace(ctx, req);
auto keyspace = req->get_path_param("keyspace");
auto table = req->get_query_param("table");
utils::chunked_vector<dht::token_range_endpoints> ranges;
if (!table.empty()) {
auto table_id = validate_table(ctx.db.local(), keyspace, table);
ranges = co_await ss.local().describe_ring_for_table(table_id);
} else {
ranges = co_await ss.local().describe_ring(keyspace);
validate_table(ctx.db.local(), keyspace, table);
return describe_ring_as_json_for_table(ss, keyspace, table);
}
co_return json::json_return_type(stream_range_as_array(std::move(ranges), token_range_endpoints_to_json));
return describe_ring_as_json(ss, validate_keyspace(ctx, req));
}
static
@@ -1723,69 +1727,6 @@ rest_tablet_balancing_enable(sharded<service::storage_service>& ss, std::unique_
co_return json_void();
}
static
future<json::json_return_type>
rest_create_vnode_tablet_migration(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
if (!ss.local().get_feature_service().vnodes_to_tablets_migrations) {
apilog.warn("create_vnode_tablet_migration: called before the cluster feature was enabled");
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto keyspace = validate_keyspace(ctx, req);
co_await ss.local().prepare_for_tablets_migration(keyspace);
co_return json_void();
}
static
future<json::json_return_type>
rest_get_vnode_tablet_migration(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
if (!ss.local().get_feature_service().vnodes_to_tablets_migrations) {
apilog.warn("get_vnode_tablet_migration: called before the cluster feature was enabled");
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto keyspace = validate_keyspace(ctx, req);
auto status = co_await ss.local().get_tablets_migration_status(keyspace);
ss::vnode_tablet_migration_status result;
result.keyspace = status.keyspace;
result.status = status.status;
result.nodes._set = true;
for (const auto& node : status.nodes) {
ss::vnode_tablet_migration_node_status n;
n.host_id = fmt::to_string(node.host_id);
n.current_mode = node.current_mode;
n.intended_mode = node.intended_mode;
result.nodes.push(n);
}
co_return result;
}
static
future<json::json_return_type>
rest_set_vnode_tablet_migration_node_storage_mode(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
if (!ss.local().get_feature_service().vnodes_to_tablets_migrations) {
apilog.warn("set_vnode_tablet_migration_node_storage_mode: called before the cluster feature was enabled");
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto mode_str = req->get_query_param("intended_mode");
auto mode = service::intended_storage_mode_from_string(mode_str);
co_await ss.local().set_node_intended_storage_mode(mode);
co_return json_void();
}
static
future<json::json_return_type>
rest_finalize_vnode_tablet_migration(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
if (!ss.local().get_feature_service().vnodes_to_tablets_migrations) {
apilog.warn("finalize_vnode_tablet_migration: called before the cluster feature was enabled");
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto keyspace = validate_keyspace(ctx, req);
validate_keyspace(ctx, keyspace);
co_await ss.local().finalize_tablets_migration(keyspace);
co_return json_void();
}
static
future<json::json_return_type>
rest_quiesce_topology(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
@@ -1936,10 +1877,6 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::del_tablet_replica.set(r, rest_bind(rest_del_tablet_replica, ctx, ss));
ss::repair_tablet.set(r, rest_bind(rest_repair_tablet, ctx, ss));
ss::tablet_balancing_enable.set(r, rest_bind(rest_tablet_balancing_enable, ss));
ss::create_vnode_tablet_migration.set(r, rest_bind(rest_create_vnode_tablet_migration, ctx, ss));
ss::get_vnode_tablet_migration.set(r, rest_bind(rest_get_vnode_tablet_migration, ctx, ss));
ss::set_vnode_tablet_migration_node_storage_mode.set(r, rest_bind(rest_set_vnode_tablet_migration_node_storage_mode, ctx, ss));
ss::finalize_vnode_tablet_migration.set(r, rest_bind(rest_finalize_vnode_tablet_migration, ctx, ss));
ss::quiesce_topology.set(r, rest_bind(rest_quiesce_topology, ss));
sp::get_schema_versions.set(r, rest_bind(rest_get_schema_versions, ss));
ss::drop_quarantined_sstables.set(r, rest_bind(rest_drop_quarantined_sstables, ctx, ss));
@@ -2019,10 +1956,6 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::del_tablet_replica.unset(r);
ss::repair_tablet.unset(r);
ss::tablet_balancing_enable.unset(r);
ss::create_vnode_tablet_migration.unset(r);
ss::get_vnode_tablet_migration.unset(r);
ss::set_vnode_tablet_migration_node_storage_mode.unset(r);
ss::finalize_vnode_tablet_migration.unset(r);
ss::quiesce_topology.unset(r);
sp::get_schema_versions.unset(r);
ss::drop_quarantined_sstables.unset(r);
@@ -2113,8 +2046,6 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, opts);
}
co_return json_void();
} catch (const data_dictionary::no_such_column_family& e) {
throw httpd::bad_param_exception(e.what());
} catch (...) {
apilog.error("take_snapshot failed: {}", std::current_exception());
throw;
@@ -2151,8 +2082,6 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
try {
co_await snap_ctl.local().clear_snapshot(tag, keynames, column_family);
co_return json_void();
} catch (const data_dictionary::no_such_column_family& e) {
throw httpd::bad_param_exception(e.what());
} catch (...) {
apilog.error("del_snapshot failed: {}", std::current_exception());
throw;

View File

@@ -14,7 +14,6 @@
#include <fmt/ranges.h>
#include "utils/to_string.hh"
#include "utils/error_injection.hh"
#include "data_dictionary/data_dictionary.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
@@ -106,9 +105,6 @@ auth::authentication_option_set auth::certificate_authenticator::alterable_optio
}
future<std::optional<auth::authenticated_user>> auth::certificate_authenticator::authenticate(session_dn_func f) const {
if (auto user = utils::get_local_injector().inject_parameter("transport_early_auth_bypass")) {
co_return auth::authenticated_user{sstring(*user)};
}
if (!f) {
co_return std::nullopt;
}

View File

@@ -32,7 +32,7 @@ namespace {
logger mylog{"ldap_role_manager"}; // `log` is taken by math.
struct url_desc_deleter {
void operator()(LDAPURLDesc *p) {
void operator()(LDAPURLDesc* p) {
ldap_free_urldesc(p);
}
};
@@ -40,7 +40,7 @@ struct url_desc_deleter {
using url_desc_ptr = std::unique_ptr<LDAPURLDesc, url_desc_deleter>;
url_desc_ptr parse_url(std::string_view url) {
LDAPURLDesc *desc = nullptr;
LDAPURLDesc* desc = nullptr;
if (ldap_url_parse(url.data(), &desc)) {
mylog.error("error in ldap_url_parse({})", url);
}
@@ -53,8 +53,12 @@ std::vector<sstring> get_attr_values(LDAP* ld, LDAPMessage* res, const char* att
mylog.debug("Analyzing search results");
for (auto e = ldap_first_entry(ld, res); e; e = ldap_next_entry(ld, e)) {
struct deleter {
void operator()(berval** p) { ldap_value_free_len(p); }
void operator()(char* p) { ldap_memfree(p); }
void operator()(berval** p) {
ldap_value_free_len(p);
}
void operator()(char* p) {
ldap_memfree(p);
}
};
const std::unique_ptr<char, deleter> dname(ldap_get_dn(ld, e));
mylog.debug("Analyzing entry {}", dname.get());
@@ -75,32 +79,29 @@ std::vector<sstring> get_attr_values(LDAP* ld, LDAPMessage* res, const char* att
namespace auth {
ldap_role_manager::ldap_role_manager(
std::string_view query_template, std::string_view target_attr, std::string_view bind_name, std::string_view bind_password,
uint32_t permissions_update_interval_in_ms,
utils::observer<uint32_t> permissions_update_interval_in_ms_observer,
cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache)
: _std_mgr(qp, rg0c, mm, cache), _group0_client(rg0c), _query_template(query_template), _target_attr(target_attr), _bind_name(bind_name)
, _bind_password(bind_password)
, _permissions_update_interval_in_ms(permissions_update_interval_in_ms)
, _permissions_update_interval_in_ms_observer(std::move(permissions_update_interval_in_ms_observer))
, _connection_factory(bind(std::mem_fn(&ldap_role_manager::reconnect), std::ref(*this)))
, _cache(cache)
, _cache_pruner(make_ready_future<>()) {
ldap_role_manager::ldap_role_manager(std::string_view query_template, std::string_view target_attr, std::string_view bind_name, std::string_view bind_password,
uint32_t permissions_update_interval_in_ms, utils::observer<uint32_t> permissions_update_interval_in_ms_observer, cql3::query_processor& qp,
::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache)
: _std_mgr(qp, rg0c, mm, cache)
, _group0_client(rg0c)
, _query_template(query_template)
, _target_attr(target_attr)
, _bind_name(bind_name)
, _bind_password(bind_password)
, _permissions_update_interval_in_ms(permissions_update_interval_in_ms)
, _permissions_update_interval_in_ms_observer(std::move(permissions_update_interval_in_ms_observer))
, _connection_factory(bind(std::mem_fn(&ldap_role_manager::reconnect), std::ref(*this)))
, _cache(cache)
, _cache_pruner(make_ready_future<>()) {
}
ldap_role_manager::ldap_role_manager(cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache)
: ldap_role_manager(
qp.db().get_config().ldap_url_template(),
qp.db().get_config().ldap_attr_role(),
qp.db().get_config().ldap_bind_dn(),
qp.db().get_config().ldap_bind_passwd(),
qp.db().get_config().permissions_update_interval_in_ms(),
qp.db().get_config().permissions_update_interval_in_ms.observe([this] (const uint32_t& v) { _permissions_update_interval_in_ms = v; }),
qp,
rg0c,
mm,
cache) {
: ldap_role_manager(qp.db().get_config().ldap_url_template(), qp.db().get_config().ldap_attr_role(), qp.db().get_config().ldap_bind_dn(),
qp.db().get_config().ldap_bind_passwd(), qp.db().get_config().permissions_update_interval_in_ms(),
qp.db().get_config().permissions_update_interval_in_ms.observe([this](const uint32_t& v) {
_permissions_update_interval_in_ms = v;
}),
qp, rg0c, mm, cache) {
}
std::string_view ldap_role_manager::qualified_java_name() const noexcept {
@@ -113,17 +114,16 @@ const resource_set& ldap_role_manager::protected_resources() const {
future<> ldap_role_manager::start() {
if (!parse_url(get_url("dummy-user"))) { // Just need host and port -- any user should do.
return make_exception_future(
std::runtime_error(fmt::format("error getting LDAP server address from template {}", _query_template)));
return make_exception_future(std::runtime_error(fmt::format("error getting LDAP server address from template {}", _query_template)));
}
_cache_pruner = futurize_invoke([this] () -> future<> {
_cache_pruner = futurize_invoke([this]() -> future<> {
while (true) {
try {
co_await seastar::sleep_abortable(std::chrono::milliseconds(_permissions_update_interval_in_ms), _as);
} catch (const seastar::sleep_aborted&) {
co_return; // ignore
}
co_await _cache.container().invoke_on_all([] (cache& c) -> future<> {
co_await _cache.container().invoke_on_all([](cache& c) -> future<> {
try {
co_await c.reload_all_permissions();
} catch (...) {
@@ -165,7 +165,7 @@ future<conn_ptr> ldap_role_manager::connect() {
future<conn_ptr> ldap_role_manager::reconnect() {
unsigned retries_left = 5;
using namespace std::literals::chrono_literals;
conn_ptr conn = co_await exponential_backoff_retry::do_until_value(1s, 32s, _as, [this, &retries_left] () -> future<std::optional<conn_ptr>> {
conn_ptr conn = co_await exponential_backoff_retry::do_until_value(1s, 32s, _as, [this, &retries_left]() -> future<std::optional<conn_ptr>> {
if (!retries_left) {
co_return conn_ptr{};
}
@@ -188,11 +188,13 @@ future<conn_ptr> ldap_role_manager::reconnect() {
future<> ldap_role_manager::stop() {
_as.request_abort();
return std::move(_cache_pruner).then([this] {
return _std_mgr.stop();
}).then([this] {
return _connection_factory.stop();
});
return std::move(_cache_pruner)
.then([this] {
return _std_mgr.stop();
})
.then([this] {
return _connection_factory.stop();
});
}
future<> ldap_role_manager::create(std::string_view name, const role_config& config, ::service::group0_batch& mc) {
@@ -221,43 +223,42 @@ future<role_set> ldap_role_manager::query_granted(std::string_view grantee_name,
if (!desc) {
return make_exception_future<role_set>(std::runtime_error(format("Error parsing URL {}", url)));
}
return _connection_factory.with_connection([this, desc = std::move(desc), grantee_name_ = sstring(grantee_name)]
(ldap_connection& conn) -> future<role_set> {
sstring grantee_name = std::move(grantee_name_);
ldap_msg_ptr res = co_await conn.search(desc->lud_dn, desc->lud_scope, desc->lud_filter, desc->lud_attrs,
/*attrsonly=*/0, /*serverctrls=*/nullptr, /*clientctrls=*/nullptr,
/*timeout=*/nullptr, /*sizelimit=*/0);
mylog.trace("query_granted: got search results");
const auto mtype = ldap_msgtype(res.get());
if (mtype != LDAP_RES_SEARCH_ENTRY && mtype != LDAP_RES_SEARCH_RESULT && mtype != LDAP_RES_SEARCH_REFERENCE) {
mylog.error("ldap search yielded result {} of type {}", static_cast<const void*>(res.get()), mtype);
co_return coroutine::exception(std::make_exception_ptr(std::runtime_error("ldap_role_manager: search result has wrong type")));
}
std::vector<sstring> values = get_attr_values(conn.get_ldap(), res.get(), _target_attr.c_str());
auth::role_set valid_roles{grantee_name};
// Each value is a role to be granted.
co_await parallel_for_each(values, [this, &valid_roles] (const sstring& ldap_role) {
return _std_mgr.exists(ldap_role).then([&valid_roles, &ldap_role] (bool exists) {
if (exists) {
valid_roles.insert(ldap_role);
} else {
mylog.error("unrecognized role received from LDAP: {}", ldap_role);
return _connection_factory.with_connection(
[this, desc = std::move(desc), grantee_name_ = sstring(grantee_name)](ldap_connection& conn) -> future<role_set> {
sstring grantee_name = std::move(grantee_name_);
ldap_msg_ptr res = co_await conn.search(desc->lud_dn, desc->lud_scope, desc->lud_filter, desc->lud_attrs,
/*attrsonly=*/0, /*serverctrls=*/nullptr, /*clientctrls=*/nullptr,
/*timeout=*/nullptr, /*sizelimit=*/0);
mylog.trace("query_granted: got search results");
const auto mtype = ldap_msgtype(res.get());
if (mtype != LDAP_RES_SEARCH_ENTRY && mtype != LDAP_RES_SEARCH_RESULT && mtype != LDAP_RES_SEARCH_REFERENCE) {
mylog.error("ldap search yielded result {} of type {}", static_cast<const void*>(res.get()), mtype);
co_return coroutine::exception(std::make_exception_ptr(std::runtime_error("ldap_role_manager: search result has wrong type")));
}
});
});
std::vector<sstring> values = get_attr_values(conn.get_ldap(), res.get(), _target_attr.c_str());
auth::role_set valid_roles{grantee_name};
co_return std::move(valid_roles);
});
// Each value is a role to be granted.
co_await parallel_for_each(values, [this, &valid_roles](const sstring& ldap_role) {
return _std_mgr.exists(ldap_role).then([&valid_roles, &ldap_role](bool exists) {
if (exists) {
valid_roles.insert(ldap_role);
} else {
mylog.error("unrecognized role received from LDAP: {}", ldap_role);
}
});
});
co_return std::move(valid_roles);
});
}
future<role_to_directly_granted_map>
ldap_role_manager::query_all_directly_granted(::service::query_state& qs) {
future<role_to_directly_granted_map> ldap_role_manager::query_all_directly_granted(::service::query_state& qs) {
role_to_directly_granted_map result;
auto roles = co_await query_all(qs);
for (auto& role: roles) {
for (auto& role : roles) {
auto granted_set = co_await query_granted(role, recursive_role_query::no);
for (auto& granted: granted_set) {
for (auto& granted : granted_set) {
if (granted != role) {
result.insert({role, granted});
}
@@ -271,7 +272,7 @@ future<role_set> ldap_role_manager::query_all(::service::query_state& qs) {
}
future<> ldap_role_manager::create_role(std::string_view role_name) {
return smp::submit_to(0, [this, role_name] () -> future<> {
return smp::submit_to(0, [this, role_name]() -> future<> {
int retries = 10;
while (true) {
auto guard = co_await _group0_client.start_operation(_as, ::service::raft_timeout{});
@@ -283,8 +284,8 @@ future<> ldap_role_manager::create_role(std::string_view role_name) {
} catch (const role_already_exists&) {
// ok
} catch (const ::service::group0_concurrent_modification& ex) {
mylog.warn("Failed to auto-create role \"{}\" due to guard conflict.{}.",
role_name, retries ? " Retrying" : " Number of retries exceeded, giving up");
mylog.warn("Failed to auto-create role \"{}\" due to guard conflict.{}.", role_name,
retries ? " Retrying" : " Number of retries exceeded, giving up");
if (retries--) {
continue;
}
@@ -329,8 +330,7 @@ future<bool> ldap_role_manager::can_login(std::string_view role_name) {
return _std_mgr.can_login(role_name);
}
future<std::optional<sstring>> ldap_role_manager::get_attribute(
std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
future<std::optional<sstring>> ldap_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
return _std_mgr.get_attribute(role_name, attribute_name, qs);
}

View File

@@ -76,14 +76,14 @@ struct partition_deletion {
using clustered_column_set = std::map<clustering_key, cdc::one_kind_column_set, clustering_key::less_compare>;
template<typename Container>
template <typename Container>
concept EntryContainer = requires(Container& container) {
// Parenthesized due to https://bugs.llvm.org/show_bug.cgi?id=45088
{ (container.atomic_entries) } -> std::same_as<std::vector<atomic_column_update>&>;
{ (container.nonatomic_entries) } -> std::same_as<std::vector<nonatomic_column_update>&>;
};
template<EntryContainer Container>
template <EntryContainer Container>
static void add_columns_affected_by_entries(cdc::one_kind_column_set& cset, const Container& cont) {
for (const auto& entry : cont.atomic_entries) {
cset.set(entry.id);
@@ -134,7 +134,7 @@ struct batch {
ret.emplace(clustering_key::make_empty(), all_columns);
}
auto process_change_type = [&] (const auto& changes) {
auto process_change_type = [&](const auto& changes) {
for (const auto& change : changes) {
auto& cset = ret[change.key];
cset.resize(s.regular_columns_count());
@@ -211,7 +211,9 @@ private:
public:
extract_collection_visitor(column_id id, std::map<change_key_t, row_update>& updates)
: _id(id), _updates(updates) {}
: _id(id)
, _updates(updates) {
}
void collection_tombstone(const tombstone& t) {
auto& entry = get_or_append_entry(t.timestamp + 1, gc_clock::duration(0));
@@ -226,7 +228,9 @@ public:
cell(key, c);
}
constexpr bool finished() const { return false; }
constexpr bool finished() const {
return false;
}
};
/* Visits all cells and tombstones in a row, putting the encountered changes into buckets
@@ -249,41 +253,46 @@ struct extract_row_visitor {
void collection_column(const column_definition& cdef, auto&& visit_collection) {
visit(*cdef.type, make_visitor(
[&] (const collection_type_impl& ctype) {
struct collection_visitor : public extract_collection_visitor<collection_visitor> {
data_type _value_type;
[&](const collection_type_impl& ctype) {
struct collection_visitor : public extract_collection_visitor<collection_visitor> {
data_type _value_type;
collection_visitor(column_id id, std::map<change_key_t, row_update>& updates, const collection_type_impl& ctype)
: extract_collection_visitor<collection_visitor>(id, updates), _value_type(ctype.value_comparator()) {}
collection_visitor(column_id id, std::map<change_key_t, row_update>& updates, const collection_type_impl& ctype)
: extract_collection_visitor<collection_visitor>(id, updates)
, _value_type(ctype.value_comparator()) {
}
data_type get_value_type(bytes_view) {
return _value_type;
}
} v(cdef.id, _updates, ctype);
data_type get_value_type(bytes_view) {
return _value_type;
}
} v(cdef.id, _updates, ctype);
visit_collection(v);
},
[&] (const user_type_impl& utype) {
struct udt_visitor : public extract_collection_visitor<udt_visitor> {
const user_type_impl& _utype;
visit_collection(v);
},
[&](const user_type_impl& utype) {
struct udt_visitor : public extract_collection_visitor<udt_visitor> {
const user_type_impl& _utype;
udt_visitor(column_id id, std::map<change_key_t, row_update>& updates, const user_type_impl& utype)
: extract_collection_visitor<udt_visitor>(id, updates), _utype(utype) {}
udt_visitor(column_id id, std::map<change_key_t, row_update>& updates, const user_type_impl& utype)
: extract_collection_visitor<udt_visitor>(id, updates)
, _utype(utype) {
}
data_type get_value_type(bytes_view key) {
return _utype.type(deserialize_field_index(key));
}
} v(cdef.id, _updates, utype);
data_type get_value_type(bytes_view key) {
return _utype.type(deserialize_field_index(key));
}
} v(cdef.id, _updates, utype);
visit_collection(v);
},
[&] (const abstract_type& o) {
throw std::runtime_error(format("extract_changes: unknown collection type:", o.name()));
}
));
visit_collection(v);
},
[&](const abstract_type& o) {
throw std::runtime_error(format("extract_changes: unknown collection type:", o.name()));
}));
}
constexpr bool finished() const { return false; }
constexpr bool finished() const {
return false;
}
};
struct extract_changes_visitor {
@@ -293,12 +302,8 @@ struct extract_changes_visitor {
extract_row_visitor v;
visit_row_cells(v);
for (auto& [ts_ttl, row_update]: v._updates) {
_result[ts_ttl.first].static_updates.push_back({
ts_ttl.second,
std::move(row_update.atomic_entries),
std::move(row_update.nonatomic_entries)
});
for (auto& [ts_ttl, row_update] : v._updates) {
_result[ts_ttl.first].static_updates.push_back({ts_ttl.second, std::move(row_update.atomic_entries), std::move(row_update.nonatomic_entries)});
}
}
@@ -319,24 +324,18 @@ struct extract_changes_visitor {
} v;
visit_row_cells(v);
for (auto& [ts_ttl, row_update]: v._updates) {
for (auto& [ts_ttl, row_update] : v._updates) {
// It is important that changes in the resulting `set_of_changes` are listed
// in increasing TTL order. The reason is explained in a comment in cdc/log.cc,
// search for "#6070".
auto [ts, ttl] = ts_ttl;
if (v._marker && ts == v._marker_ts && ttl == v._marker_ttl) {
_result[ts].clustered_inserts.push_back({
ttl,
ckey,
*v._marker,
std::move(row_update.atomic_entries),
{}
});
_result[ts].clustered_inserts.push_back({ttl, ckey, *v._marker, std::move(row_update.atomic_entries), {}});
auto& cr_insert = _result[ts].clustered_inserts.back();
bool clustered_update_exists = false;
for (auto& nonatomic_up: row_update.nonatomic_entries) {
for (auto& nonatomic_up : row_update.nonatomic_entries) {
// Updating a collection column with an INSERT statement implies inserting a tombstone.
//
// For example, suppose that we have:
@@ -362,12 +361,7 @@ struct extract_changes_visitor {
cr_insert.nonatomic_entries.push_back(std::move(nonatomic_up));
} else {
if (!clustered_update_exists) {
_result[ts].clustered_updates.push_back({
ttl,
ckey,
{},
{}
});
_result[ts].clustered_updates.push_back({ttl, ckey, {}, {}});
// Multiple iterations of this `for` loop (for different collection columns)
// might want to put their `nonatomic_up`s into an UPDATE change;
@@ -390,12 +384,7 @@ struct extract_changes_visitor {
}
}
} else {
_result[ts].clustered_updates.push_back({
ttl,
ckey,
std::move(row_update.atomic_entries),
std::move(row_update.nonatomic_entries)
});
_result[ts].clustered_updates.push_back({ttl, ckey, std::move(row_update.atomic_entries), std::move(row_update.nonatomic_entries)});
}
}
}
@@ -412,7 +401,9 @@ struct extract_changes_visitor {
_result[t.timestamp].partition_deletions = partition_deletion{t};
}
constexpr bool finished() const { return false; }
constexpr bool finished() const {
return false;
}
};
set_of_changes extract_changes(const mutation& m) {
@@ -426,13 +417,23 @@ namespace cdc {
struct find_timestamp_visitor {
api::timestamp_type _ts = api::missing_timestamp;
bool finished() const { return _ts != api::missing_timestamp; }
bool finished() const {
return _ts != api::missing_timestamp;
}
void visit(api::timestamp_type ts) { _ts = ts; }
void visit(const atomic_cell_view& cell) { visit(cell.timestamp()); }
void visit(api::timestamp_type ts) {
_ts = ts;
}
void visit(const atomic_cell_view& cell) {
visit(cell.timestamp());
}
void live_atomic_cell(const column_definition&, const atomic_cell_view& cell) { visit(cell); }
void dead_atomic_cell(const column_definition&, const atomic_cell_view& cell) { visit(cell); }
void live_atomic_cell(const column_definition&, const atomic_cell_view& cell) {
visit(cell);
}
void dead_atomic_cell(const column_definition&, const atomic_cell_view& cell) {
visit(cell);
}
void collection_tombstone(const tombstone& t) {
// A collection tombstone with timestamp T can be created with:
// UPDATE ks.t USING TIMESTAMP T + 1 SET X = null WHERE ...
@@ -441,15 +442,33 @@ struct find_timestamp_visitor {
// with cdc$time using timestamp T + 1 instead of T.
visit(t.timestamp + 1);
}
void live_collection_cell(bytes_view, const atomic_cell_view& cell) { visit(cell); }
void dead_collection_cell(bytes_view, const atomic_cell_view& cell) { visit(cell); }
void collection_column(const column_definition&, auto&& visit_collection) { visit_collection(*this); }
void marker(const row_marker& rm) { visit(rm.timestamp()); }
void static_row_cells(auto&& visit_row_cells) { visit_row_cells(*this); }
void clustered_row_cells(const clustering_key&, auto&& visit_row_cells) { visit_row_cells(*this); }
void clustered_row_delete(const clustering_key&, const tombstone& t) { visit(t.timestamp); }
void range_delete(const range_tombstone& t) { visit(t.tomb.timestamp); }
void partition_delete(const tombstone& t) { visit(t.timestamp); }
void live_collection_cell(bytes_view, const atomic_cell_view& cell) {
visit(cell);
}
void dead_collection_cell(bytes_view, const atomic_cell_view& cell) {
visit(cell);
}
void collection_column(const column_definition&, auto&& visit_collection) {
visit_collection(*this);
}
void marker(const row_marker& rm) {
visit(rm.timestamp());
}
void static_row_cells(auto&& visit_row_cells) {
visit_row_cells(*this);
}
void clustered_row_cells(const clustering_key&, auto&& visit_row_cells) {
visit_row_cells(*this);
}
void clustered_row_delete(const clustering_key&, const tombstone& t) {
visit(t.timestamp);
}
void range_delete(const range_tombstone& t) {
visit(t.tomb.timestamp);
}
void partition_delete(const tombstone& t) {
visit(t.timestamp);
}
};
/* Find some timestamp inside the given mutation.
@@ -505,8 +524,12 @@ struct should_split_visitor {
virtual ~should_split_visitor() = default;
inline bool finished() const { return _result; }
inline void stop() { _result = true; }
inline bool finished() const {
return _result;
}
inline void stop() {
_result = true;
}
void visit(api::timestamp_type ts, gc_clock::duration ttl = gc_clock::duration(0)) {
if (_ts != api::missing_timestamp && _ts != ts) {
@@ -517,15 +540,23 @@ struct should_split_visitor {
if (_ttl && *_ttl != ttl) {
return stop();
}
_ttl = { ttl };
_ttl = {ttl};
}
void visit(const atomic_cell_view& cell) { visit(cell.timestamp(), get_ttl(cell)); }
void visit(const atomic_cell_view& cell) {
visit(cell.timestamp(), get_ttl(cell));
}
void live_atomic_cell(const column_definition&, const atomic_cell_view& cell) { visit(cell); }
void dead_atomic_cell(const column_definition&, const atomic_cell_view& cell) { visit(cell); }
void live_atomic_cell(const column_definition&, const atomic_cell_view& cell) {
visit(cell);
}
void dead_atomic_cell(const column_definition&, const atomic_cell_view& cell) {
visit(cell);
}
void collection_tombstone(const tombstone& t) { visit(t.timestamp + 1); }
void collection_tombstone(const tombstone& t) {
visit(t.timestamp + 1);
}
virtual void live_collection_cell(bytes_view, const atomic_cell_view& cell) {
if (_had_row_marker) {
@@ -534,8 +565,12 @@ struct should_split_visitor {
}
visit(cell);
}
void dead_collection_cell(bytes_view, const atomic_cell_view& cell) { visit(cell); }
void collection_column(const column_definition&, auto&& visit_collection) { visit_collection(*this); }
void dead_collection_cell(bytes_view, const atomic_cell_view& cell) {
visit(cell);
}
void collection_column(const column_definition&, auto&& visit_collection) {
visit_collection(*this);
}
virtual void marker(const row_marker& rm) {
_had_row_marker = true;
@@ -606,8 +641,8 @@ bool should_split(const mutation& m, const per_request_options& options) {
cdc::inspect_mutation(m, v);
return v._result
// A mutation with no timestamp will be split into 0 mutations:
|| v._ts == api::missing_timestamp;
// A mutation with no timestamp will be split into 0 mutations:
|| v._ts == api::missing_timestamp;
}
// Returns true if the row state and the atomic and nonatomic entries represent
@@ -642,7 +677,7 @@ static bool entries_match_row_state(const schema_ptr& base_schema, const cell_ma
if (current_values.size() != update.cells.size()) {
return false;
}
std::unordered_map<sstring_view, bytes> current_values_map;
for (const auto& entry : current_values) {
const auto attr_name = std::string_view(value_cast<sstring>(entry.first));
@@ -711,8 +746,8 @@ bool should_skip(batch& changes, const mutation& base_mutation, change_processor
return true;
}
void process_changes_with_splitting(const mutation& base_mutation, change_processor& processor,
bool enable_preimage, bool enable_postimage, bool alternator_strict_compatibility) {
void process_changes_with_splitting(
const mutation& base_mutation, change_processor& processor, bool enable_preimage, bool enable_postimage, bool alternator_strict_compatibility) {
const auto base_schema = base_mutation.schema();
auto changes = extract_changes(base_mutation);
auto pk = base_mutation.key();
@@ -824,8 +859,8 @@ void process_changes_with_splitting(const mutation& base_mutation, change_proces
}
}
void process_changes_without_splitting(const mutation& base_mutation, change_processor& processor,
bool enable_preimage, bool enable_postimage, bool alternator_strict_compatibility) {
void process_changes_without_splitting(
const mutation& base_mutation, change_processor& processor, bool enable_preimage, bool enable_postimage, bool alternator_strict_compatibility) {
if (alternator_strict_compatibility) {
auto changes = extract_changes(base_mutation);
if (should_skip(changes.begin()->second, base_mutation, processor)) {
@@ -842,7 +877,7 @@ void process_changes_without_splitting(const mutation& base_mutation, change_pro
one_kind_column_set columns{base_schema->static_columns_count()};
if (!p.static_row().empty()) {
p.static_row().get().for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
p.static_row().get().for_each_cell([&](column_id id, const atomic_cell_or_collection& cell) {
columns.set(id);
});
processor.produce_preimage(nullptr, columns);
@@ -855,7 +890,7 @@ void process_changes_without_splitting(const mutation& base_mutation, change_pro
// Row deleted - include all columns in preimage
columns.set(0, base_schema->regular_columns_count(), true);
} else {
cr.row().cells().for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
cr.row().cells().for_each_cell([&](column_id id, const atomic_cell_or_collection& cell) {
columns.set(id);
});
}

View File

@@ -1,47 +0,0 @@
#
# Copyright 2025-present ScyllaDB
#
#
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
#
# Custom FindLua module that uses pkg-config, matching configure.py's
# approach. CMake's built-in FindLua resolves to the versioned library
# (e.g. liblua-5.4.so) instead of the unversioned symlink (liblua.so),
# causing a name mismatch between the two build systems.
find_package(PkgConfig REQUIRED)
# configure.py: lua53 on Debian-like, lua on others
pkg_search_module(PC_lua QUIET lua53 lua)
find_library(Lua_LIBRARY
NAMES lua lua5.3 lua53
HINTS
${PC_lua_LIBDIR}
${PC_lua_LIBRARY_DIRS})
find_path(Lua_INCLUDE_DIR
NAMES lua.h
HINTS
${PC_lua_INCLUDEDIR}
${PC_lua_INCLUDE_DIRS})
mark_as_advanced(
Lua_LIBRARY
Lua_INCLUDE_DIR)
include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(Lua
REQUIRED_VARS
Lua_LIBRARY
Lua_INCLUDE_DIR
VERSION_VAR PC_lua_VERSION)
if(Lua_FOUND)
set(LUA_LIBRARIES ${Lua_LIBRARY})
set(LUA_INCLUDE_DIR ${Lua_INCLUDE_DIR})
endif()

View File

@@ -1,5 +1,5 @@
set(CMAKE_CXX_FLAGS_COVERAGE
"-fprofile-instr-generate -fcoverage-mapping"
"-fprofile-instr-generate -fcoverage-mapping -fprofile-list=${CMAKE_SOURCE_DIR}/coverage_sources.list"
CACHE
INTERNAL
"")
@@ -8,33 +8,18 @@ update_build_flags(Coverage
OPTIMIZATION_LEVEL "g")
set(scylla_build_mode_Coverage "coverage")
# Coverage mode sets cmake_build_type='Debug' for Seastar
# (configure.py:515), so Seastar's pkg-config --cflags output
# (configure.py:2252-2267, queried at configure.py:3039) includes debug
# defines, sanitizer compile flags, and -fstack-clash-protection.
# Seastar's CMake generator expressions only activate these for
# Debug/Sanitize configs, so we add them explicitly for Coverage.
set(Seastar_DEFINITIONS_COVERAGE
SCYLLA_BUILD_MODE=${scylla_build_mode_Coverage}
SEASTAR_DEBUG
SEASTAR_DEFAULT_ALLOCATOR
SEASTAR_SHUFFLE_TASK_QUEUE
SEASTAR_DEBUG_SHARED_PTR
SEASTAR_DEBUG_PROMISE
SEASTAR_TYPE_ERASE_MORE)
DEBUG
SANITIZE
DEBUG_LSA_SANITIZER
SCYLLA_ENABLE_ERROR_INJECTION)
foreach(definition ${Seastar_DEFINITIONS_COVERAGE})
add_compile_definitions(
$<$<CONFIG:Coverage>:${definition}>)
endforeach()
add_compile_options(
$<$<CONFIG:Coverage>:-fsanitize=address>
$<$<CONFIG:Coverage>:-fsanitize=undefined>
$<$<CONFIG:Coverage>:-fsanitize=vptr>
$<$<CONFIG:Coverage>:-fstack-clash-protection>)
set(CMAKE_EXE_LINKER_FLAGS_COVERAGE
set(CMAKE_STATIC_LINKER_FLAGS_COVERAGE
"-fprofile-instr-generate -fcoverage-mapping")
maybe_limit_stack_usage_in_KB(40 Coverage)

View File

@@ -131,7 +131,6 @@ function(maybe_limit_stack_usage_in_KB stack_usage_threshold_in_KB config)
check_cxx_compiler_flag(${_stack_usage_threshold_flag} _stack_usage_flag_supported)
if(_stack_usage_flag_supported)
add_compile_options($<$<CONFIG:${config}>:${_stack_usage_threshold_flag}>)
add_compile_options($<$<CONFIG:${config}>:-Wno-error=stack-usage=>)
endif()
endfunction()
@@ -261,23 +260,6 @@ endif()
# Force SHA1 build-id generation
add_link_options("LINKER:--build-id=sha1")
# Match configure.py: add -fno-lto globally. configure.py adds -fno-lto to
# all binaries (except standalone cpp_apps like patchelf) via the per-binary
# $libs variable. LTO-enabled targets (scylla binary in RelWithDebInfo) will
# override with -flto=thin -ffat-lto-objects via enable_lto().
add_link_options(-fno-lto)
# Match configure.py:2633-2636 — sanitizer link flags for standalone binaries
# (e.g. patchelf) that don't link Seastar. Seastar-linked targets get these
# via seastar_libs (configure.py:2649).
# Coverage mode gets sanitizer link flags via the seastar target instead
# (see CMakeLists.txt), matching configure.py where only seastar_libs_coverage
# carries -fsanitize (not cxx_ld_flags).
add_link_options(
$<$<CONFIG:Debug,Sanitize>:-fsanitize=address>
$<$<CONFIG:Debug,Sanitize>:-fsanitize=undefined>)
include(CheckLinkerFlag)
set(Scylla_USE_LINKER
""

View File

@@ -44,7 +44,6 @@
#include "dht/partition_filter.hh"
#include "mutation_writer/shard_based_splitting_writer.hh"
#include "mutation_writer/partition_based_splitting_writer.hh"
#include "mutation_writer/token_group_based_splitting_writer.hh"
#include "mutation/mutation_source_metadata.hh"
#include "mutation/mutation_fragment_stream_validator.hh"
#include "utils/assert.hh"
@@ -1934,7 +1933,6 @@ class resharding_compaction final : public compaction {
};
std::vector<estimated_values> _estimation_per_shard;
std::vector<sstables::run_id> _run_identifiers;
bool _reshard_vnodes;
private:
// return estimated partitions per sstable for a given shard
uint64_t partitions_per_sstable(shard_id s) const {
@@ -1947,11 +1945,7 @@ public:
: compaction(table_s, std::move(descriptor), cdata, progress_monitor, use_backlog_tracker::no)
, _estimation_per_shard(smp::count)
, _run_identifiers(smp::count)
, _reshard_vnodes(descriptor.options.as<compaction_type_options::reshard>().vnodes_resharding)
{
if (_reshard_vnodes && !_owned_ranges) {
on_internal_error(clogger, "Resharding vnodes requires owned_ranges");
}
for (auto& sst : _sstables) {
const auto& shards = sst->get_shards_for_this_sstable();
auto size = sst->bytes_on_disk();
@@ -1989,25 +1983,8 @@ public:
}
mutation_reader_consumer make_interposer_consumer(mutation_reader_consumer end_consumer) override {
auto owned_ranges = _reshard_vnodes ? _owned_ranges : nullptr;
return [end_consumer = std::move(end_consumer), owned_ranges = std::move(owned_ranges)] (mutation_reader reader) mutable -> future<> {
if (owned_ranges) {
auto classify = [owned_ranges, it = owned_ranges->begin(), idx = mutation_writer::token_group_id(0)] (dht::token t) mutable -> mutation_writer::token_group_id {
dht::token_comparator cmp;
while (it != owned_ranges->end() && it->after(t, cmp)) {
clogger.debug("Token {} is after current range {}: advancing to the next range", t, *it);
++it;
++idx;
}
if (it == owned_ranges->end() || !it->contains(t, cmp)) {
on_internal_error(clogger, fmt::format("Token {} is outside of owned ranges", t));
}
return idx;
};
return mutation_writer::segregate_by_token_group(std::move(reader), std::move(classify), std::move(end_consumer));
} else {
return mutation_writer::segregate_by_shard(std::move(reader), std::move(end_consumer));
}
return [end_consumer = std::move(end_consumer)] (mutation_reader reader) mutable -> future<> {
return mutation_writer::segregate_by_shard(std::move(reader), std::move(end_consumer));
};
}

View File

@@ -87,8 +87,6 @@ public:
drop_unfixable_sstables drop_unfixable = drop_unfixable_sstables::no;
};
struct reshard {
// If set, resharding compaction will apply the owned_ranges to segregate sstables in vnode boundaries.
bool vnodes_resharding = false;
};
struct reshape {
};
@@ -117,8 +115,8 @@ public:
return compaction_type_options(reshape{});
}
static compaction_type_options make_reshard(bool vnodes_resharding = false) {
return compaction_type_options(reshard{.vnodes_resharding = vnodes_resharding});
static compaction_type_options make_reshard() {
return compaction_type_options(reshard{});
}
static compaction_type_options make_regular() {

View File

@@ -946,7 +946,7 @@ sstables::shared_sstable sstables_task_executor::consume_sstable() {
auto sst = _sstables.back();
_sstables.pop_back();
--_cm._stats.pending_tasks; // from this point on, switch_state(pending|active) works the same way as any other task
cmlog.debug("{}", format("consumed {}", sst->get_filename()));
cmlog.debug("consumed {}", sst->get_filename());
return sst;
}
@@ -1208,7 +1208,6 @@ future<> compaction_manager::await_tasks(std::vector<shared_ptr<compaction_task_
std::vector<shared_ptr<compaction_task_executor>>
compaction_manager::do_stop_ongoing_compactions(sstring reason, std::function<bool(const compaction_group_view*)> filter, std::optional<compaction_type> type_opt) noexcept {
auto ongoing_compactions = get_compactions(filter).size();
auto tasks = _tasks
| std::views::filter([&filter, type_opt] (const auto& task) {
return filter(task.compacting_table()) && (!type_opt || task.compaction_type() == *type_opt);
@@ -1217,6 +1216,7 @@ compaction_manager::do_stop_ongoing_compactions(sstring reason, std::function<bo
| std::ranges::to<std::vector<shared_ptr<compaction_task_executor>>>();
logging::log_level level = tasks.empty() ? log_level::debug : log_level::info;
if (cmlog.is_enabled(level)) {
auto ongoing_compactions = get_compactions(filter).size();
std::string scope = "";
if (!tasks.empty()) {
const compaction_group_view* t = tasks.front()->compacting_table();
@@ -1426,11 +1426,17 @@ protected:
compaction_strategy cs = t.get_compaction_strategy();
compaction_descriptor descriptor = co_await cs.get_sstables_for_compaction(t, _cm.get_strategy_control());
int weight = calculate_weight(descriptor);
cmlog.debug("Started minor compaction sstables={} sstables_reapired_at={} range={} uuid={} compaction_uuid={}",
descriptor.sstables, compacting_table()->get_sstables_repaired_at(),
compacting_table()->token_range(), uuid, _compaction_data.compaction_uuid);
bool debug_enabled = cmlog.is_enabled(log_level::debug);
if (debug_enabled) {
cmlog.debug("Started minor compaction sstables={} sstables_reapired_at={} range={} uuid={} compaction_uuid={}",
descriptor.sstables, compacting_table()->get_sstables_repaired_at(),
compacting_table()->token_range(), uuid, _compaction_data.compaction_uuid);
}
auto old_sstables = ::format("{}", descriptor.sstables);
sstring old_sstables;
if (debug_enabled) {
old_sstables = ::format("{}", descriptor.sstables);
}
if (descriptor.sstables.empty() || !can_proceed() || t.is_auto_compaction_disabled_by_user()) {
cmlog.debug("{}: sstables={} can_proceed={} auto_compaction={}", *this, descriptor.sstables.size(), can_proceed(), t.is_auto_compaction_disabled_by_user());
@@ -1460,8 +1466,10 @@ protected:
try {
bool should_update_history = this->should_update_history(descriptor.options.type());
compaction_result res = co_await compact_sstables(std::move(descriptor), _compaction_data, on_replace);
cmlog.debug("Finished minor compaction old_sstables={} new_sstables={} sstables_reapired_at={} range={} uuid={} compaction_uuid={}",
old_sstables, res.new_sstables, compacting_table()->get_sstables_repaired_at(), compacting_table()->token_range(), uuid, _compaction_data.compaction_uuid);
if (debug_enabled) {
cmlog.debug("Finished minor compaction old_sstables={} new_sstables={} sstables_reapired_at={} range={} uuid={} compaction_uuid={}",
old_sstables, res.new_sstables, compacting_table()->get_sstables_repaired_at(), compacting_table()->token_range(), uuid, _compaction_data.compaction_uuid);
}
finish_compaction();
if (should_update_history) {
// update_history can take a long time compared to

View File

@@ -33,8 +33,10 @@ future<compaction_descriptor> leveled_compaction_strategy::get_sstables_for_comp
auto candidate = manifest.get_compaction_candidates(*state->last_compacted_keys, state->compaction_counter);
if (!candidate.sstables.empty()) {
auto main_set = co_await table_s.main_sstable_set();
leveled_manifest::logger.debug("leveled: Compacting {} out of {} sstables", candidate.sstables.size(), main_set->size());
if (leveled_manifest::logger.is_enabled(logging::log_level::debug)) {
auto main_set = co_await table_s.main_sstable_set();
leveled_manifest::logger.debug("leveled: Compacting {} out of {} sstables", candidate.sstables.size(), main_set->size());
}
co_return candidate;
}

View File

@@ -132,7 +132,7 @@ distribute_reshard_jobs(sstables::sstable_directory::sstable_open_info_vector so
// A creator function must be passed that will create an SSTable object in the correct shard,
// and an I/O priority must be specified.
future<> reshard(sstables::sstable_directory& dir, sstables::sstable_directory::sstable_open_info_vector shared_info, replica::table& table,
compaction::compaction_sstable_creator_fn creator, compaction::owned_ranges_ptr owned_ranges_ptr, bool vnodes_resharding, tasks::task_info parent_info)
compaction::compaction_sstable_creator_fn creator, compaction::owned_ranges_ptr owned_ranges_ptr, tasks::task_info parent_info)
{
// Resharding doesn't like empty sstable sets, so bail early. There is nothing
// to reshard in this shard.
@@ -160,22 +160,13 @@ future<> reshard(sstables::sstable_directory& dir, sstables::sstable_directory::
// There is a semaphore inside the compaction manager in run_resharding_jobs. So we
// parallel_for_each so the statistics about pending jobs are updated to reflect all
// jobs. But only one will run in parallel at a time
//
// The compaction group view is used here only for job registration and gate-holding;
// resharding never reads or writes the group's own SSTables. With static (vnode)
// sharding there is exactly one group per shard; with tablets there may be many.
// In either case, any registered group suffices.
auto* cg = table.get_any_compaction_group();
if (!cg) {
on_internal_error(tasks::tmlogger, format("No compaction group found for table {}.{}", table.schema()->ks_name(), table.schema()->cf_name()));
}
auto& t = cg->view_for_unrepaired_data();
auto& t = table.try_get_compaction_group_view_with_static_sharding();
co_await coroutine::parallel_for_each(buckets, [&] (std::vector<sstables::shared_sstable>& sstlist) mutable {
return table.get_compaction_manager().run_custom_job(t, compaction_type::Reshard, "Reshard compaction", [&] (compaction_data& info, compaction_progress_monitor& progress_monitor) -> future<> {
auto erm = table.get_effective_replication_map(); // keep alive around compaction.
compaction_descriptor desc(sstlist);
desc.options = compaction_type_options::make_reshard(vnodes_resharding);
desc.options = compaction_type_options::make_reshard();
desc.creator = creator;
desc.sharder = &erm->get_sharder(*table.schema());
desc.owned_ranges = owned_ranges_ptr;
@@ -915,7 +906,7 @@ future<> table_resharding_compaction_task_impl::run() {
if (_owned_ranges_ptr) {
local_owned_ranges_ptr = make_lw_shared<const dht::token_range_vector>(*_owned_ranges_ptr);
}
auto task = co_await compaction_module.make_and_start_task<shard_resharding_compaction_task_impl>(parent_info, _status.keyspace, _status.table, _status.id, _dir, db, _creator, std::move(local_owned_ranges_ptr), _vnodes_resharding, destinations);
auto task = co_await compaction_module.make_and_start_task<shard_resharding_compaction_task_impl>(parent_info, _status.keyspace, _status.table, _status.id, _dir, db, _creator, std::move(local_owned_ranges_ptr), destinations);
co_await task->done();
}));
@@ -935,14 +926,12 @@ shard_resharding_compaction_task_impl::shard_resharding_compaction_task_impl(tas
replica::database& db,
compaction_sstable_creator_fn creator,
compaction::owned_ranges_ptr local_owned_ranges_ptr,
bool vnodes_resharding,
std::vector<replica::reshard_shard_descriptor>& destinations) noexcept
: resharding_compaction_task_impl(module, tasks::task_id::create_random_id(), 0, "shard", std::move(keyspace), std::move(table), "", parent_id)
, _dir(dir)
, _db(db)
, _creator(std::move(creator))
, _local_owned_ranges_ptr(std::move(local_owned_ranges_ptr))
, _vnodes_resharding(vnodes_resharding)
, _destinations(destinations)
{
_expected_workload = _destinations[this_shard_id()].size();
@@ -952,7 +941,7 @@ future<> shard_resharding_compaction_task_impl::run() {
auto& table = _db.find_column_family(_status.keyspace, _status.table);
auto info_vec = std::move(_destinations[this_shard_id()].info_vec);
tasks::task_info info{_status.id, _status.shard};
co_await reshard(_dir.local(), std::move(info_vec), table, _creator, std::move(_local_owned_ranges_ptr), _vnodes_resharding, info);
co_await reshard(_dir.local(), std::move(info_vec), table, _creator, std::move(_local_owned_ranges_ptr), info);
co_await _dir.local().move_foreign_sstables(_dir);
}

View File

@@ -693,7 +693,6 @@ private:
sharded<replica::database>& _db;
compaction_sstable_creator_fn _creator;
compaction::owned_ranges_ptr _owned_ranges_ptr;
bool _vnodes_resharding;
public:
table_resharding_compaction_task_impl(tasks::task_manager::module_ptr module,
std::string keyspace,
@@ -701,14 +700,12 @@ public:
sharded<sstables::sstable_directory>& dir,
sharded<replica::database>& db,
compaction_sstable_creator_fn creator,
compaction::owned_ranges_ptr owned_ranges_ptr,
bool vnodes_resharding) noexcept
compaction::owned_ranges_ptr owned_ranges_ptr) noexcept
: resharding_compaction_task_impl(module, tasks::task_id::create_random_id(), module->new_sequence_number(), "table", std::move(keyspace), std::move(table), "", tasks::task_id::create_null_id())
, _dir(dir)
, _db(db)
, _creator(std::move(creator))
, _owned_ranges_ptr(std::move(owned_ranges_ptr))
, _vnodes_resharding(vnodes_resharding)
{}
protected:
virtual future<> run() override;
@@ -721,7 +718,6 @@ private:
replica::database& _db;
compaction_sstable_creator_fn _creator;
compaction::owned_ranges_ptr _local_owned_ranges_ptr;
bool _vnodes_resharding;
std::vector<replica::reshard_shard_descriptor>& _destinations;
public:
shard_resharding_compaction_task_impl(tasks::task_manager::module_ptr module,
@@ -732,7 +728,6 @@ public:
replica::database& db,
compaction_sstable_creator_fn creator,
compaction::owned_ranges_ptr local_owned_ranges_ptr,
bool vnodes_resharding,
std::vector<replica::reshard_shard_descriptor>& destinations) noexcept;
protected:
virtual future<> run() override;

View File

@@ -15,6 +15,7 @@
#include "compaction_strategy_state.hh"
#include "utils/error_injection.hh"
#include <seastar/util/lazy.hh>
#include <ranges>
namespace compaction {
@@ -28,12 +29,12 @@ time_window_compaction_strategy_state_ptr time_window_compaction_strategy::get_s
}
const std::unordered_map<sstring, std::chrono::seconds> time_window_compaction_strategy_options::valid_window_units = {
{ "MINUTES", 60s }, { "HOURS", 3600s }, { "DAYS", 86400s }
};
{"MINUTES", 60s}, {"HOURS", 3600s}, {"DAYS", 86400s}};
const std::unordered_map<sstring, time_window_compaction_strategy_options::timestamp_resolutions> time_window_compaction_strategy_options::valid_timestamp_resolutions = {
{ "MICROSECONDS", timestamp_resolutions::microsecond },
{ "MILLISECONDS", timestamp_resolutions::millisecond },
const std::unordered_map<sstring, time_window_compaction_strategy_options::timestamp_resolutions>
time_window_compaction_strategy_options::valid_timestamp_resolutions = {
{"MICROSECONDS", timestamp_resolutions::microsecond},
{"MILLISECONDS", timestamp_resolutions::millisecond},
};
static std::chrono::seconds validate_compaction_window_unit(const std::map<sstring, sstring>& options) {
@@ -43,7 +44,8 @@ static std::chrono::seconds validate_compaction_window_unit(const std::map<sstri
if (tmp_value) {
auto valid_window_units_it = time_window_compaction_strategy_options::valid_window_units.find(tmp_value.value());
if (valid_window_units_it == time_window_compaction_strategy_options::valid_window_units.end()) {
throw exceptions::configuration_exception(fmt::format("Invalid window unit {} for {}", tmp_value.value(), time_window_compaction_strategy_options::COMPACTION_WINDOW_UNIT_KEY));
throw exceptions::configuration_exception(
fmt::format("Invalid window unit {} for {}", tmp_value.value(), time_window_compaction_strategy_options::COMPACTION_WINDOW_UNIT_KEY));
}
window_unit = valid_window_units_it->second;
}
@@ -59,10 +61,12 @@ static std::chrono::seconds validate_compaction_window_unit(const std::map<sstri
static int validate_compaction_window_size(const std::map<sstring, sstring>& options) {
auto tmp_value = compaction_strategy_impl::get_value(options, time_window_compaction_strategy_options::COMPACTION_WINDOW_SIZE_KEY);
int window_size = cql3::statements::property_definitions::to_long(time_window_compaction_strategy_options::COMPACTION_WINDOW_SIZE_KEY, tmp_value, time_window_compaction_strategy_options::DEFAULT_COMPACTION_WINDOW_SIZE);
int window_size = cql3::statements::property_definitions::to_long(time_window_compaction_strategy_options::COMPACTION_WINDOW_SIZE_KEY, tmp_value,
time_window_compaction_strategy_options::DEFAULT_COMPACTION_WINDOW_SIZE);
if (window_size <= 0) {
throw exceptions::configuration_exception(fmt::format("{} value ({}) must be greater than 1", time_window_compaction_strategy_options::COMPACTION_WINDOW_SIZE_KEY, window_size));
throw exceptions::configuration_exception(
fmt::format("{} value ({}) must be greater than 1", time_window_compaction_strategy_options::COMPACTION_WINDOW_SIZE_KEY, window_size));
}
return window_size;
@@ -82,26 +86,30 @@ static db_clock::duration validate_expired_sstable_check_frequency_seconds(const
try {
expired_sstable_check_frequency = std::chrono::seconds(std::stol(tmp_value.value()));
} catch (const std::exception& e) {
throw exceptions::syntax_exception(fmt::format("Invalid long value {} for {}", tmp_value.value(), time_window_compaction_strategy_options::EXPIRED_SSTABLE_CHECK_FREQUENCY_SECONDS_KEY));
throw exceptions::syntax_exception(fmt::format(
"Invalid long value {} for {}", tmp_value.value(), time_window_compaction_strategy_options::EXPIRED_SSTABLE_CHECK_FREQUENCY_SECONDS_KEY));
}
}
return expired_sstable_check_frequency;
}
static db_clock::duration validate_expired_sstable_check_frequency_seconds(const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options) {
static db_clock::duration validate_expired_sstable_check_frequency_seconds(
const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options) {
db_clock::duration expired_sstable_check_frequency = validate_expired_sstable_check_frequency_seconds(options);
unchecked_options.erase(time_window_compaction_strategy_options::EXPIRED_SSTABLE_CHECK_FREQUENCY_SECONDS_KEY);
return expired_sstable_check_frequency;
}
static time_window_compaction_strategy_options::timestamp_resolutions validate_timestamp_resolution(const std::map<sstring, sstring>& options) {
time_window_compaction_strategy_options::timestamp_resolutions timestamp_resolution = time_window_compaction_strategy_options::timestamp_resolutions::microsecond;
time_window_compaction_strategy_options::timestamp_resolutions timestamp_resolution =
time_window_compaction_strategy_options::timestamp_resolutions::microsecond;
auto tmp_value = compaction_strategy_impl::get_value(options, time_window_compaction_strategy_options::TIMESTAMP_RESOLUTION_KEY);
if (tmp_value) {
if (!time_window_compaction_strategy_options::valid_timestamp_resolutions.contains(tmp_value.value())) {
throw exceptions::configuration_exception(fmt::format("Invalid timestamp resolution {} for {}", tmp_value.value(), time_window_compaction_strategy_options::TIMESTAMP_RESOLUTION_KEY));
throw exceptions::configuration_exception(fmt::format(
"Invalid timestamp resolution {} for {}", tmp_value.value(), time_window_compaction_strategy_options::TIMESTAMP_RESOLUTION_KEY));
} else {
timestamp_resolution = time_window_compaction_strategy_options::valid_timestamp_resolutions.at(tmp_value.value());
}
@@ -110,7 +118,8 @@ static time_window_compaction_strategy_options::timestamp_resolutions validate_t
return timestamp_resolution;
}
static time_window_compaction_strategy_options::timestamp_resolutions validate_timestamp_resolution(const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options) {
static time_window_compaction_strategy_options::timestamp_resolutions validate_timestamp_resolution(
const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options) {
time_window_compaction_strategy_options::timestamp_resolutions timestamp_resolution = validate_timestamp_resolution(options);
unchecked_options.erase(time_window_compaction_strategy_options::TIMESTAMP_RESOLUTION_KEY);
return timestamp_resolution;
@@ -145,7 +154,7 @@ void time_window_compaction_strategy_options::validate(const std::map<sstring, s
compaction_strategy_impl::validate_min_max_threshold(options, unchecked_options);
auto it = options.find("enable_optimized_twcs_queries");
if (it != options.end() && it->second != "true" && it->second != "false") {
if (it != options.end() && it->second != "true" && it->second != "false") {
throw exceptions::configuration_exception(fmt::format("enable_optimized_twcs_queries value ({}) must be \"true\" or \"false\"", it->second));
}
unchecked_options.erase("enable_optimized_twcs_queries");
@@ -162,7 +171,9 @@ class classify_by_timestamp {
std::vector<int64_t> _known_windows;
public:
explicit classify_by_timestamp(time_window_compaction_strategy_options options) : _options(std::move(options)) { }
explicit classify_by_timestamp(time_window_compaction_strategy_options options)
: _options(std::move(options)) {
}
int64_t operator()(api::timestamp_type ts) {
const auto window = time_window_compaction_strategy::get_window_for(_options, ts);
if (const auto it = std::ranges::find(_known_windows, window); it != _known_windows.end()) {
@@ -190,7 +201,7 @@ uint64_t time_window_compaction_strategy::adjust_partition_estimate(const mutati
auto estimated_window_count = max_data_segregation_window_count;
auto default_ttl = std::chrono::duration_cast<std::chrono::microseconds>(s->default_time_to_live());
bool min_and_max_ts_available = ms_meta.min_timestamp && ms_meta.max_timestamp;
auto estimate_window_count = [this] (timestamp_type min_window, timestamp_type max_window) {
auto estimate_window_count = [this](timestamp_type min_window, timestamp_type max_window) {
const auto window_size = get_window_size(_options);
return (max_window + (window_size - 1) - min_window) / window_size;
};
@@ -210,21 +221,19 @@ uint64_t time_window_compaction_strategy::adjust_partition_estimate(const mutati
return partition_estimate / std::max(1UL, uint64_t(estimated_window_count));
}
mutation_reader_consumer time_window_compaction_strategy::make_interposer_consumer(const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const {
if (ms_meta.min_timestamp && ms_meta.max_timestamp
&& get_window_for(_options, *ms_meta.min_timestamp) == get_window_for(_options, *ms_meta.max_timestamp)) {
mutation_reader_consumer time_window_compaction_strategy::make_interposer_consumer(
const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const {
if (ms_meta.min_timestamp && ms_meta.max_timestamp &&
get_window_for(_options, *ms_meta.min_timestamp) == get_window_for(_options, *ms_meta.max_timestamp)) {
return end_consumer;
}
return [options = _options, end_consumer = std::move(end_consumer)] (mutation_reader rd) mutable -> future<> {
return mutation_writer::segregate_by_timestamp(
std::move(rd),
classify_by_timestamp(std::move(options)),
end_consumer);
return [options = _options, end_consumer = std::move(end_consumer)](mutation_reader rd) mutable -> future<> {
return mutation_writer::segregate_by_timestamp(std::move(rd), classify_by_timestamp(std::move(options)), end_consumer);
};
}
compaction_descriptor
time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
compaction_descriptor time_window_compaction_strategy::get_reshaping_job(
std::vector<sstables::shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
auto mode = cfg.mode;
std::vector<sstables::shared_sstable> single_window;
std::vector<sstables::shared_sstable> multi_window;
@@ -239,7 +248,7 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_
// Sort input sstables by first_key order
// to allow efficient reshaping of disjoint sstables.
std::sort(input.begin(), input.end(), [&schema] (const sstables::shared_sstable& a, const sstables::shared_sstable& b) {
std::sort(input.begin(), input.end(), [&schema](const sstables::shared_sstable& a, const sstables::shared_sstable& b) {
return dht::ring_position(a->get_first_decorated_key()).less_compare(*schema, dht::ring_position(b->get_first_decorated_key()));
});
@@ -253,31 +262,34 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_
}
}
auto is_disjoint = [&schema, mode, max_sstables] (const std::vector<sstables::shared_sstable>& ssts) {
auto is_disjoint = [&schema, mode, max_sstables](const std::vector<sstables::shared_sstable>& ssts) {
size_t tolerance = (mode == reshape_mode::relaxed) ? max_sstables : 0;
return sstable_set_overlapping_count(schema, ssts) <= tolerance;
};
clogger.debug("time_window_compaction_strategy::get_reshaping_job: offstrategy_threshold={} max_sstables={} multi_window={} disjoint={} single_window={} disjoint={}",
offstrategy_threshold, max_sstables,
multi_window.size(), !multi_window.empty() && sstable_set_overlapping_count(schema, multi_window) == 0,
single_window.size(), !single_window.empty() && sstable_set_overlapping_count(schema, single_window) == 0);
clogger.debug("time_window_compaction_strategy::get_reshaping_job: offstrategy_threshold={} max_sstables={} multi_window={} disjoint={} "
"single_window={} disjoint={}",
offstrategy_threshold, max_sstables, multi_window.size(), seastar::value_of([&] {
return !multi_window.empty() && sstable_set_overlapping_count(schema, multi_window) == 0;
}),
single_window.size(), seastar::value_of([&] {
return !single_window.empty() && sstable_set_overlapping_count(schema, single_window) == 0;
}));
auto get_job_size = [] (const std::vector<sstables::shared_sstable>& ssts) {
auto get_job_size = [](const std::vector<sstables::shared_sstable>& ssts) {
return std::ranges::fold_left(ssts | std::views::transform(std::mem_fn(&sstables::sstable::bytes_on_disk)), uint64_t(0), std::plus{});
};
// Targets a space overhead of 10%. All disjoint sstables can be compacted together as long as they won't
// cause an overhead above target. Otherwise, the job targets a maximum of #max_threshold sstables.
auto need_trimming = [&] (const std::vector<sstables::shared_sstable>& ssts, const uint64_t job_size, bool is_disjoint) {
auto need_trimming = [&](const std::vector<sstables::shared_sstable>& ssts, const uint64_t job_size, bool is_disjoint) {
const size_t min_sstables = 2;
auto is_above_target_size = job_size > target_job_size;
return (ssts.size() > max_sstables && !is_disjoint) ||
(ssts.size() > min_sstables && is_above_target_size);
return (ssts.size() > max_sstables && !is_disjoint) || (ssts.size() > min_sstables && is_above_target_size);
};
auto maybe_trim_job = [&need_trimming] (std::vector<sstables::shared_sstable>& ssts, uint64_t job_size, bool is_disjoint) {
auto maybe_trim_job = [&need_trimming](std::vector<sstables::shared_sstable>& ssts, uint64_t job_size, bool is_disjoint) {
while (need_trimming(ssts, job_size, is_disjoint)) {
auto sst = ssts.back();
ssts.pop_back();
@@ -294,7 +306,7 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_
// For example, if there are N sstables spanning window W, where N <= 32, then we can produce all data for W
// in a single compaction round, removing the need to later compact W to reduce its number of files.
auto sort_size = std::min(max_sstables, multi_window.size());
std::ranges::partial_sort(multi_window, multi_window.begin() + sort_size, std::ranges::less(), [] (const sstables::shared_sstable &a) {
std::ranges::partial_sort(multi_window, multi_window.begin() + sort_size, std::ranges::less(), [](const sstables::shared_sstable& a) {
return a->get_stats_metadata().max_timestamp;
});
maybe_trim_job(multi_window, job_size, disjoint);
@@ -334,8 +346,7 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_
return compaction_descriptor();
}
future<compaction_descriptor>
time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) {
future<compaction_descriptor> time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) {
auto state = get_state(table_s);
auto compaction_time = gc_clock::now();
auto candidates = co_await control.candidates(table_s);
@@ -369,10 +380,8 @@ time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_vi
co_return compaction_descriptor(std::move(compaction_candidates));
}
time_window_compaction_strategy::bucket_compaction_mode
time_window_compaction_strategy::compaction_mode(const time_window_compaction_strategy_state& state,
const bucket_t& bucket, timestamp_type bucket_key,
timestamp_type now, size_t min_threshold) const {
time_window_compaction_strategy::bucket_compaction_mode time_window_compaction_strategy::compaction_mode(
const time_window_compaction_strategy_state& state, const bucket_t& bucket, timestamp_type bucket_key, timestamp_type now, size_t min_threshold) const {
// STCS will also be performed on older window buckets, to avoid a bad write and
// space amplification when something like read repair cause small updates to
// those past windows.
@@ -385,8 +394,7 @@ time_window_compaction_strategy::compaction_mode(const time_window_compaction_st
return bucket_compaction_mode::none;
}
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> time_window_compaction_strategy::get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> non_expiring_sstables, gc_clock::time_point compaction_time, time_window_compaction_strategy_state& state) {
auto most_interesting = get_compaction_candidates(table_s, control, non_expiring_sstables, state);
@@ -400,31 +408,29 @@ time_window_compaction_strategy::get_next_non_expired_sstables(compaction_group_
// if there is no sstable to compact in standard way, try compacting single sstable whose droppable tombstone
// ratio is greater than threshold.
std::erase_if(non_expiring_sstables, [this, compaction_time, &table_s] (const sstables::shared_sstable& sst) -> bool {
std::erase_if(non_expiring_sstables, [this, compaction_time, &table_s](const sstables::shared_sstable& sst) -> bool {
return !worth_dropping_tombstones(sst, compaction_time, table_s);
});
if (non_expiring_sstables.empty()) {
return {};
}
auto it = std::ranges::min_element(non_expiring_sstables, [] (auto& i, auto& j) {
auto it = std::ranges::min_element(non_expiring_sstables, [](auto& i, auto& j) {
return i->get_stats_metadata().min_timestamp < j->get_stats_metadata().min_timestamp;
});
return { *it };
return {*it};
}
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::get_compaction_candidates(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> candidate_sstables, time_window_compaction_strategy_state& state) {
std::vector<sstables::shared_sstable> time_window_compaction_strategy::get_compaction_candidates(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> candidate_sstables, time_window_compaction_strategy_state& state) {
auto [buckets, max_timestamp] = get_buckets(std::move(candidate_sstables), _options);
// Update the highest window seen, if necessary
state.highest_window_seen = std::max(state.highest_window_seen, max_timestamp);
return newest_bucket(table_s, control, std::move(buckets), table_s.min_compaction_threshold(), table_s.schema()->max_compaction_threshold(),
state.highest_window_seen, state);
state.highest_window_seen, state);
}
timestamp_type
time_window_compaction_strategy::get_window_lower_bound(std::chrono::seconds sstable_window_size, timestamp_type timestamp) {
timestamp_type time_window_compaction_strategy::get_window_lower_bound(std::chrono::seconds sstable_window_size, timestamp_type timestamp) {
using namespace std::chrono;
// mask out window size from timestamp to get lower bound of its window
auto num_windows = microseconds(timestamp) / sstable_window_size;
@@ -432,8 +438,8 @@ time_window_compaction_strategy::get_window_lower_bound(std::chrono::seconds sst
return duration_cast<microseconds>(num_windows * sstable_window_size).count();
}
std::pair<std::map<timestamp_type, std::vector<sstables::shared_sstable>>, timestamp_type>
time_window_compaction_strategy::get_buckets(std::vector<sstables::shared_sstable> files, const time_window_compaction_strategy_options& options) {
std::pair<std::map<timestamp_type, std::vector<sstables::shared_sstable>>, timestamp_type> time_window_compaction_strategy::get_buckets(
std::vector<sstables::shared_sstable> files, const time_window_compaction_strategy_options& options) {
std::map<timestamp_type, std::vector<sstables::shared_sstable>> buckets;
timestamp_type max_timestamp = 0;
@@ -450,11 +456,13 @@ time_window_compaction_strategy::get_buckets(std::vector<sstables::shared_sstabl
return std::make_pair(std::move(buckets), max_timestamp);
}
}
} // namespace compaction
template <>
struct fmt::formatter<std::map<compaction::timestamp_type, std::vector<sstables::shared_sstable>>> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
constexpr auto parse(format_parse_context& ctx) {
return ctx.begin();
}
auto format(const std::map<compaction::timestamp_type, std::vector<sstables::shared_sstable>>& buckets, fmt::format_context& ctx) const {
auto out = fmt::format_to(ctx.out(), " buckets = {{\n");
for (auto& [timestamp, sstables] : buckets | std::views::reverse) {
@@ -466,9 +474,9 @@ struct fmt::formatter<std::map<compaction::timestamp_type, std::vector<sstables:
namespace compaction {
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::newest_bucket(compaction_group_view& table_s, strategy_control& control, std::map<timestamp_type, std::vector<sstables::shared_sstable>> buckets,
int min_threshold, int max_threshold, timestamp_type now, time_window_compaction_strategy_state& state) {
std::vector<sstables::shared_sstable> time_window_compaction_strategy::newest_bucket(compaction_group_view& table_s, strategy_control& control,
std::map<timestamp_type, std::vector<sstables::shared_sstable>> buckets, int min_threshold, int max_threshold, timestamp_type now,
time_window_compaction_strategy_state& state) {
clogger.debug("time_window_compaction_strategy::newest_bucket:\n now {}\n{}", now, buckets);
for (auto&& [key, bucket] : buckets | std::views::reverse) {
@@ -509,8 +517,7 @@ time_window_compaction_strategy::newest_bucket(compaction_group_view& table_s, s
return {};
}
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::trim_to_threshold(std::vector<sstables::shared_sstable> bucket, int max_threshold) {
std::vector<sstables::shared_sstable> time_window_compaction_strategy::trim_to_threshold(std::vector<sstables::shared_sstable> bucket, int max_threshold) {
auto n = std::min(bucket.size(), size_t(max_threshold));
// Trim the largest sstables off the end to meet the maxThreshold
std::ranges::partial_sort(bucket, bucket.begin() + n, std::ranges::less(), std::mem_fn(&sstables::sstable::ondisk_data_size));
@@ -542,8 +549,8 @@ future<int64_t> time_window_compaction_strategy::estimated_pending_compactions(c
co_return n;
}
std::vector<compaction_descriptor>
time_window_compaction_strategy::get_cleanup_compaction_jobs(compaction_group_view& table_s, std::vector<sstables::shared_sstable> candidates) const {
std::vector<compaction_descriptor> time_window_compaction_strategy::get_cleanup_compaction_jobs(
compaction_group_view& table_s, std::vector<sstables::shared_sstable> candidates) const {
std::vector<compaction_descriptor> ret;
for (auto&& [_, sstables] : get_buckets(std::move(candidates), _options).first) {
auto per_window_jobs = size_tiered_compaction_strategy(_stcs_options).get_cleanup_compaction_jobs(table_s, std::move(sstables));
@@ -556,4 +563,4 @@ std::unique_ptr<sstables::sstable_set_impl> time_window_compaction_strategy::mak
return std::make_unique<sstables::time_series_sstable_set>(ts.schema(), _options.enable_optimized_twcs_queries);
}
}
} // namespace compaction

View File

@@ -488,7 +488,6 @@ sstable_format: ms
# compressed.
# can be: all - all traffic is compressed
# dc - traffic between different datacenters is compressed
# rack - traffic between different racks is compressed
# none - nothing is compressed.
# internode_compression: none
@@ -584,7 +583,8 @@ sstable_format: ms
audit: "table"
#
# List of statement categories that should be audited.
audit_categories: "DCL,DDL,AUTH,ADMIN"
# Possible categories are: QUERY, DML, DCL, DDL, AUTH, ADMIN
audit_categories: "DCL,AUTH,ADMIN"
#
# List of tables that should be audited.
# audit_tables: "<keyspace_name>.<table_name>,<keyspace_name>.<table_name>"

View File

@@ -1708,14 +1708,12 @@ deps['test/boost/combined_tests'] += [
'test/boost/sstable_compression_config_test.cc',
'test/boost/sstable_directory_test.cc',
'test/boost/sstable_set_test.cc',
'test/boost/sstable_tablet_streaming.cc',
'test/boost/statement_restrictions_test.cc',
'test/boost/storage_proxy_test.cc',
'test/boost/tablets_test.cc',
'test/boost/tracing_test.cc',
'test/boost/user_function_test.cc',
'test/boost/user_types_test.cc',
'test/boost/vector_index_test.cc',
'test/boost/view_build_test.cc',
'test/boost/view_complex_test.cc',
'test/boost/view_schema_ckey_test.cc',
@@ -2233,20 +2231,16 @@ abseil_libs = ['absl/' + lib for lib in [
'container/libabsl_raw_hash_set.a',
'synchronization/libabsl_synchronization.a',
'synchronization/libabsl_graphcycles_internal.a',
'synchronization/libabsl_kernel_timeout_internal.a',
'debugging/libabsl_stacktrace.a',
'debugging/libabsl_symbolize.a',
'debugging/libabsl_debugging_internal.a',
'debugging/libabsl_demangle_internal.a',
'debugging/libabsl_demangle_rust.a',
'debugging/libabsl_decode_rust_punycode.a',
'debugging/libabsl_utf8_for_code_point.a',
'debugging/libabsl_borrowed_fixup_buffer.a',
'time/libabsl_time.a',
'time/libabsl_time_zone.a',
'numeric/libabsl_int128.a',
'hash/libabsl_hash.a',
'hash/libabsl_city.a',
'hash/libabsl_low_level_hash.a',
'base/libabsl_malloc_internal.a',
'base/libabsl_spinlock_wait.a',
'base/libabsl_base.a',

View File

@@ -48,13 +48,15 @@ const sstring query_processor::CQL_VERSION = "3.3.1";
const std::chrono::minutes prepared_statements_cache::entry_expiry = std::chrono::minutes(60);
struct query_processor::remote {
remote(service::migration_manager& mm, service::mapreduce_service& fwd,
service::storage_service& ss, service::raft_group0_client& group0_client,
service::strong_consistency::coordinator& _sc_coordinator)
: mm(mm), mapreducer(fwd), ss(ss), group0_client(group0_client)
, sc_coordinator(_sc_coordinator)
, gate("query_processor::remote")
{}
remote(service::migration_manager& mm, service::mapreduce_service& fwd, service::storage_service& ss, service::raft_group0_client& group0_client,
service::strong_consistency::coordinator& _sc_coordinator)
: mm(mm)
, mapreducer(fwd)
, ss(ss)
, group0_client(group0_client)
, sc_coordinator(_sc_coordinator)
, gate("query_processor::remote") {
}
service::migration_manager& mm;
service::mapreduce_service& mapreducer;
@@ -77,24 +79,34 @@ static service::query_state query_state_for_internal_call() {
return {service::client_state::for_internal_calls(), empty_service_permit()};
}
query_processor::query_processor(service::storage_proxy& proxy, data_dictionary::database db, service::migration_notifier& mn, vector_search::vector_store_client& vsc, query_processor::memory_config mcfg, cql_config& cql_cfg, utils::loading_cache_config auth_prep_cache_cfg, lang::manager& langm)
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
, _mnotifier(mn)
, _vector_store_client(vsc)
, _mcfg(mcfg)
, _cql_config(cql_cfg)
, _prepared_cache(prep_cache_log, _mcfg.prepared_statment_cache_size)
, _authorized_prepared_cache(std::move(auth_prep_cache_cfg), authorized_prepared_statements_cache_log)
, _auth_prepared_cache_cfg_cb([this] (uint32_t) { (void) _authorized_prepared_cache_config_action.trigger_later(); })
, _authorized_prepared_cache_config_action([this] { update_authorized_prepared_cache_config(); return make_ready_future<>(); })
, _authorized_prepared_cache_update_interval_in_ms_observer(_db.get_config().permissions_update_interval_in_ms.observe(_auth_prepared_cache_cfg_cb))
, _authorized_prepared_cache_validity_in_ms_observer(_db.get_config().permissions_validity_in_ms.observe(_auth_prepared_cache_cfg_cb))
, _lang_manager(langm)
, _write_consistency_levels_warned_observer(_db.get_config().write_consistency_levels_warned.observe([this](const auto& v) { _write_consistency_levels_warned = to_consistency_level_set(v); }))
, _write_consistency_levels_disallowed_observer(_db.get_config().write_consistency_levels_disallowed.observe([this](const auto& v) { _write_consistency_levels_disallowed = to_consistency_level_set(v); }))
{
query_processor::query_processor(service::storage_proxy& proxy, data_dictionary::database db, service::migration_notifier& mn,
vector_search::vector_store_client& vsc, query_processor::memory_config mcfg, cql_config& cql_cfg, utils::loading_cache_config auth_prep_cache_cfg,
lang::manager& langm)
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
, _mnotifier(mn)
, _vector_store_client(vsc)
, _mcfg(mcfg)
, _cql_config(cql_cfg)
, _prepared_cache(prep_cache_log, _mcfg.prepared_statment_cache_size)
, _authorized_prepared_cache(std::move(auth_prep_cache_cfg), authorized_prepared_statements_cache_log)
, _auth_prepared_cache_cfg_cb([this](uint32_t) {
(void)_authorized_prepared_cache_config_action.trigger_later();
})
, _authorized_prepared_cache_config_action([this] {
update_authorized_prepared_cache_config();
return make_ready_future<>();
})
, _authorized_prepared_cache_update_interval_in_ms_observer(_db.get_config().permissions_update_interval_in_ms.observe(_auth_prepared_cache_cfg_cb))
, _authorized_prepared_cache_validity_in_ms_observer(_db.get_config().permissions_validity_in_ms.observe(_auth_prepared_cache_cfg_cb))
, _lang_manager(langm)
, _write_consistency_levels_warned_observer(_db.get_config().write_consistency_levels_warned.observe([this](const auto& v) {
_write_consistency_levels_warned = to_consistency_level_set(v);
}))
, _write_consistency_levels_disallowed_observer(_db.get_config().write_consistency_levels_disallowed.observe([this](const auto& v) {
_write_consistency_levels_disallowed = to_consistency_level_set(v);
})) {
_write_consistency_levels_warned = to_consistency_level_set(_db.get_config().write_consistency_levels_warned());
_write_consistency_levels_disallowed = to_consistency_level_set(_db.get_config().write_consistency_levels_disallowed());
namespace sm = seastar::metrics;
@@ -102,7 +114,7 @@ query_processor::query_processor(service::storage_proxy& proxy, data_dictionary:
using clevel = db::consistency_level;
sm::label cl_label("consistency_level");
sm::label who_label("who"); // Who queried system tables
sm::label who_label("who"); // Who queried system tables
const auto user_who_label_instance = who_label("user");
const auto internal_who_label_instance = who_label("internal");
@@ -110,17 +122,11 @@ query_processor::query_processor(service::storage_proxy& proxy, data_dictionary:
const auto system_ks_label_instance = ks_label("system");
std::vector<sm::metric_definition> qp_group;
qp_group.push_back(sm::make_counter(
"statements_prepared",
_stats.prepare_invocations,
sm::description("Counts the total number of parsed CQL requests.")));
qp_group.push_back(sm::make_counter("statements_prepared", _stats.prepare_invocations, sm::description("Counts the total number of parsed CQL requests.")));
for (auto cl = size_t(clevel::MIN_VALUE); cl <= size_t(clevel::MAX_VALUE); ++cl) {
qp_group.push_back(
sm::make_counter(
"queries",
_stats.queries_by_cl[cl],
sm::description("Counts queries by consistency level."),
{cl_label(clevel(cl)), basic_level}).set_skip_when_empty());
qp_group.push_back(sm::make_counter(
"queries", _stats.queries_by_cl[cl], sm::description("Counts queries by consistency level."), {cl_label(clevel(cl)), basic_level})
.set_skip_when_empty());
}
_metrics.add_group("query_processor", qp_group);
@@ -521,29 +527,23 @@ query_processor::query_processor(service::storage_proxy& proxy, data_dictionary:
std::vector<sm::metric_definition> cql_cl_group;
for (auto cl = size_t(clevel::MIN_VALUE); cl <= size_t(clevel::MAX_VALUE); ++cl) {
cql_cl_group.push_back(
sm::make_counter(
"writes_per_consistency_level",
_cql_stats.writes_per_consistency_level[cl],
sm::description("Counts the number of writes for each consistency level."),
{cl_label(clevel(cl)), basic_level}).set_skip_when_empty());
cql_cl_group.push_back(sm::make_counter("writes_per_consistency_level", _cql_stats.writes_per_consistency_level[cl],
sm::description("Counts the number of writes for each consistency level."), {cl_label(clevel(cl)), basic_level})
.set_skip_when_empty());
}
_metrics.add_group("cql", cql_cl_group);
_metrics.add_group("cql", {
sm::make_counter(
"write_consistency_levels_disallowed_violations",
_cql_stats.write_consistency_levels_disallowed_violations,
sm::description("Counts the number of write_consistency_levels_disallowed guardrail violations, "
"i.e. attempts to write with a forbidden consistency level."),
{basic_level}),
sm::make_counter(
"write_consistency_levels_warned_violations",
_cql_stats.write_consistency_levels_warned_violations,
sm::description("Counts the number of write_consistency_levels_warned guardrail violations, "
"i.e. attempts to write with a discouraged consistency level."),
{basic_level}),
});
_metrics.add_group(
"cql", {
sm::make_counter("write_consistency_levels_disallowed_violations", _cql_stats.write_consistency_levels_disallowed_violations,
sm::description("Counts the number of write_consistency_levels_disallowed guardrail violations, "
"i.e. attempts to write with a forbidden consistency level."),
{basic_level}),
sm::make_counter("write_consistency_levels_warned_violations", _cql_stats.write_consistency_levels_warned_violations,
sm::description("Counts the number of write_consistency_levels_warned guardrail violations, "
"i.e. attempts to write with a discouraged consistency level."),
{basic_level}),
});
_mnotifier.register_listener(_migration_subscriber.get());
}
@@ -554,15 +554,13 @@ query_processor::~query_processor() {
}
}
std::pair<std::reference_wrapper<service::strong_consistency::coordinator>, gate::holder>
query_processor::acquire_strongly_consistent_coordinator() {
std::pair<std::reference_wrapper<service::strong_consistency::coordinator>, gate::holder> query_processor::acquire_strongly_consistent_coordinator() {
auto [remote_, holder] = remote();
return {remote_.get().sc_coordinator, std::move(holder)};
}
void query_processor::start_remote(service::migration_manager& mm, service::mapreduce_service& mapreducer,
service::storage_service& ss, service::raft_group0_client& group0_client,
service::strong_consistency::coordinator& sc_coordinator) {
void query_processor::start_remote(service::migration_manager& mm, service::mapreduce_service& mapreducer, service::storage_service& ss,
service::raft_group0_client& group0_client, service::strong_consistency::coordinator& sc_coordinator) {
_remote = std::make_unique<struct remote>(mm, mapreducer, ss, group0_client, sc_coordinator);
}
@@ -582,7 +580,9 @@ future<> query_processor::stop() {
}
future<::shared_ptr<cql_transport::messages::result_message>> query_processor::execute_with_guard(
std::function<future<::shared_ptr<cql_transport::messages::result_message>>(service::query_state&, ::shared_ptr<cql_statement>, const query_options&, std::optional<service::group0_guard>)> fn,
std::function<future<::shared_ptr<cql_transport::messages::result_message>>(
service::query_state&, ::shared_ptr<cql_statement>, const query_options&, std::optional<service::group0_guard>)>
fn,
::shared_ptr<cql_statement> statement, service::query_state& query_state, const query_options& options) {
// execute all statements that need group0 guard on shard0
if (this_shard_id() != 0) {
@@ -591,13 +591,13 @@ future<::shared_ptr<cql_transport::messages::result_message>> query_processor::e
auto [remote_, holder] = remote();
size_t retries = remote_.get().mm.get_concurrent_ddl_retries();
while (true) {
while (true) {
try {
auto guard = co_await remote_.get().mm.start_group0_operation();
co_return co_await fn(query_state, statement, options, std::move(guard));
} catch (const service::group0_concurrent_modification& ex) {
log.warn("Failed to execute statement \"{}\" due to guard conflict.{}.",
statement->raw_cql_statement, retries ? " Retrying" : " Number of retries exceeded, giving up");
log.warn("Failed to execute statement \"{}\" due to guard conflict.{}.", statement->raw_cql_statement,
retries ? " Retrying" : " Number of retries exceeded, giving up");
if (retries--) {
continue;
}
@@ -606,29 +606,30 @@ future<::shared_ptr<cql_transport::messages::result_message>> query_processor::e
}
}
template<typename... Args>
future<::shared_ptr<result_message>>
query_processor::execute_maybe_with_guard(service::query_state& query_state, ::shared_ptr<cql_statement> statement, const query_options& options,
future<::shared_ptr<result_message>>(query_processor::*fn)(service::query_state&, ::shared_ptr<cql_statement>, const query_options&, std::optional<service::group0_guard>, Args...), Args... args) {
template <typename... Args>
future<::shared_ptr<result_message>> query_processor::execute_maybe_with_guard(service::query_state& query_state, ::shared_ptr<cql_statement> statement,
const query_options& options,
future<::shared_ptr<result_message>> (query_processor::*fn)(
service::query_state&, ::shared_ptr<cql_statement>, const query_options&, std::optional<service::group0_guard>, Args...),
Args... args) {
if (!statement->needs_guard(*this, query_state)) {
return (this->*fn)(query_state, std::move(statement), options, std::nullopt, std::forward<Args>(args)...);
}
static auto exec = [fn] (query_processor& qp, Args... args, service::query_state& query_state, ::shared_ptr<cql_statement> statement, const query_options& options, std::optional<service::group0_guard> guard) {
static auto exec = [fn](query_processor& qp, Args... args, service::query_state& query_state, ::shared_ptr<cql_statement> statement,
const query_options& options, std::optional<service::group0_guard> guard) {
return (qp.*fn)(query_state, std::move(statement), options, std::move(guard), std::forward<Args>(args)...);
};
return execute_with_guard(std::bind_front(exec, std::ref(*this), std::forward<Args>(args)...), std::move(statement), query_state, options);
}
future<::shared_ptr<result_message>>
query_processor::execute_direct_without_checking_exception_message(const std::string_view& query_string, service::query_state& query_state, dialect d, query_options& options) {
future<::shared_ptr<result_message>> query_processor::execute_direct_without_checking_exception_message(
const std::string_view& query_string, service::query_state& query_state, dialect d, query_options& options) {
log.trace("execute_direct: \"{}\"", query_string);
tracing::trace(query_state.get_trace_state(), "Parsing a statement");
auto p = get_statement(query_string, query_state.get_client_state(), d);
auto statement = p->statement;
if (statement->get_bound_terms() != options.get_values_count()) {
const auto msg = format("Invalid amount of bind variables: expected {:d} received {:d}",
statement->get_bound_terms(),
options.get_values_count());
const auto msg = format("Invalid amount of bind variables: expected {:d} received {:d}", statement->get_bound_terms(), options.get_values_count());
throw exceptions::invalid_request_exception(msg);
}
options.prepare(p->bound_names);
@@ -639,17 +640,13 @@ query_processor::execute_direct_without_checking_exception_message(const std::st
metrics.regularStatementsExecuted.inc();
#endif
auto user = query_state.get_client_state().user();
tracing::trace(query_state.get_trace_state(), "Processing a statement for authenticated user: {}", user ? (user->name ? *user->name : "anonymous") : "no user authenticated");
tracing::trace(query_state.get_trace_state(), "Processing a statement for authenticated user: {}",
user ? (user->name ? *user->name : "anonymous") : "no user authenticated");
return execute_maybe_with_guard(query_state, std::move(statement), options, &query_processor::do_execute_direct, std::move(p->warnings));
}
future<::shared_ptr<result_message>>
query_processor::do_execute_direct(
service::query_state& query_state,
shared_ptr<cql_statement> statement,
const query_options& options,
std::optional<service::group0_guard> guard,
cql3::cql_warnings_vec warnings) {
future<::shared_ptr<result_message>> query_processor::do_execute_direct(service::query_state& query_state, shared_ptr<cql_statement> statement,
const query_options& options, std::optional<service::group0_guard> guard, cql3::cql_warnings_vec warnings) {
auto access_future = co_await coroutine::as_future(statement->check_access(*this, query_state.get_client_state()));
if (access_future.failed()) {
co_await audit::inspect(statement, query_state, options, true);
@@ -674,26 +671,16 @@ query_processor::do_execute_direct(
co_return std::move(m);
}
future<::shared_ptr<result_message>>
query_processor::execute_prepared_without_checking_exception_message(
service::query_state& query_state,
shared_ptr<cql_statement> statement,
const query_options& options,
statements::prepared_statement::checked_weak_ptr prepared,
cql3::prepared_cache_key_type cache_key,
bool needs_authorization) {
return execute_maybe_with_guard(query_state, std::move(statement), options, &query_processor::do_execute_prepared, std::move(prepared), std::move(cache_key), needs_authorization);
future<::shared_ptr<result_message>> query_processor::execute_prepared_without_checking_exception_message(service::query_state& query_state,
shared_ptr<cql_statement> statement, const query_options& options, statements::prepared_statement::checked_weak_ptr prepared,
cql3::prepared_cache_key_type cache_key, bool needs_authorization) {
return execute_maybe_with_guard(
query_state, std::move(statement), options, &query_processor::do_execute_prepared, std::move(prepared), std::move(cache_key), needs_authorization);
}
future<::shared_ptr<result_message>>
query_processor::do_execute_prepared(
service::query_state& query_state,
shared_ptr<cql_statement> statement,
const query_options& options,
std::optional<service::group0_guard> guard,
statements::prepared_statement::checked_weak_ptr prepared,
cql3::prepared_cache_key_type cache_key,
bool needs_authorization) {
future<::shared_ptr<result_message>> query_processor::do_execute_prepared(service::query_state& query_state, shared_ptr<cql_statement> statement,
const query_options& options, std::optional<service::group0_guard> guard, statements::prepared_statement::checked_weak_ptr prepared,
cql3::prepared_cache_key_type cache_key, bool needs_authorization) {
if (needs_authorization) {
co_await statement->check_access(*this, query_state.get_client_state());
try {
@@ -707,8 +694,8 @@ query_processor::do_execute_prepared(
co_return co_await process_authorized_statement(std::move(statement), query_state, options, std::move(guard));
}
future<::shared_ptr<result_message>>
query_processor::process_authorized_statement(const ::shared_ptr<cql_statement> statement, service::query_state& query_state, const query_options& options, std::optional<service::group0_guard> guard) {
future<::shared_ptr<result_message>> query_processor::process_authorized_statement(const ::shared_ptr<cql_statement> statement,
service::query_state& query_state, const query_options& options, std::optional<service::group0_guard> guard) {
auto& client_state = query_state.get_client_state();
++_stats.queries_by_cl[size_t(options.get_consistency())];
@@ -718,43 +705,39 @@ query_processor::process_authorized_statement(const ::shared_ptr<cql_statement>
auto msg = co_await statement->execute_without_checking_exception_message(*this, query_state, options, std::move(guard));
if (msg) {
co_return std::move(msg);
co_return std::move(msg);
}
co_return ::make_shared<result_message::void_message>();
}
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, service::query_state& query_state, cql3::dialect d) {
future<::shared_ptr<cql_transport::messages::result_message::prepared>> query_processor::prepare(
sstring query_string, service::query_state& query_state, cql3::dialect d) {
auto& client_state = query_state.get_client_state();
return prepare(std::move(query_string), client_state, d);
}
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, const service::client_state& client_state, cql3::dialect d) {
future<::shared_ptr<cql_transport::messages::result_message::prepared>> query_processor::prepare(
sstring query_string, const service::client_state& client_state, cql3::dialect d) {
try {
auto key = compute_id(query_string, client_state.get_raw_keyspace(), d);
auto prep_entry = co_await _prepared_cache.get_pinned(key, [this, &query_string, &client_state, d] {
auto prepared = get_statement(query_string, client_state, d);
prepared->calculate_metadata_id();
auto bound_terms = prepared->statement->get_bound_terms();
if (bound_terms > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(
format("Too many markers(?). {:d} markers exceed the allowed maximum of {:d}",
bound_terms,
std::numeric_limits<uint16_t>::max()));
}
throwing_assert(bound_terms == prepared->bound_names.size());
return make_ready_future<std::unique_ptr<statements::prepared_statement>>(std::move(prepared));
});
auto prepared = get_statement(query_string, client_state, d);
prepared->calculate_metadata_id();
auto bound_terms = prepared->statement->get_bound_terms();
if (bound_terms > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(
format("Too many markers(?). {:d} markers exceed the allowed maximum of {:d}", bound_terms, std::numeric_limits<uint16_t>::max()));
}
throwing_assert(bound_terms == prepared->bound_names.size());
return make_ready_future<std::unique_ptr<statements::prepared_statement>>(std::move(prepared));
});
co_await utils::get_local_injector().inject("query_processor_prepare_wait_after_cache_get", utils::wait_for_message(std::chrono::seconds(60)));
co_await utils::get_local_injector().inject(
"query_processor_prepare_wait_after_cache_get",
utils::wait_for_message(std::chrono::seconds(60)));
auto msg = ::make_shared<result_message::prepared::cql>(prepared_cache_key_type::cql_id(key), std::move(prep_entry),
client_state.is_protocol_extension_set(cql_transport::cql_protocol_extension::LWT_ADD_METADATA_MARK));
client_state.is_protocol_extension_set(cql_transport::cql_protocol_extension::LWT_ADD_METADATA_MARK));
co_return std::move(msg);
} catch(typename prepared_statements_cache::statement_is_too_big&) {
} catch (typename prepared_statements_cache::statement_is_too_big&) {
throw prepared_statement_is_too_big(query_string);
}
}
@@ -765,15 +748,11 @@ static std::string hash_target(std::string_view query_string, std::string_view k
return ret;
}
prepared_cache_key_type query_processor::compute_id(
std::string_view query_string,
std::string_view keyspace,
dialect d) {
prepared_cache_key_type query_processor::compute_id(std::string_view query_string, std::string_view keyspace, dialect d) {
return prepared_cache_key_type(md5_hasher::calculate(hash_target(query_string, keyspace)), d);
}
std::unique_ptr<prepared_statement>
query_processor::get_statement(const std::string_view& query, const service::client_state& client_state, dialect d) {
std::unique_ptr<prepared_statement> query_processor::get_statement(const std::string_view& query, const service::client_state& client_state, dialect d) {
// Measuring allocation cost requires that no yield points exist
// between bytes_before and bytes_after. It needs fixing if this
// function is ever futurized.
@@ -798,8 +777,7 @@ query_processor::get_statement(const std::string_view& query, const service::cli
return p;
}
std::unique_ptr<raw::parsed_statement>
query_processor::parse_statement(const std::string_view& query, dialect d) {
std::unique_ptr<raw::parsed_statement> query_processor::parse_statement(const std::string_view& query, dialect d) {
try {
{
const char* error_injection_key = "query_processor-parse_statement-test_failure";
@@ -824,8 +802,7 @@ query_processor::parse_statement(const std::string_view& query, dialect d) {
}
}
std::vector<std::unique_ptr<raw::parsed_statement>>
query_processor::parse_statements(std::string_view queries, dialect d) {
std::vector<std::unique_ptr<raw::parsed_statement>> query_processor::parse_statements(std::string_view queries, dialect d) {
try {
auto statements = util::do_with_parser(queries, d, std::mem_fn(&cql3_parser::CqlParser::queries));
if (statements.empty()) {
@@ -854,15 +831,10 @@ std::pair<std::reference_wrapper<struct query_processor::remote>, gate::holder>
on_internal_error(log, "attempted to perform distributed query when `query_processor::remote` is unavailable");
}
query_options query_processor::make_internal_options(
const statements::prepared_statement::checked_weak_ptr& p,
const std::vector<data_value_or_unset>& values,
db::consistency_level cl,
int32_t page_size,
service::node_local_only node_local_only) const {
query_options query_processor::make_internal_options(const statements::prepared_statement::checked_weak_ptr& p, const std::vector<data_value_or_unset>& values,
db::consistency_level cl, int32_t page_size, service::node_local_only node_local_only) const {
if (p->bound_names.size() != values.size()) {
throw std::invalid_argument(
format("Invalid number of values. Expecting {:d} but got {:d}", p->bound_names.size(), values.size()));
throw std::invalid_argument(format("Invalid number of values. Expecting {:d} but got {:d}", p->bound_names.size(), values.size()));
}
auto ni = p->bound_names.begin();
raw_value_vector_with_unset bound_values;
@@ -870,32 +842,28 @@ query_options query_processor::make_internal_options(
bound_values.unset.resize(values.size());
for (auto& var : values) {
auto& n = *ni;
std::visit(overloaded_functor {
[&] (const data_value& v) {
if (v.type() == bytes_type) {
bound_values.values.emplace_back(cql3::raw_value::make_value(value_cast<bytes>(v)));
} else if (v.is_null()) {
bound_values.values.emplace_back(cql3::raw_value::make_null());
} else {
bound_values.values.emplace_back(cql3::raw_value::make_value(n->type->decompose(v)));
}
}, [&] (const unset_value&) {
bound_values.values.emplace_back(cql3::raw_value::make_null());
bound_values.unset[std::distance(p->bound_names.begin(), ni)] = true;
}
}, var);
std::visit(overloaded_functor{[&](const data_value& v) {
if (v.type() == bytes_type) {
bound_values.values.emplace_back(cql3::raw_value::make_value(value_cast<bytes>(v)));
} else if (v.is_null()) {
bound_values.values.emplace_back(cql3::raw_value::make_null());
} else {
bound_values.values.emplace_back(cql3::raw_value::make_value(n->type->decompose(v)));
}
},
[&](const unset_value&) {
bound_values.values.emplace_back(cql3::raw_value::make_null());
bound_values.unset[std::distance(p->bound_names.begin(), ni)] = true;
}},
var);
++ni;
}
return query_options(
cl,
std::move(bound_values),
cql3::query_options::specific_options {
.page_size = page_size,
.state = {},
.serial_consistency = db::consistency_level::SERIAL,
.timestamp = api::missing_timestamp,
.node_local_only = node_local_only
});
return query_options(cl, std::move(bound_values),
cql3::query_options::specific_options{.page_size = page_size,
.state = {},
.serial_consistency = db::consistency_level::SERIAL,
.timestamp = api::missing_timestamp,
.node_local_only = node_local_only});
}
statements::prepared_statement::checked_weak_ptr query_processor::prepare_internal(const sstring& query_string) {
@@ -917,11 +885,7 @@ struct internal_query_state {
};
internal_query_state query_processor::create_paged_state(
const sstring& query_string,
db::consistency_level cl,
const data_value_list& values,
int32_t page_size,
std::optional<service::query_state> qs) {
const sstring& query_string, db::consistency_level cl, const data_value_list& values, int32_t page_size, std::optional<service::query_state> qs) {
auto p = prepare_internal(query_string);
auto opts = make_internal_options(p, values, cl, page_size);
if (!qs) {
@@ -935,8 +899,7 @@ bool query_processor::has_more_results(cql3::internal_query_state& state) const
}
future<> query_processor::for_each_cql_result(
cql3::internal_query_state& state,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set::row&)> f) {
cql3::internal_query_state& state, noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set::row&)> f) {
do {
auto msg = co_await execute_paged_internal(state);
for (auto& row : *msg) {
@@ -947,17 +910,18 @@ future<> query_processor::for_each_cql_result(
} while (has_more_results(state));
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_paged_internal(internal_query_state& state) {
future<::shared_ptr<untyped_result_set>> query_processor::execute_paged_internal(internal_query_state& state) {
state.p->statement->validate(*this, service::client_state::for_internal_calls());
::shared_ptr<cql_transport::messages::result_message> msg =
co_await state.p->statement->execute(*this, *state.qs, *state.opts, std::nullopt);
::shared_ptr<cql_transport::messages::result_message> msg = co_await state.p->statement->execute(*this, *state.qs, *state.opts, std::nullopt);
class visitor : public result_message::visitor_base {
internal_query_state& _state;
query_processor& _qp;
public:
visitor(internal_query_state& state, query_processor& qp) : _state(state), _qp(qp) {
visitor(internal_query_state& state, query_processor& qp)
: _state(state)
, _qp(qp) {
}
virtual ~visitor() = default;
void visit(const result_message::rows& rmrs) override {
@@ -986,23 +950,14 @@ query_processor::execute_paged_internal(internal_query_state& state) {
co_return ::make_shared<untyped_result_set>(msg);
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(
const sstring& query_string,
db::consistency_level cl,
const data_value_list& values,
cache_internal cache) {
future<::shared_ptr<untyped_result_set>> query_processor::execute_internal(
const sstring& query_string, db::consistency_level cl, const data_value_list& values, cache_internal cache) {
auto qs = query_state_for_internal_call();
co_return co_await execute_internal(query_string, cl, qs, values, cache);
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(
const sstring& query_string,
db::consistency_level cl,
service::query_state& query_state,
const data_value_list& values,
cache_internal cache) {
future<::shared_ptr<untyped_result_set>> query_processor::execute_internal(
const sstring& query_string, db::consistency_level cl, service::query_state& query_state, const data_value_list& values, cache_internal cache) {
if (log.is_enabled(logging::log_level::trace)) {
log.trace("execute_internal: {}\"{}\" ({})", cache ? "(cached) " : "", query_string, fmt::join(values, ", "));
@@ -1020,10 +975,7 @@ query_processor::execute_internal(
}
future<utils::chunked_vector<mutation>> query_processor::get_mutations_internal(
const sstring query_string,
service::query_state& query_state,
api::timestamp_type timestamp,
std::vector<data_value_or_unset> values) {
const sstring query_string, service::query_state& query_state, api::timestamp_type timestamp, std::vector<data_value_or_unset> values) {
log.debug("get_mutations_internal: \"{}\" ({})", query_string, fmt::join(values, ", "));
auto stmt = prepare_internal(query_string);
auto mod_stmt = dynamic_pointer_cast<cql3::statements::modification_statement>(stmt->statement);
@@ -1041,12 +993,8 @@ future<utils::chunked_vector<mutation>> query_processor::get_mutations_internal(
co_return co_await mod_stmt->get_mutations(*this, opts, timeout, true, timestamp, query_state, json_cache, std::move(keys));
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_with_params(
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level cl,
service::query_state& query_state,
const data_value_list& values) {
future<::shared_ptr<untyped_result_set>> query_processor::execute_with_params(
statements::prepared_statement::checked_weak_ptr p, db::consistency_level cl, service::query_state& query_state, const data_value_list& values) {
auto opts = make_internal_options(p, values, cl);
auto statement = p->statement;
@@ -1054,30 +1002,24 @@ query_processor::execute_with_params(
co_return ::make_shared<untyped_result_set>(msg);
}
future<::shared_ptr<result_message>>
query_processor::do_execute_with_params(
service::query_state& query_state,
shared_ptr<cql_statement> statement,
const query_options& options, std::optional<service::group0_guard> guard) {
future<::shared_ptr<result_message>> query_processor::do_execute_with_params(
service::query_state& query_state, shared_ptr<cql_statement> statement, const query_options& options, std::optional<service::group0_guard> guard) {
statement->validate(*this, service::client_state::for_internal_calls());
co_return co_await coroutine::try_future(statement->execute(*this, query_state, options, std::move(guard)));
}
future<::shared_ptr<cql_transport::messages::result_message>>
query_processor::execute_batch_without_checking_exception_message(
::shared_ptr<statements::batch_statement> batch,
service::query_state& query_state,
query_options& options,
future<::shared_ptr<cql_transport::messages::result_message>> query_processor::execute_batch_without_checking_exception_message(
::shared_ptr<statements::batch_statement> batch, service::query_state& query_state, query_options& options,
std::unordered_map<prepared_cache_key_type, authorized_prepared_statements_cache::value_type> pending_authorization_entries) {
auto access_future = co_await coroutine::as_future(batch->check_access(*this, query_state.get_client_state()));
co_await coroutine::parallel_for_each(pending_authorization_entries, [this, &query_state] (auto& e) -> future<> {
try {
co_await _authorized_prepared_cache.insert(*query_state.get_client_state().user(), e.first, std::move(e.second));
} catch (...) {
log.error("failed to cache the entry: {}", std::current_exception());
}
});
co_await coroutine::parallel_for_each(pending_authorization_entries, [this, &query_state](auto& e) -> future<> {
try {
co_await _authorized_prepared_cache.insert(*query_state.get_client_state().user(), e.first, std::move(e.second));
} catch (...) {
log.error("failed to cache the entry: {}", std::current_exception());
}
});
bool failed = access_future.failed();
co_await audit::inspect(batch, query_state, options, failed);
if (access_future.failed()) {
@@ -1086,30 +1028,28 @@ query_processor::execute_batch_without_checking_exception_message(
batch->validate();
batch->validate(*this, query_state.get_client_state());
_stats.queries_by_cl[size_t(options.get_consistency())] += batch->get_statements().size();
if (log.is_enabled(logging::log_level::trace)) {
if (log.is_enabled(logging::log_level::trace)) {
std::ostringstream oss;
for (const auto& s: batch->get_statements()) {
oss << std::endl << s.statement->raw_cql_statement;
for (const auto& s : batch->get_statements()) {
oss << std::endl << s.statement->raw_cql_statement;
}
log.trace("execute_batch({}): {}", batch->get_statements().size(), oss.str());
}
co_return co_await batch->execute(*this, query_state, options, std::nullopt);
}
future<service::broadcast_tables::query_result>
query_processor::execute_broadcast_table_query(const service::broadcast_tables::query& query) {
future<service::broadcast_tables::query_result> query_processor::execute_broadcast_table_query(const service::broadcast_tables::query& query) {
auto [remote_, holder] = remote();
co_return co_await service::broadcast_tables::execute(remote_.get().group0_client, query);
}
future<query::mapreduce_result>
query_processor::mapreduce(query::mapreduce_request req, tracing::trace_state_ptr tr_state) {
future<query::mapreduce_result> query_processor::mapreduce(query::mapreduce_request req, tracing::trace_state_ptr tr_state) {
auto [remote_, holder] = remote();
co_return co_await remote_.get().mapreducer.dispatch(std::move(req), std::move(tr_state));
}
future<::shared_ptr<messages::result_message>>
query_processor::execute_schema_statement(const statements::schema_altering_statement& stmt, service::query_state& state, const query_options& options, service::group0_batch& mc) {
future<::shared_ptr<messages::result_message>> query_processor::execute_schema_statement(
const statements::schema_altering_statement& stmt, service::query_state& state, const query_options& options, service::group0_batch& mc) {
if (this_shard_id() != 0) {
on_internal_error(log, "DDL must be executed on shard 0");
}
@@ -1163,7 +1103,8 @@ future<> query_processor::announce_schema_statement(const statements::schema_alt
co_await remote_.get().mm.announce(std::move(m), std::move(guard), description);
}
query_processor::migration_subscriber::migration_subscriber(query_processor* qp) : _qp{qp} {
query_processor::migration_subscriber::migration_subscriber(query_processor* qp)
: _qp{qp} {
}
void query_processor::migration_subscriber::on_create_keyspace(const sstring& ks_name) {
@@ -1189,10 +1130,7 @@ void query_processor::migration_subscriber::on_create_view(const sstring& ks_nam
void query_processor::migration_subscriber::on_update_keyspace(const sstring& ks_name) {
}
void query_processor::migration_subscriber::on_update_column_family(
const sstring& ks_name,
const sstring& cf_name,
bool columns_changed) {
void query_processor::migration_subscriber::on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool columns_changed) {
// #1255: Ignoring columns_changed deliberately.
log.info("Column definitions for {}.{} changed, invalidating related prepared statements", ks_name, cf_name);
remove_invalid_prepared_statements(ks_name, cf_name);
@@ -1207,9 +1145,7 @@ void query_processor::migration_subscriber::on_update_function(const sstring& ks
void query_processor::migration_subscriber::on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) {
}
void query_processor::migration_subscriber::on_update_view(
const sstring& ks_name,
const sstring& view_name, bool columns_changed) {
void query_processor::migration_subscriber::on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) {
// scylladb/scylladb#16392 - Materialized views are also tables so we need at least handle
// them as such when changed.
on_update_column_family(ks_name, view_name, columns_changed);
@@ -1238,39 +1174,28 @@ void query_processor::migration_subscriber::on_drop_view(const sstring& ks_name,
remove_invalid_prepared_statements(ks_name, view_name);
}
void query_processor::migration_subscriber::remove_invalid_prepared_statements(
sstring ks_name,
std::optional<sstring> cf_name) {
_qp->_prepared_cache.remove_if([&] (::shared_ptr<cql_statement> stmt) {
void query_processor::migration_subscriber::remove_invalid_prepared_statements(sstring ks_name, std::optional<sstring> cf_name) {
_qp->_prepared_cache.remove_if([&](::shared_ptr<cql_statement> stmt) {
return this->should_invalidate(ks_name, cf_name, stmt);
});
}
bool query_processor::migration_subscriber::should_invalidate(
sstring ks_name,
std::optional<sstring> cf_name,
::shared_ptr<cql_statement> statement) {
bool query_processor::migration_subscriber::should_invalidate(sstring ks_name, std::optional<sstring> cf_name, ::shared_ptr<cql_statement> statement) {
return statement->depends_on(ks_name, cf_name);
}
future<> query_processor::query_internal(
const sstring& query_string,
db::consistency_level cl,
const data_value_list& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f,
std::optional<service::query_state> qs) {
future<> query_processor::query_internal(const sstring& query_string, db::consistency_level cl, const data_value_list& values, int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f, std::optional<service::query_state> qs) {
auto query_state = create_paged_state(query_string, cl, values, page_size, std::move(qs));
co_return co_await for_each_cql_result(query_state, std::move(f));
}
future<> query_processor::query_internal(
const sstring& query_string,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f) {
future<> query_processor::query_internal(const sstring& query_string, noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f) {
return query_internal(query_string, db::consistency_level::ONE, {}, 1000, std::move(f));
}
shared_ptr<cql_transport::messages::result_message> query_processor::bounce_to_shard(unsigned shard, cql3::computed_function_values cached_fn_calls, bool track) {
shared_ptr<cql_transport::messages::result_message> query_processor::bounce_to_shard(
unsigned shard, cql3::computed_function_values cached_fn_calls, bool track) {
if (track) {
_proxy.get_stats().replica_cross_shard_ops++;
}
@@ -1278,7 +1203,8 @@ shared_ptr<cql_transport::messages::result_message> query_processor::bounce_to_s
return ::make_shared<cql_transport::messages::result_message::bounce>(my_host_id, shard, std::move(cached_fn_calls));
}
shared_ptr<cql_transport::messages::result_message> query_processor::bounce_to_node(locator::tablet_replica replica, cql3::computed_function_values cached_fn_calls, seastar::lowres_clock::time_point timeout, bool is_write) {
shared_ptr<cql_transport::messages::result_message> query_processor::bounce_to_node(
locator::tablet_replica replica, cql3::computed_function_values cached_fn_calls, seastar::lowres_clock::time_point timeout, bool is_write) {
get_cql_stats().forwarded_requests++;
return ::make_shared<cql_transport::messages::result_message::bounce>(replica.host, replica.shard, std::move(cached_fn_calls), timeout, is_write);
}
@@ -1295,7 +1221,7 @@ void query_processor::update_authorized_prepared_cache_config() {
utils::loading_cache_config cfg;
cfg.max_size = _mcfg.authorized_prepared_cache_size;
cfg.expiry = std::min(std::chrono::milliseconds(_db.get_config().permissions_validity_in_ms()),
std::chrono::duration_cast<std::chrono::milliseconds>(prepared_statements_cache::entry_expiry));
std::chrono::duration_cast<std::chrono::milliseconds>(prepared_statements_cache::entry_expiry));
cfg.refresh = std::chrono::milliseconds(_db.get_config().permissions_update_interval_in_ms());
if (!_authorized_prepared_cache.update_config(std::move(cfg))) {
@@ -1307,4 +1233,4 @@ void query_processor::reset_cache() {
_authorized_prepared_cache.reset();
}
}
} // namespace cql3

View File

@@ -201,10 +201,6 @@ public:
return _clustering_columns_restrictions;
}
const expr::expression& get_nonprimary_key_restrictions() const {
return _nonprimary_key_restrictions;
}
// Get a set of columns restricted by the IS NOT NULL restriction.
// IS NOT NULL is a special case that is handled separately from other restrictions.
const std::unordered_set<const column_definition*> get_not_null_columns() const;

View File

@@ -8,7 +8,6 @@
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include <boost/algorithm/string.hpp>
#include <seastar/core/coroutine.hh>
#include "create_index_statement.hh"
#include "db/config.hh"
@@ -36,10 +35,8 @@
#include "db/schema_tables.hh"
#include "index/secondary_index_manager.hh"
#include "types/concrete_types.hh"
#include "types/vector.hh"
#include "db/tags/extension.hh"
#include "tombstone_gc_extension.hh"
#include "index/secondary_index.hh"
#include <stdexcept>
@@ -119,58 +116,6 @@ static data_type type_for_computed_column(cql3::statements::index_target::target
}
}
// Cassandra SAI compatibility: detect the StorageAttachedIndex class name
// used by Cassandra to create vector and metadata indexes.
static bool is_sai_class_name(const sstring& class_name) {
return class_name == "org.apache.cassandra.index.sai.StorageAttachedIndex"
|| boost::iequals(class_name, "storageattachedindex")
|| boost::iequals(class_name, "sai");
}
// Returns true if the custom class name refers to a vector-capable index
// (either ScyllaDB's native vector_index or Cassandra's SAI).
static bool is_vector_capable_class(const sstring& class_name) {
return class_name == "vector_index" || is_sai_class_name(class_name);
}
// When the custom class is SAI, verify that at least one target is a
// vector column and rewrite the class to ScyllaDB's native "vector_index".
// Non-vector single-column targets and multi-column (local-index partition
// key) targets are skipped — they are treated as filtering columns by
// vector_index::check_target().
static void maybe_rewrite_sai_to_vector_index(
const schema& schema,
const std::vector<::shared_ptr<index_target>>& targets,
index_specific_prop_defs& props) {
if (!props.custom_class || !is_sai_class_name(*props.custom_class)) {
return;
}
for (const auto& target : targets) {
auto* ident = std::get_if<::shared_ptr<column_identifier>>(&target->value);
if (!ident) {
// Multi-column target (local-index partition key) — skip.
continue;
}
auto cd = schema.get_column_definition((*ident)->name());
if (!cd) {
// Nonexistent column — skip; vector_index::validate() will catch it.
continue;
}
if (dynamic_cast<const vector_type_impl*>(cd->type.get())) {
props.custom_class = "vector_index";
return;
}
}
throw exceptions::invalid_request_exception(
"StorageAttachedIndex (SAI) is only supported on vector columns; "
"use a secondary index for non-vector columns");
}
static bool is_vector_index(const index_options_map& options) {
auto class_it = options.find(db::index::secondary_index::custom_class_option_name);
return class_it != options.end() && is_vector_capable_class(class_it->second);
}
view_ptr create_index_statement::create_view_for_index(const schema_ptr schema, const index_metadata& im,
const data_dictionary::database& db) const
{
@@ -320,8 +265,8 @@ create_index_statement::validate(query_processor& qp, const service::client_stat
_idx_properties->validate();
const bool is_vector_index = _idx_properties->custom_class && is_vector_capable_class(*_idx_properties->custom_class);
// FIXME: This is ugly and can be improved.
const bool is_vector_index = _idx_properties->custom_class && *_idx_properties->custom_class == "vector_index";
const bool uses_view_properties = _view_properties.properties()->count() > 0
|| _view_properties.use_compact_storage()
|| _view_properties.defined_ordering().size() > 0;
@@ -407,8 +352,6 @@ create_index_statement::validate_while_executing(data_dictionary::database db, l
targets.emplace_back(raw_target->prepare(*schema));
}
maybe_rewrite_sai_to_vector_index(*schema, targets, *_idx_properties);
if (_idx_properties && _idx_properties->custom_class) {
auto custom_index_factory = secondary_index::secondary_index_manager::get_custom_class_factory(*_idx_properties->custom_class);
if (!custom_index_factory) {
@@ -754,9 +697,7 @@ index_metadata create_index_statement::make_index_metadata(const std::vector<::s
const index_options_map& options)
{
index_options_map new_options = options;
auto target_option = is_vector_index(options)
? secondary_index::vector_index::serialize_targets(targets)
: secondary_index::target_parser::serialize_targets(targets);
auto target_option = secondary_index::target_parser::serialize_targets(targets);
new_options.emplace(index_target::target_option_name, target_option);
const auto& first_target = targets.front()->value;

View File

@@ -52,7 +52,6 @@ future<shared_ptr<result_message>> modification_statement::execute_without_check
}
auto [coordinator, holder] = qp.acquire_strongly_consistent_coordinator();
const auto mutate_result = co_await coordinator.get().mutate(_statement->s,
keys[0].start()->value().token(),
[&](api::timestamp_type ts) {
@@ -66,7 +65,7 @@ future<shared_ptr<result_message>> modification_statement::execute_without_check
raw_cql_statement, muts.size()));
}
return std::move(*muts.begin());
}, timeout, qs.get_client_state().get_abort_source());
});
using namespace service::strong_consistency;
if (const auto* redirect = get_if<need_redirect>(&mutate_result)) {

View File

@@ -42,7 +42,7 @@ future<::shared_ptr<result_message>> select_statement::do_execute(query_processo
const auto timeout = db::timeout_clock::now() + get_timeout(state.get_client_state(), options);
auto [coordinator, holder] = qp.acquire_strongly_consistent_coordinator();
auto query_result = co_await coordinator.get().query(_query_schema, *read_command,
key_ranges, state.get_trace_state(), timeout, state.get_client_state().get_abort_source());
key_ranges, state.get_trace_state(), timeout);
using namespace service::strong_consistency;
if (const auto* redirect = get_if<need_redirect>(&query_result)) {
@@ -54,4 +54,4 @@ future<::shared_ptr<result_message>> select_statement::do_execute(query_processo
read_command, options, now);
}
}
}

View File

@@ -250,8 +250,8 @@ void keyspace_metadata::validate(const gms::feature_service& fs, const locator::
if (params.consistency && !fs.strongly_consistent_tables) {
throw exceptions::configuration_exception("The strongly_consistent_tables feature must be enabled to use a consistency option");
}
if (params.consistency && *params.consistency == data_dictionary::consistency_config_option::local) {
throw exceptions::configuration_exception("Local consistency is not supported yet");
if (params.consistency && *params.consistency == data_dictionary::consistency_config_option::global) {
throw exceptions::configuration_exception("Global consistency is not supported yet");
}
}

View File

@@ -1582,7 +1582,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"\tnone : No auditing enabled.\n"
"\tsyslog : Audit messages sent to Syslog.\n"
"\ttable : Audit messages written to column family named audit.audit_log.\n")
, audit_categories(this, "audit_categories", liveness::LiveUpdate, value_status::Used, "DCL,DDL,AUTH,ADMIN", "Comma separated list of operation categories that should be audited.")
, audit_categories(this, "audit_categories", liveness::LiveUpdate, value_status::Used, "DCL,AUTH,ADMIN", "Comma separated list of operation categories that should be audited.")
, audit_tables(this, "audit_tables", liveness::LiveUpdate, value_status::Used, "", "Comma separated list of table names (<keyspace>.<table>) that will be audited.")
, audit_keyspaces(this, "audit_keyspaces", liveness::LiveUpdate, value_status::Used, "", "Comma separated list of keyspaces that will be audited. All tables in those keyspaces will be audited")
, audit_unix_socket_path(this, "audit_unix_socket_path", value_status::Used, "/dev/log", "The path to the unix socket used for writing to syslog. Only applicable when audit is set to syslog.")

View File

@@ -22,7 +22,7 @@ corrupt_data_handler::corrupt_data_handler(register_metrics rm) {
_metrics.add_group("corrupt_data", {
sm::make_counter("entries_reported", _stats.corrupt_data_reported,
sm::description("Counts the number of corrupt data instances reported to the corrupt data handler. "
"A non-zero value indicates that the database suffered data corruption.")).set_skip_when_empty()
"A non-zero value indicates that the database suffered data corruption."))
});
}
}

View File

@@ -50,7 +50,9 @@ future<> hint_endpoint_manager::do_store_hint(schema_ptr s, lw_shared_ptr<const
size_t mut_size = fm->representation().size();
shard_stats().size_of_hints_in_progress += mut_size;
co_await utils::get_local_injector().inject("slow_down_writing_hints", std::chrono::seconds(10));
if (utils::get_local_injector().enter("slow_down_writing_hints")) {
co_await seastar::sleep(std::chrono::seconds(10));
}
try {
const auto shared_lock = co_await get_shared_lock(file_update_mutex());

View File

@@ -186,7 +186,7 @@ void manager::register_metrics(const sstring& group_name) {
sm::description("Number of unexpected errors during sending, sending will be retried later")),
sm::make_counter("corrupted_files", _stats.corrupted_files,
sm::description("Number of hints files that were discarded during sending because the file was corrupted.")).set_skip_when_empty(),
sm::description("Number of hints files that were discarded during sending because the file was corrupted.")),
sm::make_gauge("pending_drains",
sm::description("Number of tasks waiting in the queue for draining hints"),

View File

@@ -206,7 +206,7 @@ void rate_limiter_base::register_metrics() {
sm::description("Number of times a lookup returned an already allocated entry.")),
sm::make_counter("failed_allocations", _metrics.failed_allocations,
sm::description("Number of times the rate limiter gave up trying to allocate.")).set_skip_when_empty(),
sm::description("Number of times the rate limiter gave up trying to allocate.")),
sm::make_counter("probe_count", _metrics.probe_count,
sm::description("Number of probes made during lookups.")),

View File

@@ -174,7 +174,7 @@ cache_tracker::setup_metrics() {
sm::make_counter("sstable_reader_recreations", sm::description("number of times sstable reader was recreated due to memtable flush"), _stats.underlying_recreations),
sm::make_counter("sstable_partition_skips", sm::description("number of times sstable reader was fast forwarded across partitions"), _stats.underlying_partition_skips),
sm::make_counter("sstable_row_skips", sm::description("number of times sstable reader was fast forwarded within a partition"), _stats.underlying_row_skips),
sm::make_counter("pinned_dirty_memory_overload", sm::description("amount of pinned bytes that we tried to unpin over the limit. This should sit constantly at 0, and any number different than 0 is indicative of a bug"), _stats.pinned_dirty_memory_overload).set_skip_when_empty(),
sm::make_counter("pinned_dirty_memory_overload", sm::description("amount of pinned bytes that we tried to unpin over the limit. This should sit constantly at 0, and any number different than 0 is indicative of a bug"), _stats.pinned_dirty_memory_overload),
sm::make_counter("rows_processed_from_memtable", _stats.rows_processed_from_memtable,
sm::description("total number of rows in memtables which were processed during cache update on memtable flush")),
sm::make_counter("rows_dropped_from_memtable", _stats.rows_dropped_from_memtable,

View File

@@ -63,15 +63,14 @@ namespace db {
namespace schema_tables {
static constexpr std::initializer_list<table_kind> all_table_kinds = {
table_kind::table,
table_kind::view
};
static constexpr std::initializer_list<table_kind> all_table_kinds = {table_kind::table, table_kind::view};
static schema_ptr get_table_holder(table_kind k) {
switch (k) {
case table_kind::table: return tables();
case table_kind::view: return views();
case table_kind::table:
return tables();
case table_kind::view:
return views();
}
abort();
}
@@ -94,15 +93,18 @@ void table_selector::add(sstring name) {
}
}
}
} // namespace schema_tables
}
} // namespace db
template <> struct fmt::formatter<db::schema_tables::table_kind> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <>
struct fmt::formatter<db::schema_tables::table_kind> {
constexpr auto parse(format_parse_context& ctx) {
return ctx.begin();
}
auto format(db::schema_tables::table_kind k, fmt::format_context& ctx) const {
switch (k) {
using enum db::schema_tables::table_kind;
using enum db::schema_tables::table_kind;
case table:
return fmt::format_to(ctx.out(), "table");
case view:
@@ -125,11 +127,8 @@ static std::optional<table_id> table_id_from_mutations(const schema_mutations& s
return table_id(table_row.get_nonnull<utils::UUID>("id"));
}
static
future<std::map<table_id, schema_mutations>>
read_tables_for_keyspaces(sharded<service::storage_proxy>& proxy, const std::set<sstring>& keyspace_names, table_kind kind,
const std::unordered_map<sstring, table_selector>& tables_per_keyspace)
{
static future<std::map<table_id, schema_mutations>> read_tables_for_keyspaces(sharded<service::storage_proxy>& proxy, const std::set<sstring>& keyspace_names,
table_kind kind, const std::unordered_map<sstring, table_selector>& tables_per_keyspace) {
std::map<table_id, schema_mutations> result;
for (auto&& [keyspace_name, sel] : tables_per_keyspace) {
if (!sel.tables.contains(kind)) {
@@ -149,32 +148,30 @@ read_tables_for_keyspaces(sharded<service::storage_proxy>& proxy, const std::set
// Extracts the names of tables affected by a schema mutation.
// The mutation must target one of the tables in schema_tables_holding_schema_mutations().
static
table_selector get_affected_tables(const sstring& keyspace_name, const mutation& m) {
static table_selector get_affected_tables(const sstring& keyspace_name, const mutation& m) {
const schema& s = *m.schema();
auto get_table_name = [&] (const clustering_key& ck) {
auto get_table_name = [&](const clustering_key& ck) {
// The first component of the clustering key in each table listed in
// schema_tables_holding_schema_mutations contains the table name.
return value_cast<sstring>(utf8_type->deserialize(ck.get_component(s, 0)));
};
table_selector result;
if (m.partition().partition_tombstone()) {
slogger.trace("Mutation of {}.{} for keyspace {} contains a partition tombstone",
m.schema()->ks_name(), m.schema()->cf_name(), keyspace_name);
slogger.trace("Mutation of {}.{} for keyspace {} contains a partition tombstone", m.schema()->ks_name(), m.schema()->cf_name(), keyspace_name);
result.all_in_keyspace = true;
}
for (auto&& e : m.partition().row_tombstones()) {
const range_tombstone& rt = e.tombstone();
if (rt.start.size(s) == 0 || rt.end.size(s) == 0) {
slogger.trace("Mutation of {}.{} for keyspace {} contains a multi-table range tombstone",
m.schema()->ks_name(), m.schema()->cf_name(), keyspace_name);
slogger.trace(
"Mutation of {}.{} for keyspace {} contains a multi-table range tombstone", m.schema()->ks_name(), m.schema()->cf_name(), keyspace_name);
result.all_in_keyspace = true;
break;
}
auto table_name = get_table_name(rt.start);
if (table_name != get_table_name(rt.end)) {
slogger.trace("Mutation of {}.{} for keyspace {} contains a multi-table range tombstone",
m.schema()->ks_name(), m.schema()->cf_name(), keyspace_name);
slogger.trace(
"Mutation of {}.{} for keyspace {} contains a multi-table range tombstone", m.schema()->ks_name(), m.schema()->cf_name(), keyspace_name);
result.all_in_keyspace = true;
break;
}
@@ -183,16 +180,17 @@ table_selector get_affected_tables(const sstring& keyspace_name, const mutation&
for (auto&& row : m.partition().clustered_rows()) {
result.add(get_table_name(row.key()));
}
slogger.trace("Mutation of {}.{} for keyspace {} affects tables: {}, all_in_keyspace: {}",
m.schema()->ks_name(), m.schema()->cf_name(), keyspace_name, result.tables, result.all_in_keyspace);
slogger.trace("Mutation of {}.{} for keyspace {} affects tables: {}, all_in_keyspace: {}", m.schema()->ks_name(), m.schema()->cf_name(), keyspace_name,
result.tables, result.all_in_keyspace);
return result;
}
future<schema_result>
static read_schema_for_keyspaces(sharded<service::storage_proxy>& proxy, const sstring& schema_table_name, const std::set<sstring>& keyspace_names)
{
auto map = [&proxy, schema_table_name] (const sstring& keyspace_name) { return read_schema_partition_for_keyspace(proxy, schema_table_name, keyspace_name); };
auto insert = [] (schema_result&& result, auto&& schema_entity) {
future<schema_result> static read_schema_for_keyspaces(
sharded<service::storage_proxy>& proxy, const sstring& schema_table_name, const std::set<sstring>& keyspace_names) {
auto map = [&proxy, schema_table_name](const sstring& keyspace_name) {
return read_schema_partition_for_keyspace(proxy, schema_table_name, keyspace_name);
};
auto insert = [](schema_result&& result, auto&& schema_entity) {
if (!schema_entity.second->empty()) {
result.insert(std::move(schema_entity));
}
@@ -202,11 +200,11 @@ static read_schema_for_keyspaces(sharded<service::storage_proxy>& proxy, const s
}
// Returns names of live table definitions of given keyspace
future<std::vector<sstring>>
static read_table_names_of_keyspace(sharded<service::storage_proxy>& proxy, const sstring& keyspace_name, schema_ptr schema_table) {
future<std::vector<sstring>> static read_table_names_of_keyspace(
sharded<service::storage_proxy>& proxy, const sstring& keyspace_name, schema_ptr schema_table) {
auto pkey = dht::decorate_key(*schema_table, partition_key::from_singular(*schema_table, keyspace_name));
auto&& rs = co_await db::system_keyspace::query(proxy.local().get_db(), schema_table->ks_name(), schema_table->cf_name(), pkey);
co_return rs->rows() | std::views::transform([schema_table] (const query::result_set_row& row) {
co_return rs->rows() | std::views::transform([schema_table](const query::result_set_row& row) {
const sstring name = schema_table->clustering_key_columns().begin()->name_as_text();
return row.get_nonnull<sstring>(name);
}) | std::ranges::to<std::vector>();
@@ -242,8 +240,7 @@ static void maybe_delete_schema_version(mutation& m) {
}
}
future<> schema_applier::merge_keyspaces()
{
future<> schema_applier::merge_keyspaces() {
/*
* - we don't care about entriesOnlyOnLeft() or entriesInCommon(), because only the changes are of interest to us
* - of all entriesOnlyOnRight(), we only care about ones that have live columns; it's possible to have a ColumnFamily
@@ -280,21 +277,16 @@ future<> schema_applier::merge_keyspaces()
for (auto& name : created) {
slogger.info("Creating keyspace {}", name);
auto sk_after_v = _after.scylla_keyspaces.contains(name) ? _after.scylla_keyspaces.at(name) : nullptr;
auto ksm = co_await create_keyspace_metadata(
schema_result_value_type{name, _after.keyspaces.at(name)}, sk_after_v);
auto ksm = co_await create_keyspace_metadata(schema_result_value_type{name, _after.keyspaces.at(name)}, sk_after_v);
_affected_keyspaces.created.push_back(
co_await replica::database::prepare_create_keyspace_on_all_shards(
sharded_db, _proxy, *ksm, _pending_token_metadata));
co_await replica::database::prepare_create_keyspace_on_all_shards(sharded_db, _proxy, *ksm, _pending_token_metadata));
_affected_keyspaces.names.created.insert(name);
}
for (auto& name : altered) {
slogger.info("Altering keyspace {}", name);
auto sk_after_v = _after.scylla_keyspaces.contains(name) ? _after.scylla_keyspaces.at(name) : nullptr;
auto tmp_ksm = co_await create_keyspace_metadata(
schema_result_value_type{name, _after.keyspaces.at(name)}, sk_after_v);
_affected_keyspaces.altered.push_back(
co_await replica::database::prepare_update_keyspace_on_all_shards(
sharded_db, *tmp_ksm, _pending_token_metadata));
auto tmp_ksm = co_await create_keyspace_metadata(schema_result_value_type{name, _after.keyspaces.at(name)}, sk_after_v);
_affected_keyspaces.altered.push_back(co_await replica::database::prepare_update_keyspace_on_all_shards(sharded_db, *tmp_ksm, _pending_token_metadata));
_affected_keyspaces.names.altered.insert(name);
}
for (auto& key : _affected_keyspaces.names.dropped) {
@@ -327,7 +319,7 @@ static std::vector<column_definition> get_primary_key_definition(const schema_pt
static std::vector<bytes> get_primary_key(const std::vector<column_definition>& primary_key, const query::result_set_row* row) {
std::vector<bytes> key;
for (const auto& column : primary_key) {
const data_value *val = row->get_data_value(column.name_as_text());
const data_value* val = row->get_data_value(column.name_as_text());
key.push_back(val->serialize_nonnull());
}
return key;
@@ -338,7 +330,7 @@ static std::map<std::vector<bytes>, const query::result_set_row*> build_row_map(
const std::vector<query::result_set_row>& rows = result.rows();
auto primary_key = get_primary_key_definition(result.schema());
std::map<std::vector<bytes>, const query::result_set_row*> ret;
for (const auto& row: rows) {
for (const auto& row : rows) {
auto key = get_primary_key(primary_key, &row);
ret.insert(std::pair(std::move(key), &row));
}
@@ -391,8 +383,8 @@ struct aggregate_diff {
std::vector<std::pair<const query::result_set_row*, const query::result_set_row*>> dropped;
};
static aggregate_diff diff_aggregates_rows(const schema_result& aggr_before, const schema_result& aggr_after,
const schema_result& scylla_aggr_before, const schema_result& scylla_aggr_after) {
static aggregate_diff diff_aggregates_rows(
const schema_result& aggr_before, const schema_result& aggr_after, const schema_result& scylla_aggr_before, const schema_result& scylla_aggr_after) {
using map = std::map<std::vector<bytes>, const query::result_set_row*>;
auto aggr_diff = difference(aggr_before, aggr_after, indirect_equal_to<lw_shared_ptr<query::result_set>>());
@@ -436,15 +428,11 @@ static aggregate_diff diff_aggregates_rows(const schema_result& aggr_before, con
for (const auto& k : diff.entries_only_on_left) {
auto entry = scylla_aggr_rows_before.find(k);
dropped.push_back({
aggr_before_rows.find(k)->second, (entry != scylla_aggr_rows_before.end()) ? entry->second : nullptr
});
dropped.push_back({aggr_before_rows.find(k)->second, (entry != scylla_aggr_rows_before.end()) ? entry->second : nullptr});
}
for (const auto& k : diff.entries_only_on_right) {
auto entry = scylla_aggr_rows_after.find(k);
created.push_back({
aggr_after_rows.find(k)->second, (entry != scylla_aggr_rows_after.end()) ? entry->second : nullptr
});
created.push_back({aggr_after_rows.find(k)->second, (entry != scylla_aggr_rows_after.end()) ? entry->second : nullptr});
}
}
@@ -452,11 +440,10 @@ static aggregate_diff diff_aggregates_rows(const schema_result& aggr_before, con
}
// see the comments for merge_keyspaces()
future<> schema_applier::merge_types()
{
future<> schema_applier::merge_types() {
auto diff = diff_rows(_before.types, _after.types);
co_await _affected_user_types.start();
co_await _affected_user_types.invoke_on_all([&] (affected_user_types_per_shard& af) mutable -> future<> {
co_await _affected_user_types.invoke_on_all([&](affected_user_types_per_shard& af) mutable -> future<> {
auto& db = _proxy.local().get_db().local();
std::map<sstring, std::reference_wrapper<replica::keyspace>> new_keyspaces_per_shard;
@@ -478,16 +465,12 @@ future<> schema_applier::merge_types()
// version of view to "before" version of base table and "after" to "after"
// respectively.
enum class schema_diff_side {
left, // old, before
left, // old, before
right, // new, after
};
static schema_diff_per_shard diff_table_or_view(sharded<service::storage_proxy>& proxy,
const std::map<table_id, schema_mutations>& before,
const std::map<table_id, schema_mutations>& after,
bool reload,
noncopyable_function<schema_ptr (schema_mutations sm, schema_diff_side)> create_schema)
{
static schema_diff_per_shard diff_table_or_view(sharded<service::storage_proxy>& proxy, const std::map<table_id, schema_mutations>& before,
const std::map<table_id, schema_mutations>& after, bool reload, noncopyable_function<schema_ptr(schema_mutations sm, schema_diff_side)> create_schema) {
schema_diff_per_shard d;
auto diff = difference(before, after);
for (auto&& key : diff.entries_only_on_left) {
@@ -507,10 +490,10 @@ static schema_diff_per_shard diff_table_or_view(sharded<service::storage_proxy>&
d.altered.emplace_back(schema_diff_per_shard::altered_schema{s_before, s});
}
if (reload) {
for (auto&& key: diff.entries_in_common) {
for (auto&& key : diff.entries_in_common) {
auto s = create_schema(std::move(after.at(key)), schema_diff_side::right);
slogger.info("Reloading {}.{} id={} version={}", s->ks_name(), s->cf_name(), s->id(), s->version());
d.altered.emplace_back(schema_diff_per_shard::altered_schema {s, s});
d.altered.emplace_back(schema_diff_per_shard::altered_schema{s, s});
}
}
return d;
@@ -524,7 +507,9 @@ static schema_diff_per_shard diff_table_or_view(sharded<service::storage_proxy>&
constexpr size_t max_concurrent = 8;
in_progress_types_storage_per_shard::in_progress_types_storage_per_shard(replica::database& db, const affected_keyspaces& affected_keyspaces, const affected_user_types& affected_types) : _stored_user_types(db.as_user_types_storage()) {
in_progress_types_storage_per_shard::in_progress_types_storage_per_shard(
replica::database& db, const affected_keyspaces& affected_keyspaces, const affected_user_types& affected_types)
: _stored_user_types(db.as_user_types_storage()) {
// initialize metadata for new keyspaces
for (auto& ks_per_shard : affected_keyspaces.created) {
auto metadata = ks_per_shard[this_shard_id()]->metadata();
@@ -552,7 +537,7 @@ in_progress_types_storage_per_shard::in_progress_types_storage_per_shard(replica
auto& ks_name = type->_keyspace;
_in_progress_types[ks_name].remove_type(type);
}
for (const auto &ks_name : affected_keyspaces.names.dropped) {
for (const auto& ks_name : affected_keyspaces.names.dropped) {
// can't reference a type when it's keyspace is being dropped
_in_progress_types[ks_name] = data_dictionary::user_types_metadata();
}
@@ -570,8 +555,9 @@ std::shared_ptr<data_dictionary::user_types_storage> in_progress_types_storage_p
return _stored_user_types;
}
future<> in_progress_types_storage::init(sharded<replica::database>& sharded_db, const affected_keyspaces& affected_keyspaces, const affected_user_types& affected_types) {
co_await sharded_db.invoke_on_all([&] (replica::database& db) {
future<> in_progress_types_storage::init(
sharded<replica::database>& sharded_db, const affected_keyspaces& affected_keyspaces, const affected_user_types& affected_types) {
co_await sharded_db.invoke_on_all([&](replica::database& db) {
shards[this_shard_id()] = make_foreign(seastar::make_shared<in_progress_types_storage_per_shard>(db, affected_keyspaces, affected_types));
});
}
@@ -585,8 +571,7 @@ in_progress_types_storage_per_shard& in_progress_types_storage::local() {
// that when a base schema and a subset of its views are modified together (i.e.,
// upon an alter table or alter type statement), then they are published together
// as well, without any deferring in-between.
future<> schema_applier::merge_tables_and_views()
{
future<> schema_applier::merge_tables_and_views() {
auto& user_types = _types_storage.local();
co_await _affected_tables_and_views.tables_and_views.start();
@@ -597,10 +582,10 @@ future<> schema_applier::merge_tables_and_views()
// Create CDC tables before non-CDC base tables, because we want the base tables with CDC enabled
// to point to their CDC tables.
local_cdc = diff_table_or_view(_proxy, _before.cdc, _after.cdc, _reload, [&] (schema_mutations sm, schema_diff_side) {
local_cdc = diff_table_or_view(_proxy, _before.cdc, _after.cdc, _reload, [&](schema_mutations sm, schema_diff_side) {
return create_table_from_mutations(_proxy, std::move(sm), user_types, nullptr);
});
local_tables = diff_table_or_view(_proxy, _before.tables, _after.tables, _reload, [&] (schema_mutations sm, schema_diff_side side) {
local_tables = diff_table_or_view(_proxy, _before.tables, _after.tables, _reload, [&](schema_mutations sm, schema_diff_side side) {
// If the table has CDC enabled, find the CDC schema version and set it in the table schema.
// If the table is created or altered with CDC enabled, then the CDC
// table is also created or altered in the same operation, so we can
@@ -636,7 +621,7 @@ future<> schema_applier::merge_tables_and_views()
return create_table_from_mutations(_proxy, std::move(sm), user_types, cdc_schema);
});
local_views = diff_table_or_view(_proxy, _before.views, _after.views, _reload, [&] (schema_mutations sm, schema_diff_side side) {
local_views = diff_table_or_view(_proxy, _before.views, _after.views, _reload, [&](schema_mutations sm, schema_diff_side side) {
// The view schema mutation should be created with reference to the base table schema because we definitely know it by now.
// If we don't do it we are leaving a window where write commands to this schema are illegal.
// There are 3 possibilities:
@@ -683,31 +668,26 @@ future<> schema_applier::merge_tables_and_views()
frozen_schema_diff tables_frozen = co_await local_tables.freeze();
frozen_schema_diff cdc_frozen = co_await local_cdc.freeze();
frozen_schema_diff views_frozen = co_await local_views.freeze();
co_await _affected_tables_and_views.tables_and_views.invoke_on_others([this, &tables_frozen, &cdc_frozen, &views_frozen] (affected_tables_and_views_per_shard& tables_and_views) -> future<> {
auto& db = _proxy.local().get_db().local();
tables_and_views.tables = co_await schema_diff_per_shard::copy_from(
db, _types_storage, tables_frozen);
tables_and_views.cdc = co_await schema_diff_per_shard::copy_from(
db, _types_storage, cdc_frozen);
tables_and_views.views = co_await schema_diff_per_shard::copy_from(
db, _types_storage, views_frozen);
});
co_await _affected_tables_and_views.tables_and_views.invoke_on_others(
[this, &tables_frozen, &cdc_frozen, &views_frozen](affected_tables_and_views_per_shard& tables_and_views) -> future<> {
auto& db = _proxy.local().get_db().local();
tables_and_views.tables = co_await schema_diff_per_shard::copy_from(db, _types_storage, tables_frozen);
tables_and_views.cdc = co_await schema_diff_per_shard::copy_from(db, _types_storage, cdc_frozen);
tables_and_views.views = co_await schema_diff_per_shard::copy_from(db, _types_storage, views_frozen);
});
auto& db = _proxy.local().get_db();
co_await max_concurrent_for_each(local_views.dropped, max_concurrent, [&db, this] (schema_ptr& dt) -> future<> {
co_await max_concurrent_for_each(local_views.dropped, max_concurrent, [&db, this](schema_ptr& dt) -> future<> {
auto uuid = dt->id();
_affected_tables_and_views.table_shards.insert({uuid,
co_await replica::database::prepare_drop_table_on_all_shards(db, uuid)});
_affected_tables_and_views.table_shards.insert({uuid, co_await replica::database::prepare_drop_table_on_all_shards(db, uuid)});
});
co_await max_concurrent_for_each(local_tables.dropped, max_concurrent, [&db, this] (schema_ptr& dt) -> future<> {
co_await max_concurrent_for_each(local_tables.dropped, max_concurrent, [&db, this](schema_ptr& dt) -> future<> {
auto uuid = dt->id();
_affected_tables_and_views.table_shards.insert({uuid,
co_await replica::database::prepare_drop_table_on_all_shards(db, uuid)});
_affected_tables_and_views.table_shards.insert({uuid, co_await replica::database::prepare_drop_table_on_all_shards(db, uuid)});
});
co_await max_concurrent_for_each(local_cdc.dropped, max_concurrent, [&db, this] (schema_ptr& dt) -> future<> {
co_await max_concurrent_for_each(local_cdc.dropped, max_concurrent, [&db, this](schema_ptr& dt) -> future<> {
auto uuid = dt->id();
_affected_tables_and_views.table_shards.insert({uuid,
co_await replica::database::prepare_drop_table_on_all_shards(db, uuid)});
_affected_tables_and_views.table_shards.insert({uuid, co_await replica::database::prepare_drop_table_on_all_shards(db, uuid)});
});
}
@@ -719,8 +699,8 @@ future<frozen_schema_diff> schema_diff_per_shard::freeze() const {
}
for (const auto& a : altered) {
result.altered.push_back(frozen_schema_diff::altered_schema{
.old_schema = extended_frozen_schema(a.old_schema),
.new_schema = extended_frozen_schema(a.new_schema),
.old_schema = extended_frozen_schema(a.old_schema),
.new_schema = extended_frozen_schema(a.new_schema),
});
co_await coroutine::maybe_yield();
}
@@ -743,8 +723,8 @@ future<schema_diff_per_shard> schema_diff_per_shard::copy_from(replica::database
}
for (const auto& a : oth.altered) {
result.altered.push_back(schema_diff_per_shard::altered_schema{
.old_schema = a.old_schema.unfreeze(commited_ctxt),
.new_schema = a.new_schema.unfreeze(ctxt),
.old_schema = a.old_schema.unfreeze(commited_ctxt),
.new_schema = a.new_schema.unfreeze(ctxt),
});
co_await coroutine::maybe_yield();
}
@@ -758,7 +738,7 @@ future<schema_diff_per_shard> schema_diff_per_shard::copy_from(replica::database
static future<> notify_tables_and_views(service::migration_notifier& notifier, const affected_tables_and_views& diff) {
auto it = diff.tables_and_views.local().columns_changed.cbegin();
auto notify = [&] (auto& r, auto&& f) -> future<> {
auto notify = [&](auto& r, auto&& f) -> future<> {
co_await max_concurrent_for_each(r, max_concurrent, std::move(f));
};
@@ -767,24 +747,41 @@ static future<> notify_tables_and_views(service::migration_notifier& notifier, c
const auto& views = diff.tables_and_views.local().views;
// View drops are notified first, because a table can only be dropped if its views are already deleted
co_await notify(views.dropped, [&] (auto&& dt) { return notifier.drop_view(view_ptr(dt)); });
co_await notify(tables.dropped, [&] (auto&& dt) { return notifier.drop_column_family(dt); });
co_await notify(cdc.dropped, [&] (auto&& dt) { return notifier.drop_column_family(dt); });
co_await notify(views.dropped, [&](auto&& dt) {
return notifier.drop_view(view_ptr(dt));
});
co_await notify(tables.dropped, [&](auto&& dt) {
return notifier.drop_column_family(dt);
});
co_await notify(cdc.dropped, [&](auto&& dt) {
return notifier.drop_column_family(dt);
});
// Table creations are notified first, in case a view is created right after the table
co_await notify(tables.created, [&] (auto&& gs) { return notifier.create_column_family(gs); });
co_await notify(cdc.created, [&] (auto&& gs) { return notifier.create_column_family(gs); });
co_await notify(views.created, [&] (auto&& gs) { return notifier.create_view(view_ptr(gs)); });
co_await notify(tables.created, [&](auto&& gs) {
return notifier.create_column_family(gs);
});
co_await notify(cdc.created, [&](auto&& gs) {
return notifier.create_column_family(gs);
});
co_await notify(views.created, [&](auto&& gs) {
return notifier.create_view(view_ptr(gs));
});
// Table altering is notified first, in case new base columns appear
co_await notify(tables.altered, [&] (auto&& altered) { return notifier.update_column_family(altered.new_schema, *it++); });
co_await notify(cdc.altered, [&] (auto&& altered) { return notifier.update_column_family(altered.new_schema, *it++); });
co_await notify(views.altered, [&] (auto&& altered) { return notifier.update_view(view_ptr(altered.new_schema), *it++); });
co_await notify(tables.altered, [&](auto&& altered) {
return notifier.update_column_family(altered.new_schema, *it++);
});
co_await notify(cdc.altered, [&](auto&& altered) {
return notifier.update_column_family(altered.new_schema, *it++);
});
co_await notify(views.altered, [&](auto&& altered) {
return notifier.update_view(view_ptr(altered.new_schema), *it++);
});
}
static void drop_cached_func(replica::database& db, const query::result_set_row& row) {
auto language = row.get_nonnull<sstring>("language");
if (language == "wasm") {
cql3::functions::function_name name{
row.get_nonnull<sstring>("keyspace_name"), row.get_nonnull<sstring>("function_name")};
cql3::functions::function_name name{row.get_nonnull<sstring>("keyspace_name"), row.get_nonnull<sstring>("function_name")};
auto arg_types = read_arg_types(row, name.keyspace, db.user_types());
db.lang().remove(name, arg_types);
}
@@ -793,14 +790,13 @@ static void drop_cached_func(replica::database& db, const query::result_set_row&
future<> schema_applier::merge_functions() {
auto diff = diff_rows(_before.functions, _after.functions);
co_await _functions_batch.start();
co_await _functions_batch.invoke_on_all(coroutine::lambda([&] (cql3::functions::change_batch& batch) -> future<> {
co_await _functions_batch.invoke_on_all(coroutine::lambda([&](cql3::functions::change_batch& batch) -> future<> {
auto& db = _proxy.local().get_db().local();
for (const auto& val : diff.created) {
batch.add_function(co_await create_func(db, *val, _types_storage.local()));
}
for (const auto& val : diff.dropped) {
cql3::functions::function_name name{
val->get_nonnull<sstring>("keyspace_name"), val->get_nonnull<sstring>("function_name")};
cql3::functions::function_name name{val->get_nonnull<sstring>("keyspace_name"), val->get_nonnull<sstring>("function_name")};
auto commited_storage = _types_storage.local().committed_storage();
auto arg_types = read_arg_types(*val, name.keyspace, *commited_storage);
// as we don't yield between dropping cache and committing batch
@@ -818,14 +814,13 @@ future<> schema_applier::merge_functions() {
future<> schema_applier::merge_aggregates() {
auto diff = diff_aggregates_rows(_before.aggregates, _after.aggregates, _before.scylla_aggregates, _after.scylla_aggregates);
co_await _functions_batch.invoke_on_all([&] (cql3::functions::change_batch& batch)-> future<> {
co_await _functions_batch.invoke_on_all([&](cql3::functions::change_batch& batch) -> future<> {
auto& db = _proxy.local().get_db().local();
for (const auto& val : diff.created) {
batch.add_function(create_aggregate(db, *val.first, val.second, batch, _types_storage.local()));
}
for (const auto& val : diff.dropped) {
cql3::functions::function_name name{
val.first->get_nonnull<sstring>("keyspace_name"), val.first->get_nonnull<sstring>("aggregate_name")};
cql3::functions::function_name name{val.first->get_nonnull<sstring>("keyspace_name"), val.first->get_nonnull<sstring>("aggregate_name")};
auto commited_storage = _types_storage.local().committed_storage();
auto arg_types = read_arg_types(*val.first, name.keyspace, *commited_storage);
batch.remove_aggregate(name, arg_types);
@@ -860,15 +855,15 @@ future<schema_persisted_state> schema_applier::get_schema_persisted_state() {
auto [tables, cdc] = extract_cdc(std::move(tables_and_cdc));
schema_persisted_state v{
.keyspaces = co_await read_schema_for_keyspaces(_proxy, KEYSPACES, _keyspaces),
.scylla_keyspaces = co_await read_schema_for_keyspaces(_proxy, SCYLLA_KEYSPACES, _keyspaces),
.tables = std::move(tables),
.types = co_await read_schema_for_keyspaces(_proxy, TYPES, _keyspaces),
.views = co_await read_tables_for_keyspaces(_proxy, _keyspaces, table_kind::view, _affected_tables),
.cdc = std::move(cdc),
.functions = co_await read_schema_for_keyspaces(_proxy, FUNCTIONS, _keyspaces),
.aggregates = co_await read_schema_for_keyspaces(_proxy, AGGREGATES, _keyspaces),
.scylla_aggregates = co_await read_schema_for_keyspaces(_proxy, SCYLLA_AGGREGATES, _keyspaces),
.keyspaces = co_await read_schema_for_keyspaces(_proxy, KEYSPACES, _keyspaces),
.scylla_keyspaces = co_await read_schema_for_keyspaces(_proxy, SCYLLA_KEYSPACES, _keyspaces),
.tables = std::move(tables),
.types = co_await read_schema_for_keyspaces(_proxy, TYPES, _keyspaces),
.views = co_await read_tables_for_keyspaces(_proxy, _keyspaces, table_kind::view, _affected_tables),
.cdc = std::move(cdc),
.functions = co_await read_schema_for_keyspaces(_proxy, FUNCTIONS, _keyspaces),
.aggregates = co_await read_schema_for_keyspaces(_proxy, AGGREGATES, _keyspaces),
.scylla_aggregates = co_await read_schema_for_keyspaces(_proxy, SCYLLA_AGGREGATES, _keyspaces),
};
co_return v;
}
@@ -924,10 +919,11 @@ class pending_schema_getter : public service::schema_getter {
private:
schema_applier& _sa;
sharded<replica::database>& _db;
public:
pending_schema_getter(schema_applier& sa) :
_sa(sa), _db(sa._proxy.local().get_db()) {
};
pending_schema_getter(schema_applier& sa)
: _sa(sa)
, _db(sa._proxy.local().get_db()) {};
virtual flat_hash_map<sstring, locator::replication_strategy_ptr> get_keyspaces_replication() const override {
flat_hash_map<sstring, locator::replication_strategy_ptr> out;
@@ -989,8 +985,7 @@ future<> schema_applier::update_tablets() {
if (_tablet_hint) {
slogger.info("Tablet metadata changed");
pending_schema_getter getter{*this};
_token_metadata_change = co_await _ss.local().prepare_token_metadata_change(
_pending_token_metadata.local(), getter);
_token_metadata_change = co_await _ss.local().prepare_token_metadata_change(_pending_token_metadata.local(), getter);
}
}
@@ -999,8 +994,7 @@ future<> schema_applier::update_tablets() {
future<> schema_applier::load_mutable_token_metadata() {
locator::mutable_token_metadata_ptr current_token_metadata = co_await _ss.local().get_mutable_token_metadata_ptr();
if (_tablet_hint) {
auto new_token_metadata = co_await _ss.local().prepare_tablet_metadata(
_tablet_hint, current_token_metadata);
auto new_token_metadata = co_await _ss.local().prepare_tablet_metadata(_tablet_hint, current_token_metadata);
co_return co_await _pending_token_metadata.assign(new_token_metadata);
}
co_await _pending_token_metadata.assign(current_token_metadata);
@@ -1115,14 +1109,13 @@ future<> schema_applier::commit() {
// However, we can only acquire the (write) lock after preparing all
// entities for the pending schema change that need to iterate over tables_metadata;
// otherwise, such iteration would deadlock.
_metadata_locks = std::make_unique<replica::tables_metadata_lock_on_all_shards>(
co_await replica::database::lock_tables_metadata(sharded_db));
_metadata_locks = std::make_unique<replica::tables_metadata_lock_on_all_shards>(co_await replica::database::lock_tables_metadata(sharded_db));
// Run func first on shard 0
// to allow "seeding" of the effective_replication_map
// with a new e_r_m instance.
SCYLLA_ASSERT(this_shard_id() == 0);
commit_on_shard(sharded_db.local());
co_await sharded_db.invoke_on_others([this] (replica::database& db) {
co_await sharded_db.invoke_on_others([this](replica::database& db) {
commit_on_shard(db);
});
// unlock as some functions in post_commit() may read data under those locks
@@ -1154,12 +1147,11 @@ future<> schema_applier::finalize_tables_and_views() {
if (_tablet_hint) {
auto& db = sharded_db.local();
co_await db.get_compaction_manager().get_shared_tombstone_gc_state().
flush_pending_repair_time_update(db);
co_await db.get_compaction_manager().get_shared_tombstone_gc_state().flush_pending_repair_time_update(db);
_ss.local().wake_up_topology_state_machine();
}
co_await sharded_db.invoke_on_all([&diff] (replica::database& db) -> future<> {
co_await sharded_db.invoke_on_all([&diff](replica::database& db) -> future<> {
const auto& tables = diff.tables_and_views.local().tables;
const auto& cdc = diff.tables_and_views.local().cdc;
const auto& views = diff.tables_and_views.local().views;
@@ -1184,15 +1176,14 @@ future<> schema_applier::finalize_tables_and_views() {
//
// Drop column mapping entries for dropped tables since these will not be TTLed automatically
// and will stay there forever if we don't clean them up manually
co_await max_concurrent_for_each(diff.tables_and_views.local().tables.created, max_concurrent, [this] (const schema_ptr& gs) -> future<> {
co_await max_concurrent_for_each(diff.tables_and_views.local().tables.created, max_concurrent, [this](const schema_ptr& gs) -> future<> {
co_await store_column_mapping(_proxy, gs, false);
});
co_await max_concurrent_for_each(diff.tables_and_views.local().tables.altered, max_concurrent, [this] (const schema_diff_per_shard::altered_schema& altered) -> future<> {
co_await when_all_succeed(
store_column_mapping(_proxy, altered.old_schema, true),
store_column_mapping(_proxy, altered.new_schema, false));
});
co_await max_concurrent_for_each(diff.tables_and_views.local().tables.dropped, max_concurrent, [this] (const schema_ptr& s) -> future<> {
co_await max_concurrent_for_each(
diff.tables_and_views.local().tables.altered, max_concurrent, [this](const schema_diff_per_shard::altered_schema& altered) -> future<> {
co_await when_all_succeed(store_column_mapping(_proxy, altered.old_schema, true), store_column_mapping(_proxy, altered.new_schema, false));
});
co_await max_concurrent_for_each(diff.tables_and_views.local().tables.dropped, max_concurrent, [this](const schema_ptr& s) -> future<> {
co_await drop_column_mapping(_sys_ks.local(), s->id(), s->version());
});
}
@@ -1200,7 +1191,7 @@ future<> schema_applier::finalize_tables_and_views() {
future<> schema_applier::post_commit() {
co_await finalize_tables_and_views();
auto& sharded_db = _proxy.local().get_db();
co_await sharded_db.invoke_on_all([&] (replica::database& db) -> future<> {
co_await sharded_db.invoke_on_all([&](replica::database& db) -> future<> {
auto& notifier = db.get_notifier();
// notify about keyspaces
for (const auto& name : _affected_keyspaces.names.created) {
@@ -1260,8 +1251,8 @@ static future<> execute_do_merge_schema(sharded<service::storage_proxy>& proxy,
co_await ap.post_commit();
}
static future<> do_merge_schema(sharded<service::storage_proxy>& proxy, sharded<service::storage_service>& ss, sharded<db::system_keyspace>& sys_ks, utils::chunked_vector<mutation> mutations, bool reload)
{
static future<> do_merge_schema(sharded<service::storage_proxy>& proxy, sharded<service::storage_service>& ss, sharded<db::system_keyspace>& sys_ks,
utils::chunked_vector<mutation> mutations, bool reload) {
slogger.trace("do_merge_schema: {}", mutations);
schema_applier ap(proxy, ss, sys_ks, reload);
co_await execute_do_merge_schema(proxy, ap, std::move(mutations)).finally([&ap]() {
@@ -1278,22 +1269,22 @@ static future<> do_merge_schema(sharded<service::storage_proxy>& proxy, sharded
* @throws ConfigurationException If one of metadata attributes has invalid value
* @throws IOException If data was corrupted during transportation or failed to apply fs operations
*/
future<> merge_schema(sharded<db::system_keyspace>& sys_ks, sharded<service::storage_proxy>& proxy, sharded<service::storage_service>& ss, utils::chunked_vector<mutation> mutations, bool reload)
{
future<> merge_schema(sharded<db::system_keyspace>& sys_ks, sharded<service::storage_proxy>& proxy, sharded<service::storage_service>& ss,
utils::chunked_vector<mutation> mutations, bool reload) {
if (this_shard_id() != 0) {
// mutations must be applied on the owning shard (0).
co_await smp::submit_to(0, coroutine::lambda([&, fmuts = freeze(mutations)] () mutable -> future<> {
co_await smp::submit_to(0, coroutine::lambda([&, fmuts = freeze(mutations)]() mutable -> future<> {
co_await merge_schema(sys_ks, proxy, ss, co_await unfreeze_gently(fmuts), reload);
}));
co_return;
}
co_await with_merge_lock([&] () mutable -> future<> {
co_await with_merge_lock([&]() mutable -> future<> {
co_await do_merge_schema(proxy, ss, sys_ks, std::move(mutations), reload);
auto version = co_await get_group0_schema_version(sys_ks.local());
co_await update_schema_version_and_announce(sys_ks, proxy, version);
});
}
}
} // namespace schema_tables
}
} // namespace db

View File

@@ -144,7 +144,7 @@ static std::vector<sstring> get_keyspaces(const schema& s, const replica::databa
/**
* Makes a wrapping range of ring_position from a nonwrapping range of token, used to select sstables.
*/
static dht::partition_range as_ring_position_range(const dht::token_range& r) {
static dht::partition_range as_ring_position_range(dht::token_range& r) {
std::optional<wrapping_interval<dht::ring_position>::bound> start_bound, end_bound;
if (r.start()) {
start_bound = {{ dht::ring_position(r.start()->value(), dht::ring_position::token_bound::start), r.start()->is_inclusive() }};
@@ -156,14 +156,11 @@ static dht::partition_range as_ring_position_range(const dht::token_range& r) {
}
/**
* Add a new range_estimates for the specified range, considering the sstables associated
* with the table identified by `cf_id` across all shards.
* Add a new range_estimates for the specified range, considering the sstables associated with `cf`.
*/
static future<system_keyspace::range_estimates> estimate(replica::database& db, table_id cf_id, schema_ptr schema, const token_range& r) {
struct shard_estimate {
int64_t count = 0;
utils::estimated_histogram hist{0};
};
static future<system_keyspace::range_estimates> estimate(const replica::column_family& cf, const token_range& r) {
int64_t count{0};
utils::estimated_histogram hist{0};
auto from_bytes = [] (auto& b) {
return dht::token::from_sstring(utf8_type->to_string(b));
};
@@ -172,35 +169,14 @@ static future<system_keyspace::range_estimates> estimate(replica::database& db,
wrapping_interval<dht::token>({{ from_bytes(r.start), false }}, {{ from_bytes(r.end) }}),
dht::token_comparator(),
[&] (auto&& rng) { ranges.push_back(std::move(rng)); });
// Estimate partition count and size distribution from sstables on a single shard.
auto estimate_on_shard = [cf_id, ranges] (replica::database& local_db) -> future<shard_estimate> {
auto table_ptr = local_db.get_tables_metadata().get_table_if_exists(cf_id);
if (!table_ptr) {
co_return shard_estimate{};
for (auto&& r : ranges) {
auto rp_range = as_ring_position_range(r);
for (auto&& sstable : cf.select_sstables(rp_range)) {
count += co_await sstable->estimated_keys_for_range(r);
hist.merge(sstable->get_stats_metadata().estimated_partition_size);
}
auto& cf = *table_ptr;
shard_estimate result;
for (auto&& r : ranges) {
auto rp_range = as_ring_position_range(r);
for (auto&& sstable : cf.select_sstables(rp_range)) {
result.count += co_await sstable->estimated_keys_for_range(r);
result.hist.merge(sstable->get_stats_metadata().estimated_partition_size);
}
}
co_return result;
};
// Combine partial results from two shards.
auto reduce = [] (shard_estimate a, const shard_estimate& b) {
a.count += b.count;
a.hist.merge(b.hist);
return a;
};
auto aggregate = co_await db.container().map_reduce0(std::move(estimate_on_shard), shard_estimate{}, std::move(reduce));
int64_t mean_size = aggregate.count > 0 ? aggregate.hist.mean() : 0;
co_return system_keyspace::range_estimates{std::move(schema), r.start, r.end, aggregate.count, mean_size};
}
co_return system_keyspace::range_estimates{cf.schema(), r.start, r.end, count, count > 0 ? hist.mean() : 0};
}
/**
@@ -345,7 +321,7 @@ size_estimates_mutation_reader::estimates_for_current_keyspace(std::vector<token
auto rows_to_estimate = range.slice(rows, virtual_row_comparator(_schema));
for (auto&& r : rows_to_estimate) {
auto& cf = _db.find_column_family(*_current_partition, utf8_type->to_string(r.cf_name));
estimates.push_back(co_await estimate(_db, cf.schema()->id(), cf.schema(), r.tokens));
estimates.push_back(co_await estimate(cf, r.tokens));
if (estimates.size() >= _slice.partition_row_limit()) {
co_return estimates;
}

View File

@@ -18,11 +18,8 @@
#include <seastar/coroutine/parallel_for_each.hh>
#include "db/snapshot-ctl.hh"
#include "db/snapshot/backup_task.hh"
#include "db/schema_tables.hh"
#include "index/secondary_index_manager.hh"
#include "replica/database.hh"
#include "replica/global_table_ptr.hh"
#include "replica/schema_describe_helper.hh"
#include "sstables/sstables_manager.hh"
#include "service/storage_proxy.hh"
@@ -157,56 +154,14 @@ future<> snapshot_ctl::do_take_cluster_column_family_snapshot(std::vector<sstrin
);
}
sstring snapshot_ctl::resolve_table_name(const sstring& ks_name, const sstring& name) const {
try {
_db.local().find_uuid(ks_name, name);
return name;
} catch (const data_dictionary::no_such_column_family&) {
// The name may be a logical index name (e.g. "myindex").
// Only indexes with a backing view have a separate backing table
// that can be snapshotted. Custom indexes such as vector indexes
// do not, so keep rejecting them here rather than mapping them to
// a synthetic name.
auto schema = _db.local().find_indexed_table(ks_name, name);
if (schema) {
const auto& im = schema->all_indices().at(name);
if (db::schema_tables::view_should_exist(im)) {
return secondary_index::index_table_name(name);
}
}
throw;
}
}
future<> snapshot_ctl::do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts) {
for (auto& t : tables) {
t = resolve_table_name(ks_name, t);
}
co_await check_snapshot_not_exist(ks_name, tag, tables);
co_await replica::database::snapshot_tables_on_all_shards(_db, ks_name, std::move(tables), std::move(tag), opts);
}
future<> snapshot_ctl::clear_snapshot(sstring tag, std::vector<sstring> keyspace_names, sstring cf_name) {
co_return co_await run_snapshot_modify_operation([this, tag = std::move(tag), keyspace_names = std::move(keyspace_names), cf_name = std::move(cf_name)] (this auto) -> future<> {
// clear_snapshot enumerates keyspace_names and uses cf_name as a
// filter in each. When cf_name needs resolution (e.g. logical index
// name -> backing table name), the result may differ per keyspace,
// so resolve and clear individually.
if (!cf_name.empty() && !keyspace_names.empty()) {
std::vector<std::pair<sstring, sstring>> resolved_targets;
resolved_targets.reserve(keyspace_names.size());
// Resolve every keyspace first so a later failure doesn't delete
// snapshots that were already matched in earlier keyspaces.
for (const auto& ks_name : keyspace_names) {
resolved_targets.emplace_back(ks_name, resolve_table_name(ks_name, cf_name));
}
for (auto& [ks_name, resolved_cf_name] : resolved_targets) {
co_await _db.local().clear_snapshot(tag, {ks_name}, std::move(resolved_cf_name));
}
co_return;
}
co_await _db.local().clear_snapshot(std::move(tag), std::move(keyspace_names), cf_name);
return run_snapshot_modify_operation([this, tag = std::move(tag), keyspace_names = std::move(keyspace_names), cf_name = std::move(cf_name)] {
return _db.local().clear_snapshot(tag, keyspace_names, cf_name);
});
}
@@ -215,26 +170,7 @@ snapshot_ctl::get_snapshot_details() {
using snapshot_map = std::unordered_map<sstring, db_snapshot_details>;
co_return co_await run_snapshot_list_operation(coroutine::lambda([this] () -> future<snapshot_map> {
auto details = co_await _db.local().get_snapshot_details();
for (auto& [snapshot_name, snapshot_details] : details) {
for (auto& table : snapshot_details) {
auto schema = _db.local().as_data_dictionary().try_find_table(
table.ks, table.cf);
if (!schema || !schema->schema()->is_view()) {
continue;
}
auto helper = replica::make_schema_describe_helper(
schema->schema(), _db.local().as_data_dictionary());
if (helper.type == schema_describe_helper::type::index) {
table.cf = secondary_index::index_name_from_table_name(
table.cf);
}
}
}
co_return details;
return _db.local().get_snapshot_details();
}));
}
@@ -299,4 +235,4 @@ future<int64_t> snapshot_ctl::true_snapshots_size(sstring ks, sstring cf) {
}));
}
}
}

View File

@@ -133,12 +133,6 @@ private:
future<> check_snapshot_not_exist(sstring ks_name, sstring name, std::optional<std::vector<sstring>> filter = {});
// Resolve a user-provided table name that may be a logical index name
// (e.g. "myindex") to its backing column family name (e.g.
// "myindex_index"). Returns the name unchanged if it already
// matches a column family.
sstring resolve_table_name(const sstring& ks_name, const sstring& name) const;
future<> run_snapshot_modify_operation(noncopyable_function<future<>()> &&);
template <typename Func>
@@ -157,4 +151,4 @@ private:
future<> do_take_cluster_column_family_snapshot(std::vector<sstring> ks_names, std::vector<sstring> tables, sstring tag, snapshot_options opts = {});
};
}
}

View File

@@ -281,7 +281,6 @@ schema_ptr system_keyspace::topology() {
.with_column("cleanup_status", utf8_type)
.with_column("supported_features", set_type_impl::get_instance(utf8_type, true))
.with_column("request_id", timeuuid_type)
.with_column("intended_storage_mode", utf8_type)
.with_column("ignore_nodes", set_type_impl::get_instance(uuid_type, true), column_kind::static_column)
.with_column("new_cdc_generation_data_uuid", timeuuid_type, column_kind::static_column)
.with_column("new_keyspace_rf_change_ks_name", utf8_type, column_kind::static_column) // deprecated
@@ -324,7 +323,6 @@ schema_ptr system_keyspace::topology_requests() {
.with_column("snapshot_tag", utf8_type)
.with_column("snapshot_expiry", timestamp_type)
.with_column("snapshot_skip_flush", boolean_type)
.with_column("finalize_migration_ks_name", utf8_type)
.set_comment("Topology request tracking")
.with_hash_version()
.build();
@@ -3171,11 +3169,6 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
}
}
std::optional<service::intended_storage_mode> storage_mode;
if (row.has("intended_storage_mode")) {
storage_mode = service::intended_storage_mode_from_string(row.get_as<sstring>("intended_storage_mode"));
}
std::unordered_map<raft::server_id, service::replica_state>* map = nullptr;
if (nstate == service::node_state::normal) {
map = &ret.normal_nodes;
@@ -3200,7 +3193,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
map->emplace(host_id, service::replica_state{
nstate, std::move(datacenter), std::move(rack), std::move(release_version),
ring_slice, shard_count, ignore_msb, std::move(supported_features),
service::cleanup_status_from_string(cleanup_status), request_id, storage_mode});
service::cleanup_status_from_string(cleanup_status), request_id});
}
}
@@ -3513,9 +3506,6 @@ system_keyspace::topology_requests_entry system_keyspace::topology_request_row_t
entry.snapshot_expiry = row.get_as<db_clock::time_point>("snapshot_expiry");
}
}
if (row.has("finalize_migration_ks_name")) {
entry.finalize_migration_ks_name = row.get_as<sstring>("finalize_migration_ks_name");
}
return entry;
}

View File

@@ -427,7 +427,6 @@ public:
std::optional<sstring> snapshot_tag;
std::optional<db_clock::time_point> snapshot_expiry;
bool snapshot_skip_flush;
std::optional<sstring> finalize_migration_ks_name;
};
using topology_requests_entries = std::unordered_map<utils::UUID, system_keyspace::topology_requests_entry>;

View File

@@ -143,18 +143,10 @@ dht::token_range view_building_worker::get_tablet_token_range(table_id table_id,
}
future<> view_building_worker::drain() {
auto drain_started = std::exchange(_drain_started, started_drain::yes);
if (drain_started == started_drain::no) {
_drain_finished = shared_future(do_drain());
}
return _drain_finished.get_future();
}
future<> view_building_worker::do_drain() {
if (!_as.abort_requested()) {
_as.request_abort();
}
co_await _staging_sstables_mutex.wait();
_state._mutex.broken();
_staging_sstables_mutex.broken();
_sstables_to_register_event.broken();
if (this_shard_id() == 0) {
@@ -164,9 +156,7 @@ future<> view_building_worker::do_drain() {
co_await std::move(state_observer);
co_await _mnotifier.unregister_listener(this);
}
co_await _state._mutex.wait();
_state._mutex.broken();
co_await _state.drain();
co_await _state.clear();
co_await uninit_messaging_service();
}
@@ -210,7 +200,9 @@ future<> view_building_worker::run_staging_sstables_registrator() {
while (!_as.abort_requested()) {
bool sleep = false;
try {
auto lock = co_await get_units(_staging_sstables_mutex, 1, _as);
co_await create_staging_sstable_tasks();
lock.return_all();
_as.check();
co_await _sstables_to_register_event.when();
} catch (semaphore_aborted&) {
@@ -235,45 +227,13 @@ future<> view_building_worker::run_staging_sstables_registrator() {
}
}
future<std::vector<foreign_ptr<semaphore_units<>>>> view_building_worker::lock_staging_mutex_on_multiple_shards(std::flat_set<shard_id> shards) {
SCYLLA_ASSERT(this_shard_id() == 0);
// Collect `_staging_sstables_mutex` locks from multiple shards,
// so other shards won't interact with their `_staging_sstables` map
// until the caller releases them.
std::vector<foreign_ptr<semaphore_units<>>> locks;
locks.resize(smp::count);
// Locks are acquired from multiple shards in parallel.
// This is the only place where multiple-shard locks are acquired at once
// and the method is called only once at a time (from `create_staging_sstable_tasks()`
// on shard 0), so no deadlock may occur.
co_await coroutine::parallel_for_each(shards, [&locks, &sharded_vbw = container()] (auto shard_id) -> future<> {
auto lock_ptr = co_await smp::submit_to(shard_id, [&sharded_vbw] () -> future<foreign_ptr<semaphore_units<>>> {
auto& vbw = sharded_vbw.local();
auto lock = co_await get_units(vbw._staging_sstables_mutex, 1, vbw._as);
co_return make_foreign(std::move(lock));
});
locks[shard_id] = std::move(lock_ptr);
});
co_return std::move(locks);
}
future<> view_building_worker::create_staging_sstable_tasks() {
// Explicitly lock shard0 beforehand to prevent other shards from modifying `_sstables_to_register` from `register_staging_sstable_tasks()`
auto lock0 = co_await get_units(_staging_sstables_mutex, 1, _as);
if (_sstables_to_register.empty()) {
co_return;
}
auto shards = _sstables_to_register
| std::views::values
| std::views::join
| std::views::transform([] (const auto& sst_info) { return sst_info.shard; })
| std::ranges::to<std::flat_set<shard_id>>();
shards.erase(0); // We're already holding shard0 lock
auto locks = co_await lock_staging_mutex_on_multiple_shards(std::move(shards));
utils::chunked_vector<canonical_mutation> cmuts;
auto guard = co_await _group0.client().start_operation(_as);
auto my_host_id = _db.get_token_metadata().get_topology().my_host_id();
for (auto& [table_id, sst_infos]: _sstables_to_register) {
@@ -500,16 +460,6 @@ static std::unordered_set<table_id> get_ids_of_all_views(replica::database& db,
}) | std::ranges::to<std::unordered_set>();;
}
void view_building_worker::state::start_batch(std::unique_ptr<batch> batch) {
if (_drained) {
on_internal_error(vbw_logger, "view_building_worker::state was already drained");
} else if (_batch) {
on_internal_error(vbw_logger, fmt::format("view_building_worker::state::start_batch(): some batch (tasks: {}) is already running", _batch->tasks | std::views::keys));
}
_batch = std::move(batch);
_batch->start();
}
// If `state::processing_base_table` is different that the `view_building_state::currently_processed_base_table`,
// clear the state, save and flush new base table
future<> view_building_worker::state::update_processing_base_table(replica::database& db, const view_building_state& building_state, abort_source& as) {
@@ -535,10 +485,6 @@ future<> view_building_worker::state::clean_up_after_batch() {
// Flush base table, set is as currently processing base table and save which views exist at the time of flush
future<> view_building_worker::state::flush_base_table(replica::database& db, table_id base_table_id, abort_source& as) {
if (_drained) {
on_internal_error(vbw_logger, "view_building_worker::state was already drained");
}
auto cf = db.find_column_family(base_table_id).shared_from_this();
co_await when_all(cf->await_pending_writes(), cf->await_pending_streams());
co_await flush_base(cf, as);
@@ -557,11 +503,6 @@ future<> view_building_worker::state::clear() {
flushed_views.clear();
}
future<> view_building_worker::state::drain() {
_drained = true;
co_await clear();
}
view_building_worker::batch::batch(sharded<view_building_worker>& vbw, std::unordered_map<utils::UUID, view_building_task> tasks, table_id base_id, locator::tablet_replica replica)
: base_id(base_id)
, replica(replica)
@@ -726,34 +667,24 @@ future<> view_building_worker::do_build_range(table_id base_id, std::vector<tabl
}
future<> view_building_worker::do_process_staging(table_id table_id, dht::token last_token) {
if (_staging_sstables[table_id].empty()) {
co_return;
}
auto table = _db.get_tables_metadata().get_table(table_id).shared_from_this();
auto& tablet_map = table->get_effective_replication_map()->get_token_metadata().tablets().get_tablet_map(table_id);
auto tid = tablet_map.get_tablet_id(last_token);
auto tablet_range = tablet_map.get_token_range(tid);
// Select sstables belonging to the tablet (identified by `last_token`)
std::vector<sstables::shared_sstable> sstables_to_process;
try {
// Acquire `_staging_sstables_mutex` to prevent `create_staging_sstable_tasks()` from
// concurrently modifying `_staging_sstables` (moving entries from `_sstables_to_register`)
// while we read them.
auto lock = co_await get_units(_staging_sstables_mutex, 1, _as);
auto& tablet_map = table->get_effective_replication_map()->get_token_metadata().tablets().get_tablet_map(table_id);
auto tid = tablet_map.get_tablet_id(last_token);
auto tablet_range = tablet_map.get_token_range(tid);
// Select sstables belonging to the tablet (identified by `last_token`)
for (auto& sst: _staging_sstables[table_id]) {
auto sst_last_token = sst->get_last_decorated_key().token();
if (tablet_range.contains(sst_last_token, dht::token_comparator())) {
sstables_to_process.push_back(sst);
}
for (auto& sst: _staging_sstables[table_id]) {
auto sst_last_token = sst->get_last_decorated_key().token();
if (tablet_range.contains(sst_last_token, dht::token_comparator())) {
sstables_to_process.push_back(sst);
}
lock.return_all();
} catch (semaphore_aborted&) {
vbw_logger.warn("Semaphore was aborted while waiting to removed processed sstables for table {}", table_id);
co_return;
}
if (sstables_to_process.empty()) {
co_return;
}
co_await _vug.process_staging_sstables(std::move(table), sstables_to_process);
try {
@@ -868,8 +799,8 @@ future<std::vector<utils::UUID>> view_building_worker::work_on_tasks(raft::term_
}
// Create and start the batch
auto batch = std::make_unique<view_building_worker::batch>(container(), std::move(tasks), *building_state.currently_processed_base_table, my_replica);
_state.start_batch(std::move(batch));
_state._batch = std::make_unique<batch>(container(), std::move(tasks), *building_state.currently_processed_base_table, my_replica);
_state._batch->start();
}
if (std::ranges::all_of(ids, [&] (auto& id) { return !_state._batch->tasks.contains(id); })) {

View File

@@ -14,7 +14,6 @@
#include <seastar/core/shared_future.hh>
#include <unordered_map>
#include <unordered_set>
#include <flat_set>
#include "locator/abstract_replication_strategy.hh"
#include "locator/tablets.hh"
#include "raft/raft.hh"
@@ -99,14 +98,11 @@ class view_building_worker : public seastar::peering_sharded_service<view_buildi
std::unordered_set<table_id> flushed_views;
semaphore _mutex = semaphore(1);
bool _drained = false;
// All of the methods below should be executed while holding `_mutex` unit!
void start_batch(std::unique_ptr<batch> batch);
future<> update_processing_base_table(replica::database& db, const view_building_state& building_state, abort_source& as);
future<> flush_base_table(replica::database& db, table_id base_table_id, abort_source& as);
future<> clean_up_after_batch();
future<> clear();
future<> drain();
};
// Wrapper which represents information needed to create
@@ -173,24 +169,14 @@ private:
future<> do_process_staging(table_id base_id, dht::token last_token);
future<> run_staging_sstables_registrator();
// Acquires `_staging_sstables_mutex` on all shards internally,
// so callers must not hold `_staging_sstables_mutex` when invoking it.
// Caller must hold units from `_staging_sstables_mutex`
future<> create_staging_sstable_tasks();
future<> discover_existing_staging_sstables();
std::unordered_map<table_id, std::vector<staging_sstable_task_info>> discover_local_staging_sstables(building_tasks building_tasks);
// Acquire `_staging_sstables_mutex` on multiple shards in parallel.
// Must be called only from shard 0.
// Must be called ONLY by `create_staging_sstable_tasks()` and only once at a time to avoid deadlock.
future<std::vector<foreign_ptr<semaphore_units<>>>> lock_staging_mutex_on_multiple_shards(std::flat_set<shard_id> shards);
void init_messaging_service();
future<> uninit_messaging_service();
future<std::vector<utils::UUID>> work_on_tasks(raft::term_t term, std::vector<utils::UUID> ids);
using started_drain = bool_class<struct started_drain_tag>;
started_drain _drain_started = started_drain::no;
shared_future<> _drain_finished;
future<> do_drain();
};
}

View File

@@ -99,7 +99,7 @@ public:
set_cell(cr, "up", gossiper.is_alive(hostid));
if (gossiper.is_shutdown(endpoint)) {
set_cell(cr, "status", "shutdown");
set_cell(cr, "status", gossiper.get_gossip_status(endpoint));
} else {
set_cell(cr, "status", boost::to_upper_copy<std::string>(fmt::format("{}", ss.get_node_state(hostid))));
}
@@ -224,12 +224,12 @@ public:
}
if (_db.find_keyspace(e.name).get_replication_strategy().uses_tablets()) {
co_await _db.get_tables_metadata().for_each_table_gently([&, this] (table_id tid, lw_shared_ptr<replica::table> table) -> future<> {
co_await _db.get_tables_metadata().for_each_table_gently([&, this] (table_id, lw_shared_ptr<replica::table> table) -> future<> {
if (table->schema()->ks_name() != e.name) {
co_return;
}
const auto& table_name = table->schema()->cf_name();
utils::chunked_vector<dht::token_range_endpoints> ranges = co_await _ss.describe_ring_for_table(tid);
utils::chunked_vector<dht::token_range_endpoints> ranges = co_await _ss.describe_ring_for_table(e.name, table_name);
co_await emit_ring(result, e.key, table_name, std::move(ranges));
});
} else {

View File

@@ -29,8 +29,8 @@ static logging::logger blogger("boot_strapper");
namespace dht {
future<> boot_strapper::bootstrap(streaming::stream_reason reason, gms::gossiper& gossiper, service::frozen_topology_guard topo_guard,
locator::host_id replace_address) {
future<> boot_strapper::bootstrap(
streaming::stream_reason reason, gms::gossiper& gossiper, service::frozen_topology_guard topo_guard, locator::host_id replace_address) {
blogger.debug("Beginning bootstrap process: sorted_tokens={}", get_token_metadata().sorted_tokens());
sstring description;
if (reason == streaming::stream_reason::bootstrap) {
@@ -41,7 +41,8 @@ future<> boot_strapper::bootstrap(streaming::stream_reason reason, gms::gossiper
throw std::runtime_error("Wrong stream_reason provided: it can only be replace or bootstrap");
}
try {
auto streamer = make_lw_shared<range_streamer>(_db, _stream_manager, _token_metadata_ptr, _abort_source, _tokens, _address, _dr, description, reason, topo_guard);
auto streamer = make_lw_shared<range_streamer>(
_db, _stream_manager, _token_metadata_ptr, _abort_source, _tokens, _address, _dr, description, reason, topo_guard);
auto nodes_to_filter = gossiper.get_unreachable_members();
if (reason == streaming::stream_reason::replace) {
nodes_to_filter.insert(std::move(replace_address));
@@ -71,7 +72,8 @@ std::unordered_set<token> boot_strapper::get_random_bootstrap_tokens(const token
}
if (num_tokens == 1) {
blogger.warn("Picking random token for a single vnode. You should probably add more vnodes; failing that, you should probably specify the token manually");
blogger.warn(
"Picking random token for a single vnode. You should probably add more vnodes; failing that, you should probably specify the token manually");
}
auto tokens = get_random_tokens(std::move(tmptr), num_tokens);
@@ -86,7 +88,8 @@ std::unordered_set<token> boot_strapper::get_bootstrap_tokens(token_metadata_ptr
return get_bootstrap_tokens(std::move(tmptr), cfg.initial_token(), cfg.num_tokens(), check);
}
std::unordered_set<token> boot_strapper::get_bootstrap_tokens(const token_metadata_ptr tmptr, sstring tokens_string, uint32_t num_tokens, check_token_endpoint check) {
std::unordered_set<token> boot_strapper::get_bootstrap_tokens(
const token_metadata_ptr tmptr, sstring tokens_string, uint32_t num_tokens, check_token_endpoint check) {
std::unordered_set<sstring> initial_tokens;
try {
boost::split(initial_tokens, tokens_string, boost::is_any_of(sstring(", ")));
@@ -102,7 +105,8 @@ std::unordered_set<token> boot_strapper::get_bootstrap_tokens(const token_metada
for (auto& token_string : initial_tokens) {
auto token = dht::token::from_sstring(token_string);
if (check && tmptr->get_endpoint(token)) {
throw std::runtime_error(format("Bootstrapping to existing token {} is not allowed (decommission/removenode the old node first).", token_string));
throw std::runtime_error(
format("Bootstrapping to existing token {} is not allowed (decommission/removenode the old node first).", token_string));
}
tokens.insert(token);
}

View File

@@ -26,10 +26,9 @@ static logging::logger logger("range_streamer");
using inet_address = gms::inet_address;
std::unordered_map<locator::host_id, dht::token_range_vector>
range_streamer::get_range_fetch_map(const std::unordered_map<dht::token_range, std::vector<locator::host_id>>& ranges_with_sources,
const std::unordered_set<std::unique_ptr<i_source_filter>>& source_filters,
const sstring& keyspace) {
std::unordered_map<locator::host_id, dht::token_range_vector> range_streamer::get_range_fetch_map(
const std::unordered_map<dht::token_range, std::vector<locator::host_id>>& ranges_with_sources,
const std::unordered_set<std::unique_ptr<i_source_filter>>& source_filters, const sstring& keyspace) {
std::unordered_map<locator::host_id, dht::token_range_vector> range_fetch_map_map;
const auto& topo = _token_metadata_ptr->get_topology();
for (const auto& x : ranges_with_sources) {
@@ -79,8 +78,8 @@ range_streamer::get_range_fetch_map(const std::unordered_map<dht::token_range, s
}
// Must be called from a seastar thread
std::unordered_map<dht::token_range, std::vector<locator::host_id>>
range_streamer::get_all_ranges_with_sources_for(const sstring& keyspace_name, const locator::vnode_effective_replication_map* erm, dht::token_range_vector desired_ranges) {
std::unordered_map<dht::token_range, std::vector<locator::host_id>> range_streamer::get_all_ranges_with_sources_for(
const sstring& keyspace_name, const locator::vnode_effective_replication_map* erm, dht::token_range_vector desired_ranges) {
logger.debug("{} ks={}", __func__, keyspace_name);
auto range_addresses = erm->get_range_host_ids().get();
@@ -114,24 +113,24 @@ range_streamer::get_all_ranges_with_sources_for(const sstring& keyspace_name, co
}
// Must be called from a seastar thread
std::unordered_map<dht::token_range, std::vector<locator::host_id>>
range_streamer::get_all_ranges_with_strict_sources_for(const sstring& keyspace_name, const locator::vnode_effective_replication_map* erm, dht::token_range_vector desired_ranges, gms::gossiper& gossiper) {
std::unordered_map<dht::token_range, std::vector<locator::host_id>> range_streamer::get_all_ranges_with_strict_sources_for(
const sstring& keyspace_name, const locator::vnode_effective_replication_map* erm, dht::token_range_vector desired_ranges, gms::gossiper& gossiper) {
logger.debug("{} ks={}", __func__, keyspace_name);
SCYLLA_ASSERT (_tokens.empty() == false);
SCYLLA_ASSERT(_tokens.empty() == false);
auto& strat = erm->get_replication_strategy();
//Active ranges
// Active ranges
auto metadata_clone = get_token_metadata().clone_only_token_map().get();
auto range_addresses = strat.get_range_host_ids(metadata_clone).get();
//Pending ranges
// Pending ranges
metadata_clone.update_topology(_address, _dr);
metadata_clone.update_normal_tokens(_tokens, _address).get();
auto pending_range_addresses = strat.get_range_host_ids(metadata_clone).get();
auto pending_range_addresses = strat.get_range_host_ids(metadata_clone).get();
metadata_clone.clear_gently().get();
//Collects the source that will have its range moved to the new node
// Collects the source that will have its range moved to the new node
std::unordered_map<dht::token_range, std::vector<locator::host_id>> range_sources;
logger.debug("keyspace={}, desired_ranges.size={}, range_addresses.size={}", keyspace_name, desired_ranges.size(), range_addresses.size());
@@ -150,11 +149,12 @@ range_streamer::get_all_ranges_with_strict_sources_for(const sstring& keyspace_n
}
std::unordered_set<locator::host_id> new_endpoints(it->second.begin(), it->second.end());
//Due to CASSANDRA-5953 we can have a higher RF then we have endpoints.
//So we need to be careful to only be strict when endpoints == RF
// Due to CASSANDRA-5953 we can have a higher RF then we have endpoints.
// So we need to be careful to only be strict when endpoints == RF
if (old_endpoints.size() == erm->get_replication_factor()) {
std::erase_if(old_endpoints,
[&new_endpoints] (locator::host_id ep) { return new_endpoints.contains(ep); });
std::erase_if(old_endpoints, [&new_endpoints](locator::host_id ep) {
return new_endpoints.contains(ep);
});
if (old_endpoints.size() != 1) {
throw std::runtime_error(format("Expected 1 endpoint but found {:d}", old_endpoints.size()));
}
@@ -163,7 +163,7 @@ range_streamer::get_all_ranges_with_strict_sources_for(const sstring& keyspace_n
}
}
//Validate
// Validate
auto it = range_sources.find(desired_range);
if (it == range_sources.end()) {
throw std::runtime_error(format("No sources found for {}", desired_range));
@@ -176,7 +176,9 @@ range_streamer::get_all_ranges_with_strict_sources_for(const sstring& keyspace_n
locator::host_id source_id = it->second.front();
if (gossiper.is_enabled() && !gossiper.is_alive(source_id)) {
throw std::runtime_error(format("A node required to move the data consistently is down ({}). If you wish to move the data from a potentially inconsistent replica, restart the node with consistent_rangemovement=false", source_id));
throw std::runtime_error(format("A node required to move the data consistently is down ({}). If you wish to move the data from a potentially "
"inconsistent replica, restart the node with consistent_rangemovement=false",
source_id));
}
}
@@ -188,12 +190,8 @@ bool range_streamer::use_strict_sources_for_ranges(const sstring& keyspace_name,
auto nr_nodes_in_ring = get_token_metadata().get_normal_token_owners().size();
bool everywhere_topology = erm.get_replication_strategy().get_type() == locator::replication_strategy_type::everywhere_topology;
// Use strict when number of nodes in the ring is equal or more than RF
auto strict = _db.local().get_config().consistent_rangemovement()
&& !_tokens.empty()
&& !everywhere_topology
&& nr_nodes_in_ring >= rf;
logger.debug("use_strict_sources_for_ranges: ks={}, nr_nodes_in_ring={}, rf={}, strict={}",
keyspace_name, nr_nodes_in_ring, rf, strict);
auto strict = _db.local().get_config().consistent_rangemovement() && !_tokens.empty() && !everywhere_topology && nr_nodes_in_ring >= rf;
logger.debug("use_strict_sources_for_ranges: ks={}, nr_nodes_in_ring={}, rf={}, strict={}", keyspace_name, nr_nodes_in_ring, rf, strict);
return strict;
}
@@ -214,34 +212,36 @@ void range_streamer::add_rx_ranges(const sstring& keyspace_name, std::unordered_
}
// TODO: This is the legacy range_streamer interface, it is add_rx_ranges which adds rx ranges.
future<> range_streamer::add_ranges(const sstring& keyspace_name, locator::static_effective_replication_map_ptr erm, dht::token_range_vector ranges, gms::gossiper& gossiper, bool is_replacing) {
return seastar::async([this, keyspace_name, ermp = std::move(erm), ranges= std::move(ranges), &gossiper, is_replacing] () mutable {
if (_nr_tx_added) {
throw std::runtime_error("Mixed sending and receiving is not supported");
}
_nr_rx_added++;
auto erm = ermp->maybe_as_vnode_effective_replication_map();
SCYLLA_ASSERT(erm != nullptr);
auto ranges_for_keyspace = !is_replacing && use_strict_sources_for_ranges(keyspace_name, *erm)
? get_all_ranges_with_strict_sources_for(keyspace_name, erm, std::move(ranges), gossiper)
: get_all_ranges_with_sources_for(keyspace_name, erm, std::move(ranges));
if (logger.is_enabled(logging::log_level::debug)) {
for (auto& x : ranges_for_keyspace) {
logger.debug("{} : keyspace {} range {} exists on {}", _description, keyspace_name, x.first, x.second);
future<> range_streamer::add_ranges(const sstring& keyspace_name, locator::static_effective_replication_map_ptr erm, dht::token_range_vector ranges,
gms::gossiper& gossiper, bool is_replacing) {
return seastar::async([this, keyspace_name, ermp = std::move(erm), ranges = std::move(ranges), &gossiper, is_replacing]() mutable {
if (_nr_tx_added) {
throw std::runtime_error("Mixed sending and receiving is not supported");
}
}
_nr_rx_added++;
auto erm = ermp->maybe_as_vnode_effective_replication_map();
SCYLLA_ASSERT(erm != nullptr);
auto ranges_for_keyspace = !is_replacing && use_strict_sources_for_ranges(keyspace_name, *erm)
? get_all_ranges_with_strict_sources_for(keyspace_name, erm, std::move(ranges), gossiper)
: get_all_ranges_with_sources_for(keyspace_name, erm, std::move(ranges));
std::unordered_map<locator::host_id, dht::token_range_vector> range_fetch_map = get_range_fetch_map(ranges_for_keyspace, _source_filters, keyspace_name);
utils::clear_gently(ranges_for_keyspace).get();
if (logger.is_enabled(logging::log_level::debug)) {
for (auto& x : range_fetch_map) {
logger.debug("{} : keyspace={}, ranges={} from source={}, range_size={}", _description, keyspace_name, x.second, x.first, x.second.size());
if (logger.is_enabled(logging::log_level::debug)) {
for (auto& x : ranges_for_keyspace) {
logger.debug("{} : keyspace {} range {} exists on {}", _description, keyspace_name, x.first, x.second);
}
}
}
_to_stream.emplace(keyspace_name, std::move(range_fetch_map));
});
std::unordered_map<locator::host_id, dht::token_range_vector> range_fetch_map =
get_range_fetch_map(ranges_for_keyspace, _source_filters, keyspace_name);
utils::clear_gently(ranges_for_keyspace).get();
if (logger.is_enabled(logging::log_level::debug)) {
for (auto& x : range_fetch_map) {
logger.debug("{} : keyspace={}, ranges={} from source={}, range_size={}", _description, keyspace_name, x.second, x.first, x.second.size());
}
}
_to_stream.emplace(keyspace_name, std::move(range_fetch_map));
});
}
future<> range_streamer::stream_async() {
@@ -250,73 +250,73 @@ future<> range_streamer::stream_async() {
_token_metadata_ptr = nullptr;
logger.info("{} starts, nr_ranges_remaining={}", _description, _nr_ranges_remaining);
auto start = lowres_clock::now();
return do_for_each(_to_stream, [this, description = _description] (auto& stream) {
return do_for_each(_to_stream, [this, description = _description](auto& stream) {
const auto& keyspace = stream.first;
auto& ip_range_vec = stream.second;
auto ips = ip_range_vec | std::views::keys | std::ranges::to<std::list>();
// Fetch from or send to peer node in parallel
logger.info("{} with {} for keyspace={} started, nodes_to_stream={}", description, ips, keyspace, ip_range_vec.size());
return parallel_for_each(ip_range_vec, [this, description, keyspace] (auto& ip_range) {
auto& source = ip_range.first;
auto& range_vec = ip_range.second;
return seastar::with_semaphore(_limiter, 1, [this, description, keyspace, source, &range_vec] () mutable {
return seastar::async([this, description, keyspace, source, &range_vec] () mutable {
// TODO: It is better to use fiber instead of thread here because
// creating a thread per peer can be some memory in a large cluster.
auto start_time = lowres_clock::now();
unsigned sp_index = 0;
unsigned nr_ranges_streamed = 0;
size_t nr_ranges_total = range_vec.size();
auto do_streaming = [&] (dht::token_range_vector&& ranges_to_stream) {
auto sp = stream_plan(_stream_manager.local(), format("{}-{}-index-{:d}", description, keyspace, sp_index++),
_reason, _topo_guard);
auto abort_listener = _abort_source.subscribe([&] () noexcept { sp.abort(); });
_abort_source.check();
logger.info("{} with {} for keyspace={}, streaming [{}, {}) out of {} ranges",
description, source, keyspace,
nr_ranges_streamed, nr_ranges_streamed + ranges_to_stream.size(), nr_ranges_total);
auto ranges_streamed = ranges_to_stream.size();
if (_nr_rx_added) {
sp.request_ranges(source, keyspace, std::move(ranges_to_stream), _tables);
} else if (_nr_tx_added) {
sp.transfer_ranges(source, keyspace, std::move(ranges_to_stream), _tables);
}
sp.execute().discard_result().get();
// Update finished percentage
nr_ranges_streamed += ranges_streamed;
_nr_ranges_remaining -= ranges_streamed;
float percentage = _nr_total_ranges == 0 ? 1 : (_nr_total_ranges - _nr_ranges_remaining) / (float)_nr_total_ranges;
_stream_manager.local().update_finished_percentage(_reason, percentage);
logger.info("Finished {} out of {} ranges for {}, finished percentage={}",
_nr_total_ranges - _nr_ranges_remaining, _nr_total_ranges, _reason, percentage);
};
dht::token_range_vector ranges_to_stream;
try {
for (auto it = range_vec.begin(); it < range_vec.end();) {
ranges_to_stream.push_back(*it);
++it;
auto fraction = _db.local().get_config().stream_plan_ranges_fraction();
size_t nr_ranges_per_stream_plan = nr_ranges_total * fraction;
if (ranges_to_stream.size() < nr_ranges_per_stream_plan) {
continue;
} else {
do_streaming(std::exchange(ranges_to_stream, {}));
it = range_vec.erase(range_vec.begin(), it);
return parallel_for_each(ip_range_vec, [this, description, keyspace](auto& ip_range) {
auto& source = ip_range.first;
auto& range_vec = ip_range.second;
return seastar::with_semaphore(_limiter, 1, [this, description, keyspace, source, &range_vec]() mutable {
return seastar::async([this, description, keyspace, source, &range_vec]() mutable {
// TODO: It is better to use fiber instead of thread here because
// creating a thread per peer can be some memory in a large cluster.
auto start_time = lowres_clock::now();
unsigned sp_index = 0;
unsigned nr_ranges_streamed = 0;
size_t nr_ranges_total = range_vec.size();
auto do_streaming = [&](dht::token_range_vector&& ranges_to_stream) {
auto sp = stream_plan(_stream_manager.local(), format("{}-{}-index-{:d}", description, keyspace, sp_index++), _reason, _topo_guard);
auto abort_listener = _abort_source.subscribe([&]() noexcept {
sp.abort();
});
_abort_source.check();
logger.info("{} with {} for keyspace={}, streaming [{}, {}) out of {} ranges", description, source, keyspace, nr_ranges_streamed,
nr_ranges_streamed + ranges_to_stream.size(), nr_ranges_total);
auto ranges_streamed = ranges_to_stream.size();
if (_nr_rx_added) {
sp.request_ranges(source, keyspace, std::move(ranges_to_stream), _tables);
} else if (_nr_tx_added) {
sp.transfer_ranges(source, keyspace, std::move(ranges_to_stream), _tables);
}
sp.execute().discard_result().get();
// Update finished percentage
nr_ranges_streamed += ranges_streamed;
_nr_ranges_remaining -= ranges_streamed;
float percentage = _nr_total_ranges == 0 ? 1 : (_nr_total_ranges - _nr_ranges_remaining) / (float)_nr_total_ranges;
_stream_manager.local().update_finished_percentage(_reason, percentage);
logger.info("Finished {} out of {} ranges for {}, finished percentage={}", _nr_total_ranges - _nr_ranges_remaining, _nr_total_ranges,
_reason, percentage);
};
dht::token_range_vector ranges_to_stream;
try {
for (auto it = range_vec.begin(); it < range_vec.end();) {
ranges_to_stream.push_back(*it);
++it;
auto fraction = _db.local().get_config().stream_plan_ranges_fraction();
size_t nr_ranges_per_stream_plan = nr_ranges_total * fraction;
if (ranges_to_stream.size() < nr_ranges_per_stream_plan) {
continue;
} else {
do_streaming(std::exchange(ranges_to_stream, {}));
it = range_vec.erase(range_vec.begin(), it);
}
}
if (ranges_to_stream.size() > 0) {
do_streaming(std::exchange(ranges_to_stream, {}));
range_vec.clear();
}
} catch (...) {
auto t = std::chrono::duration_cast<std::chrono::duration<float>>(lowres_clock::now() - start_time).count();
logger.warn("{} with {} for keyspace={} failed, took {} seconds: {}", description, source, keyspace, t, std::current_exception());
throw;
}
if (ranges_to_stream.size() > 0) {
do_streaming(std::exchange(ranges_to_stream, {}));
range_vec.clear();
}
} catch (...) {
auto t = std::chrono::duration_cast<std::chrono::duration<float>>(lowres_clock::now() - start_time).count();
logger.warn("{} with {} for keyspace={} failed, took {} seconds: {}", description, source, keyspace, t, std::current_exception());
throw;
}
auto t = std::chrono::duration_cast<std::chrono::duration<float>>(lowres_clock::now() - start_time).count();
logger.info("{} with {} for keyspace={} succeeded, took {} seconds", description, source, keyspace, t);
});
});
logger.info("{} with {} for keyspace={} succeeded, took {} seconds", description, source, keyspace, t);
});
});
});
}).finally([this, start] {
auto t = std::chrono::duration_cast<std::chrono::seconds>(lowres_clock::now() - start).count();
@@ -344,4 +344,4 @@ size_t range_streamer::nr_ranges_to_stream() {
return nr_ranges_remaining;
}
} // dht
} // namespace dht

View File

@@ -9,7 +9,6 @@
import os
import sys
import shlex
import argparse
import psutil
from pathlib import Path
@@ -104,41 +103,16 @@ if __name__ == '__main__':
run('dd if=/dev/zero of={} bs=1M count={}'.format(swapfile, swapsize_mb), shell=True, check=True)
swapfile.chmod(0o600)
run('mkswap -f {}'.format(swapfile), shell=True, check=True)
mount_point = find_mount_point(swap_directory)
mount_unit = out(f'systemd-escape -p --suffix=mount {shlex.quote(str(mount_point))}')
# Add DefaultDependencies=no to the swap unit to avoid getting the default
# Before=swap.target dependency. We apply this to all clouds, but the
# requirement came from Azure:
#
# On Azure, the swap directory is on the Azure ephemeral disk (mounted on /mnt).
# However, cloud-init makes this mount (i.e., the mnt.mount unit) depend on
# the network (After=network-online.target). By extension, this means that
# the swap unit depends on the network. If we didn't use DefaultDependencies=no,
# then the swap unit would be part of the swap.target which other services
# assume to be a local boot target, so we would end up with dependency cycles
# such as:
#
# swap.target -> mnt-swapfile.swap -> mnt.mount -> network-online.target -> network.target -> systemd-resolved.service -> tmp.mount -> swap.target
#
# By removing the automatic Before=swap.target, the swap unit is no longer
# part of swap.target, avoiding such cycles. The swap will still be
# activated via WantedBy=multi-user.target.
unit_data = '''
[Unit]
Description=swapfile
DefaultDependencies=no
After={}
Conflicts=umount.target
Before=umount.target
[Swap]
What={}
[Install]
WantedBy=multi-user.target
'''[1:-1].format(mount_unit, swapfile)
'''[1:-1].format(swapfile)
with swapunit.open('w') as f:
f.write(unit_data)
systemd_unit.reload()

View File

@@ -1,12 +1,6 @@
### a dictionary of redirections
#old path: new path
# Move the Upgrade Support (About Upgrade) page
/stable/upgrade/about-upgrade.html: https://docs.scylladb.com/stable/versioning/upgrade-policy.html
/branch-2025.4/upgrade/about-upgrade.html: https://docs.scylladb.com/stable/versioning/upgrade-policy.html
/branch-2026.1/upgrade/about-upgrade.html: https://docs.scylladb.com/stable/versioning/upgrade-policy.html
# Move the OS Support page
/stable/getting-started/os-support.html: https://docs.scylladb.com/stable/versioning/os-support-per-version.html

View File

@@ -31,7 +31,7 @@ was used. Alternator currently supports two compression algorithms, `gzip`
and `deflate`, both standardized in ([RFC 9110](https://www.rfc-editor.org/rfc/rfc9110.html)).
Other standard compression types which are listed in
[IANA's HTTP Content Coding Registry](https://www.iana.org/assignments/http-parameters/http-parameters.xhtml#content-coding),
including `zstd` ([RFC 8878](https://www.rfc-editor.org/rfc/rfc8878.html)),
including `zstd` ([RFC 8878][https://www.rfc-editor.org/rfc/rfc8878.html]),
are not yet supported by Alternator.
Note that HTTP's compression only compresses the request's _body_ - not the

View File

@@ -261,51 +261,8 @@ The following options are supported for vector indexes. All of them are optional
| | * ``true``: Enable rescoring. | |
| | * ``false``: Disable rescoring. | |
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
| ``source_model`` | The name of the embedding model that produced the vectors (e.g., ``"ada002"``). Cassandra client | *(none)* |
| | libraries such as CassIO send this option to tag the index with the model. Cassandra SAI rejects it as | |
| | an unrecognized property; ScyllaDB accepts and preserves it in ``DESCRIBE`` output for compatibility | |
| | with those libraries, but does not act on it. | |
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
.. _cassandra-sai-compatibility:
Cassandra SAI Compatibility for Vector Search
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ScyllaDB accepts the Cassandra ``StorageAttachedIndex`` (SAI) class name in ``CREATE CUSTOM INDEX``
statements **for vector columns**. Cassandra libraries such as
`CassIO <https://cassio.org/>`_ and `LangChain <https://www.langchain.com/>`_ use SAI to create
vector indexes; ScyllaDB recognizes these statements for compatibility.
When ScyllaDB encounters an SAI class name on a **vector column**, the index is automatically
created as a native ``vector_index``. The following class names are recognized:
* ``org.apache.cassandra.index.sai.StorageAttachedIndex`` (exact case required)
* ``StorageAttachedIndex`` (case-insensitive)
* ``SAI`` (case-insensitive)
Example::
-- Cassandra SAI statement accepted by ScyllaDB:
CREATE CUSTOM INDEX ON my_table (embedding)
USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
WITH OPTIONS = {'similarity_function': 'COSINE'};
-- Equivalent to:
CREATE CUSTOM INDEX ON my_table (embedding)
USING 'vector_index'
WITH OPTIONS = {'similarity_function': 'COSINE'};
The ``similarity_function`` option is supported by both Cassandra SAI and ScyllaDB.
.. note::
SAI class names are only supported on **vector columns**. Using an SAI class name on a
non-vector column (e.g., ``text`` or ``int``) will result in an error. General SAI
indexing of non-vector columns is not supported by ScyllaDB; use a
:doc:`secondary index </cql/secondary-indexes>` instead.
.. _drop-index-statement:
DROP INDEX

111
docs/dev/audit.md Normal file
View File

@@ -0,0 +1,111 @@
# Introduction
Similar to the approach described in CASSANDRA-12151, we add the
concept of an audit specification. An audit has a target (syslog or a
table) and a set of events/actions that it wants recorded. We
introduce new CQL syntax for Scylla users to describe and manipulate
audit specifications.
Prior art:
- Microsoft SQL Server [audit
description](https://docs.microsoft.com/en-us/sql/relational-databases/security/auditing/sql-server-audit-database-engine?view=sql-server-ver15)
- pgAudit [docs](https://github.com/pgaudit/pgaudit/blob/master/README.md)
- MySQL audit_log docs in
[MySQL](https://dev.mysql.com/doc/refman/8.0/en/audit-log.html) and
[Azure](https://docs.microsoft.com/en-us/azure/mysql/concepts-audit-logs)
- DynamoDB can [use CloudTrail](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/logging-using-cloudtrail.html) to log all events
# CQL extensions
## Create an audit
```cql
CREATE AUDIT [IF NOT EXISTS] audit-name WITH TARGET { SYSLOG | table-name }
[ AND TRIGGER KEYSPACE IN (ks1, ks2, ks3) ]
[ AND TRIGGER TABLE IN (tbl1, tbl2, tbl3) ]
[ AND TRIGGER ROLE IN (usr1, usr2, usr3) ]
[ AND TRIGGER CATEGORY IN (cat1, cat2, cat3) ]
;
```
From this point on, every database event that matches all present
triggers will be recorded in the target. When the target is a table,
it behaves like the [current
design](https://docs.scylladb.com/operating-scylla/security/auditing/#table-storage).
The audit name must be different from all other audits, unless IF NOT
EXISTS precedes it, in which case the existing audit must be identical
to the new definition. Case sensitivity and length limit are the same
as for table names.
A trigger kind (ie, `KEYSPACE`, `TABLE`, `ROLE`, or `CATEGORY`) can be
specified at most once.
## Show an audit
```cql
DESCRIBE AUDIT [audit-name ...];
```
Prints definitions of all audits named herein. If no names are
provided, prints all audits.
## Delete an audit
```cql
DROP AUDIT audit-name;
```
Stops logging events specified by this audit. Doesn't impact the
already logged events. If the target is a table, it remains as it is.
## Alter an audit
```cql
ALTER AUDIT audit-name WITH {same syntax as CREATE}
```
Any trigger provided will be updated (or newly created, if previously
absent). To drop a trigger, use `IN *`.
## Permissions
Only superusers can modify audits or turn them on and off.
Only superusers can read tables that are audit targets; no user can
modify them. Only superusers can drop tables that are audit targets,
after the audit itself is dropped. If a superuser doesn't drop a
target table, it remains in existence indefinitely.
# Implementation
## Efficient trigger evaluation
```c++
namespace audit {
/// Stores triggers from an AUDIT statement.
class triggers {
// Use trie structures for speedy string lookup.
optional<trie> _ks_trigger, _tbl_trigger, _usr_trigger;
// A logical-AND filter.
optional<unsigned> _cat_trigger;
public:
/// True iff every non-null trigger matches the corresponding ainf element.
bool should_audit(const audit_info& ainf);
};
} // namespace audit
```
To prevent modification of target tables, `audit::inspect()` will
check the statement and throw if it is disallowed, similar to what
`check_access()` currently does.
## Persisting audit definitions
Obviously, an audit definition must survive a server restart and stay
consistent among all nodes in a cluster. We'll accomplish both by
storing audits in a system table.

View File

@@ -1,155 +0,0 @@
# Comparing Build Systems: configure.py vs CMake
ScyllaDB has two build systems: the primary `configure.py` + Ninja pipeline
and an alternative CMake build (used mainly for IDE integration — CLion,
clangd, etc.). Both must produce equivalent compilation and link commands.
`scripts/compare_build_systems.py` verifies this by parsing the `build.ninja`
files generated by each system and comparing:
1. **Per-file compilation flags** — defines, warnings, optimization, language
flags for every Scylla source file.
2. **Link target sets** — are the same executables produced by both systems?
3. **Per-target linker settings** — link flags and libraries for every common
executable.
`configure.py` is treated as the baseline. CMake should match it.
## Quick start
```bash
# Compare a single mode
scripts/compare_build_systems.py -m dev
# Compare all modes
scripts/compare_build_systems.py
# Verbose output — show per-file and per-target differences
scripts/compare_build_systems.py -m debug -v
```
The script automatically configures both build systems into a temporary
directory for every run — the user's existing build tree is never touched.
No manual `configure.py` or `cmake` invocation is required.
## Mode mapping
| configure.py | CMake |
|--------------|------------------|
| `debug` | `Debug` |
| `dev` | `Dev` |
| `release` | `RelWithDebInfo` |
| `sanitize` | `Sanitize` |
| `coverage` | `Coverage` |
## Examples
```bash
# Check dev mode only (fast, most common during development)
scripts/compare_build_systems.py -m dev
# Check all modes
scripts/compare_build_systems.py
# CI mode: quiet, strict (exit 1 on any diff)
scripts/compare_build_systems.py --ci
# Verbose output for debugging a specific mode
scripts/compare_build_systems.py -m sanitize -v
# Quiet mode — only prints summary and errors
scripts/compare_build_systems.py -m dev -q
```
## Exit codes
| Code | Meaning |
|------|--------------------------------------------------------------------------|
| `0` | All checked modes match |
| `1` | Differences found |
| `2` | Configuration failure or some modes could not be compared (e.g. skipped) |
## What it ignores
The script intentionally ignores certain structural differences that are
inherent to how the two build systems work:
- **Include paths** (`-I`, `-isystem`) — directory layout differs between
the two systems.
- **LTO/PGO flags** — these are configuration-dependent options, not
mode-inherent.
- **Internal library targets** — CMake creates intermediate static/shared
libraries (e.g., `scylla-main`, `test-lib`, abseil targets) while
`configure.py` links `.o` files directly.
- **Per-component Boost defines** — CMake adds `BOOST_REGEX_DYN_LINK` etc.
per component; `configure.py` uses a single `BOOST_ALL_DYN_LINK`.
## Typical workflow
After modifying `CMakeLists.txt` or `cmake/mode.*.cmake`:
```bash
# 1. Run the comparison (auto-configures both systems in a temp dir)
scripts/compare_build_systems.py -m dev -v
# 2. Fix any differences, repeat
```
## AI agent workflow
When the script reports mismatches, you can paste its summary output into
an AI coding agent (GitHub Copilot, etc.) and ask it to fix the
discrepancies. The agent has access to both `configure.py` and the
CMake files and can resolve most differences automatically.
### Example interaction
**1. Run the script:**
```bash
scripts/compare_build_systems.py
```
**2. Copy the summary and paste it to the agent:**
> I ran `scripts/compare_build_systems.py` and got:
>
> ```
> Summary
> ══════════════════════════════════════════════════════════════════════
> debug (CMake: Debug ): ✗ MISMATCH
> Compilation: 3 files with flag diffs, 1 sources only in configure.py
> only-configure.py defines: -DSOME_FLAG (3 files)
> Link targets: 1 only in configure.py
> Linker: 2 targets with lib diffs
> lib only in CMake: boost_filesystem (2 targets)
> dev (CMake: Dev ): ✗ MISMATCH
> Compilation: 1 sources only in configure.py
> Link targets: 1 only in configure.py
> release (CMake: RelWithDebInfo ): ✓ MATCH
> sanitize (CMake: Sanitize ): ✓ MATCH
> coverage (CMake: Coverage ): ✓ MATCH
> ```
>
> Please fix all issues and commit according to project guidelines.
**3. The agent will:**
- Identify each discrepancy (missing sources, missing targets, extra
libraries, missing defines).
- Trace root causes — e.g., a test added to `configure.py` but not to
`test/boost/CMakeLists.txt`, or an unnecessary `Boost::filesystem`
link in a CMake target.
- Apply fixes to the appropriate `CMakeLists.txt` files.
- Re-run cmake and the comparison script to verify the fix.
- Commit each fix to the correct commit in the series (using
`git commit --fixup` + `git rebase --autosquash`).
### Tips
- **Paste the full summary block** — the inline diff details (compilation,
link targets, linker) give the agent enough context to act without
scrolling through verbose output.
- **Use `-v` for stubborn issues** — if the agent needs per-file or
per-target detail, re-run with `-v` and paste the relevant section.

View File

@@ -1,81 +0,0 @@
# Counters
Counters are special kinds of cells which value can only be incremented, decremented, read and (with some limitations) deleted. In particular, once deleted, that counter cannot be used again. For example:
```cql
> UPDATE cf SET my_counter = my_counter + 6 WHERE pk = 0
> SELECT * FROM cf;
pk | my_counter
----+------------
0 | 6
(1 rows)
> UPDATE cf SET my_counter = my_counter - 1 WHERE pk = 0
> SELECT * FROM cf;
pk | my_counter
----+------------
0 | 5
(1 rows)
> DELETE my_counter FROM cf WHERE pk = 0;
> SELECT * FROM cf;
pk | my_counter
----+------------
(0 rows)
> UPDATE cf SET my_counter = my_counter + 3 WHERE pk = 0
> SELECT * FROM cf;
pk | my_counter
----+------------
(0 rows)
```
## Counters representation
Counters are represented as sets of, so called, shards which are triples containing:
* counter id uuid identifying the writer owning that shard (see below)
* logical clock incremented each time the owning writer modifies the shard value
* current value sum of increments and decrements done by the owning writer
During each write operation one of the replicas is chosen as a leader. The leader reads its shard, increments logical clock, updates current value and then sends the new version of its shard to the other replicas.
Shards owned by the same writer are merged (see below) so that each counter cell contains only one shard per counter id. Reading the actual counter value requires summing values of all shards.
### Counter id
The counter id is a 128-bit UUID that identifies which writer owns a shard. How it is assigned depends on whether the table uses vnodes or tablets.
**Vnodes:** the counter id is the host id of the node that owns the shard. Each node in the cluster gets a unique counter id, so the number of shards in a counter cell grows with the number of distinct nodes that have ever written to it.
**Tablets:** the counter id is rack-based rather than node-based. It is a deterministic type-3 (name-based) UUID derived from the string `"<datacenter>:<rack>"`. All nodes in the same rack share the same counter id.
During tablet migration, since there are two active replicas in a rack and in order to avoid conflicts, the node that is a *pending replica* uses the **negated** rack UUID as its counter id.
This bounds the number of shards in a counter cell to at most `2 × (number of racks)` regardless of node replacements.
### Merging and reconciliation
Reconciliation of two counters requires merging all shards belonging to the same counter id. The rule is: the shard with the highest logical clock wins.
Since support of deleting counters is limited so that once deleted they cannot be used again, during reconciliation tombstones win with live counter cell regardless of their timestamps.
### Digest
Computing a digest of counter cells needs to be done based solely on the shard contents (counter id, value, logical clock) rather than any structural metadata.
## Writes
1. Counter update starts with a client sending counter delta as a long (CQL3 `bigint`) to the coordinator.
2. CQL3 creates a `CounterMutation` containing a `counter_update` cell which is just a delta.
3. Coordinator chooses the leader of the counter update and sends it the mutation. The leader is always one of the replicas owning the partition the modified counter belongs to.
4. Now, the leader needs to transform counter deltas into shards. To do that it reads the current value of the shard it owns, and produces a new shard with the value modified by the delta and the logical clock incremented.
5. The mutation with the newly created shard is both used to update the memtable on the leader as well as sent to the other nodes for replication.
### Choosing leader
Choosing a replica which becomes a leader for a counter update is completely at the coordinator discretion. It is not a static role in any way and any concurrent update could be forwarded to a different leader. This means that all problems related to leader election are avoided.
The coordinator chooses the leader using the following algorithm:
1. If the coordinator can be a leader it chooses itself.
2. Otherwise, a random replica from the local DC is chosen.
3. If there is no eligible node available in the local DC the replica closest to the coordinator (according to the snitch) is chosen.
## Reads
Querying counter values is much simpler than updating it. First part of the read operation is performed as for all other cell types. When counter cells from different sources are being reconciled their shards are merged. Once the final counter cell value is known and the `CounterCell` is serialised, current values of all shards are summed up and the output of serialisation is a long integer.

View File

@@ -192,10 +192,14 @@ For example, to configure ScyllaDB to use listen address `10.0.0.5`:
$ docker run --name some-scylla -d scylladb/scylla --listen-address 10.0.0.5
```
**Since: 1.4**
#### `--alternator-address ADDR`
The `--alternator-address` command line option configures the Alternator API listen address. The default value is the same as `--listen-address`.
**Since: 3.2**
#### `--alternator-port PORT`
The `--alternator-port` command line option configures the Alternator API listen port. The Alternator API is disabled by default. You need to specify the port to enable it.
@@ -206,16 +210,22 @@ For example, to configure ScyllaDB to listen to Alternator API at port `8000`:
$ docker run --name some-scylla -d scylladb/scylla --alternator-port 8000
```
**Since: 3.2**
#### `--alternator-https-port PORT`
The `--alternator-https-port` option is similar to `--alternator-port`, just enables an encrypted (HTTPS) port. Either the `--alternator-https-port` or `--alternator-http-port`, or both, can be used to enable Alternator.
Note that the `--alternator-https-port` option also requires that files `/etc/scylla/scylla.crt` and `/etc/scylla/scylla.key` be inserted into the image. These files contain an SSL certificate and key, respectively.
**Since: 4.2**
#### `--alternator-write-isolation policy`
The `--alternator-write-isolation` command line option chooses between four allowed write isolation policies described in docs/alternator/alternator.md. This option must be specified if Alternator is enabled - it does not have a default.
**Since: 4.1**
#### `--broadcast-address ADDR`
The `--broadcast-address` command line option configures the IP address the ScyllaDB instance tells other ScyllaDB nodes in the cluster to connect to.
@@ -294,6 +304,8 @@ For example, to skip running I/O setup:
$ docker run --name some-scylla -d scylladb/scylla --io-setup 0
```
**Since: 4.3**
#### `--cpuset CPUSET`
The `--cpuset` command line option restricts ScyllaDB to run on only on CPUs specified by `CPUSET`.
@@ -329,18 +341,26 @@ For example, to enable the User Defined Functions (UDF) feature:
$ docker run --name some-scylla -d scylladb/scylla --experimental-feature=udf
```
**Since: 2.0**
#### `--disable-version-check`
The `--disable-version-check` disable the version validation check.
**Since: 2.2**
#### `--authenticator AUTHENTICATOR`
The `--authenticator` command lines option allows to provide the authenticator class ScyllaDB will use. By default ScyllaDB uses the `AllowAllAuthenticator` which performs no credentials checks. The second option is using the `PasswordAuthenticator` parameter, which relies on username/password pairs to authenticate users.
**Since: 2.3**
#### `--authorizer AUTHORIZER`
The `--authorizer` command lines option allows to provide the authorizer class ScyllaDB will use. By default ScyllaDB uses the `AllowAllAuthorizer` which allows any action to any user. The second option is using the `CassandraAuthorizer` parameter, which stores permissions in `system.permissions` table.
**Since: 2025.4**
#### `--dc NAME`
The `--dc` command line option sets the datacenter name for the ScyllaDB node.

View File

@@ -37,17 +37,8 @@ Global index's target is usually just the indexed column name, unless the index
- index on map, set or list values: VALUES(v)
- index on map entries: ENTRIES(v)
Their serialization uses lowercase type names as prefixes, except for `full` which is serialized
as just the column name (without any prefix):
`"v"`, `"keys(v)"`, `"values(v)"`, `"entries(v)"` are valid targets; a frozen full collection
index on column `v` is stored simply as `"v"` (same as a regular index).
If the column name contains characters that could be confused with the above formats
(e.g., a name containing parentheses or braces), it is escaped using the CQL
quoted-identifier syntax (column_identifier::to_cql_string()), which wraps the
name in double quotes and doubles any embedded double-quote characters. For example,
a column named `hEllo` is stored as `"hEllo"`, and a column named `keys(m)` is
stored as `"keys(m)"`.
Their serialization is just string representation, so:
"v", "FULL(v)", "KEYS(v)", "VALUES(v)", "ENTRIES(v)" are all valid targets.
## Local index

View File

@@ -1,67 +0,0 @@
# System Keyspaces Overview
This page gives a high-level overview of several internal keyspaces and what they are used for.
## Table of Contents
- [system_replicated_keys](#system_replicated_keys)
- [system_distributed](#system_distributed)
- [system_distributed_everywhere](#system_distributed_everywhere)
- [system_auth](#system_auth)
- [system](#system)
- [system_schema](#system_schema)
- [system_traces](#system_traces)
- [system_audit/audit](#system_auditaudit)
## `system_replicated_keys`
Internal keyspace for encryption-at-rest key material used by the replicated key provider. It stores encrypted data keys so nodes can retrieve the correct key IDs when reading encrypted data.
This keyspace is created as an internal system keyspace and uses `EverywhereStrategy` so key metadata is available on every node. It is not intended for user data.
## `system_distributed`
Internal distributed metadata keyspace used for cluster-wide coordination data that is shared across nodes.
In practice, it is used for metadata such as:
- materialized view build coordination state
- CDC stream/timestamp metadata exposed to clients
- service level definitions used by workload prioritization
This keyspace is managed by Scylla and is not intended for application tables.
It is created as an internal keyspace (historically with `SimpleStrategy` and RF=3 by default).
## `system_distributed_everywhere`
Legacy keyspace. It is no longer used.
## `system_auth`
Legacy auth keyspace name kept primarily for compatibility.
Auth tables have moved to the `system` keyspace (`roles`, `role_members`, `role_permissions`, and related auth state). `system_auth` may still exist for compatibility with legacy tooling/queries, but it is no longer where current auth state is primarily stored.
## `system`
This keyspace is local one, so each node has its own, independent content for tables in this keyspace. For some tables, the content is coordinated at a higher level (RAFT), but not via the traditional replication systems (storage proxy).
See the detailed table-level documentation here: [system_keyspace](system_keyspace.md)
## `system_schema`
This keyspace is local one, so each node has its own, independent content for tables in this keyspace. All tables in this keyspace are coordinated via the schema replication system.
See the detailed table-level documentation here: [system_schema_keyspace](system_schema_keyspace.md)
## `system_traces`
Internal tracing keyspace used for query tracing and slow-query logging records (`sessions`, `events`, and related index/log tables).
This keyspace is written by Scylla's tracing subsystem for diagnostics and observability. It is operational metadata, not user application data (historically created with `SimpleStrategy` and RF=2).
## `system_audit`/`audit`
Internal audit-logging keyspace used to persist audit events when table-backed auditing is enabled.
Scylla's audit table storage is implemented as an internal audit keyspace for audit records (for example, auth/admin/DCL activity depending on audit configuration). In current code this keyspace is named `audit`, while operational material may refer to it as its historical name (`system_audit`). It is intended for security/compliance observability, not for application data.

View File

@@ -1611,7 +1611,6 @@ CREATE TABLE system.topology (
cleanup_status text,
datacenter text,
ignore_msb int,
intended_storage_mode text,
node_state text,
num_tokens int,
rack text,
@@ -1664,7 +1663,6 @@ CREATE TABLE system.topology (
- `tokens_string`: Alternative representation of tokens
- `shard_count`: Number of shards on the node
- `ignore_msb`: MSB bits to ignore for token calculation
- `intended_storage_mode`: Intended storage mode for tables under vnodes-to-tablets migration. The node switches to this mode on next restart.
- `cleanup_status`: Status of cleanup operations
- `supported_features`: Features supported by this node
- `request_id`: ID of the current topology request for this node

View File

@@ -700,7 +700,6 @@ CREATE TABLE system.topology (
host_id uuid,
datacenter text,
ignore_msb int,
intended_storage_mode text,
node_state text,
num_tokens int,
rack text,
@@ -742,7 +741,6 @@ Each node has a clustering row in the table where its `host_id` is the clusterin
- `datacenter` - a name of the datacenter the node belongs to
- `rack` - a name of the rack the node belongs to
- `ignore_msb` - the value of the node's `murmur3_partitioner_ignore_msb_bits` parameter
- `intended_storage_mode` - if set, it indicates the intended storage mode for tables under vnodes-to-tablets migration
- `shard_count` - the node's `smp::count`
- `release_version` - the node's `version::current()` (corresponding to a Cassandra version, used by drivers)
- `node_state` - current state of the node (as described earlier)

View File

@@ -1,10 +0,0 @@
# Vector index in Scylla
Vector indexes are custom indexes (USING 'vector\_index'). Their `target` option in `system_schema.indexes` uses following format:
- Simple single-column vector index `(v)`: just the (escaped) column name, e.g. `v`
- Vector index with filtering columns `(v, f1, f2)`: JSON with `tc` (target column) and `fc` (filtering columns): `{"tc":"v","fc":["f1","f2"]}`
- Local vector index `((p1, p2), v)`: JSON with `tc` and `pk` (partition key columns): `{"tc":"v","pk":["p1","p2"]}`
- Local vector index with filtering columns `((p1, p2), v, f1, f2)`: JSON with `tc`, `pk`, and `fc`: `{"tc":"v","pk":["p1","p2"],"fc":["f1","f2"]}`
The `target` option acts as the interface for the vector-store service, providing the metadata necessary to determine which columns are indexed and how they are structured.

View File

@@ -289,7 +289,7 @@ Yes, but it will require running a full repair (or cleanup) to change the replic
- If you're reducing the replication factor, run ``nodetool cleanup <updated Keyspace>`` on the keyspace you modified to remove surplus replicated data.
Cleanup runs on a per-node basis.
- If you're increasing the replication factor, refer to :doc:`How to Safely Increase the RF </kb/rf-increase>`
- Note that you need to provide the keyspace name. If you do not, the cleanup or repair operation runs on all keyspaces for the specific node.
- Note that you need to provide the keyspace namr. If you do not, the cleanup or repair operation runs on all keyspaces for the specific node.
Why can't I set ``listen_address`` to listen to 0.0.0.0 (all my addresses)?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

View File

@@ -181,7 +181,6 @@ internode_compression controls whether traffic between nodes is compressed.
* all - all traffic is compressed.
* dc - traffic between different datacenters is compressed.
* rack - traffic between different racks is compressed.
* none - nothing is compressed (default).
Configuring TLS/SSL in scylla.yaml

View File

@@ -10,7 +10,6 @@ ScyllaDB Configuration Procedures
How to do a Rolling Restart <rolling-restart>
Advanced Internode (RPC) Compression <advanced-internode-compression>
Shared-dictionary compression for SSTables <sstable-dictionary-compression>
Migrate a Keyspace from Vnodes to Tablets <migrate-vnodes-to-tablets>
Procedures to change ScyllaDB Configuration settings.
@@ -23,5 +22,3 @@ Procedures to change ScyllaDB Configuration settings.
* :doc:`Advanced Internode (RPC) Compression </operating-scylla/procedures/config-change/advanced-internode-compression>`
* :doc:`Shared-dictionary compression for SSTables </operating-scylla/procedures/config-change/sstable-dictionary-compression>`
* :doc:`Migrate a Keyspace from Vnodes to Tablets </operating-scylla/procedures/config-change/migrate-vnodes-to-tablets>`

View File

@@ -1,393 +0,0 @@
Migrate a Keyspace from Vnodes to Tablets
==========================================
This procedure describes how to migrate an existing keyspace from vnodes
to tablets. Tablets are designed to be the long-term replacement for vnodes,
offering numerous benefits such as faster topology operations, automatic load
balancing, automatic cleanups, and improved streaming performance. Migrating to
tablets is strongly recommended. See :doc:`Data Distribution with Tablets </architecture/tablets/>`
for details.
.. note::
The migration is an online operation. This means that the keyspace remains
fully available to users throughout the migration, provided that its
replication factor is greater than 1. Reads and writes continue to be served
using vnodes until the migration is finished.
.. warning::
During the migration, you should expect degraded performance on the migrating
keyspace. The reasons are the following:
* **Rolling restart**: Each node must upgrade its storage from vnodes to
tablets. This is an offline operation happening on startup, so a restart is
needed. Upon restart, each node performs a heavy and time-consuming
resharding operation to reorganize its data based on tablets, and remains
offline until this operation completes. Resharding may last from minutes to
hours, depending on the amount of data that the node holds. At this time,
the node cannot serve any requests.
* **Unbalanced tablets**: The initial tablet layout mirrors the vnode layout.
The tablet load balancer does not rebalance tablets until the migration is
finished, so some shards may carry more data than others during the
migration. The imbalance is expected to be more prominent in clusters with
very large nodes (hundreds of vCPUs).
* **Loss of shard awareness**: During the migration and until the rolling
restart is complete, the cluster is in a mixed state with some nodes using
vnodes and others using tablets. In this state, queries may cause
cross-shard operations within nodes, reducing performance.
The performance will return to normal after the migration finishes and the
tablet load balancer rebalances the data.
Prerequisites
-------------
* All nodes in the cluster must be **up and running**. You can check the status
of all nodes with
:doc:`nodetool status </operating-scylla/nodetool-commands/status/>`.
* All nodes must be running ScyllaDB 2026.2 or later.
Limitations
-----------
The current migration procedure has the following limitations:
* The total number of **vnode tokens** in the cluster must be a **power of two**
and the tokens must be **evenly spaced** across the token ring. This is
verified automatically when starting the migration.
* **No schema changes** during the migration. Do not create, alter, or drop
tables in the migrating keyspace until the migration is finished.
* **No topology changes** during the migration. Do not add, remove, decommission,
or replace nodes while a migration is in progress.
* **No TRUNCATE** on tables in the migrating keyspace during the migration.
* Only **CQL base tables** can be migrated. Materialized views, secondary
indexes, CDC tables, and Alternator tables are not supported.
* Tables with **counters** or **LWTs** cannot be migrated.
Overview
--------
The migration consists of three phases:
1. **Prepare**: Create tablet maps for all tables in the keyspace. Each tablet
inherits its token range and replica set from the corresponding vnode range.
2. **Storage upgrade**: Restart each node one at a time, upgrading its storage
from vnodes to tablets. Upon restart, the node begins resharding data into
tablets. This is a storage-layer operation and is unrelated to ScyllaDB
version upgrades.
3. **Finalize**: Once all nodes have been upgraded, commit the migration by
clearing the migration state and switching the keyspace schema to tablets.
During the first two phases, the migration is reversible; you can roll back to
vnodes. However, once the migration is finalized, it cannot be reversed.
.. note::
In the following sections, any reference to "upgrade" or "downgrade" of a
node will refer to the migration of its storage from vnodes to tablets or
vice versa. Do not confuse it with version upgrades/downgrades.
Procedure
---------
#. Prepare the keyspace for migration:
#. Create tablet maps for all tables in the keyspace:
.. code-block:: console
scylla nodetool migrate-to-tablets start <keyspace>
#. Verify that the keyspace is in ``migrating_to_tablets`` state and all nodes are still using vnodes:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets status ks
Keyspace: ks
Status: migrating_to_tablets
Nodes:
Host ID Status
99d8de76-3954-4727-911a-6a07251b180c uses vnodes
0b5fd6f6-9670-4faf-a480-ad58cf119007 uses vnodes
017dd39a-3d06-4c8a-8ac4-379f9e595607 uses vnodes
.. _upgrade-nodes:
#. Upgrade all nodes to tablets:
#. Pick a node.
#. Mark the node for upgrade to tablets:
.. note::
This is a node-local operation. Use the IP address of the node that
you are upgrading.
.. caution::
Do not mark more than one node for upgrade at the same time. Even if
you restart them serially, unexpected restarts can happen for various
reasons (crashes, power failures, etc.) leading to parallel node
upgrades which can reduce availability.
.. code-block:: console
scylla nodetool -h <node-ip> migrate-to-tablets upgrade
#. Verify that the node status changed from ``vnodes`` to ``migrating to tablets``:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets status ks
Keyspace: ks
Status: migrating_to_tablets
Nodes:
Host ID Status
99d8de76-3954-4727-911a-6a07251b180c migrating to tablets <---
0b5fd6f6-9670-4faf-a480-ad58cf119007 uses vnodes
017dd39a-3d06-4c8a-8ac4-379f9e595607 uses vnodes
#. Drain and stop the node:
.. code-block:: console
scylla nodetool -h <node-ip> drain
.. include:: /rst_include/scylla-commands-stop-index.rst
#. Restart the node:
.. include:: /rst_include/scylla-commands-start-index.rst
#. Wait until the node is UP and has returned to the ScyllaDB cluster using :doc:`nodetool status </operating-scylla/nodetool-commands/status/>`.
This operation may take a long time due to resharding. To monitor
resharding progress, use the task manager API:
.. code-block:: console
scylla nodetool tasks list compaction -h <node-ip> --keyspace <keyspace> | grep -i reshard
#. Verify that the node status changed from ``migrating to tablets`` to ``uses tablets``:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets status ks
Keyspace: ks
Status: migrating_to_tablets
Nodes:
Host ID Status
99d8de76-3954-4727-911a-6a07251b180c uses tablets <---
0b5fd6f6-9670-4faf-a480-ad58cf119007 uses vnodes
017dd39a-3d06-4c8a-8ac4-379f9e595607 uses vnodes
#. Move to the next node and repeat from step a until all nodes are upgraded.
#. Finalize the migration:
.. warning::
Finalization **cannot be undone**. Once the migration is finalized, the
keyspace cannot be switched back to vnodes.
#. Issue the finalization request:
.. code-block:: console
scylla nodetool migrate-to-tablets finalize <keyspace>
#. Verify that the keyspace status changed to ``tablets``:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets finalize ks
Keyspace: ks
Status: tablets
Rollback Procedure
------------------
.. note::
Rollback is only possible **before finalization**. Once the migration is
finalized, it cannot be reversed.
If you need to abort the migration **before finalization**, you can roll back
by downgrading each node back to vnodes. The rollback procedure is the
following:
#. Find all nodes that have been upgraded to tablets (status: ``uses tablets``)
or they are in the process of upgrading to tablets (status: ``migrating to tablets``):
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets status ks
Keyspace: ks
Status: migrating_to_tablets
Nodes:
Host ID Status
99d8de76-3954-4727-911a-6a07251b180c uses tablets <---
0b5fd6f6-9670-4faf-a480-ad58cf119007 migrating to tablets <---
017dd39a-3d06-4c8a-8ac4-379f9e595607 uses vnodes
#. For **each upgraded or upgrading node** in the cluster, perform a downgrade
(one node at a time):
#. Mark the node for downgrade:
.. code-block:: console
scylla nodetool -h <node-ip> migrate-to-tablets downgrade
#. Check the node status. The status for a previously upgraded node should
change from ``uses tablets`` to ``migrating to vnodes``. The status for a
previously upgrading node should change from ``migrating to tablets`` to
``uses vnodes`` or ``migrating to vnodes``:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets status ks
Keyspace: ks
Status: migrating_to_tablets
Nodes:
Host ID Status
99d8de76-3954-4727-911a-6a07251b180c migrating to vnodes <---
0b5fd6f6-9670-4faf-a480-ad58cf119007 migrating to tablets
017dd39a-3d06-4c8a-8ac4-379f9e595607 uses vnodes
#. If the node status is ``uses vnodes``, the downgrade is complete. Move to
the next node and repeat from step a.
#. If the node is ``migrating to vnodes``, restart it to complete the
downgrade:
#. Drain and stop the node:
.. code-block:: console
scylla nodetool -h <node-ip> drain
.. include:: /rst_include/scylla-commands-stop-index.rst
#. Restart the node:
.. include:: /rst_include/scylla-commands-start-index.rst
#. Wait until the node is UP and has returned to the ScyllaDB cluster using :doc:`nodetool status </operating-scylla/nodetool-commands/status/>`.
This operation may take a long time due to resharding. To monitor
resharding progress, use the task manager API:
.. code-block:: console
scylla nodetool tasks list compaction -h <node-ip> --keyspace <keyspace> | grep -i reshard
#. Verify that the node status changed from ``migrating to vnodes`` to ``uses vnodes``:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace>
**Example:**
.. code-block:: console
$ scylla nodetool migrate-to-tablets status ks
Keyspace: ks
Status: migrating_to_tablets
Nodes:
Host ID Status
99d8de76-3954-4727-911a-6a07251b180c uses vnodes <---
0b5fd6f6-9670-4faf-a480-ad58cf119007 migrating to tablets
017dd39a-3d06-4c8a-8ac4-379f9e595607 uses vnodes
#. Move to the next node and repeat from step a until all nodes are
downgraded.
#. Once all nodes have been downgraded, finalize the rollback:
.. code-block:: console
scylla nodetool migrate-to-tablets finalize <keyspace>
Migrating multiple keyspaces
----------------------------
Migrating multiple keyspaces simultaneously is supported. The procedure is the
same as with a single keyspace except that the preparation and finalization
steps need to be repeated for each keyspace. However, note that a new migration
cannot be started once another migration is in the upgrade phase. The migrations
need to be prepared and finalized together.
To migrate multiple keyspaces simultaneously, follow these steps:
#. For **each keyspace**, prepare it for migration:
.. code-block:: console
scylla nodetool migrate-to-tablets start <keyspace1>
scylla nodetool migrate-to-tablets start <keyspace2>
...
Verify that all keyspaces are in ``migrating_to_tablets`` state before
proceeding:
.. code-block:: console
scylla nodetool migrate-to-tablets status <keyspace1>
scylla nodetool migrate-to-tablets status <keyspace2>
...
#. Upgrade all nodes in the cluster following the same :ref:`procedure <upgrade-nodes>`
as for a single keyspace. Each node restart reshards all keyspaces under
migration in one pass.
#. For **each keyspace**, finalize the migration:
.. code-block:: console
scylla nodetool migrate-to-tablets finalize <keyspace1>
scylla nodetool migrate-to-tablets finalize <keyspace2>
...

View File

@@ -2,8 +2,8 @@
ScyllaDB Auditing Guide
========================
Auditing allows the administrator to monitor activities on a ScyllaDB cluster, including CQL queries and data changes, as well as Alternator (DynamoDB-compatible API) requests.
The information is stored in a Syslog or a ScyllaDB table.
Auditing allows the administrator to monitor activities on a Scylla cluster, including queries and data changes.
The information is stored in a Syslog or a Scylla table.
Prerequisite
------------
@@ -14,15 +14,15 @@ Enable ScyllaDB :doc:`Authentication </operating-scylla/security/authentication>
Enabling Audit
---------------
By default, auditing is **enabled** with the ``table`` backend. Enabling auditing is controlled by the ``audit:`` parameter in the ``scylla.yaml`` file.
By default, table auditing is **enabled**. Enabling auditing is controlled by the ``audit:`` parameter in the ``scylla.yaml`` file.
You can set the following options:
* ``none`` - Audit is disabled.
* ``table`` - Audit is enabled, and messages are stored in a ScyllaDB table (default).
* ``table`` - Audit is enabled, and messages are stored in a Scylla table (default).
* ``syslog`` - Audit is enabled, and messages are sent to Syslog.
* ``syslog,table`` - Audit is enabled, and messages are stored in a ScyllaDB table and sent to Syslog.
* ``syslog,table`` - Audit is enabled, and messages are stored in a Scylla table and sent to Syslog.
Configuring any other value results in an error at ScyllaDB startup.
Configuring any other value results in an error at Scylla startup.
Configuring Audit
-----------------
@@ -34,9 +34,7 @@ Flag Default Value Description
================== ================================== ========================================================================================================================
audit_categories "DCL,AUTH,ADMIN" Comma-separated list of statement categories that should be audited
------------------ ---------------------------------- ------------------------------------------------------------------------------------------------------------------------
audit_tables “” Comma-separated list of table names that should be audited, in the format ``<keyspace_name>.<table_name>``.
For Alternator tables use the ``alternator.<table_name>`` format (see :ref:`alternator-auditing`).
audit_tables “” Comma-separated list of table names that should be audited, in the format of <keyspacename>.<tablename>
------------------ ---------------------------------- ------------------------------------------------------------------------------------------------------------------------
audit_keyspaces “” Comma-separated list of keyspaces that should be audited. You must specify at least one keyspace.
If you leave this option empty, no keyspace will be audited.
@@ -49,137 +47,30 @@ You can use DCL, AUTH, and ADMIN audit categories without including any keyspace
audit_categories parameter description
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
========= ========================================================================================= ====================
Parameter Logs Description Applies To
========= ========================================================================================= ====================
AUTH Logs login events CQL
--------- ----------------------------------------------------------------------------------------- --------------------
DML Logs insert, update, delete, and other data manipulation language (DML) events CQL, Alternator
--------- ----------------------------------------------------------------------------------------- --------------------
DDL Logs object and role create, alter, drop, and other data definition language (DDL) events CQL, Alternator
--------- ----------------------------------------------------------------------------------------- --------------------
DCL Logs grant, revoke, create role, drop role, and list roles events CQL
--------- ----------------------------------------------------------------------------------------- --------------------
QUERY Logs all queries CQL, Alternator
--------- ----------------------------------------------------------------------------------------- --------------------
ADMIN Logs service level operations: create, alter, drop, attach, detach, list. CQL
========= =========================================================================================
Parameter Logs Description
========= =========================================================================================
AUTH Logs login events
--------- -----------------------------------------------------------------------------------------
DML Logs insert, update, delete, and other data manipulation language (DML) events
--------- -----------------------------------------------------------------------------------------
DDL Logs object and role create, alter, drop, and other data definition language (DDL) events
--------- -----------------------------------------------------------------------------------------
DCL Logs grant, revoke, create role, drop role, and list roles events
--------- -----------------------------------------------------------------------------------------
QUERY Logs all queries
--------- -----------------------------------------------------------------------------------------
ADMIN Logs service level operations: create, alter, drop, attach, detach, list.
For :ref:`service level <workload-priorization-service-level-management>`
auditing.
========= ========================================================================================= ====================
For details on auditing Alternator operations, see :ref:`alternator-auditing`.
========= =========================================================================================
Note that enabling audit may negatively impact performance and audit-to-table may consume extra storage. That's especially true when auditing DML and QUERY categories, which generate a high volume of audit messages.
.. _alternator-auditing:
Auditing Alternator Requests
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When auditing is enabled, Alternator (DynamoDB-compatible API) requests are audited using the same
backends and the same filtering configuration (``audit_categories``, ``audit_keyspaces``,
``audit_tables``) as CQL operations. No additional configuration is needed.
Both successful and failed Alternator requests are audited.
Alternator Operation Categories
""""""""""""""""""""""""""""""""
Each Alternator API operation is assigned to one of the standard audit categories:
========= ====================================================================================================
Category Alternator Operations
========= ====================================================================================================
DDL CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource, UpdateTimeToLive
--------- ----------------------------------------------------------------------------------------------------
DML PutItem, UpdateItem, DeleteItem, BatchWriteItem
--------- ----------------------------------------------------------------------------------------------------
QUERY GetItem, BatchGetItem, Query, Scan, DescribeTable, ListTables, DescribeEndpoints,
ListTagsOfResource, DescribeTimeToLive, DescribeContinuousBackups,
ListStreams, DescribeStream, GetShardIterator, GetRecords
========= ====================================================================================================
.. note:: AUTH, DCL, and ADMIN categories do not apply to Alternator operations. These categories
are specific to CQL authentication, authorization, and service-level management.
Operation Field Format
"""""""""""""""""""""""
For CQL operations, the ``operation`` field in the audit log contains the raw CQL query string.
For Alternator operations, the format is:
.. code-block:: none
<OperationName>|<JSON request body>
For example:
.. code-block:: none
PutItem|{"TableName":"my_table","Item":{"p":{"S":"pk_val"},"c":{"S":"ck_val"},"v":{"S":"data"}}}
.. note:: The full JSON request body is included in the ``operation`` field. For batch operations
(such as BatchWriteItem), this can be very large (up to 16 MB).
Keyspace and Table Filtering for Alternator
""""""""""""""""""""""""""""""""""""""""""""
The real keyspace name of an Alternator table ``T`` is ``alternator_T``.
The ``audit_tables`` config flag uses the shorthand format ``alternator.T`` to refer to such
tables -- the parser expands it to the real keyspace name automatically.
For ``audit_keyspaces``, use the real keyspace name directly.
For example, to audit an Alternator table called ``my_table_name`` use either of the below:
.. code-block:: yaml
# Using audit_tables - use 'alternator' as the keyspace name:
audit_tables: "alternator.my_table_name"
# Using audit_keyspaces - use the real keyspace name:
audit_keyspaces: "alternator_my_table_name"
**Global and batch operations**: Some Alternator operations are not scoped to a single table:
* ``ListTables`` and ``DescribeEndpoints`` have no associated keyspace or table.
* ``BatchWriteItem`` and ``BatchGetItem`` may span multiple tables.
These operations are logged whenever their category matches ``audit_categories``, regardless of
``audit_keyspaces`` or ``audit_tables`` filters. Their ``keyspace_name`` field is empty, and for
batch operations the ``table_name`` field contains a pipe-separated (``|``) list of all involved table names.
**DynamoDB Streams operations**: For streams-related operations (``DescribeStream``, ``GetShardIterator``,
``GetRecords``), the ``table_name`` field contains the base table name and the CDC log table name
separated by a pipe (e.g., ``my_table|my_table_scylla_cdc_log``).
Alternator Audit Log Examples
""""""""""""""""""""""""""""""
Syslog output example (PutItem):
.. code-block:: shell
Mar 18 10:15:03 ip-10-143-2-108 scylla-audit[28387]: node="10.143.2.108", category="DML", cl="LOCAL_QUORUM", error="false", keyspace="alternator_my_table", query="PutItem|{\"TableName\":\"my_table\",\"Item\":{\"p\":{\"S\":\"pk_val\"}}}", client_ip="127.0.0.1", table="my_table", username="anonymous"
Table output example (PutItem):
.. code-block:: shell
SELECT * FROM audit.audit_log ;
returns:
.. code-block:: none
date | node | event_time | category | consistency | error | keyspace_name | operation | source | table_name | username |
-------------------------+--------------+--------------------------------------+----------+--------------+-------+-----------------------+----------------------------------------------------------------------------------+-----------+------------+-----------+
2026-03-18 00:00:00+0000 | 10.143.2.108 | 3429b1a5-2a94-11e8-8f4e-000000000001 | DML | LOCAL_QUORUM | False | alternator_my_table | PutItem|{"TableName":"my_table","Item":{"p":{"S":"pk_val"}}} | 127.0.0.1 | my_table | anonymous |
(1 row)
Configuring Audit Storage
---------------------------
Auditing messages can be sent to :ref:`Syslog <auditing-syslog-storage>` or stored in a ScyllaDB :ref:`table <auditing-table-storage>` or both.
Auditing messages can be sent to :ref:`Syslog <auditing-syslog-storage>` or stored in a Scylla :ref:`table <auditing-table-storage>` or both.
.. _auditing-syslog-storage:
@@ -208,13 +99,13 @@ Storing Audit Messages in Syslog
# All tables in those keyspaces will be audited
audit_keyspaces: "mykespace"
#. Restart the ScyllaDB node.
#. Restart the Scylla node.
.. include:: /rst_include/scylla-commands-restart-index.rst
By default, audit messages are written to the same destination as ScyllaDB :doc:`logging </getting-started/logging>`, with ``scylla-audit`` as the process name.
By default, audit messages are written to the same destination as Scylla :doc:`logging </getting-started/logging>`, with ``scylla-audit`` as the process name.
Logging output example (CQL drop table):
Logging output example (drop table):
.. code-block:: shell
@@ -232,7 +123,7 @@ To redirect the Syslog output to a file, follow the steps below (available only
Storing Audit Messages in a Table
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Messages are stored in a ScyllaDB table named ``audit.audit_log``.
Messages are stored in a Scylla table named ``audit.audit_log``.
For example:
@@ -279,11 +170,11 @@ For example:
# All tables in those keyspaces will be audited
audit_keyspaces: "mykespace"
#. Restart the ScyllaDB node.
#. Restart Scylla node.
.. include:: /rst_include/scylla-commands-restart-index.rst
Table output example (CQL drop table):
Table output example (drop table):
.. code-block:: shell
@@ -305,7 +196,7 @@ Storing Audit Messages in a Table and Syslog Simultaneously
**Procedure**
#. Follow both procedures from above, and set the ``audit`` parameter in the ``scylla.yaml`` file to both ``syslog`` and ``table``. You need to restart ScyllaDB only once.
#. Follow both procedures from above, and set the ``audit`` parameter in the ``scylla.yaml`` file to both ``syslog`` and ``table``. You need to restart scylla only once.
To have both syslog and table you need to specify both backends separated by a comma:

View File

@@ -0,0 +1,41 @@
================
About Upgrade
================
ScyllaDB upgrade is a rolling procedure - it does not require a full cluster
shutdown and is performed without any downtime or disruption of service.
To ensure a successful upgrade, follow
the :doc:`documented upgrade procedures <upgrade-guides/index>` tested by
ScyllaDB. This means that:
* You should follow the upgrade policy:
* Starting with version **2025.4**, upgrades can **skip minor versions** if:
* They remain within the same major version (for example, upgrading
directly from *2025.1 → 2025.4* is supported).
* You upgrade to the next major version (for example, upgrading
directly from *2025.3 → 2026.1* is supported).
* For versions **prior to 2025.4**, upgrades must be performed consecutively—
each successive X.Y version must be installed in order, **without skipping
any major or minor version** (for example, upgrading directly from 2025.1 → 2025.3
is not supported).
* You cannot skip major versions. Upgrades must move from one major version to
the next using the documented major-version upgrade path.
* You should upgrade to a supported version of ScyllaDB.
See `ScyllaDB Version Support <https://docs.scylladb.com/stable/versioning/version-support.html>`_.
* Before you upgrade to the next version, the whole cluster (each node) must
be upgraded to the previous version.
* You cannot perform an upgrade by replacing the nodes in the cluster with new
nodes with a different ScyllaDB version. You should never add a new node with
a different version to a cluster - if you
:doc:`add a node </operating-scylla/procedures/cluster-management/add-node-to-cluster>`,
it must have the same X.Y.Z (major.minor.patch) version as the other nodes in
the cluster.
Upgrading to each patch version by following the Maintenance Release Upgrade
Guide is optional. However, we recommend upgrading to the latest patch release
for your version before upgrading to a new version.

View File

@@ -5,6 +5,7 @@ Upgrade ScyllaDB
.. toctree::
:titlesonly:
About Upgrade <about-upgrade>
Upgrade Guides <upgrade-guides/index>

View File

@@ -5,7 +5,6 @@ Upgrade ScyllaDB
.. toctree::
ScyllaDB 2025.x to ScyllaDB 2026.1 <upgrade-guide-from-2025.x-to-2026.1/index>
ScyllaDB 2026.x Patch Upgrades <upgrade-guide-from-2026.x.y-to-2026.x.z>
ScyllaDB Image <ami-upgrade>

View File

@@ -20,7 +20,7 @@ This guide covers upgrading ScyllaDB on Red Hat Enterprise Linux (RHEL), CentOS,
and Ubuntu. See `OS Support by Platform and Version <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_
for information about supported versions. It also applies when using the ScyllaDB official image on EC2, GCP, or Azure.
See `Upgrade Policy <https://docs.scylladb.com/stable/versioning/upgrade-policy.html>`_ for the ScyllaDB upgrade policy.
See :doc:`About Upgrade </upgrade/about-upgrade/>` for the ScyllaDB upgrade policy.
Before You Upgrade ScyllaDB
==============================

View File

@@ -1,268 +0,0 @@
.. |SCYLLA_NAME| replace:: ScyllaDB
.. |SRC_VERSION| replace:: 2026.x.y
.. |NEW_VERSION| replace:: 2026.x.z
==========================================================================
Upgrade - |SCYLLA_NAME| |SRC_VERSION| to |NEW_VERSION| (Patch Upgrades)
==========================================================================
This document describes a step-by-step procedure for upgrading from
|SCYLLA_NAME| |SRC_VERSION| to |SCYLLA_NAME| |NEW_VERSION| (where "z" is
the latest available version), and rolling back to version |SRC_VERSION|
if necessary.
This guide covers upgrading ScyllaDB on Red Hat Enterprise Linux (RHEL),
CentOS, Debian, and Ubuntu.
See `OS Support by Platform and Version <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_
for information about supported versions.
It also applies to the ScyllaDB official image on EC2, GCP, or Azure.
See `Upgrade Policy <https://docs.scylladb.com/stable/versioning/upgrade-policy.html>`_ for the ScyllaDB upgrade policy.
Upgrade Procedure
=================
.. note::
Apply the following procedure **serially** on each node. Do not move to the next
node before validating that the node is up and running the new version.
A ScyllaDB upgrade is a rolling procedure that does **not** require a full cluster
shutdown. For each of the nodes in the cluster, you will:
#. Drain the node and back up the data.
#. Backup configuration file.
#. Stop ScyllaDB.
#. Download and install new ScyllaDB packages.
#. Start ScyllaDB.
#. Validate that the upgrade was successful.
**Before** upgrading, check which version you are running now using
``scylla --version``. Note the current version in case you want to roll back
the upgrade.
**During** the rolling upgrade it is highly recommended:
* Not to use new |NEW_VERSION| features.
* Not to run administration functions, like repairs, refresh, rebuild or add
or remove nodes. See
`sctool <https://manager.docs.scylladb.com/stable/sctool/>`_ for suspending
ScyllaDB Manager's scheduled or running repairs.
* Not to apply schema changes.
Upgrade Steps
=============
Back up the data
------------------------------
Back up all the data to an external device. We recommend using
`ScyllaDB Manager <https://manager.docs.scylladb.com/stable/backup/index.html>`_
to create backups.
Alternatively, you can use the ``nodetool snapshot`` command.
For **each** node in the cluster, run the following:
.. code:: sh
nodetool drain
nodetool snapshot
Take note of the directory name that nodetool gives you, and copy all
the directories with this name under ``/var/lib/scylla`` to a backup device.
When the upgrade is completed on all nodes, remove the snapshot with the
``nodetool clearsnapshot -t <snapshot>`` command to prevent running out of
space.
Back up the configuration file
------------------------------
Back up the ``scylla.yaml`` configuration file and the ScyllaDB packages
in case you need to roll back the upgrade.
.. tabs::
.. group-tab:: Debian/Ubuntu
.. code:: sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup
sudo cp /etc/apt/sources.list.d/scylla.list ~/scylla.list-backup
.. group-tab:: RHEL/CentOS
.. code:: sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup
sudo cp /etc/yum.repos.d/scylla.repo ~/scylla.repo-backup
Gracefully stop the node
------------------------
.. code:: sh
sudo service scylla-server stop
Download and install the new release
------------------------------------
You dont need to update the ScyllaDB DEB or RPM repo when you upgrade to
a patch release.
.. tabs::
.. group-tab:: Debian/Ubuntu
To install a patch version on Debian or Ubuntu, run:
.. code:: sh
sudo apt-get clean all
sudo apt-get update
sudo apt-get dist-upgrade scylla
Answer y to the first two questions.
.. group-tab:: RHEL/CentOS
To install a patch version on RHEL or CentOS, run:
.. code:: sh
sudo yum clean all
sudo yum update scylla\* -y
.. group-tab:: EC2/GCP/Azure Ubuntu Image
If you're using the ScyllaDB official image (recommended), see
the **Debian/Ubuntu** tab for upgrade instructions.
If you're using your own image and have installed ScyllaDB packages for
Ubuntu or Debian, you need to apply an extended upgrade procedure:
#. Install the new ScyllaDB version with the additional
``scylla-machine-image`` package:
.. code-block:: console
sudo apt-get clean all
sudo apt-get update
sudo apt-get dist-upgrade scylla
sudo apt-get dist-upgrade scylla-machine-image
#. Run ``scylla_setup`` without ``running io_setup``.
#. Run ``sudo /opt/scylladb/scylla-machine-image/scylla_cloud_io_setup``.
Start the node
--------------
.. code:: sh
sudo service start scylla-server
Validate
--------
#. Check cluster status with ``nodetool status`` and make sure **all** nodes,
including the one you just upgraded, are in UN status.
#. Use ``curl -X GET "http://localhost:10000/storage_service/scylla_release_version"``
to check the ScyllaDB version.
#. Use ``journalctl _COMM=scylla`` to check there are no new errors in the log.
#. Check again after 2 minutes to validate that no new issues are introduced.
Once you are sure the node upgrade is successful, move to the next node in
the cluster.
Rollback Procedure
==================
The following procedure describes a rollback from ScyllaDB release
|NEW_VERSION| to |SRC_VERSION|. Apply this procedure if an upgrade from
|SRC_VERSION| to |NEW_VERSION| failed before completing on all nodes.
* Use this procedure only on nodes you upgraded to |NEW_VERSION|.
* Execute the following commands one node at a time, moving to the next node only
after the rollback procedure is completed successfully.
ScyllaDB rollback is a rolling procedure that does **not** require a full
cluster shutdown. For each of the nodes to roll back to |SRC_VERSION|, you will:
#. Drain the node and stop ScyllaDB.
#. Downgrade to the previous release.
#. Restore the configuration file.
#. Restart ScyllaDB.
#. Validate the rollback success.
Rollback Steps
==============
Gracefully shutdown ScyllaDB
-----------------------------
.. code:: sh
nodetool drain
sudo service stop scylla-server
Downgrade to the previous release
----------------------------------
.. tabs::
.. group-tab:: Debian/Ubuntu
To downgrade to |SRC_VERSION| on Debian or Ubuntu, run:
.. code-block:: console
:substitutions:
sudo apt-get install scylla=|SRC_VERSION|\* scylla-server=|SRC_VERSION|\* scylla-tools=|SRC_VERSION|\* scylla-tools-core=|SRC_VERSION|\* scylla-kernel-conf=|SRC_VERSION|\* scylla-conf=|SRC_VERSION|\*
Answer y to the first two questions.
.. group-tab:: RHEL/CentOS
To downgrade to |SRC_VERSION| on RHEL or CentOS, run:
.. code-block:: console
:substitutions:
sudo yum downgrade scylla\*-|SRC_VERSION|-\* -y
.. group-tab:: EC2/GCP/Azure Ubuntu Image
If youre using the ScyllaDB official image (recommended), see
the **Debian/Ubuntu** tab for upgrade instructions.
If youre using your own image and have installed ScyllaDB packages for
Ubuntu or Debian, you need to additionally downgrade
the ``scylla-machine-image`` package.
.. code-block:: console
:substitutions:
sudo apt-get install scylla=|SRC_VERSION|\* scylla-server=|SRC_VERSION|\* scylla-tools=|SRC_VERSION|\* scylla-tools-core=|SRC_VERSION|\* scylla-kernel-conf=|SRC_VERSION|\* scylla-conf=|SRC_VERSION|\*
sudo apt-get install scylla-machine-image=|SRC_VERSION|\*
Answer y to the first two questions.
Restore the configuration file
------------------------------
.. code:: sh
sudo rm -rf /etc/scylla/scylla.yaml
sudo cp -a /etc/scylla/scylla.yaml.backup /etc/scylla/scylla.yaml
Start the node
--------------
.. code:: sh
sudo service scylla-server start
Validate
--------
Check upgrade instruction above for validation. Once you are sure the node
rollback is successful, move to the next node in the cluster.

View File

@@ -227,19 +227,13 @@ Security
Indexing and Caching
^^^^^^^^^^^^^^^^^^^^^
+----------------------------------------------------------------+--------------------------------------------------------------------------------------+
| Options | Support |
+================================================================+======================================================================================+
|:doc:`Secondary Index </features/secondary-indexes>` | |v| |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------+
|StorageAttachedIndex (SAI) | |x| |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------+
|:ref:`SAI for vector search <cassandra-sai-compatibility>` | |v| :sup:`*` |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------+
|:doc:`Materialized Views </features/materialized-views>` | |v| |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------+
:sup:`*` SAI class name on vector columns is rewritten to native ``vector_index``
+--------------------------------------------------------------+--------------------------------------------------------------------------------------+
| Options | Support |
+==============================================================+======================================================================================+
|:doc:`Secondary Index </features/secondary-indexes>` | |v| |
+--------------------------------------------------------------+--------------------------------------------------------------------------------------+
|:doc:`Materialized Views </features/materialized-views>` | |v| |
+--------------------------------------------------------------+--------------------------------------------------------------------------------------+
Additional Features

View File

@@ -380,7 +380,7 @@ public:
}
template<typename HostType, typename CacheType, typename ConfigType>
shared_ptr<HostType> get_host(const sstring& host, CacheType& cache, const ConfigType& config_map, std::string_view config_entry_name) {
shared_ptr<HostType> get_host(const sstring& host, CacheType& cache, const ConfigType& config_map) {
auto& host_cache = cache[this_shard_id()];
auto it = host_cache.find(host);
if (it != host_cache.end()) {
@@ -394,26 +394,23 @@ public:
return result;
}
throw std::invalid_argument(fmt::format(
"Encryption host \"{}\" is not defined in scylla.yaml. "
"Make sure it is listed under the \"{}\" section.",
host, config_entry_name));
throw std::invalid_argument("No such host: " + host);
}
shared_ptr<kmip_host> get_kmip_host(const sstring& host) override {
return get_host<kmip_host>(host, _per_thread_kmip_host_cache, _cfg->kmip_hosts(), "kmip_hosts");
return get_host<kmip_host>(host, _per_thread_kmip_host_cache, _cfg->kmip_hosts());
}
shared_ptr<kms_host> get_kms_host(const sstring& host) override {
return get_host<kms_host>(host, _per_thread_kms_host_cache, _cfg->kms_hosts(), "kms_hosts");
return get_host<kms_host>(host, _per_thread_kms_host_cache, _cfg->kms_hosts());
}
shared_ptr<gcp_host> get_gcp_host(const sstring& host) override {
return get_host<gcp_host>(host, _per_thread_gcp_host_cache, _cfg->gcp_hosts(), "gcp_hosts");
return get_host<gcp_host>(host, _per_thread_gcp_host_cache, _cfg->gcp_hosts());
}
shared_ptr<azure_host> get_azure_host(const sstring& host) override {
return get_host<azure_host>(host, _per_thread_azure_host_cache, _cfg->azure_hosts(), "azure_hosts");
return get_host<azure_host>(host, _per_thread_azure_host_cache, _cfg->azure_hosts());
}

View File

@@ -437,6 +437,7 @@ void ldap_connection::poll_results() {
const auto found = _msgid_to_promise.find(id);
if (found == _msgid_to_promise.end()) {
mylog.error("poll_results: got valid result for unregistered id {}, dropping it", id);
ldap_msgfree(result);
} else {
found->second.set_value(std::move(result_ptr));
_msgid_to_promise.erase(found);

View File

@@ -41,7 +41,7 @@ public:
_ip == other._ip;
}
explicit endpoint_state(inet_address ip) noexcept
endpoint_state(inet_address ip) noexcept
: _heart_beat_state()
, _update_timestamp(clk::now())
, _ip(ip)

View File

@@ -179,7 +179,6 @@ public:
gms::feature topology_noop_request { *this, "TOPOLOGY_NOOP_REQUEST"sv };
gms::feature tablets_intermediate_fallback_cleanup { *this, "TABLETS_INTERMEDIATE_FALLBACK_CLEANUP"sv };
gms::feature batchlog_v2 { *this, "BATCHLOG_V2"sv };
gms::feature vnodes_to_tablets_migrations { *this, "VNODES_TO_TABLETS_MIGRATIONS"sv };
public:
const std::unordered_map<sstring, std::reference_wrapper<feature>>& registered_features() const;

File diff suppressed because it is too large Load Diff

View File

@@ -91,6 +91,7 @@ struct loaded_endpoint_state {
class gossiper : public seastar::async_sharded_service<gossiper>, public seastar::peering_sharded_service<gossiper> {
public:
using clk = seastar::lowres_system_clock;
using ignore_features_of_local_node = bool_class<class ignore_features_of_local_node_tag>;
using generation_for_nodes = std::unordered_map<locator::host_id, generation_type>;
private:
using messaging_verb = netw::messaging_verb;
@@ -197,7 +198,18 @@ private:
endpoint_locks_map _endpoint_locks;
public:
static constexpr std::array DEAD_STATES{
versioned_value::REMOVED_TOKEN,
versioned_value::STATUS_LEFT,
};
static constexpr std::array SILENT_SHUTDOWN_STATES{
versioned_value::REMOVED_TOKEN,
versioned_value::STATUS_LEFT,
versioned_value::STATUS_BOOTSTRAPPING,
versioned_value::STATUS_UNKNOWN,
};
static constexpr std::chrono::milliseconds INTERVAL{1000};
static constexpr std::chrono::hours A_VERY_LONG_TIME{24 * 3};
// Maximum difference between remote generation value and generation
// value this node would get if this node were restarted that we are
@@ -229,6 +241,7 @@ private:
/* initial seeds for joining the cluster */
std::set<inet_address> _seeds;
std::map<locator::host_id, clk::time_point> _expire_time_endpoint_map;
bool _in_shadow_round = false;
@@ -328,6 +341,13 @@ private:
utils::chunked_vector<gossip_digest> make_random_gossip_digest() const;
public:
/**
* Handles switching the endpoint's state from REMOVING_TOKEN to REMOVED_TOKEN
*
* @param endpoint
* @param host_id
*/
future<> advertise_token_removed(locator::host_id host_id, permit_id);
/**
* Do not call this method unless you know what you are doing.
@@ -343,6 +363,7 @@ public:
future<generation_type> get_current_generation_number(locator::host_id endpoint) const;
future<version_type> get_current_heart_beat_version(locator::host_id endpoint) const;
bool is_safe_for_bootstrap(inet_address endpoint) const;
private:
/**
* Returns true if the chosen target was also a seed. False otherwise
@@ -362,6 +383,7 @@ private:
future<> do_gossip_to_unreachable_member(gossip_digest_syn message);
public:
clk::time_point get_expire_time_for_endpoint(locator::host_id endpoint) const noexcept;
// Gets a shared pointer to the endpoint_state, if exists.
// Otherwise, returns a null ptr.
@@ -445,7 +467,7 @@ private:
public:
bool is_alive(locator::host_id id) const;
bool is_left(const endpoint_state& eps) const;
bool is_dead_state(const endpoint_state& eps) const;
// Wait for nodes to be alive on all shards
future<> wait_alive(std::vector<gms::inet_address> nodes, std::chrono::milliseconds timeout);
future<> wait_alive(std::vector<locator::host_id> nodes, std::chrono::milliseconds timeout);
@@ -566,12 +588,17 @@ public:
public:
bool is_enabled() const;
public:
void add_expire_time_for_endpoint(locator::host_id endpoint, clk::time_point expire_time);
static clk::time_point compute_expire_time();
public:
bool is_seed(const inet_address& endpoint) const;
bool is_shutdown(const locator::host_id& endpoint) const;
bool is_shutdown(const endpoint_state& eps) const;
bool is_normal(const locator::host_id& endpoint) const;
bool is_cql_ready(const locator::host_id& endpoint) const;
bool is_silent_shutdown_state(const endpoint_state& ep_state) const;
void force_newer_generation();
public:
std::string_view get_gossip_status(const endpoint_state& ep_state) const noexcept;
@@ -588,8 +615,12 @@ private:
gossip_address_map& _address_map;
gossip_config _gcfg;
condition_variable _failure_detector_loop_cv;
// Get features supported by a particular node
std::set<sstring> get_supported_features(locator::host_id endpoint) const;
locator::token_metadata_ptr get_token_metadata_ptr() const noexcept;
std::string get_node_status(const locator::host_id& endpoint) const noexcept;
public:
// Get features supported by all the nodes this node knows about
std::set<sstring> get_supported_features(const std::unordered_map<locator::host_id, sstring>& loaded_peer_features, ignore_features_of_local_node ignore_local_node) const;
private:
seastar::metrics::metric_groups _metrics;
public:

View File

@@ -10,6 +10,10 @@
#include "gms/versioned_value.hh"
#include "message/messaging_service.hh"
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/classification.hpp>
#include <charconv>
namespace gms {
static_assert(std::is_nothrow_default_constructible_v<versioned_value>);
@@ -19,6 +23,11 @@ versioned_value versioned_value::network_version() {
return versioned_value(format("{}", netw::messaging_service::current_version));
}
sstring versioned_value::make_full_token_string(const std::unordered_set<dht::token>& tokens) {
return fmt::to_string(fmt::join(tokens | std::views::transform([] (const dht::token& t) {
return t.to_sstring(); }), ";"));
}
sstring versioned_value::make_token_string(const std::unordered_set<dht::token>& tokens) {
if (tokens.empty()) {
return "";
@@ -26,4 +35,16 @@ sstring versioned_value::make_token_string(const std::unordered_set<dht::token>&
return tokens.begin()->to_sstring();
}
} // namespace gms
std::unordered_set<dht::token> versioned_value::tokens_from_string(const sstring& s) {
if (s.size() == 0) {
return {}; // boost::split produces one element for empty string
}
std::vector<sstring> tokens;
boost::split(tokens, s, boost::is_any_of(";"));
std::unordered_set<dht::token> ret;
for (auto str : tokens) {
ret.emplace(dht::token::from_sstring(str));
}
return ret;
}
}

View File

@@ -18,6 +18,7 @@
#include "schema/schema_fwd.hh"
#include "service/state_id.hh"
#include "version.hh"
#include "cdc/generation_id.hh"
#include <set>
#include <unordered_set>
@@ -44,7 +45,11 @@ public:
// values for ApplicationState.STATUS
static constexpr std::string_view STATUS_UNKNOWN{"UNKNOWN"};
static constexpr std::string_view STATUS_BOOTSTRAPPING{"BOOT"};
static constexpr std::string_view STATUS_NORMAL{"NORMAL"};
static constexpr std::string_view STATUS_LEFT{"LEFT"};
static constexpr std::string_view REMOVED_TOKEN{"removed"};
static constexpr std::string_view SHUTDOWN{"shutdown"};
@@ -75,18 +80,26 @@ public:
: _version(-1) {
}
static versioned_value clone_with_higher_version(const versioned_value& value) noexcept {
return versioned_value(value.value());
}
private:
static sstring version_string(const std::initializer_list<sstring>& args) {
return fmt::to_string(fmt::join(args, versioned_value::DELIMITER));
}
static sstring make_full_token_string(const std::unordered_set<dht::token>& tokens);
static sstring make_token_string(const std::unordered_set<dht::token>& tokens);
static sstring make_cdc_generation_id_string(std::optional<cdc::generation_id>);
// Reverse of `make_full_token_string`.
static std::unordered_set<dht::token> tokens_from_string(const sstring&);
static versioned_value clone_with_higher_version(const versioned_value& value) noexcept {
return versioned_value(value.value());
}
static versioned_value bootstrapping(const std::unordered_set<dht::token>& tokens) {
return versioned_value(version_string({sstring(versioned_value::STATUS_BOOTSTRAPPING),
make_token_string(tokens)}));
}
public:
static versioned_value normal(const std::unordered_set<dht::token>& tokens) {
return versioned_value(version_string({sstring(versioned_value::STATUS_NORMAL),
make_token_string(tokens)}));
@@ -100,10 +113,24 @@ public:
return versioned_value(new_version.to_sstring());
}
static versioned_value left(const std::unordered_set<dht::token>& tokens, int64_t expire_time) {
return versioned_value(version_string({sstring(versioned_value::STATUS_LEFT),
make_token_string(tokens),
std::to_string(expire_time)}));
}
static versioned_value host_id(const locator::host_id& host_id) {
return versioned_value(host_id.to_sstring());
}
static versioned_value tokens(const std::unordered_set<dht::token>& tokens) {
return versioned_value(make_full_token_string(tokens));
}
static versioned_value removed_nonlocal(const locator::host_id& host_id, int64_t expire_time) {
return versioned_value(sstring(REMOVED_TOKEN) + sstring(DELIMITER) + host_id.to_sstring() + sstring(DELIMITER) + to_sstring(expire_time));
}
static versioned_value shutdown(bool value) {
return versioned_value(sstring(SHUTDOWN) + sstring(DELIMITER) + (value ? "true" : "false"));
}
@@ -142,6 +169,10 @@ public:
return versioned_value(private_ip);
}
static versioned_value severity(double value) {
return versioned_value(to_sstring(value));
}
static versioned_value supported_features(const std::set<std::string_view>& features) {
return versioned_value(fmt::to_string(fmt::join(features, ",")));
}

View File

@@ -202,10 +202,6 @@ std::optional<sstring> secondary_index_manager::custom_index_class(const schema&
// This function returns a factory, as the custom index class should be lightweight, preferably not holding any state.
// We prefer this over a static custom index class instance, as it allows us to avoid any issues with thread safety.
//
// Note: SAI class names (StorageAttachedIndex, sai) are not listed here
// because maybe_rewrite_sai_to_vector_index() in create_index_statement.cc
// rewrites them to "vector_index" before the index metadata is persisted.
std::optional<std::function<std::unique_ptr<custom_index>()>> secondary_index_manager::get_custom_class_factory(const sstring& class_name) {
sstring lower_class_name = class_name;
std::transform(lower_class_name.begin(), lower_class_name.end(), lower_class_name.begin(), ::tolower);

View File

@@ -20,7 +20,6 @@
#include "types/concrete_types.hh"
#include "types/types.hh"
#include "utils/managed_string.hh"
#include <ranges>
#include <seastar/core/sstring.hh>
#include <boost/algorithm/string.hpp>
@@ -104,126 +103,19 @@ const static std::unordered_map<sstring, std::function<void(const sstring&, cons
{"oversampling", std::bind_front(validate_factor_option, 1.0f, 100.0f)},
// 'rescoring' enables recalculating of similarity scores of candidates retrieved from vector store when quantization is used.
{"rescoring", std::bind_front(validate_enumerated_option, boolean_values)},
// 'source_model' is a Cassandra SAI option specifying the embedding model name.
// Used by Cassandra libraries (e.g., CassIO) to tag indexes with the model that produced the vectors.
// Accepted for compatibility but not used by ScyllaDB.
{"source_model", [](const sstring&, const sstring&) { /* accepted for Cassandra compatibility */ }},
};
static constexpr auto TC_TARGET_KEY = "tc";
static constexpr auto PK_TARGET_KEY = "pk";
static constexpr auto FC_TARGET_KEY = "fc";
// Convert a serialized targets string (as produced by serialize_targets())
// back into the CQL column list used inside CREATE INDEX ... ON table(<here>).
//
// JSON examples:
// {"tc":"v","fc":["f1","f2"]} -> "v, f1, f2"
// {"tc":"v","pk":["p1","p2"]} -> "(p1, p2), v"
// {"tc":"v","pk":["p1","p2"],"fc":["f1"]} -> "(p1, p2), v, f1"
static sstring targets_to_cql(const sstring& targets) {
sstring get_vector_index_target_column(const sstring& targets) {
std::optional<rjson::value> json_value = rjson::try_parse(targets);
if (!json_value || !json_value->IsObject()) {
return cql3::util::maybe_quote(cql3::statements::index_target::column_name_from_target_string(targets));
return target_parser::get_target_column_name_from_string(targets);
}
sstring result;
const rjson::value* pk = rjson::find(*json_value, PK_TARGET_KEY);
rjson::value* pk = rjson::find(*json_value, "pk");
if (pk && pk->IsArray() && !pk->Empty()) {
result += "(";
auto pk_cols = std::views::all(pk->GetArray()) | std::views::transform([&](const rjson::value& col) {
return cql3::util::maybe_quote(sstring(rjson::to_string_view(col)));
}) | std::ranges::to<std::vector<sstring>>();
result += boost::algorithm::join(pk_cols, ", ");
result += "), ";
return sstring(rjson::to_string_view(pk->GetArray()[0]));
}
const rjson::value* tc = rjson::find(*json_value, TC_TARGET_KEY);
if (tc && tc->IsString()) {
result += cql3::util::maybe_quote(sstring(rjson::to_string_view(*tc)));
}
const rjson::value* fc = rjson::find(*json_value, FC_TARGET_KEY);
if (fc && fc->IsArray()) {
for (rapidjson::SizeType i = 0; i < fc->Size(); ++i) {
result += ", ";
result += cql3::util::maybe_quote(sstring(rjson::to_string_view((*fc)[i])));
}
}
return result;
}
// Serialize vector index targets into a format using:
// "tc" for the target (vector) column,
// "pk" for partition key columns (local index),
// "fc" for filtering columns.
// For a simple single-column vector index, returns just the column name.
// Examples:
// (v) -> "v"
// (v, f1, f2) -> {"tc":"v","fc":["f1","f2"]}
// ((p1, p2), v) -> {"tc":"v","pk":["p1","p2"]}
// ((p1, p2), v, f1, f2) -> {"tc":"v","pk":["p1","p2"],"fc":["f1","f2"]}
sstring vector_index::serialize_targets(const std::vector<::shared_ptr<cql3::statements::index_target>>& targets) {
using cql3::statements::index_target;
if (targets.size() == 0) {
throw exceptions::invalid_request_exception("Vector index must have at least one target column");
}
if (targets.size() == 1) {
auto tc = targets[0]->value;
if (!std::holds_alternative<index_target::single_column>(tc)) {
throw exceptions::invalid_request_exception("Missing vector column target for local vector index");
}
return index_target::escape_target_column(*std::get<index_target::single_column>(tc));
}
const bool has_pk = std::holds_alternative<index_target::multiple_columns>(targets.front()->value);
const size_t tc_idx = has_pk ? 1 : 0;
const size_t fc_count = targets.size() - tc_idx - 1;
if (!std::holds_alternative<index_target::single_column>(targets[tc_idx]->value)) {
throw exceptions::invalid_request_exception("Vector index target column must be a single column");
}
rjson::value json_map = rjson::empty_object();
rjson::add_with_string_name(json_map, TC_TARGET_KEY, rjson::from_string(std::get<index_target::single_column>(targets[tc_idx]->value)->text()));
if (has_pk) {
rjson::value pk_json = rjson::empty_array();
for (const auto& col : std::get<index_target::multiple_columns>(targets.front()->value)) {
rjson::push_back(pk_json, rjson::from_string(col->text()));
}
rjson::add_with_string_name(json_map, PK_TARGET_KEY, std::move(pk_json));
}
if (fc_count > 0) {
rjson::value fc_json = rjson::empty_array();
for (size_t i = tc_idx + 1; i < targets.size(); ++i) {
if (!std::holds_alternative<index_target::single_column>(targets[i]->value)) {
throw exceptions::invalid_request_exception("Vector index filtering column must be a single column");
}
rjson::push_back(fc_json, rjson::from_string(std::get<index_target::single_column>(targets[i]->value)->text()));
}
rjson::add_with_string_name(json_map, FC_TARGET_KEY, std::move(fc_json));
}
return rjson::print(json_map);
}
sstring vector_index::get_target_column(const sstring& targets) {
std::optional<rjson::value> json_value = rjson::try_parse(targets);
if (!json_value || !json_value->IsObject()) {
return cql3::statements::index_target::column_name_from_target_string(targets);
}
rjson::value* tc = rjson::find(*json_value, TC_TARGET_KEY);
if (tc && tc->IsString()) {
return sstring(rjson::to_string_view(*tc));
}
return cql3::statements::index_target::column_name_from_target_string(targets);
return target_parser::get_target_column_name_from_string(targets);
}
bool vector_index::is_rescoring_enabled(const index_options_map& properties) {
@@ -254,37 +146,12 @@ bool vector_index::view_should_exist() const {
}
std::optional<cql3::description> vector_index::describe(const index_metadata& im, const schema& base_schema) const {
static const std::unordered_set<sstring> system_options = {
cql3::statements::index_target::target_option_name,
db::index::secondary_index::custom_class_option_name,
db::index::secondary_index::index_version_option_name,
};
fragmented_ostringstream os;
os << "CREATE CUSTOM INDEX " << cql3::util::maybe_quote(im.name()) << " ON " << cql3::util::maybe_quote(base_schema.ks_name()) << "."
<< cql3::util::maybe_quote(base_schema.cf_name()) << "(" << targets_to_cql(im.options().at(cql3::statements::index_target::target_option_name)) << ")"
os << "CREATE CUSTOM INDEX " << cql3::util::maybe_quote(im.name()) << " ON "
<< cql3::util::maybe_quote(base_schema.ks_name()) << "." << cql3::util::maybe_quote(base_schema.cf_name())
<< "(" << cql3::util::maybe_quote(im.options().at(cql3::statements::index_target::target_option_name)) << ")"
<< " USING 'vector_index'";
// Collect user-provided options (excluding system keys like target, class_name, index_version).
std::map<sstring, sstring> user_options;
for (const auto& [key, value] : im.options()) {
if (!system_options.contains(key)) {
user_options.emplace(key, value);
}
}
if (!user_options.empty()) {
os << " WITH OPTIONS = {";
bool first = true;
for (const auto& [key, value] : user_options) {
if (!first) {
os << ", ";
}
os << "'" << key << "': '" << value << "'";
first = false;
}
os << "}";
}
return cql3::description{
.keyspace = base_schema.ks_name(),
.type = "index",
@@ -479,7 +346,7 @@ bool vector_index::is_vector_index_on_column(const index_metadata& im, const sst
auto target_it = im.options().find(cql3_parser::index_target::target_option_name);
if (class_it != im.options().end() && target_it != im.options().end()) {
auto custom_class = secondary_index_manager::get_custom_class_factory(class_it->second);
return custom_class && dynamic_cast<vector_index*>((*custom_class)().get()) && get_target_column(target_it->second) == target_name;
return custom_class && dynamic_cast<vector_index*>((*custom_class)().get()) && get_vector_index_target_column(target_it->second) == target_name;
}
return false;
}

View File

@@ -37,9 +37,6 @@ public:
static bool is_vector_index_on_column(const index_metadata& im, const sstring& target_name);
static void check_cdc_options(const schema& schema);
static sstring serialize_targets(const std::vector<::shared_ptr<cql3::statements::index_target>>& targets);
static sstring get_target_column(const sstring& targets);
static bool is_rescoring_enabled(const index_options_map& properties);
static float get_oversampling(const index_options_map& properties);
static sstring get_cql_similarity_function_name(const index_options_map& properties);

View File

@@ -42,14 +42,7 @@ void everywhere_replication_strategy::validate_options(const gms::feature_servic
sstring everywhere_replication_strategy::sanity_check_read_replicas(const effective_replication_map& erm, const host_id_vector_replica_set& read_replicas) const {
const auto replication_factor = erm.get_replication_factor();
if (const auto& topo_info = erm.get_token_metadata().get_topology_change_info(); topo_info && topo_info->read_new) {
if (read_replicas.size() > replication_factor + 1) {
return seastar::format(
"everywhere_replication_strategy: the number of replicas for everywhere_replication_strategy is {}, "
"cannot be higher than replication factor {} + 1 during the 'read from new replicas' stage of a topology change",
read_replicas.size(), replication_factor);
}
} else if (read_replicas.size() > replication_factor) {
if (read_replicas.size() > replication_factor) {
return seastar::format("everywhere_replication_strategy: the number of replicas for everywhere_replication_strategy is {}, cannot be higher than replication factor {}", read_replicas.size(), replication_factor);
}
return {};

View File

@@ -33,15 +33,14 @@ size_t hash<locator::endpoint_dc_rack>::operator()(const locator::endpoint_dc_ra
return utils::tuple_hash()(std::tie(v.dc, v.rack));
}
}
} // namespace std
namespace locator {
static logging::logger logger("network_topology_strategy");
network_topology_strategy::network_topology_strategy(replication_strategy_params params, const topology* topo) :
abstract_replication_strategy(params,
replication_strategy_type::network_topology) {
network_topology_strategy::network_topology_strategy(replication_strategy_params params, const topology* topo)
: abstract_replication_strategy(params, replication_strategy_type::network_topology) {
auto opts = _config_options;
logger.debug("options={}", opts);
@@ -65,8 +64,7 @@ network_topology_strategy::network_topology_strategy(replication_strategy_params
if (boost::equals(key, "replication_factor")) {
on_internal_error(rslogger, "replication_factor should have been replaced with a DC:RF mapping by now");
} else {
throw exceptions::configuration_exception(format(
"'{}' is not a valid option, did you mean (lowercase) 'replication_factor'?", key));
throw exceptions::configuration_exception(format("'{}' is not a valid option, did you mean (lowercase) 'replication_factor'?", key));
}
}
@@ -109,8 +107,8 @@ class natural_endpoints_tracker {
, _rf_left(std::min(rf, node_count))
// If there aren't enough racks in this DC to fill the RF, we'll still use at least one node from each rack,
// and the difference is to be filled by the first encountered nodes.
, _acceptable_rack_repeats(rf - rack_count)
{}
, _acceptable_rack_repeats(rf - rack_count) {
}
/**
* Attempts to add an endpoint to the replicas for this datacenter, adding to the endpoints set if successful.
@@ -201,8 +199,7 @@ public:
, _tp(_tm.get_topology())
, _dc_rep_factor(dc_rep_factor)
, _token_owners(_tm.get_datacenter_token_owners())
, _racks(_tm.get_datacenter_racks_token_owners())
{
, _racks(_tm.get_datacenter_racks_token_owners()) {
// not aware of any cluster members
SCYLLA_ASSERT(!_token_owners.empty() && !_racks.empty());
@@ -251,16 +248,14 @@ public:
for (const auto& [dc, rf_data] : dc_rf) {
auto rf = rf_data.count();
if (rf > endpoints_in(dc)) {
throw exceptions::configuration_exception(seastar::format(
"Datacenter {} doesn't have enough token-owning nodes for replication_factor={}", dc, rf));
throw exceptions::configuration_exception(
seastar::format("Datacenter {} doesn't have enough token-owning nodes for replication_factor={}", dc, rf));
}
}
}
};
future<host_id_set>
network_topology_strategy::calculate_natural_endpoints(
const token& search_token, const token_metadata& tm) const {
future<host_id_set> network_topology_strategy::calculate_natural_endpoints(const token& search_token, const token_metadata& tm) const {
natural_endpoints_tracker tracker(tm, _dc_rep_factor);
@@ -285,12 +280,14 @@ void network_topology_strategy::validate_options(const gms::feature_service& fs,
for (auto& c : _config_options) {
if (c.first == sstring("replication_factor")) {
on_internal_error(rslogger, fmt::format("'replication_factor' tag should be unrolled into a list of DC:RF by now."
"_config_options:{}", _config_options));
"_config_options:{}",
_config_options));
}
auto dc = dcs.find(c.first);
if (dc == dcs.end()) {
throw exceptions::configuration_exception(format("Unrecognized strategy option {{{}}} "
"passed to NetworkTopologyStrategy", this->to_qualified_class_name(c.first)));
"passed to NetworkTopologyStrategy",
this->to_qualified_class_name(c.first)));
}
auto racks = dc->second | std::views::keys | std::ranges::to<std::unordered_set<sstring>>();
auto rf = parse_replication_factor(c.second);
@@ -311,8 +308,8 @@ future<tablet_map> network_topology_strategy::allocate_tablets_for_new_table(sch
rslogger.info("Rounding up tablet count from {} to {} for table {}.{}", tablet_count, aligned_tablet_count, s->ks_name(), s->cf_name());
tablet_count = aligned_tablet_count;
}
co_return co_await reallocate_tablets(std::move(s), std::move(tm),
tablet_map(tablet_count, get_consistency() != data_dictionary::consistency_config_option::eventual));
co_return co_await reallocate_tablets(
std::move(s), std::move(tm), tablet_map(tablet_count, get_consistency() != data_dictionary::consistency_config_option::eventual));
}
future<tablet_map> network_topology_strategy::reallocate_tablets(schema_ptr s, token_metadata_ptr tm, tablet_map tablets) const {
@@ -321,16 +318,15 @@ future<tablet_map> network_topology_strategy::reallocate_tablets(schema_ptr s, t
co_await load.populate_with_normalized_load();
co_await load.populate(std::nullopt, s->id());
tablet_logger.debug("Allocating tablets for {}.{} ({}): dc_rep_factor={} tablet_count={}", s->ks_name(), s->cf_name(), s->id(), _dc_rep_factor, tablets.tablet_count());
tablet_logger.debug(
"Allocating tablets for {}.{} ({}): dc_rep_factor={} tablet_count={}", s->ks_name(), s->cf_name(), s->id(), _dc_rep_factor, tablets.tablet_count());
for (tablet_id tb : tablets.tablet_ids()) {
auto tinfo = tablets.get_tablet_info(tb);
tinfo.replicas = co_await reallocate_tablets(s, tm, load, tablets, tb);
if (tablets.has_raft_info()) {
if (!tablets.get_tablet_raft_info(tb).group_id) {
tablets.set_tablet_raft_info(tb, tablet_raft_info {
.group_id = raft::group_id{utils::UUID_gen::get_time_UUID()}
});
tablets.set_tablet_raft_info(tb, tablet_raft_info{.group_id = raft::group_id{utils::UUID_gen::get_time_UUID()}});
}
}
tablets.set_tablet(tb, std::move(tinfo));
@@ -340,7 +336,8 @@ future<tablet_map> network_topology_strategy::reallocate_tablets(schema_ptr s, t
co_return tablets;
}
future<tablet_replica_set> network_topology_strategy::reallocate_tablets(schema_ptr s, token_metadata_ptr tm, load_sketch& load, const tablet_map& cur_tablets, tablet_id tb) const {
future<tablet_replica_set> network_topology_strategy::reallocate_tablets(
schema_ptr s, token_metadata_ptr tm, load_sketch& load, const tablet_map& cur_tablets, tablet_id tb) const {
tablet_replica_set replicas;
// Current number of replicas per dc
std::unordered_map<sstring, size_t> nodes_per_dc;
@@ -364,8 +361,8 @@ future<tablet_replica_set> network_topology_strategy::reallocate_tablets(schema_
if (new_rf && new_rf->is_rack_based()) {
auto diff = diff_racks(old_racks_per_dc[dc], new_rf->get_rack_list());
tablet_logger.debug("reallocate_tablets {}.{} tablet_id={} dc={} old_racks={} add_racks={} del_racks={}",
s->ks_name(), s->cf_name(), tb, dc, old_racks_per_dc[dc], diff.added, diff.removed);
tablet_logger.debug("reallocate_tablets {}.{} tablet_id={} dc={} old_racks={} add_racks={} del_racks={}", s->ks_name(), s->cf_name(), tb, dc,
old_racks_per_dc[dc], diff.added, diff.removed);
if (!diff) {
continue;
@@ -395,23 +392,18 @@ future<tablet_replica_set> network_topology_strategy::reallocate_tablets(schema_
co_return replicas;
}
tablet_replica_set network_topology_strategy::drop_tablets_in_racks(schema_ptr s,
token_metadata_ptr tm,
load_sketch& load,
tablet_id tb,
const tablet_replica_set& cur_replicas,
const sstring& dc,
const rack_list& racks_to_drop) const {
tablet_replica_set network_topology_strategy::drop_tablets_in_racks(schema_ptr s, token_metadata_ptr tm, load_sketch& load, tablet_id tb,
const tablet_replica_set& cur_replicas, const sstring& dc, const rack_list& racks_to_drop) const {
auto& topo = tm->get_topology();
tablet_replica_set filtered;
auto is_rack_to_drop = [&racks_to_drop] (const sstring& rack) {
auto is_rack_to_drop = [&racks_to_drop](const sstring& rack) {
return std::ranges::contains(racks_to_drop, rack);
};
for (const auto& tr : cur_replicas) {
auto& node = topo.get_node(tr.host);
if (node.dc_rack().dc == dc && is_rack_to_drop(node.dc_rack().rack)) {
tablet_logger.debug("drop_tablets_in_rack {}.{} tablet_id={} dc={} rack={} removing replica: {}",
s->ks_name(), s->cf_name(), tb, node.dc_rack().dc, node.dc_rack().rack, tr);
tablet_logger.debug("drop_tablets_in_rack {}.{} tablet_id={} dc={} rack={} removing replica: {}", s->ks_name(), s->cf_name(), tb, node.dc_rack().dc,
node.dc_rack().rack, tr);
load.unload(tr.host, tr.shard, 1, service::default_target_tablet_size);
} else {
filtered.emplace_back(tr);
@@ -420,22 +412,17 @@ tablet_replica_set network_topology_strategy::drop_tablets_in_racks(schema_ptr s
return filtered;
}
tablet_replica_set network_topology_strategy::add_tablets_in_racks(schema_ptr s,
token_metadata_ptr tm,
load_sketch& load,
tablet_id tb,
const tablet_replica_set& cur_replicas,
const sstring& dc,
const rack_list& racks_to_add) const {
tablet_replica_set network_topology_strategy::add_tablets_in_racks(schema_ptr s, token_metadata_ptr tm, load_sketch& load, tablet_id tb,
const tablet_replica_set& cur_replicas, const sstring& dc, const rack_list& racks_to_add) const {
auto nodes = tm->get_datacenter_racks_token_owners_nodes();
auto& dc_nodes = nodes.at(dc);
auto new_replicas = cur_replicas;
for (auto&& rack: racks_to_add) {
for (auto&& rack : racks_to_add) {
host_id min_node;
double min_load = std::numeric_limits<double>::max();
for (auto&& node: dc_nodes.at(rack)) {
for (auto&& node : dc_nodes.at(rack)) {
if (!node.get().is_normal()) {
continue;
}
@@ -450,29 +437,26 @@ tablet_replica_set network_topology_strategy::add_tablets_in_racks(schema_ptr s,
}
if (!min_node) {
throw std::runtime_error(
fmt::format("No candidate node in rack {}.{} to allocate tablet replica", dc, rack));
throw std::runtime_error(fmt::format("No candidate node in rack {}.{} to allocate tablet replica", dc, rack));
}
auto new_replica = tablet_replica{min_node, load.next_shard(min_node, 1, service::default_target_tablet_size)};
new_replicas.push_back(new_replica);
tablet_logger.trace("add_tablet_in_rack {}.{} tablet_id={} dc={} rack={} load={} new_replica={}",
s->ks_name(), s->cf_name(), tb.id, dc, rack, min_load, new_replica);
tablet_logger.trace("add_tablet_in_rack {}.{} tablet_id={} dc={} rack={} load={} new_replica={}", s->ks_name(), s->cf_name(), tb.id, dc, rack, min_load,
new_replica);
}
return new_replicas;
}
future<tablet_replica_set> network_topology_strategy::add_tablets_in_dc(schema_ptr s, token_metadata_ptr tm, load_sketch& load, tablet_id tb,
std::map<sstring, std::unordered_set<locator::host_id>>& replicas_per_rack,
const tablet_replica_set& cur_replicas,
sstring dc, size_t dc_node_count, size_t dc_rf) const {
std::map<sstring, std::unordered_set<locator::host_id>>& replicas_per_rack, const tablet_replica_set& cur_replicas, sstring dc, size_t dc_node_count,
size_t dc_rf) const {
static thread_local std::default_random_engine rnd_engine{std::random_device{}()};
auto replicas = cur_replicas;
// all_dc_racks is ordered lexicographically on purpose
auto all_dc_racks = tm->get_datacenter_racks_token_owners_nodes().at(dc)
| std::ranges::to<std::map>();
auto all_dc_racks = tm->get_datacenter_racks_token_owners_nodes().at(dc) | std::ranges::to<std::map>();
// Track all nodes with no replicas on them for this tablet, per rack.
struct node_load {
@@ -481,7 +465,7 @@ future<tablet_replica_set> network_topology_strategy::add_tablets_in_dc(schema_p
};
// for sorting in descending load order
// (in terms of load)
auto node_load_cmp = [] (const node_load& a, const node_load& b) {
auto node_load_cmp = [](const node_load& a, const node_load& b) {
return a.load > b.load;
};
@@ -533,7 +517,7 @@ future<tablet_replica_set> network_topology_strategy::add_tablets_in_dc(schema_p
// ensure fairness across racks (in particular if rf < number_of_racks)
// by rotating the racks order
auto append_candidate_racks = [&] (candidates_list& racks) {
auto append_candidate_racks = [&](candidates_list& racks) {
if (auto size = racks.size()) {
auto it = racks.begin() + tb.id % size;
std::move(it, racks.end(), std::back_inserter(candidate_racks));
@@ -545,20 +529,19 @@ future<tablet_replica_set> network_topology_strategy::add_tablets_in_dc(schema_p
append_candidate_racks(existing_racks);
if (candidate_racks.empty()) {
on_internal_error(tablet_logger,
seastar::format("allocate_replica {}.{}: no candidate racks found for dc={} allocated={} rf={}: existing={}",
s->ks_name(), s->cf_name(), dc, dc_node_count, dc_rf, replicas_per_rack));
on_internal_error(tablet_logger, seastar::format("allocate_replica {}.{}: no candidate racks found for dc={} allocated={} rf={}: existing={}",
s->ks_name(), s->cf_name(), dc, dc_node_count, dc_rf, replicas_per_rack));
}
auto candidate_rack = candidate_racks.begin();
auto allocate_replica = [&] (candidates_list::iterator& candidate) {
auto allocate_replica = [&](candidates_list::iterator& candidate) {
const auto& rack = candidate->rack;
auto& nodes = candidate->nodes;
if (nodes.empty()) {
on_internal_error(tablet_logger,
seastar::format("allocate_replica {}.{} tablet_id={}: candidates vector for rack={} is empty for allocating tablet replicas in dc={} allocated={} rf={}",
s->ks_name(), s->cf_name(), tb.id, rack, dc, dc_node_count, dc_rf));
on_internal_error(tablet_logger, seastar::format("allocate_replica {}.{} tablet_id={}: candidates vector for rack={} is empty for allocating "
"tablet replicas in dc={} allocated={} rf={}",
s->ks_name(), s->cf_name(), tb.id, rack, dc, dc_node_count, dc_rf));
}
auto host_id = nodes.back().host;
auto replica = tablet_replica{host_id, load.next_shard(host_id, 1, service::default_target_tablet_size)};
@@ -566,13 +549,13 @@ future<tablet_replica_set> network_topology_strategy::add_tablets_in_dc(schema_p
auto inserted = replicas_per_rack[node.dc_rack().rack].insert(host_id).second;
// Sanity check that a node is not used more than once
if (!inserted) {
on_internal_error(tablet_logger,
seastar::format("allocate_replica {}.{} tablet_id={}: allocated replica={} node already used when allocating tablet replicas in dc={} allocated={} rf={}: replicas={}",
s->ks_name(), s->cf_name(), tb.id, replica, dc, dc_node_count, dc_rf, replicas));
on_internal_error(tablet_logger, seastar::format("allocate_replica {}.{} tablet_id={}: allocated replica={} node already used when allocating "
"tablet replicas in dc={} allocated={} rf={}: replicas={}",
s->ks_name(), s->cf_name(), tb.id, replica, dc, dc_node_count, dc_rf, replicas));
}
nodes.pop_back();
tablet_logger.trace("allocate_replica {}.{} tablet_id={}: allocated tablet replica={} dc={} rack={}: nodes remaining in rack={}",
s->ks_name(), s->cf_name(), tb.id, replica, node.dc_rack().dc, node.dc_rack().rack, nodes.size());
tablet_logger.trace("allocate_replica {}.{} tablet_id={}: allocated tablet replica={} dc={} rack={}: nodes remaining in rack={}", s->ks_name(),
s->cf_name(), tb.id, replica, node.dc_rack().dc, node.dc_rack().rack, nodes.size());
if (nodes.empty()) {
candidate = candidate_racks.erase(candidate);
} else {
@@ -583,7 +566,8 @@ future<tablet_replica_set> network_topology_strategy::add_tablets_in_dc(schema_p
}
if (tablet_logger.is_enabled(log_level::trace)) {
if (candidate != candidate_racks.end()) {
tablet_logger.trace("allocate_replica {}.{} tablet_id={}: next rack={} nodes={}", s->ks_name(), s->cf_name(), tb.id, candidate->rack, candidate->nodes.size());
tablet_logger.trace("allocate_replica {}.{} tablet_id={}: next rack={} nodes={}", s->ks_name(), s->cf_name(), tb.id, candidate->rack,
candidate->nodes.size());
} else {
tablet_logger.trace("allocate_replica {}.{} tablet_id={}: no candidate racks left", s->ks_name(), s->cf_name(), tb.id);
}
@@ -591,15 +575,15 @@ future<tablet_replica_set> network_topology_strategy::add_tablets_in_dc(schema_p
return replica;
};
tablet_logger.debug("allocate_replica {}.{} tablet_id={}: allocating tablet replicas in dc={} allocated={} rf={}",
s->ks_name(), s->cf_name(), tb.id, dc, dc_node_count, dc_rf);
tablet_logger.debug("allocate_replica {}.{} tablet_id={}: allocating tablet replicas in dc={} allocated={} rf={}", s->ks_name(), s->cf_name(), tb.id, dc,
dc_node_count, dc_rf);
for (size_t remaining = dc_rf - dc_node_count; remaining; --remaining) {
co_await coroutine::maybe_yield();
if (candidate_rack == candidate_racks.end()) {
on_internal_error(tablet_logger,
format("allocate_replica {}.{} tablet_id={}: ran out of candidates for allocating tablet replicas in dc={} allocated={} rf={}: remaining={}",
s->ks_name(), s->cf_name(), tb.id, dc, dc_node_count, dc_rf, remaining));
on_internal_error(tablet_logger, format("allocate_replica {}.{} tablet_id={}: ran out of candidates for allocating tablet replicas in dc={} "
"allocated={} rf={}: remaining={}",
s->ks_name(), s->cf_name(), tb.id, dc, dc_node_count, dc_rf, remaining));
}
replicas.emplace_back(allocate_replica(candidate_rack));
}
@@ -608,9 +592,9 @@ future<tablet_replica_set> network_topology_strategy::add_tablets_in_dc(schema_p
}
tablet_replica_set network_topology_strategy::drop_tablets_in_dc(schema_ptr s, const locator::topology& topo, load_sketch& load, tablet_id tb,
const tablet_replica_set& cur_replicas,
sstring dc, size_t dc_node_count, size_t dc_rf) const {
tablet_logger.debug("drop_tablets_in_dc {}.{} tablet_id={}: deallocating tablet replicas in dc={} allocated={} rf={}", s->ks_name(), s->cf_name(), tb.id, dc, dc_node_count, dc_rf);
const tablet_replica_set& cur_replicas, sstring dc, size_t dc_node_count, size_t dc_rf) const {
tablet_logger.debug("drop_tablets_in_dc {}.{} tablet_id={}: deallocating tablet replicas in dc={} allocated={} rf={}", s->ks_name(), s->cf_name(), tb.id,
dc, dc_node_count, dc_rf);
// Leave dc_rf replicas in dc, effectively deallocating in reverse order,
// to maintain replica pairing between the base table and its materialized views.
@@ -629,8 +613,7 @@ tablet_replica_set network_topology_strategy::drop_tablets_in_dc(schema_ptr s, c
return filtered;
}
sstring network_topology_strategy::sanity_check_read_replicas(const effective_replication_map& erm,
const host_id_vector_replica_set& read_replicas) const {
sstring network_topology_strategy::sanity_check_read_replicas(const effective_replication_map& erm, const host_id_vector_replica_set& read_replicas) const {
const auto& topology = erm.get_topology();
struct rf_node_count {
@@ -663,4 +646,4 @@ sstring network_topology_strategy::sanity_check_read_replicas(const effective_re
using registry = class_registrator<abstract_replication_strategy, network_topology_strategy, replication_strategy_params, const topology*>;
static registry registrator("org.apache.cassandra.locator.NetworkTopologyStrategy");
static registry registrator_short_name("NetworkTopologyStrategy");
}
} // namespace locator

File diff suppressed because it is too large Load Diff

View File

@@ -367,8 +367,6 @@ future<> token_metadata_impl::clear_gently() noexcept {
co_await utils::clear_gently(_sorted_tokens);
co_await _topology.clear_gently();
co_await _tablets.clear_gently();
co_await utils::clear_gently(_topology_change_info);
_topology_change_info.reset();
co_return;
}

View File

@@ -26,12 +26,16 @@
struct node_printer {
const locator::node* v;
node_printer(const locator::node* n) noexcept : v(n) {}
node_printer(const locator::node* n) noexcept
: v(n) {
}
};
template <>
struct fmt::formatter<node_printer> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
constexpr auto parse(format_parse_context& ctx) {
return ctx.begin();
}
auto format(const node_printer& np, fmt::format_context& ctx) const {
const locator::node* node = np.v;
auto out = fmt::format_to(ctx.out(), "node={}", fmt::ptr(node));
@@ -43,7 +47,9 @@ struct fmt::formatter<node_printer> {
};
static auto lazy_backtrace() {
return seastar::value_of([] { return current_backtrace(); });
return seastar::value_of([] {
return current_backtrace();
});
}
namespace locator {
@@ -51,11 +57,12 @@ namespace locator {
static logging::logger tlogger("topology");
thread_local const endpoint_dc_rack endpoint_dc_rack::default_location = {
.dc = locator::production_snitch_base::default_dc,
.rack = locator::production_snitch_base::default_rack,
.dc = locator::production_snitch_base::default_dc,
.rack = locator::production_snitch_base::default_rack,
};
node::node(const locator::topology* topology, locator::host_id id, endpoint_dc_rack dc_rack, state state, shard_id shard_count, bool excluded, this_node is_this_node, node::idx_type idx, bool draining)
node::node(const locator::topology* topology, locator::host_id id, endpoint_dc_rack dc_rack, state state, shard_id shard_count, bool excluded,
this_node is_this_node, node::idx_type idx, bool draining)
: _topology(topology)
, _host_id(id)
, _dc_rack(std::move(dc_rack))
@@ -64,10 +71,11 @@ node::node(const locator::topology* topology, locator::host_id id, endpoint_dc_r
, _excluded(excluded)
, _draining(draining)
, _is_this_node(is_this_node)
, _idx(idx)
{}
, _idx(idx) {
}
node_holder node::make(const locator::topology* topology, locator::host_id id, endpoint_dc_rack dc_rack, state state, shard_id shard_count, bool excluded, node::this_node is_this_node, node::idx_type idx, bool draining) {
node_holder node::make(const locator::topology* topology, locator::host_id id, endpoint_dc_rack dc_rack, state state, shard_id shard_count, bool excluded,
node::this_node is_this_node, node::idx_type idx, bool draining) {
return std::make_unique<node>(topology, std::move(id), std::move(dc_rack), std::move(state), shard_count, excluded, is_this_node, idx, draining);
}
@@ -77,14 +85,22 @@ node_holder node::clone() const {
std::string node::to_string(node::state s) {
switch (s) {
case state::none: return "none";
case state::bootstrapping: return "bootstrapping";
case state::replacing: return "replacing";
case state::normal: return "normal";
case state::being_decommissioned: return "being_decommissioned";
case state::being_removed: return "being_removed";
case state::being_replaced: return "being_replaced";
case state::left: return "left";
case state::none:
return "none";
case state::bootstrapping:
return "bootstrapping";
case state::replacing:
return "replacing";
case state::normal:
return "normal";
case state::being_decommissioned:
return "being_decommissioned";
case state::being_removed:
return "being_removed";
case state::being_replaced:
return "being_replaced";
case state::left:
return "left";
}
__builtin_unreachable();
}
@@ -101,21 +117,19 @@ future<> topology::clear_gently() noexcept {
}
topology::topology(shallow_copy, config cfg)
: _shard(this_shard_id())
, _cfg(cfg)
, _sort_by_proximity(true)
{
: _shard(this_shard_id())
, _cfg(cfg)
, _sort_by_proximity(true) {
// constructor for shallow copying of token_metadata_impl
}
topology::topology(config cfg)
: _shard(this_shard_id())
, _cfg(cfg)
, _sort_by_proximity(!cfg.disable_proximity_sorting)
, _random_engine(std::random_device{}())
{
tlogger.trace("topology[{}]: constructing using config: endpoint={} id={} dc={} rack={}", fmt::ptr(this),
cfg.this_endpoint, cfg.this_host_id, cfg.local_dc_rack.dc, cfg.local_dc_rack.rack);
: _shard(this_shard_id())
, _cfg(cfg)
, _sort_by_proximity(!cfg.disable_proximity_sorting)
, _random_engine(std::random_device{}()) {
tlogger.trace("topology[{}]: constructing using config: endpoint={} id={} dc={} rack={}", fmt::ptr(this), cfg.this_endpoint, cfg.this_host_id,
cfg.local_dc_rack.dc, cfg.local_dc_rack.rack);
add_node(cfg.this_host_id, cfg.local_dc_rack, node::state::none);
}
@@ -131,8 +145,7 @@ topology::topology(topology&& o) noexcept
, _dc_racks(std::move(o._dc_racks))
, _sort_by_proximity(o._sort_by_proximity)
, _datacenters(std::move(o._datacenters))
, _random_engine(std::move(o._random_engine))
{
, _random_engine(std::move(o._random_engine)) {
SCYLLA_ASSERT(_shard == this_shard_id());
tlogger.trace("topology[{}]: move from [{}]", fmt::ptr(this), fmt::ptr(&o));
@@ -153,16 +166,18 @@ topology& topology::operator=(topology&& o) noexcept {
void topology::set_host_id_cfg(host_id this_host_id) {
if (_cfg.this_host_id) {
on_internal_error(tlogger, fmt::format("topology[{}] set_host_id_cfg can be caller only once current id {} new id {}", fmt::ptr(this), _cfg.this_host_id, this_host_id));
on_internal_error(tlogger,
fmt::format("topology[{}] set_host_id_cfg can be caller only once current id {} new id {}", fmt::ptr(this), _cfg.this_host_id, this_host_id));
}
if (_nodes.size() != 1) {
on_internal_error(tlogger, fmt::format("topology[{}] set_host_id_cfg called while nodes size is greater than 1", fmt::ptr(this)));
on_internal_error(tlogger, fmt::format("topology[{}] set_host_id_cfg called while nodes size is greater than 1", fmt::ptr(this)));
}
if (!_this_node) {
on_internal_error(tlogger, fmt::format("topology[{}] set_host_id_cfg called while _this_nodes is null", fmt::ptr(this)));
on_internal_error(tlogger, fmt::format("topology[{}] set_host_id_cfg called while _this_nodes is null", fmt::ptr(this)));
}
if (_this_node->host_id()) {
on_internal_error(tlogger, fmt::format("topology[{}] set_host_id_cfg called while _this_nodes has non null id {}", fmt::ptr(this), _this_node->host_id()));
on_internal_error(
tlogger, fmt::format("topology[{}] set_host_id_cfg called while _this_nodes has non null id {}", fmt::ptr(this), _this_node->host_id()));
}
remove_node(*_this_node);
@@ -203,7 +218,8 @@ const node& topology::add_node(node_holder nptr) {
if (nptr->topology() != this) {
if (nptr->topology()) {
on_fatal_internal_error(tlogger, seastar::format("topology[{}]: {} belongs to different topology={}", fmt::ptr(this), node_printer(node), fmt::ptr(node->topology())));
on_fatal_internal_error(tlogger,
seastar::format("topology[{}]: {} belongs to different topology={}", fmt::ptr(this), node_printer(node), fmt::ptr(node->topology())));
}
nptr->set_topology(this);
}
@@ -219,7 +235,8 @@ const node& topology::add_node(node_holder nptr) {
try {
if (is_configured_this_node(*node)) {
if (_this_node) {
on_internal_error(tlogger, seastar::format("topology[{}]: {}: local node already mapped to {}", fmt::ptr(this), node_printer(node), node_printer(this_node())));
on_internal_error(tlogger,
seastar::format("topology[{}]: {}: local node already mapped to {}", fmt::ptr(this), node_printer(node), node_printer(this_node())));
}
locator::node& n = *_nodes.back();
n._is_this_node = node::this_node::yes;
@@ -238,14 +255,25 @@ const node& topology::add_node(node_holder nptr) {
return *node;
}
void topology::update_node(node& node, std::optional<host_id> opt_id, std::optional<endpoint_dc_rack> opt_dr, std::optional<node::state> opt_st, std::optional<shard_id> opt_shard_count) {
void topology::update_node(node& node, std::optional<host_id> opt_id, std::optional<endpoint_dc_rack> opt_dr, std::optional<node::state> opt_st,
std::optional<shard_id> opt_shard_count) {
tlogger.debug("topology[{}]: update_node: {}: to: host_id={} dc={} rack={} state={} shard_count={}, at {}", fmt::ptr(this), node_printer(&node),
opt_id ? format("{}", *opt_id) : "unchanged",
opt_dr ? format("{}", opt_dr->dc) : "unchanged",
opt_dr ? format("{}", opt_dr->rack) : "unchanged",
opt_st ? format("{}", *opt_st) : "unchanged",
opt_shard_count ? format("{}", *opt_shard_count) : "unchanged",
lazy_backtrace());
seastar::value_of([&] {
return opt_id ? format("{}", *opt_id) : "unchanged";
}),
seastar::value_of([&] {
return opt_dr ? format("{}", opt_dr->dc) : "unchanged";
}),
seastar::value_of([&] {
return opt_dr ? format("{}", opt_dr->rack) : "unchanged";
}),
seastar::value_of([&] {
return opt_st ? format("{}", *opt_st) : "unchanged";
}),
seastar::value_of([&] {
return opt_shard_count ? format("{}", *opt_shard_count) : "unchanged";
}),
lazy_backtrace());
bool changed = false;
if (opt_id) {
@@ -257,7 +285,8 @@ void topology::update_node(node& node, std::optional<host_id> opt_id, std::optio
on_internal_error(tlogger, seastar::format("This node host_id is already set: {}: new host_id={}", node_printer(&node), *opt_id));
}
if (_nodes_by_host_id.contains(*opt_id)) {
on_internal_error(tlogger, seastar::format("Cannot update node host_id: {}: new host_id already exists: {}", node_printer(&node), node_printer(find_node(*opt_id))));
on_internal_error(tlogger, seastar::format("Cannot update node host_id: {}: new host_id already exists: {}", node_printer(&node),
node_printer(find_node(*opt_id))));
}
changed = true;
} else {
@@ -442,11 +471,11 @@ const node* topology::find_node(node::idx_type idx) const noexcept {
return _nodes.at(idx).get();
}
const node& topology::add_or_update_endpoint(host_id id, std::optional<endpoint_dc_rack> opt_dr, std::optional<node::state> opt_st, std::optional<shard_id> shard_count)
{
tlogger.trace("topology[{}]: add_or_update_endpoint: host_id={} dc={} rack={} state={} shards={}, at {}", fmt::ptr(this),
id, opt_dr.value_or(endpoint_dc_rack{}).dc, opt_dr.value_or(endpoint_dc_rack{}).rack, opt_st.value_or(node::state::none), shard_count,
lazy_backtrace());
const node& topology::add_or_update_endpoint(
host_id id, std::optional<endpoint_dc_rack> opt_dr, std::optional<node::state> opt_st, std::optional<shard_id> shard_count) {
tlogger.trace("topology[{}]: add_or_update_endpoint: host_id={} dc={} rack={} state={} shards={}, at {}", fmt::ptr(this), id,
opt_dr.value_or(endpoint_dc_rack{}).dc, opt_dr.value_or(endpoint_dc_rack{}).rack, opt_st.value_or(node::state::none), shard_count,
lazy_backtrace());
auto* n = find_node(id);
if (n) {
@@ -454,14 +483,10 @@ const node& topology::add_or_update_endpoint(host_id id, std::optional<endpoint_
return *n;
}
return add_node(id,
opt_dr.value_or(endpoint_dc_rack::default_location),
opt_st.value_or(node::state::none),
shard_count.value_or(0));
return add_node(id, opt_dr.value_or(endpoint_dc_rack::default_location), opt_st.value_or(node::state::none), shard_count.value_or(0));
}
bool topology::remove_endpoint(locator::host_id host_id)
{
bool topology::remove_endpoint(locator::host_id host_id) {
auto node = find_node(host_id);
tlogger.debug("topology[{}]: remove_endpoint: host_id={}: {}", fmt::ptr(this), host_id, node_printer(node));
// Do not allow removing yourself from the topology
@@ -490,10 +515,6 @@ const endpoint_dc_rack& topology::get_location_slow(host_id id) const {
throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace()));
}
utils::UUID topology::get_rack_uuid() const {
return utils::UUID_gen::get_name_UUID(format("{}:{}", get_location().dc, get_location().rack));
}
void topology::sort_by_proximity(locator::host_id address, host_id_vector_replica_set& addresses) const {
if (can_sort_by_proximity()) {
do_sort_by_proximity(address, addresses);
@@ -506,7 +527,7 @@ void topology::do_sort_by_proximity(locator::host_id address, host_id_vector_rep
locator::host_id id;
int distance;
};
auto host_infos = addresses | std::views::transform([&] (locator::host_id id) {
auto host_infos = addresses | std::views::transform([&](locator::host_id id) {
const auto& loc1 = get_location(id);
return info{id, distance(address, loc, id, loc1)};
}) | std::ranges::to<utils::small_vector<info, host_id_vector_replica_set::internal_capacity()>>();
@@ -568,11 +589,12 @@ std::unordered_set<locator::host_id> topology::get_all_host_ids() const {
return ids;
}
std::unordered_map<sstring, std::unordered_set<host_id>>
topology::get_datacenter_host_ids() const {
std::unordered_map<sstring, std::unordered_set<host_id>> topology::get_datacenter_host_ids() const {
std::unordered_map<sstring, std::unordered_set<host_id>> ret;
for (auto& [dc, nodes] : _dc_nodes) {
ret[dc] = nodes | std::views::transform([] (const node& n) { return n.host_id(); }) | std::ranges::to<std::unordered_set>();
ret[dc] = nodes | std::views::transform([](const node& n) {
return n.host_id();
}) | std::ranges::to<std::unordered_set>();
}
return ret;
}

View File

@@ -344,8 +344,6 @@ public:
return get_location(id).rack;
}
utils::UUID get_rack_uuid() const;
auto get_local_dc_filter() const noexcept {
return [ this, local_dc = get_datacenter() ] (auto ep) {
return get_datacenter(ep) == local_dc;

12
main.cc
View File

@@ -1807,17 +1807,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
checkpoint(stop_signal, "starting repair service");
auto max_memory_repair = memory::stats().total_memory() * 0.1;
auto repair_config = sharded_parameter([&] {
return repair_service::config{
.enable_small_table_optimization_for_rbno = cfg->enable_small_table_optimization_for_rbno,
.repair_hints_batchlog_flush_cache_time_in_ms = cfg->repair_hints_batchlog_flush_cache_time_in_ms,
.repair_partition_count_estimation_ratio = cfg->repair_partition_count_estimation_ratio,
.critical_disk_utilization_level = cfg->critical_disk_utilization_level,
.repair_multishard_reader_buffer_hint_size = cfg->repair_multishard_reader_buffer_hint_size,
.repair_multishard_reader_enable_read_ahead = cfg->repair_multishard_reader_enable_read_ahead,
};
});
repair.start(std::ref(tsm), std::ref(gossiper), std::ref(messaging), std::ref(db), std::ref(proxy), std::ref(bm), std::ref(sys_ks), std::ref(view_builder), std::ref(view_building_worker), std::ref(task_manager), std::ref(mm), max_memory_repair, std::move(repair_config)).get();
repair.start(std::ref(tsm), std::ref(gossiper), std::ref(messaging), std::ref(db), std::ref(proxy), std::ref(bm), std::ref(sys_ks), std::ref(view_builder), std::ref(view_building_worker), std::ref(task_manager), std::ref(mm), max_memory_repair).get();
auto stop_repair_service = defer_verbose_shutdown("repair service", [&repair] {
repair.stop().get();
});

View File

@@ -175,9 +175,11 @@ std::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, ato
}
void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset, counter_id local_id) {
void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset, locator::host_id local_host_id) {
// FIXME: allow current_state to be frozen_mutation
utils::UUID local_id = local_host_id.uuid();
auto transform_new_row_to_shards = [&s = *m.schema(), clock_offset, local_id] (column_kind kind, auto& cells) {
cells.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto& cdef = s.column_at(kind, id);
@@ -186,7 +188,7 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
return; // continue -- we are in lambda
}
auto delta = acv.counter_update_value();
auto cs = counter_shard(local_id, delta, clock_offset + 1);
auto cs = counter_shard(counter_id(local_id), delta, clock_offset + 1);
ac_o_c = counter_cell_builder::from_single_shard(acv.timestamp(), cs);
});
};
@@ -210,7 +212,7 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
return; // continue -- we are in lambda
}
auto ccv = counter_cell_view(acv);
auto cs = ccv.get_shard(local_id);
auto cs = ccv.get_shard(counter_id(local_id));
if (!cs) {
return; // continue
}
@@ -230,7 +232,7 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
auto delta = acv.counter_update_value();
if (shards.empty() || shards.front().first > id) {
auto cs = counter_shard(local_id, delta, clock_offset + 1);
auto cs = counter_shard(counter_id(local_id), delta, clock_offset + 1);
ac_o_c = counter_cell_builder::from_single_shard(acv.timestamp(), cs);
} else {
auto& cs = shards.front().second;

View File

@@ -370,7 +370,7 @@ struct counter_cell_mutable_view : basic_counter_cell_view<mutable_view::yes> {
// Transforms mutation dst from counter updates to counter shards using state
// stored in current_state.
// If current_state is present it has to be in the same schema as dst.
void transform_counter_updates_to_shards(mutation& dst, const mutation* current_state, uint64_t clock_offset, counter_id local_id);
void transform_counter_updates_to_shards(mutation& dst, const mutation* current_state, uint64_t clock_offset, locator::host_id local_id);
template<>
struct appending_hash<counter_shard_view> {

View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:54662978b9ce4a6e25790b1b0a5099e6063173ffa95a399a6287cf474376ed09
size 6595952
oid sha256:e59fe56eac435fd03c2f0d7dfc11c6998d7c0750e1851535575497dd13d96015
size 6505524

View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:0cf44ea1fb2ae20de45d26fe8095054e60cb8700cddcb2fd79ef79705484b18a
size 6603780
oid sha256:34a0955d2c5a88e18ddab0f1df085e10a17e14129c3e21de91e4f27ef949b6c4
size 6502668

View File

@@ -1098,8 +1098,7 @@ std::optional<std::pair<read_id, index_t>> fsm::start_read_barrier(server_id req
// Make sure that only a leader or a node that is part of the config can request read barrier
// Nodes outside of the config may never get the data, so they will not be able to read it.
follower_progress* opt_progress = leader_state().tracker.find(requester);
if (requester != _my_id && opt_progress == nullptr) {
if (requester != _my_id && leader_state().tracker.find(requester) == nullptr) {
throw std::runtime_error(fmt::format("Read barrier requested by a node outside of the configuration {}", requester));
}
@@ -1110,23 +1109,19 @@ std::optional<std::pair<read_id, index_t>> fsm::start_read_barrier(server_id req
return {};
}
// Optimization for read barriers requested on non-voters. A non-voter doesn't receive the read_quorum message, so
// it might update its commit index only after another leader tick, which would slow down wait_for_apply() at the
// end of the read barrier. Prevent that by replicating to the non-voting requester here.
if (requester != _my_id && opt_progress->commit_idx < _commit_idx && opt_progress->match_idx == _log.last_idx()
&& !opt_progress->can_vote) {
logger.trace("start_read_barrier[{}]: replicate to {} because follower commit_idx={} < commit_idx={}, "
"follower match_idx={} == last_idx={}, and follower can_vote={}",
_my_id, requester, opt_progress->commit_idx, _commit_idx, opt_progress->match_idx,
_log.last_idx(), opt_progress->can_vote);
replicate_to(*opt_progress, true);
}
read_id id = next_read_id();
logger.trace("start_read_barrier[{}] starting read barrier with id {}", _my_id, id);
return std::make_pair(id, _commit_idx);
}
void fsm::maybe_update_commit_idx_for_read(index_t read_idx) {
// read_idx from the leader might not be replicated to the local node yet.
const bool in_local_log = read_idx <= _log.last_idx();
if (in_local_log && log_term_for(read_idx) == get_current_term()) {
advance_commit_idx(read_idx);
}
}
void fsm::stop() {
if (is_leader()) {
// Become follower to stop accepting requests

View File

@@ -480,6 +480,15 @@ public:
std::optional<std::pair<read_id, index_t>> start_read_barrier(server_id requester);
// Update the commit index to the read index (a read barrier result from the leader) if the local entry with the
// read index belongs to the current term.
//
// Satisfying the condition above guarantees that the local log matches the current leader's log up to the read
// index (the Log Matching Property), so the current leader won't drop the local entry with the read index.
// Moreover, this entry has been committed by the leader, so future leaders also won't drop it (the Leader
// Completeness Property). Hence, updating the commit index is safe.
void maybe_update_commit_idx_for_read(index_t read_idx);
size_t in_memory_log_size() const {
return _log.in_memory_size();
}

View File

@@ -1109,18 +1109,6 @@ future<> server_impl::process_fsm_output(index_t& last_stable, fsm_output&& batc
// case.
co_await _persistence->store_term_and_vote(batch.term_and_vote->first, batch.term_and_vote->second);
_stats.store_term_and_vote++;
// When the term advances, any in-flight snapshot transfers
// belong to an outdated term: the progress tracker has been
// reset in become_leader() or we are now a follower.
// Abort them before we dispatch this batch's messages, which
// may start fresh transfers for the new term.
//
// A vote may also change independently of the term (e.g. a
// follower voting for a candidate at the same term), but in
// that case there are no in-flight transfers and the abort
// is a no-op.
abort_snapshot_transfers();
}
if (batch.snp) {
@@ -1230,6 +1218,8 @@ future<> server_impl::process_fsm_output(index_t& last_stable, fsm_output&& batc
// quickly) stop happening (we're outside the config after all).
co_await _apply_entries.push_eventually(removed_from_config{});
}
// request aborts of snapshot transfers
abort_snapshot_transfers();
// abort all read barriers
for (auto& r : _reads) {
r.promise.set_value(not_a_leader{_fsm->current_leader()});
@@ -1571,6 +1561,7 @@ future<> server_impl::read_barrier(seastar::abort_source* as) {
co_return stop_iteration::no;
}
read_idx = std::get<index_t>(res);
_fsm->maybe_update_commit_idx_for_read(read_idx);
co_return stop_iteration::yes;
});

View File

@@ -113,6 +113,8 @@ void tracker::set_configuration(const configuration& configuration, index_t next
}
auto newp = this->progress::find(s.addr.id);
if (newp != this->progress::end()) {
// Processing joint configuration and already added
// an entry for this id.
continue;
}
auto oldp = old_progress.find(s.addr.id);
@@ -121,7 +123,7 @@ void tracker::set_configuration(const configuration& configuration, index_t next
} else {
newp = this->progress::emplace(s.addr.id, follower_progress{s.addr.id, next_idx}).first;
}
newp->second.can_vote = configuration.can_vote(s.addr.id);
newp->second.can_vote = s.can_vote;
}
};
emplace_simple_config(configuration.current, _current_voters);

View File

@@ -137,9 +137,6 @@ public:
// unless the reader is fast-forwarded to a new range.
bool _end_of_stream = false;
// Set by fill buffer for segregating output by partition range.
std::optional<dht::partition_range> _next_range;
schema_ptr _schema;
reader_permit _permit;
friend class mutation_reader;

View File

@@ -46,9 +46,7 @@ private:
const dht::sharder& remote_sharder,
unsigned remote_shard,
gc_clock::time_point compaction_time,
incremental_repair_meta inc,
uint64_t multishard_reader_buffer_hint_size,
bool multishard_reader_enable_read_ahead);
incremental_repair_meta inc);
public:
repair_reader(
@@ -62,9 +60,7 @@ public:
uint64_t seed,
read_strategy strategy,
gc_clock::time_point compaction_time,
incremental_repair_meta inc,
uint64_t multishard_reader_buffer_hint_size,
bool multishard_reader_enable_read_ahead);
incremental_repair_meta inc);
future<mutation_fragment_opt>
read_mutation_fragment();

File diff suppressed because it is too large Load Diff

View File

@@ -47,6 +47,7 @@
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/all.hh>
#include <seastar/coroutine/as_future.hh>
#include "db/config.hh"
#include "db/system_keyspace.hh"
#include "service/storage_proxy.hh"
#include "db/batchlog_manager.hh"
@@ -286,9 +287,7 @@ mutation_reader repair_reader::make_reader(
const dht::sharder& remote_sharder,
unsigned remote_shard,
gc_clock::time_point compaction_time,
incremental_repair_meta inc,
uint64_t multishard_reader_buffer_hint_size,
bool multishard_reader_enable_read_ahead) {
incremental_repair_meta inc) {
switch (strategy) {
case read_strategy::local: {
auto ms = mutation_source([&cf, compaction_time] (
@@ -314,11 +313,12 @@ mutation_reader repair_reader::make_reader(
}
case read_strategy::multishard_split: {
std::optional<size_t> multishard_reader_buffer_size;
if (multishard_reader_buffer_hint_size) {
const auto& dbconfig = db.local().get_config();
if (dbconfig.repair_multishard_reader_buffer_hint_size()) {
// Setting the repair buffer size as the multishard reader's buffer
// size helps avoid extra cross-shard round-trips and possible
// evict-recreate cycles.
multishard_reader_buffer_size = multishard_reader_buffer_hint_size;
multishard_reader_buffer_size = dbconfig.repair_multishard_reader_buffer_hint_size();
}
return make_multishard_streaming_reader(db, _schema, _permit, [this] {
auto shard_range = _sharder.next();
@@ -326,7 +326,7 @@ mutation_reader repair_reader::make_reader(
return std::optional<dht::partition_range>(dht::to_partition_range(*shard_range));
}
return std::optional<dht::partition_range>();
}, compaction_time, multishard_reader_buffer_size, read_ahead(multishard_reader_enable_read_ahead));
}, compaction_time, multishard_reader_buffer_size, read_ahead(dbconfig.repair_multishard_reader_enable_read_ahead()));
}
case read_strategy::multishard_filter: {
return make_filtering_reader(make_multishard_streaming_reader(db, _schema, _permit, _range, compaction_time, {}, read_ahead::yes),
@@ -354,17 +354,14 @@ repair_reader::repair_reader(
uint64_t seed,
read_strategy strategy,
gc_clock::time_point compaction_time,
incremental_repair_meta inc,
uint64_t multishard_reader_buffer_hint_size,
bool multishard_reader_enable_read_ahead)
incremental_repair_meta inc)
: _schema(s)
, _permit(std::move(permit))
, _range(dht::to_partition_range(range))
, _sharder(remote_sharder, range, remote_shard)
, _seed(seed)
, _local_read_op(strategy == read_strategy::local ? std::optional(cf.read_in_progress()) : std::nullopt)
, _reader(make_reader(db, cf, strategy, remote_sharder, remote_shard, compaction_time, inc,
multishard_reader_buffer_hint_size, multishard_reader_enable_read_ahead))
, _reader(make_reader(db, cf, strategy, remote_sharder, remote_shard, compaction_time, inc))
{ }
future<mutation_fragment_opt>
@@ -1324,9 +1321,7 @@ private:
return read_strategy;
}),
_compaction_time,
_incremental_repair_meta,
_rs.get_config().repair_multishard_reader_buffer_hint_size(),
bool(_rs.get_config().repair_multishard_reader_enable_read_ahead()));
_incremental_repair_meta);
}
try {
while (cur_size < _max_row_buf_size) {
@@ -2182,13 +2177,8 @@ public:
auto& cm = table.get_compaction_manager();
int64_t repaired_at = _incremental_repair_meta.sstables_repaired_at + 1;
// Keep the new sstables marked as being_repaired until repair_update_compaction_ctrl
// is called (after sstables_repaired_at is committed to Raft). This is an additional
// in-memory guard; the classifier itself also protects these sstables via the
// repaired_at > sstables_repaired_at check.
auto modifier = [repaired_at, session = _frozen_topology_guard] (sstables::sstable& new_sst) {
auto modifier = [repaired_at] (sstables::sstable& new_sst) {
new_sst.update_repaired_at(repaired_at);
new_sst.mark_as_being_repaired(session);
};
std::unordered_map<compaction::compaction_group_view*, std::vector<sstables::shared_sstable>> sstables_by_group;
@@ -2635,7 +2625,7 @@ future<repair_flush_hints_batchlog_response> repair_service::repair_flush_hints_
auto permit = co_await seastar::get_units(_flush_hints_batchlog_sem, 1);
bool updated = false;
auto now = gc_clock::now();
auto cache_time = std::chrono::milliseconds(_config.repair_hints_batchlog_flush_cache_time_in_ms());
auto cache_time = std::chrono::milliseconds(get_db().local().get_config().repair_hints_batchlog_flush_cache_time_in_ms());
auto cache_disabled = cache_time == std::chrono::milliseconds(0);
auto flush_time = now;
db::all_batches_replayed all_replayed = db::all_batches_replayed::yes;
@@ -3263,10 +3253,13 @@ private:
// sequentially because the rows from repair follower 1 to
// repair master might reduce the amount of missing data
// between repair master and repair follower 2.
repair_hash_set set_diff = get_set_diff(master.peer_row_hash_sets(node_idx), master.working_row_hashes().get());
auto working_hashes = master.working_row_hashes().get();
repair_hash_set set_diff = get_set_diff(master.peer_row_hash_sets(node_idx), working_hashes);
// Request missing sets from peer node
rlogger.debug("Before get_row_diff to node {}, local={}, peer={}, set_diff={}",
node, master.working_row_hashes().get().size(), master.peer_row_hash_sets(node_idx).size(), set_diff.size());
if (rlogger.is_enabled(logging::log_level::debug)) {
rlogger.debug("Before get_row_diff to node {}, local={}, peer={}, set_diff={}",
node, working_hashes.size(), master.peer_row_hash_sets(node_idx).size(), set_diff.size());
}
// If we need to pull all rows from the peer. We can avoid
// sending the row hashes on wire by setting needs_all_rows flag.
auto needs_all_rows = repair_meta::needs_all_rows_t(set_diff.size() == master.peer_row_hash_sets(node_idx).size());
@@ -3279,7 +3272,9 @@ private:
master.get_row_diff(std::move(set_diff), needs_all_rows, node, node_idx, dst_cpu_id);
ns.state = repair_state::get_row_diff_finished;
}
rlogger.debug("After get_row_diff node {}, hash_sets={}", master.myhostid(), master.working_row_hashes().get().size());
if (rlogger.is_enabled(logging::log_level::debug)) {
rlogger.debug("After get_row_diff node {}, hash_sets={}", master.myhostid(), master.working_row_hashes().get().size());
}
} catch (...) {
rlogger.warn("repair[{}]: get_row_diff: got error from node={}, keyspace={}, table={}, range={}, error={}",
_shard_task.global_repair_id.uuid(), node, _shard_task.get_keyspace(), _cf_name, _range, std::current_exception());
@@ -3505,7 +3500,7 @@ public:
// To save memory and have less different conditions, we
// use the estimation for RBNO repair as well.
_estimated_partitions *= _shard_task.rs.get_config().repair_partition_count_estimation_ratio();
_estimated_partitions *= _shard_task.db.local().get_config().repair_partition_count_estimation_ratio();
}
parallel_for_each(master.all_nodes(), coroutine::lambda([&] (repair_node_state& ns) -> future<> {
@@ -3641,8 +3636,7 @@ repair_service::repair_service(sharded<service::topology_state_machine>& tsm,
sharded<db::view::view_building_worker>& vbw,
tasks::task_manager& tm,
service::migration_manager& mm,
size_t max_repair_memory,
config cfg)
size_t max_repair_memory)
: _tsm(tsm)
, _gossiper(gossiper)
, _messaging(ms)
@@ -3657,7 +3651,6 @@ repair_service::repair_service(sharded<service::topology_state_machine>& tsm,
, _node_ops_metrics(_repair_module)
, _max_repair_memory(max_repair_memory)
, _memory_sem(max_repair_memory)
, _config(std::move(cfg))
{
tm.register_module("repair", _repair_module);
if (this_shard_id() == 0) {
@@ -3668,7 +3661,7 @@ repair_service::repair_service(sharded<service::topology_state_machine>& tsm,
future<> repair_service::start(utils::disk_space_monitor* dsm) {
if (dsm && (this_shard_id() == 0)) {
_out_of_space_subscription = dsm->subscribe(_config.critical_disk_utilization_level, [this] (auto threshold_reached) {
_out_of_space_subscription = dsm->subscribe(_db.local().get_config().critical_disk_utilization_level, [this] (auto threshold_reached) {
if (threshold_reached) {
return container().invoke_on_all([] (repair_service& rs) { return rs.drain(); });
}

Some files were not shown because too many files have changed in this diff Show More