Files
scylladb/service/client_state.cc
Avi Kivity 5f8484897b Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun
Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged.

---
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
   irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
   (according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
   system_distributed.cdc_generation_descriptions, using the
   generation's timestamp as the partition key (we call this table
   the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
   application state.

The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.

Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.

The commit introduces a new table for broadcasting generation data with
the following properties:
-  it uses a better schema that stores the data in multiple rows, each
   of manageable size
-  it resides in a new keyspace that uses EverywhereStrategy so the
   data will be written to every node in the cluster that has a token in
   the token ring
-  the data will be written using CL=ALL and read using CL=ONE; thanks
   to this, restarting node won't have to communicate with other nodes
   to retrieve the data of the last known generation. Note that writing
   with CL=ALL does not reduce availability: creating a new generation
   *requires* all nodes to be available anyway, because they must learn
   about the generation before their clocks go past the generation's
   timestamp; if they don't, partitions won't be mapped to stream IDs
   consistently across the cluster
-  the partition key is no longer the generation's timestamp. Because it
   was that way in the old internal table, it forced the algorithm to
   choose the timestamp *before* the generation data was inserted into
   the table. What if the inserting took a long time? It increased the
   chance that nodes would learn about the generation too late (after
   their clocks moved past its timestamp). With the new schema we will
   first insert the generation data using a randomly generated UUID as
   the partition key, *then* choose the timestamp, then gossip both the
   timestamp and the UUID.
   Observe that after a node learns about a generation broadcasted using
   this new method through gossip it will retrieve its data very quickly
   since it's one of the replicas and it can use CL=ONE as it was
   written using CL=ALL.

The generation's timestamp and the UUID mentioned in the last point form
a "generation identifier" for this new generation. For passing these new
identifiers around, we introduce the cdc::generation_id_v2 type.

Fixes #7961.

---

For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order.

dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/
unit tests (dev) passed locally

Closes #8643

* github.com:scylladb/scylla:
  docs: update cdc.md with info about the new internal table
  sys_dist_ks: don't create old CDC generations table on service initialization
  sys_dist_ks: rename all_tables() to ensured_tables()
  cdc: when creating new generations, use format v2 if possible
  main: pass feature_service to cdc::generation_service
  gms: introduce CDC_GENERATIONS_V2 feature
  cdc: introduce retrieve_generation_data
  test: cdc: include new generations table in permissions test
  sys_dist_ks: increase timeout for create_cdc_desc
  sys_dist_ks: new table for exchanging CDC generations
  tree-wide: introduce cdc::generation_id_v2
2021-05-27 17:13:44 +03:00

308 lines
12 KiB
C++

/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "client_state.hh"
#include "auth/authorizer.hh"
#include "auth/authenticator.hh"
#include "auth/common.hh"
#include "auth/resource.hh"
#include "exceptions/exceptions.hh"
#include "validation.hh"
#include "db/system_keyspace.hh"
#include "db/schema_tables.hh"
#include "tracing/trace_keyspace_helper.hh"
#include "db/system_distributed_keyspace.hh"
#include "database.hh"
#include "cdc/log.hh"
#include <seastar/core/coroutine.hh>
thread_local api::timestamp_type service::client_state::_last_timestamp_micros = 0;
void service::client_state::set_login(auth::authenticated_user user) {
_user = std::move(user);
}
future<> service::client_state::check_user_can_login() {
if (auth::is_anonymous(*_user)) {
return make_ready_future();
}
auto& role_manager = _auth_service->underlying_role_manager();
return role_manager.exists(*_user->name).then([this](bool exists) mutable {
if (!exists) {
throw exceptions::authentication_exception(
format("User {} doesn't exist - create it with CREATE USER query first",
*_user->name));
}
return make_ready_future();
}).then([this, &role_manager] {
return role_manager.can_login(*_user->name).then([this](bool can_login) {
if (!can_login) {
throw exceptions::authentication_exception(format("{} is not permitted to log in", *_user->name));
}
return make_ready_future();
});
});
}
void service::client_state::validate_login() const {
if (!_user) {
throw exceptions::unauthorized_exception("You have not logged in");
}
}
void service::client_state::ensure_not_anonymous() const {
validate_login();
if (auth::is_anonymous(*_user)) {
throw exceptions::unauthorized_exception("You have to be logged in and not anonymous to perform this request");
}
}
future<> service::client_state::has_all_keyspaces_access(
auth::permission p) const {
if (_is_internal) {
return make_ready_future();
}
validate_login();
return do_with(auth::resource(auth::resource_kind::data), [this, p](const auto& r) {
return ensure_has_permission({p, r});
});
}
future<> service::client_state::has_keyspace_access(const sstring& ks,
auth::permission p) const {
return do_with(ks, auth::make_data_resource(ks), [this, p](auto const& ks, auto const& r) {
return has_access(ks, {p, r});
});
}
future<> service::client_state::has_column_family_access(const database& db, const sstring& ks,
const sstring& cf, auth::permission p, auth::command_desc::type t) const {
validation::validate_column_family(db, ks, cf);
return do_with(ks, auth::make_data_resource(ks, cf), [this, p, t](const auto& ks, const auto& r) {
return has_access(ks, {p, r, t});
});
}
future<> service::client_state::has_schema_access(const schema& s, auth::permission p) const {
return do_with(
s.ks_name(),
auth::make_data_resource(s.ks_name(),s.cf_name()),
[this, p](auto const& ks, auto const& r) {
return has_access(ks, {p, r});
});
}
future<> service::client_state::has_access(const sstring& ks, auth::command_desc cmd) const {
if (ks.empty()) {
throw exceptions::invalid_request_exception("You have not set a keyspace for this session");
}
if (_is_internal) {
return make_ready_future();
}
validate_login();
static const auto alteration_permissions = auth::permission_set::of<
auth::permission::CREATE, auth::permission::ALTER, auth::permission::DROP>();
// we only care about schema modification.
if (alteration_permissions.contains(cmd.permission)) {
// prevent system keyspace modification
auto name = ks;
std::transform(name.begin(), name.end(), name.begin(), ::tolower);
if (is_system_keyspace(name)) {
throw exceptions::unauthorized_exception(ks + " keyspace is not user-modifiable.");
}
//
// we want to disallow dropping any contents of TRACING_KS and disallow dropping the `auth::meta::AUTH_KS`
// keyspace.
//
const bool dropping_anything_in_tracing = (name == tracing::trace_keyspace_helper::KEYSPACE_NAME)
&& (cmd.permission == auth::permission::DROP);
const bool dropping_auth_keyspace = (cmd.resource == auth::make_data_resource(auth::meta::AUTH_KS))
&& (cmd.permission == auth::permission::DROP);
if (dropping_anything_in_tracing || dropping_auth_keyspace) {
throw exceptions::unauthorized_exception(
format("Cannot {} {}", auth::permissions::to_string(cmd.permission), cmd.resource));
}
}
static thread_local std::unordered_set<auth::resource> readable_system_resources = [] {
std::unordered_set<auth::resource> tmp;
for (auto cf : { db::system_keyspace::LOCAL, db::system_keyspace::PEERS }) {
tmp.insert(auth::make_data_resource(db::system_keyspace::NAME, cf));
}
for (auto cf : db::schema_tables::all_table_names(db::schema_features::full())) {
tmp.insert(auth::make_data_resource(db::schema_tables::NAME, cf));
}
return tmp;
}();
if (cmd.permission == auth::permission::SELECT && readable_system_resources.contains(cmd.resource)) {
return make_ready_future();
}
if (alteration_permissions.contains(cmd.permission)) {
if (auth::is_protected(*_auth_service, cmd)) {
throw exceptions::unauthorized_exception(format("{} is protected", cmd.resource));
}
}
if (cmd.resource.kind() == auth::resource_kind::data) {
const auto resource_view = auth::data_resource_view(cmd.resource);
if (resource_view.table()) {
if (cmd.permission == auth::permission::DROP) {
if (cdc::is_log_for_some_table(ks, *resource_view.table())) {
throw exceptions::unauthorized_exception(
format("Cannot {} cdc log table {}", auth::permissions::to_string(cmd.permission), cmd.resource));
}
}
static constexpr auto cdc_topology_description_forbidden_permissions = auth::permission_set::of<
auth::permission::ALTER, auth::permission::DROP>();
if (cdc_topology_description_forbidden_permissions.contains(cmd.permission)) {
if ((ks == db::system_distributed_keyspace::NAME || ks == db::system_distributed_keyspace::NAME_EVERYWHERE)
&& (resource_view.table() == db::system_distributed_keyspace::CDC_DESC_V2
|| resource_view.table() == db::system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION
|| resource_view.table() == db::system_distributed_keyspace::CDC_TIMESTAMPS
|| resource_view.table() == db::system_distributed_keyspace::CDC_GENERATIONS_V2)) {
throw exceptions::unauthorized_exception(
format("Cannot {} {}", auth::permissions::to_string(cmd.permission), cmd.resource));
}
}
}
}
return ensure_has_permission(cmd);
}
future<bool> service::client_state::check_has_permission(auth::command_desc cmd) const {
if (_is_internal) {
return make_ready_future<bool>(true);
}
return do_with(cmd.resource.parent(), [this, cmd](const std::optional<auth::resource>& parent_r) {
return auth::get_permissions(*_auth_service, *_user, cmd.resource).then(
[this, p = cmd.permission, &parent_r](auth::permission_set set) {
if (set.contains(p)) {
return make_ready_future<bool>(true);
}
if (parent_r) {
return check_has_permission({p, *parent_r});
}
return make_ready_future<bool>(false);
});
});
}
future<> service::client_state::ensure_has_permission(auth::command_desc cmd) const {
return check_has_permission(cmd).then([this, cmd](bool ok) {
if (!ok) {
throw exceptions::unauthorized_exception(
format("User {} has no {} permission on {} or any of its parents",
*_user,
auth::permissions::to_string(cmd.permission),
cmd.resource));
}
});
}
void service::client_state::set_keyspace(database& db, std::string_view keyspace) {
// Skip keyspace validation for non-authenticated users. Apparently, some client libraries
// call set_keyspace() before calling login(), and we have to handle that.
if (_user && !db.has_keyspace(keyspace)) {
throw exceptions::invalid_request_exception(format("Keyspace '{}' does not exist", keyspace));
}
_keyspace = sstring(keyspace);
}
future<> service::client_state::ensure_exists(const auth::resource& r) const {
return _auth_service->exists(r).then([&r](bool exists) {
if (!exists) {
throw exceptions::invalid_request_exception(format("{} doesn't exist.", r));
}
return make_ready_future<>();
});
}
future<> service::client_state::maybe_update_per_service_level_params() {
if (_sl_controller && _user && _user->name) {
auto& role_manager = _auth_service->underlying_role_manager();
auto role_set = co_await role_manager.query_granted(_user->name.value(), auth::recursive_role_query::yes);
auto slo_opt = co_await _sl_controller->find_service_level(role_set);
if (!slo_opt) {
co_return;
}
auto slo_timeout_or = [&] (const lowres_clock::duration& default_timeout) {
return std::visit(overloaded_functor{
[&] (const qos::service_level_options::unset_marker&) -> lowres_clock::duration {
return default_timeout;
},
[&] (const qos::service_level_options::delete_marker&) -> lowres_clock::duration {
return default_timeout;
},
[&] (const lowres_clock::duration& d) -> lowres_clock::duration {
return d;
},
}, slo_opt->timeout);
};
_timeout_config.read_timeout = slo_timeout_or(_default_timeout_config.read_timeout);
_timeout_config.write_timeout = slo_timeout_or(_default_timeout_config.write_timeout);
_timeout_config.range_read_timeout = slo_timeout_or(_default_timeout_config.range_read_timeout);
_timeout_config.counter_write_timeout = slo_timeout_or(_default_timeout_config.counter_write_timeout);
_timeout_config.truncate_timeout = slo_timeout_or(_default_timeout_config.truncate_timeout);
_timeout_config.cas_timeout = slo_timeout_or(_default_timeout_config.cas_timeout);
_timeout_config.other_timeout = slo_timeout_or(_default_timeout_config.other_timeout);
_workload_type = slo_opt->workload;
}
}