Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged.
---
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
(according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
system_distributed.cdc_generation_descriptions, using the
generation's timestamp as the partition key (we call this table
the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
application state.
The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.
Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.
The commit introduces a new table for broadcasting generation data with
the following properties:
- it uses a better schema that stores the data in multiple rows, each
of manageable size
- it resides in a new keyspace that uses EverywhereStrategy so the
data will be written to every node in the cluster that has a token in
the token ring
- the data will be written using CL=ALL and read using CL=ONE; thanks
to this, restarting node won't have to communicate with other nodes
to retrieve the data of the last known generation. Note that writing
with CL=ALL does not reduce availability: creating a new generation
*requires* all nodes to be available anyway, because they must learn
about the generation before their clocks go past the generation's
timestamp; if they don't, partitions won't be mapped to stream IDs
consistently across the cluster
- the partition key is no longer the generation's timestamp. Because it
was that way in the old internal table, it forced the algorithm to
choose the timestamp *before* the generation data was inserted into
the table. What if the inserting took a long time? It increased the
chance that nodes would learn about the generation too late (after
their clocks moved past its timestamp). With the new schema we will
first insert the generation data using a randomly generated UUID as
the partition key, *then* choose the timestamp, then gossip both the
timestamp and the UUID.
Observe that after a node learns about a generation broadcasted using
this new method through gossip it will retrieve its data very quickly
since it's one of the replicas and it can use CL=ONE as it was
written using CL=ALL.
The generation's timestamp and the UUID mentioned in the last point form
a "generation identifier" for this new generation. For passing these new
identifiers around, we introduce the cdc::generation_id_v2 type.
Fixes#7961.
---
For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order.
dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/
unit tests (dev) passed locally
Closes#8643
* github.com:scylladb/scylla:
docs: update cdc.md with info about the new internal table
sys_dist_ks: don't create old CDC generations table on service initialization
sys_dist_ks: rename all_tables() to ensured_tables()
cdc: when creating new generations, use format v2 if possible
main: pass feature_service to cdc::generation_service
gms: introduce CDC_GENERATIONS_V2 feature
cdc: introduce retrieve_generation_data
test: cdc: include new generations table in permissions test
sys_dist_ks: increase timeout for create_cdc_desc
sys_dist_ks: new table for exchanging CDC generations
tree-wide: introduce cdc::generation_id_v2
This draft extends and obsoletes #8123 by introducing a way of determining the workload type from service level parameters, and then using this context to qualify requests for shedding.
The rough idea is that when the admission queue in the CQL server is hit, it might make more sense to start shedding surplus requests instead of accumulating them on the semaphore. The assumption that interactive workloads are more interested in the success rate of as many requests as possible, and hanging on a semaphore reduces the chances for a request to succeed. Thus, it may make sense to shed some requests to reduce the load on this coordinator and let the existing requests to finish.
It's a draft, because I only performed local guided tests. #8123 was followed by some experiments on a multinode cluster which I want to rerun first.
Closes#8680
* github.com:scylladb/scylla:
test: add a case for conflicting workload types
cql-pytest: add basic tests for service level workload types
docs: describe workload types for service levels
sys_dist_ks: fix redundant parsing in get_service_level
sys_dist_ks: make get_service_level exception-safe
transport: start shedding requests during potential overload
client_state: hook workload type from service levels
cql3: add listing service level workload type
cql3: add persisting service level workload type
qos: add workload_type service level parameter
Some subclasses want to maintain state, which constness needlessly precludes.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8721
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
(according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
system_distributed.cdc_generation_descriptions, using the
generation's timestamp as the partition key (we call this table
the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
application state.
The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.
Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.
The commit introduces a new table for broadcasting generation data with
the following properties:
- it uses a better schema that stores the data in multiple rows, each
of manageable size
- it resides in the `system_distributed_everywhere` keyspace so the
data will be written to every node in the cluster that has a token in
the token ring
- the data will be written using CL=ALL and read using CL=ONE; thanks
to this, restarting node won't have to communicate with other nodes
to retrieve the data of the last known generation. Note that writing
with CL=ALL does not reduce availability: creating a new generation
*requires* all nodes to be available anyway, because they must learn
about the generation before their clocks go past the generation's
timestamp; if they don't, partitions won't be mapped to stream IDs
consistently across the cluster
- the partition key is no longer the generation's timestamp. Because it
was that way in the old internal table, it forced the algorithm to
choose the timestamp *before* the generation data was inserted into
the table. What if the inserting took a long time? It increased the
chance that nodes would learn about the generation too late (after
their clocks moved past its timestamp). With the new schema we will
first insert the generation data using a randomly generated UUID as
the partition key, *then* choose the timestamp, then gossip both the
timestamp and the UUID. The timestamp and the UUID form the
"generation identifier" of this new generation; this should explain
why we introduced the generation_id_v2 type in previous commits.
Observe that after a node learns about a generation broadcasted using
this new method through gossip it will retrieve its data very quickly
since it's one of the replicas and it can use CL=ONE as it was
written using CL=ALL.
Note that the node is still using the old method - the actual switch
will be done in a later commit.
`system_distributed_everywhere` is a new keyspace that uses Everywhere
replication strategy. This is useful, for example, when we want to store
internal data that should be accessible by every node; the data can be
written using CL=ALL (e.g. during node operations such as node
bootstrap, which require all nodes to be alive - at least currently) and
then read by each node locally using CL=ONE (e.g. during node restarts).
Closes#8457
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments
This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.
The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.
Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
The client_state::check_access() calls for global storage service
to get the features from it and check if the CDC feature is on.
The latter is needed to perform CDC-specific checks.
However it was noticed, that the check for the feature is excessive
as all the guarded if-s will resolve to false in case CDC is off
and the check_access will effectively work as it would with the
feature check.
With that observation, it's possible to ditch one more global storage
service reference.
tests: unit(dev), dtest(dev, auth)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210105063651.7081-1-xemul@scylladb.com>
The previous patch brought the databse reference arg. And since
the currently called validate_column_family() overload _just_
gets the database from global proxy, it's better to shortcut.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is called from cql3/statements' check_access methods and from thrift
handlers. The former have proxy argument from which they can get the
database. The latter already have the database itself on board.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
These alterations cannot break the database irreparably, so allow
them.
Expand command_desc as required.
Add a type (rather than command_desc) parameter to
has_column_family_access() to minimize code changes.
Fixes#7057
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.
`contains` does not only express the intend of the code better but also
does it in more unified way.
This commit replaces all the occurences of the `count` with the
`contains`.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
No change in the implementation since it was already copying the
string. Taking a std::string_view is just a bit more flexible.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
_user cannot outlive client_state class instance, so there is no point
in holding it in shared_ptr.
Tested: debug test.py and dtest auth_test.py
Message-Id: <20191128131217.26294-5-gleb@scylladb.com>
If each client_state has its own copy of the variable two clients may
generate timestamps that clash and needlessly create contention. Making
the variable shared between all client_state on the same shard will make
sure this will not happen to two clients on the same shard. It may still
happen for two client on two different shards or two different nodes.
Commit 7e3805ed3d removed the load balancing code from cql
server, but it did not remove most of the craft that load balancing
introduced. The most of the complexity (and probably the main reason the
code never worked properly) is around service::client_state class which
is copied before been passed to the request processor (because in the past
the processing could have happened on another shard) and then merged back
into the "master copy" because a request processing may have changed it.
This commit remove all this copying. The client_request is passed as a
reference all the way to the lowest layer that needs it and it copy
construction is removed to make sure nobody copies it by mistake.
tests: dev, default c-s load of 3 node cluster
Message-Id: <20190906083050.GA21796@scylladb.com>
db::schema_tables::ALL and db::schema_tables::all_tables() are both supposed
to list the same schema tables - the former is the list of their names, and
the latter is the list of their schemas. This code duplication makes it easy
to forget to update one of them, and indeed recently the new
"view_virtual_columns" was added to all_tables() but not to ALL.
What this patch does is to make ALL a function instead of constant vector.
The newly named all_table_names() function uses all_tables() so the list
of schema tables only appears once.
So that nobody worries about the performance impact, all_table_names()
caches the list in a per-thread vector that is only prepared once per thread.
Because after this patch all_table_names() has the "view_virtual_columns"
that was previously missing, this patch also fixes#4339, which was about
virtual columns in materialized views not being propagated to other nodes.
Unfortunately, to test the fix for #4339 we need a test with multiple
nodes, so we cannot test it here in a unit test, and will instead use
the dtest framework, in a separate patch.
Fixes#4339
Branches: 3.0
Tests: all unit tests (release and debug mode), new dtest for #4339. The unit test mutation_reader_test failed in debug mode but not in release mode, but this probably has nothing to do with this patch (?).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190320063437.32731-1-nyh@scylladb.com>
This header, which is easily replaced with a forward declaration,
introduces a dependency on database.hh everywhere. Remove it and scatter
includes of database.hh in source files that really need it.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
There is not reason to use an std::set for it since we don't care about
the ordering - only about the existance of a particular entry.
Hash table will be more efficient for this use case.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528220892-5784-2-git-send-email-vladz@scylladb.com>
This change allows for seamless migration of the legacy users metadata
to the new role-based metadata tables. This process is summarized in
`docs/migrating-from-users-to-roles.md`.
In general, if any nondefault metadata exists in the new tables, then
no migration happens. If, in this case, legacy metadata still exists
then a warning is written to the log.
If no nondefault metadata exists in the new tables and the legacy tables
exist, then each node will copy the data from the legacy tables to the
new tables, performing transformations as necessary. An informational
message is written to the log when the migration process starts, and
when the process ends. During the process of copying, data is
overwritten so that multiple nodes racing to migrate data do not
conflict.
Since Apache Cassandra's auth. schema uses the same table for managing
roles and authentication information, some useful functions in
`roles-metadata.hh` have been added to avoid code duplication.
Because a superuser should be able to drop the legacy users tables from
`system_auth` once the cluster has migrated to roles and is functioning
correctly, we remove the restriction on altering anything in the
"system_auth" keyspace. Individual tables in `system_auth` are still
protected later in the function.
When a cluster is upgrading from one that does not support roles to one
that does, some nodes will be running old code which accesses old
metadata and some will be running new code which access new metadata.
With the help of the gossiper `feature` mechanism, clients connecting to
upgraded nodes will be notified (through code in the relevant CQL
statements) that modifications are not allowed until the entire cluster
has upgraded.
auth: Decouple authorization and role management
Access control in Scylla consists of three main modules: authentication,
authorization, and role-management.
Each of these modules is intended to be interchangeable with alternative
implementations. The `auth::service` class composes these modules
together to perform all access-control functionality, including caching.
This architecture implies two main properties of the individual
access-control modules:
- Independence of modules. An implementation of authentication should
have no dependence or knowledge of authorization or role-management,
for example.
- Simplicity of implementing the interface. Functionality that is common
to all implementations should not have to be duplicated in each
implementation. The abstract interface for a module should capture
only the differences between particular implementations.
Previously, the authorization interface depended on an instance of
`auth::service` for certain operations, since it required aggregation
over all the roles granted to a particular role or required checking if
a given role had superuser.
This change decouples authorization entirely from role-management: the
authorizer now manages only permissions granted directly to a role, and
not those inherited through other roles.
When a query needs to be authorized, `auth::service::get_permissions`
first uses the role manager to check if the role has superuser. Then, it
aggregates calls to `auth::authorizer::authorize` for each role granted
to the role (again, from the role-manager) to determine the sum-total
permission set. This information is cached for future queries.
This structure allows for easier error handling and
management (something I hope to improve in the future for both the
authorizer and authenticator interfaces), easier system testing, easier
implementation of the abstract interfaces, and clearer system
boundaries (so the code is easier to grok).
Some authorizers, like the "TransitionalAuthorizer", grant permissions
to anonymous users. Therefore, we could not unconditionally authorize an
empty permission set in `auth::service` for anonymous users. To account
for this, the interface of the authorizer has changed to accept an
optional name in `authorize`.
One additional notable change to the authorizer is the
`auth::authorizer::list`: previously, the filtering happened at the CQL
query layer and depended on the roles granted to the role in question.
I've changed the function to simply query for all roles and I do the
filtering in `auth::system` in-memory with the STL. This was necessary
to allow the authorizer to be decoupled from role-management. This
function is only called for LIST PERMISSIONS (so performance is not a
concern), and it significantly reduces demand on the implementation.
Finally, we unconditionally create a user in `cql_test_env` since
authorization requires its existence.
Previously, a "data" auth. resource knew how to check it's own existence by
accessing a global variable.
This patch accomplishes two things: it adds existence checking to all
kinds of resources, and moves these checks outside of `auth::resource`
itself and into `auth::service` (so that global variables are no longer
accessed).
This has the dual benefit of not enforcing copying on implementations of
the abstract interface and also limiting unnecessary copies.
As usual with Seastar, we follow the convention that a reference
parameter to a function is assumed valid for the duration of the
`future` that is returned. `do_with` helps here.
By adding some constants for root resources, we can avoid using
`seastar::do_with` at some call-sites involving `resource` instances.
The motivation behind this change is the idea that constructing a new
instance of an object is the job of the constructor.
One big benefit of this structure (with the addition of helpers for
convenience) is that calls for emplacing instances (like
`std::make_shared`, or `std::vector::emplace_back`) work without any
difficulty. This would not be true for static construction functions.
The most important change is replacing `auth::authenticated_user::name`
with a public `std::optional<sstring>` member. Anonymous users have no
name. This replaces the insecure and bug-prone special-string of
"anonymous" for anonymous users, which does unfortunate things with the
authorizer.
The new `auth::is_anonymous` function exists for convenience since
checking the absence of a `std::optional` value can be tedious.
When a caller really wants a name unconditionally, a new stream output
function is also available.
This is a large change, but it's a necessary evil.
This change brings us to a minimally-functional implementation of roles.
There are many additional changes that are necessary, including refined
grammar, bug fixes, code hygiene, and internal code structure changes.
In the interest of keeping this patch somewhat read-able, those changes
will come in subsequent patches. Until that time, roles are still marked
"unimplemented".
IMPORTANT: This code does not include any mechanism for transitioning a
cluster from user-based access-control to role-based access control. All
existing access-control metadata will be ignored (though not deleted).
Specific changes:
- All user-specific CQL statements now delegate to their roles
equivalent. The statements are effectively the same, but CREATE USER
will include LOGIN automatically. Also, LIST USERS only lists roles
with LOGIN.
- A call to LIST PERMISSIONS will now also list permissions of roles
that have been granted to the caller, in addition to permissions which
have been granted directly.
- Much of the logic of creating, altering, and deleting roles has been
moved to `auth::service`, since these operations require cooperation
between the authenticator, authorizer, and role-manager.
- LIST USERS actually works as expected now (fixes#2968).
client_state used in the process_request_one(...) contains all sorts of information irrelevant
to the caller (process_request(...)), e.g. Tracing state. Therefore instead of returning
the whole client_state object (which becomes even a bigger problem if process_one(...) and process_request_one(...)
are executed on different shards) we will return only the pieces of information we really need.
To do that we introduce a new class - processing_result, which is cross-shard-access-ready to begin with.
We are going to return a instance of this new class from the process_request_one(...).
Fixes#2351
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Move the requests-handling-related state into the client_state. This is needed to properly
define the interface between the process_request(...) and process_request_one(...).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Since the connection::_client_state is the only generator of new timestamps
now there is no need for this merge.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
A new constructor creates a copy of the current client_status to be
used in the context of the handling of a single request.
The copy may take place at a shard different from the one where the
request has been received.
In order to ensure the monotonicity of the timestamps used by the request handled
on the same connection the created copy of the client_state is going to use the same timestamp provided by the
caller instead of generating it.
It's the caller's responsibility to ensure the monotonicity of given timestamps.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Applicable permission sets will soon be specific to each kind of
resource. This change prepares us for dynamic querying of permission
sets by resource.
This change generalizes the implementation of a `resource` to many
different kinds of resources, though there is still only one
kind (`data`). In the future, we also expect resource kinds for roles,
user-defined functions (UDFs), and possibly on particular REST
end-points.
I considered several approaches to generalizing to different kinds of
resources.
One approach is to have a base class that is inherited from by different
resource kinds. The common functionality would be accessed through
virtual member functions and kind-specific functions would exist in
sub-classes. I rejected this approach because dealing with different
kinds of resources uniformly requires storage and life-time management
through something like `std::unique_ptr<auth::resource>`, which means
that we lose value semantics (including comparison) and must deal with
complications around ownership.
Another option was to use `boost::variant` (or, in future,
`std::variant`). This is closer to what we want, since there a static
set of resource kinds that we support. I rejected this approach for two
reasons. The first is that all resource kinds share the same data (a
list of segments and a root identifier), which would be duplicated in
each type that composed the variant. The second is that the complexity
and source-code overhead of `boost::variant` didn't seem warranted.
The solution I ended up with is home-grown variant. All resources are
described in the same `final` class: `auth::resource`. This class has
value semantics, supports equality comparison, and has a strict
ordering. All resources have in common a tag ("kind") and a list of
parts. Most operations on resources don't care about the kind of
resource (like getting its name, parsing a name, querying for the
parent, etc). These are just member functions of the class.
When we care about a kind-specific interpretation of a resource, we can
produce a "view" of the resource. For example, `data_resource_view`
allows for accessing the (optional) keyspace and table names.
I anticipate in the future to add functions for creating role
resources (`auth::resource::role`) and also `role_resource_view`.
The functional behaviour of the system should be unchanged with this
patch.
I've added new unit tests in `auth_resource_test.cc` and removed the old
test from `auth_test.cc`.
Fixes#3027.