CDC and $paxos tables are managed internally by Scylla. Users are
already prohibited from running ALTER and DROP commands on CDC tables.
In this commit, we extend the same restrictions to $paxos tables to
prevent users from shooting themselves in the foot.
Other commands are generally allowed for CDC and $paxos tables. An
important distinction is that CDC tables are meant to be accessed
directly by users, so appropriate permissions must be set for
non-superusers. In contrast, $paxos tables are not intended for direct
access by users. Therefore, this commit explicitly disallows
non-superusers from accessing them. Superusers are still allowed
access for debugging and troubleshooting purposes.
Note that these restrictions apply even if explicit permissions have
been granted. For example, a non-superuser may be granted SELECT
permissions on a $paxos table, but the restriction above will
still take precedence. We don't try to restrict users
from giving permissions to $paxos tables for simplicity.
This is a refactoring commit — it extracts the CDC permissions handling
logic into a separate function: check_internal_table_permissions.
This is a preparatory step for the next commit, where we'll handle
paxos state tables similarly to CDC tables.
Before this patch, granting a user MODIFY permissions on ALL KEYSPACES allowed the user to write to system tables, where the user could also set himself to "superuser" granting him all other permissions. After this patch, MODIFY permissions on ALL KEYSPACES is limited only to non-system keyspaces.
Fixes: scylladb/scylladb#23218Closesscylladb/scylladb#23219
There are few more places left that can use all_table_infos() as a
replacement for all_table_names(), patch them.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's possible to modify 'memtable_flush_period_in_ms' option only and as
single option, not with any other options together
Refs #20999Fixes#21223Closesscylladb/scylladb#22536
Similarly to `maybe_update_per_service_level_params`, the method update
connection's params but it gets `service_level_options` as an argument
instead of asking `service_level_controller`.
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.
that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:
```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
265 | return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
| ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
4290 | format(format_string<_Args...> __fmt, _Args&&... __args)
| ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
143 | format(fmt::format_string<A...> fmt, A&&... a) {
| ^
```
in this change, we
change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
`seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
because, `sstring::operator std::basic_string` would incur a deep
copy.
we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Write down definitions of `service level` and `effective service level`
in service/qos/service_level_controller.hh.
Until now, effective service level was only used as result of
`LIST EFFECTIVE SERVICE LEVEL OF <role>`.
Now we want to have quick access to effective service level of
each role and introduce cache of effective sl to do it.
New definitions clarify things.
The commit also renames:
- `update_service_levels_from_distributed_data` -> `update_service_levels_cache`
Later we will introduce effective_service_level_cache, so this change
standarizes the names.
- `find_service_level` -> `find_effective_service_level`
The function actualy returns effective service level.
db::schema_tables::all_table_names() returns std::vector<sstring>.
Usage of range-for loop without reference results in copying each
of the elements of the traversed container. Such copying is redundant.
This change introduces usage of const reference to avoid copying.
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#16983
Checking keyspace/table presence should not be part of authorization code
and it is not done consistently today. For instance keyspace presence
is not checked in "alter keyspace" during authorization, but during
statement execution. Make it consistent.
Checking if a table is CDC log and cannot be dropped should not be done
as part of authentication (this has nothing to do with auth), but in the
drop statement itself. Throwing unauthorized_exception is wrong as well,
but unfortunately it is enshrined with a test. Not sure if it is a good
idea to change it now.
In some places, the parameter name used when constructing
a resource object was 'function_name', while the actual
argument was the signature of a function, which is particularly
confusing, because function names also appear frequently in these
contexts. This patch changes the identifiers to more accurately
reflect, what they represent.
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.
Closes#10562
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
And instantly convert the validate_keyspace() as it's not called
from anywhere but the validate_column_family().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This db argument is only needed to be pushed into
cdc::is_log_for_some_table() helper. All callers already have
the d._d.::database at hands and convert it into .real_database()
call-time, so this patch effectively generalizes those calls to
the .real_database().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now called with d._d.::database converted to .real_database()
right in the argument passing, so this change can be treated as
the generalization of that .real_database() call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Straightforward replacement. Internals of the has_column_family_access()
temporarily get .real_database(), but it will be changed soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Straightforward replacement. Internals of the has_keyspace_access()
temporarily get .real_database(), but it will be changed soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
No functional changes, but makes the code shorter and gets rid
of a few allocations.
Coroutinizing has_column_family_access is deliberately skipped and
commented, since some callers expect this function to throw instead
of returning an exceptional future.
Message-Id: <958848a1eeeef490b162d2d2b805c8a14fc9082b.1636704996.git.sarna@scylladb.com>
All callers has been patched already. This argument can now
be used to replace get_local_storage_proxy().get_db().local()
call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Use a forward declaration of cql3::expr::oper_t to reduce the
number of translation units depending on expression.hh.
Before:
$ find build/dev -name '*.d' | xargs cat | grep -c expression.hh
272
After:
$ find build/dev -name '*.d' | xargs cat | grep -c expression.hh
154
Some translation units adjust their includes to restore access
to required headers.
Closes#9229
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged.
---
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
(according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
system_distributed.cdc_generation_descriptions, using the
generation's timestamp as the partition key (we call this table
the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
application state.
The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.
Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.
The commit introduces a new table for broadcasting generation data with
the following properties:
- it uses a better schema that stores the data in multiple rows, each
of manageable size
- it resides in a new keyspace that uses EverywhereStrategy so the
data will be written to every node in the cluster that has a token in
the token ring
- the data will be written using CL=ALL and read using CL=ONE; thanks
to this, restarting node won't have to communicate with other nodes
to retrieve the data of the last known generation. Note that writing
with CL=ALL does not reduce availability: creating a new generation
*requires* all nodes to be available anyway, because they must learn
about the generation before their clocks go past the generation's
timestamp; if they don't, partitions won't be mapped to stream IDs
consistently across the cluster
- the partition key is no longer the generation's timestamp. Because it
was that way in the old internal table, it forced the algorithm to
choose the timestamp *before* the generation data was inserted into
the table. What if the inserting took a long time? It increased the
chance that nodes would learn about the generation too late (after
their clocks moved past its timestamp). With the new schema we will
first insert the generation data using a randomly generated UUID as
the partition key, *then* choose the timestamp, then gossip both the
timestamp and the UUID.
Observe that after a node learns about a generation broadcasted using
this new method through gossip it will retrieve its data very quickly
since it's one of the replicas and it can use CL=ONE as it was
written using CL=ALL.
The generation's timestamp and the UUID mentioned in the last point form
a "generation identifier" for this new generation. For passing these new
identifiers around, we introduce the cdc::generation_id_v2 type.
Fixes#7961.
---
For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order.
dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/
unit tests (dev) passed locally
Closes#8643
* github.com:scylladb/scylla:
docs: update cdc.md with info about the new internal table
sys_dist_ks: don't create old CDC generations table on service initialization
sys_dist_ks: rename all_tables() to ensured_tables()
cdc: when creating new generations, use format v2 if possible
main: pass feature_service to cdc::generation_service
gms: introduce CDC_GENERATIONS_V2 feature
cdc: introduce retrieve_generation_data
test: cdc: include new generations table in permissions test
sys_dist_ks: increase timeout for create_cdc_desc
sys_dist_ks: new table for exchanging CDC generations
tree-wide: introduce cdc::generation_id_v2
This draft extends and obsoletes #8123 by introducing a way of determining the workload type from service level parameters, and then using this context to qualify requests for shedding.
The rough idea is that when the admission queue in the CQL server is hit, it might make more sense to start shedding surplus requests instead of accumulating them on the semaphore. The assumption that interactive workloads are more interested in the success rate of as many requests as possible, and hanging on a semaphore reduces the chances for a request to succeed. Thus, it may make sense to shed some requests to reduce the load on this coordinator and let the existing requests to finish.
It's a draft, because I only performed local guided tests. #8123 was followed by some experiments on a multinode cluster which I want to rerun first.
Closes#8680
* github.com:scylladb/scylla:
test: add a case for conflicting workload types
cql-pytest: add basic tests for service level workload types
docs: describe workload types for service levels
sys_dist_ks: fix redundant parsing in get_service_level
sys_dist_ks: make get_service_level exception-safe
transport: start shedding requests during potential overload
client_state: hook workload type from service levels
cql3: add listing service level workload type
cql3: add persisting service level workload type
qos: add workload_type service level parameter
Some subclasses want to maintain state, which constness needlessly precludes.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8721
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
(according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
system_distributed.cdc_generation_descriptions, using the
generation's timestamp as the partition key (we call this table
the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
application state.
The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.
Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.
The commit introduces a new table for broadcasting generation data with
the following properties:
- it uses a better schema that stores the data in multiple rows, each
of manageable size
- it resides in the `system_distributed_everywhere` keyspace so the
data will be written to every node in the cluster that has a token in
the token ring
- the data will be written using CL=ALL and read using CL=ONE; thanks
to this, restarting node won't have to communicate with other nodes
to retrieve the data of the last known generation. Note that writing
with CL=ALL does not reduce availability: creating a new generation
*requires* all nodes to be available anyway, because they must learn
about the generation before their clocks go past the generation's
timestamp; if they don't, partitions won't be mapped to stream IDs
consistently across the cluster
- the partition key is no longer the generation's timestamp. Because it
was that way in the old internal table, it forced the algorithm to
choose the timestamp *before* the generation data was inserted into
the table. What if the inserting took a long time? It increased the
chance that nodes would learn about the generation too late (after
their clocks moved past its timestamp). With the new schema we will
first insert the generation data using a randomly generated UUID as
the partition key, *then* choose the timestamp, then gossip both the
timestamp and the UUID. The timestamp and the UUID form the
"generation identifier" of this new generation; this should explain
why we introduced the generation_id_v2 type in previous commits.
Observe that after a node learns about a generation broadcasted using
this new method through gossip it will retrieve its data very quickly
since it's one of the replicas and it can use CL=ONE as it was
written using CL=ALL.
Note that the node is still using the old method - the actual switch
will be done in a later commit.
`system_distributed_everywhere` is a new keyspace that uses Everywhere
replication strategy. This is useful, for example, when we want to store
internal data that should be accessible by every node; the data can be
written using CL=ALL (e.g. during node operations such as node
bootstrap, which require all nodes to be alive - at least currently) and
then read by each node locally using CL=ONE (e.g. during node restarts).
Closes#8457
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments
This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.
The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.
Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
The client_state::check_access() calls for global storage service
to get the features from it and check if the CDC feature is on.
The latter is needed to perform CDC-specific checks.
However it was noticed, that the check for the feature is excessive
as all the guarded if-s will resolve to false in case CDC is off
and the check_access will effectively work as it would with the
feature check.
With that observation, it's possible to ditch one more global storage
service reference.
tests: unit(dev), dtest(dev, auth)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210105063651.7081-1-xemul@scylladb.com>
The previous patch brought the databse reference arg. And since
the currently called validate_column_family() overload _just_
gets the database from global proxy, it's better to shortcut.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is called from cql3/statements' check_access methods and from thrift
handlers. The former have proxy argument from which they can get the
database. The latter already have the database itself on board.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
These alterations cannot break the database irreparably, so allow
them.
Expand command_desc as required.
Add a type (rather than command_desc) parameter to
has_column_family_access() to minimize code changes.
Fixes#7057
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>