-O1 complains that client_state::_remote_addr is not initialized
(and it is right). The call site is tracing, which likely won't be
invoked for internal queries, but still.
Message-Id: <20180401150410.13651-1-avi@scylladb.com>
Add a convenience base class for view notifications, which provides
a default implementation for all other types of notifications.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds support for the nodetool viewbuildstatus command,
which shows the progress of a materialized view build across the
cluster.
A view can be absent from the result, successfully built, or
currently being built.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
At some points while bootstrapping [1], new non-seed Scylla nodes wait
for schema agreement among all known endpoints in the cluster.
The check for schema agreement was in
`service::migration_manager::is_ready_for_bootstrap`. This function
would return `true` if, at the time of its invocation, the node was
aware of at least one `UP` peer (not itself) and that all `UP` peers had
the same schema version as the node.
We wish to re-use this check in the `auth` sub-system to ensure that
the schema for internal system tables used for access-control have
propagated to the entire cluster.
Unlike in `service/storage_service.cc`, where `is_ready_for_bootstrap`
was only invoked for seed nodes, we wish to wait for schema agreement
for all nodes regardless of whether or not they are seeds.
For a single-node cluster with itself as a seed,
`is_ready_for_bootstrap` would always return `false`.
We therefore change the conditions for schema agreement. Schema
agreement is now reached when there are no known peers (so the endpoint
map of the gossiper consists only of ourselves), or when there is at
least one `UP` peer and all `UP` peers have the same schema version as
us.
This change should not impact any bootstrap behavior in
`storage_service` because seed nodes do not invoke the function and
non-seed nodes wait for peer visibility before checking for schema
agreement.
Since this function is no longer checking for schema agreement only in
the context of bootstrapping non-seed nodes, we rename it to reflect its
generality.
[1] http://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html
"
Since f8613a8415 we have reader-caching
on replicas for single-partition queries. This caching works best when
all pages of a query are sent to the same replicas consistently and thus
they can reuse the cached readers there.
The propability-based nature of read-repair works against this as on any
given page a read-repair will be attempted or not based on probability.
This will cause hight drop-rates on the replicas used for read-repair as
the cached reader will not be reusable if the replica was skipped for
one or more pages.
To fix this make the repair-decision once, on the first page of the
query and store the decision in the paging-state. On all remaining
pages of the query use this stored decision.
Tests: unit-tests(release, debug), dtest(paging_advanced_tests.py)
Refs: #1865
"
* 'per_query_repair_decision/v2' of https://github.com/denesb/scylla:
Make the read-repair decision only once
storage_proxy: add coordinator_query_options and coordinator_query_result
Add query_read_repair_decision to paging-state
"
These patches add support for C* 2.2 file(name) format.
Namely:
* It forces Scylla to write files in la format.
* Adds storage-service feature for them.
* cf and ks are determined from directory, not from file-name (for 2.2 format).
* Adds some other fixes to make dtest happy.
* Unit tests work with la format or with both formats.
"
* 'danfiala/filename-format-2.2-v4' of https://github.com/hagrid-the-developer/scylla:
tests/sstables: Tests use la format or iterate over both formats.
tests/sstables: Helper functions support 2.2 format directory structure.
stables: Use 2.2 (la) format as a default format to store sstables if it is enabled by feature-bits.
storage_service: Support la sstable storage format as a feature.
sstables: make_descriptor accepts sstable-directory, because it is necessary to determine cf and ks in 2.2 format.
sstables: Throw more detail exception for unknown item in reverse_map.
sstables/compaction: Suppress NaN in a report of a throughput.
Make the read-repair decision on the first page of a paged-query and use
it for all the remaining pages. This helps querier-cache hit-rates as
reads to nodes will be sent consistently throught the query.
As yet more parameters and return-values are about to be added to all
storage_proxy::query_* methods we need a way that scales better than
changing the signatures every time. To this end we aggregate all
non-mandatory query parameters into `coordinator_query_options` and all
return values into `coordinator_query_result`.
This way new fields can be simply added to the respective structs while
the signatures of the methods themselves and their client code can
remain unchanged.
This new field will store the repair-decision made on the first page of
the query. This decision will be sticky to all pages of the query.
In mixed clusters the decision might not happen on the first page and it
might even change during the query as old coordinators will not store
nor respect the decision.
After the shadow round and the feature checking, we remove any
endpoints from the state - namely, those that contacted us -, before
re-adding them again. This is because those nodes that replied would
have been marked as alive in the endpoint state map (but not fully,
they'd be absent from the live endpoints list), and re-adding them
marks them as dead.
If the shadow round failed, after doing the feature checking against
the system tables, we were not clearing the state map and re-adding
the endpoints. This left the alive marker set, and prevented
real_mark_alive() from eventually being called.
Fixes#3301
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch came about because of an important (and obvious, in
hindsight) realization: instances of the authorizer, role manager, and
authenticator are clients for access-control state and not the state
itself. This is reflected directly in Scylla: `auth::service` is
sharded across cores and this is possible because each instance queries
and modifies the same global state.
To give more examples, the value of an instance of `std::vector<int>` is
the structure of the container and its contents. The value of `int
file_descriptor` is an identifier for state maintained elsewhere.
Having watched an excellent talk by Herb Sutter [1] and having read an
informative blog post [2], it's clear that a member function marked
`const` communicates that the observable state of the instance is not
modified.
Thus, the member functions of the role-manager, authenticator, and
authorizer clients should not be marked `const` only if the state of the
client itself is observably changed. By this principle, member functions
which do not change the state of the client, but which mutate the global
state the client is associated with (for example, by creating a role)
are marked `const`.
The `start` (and `stop`) functions of the client have the dual role of
initializing (finalizing) both the local client state and the
external state; they are not marked `const`.
[1] https://herbsutter.com/2013/01/01/video-you-dont-know-const-and-mutable/
[2] http://talesofcpp.fusionfenix.com/post-2/episode-one-to-be-or-not-to-be-const
"
Terms
-----
querier: A class encapsulating all the logic and state needed to fill a
page. This Includes the reader, the compact_mutation object and all
associated state.
Preamble
--------
Currently for paged-queries we throw away all readers, compactors and
all associated state that contributed to filling the page and on the
next page we create them from scratch again. Thus on each page we throw
away a considerable amount of work, only to redo it again on the next
page. This has been one of the major contributors to latencies as from
the point of view of a replica each page is as much work as a fresh
query.
Solution
--------
The solution presented in this patch-series is to save queriers after
filling a page and reuse them on the next pages, thus doing the
considerable amount of work involved with creating the them only once.
On each page the coordinator will generate a UUID that identifies this
page. This UUID is used as the key, under which the contributing
queriers will be saved in the cache. On the next page the UUID from the
previous page will be used to lookup saved queriers, and the one from
the current one to saved them afterwards (if the query isn't finished).
These UUIDs (reader_recall_uuid and reader_save_uuid) are attached to
the page-state. Also attached to the page state is the list of replicas
hit on the last page. On the next page this list will be consulted to
hit the same replicas again, thus reusing the queriers saved on them.
Cached queriers will be evicted after a certain period of time to avoid
unecessary resource consumption by abandoned reads.
Cached queriers may also be evicted when the shard faces
resource-pressure, to free up resources.
Splitting up the work
---------------------
This series only fixes the singular-mutation query path, that is queries
that either fetch a single partition, or severeal single partitions (IN
queries). The fix for the scanning query path will be done in a
follow-up series, however much of the infrastructure needed for the
general querier reuse is already introduced by this series.
Ref #1865
Tests: unit-tests(debug, release), dtests(paging_test, paging_additional_test)
Benchmarking summary (read-from-disk)
-------------------------------------
1) Latency
BEFORE
latency mean : 58.0
latency median : 57.4
latency 95th percentile : 68.8
latency 99th percentile : 79.9
latency 99.9th percentile : 93.6
latency max : 93.6
AFTER
latency mean : 41.3
latency median : 40.5
latency 95th percentile : 50.8
latency 99th percentile : 68.9
latency 99.9th percentile : 89.2
latency max : 89.2
2) Throughput (single partition query)
sum(scylla_cql_reads):
BEFORE: 173'567
AFTER: 427'774
+246%
3) Throughput (IN query, 2 partitions)
sum(scylla_cql_reads):
BEFORE: 85'637
AFTER: 127'431
+148%
"
* '1865/singular-mutations/v8.2' of https://github.com/denesb/scylla: (23 commits)
Add unit test for resource based cache eviction
Add unit tests for querier_cache
Add counters to monitor querier-cache efficiency
Memory based cache eviction
Add buffer_size() to flat_mutation_reader
Resource-based cache eviction
Time-based cache eviction
Save and restore queriers in mutation_query() and data_query()
Add the querier_cache_context helper
Add querier_cache
Add querier
Add are_limits_reached() compact_mutation_state
Add start_new_page() to compact_mutation_state
Save last key of the page and method to query it
Make compact_mutation reusable
Add the CompactedFragmentsConsumer
Use the last_replicas stored in the page_state
query_singular(): return the used replicas
Consider preferred replicas when choosing endpoints for query_singular()
Add preferred and last replicas to the signature of query()
...
Pass the last_replicas from the page_state as the preferred_replicas
for query() and save the returned last_replicas as the last_replicas
field of the next page_state. The circle is now complete. The first page
of any query will pass an empty list as the preferred replicas (having
no previous paging_state) so the replicas will be selected according to
the load-balancing strategy. Any subsequent page will use the last
replicas from the last page as the preferred ones for the current one.
Thus if all goes well all pages of a query will hit the same replicas.
This patch implements the last_replicas returning part of the query()
signature changes for singular queries. It allows for client code to
save the last returned replicas and pass it to query() on the next page
as the preferred-replicas parameter, thus faciliate the read requests
for the next page hitting the same replicas.
Propagate the preferred_replicas to db::filter_for_query() and consider
them when selecting the endpoints. The algoritm for selecting the
endpoints is as follows:
* Compute the intersection of the endpoint candidates and the
preferred endpoints.
* If this yields a set of endpoints that already satisfies the CL
requirements use this set.
* Otherwise select the remaining endpoints according to the
load-balancing strategy, just like before.
preferred_replicas are added to the parameters and last_replicas are
added to the return type. The preferred replicas will be used as a hint
for the selection of the replicas to send the read requests to. The last
replicas (returned) are the replicas actually selected for the read.
This will allow queries to consistently hit the same replicas for each
page thus reusing readers created on these replicas.
For convenience a query() overload is provided that doesn't take or
return the preferred and last replicas.
This patch only adds the parameters and propagates them down to
query_singular() and query_partition_key_range(). The code to actually
use these preferred-replicas will be added in later patches.
This reason for separating this is to reduce noise and improve
reviewability for those functional changes later.
Helps paged queries consistently hit the same replicas for each
subsequent page. Replicas that already served a page will keep the
readers used for filling it around in a cache. Subsequent page request
hitting the same replicas can reuse these readers to fill the pages
avoiding the work of creating these readers from scratch on every page.
In a mixed cluster older coordinators will ignore this value.
The value of last_replicas may change between pages as nodes may become
available/unavailable or the coordinator may decide to send the read
requests to different replicas at its discretion.
Replicas are identified by an opaque uuid which should only make sense
to the storage-proxy.
This patch adds the parameter to read_command which is needed for
caching of readers during multiple pages of a paged queries, which
we will introduce in the next patches.
The query_uuid is a UUID of a previously saved reader, which
the replica is now asked to recall and resume (if this saved reader is
no longer in the cache, it is fine, a new reader will be started).
Additionally a helper flag is_first_page is added so that the replica
can avoid doing any cache lookups (and incrementing miss counters) for
the first page.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds to the "paging_state", the opaque cookie that clients are
supposed to provide when asking for the next page on a paged query, a
unique id field. This new field will be used to tell that a new request
for a page really continues the previous page, and doesn't just by chance
start at the same position the previous page stopped.
We need to support setups with mixed versions - a client may get a paging
state from a coordinator running a new version of Scylla and send it to
a different coordinator running an old version - or vice versa. So the new
uuid field is set up to have a default uuid of UUID() (a recognizable
invalid uuid 0), so new versions receiving no uuid from an old version will
set this invalid uuid, and old versions receiving a uuid from a new version
will simply ignore it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This change allows for seamless migration of the legacy users metadata
to the new role-based metadata tables. This process is summarized in
`docs/migrating-from-users-to-roles.md`.
In general, if any nondefault metadata exists in the new tables, then
no migration happens. If, in this case, legacy metadata still exists
then a warning is written to the log.
If no nondefault metadata exists in the new tables and the legacy tables
exist, then each node will copy the data from the legacy tables to the
new tables, performing transformations as necessary. An informational
message is written to the log when the migration process starts, and
when the process ends. During the process of copying, data is
overwritten so that multiple nodes racing to migrate data do not
conflict.
Since Apache Cassandra's auth. schema uses the same table for managing
roles and authentication information, some useful functions in
`roles-metadata.hh` have been added to avoid code duplication.
Because a superuser should be able to drop the legacy users tables from
`system_auth` once the cluster has migrated to roles and is functioning
correctly, we remove the restriction on altering anything in the
"system_auth" keyspace. Individual tables in `system_auth` are still
protected later in the function.
When a cluster is upgrading from one that does not support roles to one
that does, some nodes will be running old code which accesses old
metadata and some will be running new code which access new metadata.
With the help of the gossiper `feature` mechanism, clients connecting to
upgraded nodes will be notified (through code in the relevant CQL
statements) that modifications are not allowed until the entire cluster
has upgraded.
auth: Decouple authorization and role management
Access control in Scylla consists of three main modules: authentication,
authorization, and role-management.
Each of these modules is intended to be interchangeable with alternative
implementations. The `auth::service` class composes these modules
together to perform all access-control functionality, including caching.
This architecture implies two main properties of the individual
access-control modules:
- Independence of modules. An implementation of authentication should
have no dependence or knowledge of authorization or role-management,
for example.
- Simplicity of implementing the interface. Functionality that is common
to all implementations should not have to be duplicated in each
implementation. The abstract interface for a module should capture
only the differences between particular implementations.
Previously, the authorization interface depended on an instance of
`auth::service` for certain operations, since it required aggregation
over all the roles granted to a particular role or required checking if
a given role had superuser.
This change decouples authorization entirely from role-management: the
authorizer now manages only permissions granted directly to a role, and
not those inherited through other roles.
When a query needs to be authorized, `auth::service::get_permissions`
first uses the role manager to check if the role has superuser. Then, it
aggregates calls to `auth::authorizer::authorize` for each role granted
to the role (again, from the role-manager) to determine the sum-total
permission set. This information is cached for future queries.
This structure allows for easier error handling and
management (something I hope to improve in the future for both the
authorizer and authenticator interfaces), easier system testing, easier
implementation of the abstract interfaces, and clearer system
boundaries (so the code is easier to grok).
Some authorizers, like the "TransitionalAuthorizer", grant permissions
to anonymous users. Therefore, we could not unconditionally authorize an
empty permission set in `auth::service` for anonymous users. To account
for this, the interface of the authorizer has changed to accept an
optional name in `authorize`.
One additional notable change to the authorizer is the
`auth::authorizer::list`: previously, the filtering happened at the CQL
query layer and depended on the roles granted to the role in question.
I've changed the function to simply query for all roles and I do the
filtering in `auth::system` in-memory with the STL. This was necessary
to allow the authorizer to be decoupled from role-management. This
function is only called for LIST PERMISSIONS (so performance is not a
concern), and it significantly reduces demand on the implementation.
Finally, we unconditionally create a user in `cql_test_env` since
authorization requires its existence.
Previously, a "data" auth. resource knew how to check it's own existence by
accessing a global variable.
This patch accomplishes two things: it adds existence checking to all
kinds of resources, and moves these checks outside of `auth::resource`
itself and into `auth::service` (so that global variables are no longer
accessed).
This has the dual benefit of not enforcing copying on implementations of
the abstract interface and also limiting unnecessary copies.
As usual with Seastar, we follow the convention that a reference
parameter to a function is assumed valid for the duration of the
`future` that is returned. `do_with` helps here.
By adding some constants for root resources, we can avoid using
`seastar::do_with` at some call-sites involving `resource` instances.
The motivation behind this change is the idea that constructing a new
instance of an object is the job of the constructor.
One big benefit of this structure (with the addition of helpers for
convenience) is that calls for emplacing instances (like
`std::make_shared`, or `std::vector::emplace_back`) work without any
difficulty. This would not be true for static construction functions.
The most important change is replacing `auth::authenticated_user::name`
with a public `std::optional<sstring>` member. Anonymous users have no
name. This replaces the insecure and bug-prone special-string of
"anonymous" for anonymous users, which does unfortunate things with the
authorizer.
The new `auth::is_anonymous` function exists for convenience since
checking the absence of a `std::optional` value can be tedious.
When a caller really wants a name unconditionally, a new stream output
function is also available.
This is a large change, but it's a necessary evil.
This change brings us to a minimally-functional implementation of roles.
There are many additional changes that are necessary, including refined
grammar, bug fixes, code hygiene, and internal code structure changes.
In the interest of keeping this patch somewhat read-able, those changes
will come in subsequent patches. Until that time, roles are still marked
"unimplemented".
IMPORTANT: This code does not include any mechanism for transitioning a
cluster from user-based access-control to role-based access control. All
existing access-control metadata will be ignored (though not deleted).
Specific changes:
- All user-specific CQL statements now delegate to their roles
equivalent. The statements are effectively the same, but CREATE USER
will include LOGIN automatically. Also, LIST USERS only lists roles
with LOGIN.
- A call to LIST PERMISSIONS will now also list permissions of roles
that have been granted to the caller, in addition to permissions which
have been granted directly.
- Much of the logic of creating, altering, and deleting roles has been
moved to `auth::service`, since these operations require cooperation
between the authenticator, authorizer, and role-manager.
- LIST USERS actually works as expected now (fixes#2968).
When the cluster is large or the num_tokens is big, calculate_pending_ranges
can take long time to complete. It now runs in the gossip thread so it can
block the gossip processing. Another problem is it runs in a plain for loop and
can cause the reactor stall.
User see this stall with decommission operations.
I can reproduce up to 4 seconds stall within a two-node cluster each with
`--num-tokens 3072` during decommission.
Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout
Fixes#3203
* tag 'asias/issue_3203_v2.1' of github.com:scylladb/seastar-dev:
storage_service: Do not wait for update_pending_ranges in handle_state_leaving
token_metadata: Handle affected_ranges with do_for_each
token_metadata: Split token_metadata::calculate_pending_ranges
token_metadata: Futurize calculate_pending_ranges
storage_service: Futurize storage_service::do_update_pending_ranges
token_metadata: Speed up token_metadata::get_endpoint
The call chain is:
storage_service::on_change() -> storage_service::handle_state_leaving()
-> storage_service::update_pending_ranges()
Listeners run as part of gossip message processing, which is
serialized. This means we won't be processing any gossip messages until
update_pending_ranges completes. update_pending_ranges takes time to
complete.
Since we do not wait for update_pending_ranges to complete any more,
multiple update_pending_ranges operations can run at the same time, use
serialized_action to serialize it.
Tested with update_cluster_layout_tests.py
Now, do_update_pending_ranges is futurized. We can finally futurize
token_metadata::calculate_pending_ranges in order to convert the loops
inside it to do_for_each insead of plain for loops to avoid reactor
stall.
Preparation work for the futurizing of the time consuming
token_metadata::calculate_pending_ranges.
In addition, we use do_for_each for the loop. It is better than the
plain for loop because the reactor can yield to avoid stalls in cases
there are tons of keyspaces.
Set the option that enables the underlying memtable and cache readers
to request caching of a cell's hash, for requests that require a
digest.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it. We use a cluster feature
because older versions are not ready to receive a different digest
algorithm than MD5 when answering a data request.
If we ever should add a new hash algorithm, we would also need to
add a new cluster feature for that algorithm. The alternative would be
to add code so a coordinator could negotiate what digest algorithm to
use with the set of replicas it is contacting.
Fixes#2884
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
While not strictly needed, specify which algorithm to use when request
a digest from a remote node. This is more flexible than relying on a
cluster wide feature, although that's what we'll do in subsequent
patches. It also makes the verb more consistent with the data request.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>