Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.
We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.
This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.
In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.
We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.
After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.
Fixes#2462
Implementation notes, and general comments about open discussion in 2462:
* Because we are not bypassing the timeout, just setting it high enough,
I consider the concerns about the batchlog moot: if we fail for any
other reason that will be propagated. Last case, because the timeout
is per-CF, we could do what we do for the dirty memory manager and
move the batchlog alone to use a different timeout setting.
* Storage proxy likes specifying its timeouts as a time_point, whereas
when we get low enough as to deal with the read_concurrency_config,
we are talking about deltas. So at some point we need to convert time_points
to durations. We do that in the database query functions.
v2:
- use per-request instead of per-table timeouts.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"This series revives the round-robin load balancing added by Pekka back in 2015.
If somebody tries to enable it with the current master it would quite quickly
lead to a crash due to a few unresolved issues in the corresponding code.
Fixes#2351Fixes#3118"
* 'fix-round-robin-balancing-v2' of github.com:vladzcloudius/scylla:
transport::server::process_request(): avoid extra copy of the client_state
service::cql_server::connection::process_request: use client_state "request copy" constructor
service::client_state: introduce "request copy" copy-constructor
service::storage_service: add the get_local_auth_service() accessor
service::client_state: remove the unused _tracing_session_id field
A new constructor creates a copy of the current client_status to be
used in the context of the handling of a single request.
The copy may take place at a shard different from the one where the
request has been received.
In order to ensure the monotonicity of the timestamps used by the request handled
on the same connection the created copy of the client_state is going to use the same timestamp provided by the
caller instead of generating it.
It's the caller's responsibility to ensure the monotonicity of given timestamps.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Technically all that matters is the proportion among the shares so this
change is functionally a noop. However, The CPU scheduler being proposed
has shares that go all the way up to 1000. In the hopes of being able to
unify I/O and CPU controllers one day, this patch brings the I/O shares
more in line with what Avi is doing for the CPU scheduler.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In commit 1f4f71e619, an
stdx::optional<std::vector<sstring>> parameter was added to storage_proxy's
constructor. However, this parameter was not made optional, and
tests/cql_test_env.cc failed to compile because it didn't provide this
parameter.
This patch makes this parameter optional (if missing, it's like an empty
stdx::optional) so cql_test_env.cc compiles.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218132121.18782-1-nyh@scylladb.com>
Add a widely used method that returns TRUE if a given address is a broadcast
address of the local node.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"This patch series refines the security model for the upcoming switch to
roles-based access control. Roles are still do not have any function,
but CQL statements related to roles manipulate metdata. The next major
patch series after this one will switch the system to roles.
Previously, most operations around roles required superuser, but this
violates an important idea in security called the "principal of least
privilege": that a user should have only the minimum access possible to
resources in order to achieve their objective.
To that end, this patch series introduces permissions on role resources.
For example, to grant a role to a user, the performing user must have
been granted AUTHORIZE on the role being granted.
In the table below, a user (role) that has been granted the permission
in the left-most column can perform the CQL query in the right columns
depending on if the permission has been granted to the root role
resource (all roles), or a particular role resource.
Perm. All roles Specific role (r)
---------------------------------------------------------
CREATE CREATE ROLE
ALTER ALTER ROLE * ALTER ROLE r
DROP DROP ROLE * DROP ROLE r
AUTHORIZE GRANT ROLE */REVOKE GRANT ROLE r/
ROLE * REVOKE ROLE r
DESCRIBE LIST ROLES
The following restrictions around superuser exist:
- CREATE ROLE: Only a superuser can create a superuser role.
- ALTER ROLE: Only a superuser can alter the superuser status of a role.
- ALTER ROLE: You cannot alter the superuser status of yourself or of a
role granted to you.
- DROP ROLE: Only a superuser can drop a role that has superuser.
The following additional "escape hatches" apply:
- ALTER ROLE: You can alter yourself (except to give yourself
superuser).
- LIST ROLES: You can list your own roles and list the roles of any role
granted to you.
Finally, a note on terminology: I like to say a role (or user) "is"
superuser if the role (user) has directly been marked as a superuser. A
role (user) "has" superuser if they have been granted a role that is a
superuser. The second statement encompasses the first, since a role can
always be said to have been granted to itself.
Fixes #2988."
* 'jhk/role_permissions/v2' of https://github.com/hakuch/scylla: (24 commits)
auth: Move permissions cache instance to service
auth: Add roles query function to service
cql3: Update access checks for `revoke_role_statement`
cql3: Update access checks in `grant_role_statement`
cql3: Update access checks in `list_roles_statement`
cql3: Update access checks in `drop_role_statement`
cql3: Update access checks in `alter_role_statement`
cql3: Update access checks in `create_role_statement`
tests: Switch to dedicated testing superuser
auth: Publicize enforcing check for service
tests: Expose client state from test env
Allow checking permissions from `client_state`
auth: Support querying for granted superuser
auth/service.hh: Document the class
cql3: Change `create_role_statement` base
cql3/Cql.g: Add role resources to grammar
cql3/Cql.g: Avoid extra copy of `auth::resource`
auth:resource.cc: Use `string_view` in reverse map
auth: Add `role` resource kind
auth: Add the DESCRIBE permission
...
Previously, this function was private and only `ensure_has_permission`
was public. `ensure_has_permission` throws in the absence of a
permission, but it can also be useful to query a permission without it
being an error.
"Fixes #2866Fixes#2894
Changes gossip propagation to allow "atomic" grouping of values to ensure
their respective order.
Modifies gossip bootstrap startup to potentially wait longer in cases
where stabilization (messages done) takes time, to avoid data loss
in repair."
* 'calle/gossip' of github.com:scylladb/seastar-dev:
gossip: wait for stabilized gossip on bootstrap
gossiper: Prevent race condition in propagation
utils::to_string: Add printers for pairs+maps
utils::in: Add helper type for perfect forwarding initializer lists
Applicable permission sets will soon be specific to each kind of
resource. This change prepares us for dynamic querying of permission
sets by resource.
This change generalizes the implementation of a `resource` to many
different kinds of resources, though there is still only one
kind (`data`). In the future, we also expect resource kinds for roles,
user-defined functions (UDFs), and possibly on particular REST
end-points.
I considered several approaches to generalizing to different kinds of
resources.
One approach is to have a base class that is inherited from by different
resource kinds. The common functionality would be accessed through
virtual member functions and kind-specific functions would exist in
sub-classes. I rejected this approach because dealing with different
kinds of resources uniformly requires storage and life-time management
through something like `std::unique_ptr<auth::resource>`, which means
that we lose value semantics (including comparison) and must deal with
complications around ownership.
Another option was to use `boost::variant` (or, in future,
`std::variant`). This is closer to what we want, since there a static
set of resource kinds that we support. I rejected this approach for two
reasons. The first is that all resource kinds share the same data (a
list of segments and a root identifier), which would be duplicated in
each type that composed the variant. The second is that the complexity
and source-code overhead of `boost::variant` didn't seem warranted.
The solution I ended up with is home-grown variant. All resources are
described in the same `final` class: `auth::resource`. This class has
value semantics, supports equality comparison, and has a strict
ordering. All resources have in common a tag ("kind") and a list of
parts. Most operations on resources don't care about the kind of
resource (like getting its name, parsing a name, querying for the
parent, etc). These are just member functions of the class.
When we care about a kind-specific interpretation of a resource, we can
produce a "view" of the resource. For example, `data_resource_view`
allows for accessing the (optional) keyspace and table names.
I anticipate in the future to add functions for creating role
resources (`auth::resource::role`) and also `role_resource_view`.
The functional behaviour of the system should be unchanged with this
patch.
I've added new unit tests in `auth_resource_test.cc` and removed the old
test from `auth_test.cc`.
Fixes#3027.
Fixes#2866
Instead of a raw 30s sleep waiting for gossip to stabilize/set up
ranges on bootstrap, use similar logic as 'wait_for_gossip_to_settle'
and loop said 30s or more until we neither grow/shrink ep set, or
are processing ACK:s.
Fixes#2894
Allow applying certain application states as monotonic sets,
i.e. allow set of states as input, and ensure the values are
re-versioned and all applied together.
Then do so for certain states that are by design coupled
(status/tokens).
Similar solution as origins, as issue is copy of the same.
This patch ensures we correctly serialize range tombstones for dense
non-compound schemas, which until now assumed the bounds were compound
composite. We also fix the reading function, which assumed the same
thing. This affected Apache Cassandra compatibility.
Fixes#2986
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a cluster feature to enable correct serialization of
non-compound range tombstones. We thus support rollbacks during an
upgrade, as we will only change range tombstone serialization when the
cluster is fully upgraded and all nodes are capable of reading the new
format.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This change appears quite large, but is logically fairly simple.
Previously, the `auth` module was structured around global state in a
number of ways:
- There existed global instances for the authenticator and the
authorizer, which were accessed pervasively throughout the system
through `auth::authenticator::get()` and `auth::authorizer::get()`,
respectively. These instances needed to be initialized before they
could be used with `auth::authenticator::setup(sstring type_name)`
and `auth::authorizer::setup(sstring type_name)`.
- The implementation of the `auth::auth` functions and the authenticator
and authorizer depended on resources accessed globally through
`cql3::get_local_query_processor()` and
`service::get_local_migration_manager()`.
- CQL statements would check for access and manage users through static
functions in `auth::auth`. These functions would access the global
authenticator and authorizer instances and depended on the necessary
systems being started before they were used.
This change eliminates global state from all of these.
The specific changes are:
- Move out `allow_all_authenticator` and `allow_all_authorizer` into
their own files so that they're constructed like any other
authenticator or authorizer.
- Delete `auth.hh` and `auth.cc`. Constants and helper functions useful
for implementing functionality in the `auth` module have moved to
`common.hh`.
- Remove silent global dependency in
`auth::authenticated_user::is_super()` on the auth* service in favour
of a new function `auth::is_super_user()` with an explicit auth*
service argument.
- Remove global authenticator and authorizer instances, as well as the
`setup()` functions.
- Expose dependency on the auth* service in
`auth::authorizer::authorize()` and `auth::authorizer::list()`, which
is necessary to check for superuser status.
- Add an explicit `service::migration_manager` argument to the
authenticators and authorizers so they can announce metadata tables.
- The permissions cache now requires an auth* service reference instead
of just an authorizer since authorizing also requires this.
- The permissions cache configuration can now easily be created from the
DB configuration.
- Move the static functions in `auth::auth` to the new `auth::service`.
Where possible, previously static resources like the `delayed_tasks`
are now members.
- Validating `cql3::user_options` requires an authenticator, which was
previously accessed globally.
- Instances of the auth* service are accessed through `external`
instances of `client_state` instead of globally. This includes several
CQL statements including `alter_user_statement`,
`create_user_statement`, `drop_user_statement`, `grant_statement`,
`list_permissions_statement`, `permissions_altering_statement`, and
`revoke_statement`. For `internal` `client_state`, this is `nullptr`.
- Since the `cql_server` is responsible for instantiating connections
and each connection gets a new `client_state`, the `cql_server` is
instantiated with a reference to the auth* service.
- Similarly, the Thrift server is now also instantiated with a reference
to the auth* service.
- Since the storage service is responsible for instantiating and
starting the sharded servers, it is instantiated with the sharded
auth* service which it threads through. All relevant factory functions
have been updated.
- The storage service is still responsible for starting the auth*
service it has been provided, and shutting it down.
- The `cql_test_env` is now instantiated with an instance of the auth*
service, and can be accessed through a member function.
- All unit tests have been updated and pass.
Fixes#2929.
This change is motivated partly be aesthetics, but more significantly
due to the future work to refactor `auth` into a sharded service. Since
doing so will require writing `auth::auth` from scratch, these
constants (and other common functionality) need a new home.
"The main problem fixed is slow processing of application state changes.
This may lead to a bootstrapping node not having up to date view on the
ring, and serve incorrect data.
Fixes #2855."
* tag 'tgrabiec/gossip-performance-v3' of github.com:scylladb/seastar-dev:
gms/gossiper: Remove periodic replication of endpoint state map
gossiper: Check for features in the change listener
gms/gossiper: Replicate changes incrementally to other shards
gms/gossiper: Document validity of endpoint_state properties
storage_service: Update token_metadata after changing endpoint_state
gms/gossiper: Process endpoints in parallel
gms/gossiper: Serialize state changes and notifications for given node
utils/loading_shared_values: Allow Loader to return non-future result
gms/gossiper: Encapsulate lookup of endpoint_state
storage_service: Batch token metadata and endpoint state replication
utils/serialized_action: Introduce trigger_later()
gossiper: Add and improve logging
gms/gossiper: Don't fire change listeners when there is no change
gms/gossiper: Allow parallel apply_state_locally()
gms/gossiper: Avoid copies in endpoint_state::add_application_state()
gms/failure_detector: Ignore short update intervals
storage_service depends on endpoint states to be replicated to all
shards before token metadata is replicated. Currently this is taken
care of by storage_service::replicate_to_all_cores(), invoked from
storage_service's change listener. It copies whole endpoint state map,
which is expensive in large clusters. It's more efficient to replicate
only incremental changes, and only once, rather than for each
application state.
There is a requirement that whatever is present in token_metadata,
should also be present in endpoint_state. Because of that, we should update
endpoint_state first (set_gossip_tokens).
Apache Cassandra switched to this order as well in commit
b39d984f7bd682c7638415d65dcc4ac9bcb74e5f.
"Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/
One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.
The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
that the highest latency we can differentiate is 4ms. After that,
everything falls into the same bin.
2) The format that prometheus expects is that each bin will contain
the total number of points seen *up until that bin*, while we
currently export the total number of points that falls between bins.
IOW, it is a cummulative histogram.
About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:
"Note that we divide the sum of both buckets. The reason is that the
histogram buckets are cumulative. The le="0.3" bucket is also contained
in the le="1.2" bucket; dividing it by 2 corrects for that."
It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that."
Acked-by: Amnon Heiman <amnon@scylladb.com>
Acked-by: Gleb Natapov <gleb@scylladb.com>
* 'prometheus-histograms' of github.com:glommer/scylla:
storage_proxy: change reporting of estimated histograms
estimated_histogram: bring histogram closer to what prometheus expects.
We were taking a reference to a temporary value in different places.
Fix them by using get_application_state_ptr(), which also avoids a copy.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is
enabled, mark it as const, and have some callers use it instead of open
coding the logic.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive.
This patch adds a similar function which returns a pointer instead,
and changes the call sites where using the pointer-returning variant
is deemed safe (the pointer neither escapes the function, nor crosses
any defer point).
Fixes#764
Signed-off-by: Duarte Nunes <duarte@scylladb.com>