Checking if the role to be dropped has superuser requires that the role
exists, which means `auth::nonexistent_role` was thrown even when IF
EXISTS was specified.
This is a large change, but it's a necessary evil.
This change brings us to a minimally-functional implementation of roles.
There are many additional changes that are necessary, including refined
grammar, bug fixes, code hygiene, and internal code structure changes.
In the interest of keeping this patch somewhat read-able, those changes
will come in subsequent patches. Until that time, roles are still marked
"unimplemented".
IMPORTANT: This code does not include any mechanism for transitioning a
cluster from user-based access-control to role-based access control. All
existing access-control metadata will be ignored (though not deleted).
Specific changes:
- All user-specific CQL statements now delegate to their roles
equivalent. The statements are effectively the same, but CREATE USER
will include LOGIN automatically. Also, LIST USERS only lists roles
with LOGIN.
- A call to LIST PERMISSIONS will now also list permissions of roles
that have been granted to the caller, in addition to permissions which
have been granted directly.
- Much of the logic of creating, altering, and deleting roles has been
moved to `auth::service`, since these operations require cooperation
between the authenticator, authorizer, and role-manager.
- LIST USERS actually works as expected now (fixes#2968).
The previous code has an off-by-one error since the iterator is
incremented unconditionally prior to being compared to the end of the
collection.
This new version is also shorter thanks to `seastar::do_until`.
The components of access-control (authentication, authorization, and
role-management) are designed as abstract interfaces, but due to
decisions of Apache Cassandra, certain implementations are dependent on
other particular implementations.
This change throws a new exception,
`auth::incompatible_module_combination`, when a dependency is not
satisfied.
The set of allowed options is quite small, so we benefit from a static
representation (member variables) over a dynamic map.
We also logically move the "OPTIONS" option to the domain of the
authenticator (from user management), since this is where it is applied.
This refactor also aims to reduce compilation time by moving
`authentication_options` into its own header file.
While changes to `user_options` were necessary to accommodate the new
structure, that class will be deprecated shortly in the switch to roles.
Therefore, the changes are strictly temporary.
When the cluster is large or the num_tokens is big, calculate_pending_ranges
can take long time to complete. It now runs in the gossip thread so it can
block the gossip processing. Another problem is it runs in a plain for loop and
can cause the reactor stall.
User see this stall with decommission operations.
I can reproduce up to 4 seconds stall within a two-node cluster each with
`--num-tokens 3072` during decommission.
Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout
Fixes#3203
* tag 'asias/issue_3203_v2.1' of github.com:scylladb/seastar-dev:
storage_service: Do not wait for update_pending_ranges in handle_state_leaving
token_metadata: Handle affected_ranges with do_for_each
token_metadata: Split token_metadata::calculate_pending_ranges
token_metadata: Futurize calculate_pending_ranges
storage_service: Futurize storage_service::do_update_pending_ranges
token_metadata: Speed up token_metadata::get_endpoint
The call chain is:
storage_service::on_change() -> storage_service::handle_state_leaving()
-> storage_service::update_pending_ranges()
Listeners run as part of gossip message processing, which is
serialized. This means we won't be processing any gossip messages until
update_pending_ranges completes. update_pending_ranges takes time to
complete.
Since we do not wait for update_pending_ranges to complete any more,
multiple update_pending_ranges operations can run at the same time, use
serialized_action to serialize it.
Tested with update_cluster_layout_tests.py
affected_ranges can be very large in a large cluster or node with big
num_tokens account. calculate_natural_endpoints takes more time to
process in this case as well.
Futurize calculate_pending_ranges_for_leaving and handle the loop with
do_for_each to give some time for the reactor to breath, so it does not
block.
token_metadata::calculate_pending_ranges is a complicated function.
Split it into 3 parts for leaving operation, moving opeartion,
bootstrap opeartion.
Now, do_update_pending_ranges is futurized. We can finally futurize
token_metadata::calculate_pending_ranges in order to convert the loops
inside it to do_for_each insead of plain for loops to avoid reactor
stall.
Preparation work for the futurizing of the time consuming
token_metadata::calculate_pending_ranges.
In addition, we use do_for_each for the loop. It is better than the
plain for loop because the reactor can yield to avoid stalls in cases
there are tons of keyspaces.
The token_to_endpoint map can get big that trying to convert it to a
vector will cause large allocation warning.
This patch replace the implementation, so the return json array will be
created directly from the map by using stream_range_as_array helper
function.
Fixes#3185
Message-Id: <20180207153306.30921-1-amnon@scylladb.com>
Container indices are size_t, and in other places we gratuituously
declare a limit as unsigned and the loop index as signed.
Tests: unit (release)
Message-Id: <20180212121642.10525-1-avi@scylladb.com>
All of the adjustments to _remain already ensure it is greater than 0,
and indeed a negative _remain doesn't make sense.
Switching to an unsigne types allows us to re-enable -Wsign-compare.
Tests: unit (release)
Message-Id: <20180212121636.10463-1-avi@scylladb.com>
Commit cce1a2bce8 ("Use the CPU scheduler")
placed some compaction manager code in a scheduling_group. Unfortunately,
downstream code relied on the callers not deferring, so it can rely
on the column_family's existence. That doesn't happen if the column_family
is removed quickly, as with_scheduling_group() always defers.
Fix applying the scheduling group after we've taken the lock and guaranteed
the stability of the column_family object.
Fixes#3196.
Message-Id: <20180211165155.18179-1-avi@scylladb.com>
71495691aa removed sstable::get_index_reader(),
but forgot to update its callers in tests/. Update the callers to construct
a temporary shared_index_list and create the index_reader directly.
This is none too clean, but shared_index_lists needs to be retired, and then
the changes in this patch can go away too.
Tests: unit (release)
Message-Id: <20180211164739.17862-1-avi@scylladb.com>
With the changes introduced in #2981, it is no longer safe to share
index_entries among multiple sstable_mutation_readers.
The original intent behind sharing index_entries among index_readers was
to avoid re-reading same pages twice as we have two index readers -
lower and upper bound - for every sstable_mutation_reader. In fact, the
shared entries were held at the sstable object level so index_readers
from different sstable_mutation_readers could have accessed them.
Now, with calls to index_reader::advance_to(pos)/index_reader::advance_past(pos),
index_entry can be accessed in a way that modifies its state if we need
to read more promoted index blocks. It is safe to keep sharing them
between two index_readers within the same sstable_mutation_reader as the
invariant is maintained that readers can be only moved forward.
We cannot safely assume, however, that this invariant holds for multiple
sstable_mutation_readers as it may happen that one of them has read and
thrown away some promoted index blocks that another one needs. So we
restrict sharing to per-sstable_mutation_reader level.
Fixes#3189.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <83957d007621fe4c62af49aebf1838bb2f32ee55.1518226793.git.vladimir@scylladb.com>
"The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
The manager was needed for orchestrating deletion of shared sstable
across shards. It brings extra complexity that's not longer needed,
and it was also overloading shard 0, but the latter could have
been fixed.
Tests:
- unit: release mode
- dtest: resharding_test.py"
* 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla:
Remove SSTable's atomic deletion manager
Stop using SSTable's atomic deletion manager
database: split column_family::rebuild_sstable_list
"We have noticed in the past that the compiler is too conservative when it comes
to deciding which functions to inline. Since inlining functions enables further
optimisations such as const folding in some cases the difference in performance
was significant enough to force us to add [[gnu::always_inline]] attribute in
numerous places. However, this is neither a partical nor an elegant solution.
A better way to deal with the problem is to adjust the compiler tunables that
control the heuristics used for making inlining decisions. In particular,
inline-unit-growth seems to affect the performance of the emitted code most.
Apart from making the compiler more eager to inline functions bumping the
optimisation level to -O3 also seems to have a positive impact on the
performance.
Fixes#1644.
Tests: unit-test (release)
Performance tested with gcc 7.3.
Macrobenchmark
perf_simple_query
Flags: -c4 --duration 60
All results are medians.
./before ./after diff
read 338662.12 405377.80 19.7%
write 387378.89 466744.15 20.5%
Microbenchmarks
single run duration: 1.000s
number of runs: 5
BEFORE
test iterations median mad min max
combined.one_row 858933 536.389ns 0.819ns 534.823ns 537.208ns
combined.single_active 8469 77.131us 11.000ns 77.118us 77.145us
combined.many_overlapping 1199 664.105us 160.807ns 663.818us 668.527us
combined.disjoint_interleaved 8100 75.522us 22.254ns 75.500us 75.732us
combined.disjoint_ranges 8288 72.580us 10.571ns 72.568us 72.599us
memtable.one_partition_one_row 1216233 825.581ns 0.446ns 821.450ns 826.027ns
memtable.one_partition_many_rows 127336 7.855us 2.153ns 7.853us 7.898us
memtable.many_partitions_one_row 57919 17.356us 6.028ns 17.259us 17.362us
memtable.many_partitions_many_rows 4751 210.496us 102.339ns 210.393us 211.188us
AFTER
test iterations median mad min max
combined.one_row 1002321 450.292ns 0.313ns 447.202ns 450.605ns
combined.single_active 9605 67.086us 8.620ns 67.073us 67.115us
combined.many_overlapping 1476 519.554us 5.334ns 519.549us 519.953us
combined.disjoint_interleaved 9280 64.363us 5.328ns 64.335us 64.369us
combined.disjoint_ranges 9481 61.893us 3.620ns 61.885us 61.903us
memtable.one_partition_one_row 1432668 699.775ns 0.106ns 696.023ns 699.918ns
memtable.one_partition_many_rows 153692 6.536us 6.885ns 6.501us 6.543us
memtable.many_partitions_one_row 63319 15.879us 5.080ns 15.793us 15.884us
memtable.many_partitions_many_rows 5659 176.717us 66.770ns 176.650us 177.778us"
* tag 'optimise-and-inline/v2' of https://github.com/pdziepak/scylla:
configure.py: set optimisation level to -O3
configure.py: set inline-unit-growth to 300
configure.py: flag_supported: support flags with spaces
configure.py: rename warning_supported to flag_supported
configure.py: pass optimisation flags to seastar/configure.py
cql3/select_statement: do not capture stack variables by reference
In this patchset I am resubmitting Avi's enablement of the CPU scheduler
in his behalf. I've done a ton of testing in the series and there are
some improvements / changes that I had previously sent as a separate series.
What you see here is the result of merging that work.
After this patchset is applied, workloads are smoother and we are able to
uphold the pre-defined shares among the various actors.
We also finally have everything we need to merge the CPU and I/O controllers.
After that is done the code is now much simpler. But also, as a bonus,
controllers that were previously available for I/O only (compactions) are
enabled for CPU as well.
* git@github.com:glommer/scylla.git cpusched-v7:
Avi Kivity (4):
database, sstables, compaction: convert use of thread_scheduling_group
to seastar cpu scheduler
memtable, database: make memtable::clear_gently() inherit
scheduling_group
config: mark background_writer_scheduling_quota as Unused
database: place data_query execution stage into scheduling_group
Glauber Costa (9):
database, main: set up scheduling_groups for our main tasks
row_cache: actually use the scheduling group for update_cache
allow update_cache and clear_gently to use the entire task quota.
database: remove cpu_flush_quota metric
controllers: retire auto_adjust_flush_quota
controllers: allow memtable I/O controller to have shares statically
set
controllers: update control points for memtable I/O controller
controllers: allow a static priority to override the controller output
controllers: unify the I/O and CPU controllers
It has been discovered that the compiler is too conservative when
deciding which functions to inline. In particular, the limiting tunable
turned out to be inline-unit-growth which limits inlining in large
translation units.