There is a set of per-table metrics that should only be registered for
user tables.
As time passes there are more keyspaces that are not for the user
keyspace and there is now a function that covers all those cases.
This patch replaces the implementation to use is_internal_keyspace.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The to_metrics_summary is a helper function that create a metrics type
summary from a timed_rate_moving_average_with_summary object.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Currently, there are two metrics reporting mechanisms: the metrics layer
and the API. In most cases, they use the same data sources. The main
difference is around histograms and rate.
The API calculates an exponentially weighted moving average using a
timer that decays the average on each time tick. It calculates a
poor-man histogram by holding the last few entries (typically the last
256 entries). The caller to the API uses those last entries to build a
histogram.
We want to add summaries to Scylla. Similar to the API rate and
histogram, summaries are calculated per time interval.
This patch creates a unified mechanism by introducing an object that
would hold both the old-style histogram and the new
(estimated_histogram). On each time tick, a summary would be calculated.
In the future, we'll replace the API to report summaries instead of the
old-style histogram and deprecate the old style completely.
summary_calculator uses two estimated_histogram to calculate a summary.
timed_rate_moving_average_summary_and_histogram is a unifed class for
ihistogram, rates, summary, and estimated_histogram and will replace
timed_rate_moving_average_and_histogram.
Follow-up patches would move code from using
timed_rate_moving_average_and_histogram to
timed_rate_moving_average_summary_and_histogram. By keeping the API it
would make the transition easy.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch split the timed_rate_moving_average functionality into two, a
data class: rates_moving_average, and a wrapper class
timed_rate_moving_average that uses a timer to update the rates
periodically.
To make the transition as simple as possible timed_rate_moving_average,
takes the original API.
A new helper class meter_timer was introduced to handle the timer update
functionality.
This change required minimal code adaptation in some other parts of the
code.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch fixes a bug in should_sample that uses its bitmask
incorrectly.
basic_ihistogram has a feature that allows it to sample values instead
of taking a timer each time.
To decide if it should sample or not, it uses a bitmask. The bitmask
is of the form 2^n-1, which means 1 out of 2^n will be sampled.
For example, if the mask is 0x1 (2^2-1) 1 out of 2 will be sampled.
If the mask is 0x7 (2^3-1) 1 out of 8 will be sampled.
There was a bug in the should_sampled() method.
The correct form is (value&mask) == mask
Ref #2747
It does not solve all of #2747, just the bug part of it.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Prevent stalls in this path as seen in performance testing.
Also, add a respective rest_api test.
Fixes#11114Closes#11115
* github.com:scylladb/scylla:
storage_service: reserve space in get_range_to_address_map and friends
storage_service: coroutinize get_range_to_address_map and friends
storage_service: pass replication map to get_range_to_address_map and friends
storage_service: get_range_to_address_map: move selection of arbitrary ks to api layer
test: rest_api: test range_to_endpoint_map and describe_ring
Merging empty results was already allowed, but in one way only:
empty.merge(nonempty, r); // was permitted
nonempty.merge(empty, r); // not permitted
With this commit, both methods are permitted.
In order to remove copying, the other result is now taken
by rvalue reference, with all call sites being updated
accordingly.
Fixes#10446Fixes#10174Closes#11064
* round up reported time to microseconds
* add backtrace if stall detected
* add call site name (hierarchical when timers are nested)
* put timers in more places
* reduce possible logspam in nested timers by making sure to report on things only once and to not report on durations smaller than those already reported on
Closes#10576
* github.com:scylladb/scylla:
utils: logalloc: fix indentation
utils: logalloc: split the reclaim_timer in compact_and_evict_locked()
utils: logalloc: report segment stats if reclaim_segments() times out
utils: logalloc: reclaim_timer: add optional extra log callback
utils: logalloc: reclaim_timer: report non-decreasing durations
utils: logalloc: have reclaim_timer print reserve limits
utils: logalloc: move reclaim timer destructor for more readability
utils: logalloc: define a proper bundle type for reclaim_timer stats
utils: logalloc: add arithmetic operations to segment_pool::stats
utils: logalloc: have reclaim timers detect being nested
utils: logalloc: add more reclaim_timers
utils: logalloc: move reclaim_timer to compact_and_evict_locked
utils: logalloc: pull reclaim_timer definition forward
utils: logalloc: reclaim_timer make tracker optional
utils: logalloc: reclaim_timer: print backtrace if stall detected
utils: logalloc: reclaim_timer: get call site name
utils: logalloc: reclaim_timer: rename set_result
utils: logalloc: reclaim_timer: rename _reserve_segments member
utils: logalloc: reclaim_timer round up microseconds
And add calls to maybe_yield to prevent stalls in this path
as seen in performance testing.
Also, add a respective rest_api test.
Fixes#11114
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
A series of refactors to the `raft_group0` service.
Read the commits in topological order for best experience.
This PR is more or less equivalent to the second-to-last commit of PR https://github.com/scylladb/scylla/pull/10835, I split it so we could have an easier time reviewing and pushing it through.
Closes#11024
* github.com:scylladb/scylla:
service: storage_service: additional assertions and comments
service/raft: raft_group0: additional logging, assertions, comments
service/raft: raft_group0: pass seed list and `as_voter` flag to `join_group0`
service/raft: raft_group0: rewrite `remove_from_group0`
service/raft: raft_group0: rewrite `leave_group0`
service/raft: raft_group0: split `leave_group0` from `remove_from_group0`
service/raft: raft_group0: introduce `setup_group0`
service/raft: raft_group0: introduce `load_my_addr`
service/raft: raft_group0: make some calls abortable
service/raft: raft_group0: remove some temporary variables
service/raft: raft_group0: refactor `do_discover_group0`.
service/raft: raft_group0: rename `create_server_for_group` to `create_server_for_group0`
service/raft: raft_group0: extract `start_server_for_group0` function
service/raft: raft_group0: create a private section
service/raft: discovery: `seeds` may contain `self`
Before they are made asynchronous in the next patch,
so they work on a coherent snapshot of the token_metadata and
replication map as their caller.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We could yield between updating the list of servers in raft/fsm
and updating the raft_address_map, e.g. in case of a set_configuration.
If tick_leader happens before the raft_address_map is updated,
is_alive will be called with server_id that is not in the map yet.
Fix: scylladb/scylla-dtest#2753
Closes#11111
It is only needed for the "storage_service/describe_ring" api
and service/storage_service shouldn't bother with it.
It's an api sugar coating.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, the WHERE clause grammar is constrained to a conjunction of
relations: `WHERE a = ? AND b = ? AND c > ?`. The restriction happens in three
places:
1. the grammar will refuse to parse anything else
2. our filtering code isn't prepared for generic expressions
3. the interface between the grammar and the rest of the cql3 layer is via a vector of terms rather than an expression
While most of the work will be in extending the filtering code, this series tackles the
interface; it changes the `whereClause` production to return an expression rather than
a vector. Since much of cql3 layer is interested in terms, a new boolean_factors() function
is introduced to convert an expression to its boolean terms.
Closes#11105
* github.com:scylladb/scylla:
cql3: grammar: make where clause return an expression
cql3: util: deinline where clause utilities
cql3: util: change where clause utilities to accept a single expression rather than a vector of terms
cql3: statement_restrictions: accept a single expression rather than a vector
cql3: statement_restrictions: merge `if` and `for`
cql3: select_statement: remove wrong but harmless std::move() in prepare_restrictions
cql3: expr: add boolean_factors() function to factorize an expression
cql3: expression: define operator==() for expressions
cql3: values: add operator==() for raw_value
"scylla task_histogram" and "scylla fiber" will now show coroutine "promises".
Refs #10894Closes#11071
* github.com:scylladb/scylla:
test: gdb: test that "task_histogram -a" finds some coroutines
scylla-gdb.py: recognize coroutine-related symbols as task types
scylla-gdb.py: whitelist the .text section for task "vtables"
scylla-gdb.py: fix an error message
The cql-pytest cassandra_tests/validation/operations/select_test.py::
testSelectWithAlias uses a TTL but not because it wants to test the TTL
feature - it just wants to check the SELECT aliasing feature. The test
writes a TTL of 100 and then reads it back using an alias. We would
normally expect to read back 100 or 99, but to guard against a very slow
test machine, the test verified that we read back something between 70
and 100. I thought that allowing a ridiculous 30 second delay between
the write and the read requests was more than enough.
But in one run of the aarch64 debug build, this ridiculous 30 seconds
wasn't ridiculous enough - the delay ended up 35 seconds, and the
test failed!
So in this patch, I just make it even more ridiculous - we write 1000
and expect to read something over 100 - allowing a 900 second delay
in the test.
Note that neither the earlier 30-second or current 900-second delay
slows down the test in any way - this test will normally complete in
milliseconds.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11085
In preparation of the relaxation of the grammar to return any expression,
change the whereClause production to return an expression rather than
terms. Note that the expression is still constrained to be a conjunction
of relations, and our filtering code isn't prepared for more.
Before the patch, if the WHERE clause was optional, the grammar would
pass an empty vector of expressions (which is exactly correct). After
the patch, it would pass a default-constructed expression. Now that
happens to be an empty conjunction, which is exactly what's needed, but
it is too accidental, so the patch changes optional WHERE clauses to
explicitly generate an empty conjunction if the WHERE clause wasn't
specified.
Move closer to the goal of accepting a generic expression for WHERE
clause by accepting a generic expression in statement_restrictions. The
various callers will synthesize it from a vector of terms.
std::move(_where_clause) is wrong, because _where_clause is used later
(when analyzing GROUP BY), but also harmless (because the
statement_restrictions constructor accepts it by const reference).
To avoid confusion in the next patch where we'll pass _where_clause
to a different function, remove the bad std::move() in advance here.
When analyzing a WHERE clause, we want to separate individual
factors (usually relations), and later partition them into
partition key, clustering key, and regular column relations. The
first step is separation, for which this helper is added.
Currently, it is not required since the grammar supplies the
expression in separated form, but this will not work once it is
relaxed to allow any expression in the WHERE clause.
A unit test is added.
This is useful for implementing operator==() for expressions, which in
turn require comparing constants, which contain raw_values.
Note that this is not CQL comparison (that would be implemented
in cql3::expr::evaluate() and would return a CQL boolean, not a C++
boolean, but a traditional C++ value comparison.
Fix https://github.com/scylladb/scylla-docs/issues/4041
I've added the upgrade guides from 2022.x.y to 2022.x.z. They are based on the previous upgrade guides for patch releases.
Closes#11104
* github.com:scylladb/scylla:
doc: add the new upgrade guide to the toctree
doc: add the upgrage guides from 2022.x.y to 2022.x.z
The criteria is too permissive because coroutine symbols (those
without the "[clone .resume]" part at the end, anyway) look like
normal function names; hopefully this won't give too many false
positives to become a problem.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Actual vtables do not reside there, but coroutine object vptrs point
at the actual coroutine code, which is.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Expiring entries are added when a message is received from an unknown
host. If the host is later added to the raft configuration they become
non expiring. After that they can only be removed when the host is
dropped from the configuration, but they should never become expiring
again.
Refs #10826
This patch avoids unncessary CACHE_HITRATES updates through gossip.
After this patch:
Publish CACHE_HITRATES in case:
- We haven't published it at all
- The diff is bigger than 1% and we haven't published in the last 5 seconds
- The diff is really big 10%
Note: A peer node can know the cache hitrate through read_data
read_mutation_data and read_digest RPC verbs which have cache_temperature in
the response. So there is no need to update CACHE_HITRATES through gossip in
high frequency.
We do the recalculation faster if the diff is bigger than 0.01. It is useful to
do the calculation even if we do not publish the CACHE_HITRATES though gossip,
since the recalculation will call the table->set_global_cache_hit_rate to set
the hitrate.
Fixes#5971Closes#11079
In issue #10966, a user noticed that Alternator writes may be reordered
(a later write to an item is ignored with the earlier write to the same
item "winning") if Scylla nodes do not have synchronized time and if
always_use_lwt write isolation mode is not used.
In this patch I add to docs/alternator/compatibility.md a section about
this issue, what causes it, and how to solve or at least mitigate it.
Fixes#10966
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11094
Move some rare logs from TRACE to INFO level.
Add some assertions.
Write some more comments, including FIXMEs and TODOs.
Remove unnecessary `_shutdown_gate.hold()` (this is not a background
task).
Group 0 discovery would internally fetch the seed list from gossiper.
Gossiper would return the seed list from conf/scylla.yaml. This seed
list is proper for the bootstrapping scenario - we specify the initial
contact points for a node that joins a cluster.
We'll have to use a different list of seeds for group 0 discovery for
the upgrade scenario. Prepare for that by taking the seed list as a
parameter.
In the bootstrap scenario we'll pass the seed list down from
`storage_service::join_cluster`.
Additionally, `join_group0` now takes an `as_voter` flag, which is
`false` in the bootstrap scenario (we initially join as a non-voter) but
will be `true` in the upgrade scenario.
See previous commit. `remove_from_group0` had a similar problem as
`leave_group0`: it would handle the case where `raft_group0::_group0`
variant was not `raft::group_id` (i.e. we haven't joined group 0), but
RAFT local feature was enabled - i.e. the yet-unimplemented upgrade case
- by running discovery and calling `send_group0_modify_config`.
Instead, if we see that we've joined group 0 before, assume that we're
still a member and simply use the Raft `modify_config` API to remove
another server. If we're not a member it means we either decommissioned
or were removed by someone else; then we have no business trying to
remove others. There's also the unimplemented upgrade case but that will
come in another pull request.
Finally, add some logic for handling an edge case: suppose we joined
group 0 recently and we still didn't fully update our RPC address map
(it's being updated asynchronously by Raft's io_fiber). Thus we may fail
to find a member of group 0 in the address map. To handle this, ensure
we're up-to-date by performing a Raft read barrier.
State some assumptions in a comment.
Add a TODO for handling failures.
Remove unnecessary `_shutdown_gate.hold()` (this is not a background
task).
One of the following cases is true:
1. RAFT local feature is disabled. Then we don't do anything related to
group 0.
2. RAFT local feature is enabled and when we bootstrapped, we joined
group 0. Then `raft_group0::_group0` variant holds the
`raft::group_id` alternative.
3. RAFT local feature is enabled and when we bootstrapped we didn't join
group 0. This means the RAFT local feature was disabled when we
bootstrapped and we're in the (unimplemented yet) upgrade scenario.
`raft_group0::_group0` variant holds the `std::monostate` alternative.
The problem with the previous implementation was that it checked for the
conditions of the third case above - that RAFT local feature is enabled
but `_group0` does not hold `raft::group_id` - and if those conditions
were true, it executed some logic that didn't really make sense: it ran
the discovery algorithm and called `send_group0_modify_config` RPC.
In this rewrite I state some assumptions that `leave_group0` makes:
- we've finished the startup procedure.
- we're being run during decommission - after the node entered LEFT
status.
In the new implementation, if `_group0` does not hold `raft::group_id`
(checked by the internal `joined_group0()` helper), we simply return.
This is the yet-unimplemented upgrade case left for a follow-up PR.
Otherwise we fetch our Raft server ID (at this point it must be present
- otherwise it's a fatal error) and simply call `modify_config` from the
`raft::server` API.
Remove unnecessary call to `_shutdown_gate.hold()` (this is not a
background task).
`leave_group0` was responsible for both removing a different node from
group 0 and removing ourselves (leaving) group 0. The two scenarios are
a bit different and the handling will be rewritten in following commits.
Split `leave_group0` into two functions. Remove the incorrect comment
about idempotency - saying that the procedure is idempotent is an
oversimplification, one could argue it's incorrect since the second call
simply hangs, at least in the case of leaving group 0; following commits
will state what's happening more precisely.
Add some additional logging and assertions where the two functions are
called in `storage_service`.
Contains all logic for deciding to join (or not join) group 0.
Prepare for the case where we don't want to join group 0 immediately on
startup - the upgrade scenario (will be implemented in a follow-up).
Move the group 0 setup step earlier in `storage_service::join_cluster`.
`join_group0()` is now a private member of `raft_group0`. Some more
comments were written.
Compared to `load_or_create_my_addr` this function assumes that
the address is already present on disk; if not, it's a fatal error.
Use it in places where it would indeed be a fatal error
if the address was missing.
There are some calls to `modify_config` which should react to aborts
(e.g. when we shutdown Scylla).
There are also calls to `send_group0_modify_config` which should
probably also react to aborts, but the functions don't take
an abort_source parameter. This is fixable but I left TODOs for now.