Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).
This can cause reactor stalls and availability issues when replicas
apply such deletions.
This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.
Fixes#11211Closes#11215
Don't open-code calling the region_impl
_listeners->moved() in region move-constructor
and move-assignment op.
The other._impl->_region might be different then &other
post region::merge so let the region_impl
decide which region* is moved from.
The new_region is also set to region_impl->_region
so need to open-code that either in the said call sites.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The other _impl is presumed to be engaged already,
so just call other.get_impl() once for both use cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We can't be sure that the other_impl->_region == &other
since it could be a result of a previous merge,
so don't decide for it which region to unlisten to,
let it use its current _region.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Current ~region and region::operator= open-code
region_impl::unlisten. Just call it so it will be
easier to maintain.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
First check if _impl is engaged before accessing it
to set its _region = this in the move constructor and
move assignment operator.
Add unit test for these odd orner cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This series is the first step in the effort to reduce the number of metrics reported by Scylla.
The series focuses on the per-table metrics.
The combination of histograms, per-tables, and per shard makes the number of metrics in a cluster explode.
The following series uses multiple tools to reduce the number of metrics.
1. Multiple metrics should only be reported for the user tables and the condition that checked it was not updated when more non-user keyspaces were added.
2. Second, instead of a histogram, per table, per shard, it will report a summary per table, per shard, and a single histogram per node.
3. Histograms, summaries, and counters will be reported only if they are used (for example, the cas-related metrics will not be reported for tables that are not using cas).
Closes#11058
* github.com:scylladb/scylla:
Add summary_test
database: Reduce the number of per-table metrics
replica/table.cc: Do not register per-table metrics for system
histogram_metrics_helper.hh: Add to_metrics_summary function
Unified histogram, estimated_histogram, rates, and summaries
Split the timed_rate_moving_average into data and timer
utils/histogram.hh: should_sample should use a bitmask
estimated_histogram: add missing getter method
The to_metrics_summary is a helper function that create a metrics type
summary from a timed_rate_moving_average_with_summary object.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Currently, there are two metrics reporting mechanisms: the metrics layer
and the API. In most cases, they use the same data sources. The main
difference is around histograms and rate.
The API calculates an exponentially weighted moving average using a
timer that decays the average on each time tick. It calculates a
poor-man histogram by holding the last few entries (typically the last
256 entries). The caller to the API uses those last entries to build a
histogram.
We want to add summaries to Scylla. Similar to the API rate and
histogram, summaries are calculated per time interval.
This patch creates a unified mechanism by introducing an object that
would hold both the old-style histogram and the new
(estimated_histogram). On each time tick, a summary would be calculated.
In the future, we'll replace the API to report summaries instead of the
old-style histogram and deprecate the old style completely.
summary_calculator uses two estimated_histogram to calculate a summary.
timed_rate_moving_average_summary_and_histogram is a unifed class for
ihistogram, rates, summary, and estimated_histogram and will replace
timed_rate_moving_average_and_histogram.
Follow-up patches would move code from using
timed_rate_moving_average_and_histogram to
timed_rate_moving_average_summary_and_histogram. By keeping the API it
would make the transition easy.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
It was added in d20fae96a2
as a precaution not to invalidate iterators while
traversing _regions. However it is not requried as no allocation
is done on this synchronous path - therefore there is no
point in preventing reclaim.
This will allow making the respective functions const
as they merely return stats and do not modify the tracker impl.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To make the implementation inline and to prepare
for the next patch that adds a const overload of
this method.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Some methods were also marked inline when declared in the class
definition and in the ir definition site to provide a hint to
the compiler to inline them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
dynamic_bitset allocates only when constructed.
then on it doesn't throw.
Though not that accessing bits out of range
is undefined behavior.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Maintain the const chain by returning a const segment*
from segment_from_idx() const overload.
And add a respective mutable overload to return a mutable segment*.
This is done for a similar change in idx_from_segment.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a const noexcept overload of `find_empty()` so that
can_allocate_more_segments can be const noexcept as well.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch split the timed_rate_moving_average functionality into two, a
data class: rates_moving_average, and a wrapper class
timed_rate_moving_average that uses a timer to update the rates
periodically.
To make the transition as simple as possible timed_rate_moving_average,
takes the original API.
A new helper class meter_timer was introduced to handle the timer update
functionality.
This change required minimal code adaptation in some other parts of the
code.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch fixes a bug in should_sample that uses its bitmask
incorrectly.
basic_ihistogram has a feature that allows it to sample values instead
of taking a timer each time.
To decide if it should sample or not, it uses a bitmask. The bitmask
is of the form 2^n-1, which means 1 out of 2^n will be sampled.
For example, if the mask is 0x1 (2^2-1) 1 out of 2 will be sampled.
If the mask is 0x7 (2^3-1) 1 out of 8 will be sampled.
There was a bug in the should_sampled() method.
The correct form is (value&mask) == mask
Ref #2747
It does not solve all of #2747, just the bug part of it.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
With the region heap handle removed from logalloc::region, there is
nothing remaining there that needs violation of the abstraction
boundary, so we can drop these hacks.
The region_group mechanism used an intrusive heap handle embedded in
logalloc::region to allow region_group:s to track the largest region. But
with region_group moved out of logalloc, the handle is out of place.
Move it out, introducing a new intermediate class size_tracked_region
to hold the heap handle. We might eventually merge the new class into
memtable (which derives from it), but that requires a large rearrangement
of unit tests, so defer that.
Currently, a region_listener is added during construction and removed
during destruction. This was done to mimick the old region(region_group&)
constructor, as region_listener replaces region_group.
However, this makes moving the binomial heap handle outside logalloc
difficult. The natural place for the handle is in a derived class
of logalloc::region (e.g. memtable), but members of this derived class
will be destroyed earlier than the logalloc::region here. We could play
trickes with an earlier base class but it's better to just decouple
region lifecycle from listener lifecycle.
Do that be adding listen()/unlisten() methods. Some small awkwardness
remains in that merge() implicitly unlistens (see comment in
region::unlisten).
Unit tests are adjusted.
region_group is an abstraction that allows accounting for groups of
regions, but the cost/benefit ratio of maintaining the abstraction
is poor. Each time we need to change decision algorithm of memtable
flushing (admittedly rarely), we need to distill that into an abstraction
for region_groups and then use it. An example is virtual regions groups;
we wanted to account for the partially flushed memtables and had to
invent region groups to stand in their place.
Rather than continuing to invest in the abstraction, break it now
and move it to the memtable dirty memory manager which is responsible
for making those decisions. The relevant code is moved to
dirty_memory_manager.hh and dirty_memory_manager.cc (new file), and
a new unit test file is added as well.
A downside of the change is that unit testing will be more difficult.
Right now tracker_reclaim_lock uses tracker::impl::reclaiming_lock,
which won't be visible if we want to expose tracker_reclaim_lock and
use it from another translation unit. However, it's simple to switch
to an implementation that doesn't require an unknown-size data member,
and instead increment a counter via a pointer, so do that.