Commit Graph

1278 Commits

Author SHA1 Message Date
Michał Chojnowski
4563cbe595 logalloc: prevent false positives in reclaim_timer
reclaim_timer uses a coarse clock, but does not account for
the measurement error introduced by that -- it can falsely
report reclaims as stalls, even if they are shorter by a full
coarse clock tick from the requested threshold
(blocked-reactor-notify-ms).

Notably, if the stall threshold happens to be smaller or equal to coarse
clock resolution, Scylla's log gets spammed with false stall reports.
The resolution of coarse clocks in Linux is 1/CONFIG_HZ. This is
typically equal to 1 ms or 4 ms, and stall thresholds of this order
can occur in practice.

Eliminate false positives by requiring the measured reclaim duration to
be at least 1 clock tick longer than the configured threshold for it to
be considered a stall.

Fixes #10981

Closes #11680
2022-10-02 13:41:40 +03:00
Avi Kivity
2cec417426 Merge 'tools: use the standard allocator' from Botond Dénes
Tools want to be as little disrupting to the environment they run in as possible, because they might be run in a production environment, next to a running scylladb production server. As such, the usual behavior of seastar applications w.r.t. memory is an anti-pattern for tools: they don't want to reserve most of the system memory, in fact they don't want to reserve any amount, instead consuming as much as needed on-demand.
To achieve this, tools want to use the standard allocator. To achieve this they need a seastar option to to instruct seastar to *not* configure and use the seastar allocator and they need LSA to cooperate with the standard allocator.
The former is provided by https://github.com/scylladb/seastar/pull/1211.
The latter is solved by introducing the concept of a `segment_store_backend`, which abstracts away how the memory arena for segments is acquired and managed. We then refactor the existing segment store so that the seastar allocator specific parts are moved to an implementation of this backend concept, then we introduce another backend implementation appropriate to the standard allocator.
Finally, tools configure seastar with the newly introduced option to use the standard allocator and similarly configure LSA to use the standard allocator appropriate backend.

Refs: https://github.com/scylladb/scylladb/issues/9882
This is the last major code piece in scylla for making tools production ready.

Closes #11510

* github.com:scylladb/scylladb:
  test/boost: add alternative variant of logalloc test
  tools: use standard allocator
  utils/logalloc: add use_standard_allocator_segment_pool_backend()
  utils/logalloc: introduce segment store backend for standard allocator
  utils/logalloc: rebase release segment-store on segment-store-backend
  utils/logalloc: introduce segment_store_backend
  utils/logalloc: push segment alloc/dealloc to segment_store
  test/boost/logalloc_test: make test_compaction_with_multiple_regions exception-safe
2022-09-20 12:59:34 +03:00
Nadav Har'El
4c93a694b7 cql: validate bloom_filter_fp_chance up-front
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.

Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.

This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).

Fixes #11524.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11576
2022-09-20 06:18:51 +03:00
Botond Dénes
a55903c839 utils/logalloc: add use_standard_allocator_segment_pool_backend()
Creating a standard-memory-allocator backend for the segment store.
This is targeted towards tools, which want to configure LSA with a
segment store backend that is appropriate for the standard allocator
(which they want to use).
We want to be able to use this in both release and debug mode. The
former will be used by tools and the latter will be used to run the
logalloc tests with this new backend, making sure it works and doesn't
regress. For this latter, we have to allow the release and debug stores
to coexist in the same build and for the debug store to be able to
delegate to the release store when the standard allocator backend is
used.
2022-09-16 13:02:40 +03:00
Botond Dénes
c1c74005b7 utils/logalloc: introduce segment store backend for standard allocator
To be used by tools, this store backend is compatible with the standard
allocator as it acquires the memory arena for segments via mmap().
2022-09-16 12:16:57 +03:00
Botond Dénes
d2a7ebbe66 utils/logalloc: rebase release segment-store on segment-store-backend
Rebase the seastar allocator based segment store implementation on the
recently introduced segment store backend which is now abstracts away
how memory for segments is obtained.
This patch also introduces an explicit `segment_npos` to be used for
cases when a segment -> index mapping fails (segment doesn't belong to
the store). Currently the seastar allocator based store simply doesn't
handle this case, while the standard allocator based store uses 0 as the
implicit invalid index.
2022-09-16 12:16:57 +03:00
Botond Dénes
3717f7740d utils/logalloc: introduce segment_store_backend
We want to make it possible to select the segment-store to be used for
LSA -- the seastar allocator based one or the standard allocator based
on -- at runtime. Currently this choice is made at compile time via
preprocessor switches.
The current standard memory based store is specialized for debug build,
we want something more similar to the seastar standard memory allocator
based one. So we introduce a segment store backend for the current
seastar allocator based store, which abstracts how the backing memory
for all segments is allocated/freed, while keeping the segment <-> index
mapping common. In the next patches we will rebase the current seastar
allocator based segment store on this backend and later introduce
another backend for standard allocator, targeted for release builds.
2022-09-16 12:16:57 +03:00
Botond Dénes
5ea4d7fb39 utils/logalloc: push segment alloc/dealloc to segment_store
Currently the actual alloc/dealloc of memory for segments is located
outside the segment stores. We want to abstract away how segments are
allocated, so we move this logic too into the segment store. For now
this results in duplicate code in the two segment store implementations,
but this will soon be gone.
2022-09-16 12:16:57 +03:00
Pavel Emelyanov
fe48b66c0a cross-shard-barrier: Capture shared barrier in complete
When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.

This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:

    sharded<service> s;
    co_await s.invoke_on_all([] {
        ...
        barrier.abort();
    });
    ...
    co_await s.stop();

If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.

However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.

fixes: #11303

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11553
2022-09-16 08:21:02 +03:00
Avi Kivity
d3b8c0c8a6 logalloc: don't crash while reporting reclaim stalls if --abort-on-seastar-bad-alloc is specified
The logger is proof against allocation failures, except if
--abort-on-seastar-bad-alloc is specified. If it is, it will crash.

The reclaim stall report is likely to be called in low memory conditions
(reclaim's job is to alleviate these conditions after all), so we're
likely to crash here if we're reclaiming a very low memory condition
and have a large stall simultaneously (AND we're running in a debug
environment).

Prevent all this by disabling --abort-on-seastar-bad-alloc temporarily.

Fixes #11549

Closes #11555
2022-09-15 19:24:39 +02:00
Pavel Emelyanov
43131976e9 updateable_value: Update comment about cross-shard copying
refs: #7316

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11538
2022-09-14 12:35:56 +02:00
Michał Chojnowski
af7ace3926 utils: config_file: fix handling of workdir,W in the YAML file
Option names given in db/config.cc are handled for the command line by passing
them to boost::program_options, and by YAML by comparing them with YAML
keys.
boost::program_options has logic for understanding the
long_name,short_name syntax, so for a "workdir,W" option both --workdir and -W
worked, as intended. But our YAML config parsing doesn't have this logic
and expected "workdir,W" verbatim, which is obviously not intended. Fix that.

Fixes #7478
Fixes #9500
Fixes #11503

Closes #11506
2022-09-09 18:05:46 +02:00
Michał Chojnowski
c61b901828 utils: logalloc: prefer memory::free_memory() to memory::stats().free_memory
The former is a shortcut that does not involve a copy of all stats.
This saves some instructions in the hot path.

Closes #11495
2022-09-08 14:12:20 +03:00
Botond Dénes
5bc499080d utils/logalloc: remove reclaim_timer:: globals
One of them (_active_timer) is moved to shard tracker, the other is made
a simple local in reclaim_timer.
2022-08-23 10:38:58 +03:00
Botond Dénes
5f8971173e utils/logalloc: make s_sanitizer_report_backtrace global a member of tracker
We want to consolidate all the logalloc state into a single object: the
shard tracker. Replacing this global with a member in said object is
part of this effort.
2022-08-23 10:38:58 +03:00
Botond Dénes
499b9a3a7c utils/logalloc: tracker_reclaimer_lock: get shard tracker via constructor arg 2022-08-23 10:38:58 +03:00
Botond Dénes
7d17d675af utils/logalloc: move global stat accessors to tracker
These are pretend free functions, accessing globals in the background,
make them a member of the tracker instead, which everything needed
locally to compute them. Callers still have to access these stats
through the global tracker instance, but this can be changed to happen
through a local instance. Soon....
2022-08-23 10:38:58 +03:00
Botond Dénes
f406151a86 utils/logalloc: allocating_section: don't use the global tracker
Instead, get the tracker instance from the region. This requires adding
a `region&` parameter to `with_reserve()`.
This brings us one step closer to eliminating the global tracker.
2022-08-23 10:38:58 +03:00
Botond Dénes
e968866fa1 utils/logalloc: pass down tracker::impl reference to segment_pool
To get rid of some usages of `shard_tracker()`.
2022-08-23 10:38:58 +03:00
Botond Dénes
3bd94e41bf utils/logalloc: move segment pool into tracker
Instead of a separate global segment pool instance, make it a member of
the already global tracker. Most users are inside the tracker instance
anyway. Outside users can access the pool through the global tracker
instance.
2022-08-23 10:38:58 +03:00
Botond Dénes
5b86dfc35a utils/logalloc: add tracker member to basic_region_impl
For now this member is initialized from the global tracker instance. But
it allows the members of region impl to be detached from said global,
making a step towards removing it.
2022-08-23 10:38:58 +03:00
Botond Dénes
f4056bd344 utils/logalloc: make segment independent of segment pool
segment has some members, which simply forward the call to a
segment_pool method, via the global segment_pool instance. Remove these
and make the callers use the segment pool directly instead.
2022-08-23 10:38:58 +03:00
Avi Kivity
be44fd63f9 Merge 'Make get_range_addresses async and hold effective_replication_map_ptr around it' from Benny Halevy
This series converts the synchronous `effective_replication_map::get_range_addresses` to async
by calling the replication strategy async entry point with the same name, as its callers are already async
or can be made so easily.

To allow it to yield and work on a coherent view of the token_metadata / topology / replication_map,
let the callers of this patch hold a effective_replication_map per keyspace and pass it down
to the (now asynchronous) functions that use it (making affected storage_service methods static where possible
if they no longer depend on the storage_service instance).

Also, the repeated calls to everywhere_replication_strategy::calculate_natural_endpoints
are optimized in this series by introducing a virtual abstract_replication_strategy::has_static_natural_endpoints predicate
that is true for local_strategy and everywhere_replication_strategy, and is false otherwise.
With it, functions repeatedly calling calculate_natural_endpoints in a loop, for every token, will call it only once since it will return the same result every time anyhow.

Refs #11005

Doesn't fix the issue as the large allocation still remains until we make change dht::token_range_vector chunked (chunked_vector cannot be used as is at the moment since we require the ability to push also to the front when unwrapping)

Closes #11009

* github.com:scylladb/scylladb:
  effective_replication_map: make get_range_addresses asynchronous
  range_streamer: add_ranges and friends: get erm as param
  storage_service: get_new_source_ranges: get erm as param
  storage_service: get_changed_ranges_for_leaving: get erm as param
  storage_service: get_ranges_for_endpoint: get erm as param
  repair: use get_non_local_strategy_keyspaces_erms
  database: add get_non_local_strategy_keyspaces_erms
  database: add get_non_local_strategy_keyspaces
  storage_service: coroutinize update_pending_ranges
  effective_replication_map: add get_replication_strategy
  effective_replication_map: get_range_addresses: use the precalculated replication_map
  abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies
  abstract_replication_strategy: reindent
  utils: sequenced_set: expose set and `contains` method
  abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set
  utils: sequenced_set: templatize VectorType
  utils: sanitize sequenced_set
  utils: sequenced_set: delete mutable get_vector method
2022-08-09 13:25:53 +03:00
Benny Halevy
ebe1edc091 utils: sequenced_set: expose set and contains method
And use that in sights using the endpoint set
returned by abstract_replication_strategy::calculate_natural_endpoints.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
7017ad6822 abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set
So it could be used also for easily searching for an endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
38934413d4 utils: sequenced_set: templatize VectorType
Se we can use basic_sequenced_set<T, std::small_vector<T, N>>
as locator::endpoint_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
df380c30b9 utils: sanitize sequenced_set
And templatize its Vector type so it can be used
with a small_vector for inet_address_vector_replica_set.

Mark the methods const/noexcept as needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
57d9275d4a utils: sequenced_set: delete mutable get_vector method
It is dangerous to use since modifying the
sequenced_set vector will make it go out of sync
with the associated unordered_set member, making
the object unusable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
c71ef330b2 query-request, everywhere: define and use query_id as a strong type
Define query_id as a tagged_uuid
So it can be differentiated from other uuid-class types.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:13:28 +03:00
Benny Halevy
257d74bb34 schema, everywhere: define and use table_id as a strong type
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.

Fixes #11207

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:41 +03:00
Benny Halevy
a390b8475b utils: uuid: define appending_hash<utils::tagged_uuid<Tag>>
And simplify usage for appending_hash<counter_shard_view>
respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:28 +03:00
Benny Halevy
8235cfdf7a utils: tagged_uuid: rename to_uuid() to uuid()
To make it more generic, similar to other uuid() get
methods we have.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
e9cc24bc18 counters: base counter_id on utils::tagged_uuid
Use the common base class for uuid-based types.

tagged_uuid::to_uuid defined here for backward
compatibility, but it will be renamed in the next patch
to uuid().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
082d5efca8 utils: tagged_uuid: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
1b78f8ba82 utils: tagged_uuid: bool: reuse uuid::bool operator
utils::UUID defined operator bool the same way,
rely on it rather than reimplementing it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
6436c614d7 raft: migrate tagged_id definition to utils::tagged_uuid
So it can be used for other types in the system outside
of raft, like counter_id, table_id, table_schema_version,
and more.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
f0567ab853 utils: uuid: mark functions noexcept
Before we define a tagged_uuid template.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
c68e929b80 utils: bit_cast: require TriviallyCopyable To
Like std::bit_cast (https://en.cppreference.com/w/cpp/numeric/bit_cast)
we only require To to be trivially copyable.

There's no need for it to be a trivial type.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
79000bc02e bit_cast: use std::bit_cast
Now that scylla requries c++20 there's no
need to define our own implementation in utils/bit_cast.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
37b7a9cce2 utils: get rid of joinpoint
Now that it is no longer used.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Tomasz Grabiec
7f80602b01 db: range_tombstone_list: Avoid quadratic behavior when applying
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).

This can cause reactor stalls and availability issues when replicas
apply such deletions.

This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.

Fixes #11211

Closes #11215
2022-08-05 20:34:07 +03:00
Benny Halevy
1d9862dab3 logalloc: region_impl: add moved method
Don't open-code calling the region_impl
_listeners->moved() in region move-constructor
and move-assignment op.

The other._impl->_region might be different then &other
post region::merge so let the region_impl
decide which region* is moved from.

The new_region is also set to region_impl->_region
so need to open-code that either in the said call sites.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 10:49:49 +03:00
Benny Halevy
cd4dbb1cae logalloc: region: merge: optimize getting other impl
The other _impl is presumed to be engaged already,
so just call other.get_impl() once for both use cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 10:49:36 +03:00
Benny Halevy
a547cb79e8 logalloc: region: merge: call region_impl::unlisten
We can't be sure that the other_impl->_region == &other
since it could be a result of a previous merge,
so don't decide for it which region to unlisten to,
let it use its current _region.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 10:49:27 +03:00
Benny Halevy
003216de59 logalloc: region: call unlisten rather than open coding it
Current ~region and region::operator= open-code
region_impl::unlisten.  Just call it so it will be
easier to maintain.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 10:49:11 +03:00
Benny Halevy
cff953535c logalloc: region move-ctor: initialize _impl
There's no need to default-initialize it
and then move-assign it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 10:49:05 +03:00
Benny Halevy
c7d77e4076 logalloc: region: get_impl might be called on disengaged _impl when moved
First check if _impl is engaged before accessing it
to set its _region = this in the move constructor and
move assignment operator.

Add unit test for these odd orner cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 10:48:58 +03:00
Avi Kivity
2c0932cc41 Merge 'Reduce the amount of per-table metrics' from Amnon Heiman
This series is the first step in the effort to reduce the number of metrics reported by Scylla.
The series focuses on the per-table metrics.

The combination of histograms, per-tables, and per shard makes the number of metrics in a cluster explode.
The following series uses multiple tools to reduce the number of metrics.
1. Multiple metrics should only be reported for the user tables and the condition that checked it was not updated when more non-user keyspaces were added.
2. Second, instead of a histogram, per table, per shard, it will report a summary per table, per shard, and a single histogram per node.
3. Histograms, summaries, and counters will be reported only if they are used (for example, the cas-related metrics will not be reported for tables that are not using cas).

Closes #11058

* github.com:scylladb/scylla:
  Add summary_test
  database: Reduce the number of per-table metrics
  replica/table.cc: Do not register per-table metrics for system
  histogram_metrics_helper.hh: Add to_metrics_summary function
  Unified histogram, estimated_histogram, rates, and summaries
  Split the timed_rate_moving_average into data and timer
  utils/histogram.hh: should_sample should use a bitmask
  estimated_histogram: add missing getter method
2022-07-27 22:01:08 +03:00
Amnon Heiman
9a3e70adfb histogram_metrics_helper.hh: Add to_metrics_summary function
The to_metrics_summary is a helper function that create a metrics type
summary from a timed_rate_moving_average_with_summary object.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-27 16:58:52 +03:00
Amnon Heiman
c220e3a00f Unified histogram, estimated_histogram, rates, and summaries
Currently, there are two metrics reporting mechanisms: the metrics layer
and the API. In most cases, they use the same data sources. The main
difference is around histograms and rate.

The API calculates an exponentially weighted moving average using a
timer that decays the average on each time tick. It calculates a
poor-man histogram by holding the last few entries (typically the last
256 entries). The caller to the API uses those last entries to build a
histogram.

We want to add summaries to Scylla. Similar to the API rate and
histogram, summaries are calculated per time interval.

This patch creates a unified mechanism by introducing an object that
would hold both the old-style histogram and the new
(estimated_histogram). On each time tick, a summary would be calculated.
In the future, we'll replace the API to report summaries instead of the
old-style histogram and deprecate the old style completely.

summary_calculator uses two estimated_histogram to calculate a summary.

timed_rate_moving_average_summary_and_histogram is a unifed class for
ihistogram, rates, summary, and estimated_histogram and will replace
timed_rate_moving_average_and_histogram.

Follow-up patches would move code from using
timed_rate_moving_average_and_histogram to
timed_rate_moving_average_summary_and_histogram.  By keeping the API it
would make the transition easy.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-27 16:58:25 +03:00