This series is a step towards non-LRU cache algorithms.
Our cache items are able to unlink themselves from the LRU list. (In other words, they can be unlinked solely via a pointer to the item, without access to the containing list head). Some places in the code make use of that, e.g. by relying on auto-unlink of items in their destructor.
However, to implement algorithms smarter than LRU, we might want to update some cache-wide metadata on item removal. But any cache-wide structures are unreachable through an item pointer, since items only have access to themselves and their immediate neighbours. Therefore, we don't want items to unlink themselves — we want `cache.remove(item)`, rather than `item.remove_self()`, because the former can update the metadata in `cache`.
This series inserts explicit item unlink calls in places that were previously relying on destructors, gets rid of other self-unlinks, and adds an assert which ensures that every item is explicitly unlinked before destruction.
Closes#11716
* github.com:scylladb/scylladb:
utils: lru: assert that evictables are unlinked before destruction
utils: lru: remove unlink_from_lru()
cache: make all cache unlinks explicit
Previous patches introduce the assumption that evictables are manually unlinked
before destruction, to allow for correct bookkeeping within the cache.
This assert assures that this assumptions is correct.
This is particularly important because the switch from automatic to explicit
unlinking had to be done manually. Destructor calls are invisible, so it's
possible that we have missed some automatic destruction site.
unlink_from_lru() allows for unlinking elements from cache without notifying
the cache. This messes up any potential cache bookkeeping.
Improved that by replacing all uses of unlink_from_lru() with calls to
lru::remove(), which does have access to cache's metadata.
Our LSA cache is implemented as an auto_unlink Boost intrusive list, meaning
that elements of the list unlink themselves from the list automatically on
destruction. Some parts of the code rely on that, and don't unlink them
manually.
However, this precludes accurate bookkeeping about the cache. Elements only have
access to themselves and their neighbours, not to any bookkeeping context.
Therefore, a destructor cannot update the relevant metadata.
In this patch, we fix this by adding explicit unlink calls to places where it
would be done by a destructor. In a following patch, we will add an assert to
the destructor to check that every element is unlinked before destruction.
This series adds support for detecting collections that have too many items
and recording them in `system.large_cells`.
A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000.
Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table. Documentation has been updated respectively.
A new column was added to system.large_cells: `collection_items`.
Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't. Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold.
Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred.
Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it.
The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled.
In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items.
Closes#11449Closes#11674
* github.com:scylladb/scylladb:
sstables: mx/writer: optimize large data stats members order
sstables: mx/writer: keep large data stats entry as members
db: large_data_handler: dynamically update config thresholds
utils/updateable_value: add transforming_value_updater
db/large_data_handler: cql_table_large_data_handler: record large_collections
db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
db/large_data_handler: cql_table_large_data_handler: move ctor out of line
docs: large-rows-large-cells-tables: fix typos
db/system_keyspace: add collection_elements column to system.large_cells
gms/feature_service: add large_collection_detection cluster feature
test: sstable_3_x_test: add test_sstable_too_many_collection_elements
test: lib: simple_schema: add support for optional collection column
test: lib: simple_schema: build schema in ctor body
test: lib: simple_schema: cql: define s1 as static only if built this way
db/large_data_handler: maybe_record_large_cells: consider collection_elements
db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
sstables: mx/writer: add large_data_type::elements_in_collection
db/large_data_handler: get the collection_elements_count_threshold
db/config: add compaction_collection_elements_count_warning_threshold
test: sstable_3_x_test: add test_sstable_write_large_cell
test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
Automatically updates a value from a utils::updateable_value
Where they can be of different types.
An optional transfom function can provide an additional transformation
when updating the value, like multiplying it by a factor for unit conversion,
for example.
To be used for auto-updating the large data thresholds
from the db::config.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
reclaim_timer uses a coarse clock, but does not account for
the measurement error introduced by that -- it can falsely
report reclaims as stalls, even if they are shorter by a full
coarse clock tick from the requested threshold
(blocked-reactor-notify-ms).
Notably, if the stall threshold happens to be smaller or equal to coarse
clock resolution, Scylla's log gets spammed with false stall reports.
The resolution of coarse clocks in Linux is 1/CONFIG_HZ. This is
typically equal to 1 ms or 4 ms, and stall thresholds of this order
can occur in practice.
Eliminate false positives by requiring the measured reclaim duration to
be at least 1 clock tick longer than the configured threshold for it to
be considered a stall.
Fixes#10981Closes#11680
Tools want to be as little disrupting to the environment they run in as possible, because they might be run in a production environment, next to a running scylladb production server. As such, the usual behavior of seastar applications w.r.t. memory is an anti-pattern for tools: they don't want to reserve most of the system memory, in fact they don't want to reserve any amount, instead consuming as much as needed on-demand.
To achieve this, tools want to use the standard allocator. To achieve this they need a seastar option to to instruct seastar to *not* configure and use the seastar allocator and they need LSA to cooperate with the standard allocator.
The former is provided by https://github.com/scylladb/seastar/pull/1211.
The latter is solved by introducing the concept of a `segment_store_backend`, which abstracts away how the memory arena for segments is acquired and managed. We then refactor the existing segment store so that the seastar allocator specific parts are moved to an implementation of this backend concept, then we introduce another backend implementation appropriate to the standard allocator.
Finally, tools configure seastar with the newly introduced option to use the standard allocator and similarly configure LSA to use the standard allocator appropriate backend.
Refs: https://github.com/scylladb/scylladb/issues/9882
This is the last major code piece in scylla for making tools production ready.
Closes#11510
* github.com:scylladb/scylladb:
test/boost: add alternative variant of logalloc test
tools: use standard allocator
utils/logalloc: add use_standard_allocator_segment_pool_backend()
utils/logalloc: introduce segment store backend for standard allocator
utils/logalloc: rebase release segment-store on segment-store-backend
utils/logalloc: introduce segment_store_backend
utils/logalloc: push segment alloc/dealloc to segment_store
test/boost/logalloc_test: make test_compaction_with_multiple_regions exception-safe
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.
Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.
This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).
Fixes#11524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11576
Creating a standard-memory-allocator backend for the segment store.
This is targeted towards tools, which want to configure LSA with a
segment store backend that is appropriate for the standard allocator
(which they want to use).
We want to be able to use this in both release and debug mode. The
former will be used by tools and the latter will be used to run the
logalloc tests with this new backend, making sure it works and doesn't
regress. For this latter, we have to allow the release and debug stores
to coexist in the same build and for the debug store to be able to
delegate to the release store when the standard allocator backend is
used.
Rebase the seastar allocator based segment store implementation on the
recently introduced segment store backend which is now abstracts away
how memory for segments is obtained.
This patch also introduces an explicit `segment_npos` to be used for
cases when a segment -> index mapping fails (segment doesn't belong to
the store). Currently the seastar allocator based store simply doesn't
handle this case, while the standard allocator based store uses 0 as the
implicit invalid index.
We want to make it possible to select the segment-store to be used for
LSA -- the seastar allocator based one or the standard allocator based
on -- at runtime. Currently this choice is made at compile time via
preprocessor switches.
The current standard memory based store is specialized for debug build,
we want something more similar to the seastar standard memory allocator
based one. So we introduce a segment store backend for the current
seastar allocator based store, which abstracts how the backing memory
for all segments is allocated/freed, while keeping the segment <-> index
mapping common. In the next patches we will rebase the current seastar
allocator based segment store on this backend and later introduce
another backend for standard allocator, targeted for release builds.
Currently the actual alloc/dealloc of memory for segments is located
outside the segment stores. We want to abstract away how segments are
allocated, so we move this logic too into the segment store. For now
this results in duplicate code in the two segment store implementations,
but this will soon be gone.
When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.
This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:
sharded<service> s;
co_await s.invoke_on_all([] {
...
barrier.abort();
});
...
co_await s.stop();
If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.
However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.
fixes: #11303
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11553
The logger is proof against allocation failures, except if
--abort-on-seastar-bad-alloc is specified. If it is, it will crash.
The reclaim stall report is likely to be called in low memory conditions
(reclaim's job is to alleviate these conditions after all), so we're
likely to crash here if we're reclaiming a very low memory condition
and have a large stall simultaneously (AND we're running in a debug
environment).
Prevent all this by disabling --abort-on-seastar-bad-alloc temporarily.
Fixes#11549Closes#11555
Option names given in db/config.cc are handled for the command line by passing
them to boost::program_options, and by YAML by comparing them with YAML
keys.
boost::program_options has logic for understanding the
long_name,short_name syntax, so for a "workdir,W" option both --workdir and -W
worked, as intended. But our YAML config parsing doesn't have this logic
and expected "workdir,W" verbatim, which is obviously not intended. Fix that.
Fixes#7478Fixes#9500Fixes#11503Closes#11506
We want to consolidate all the logalloc state into a single object: the
shard tracker. Replacing this global with a member in said object is
part of this effort.
These are pretend free functions, accessing globals in the background,
make them a member of the tracker instead, which everything needed
locally to compute them. Callers still have to access these stats
through the global tracker instance, but this can be changed to happen
through a local instance. Soon....
Instead, get the tracker instance from the region. This requires adding
a `region&` parameter to `with_reserve()`.
This brings us one step closer to eliminating the global tracker.
Instead of a separate global segment pool instance, make it a member of
the already global tracker. Most users are inside the tracker instance
anyway. Outside users can access the pool through the global tracker
instance.
For now this member is initialized from the global tracker instance. But
it allows the members of region impl to be detached from said global,
making a step towards removing it.
segment has some members, which simply forward the call to a
segment_pool method, via the global segment_pool instance. Remove these
and make the callers use the segment pool directly instead.
This series converts the synchronous `effective_replication_map::get_range_addresses` to async
by calling the replication strategy async entry point with the same name, as its callers are already async
or can be made so easily.
To allow it to yield and work on a coherent view of the token_metadata / topology / replication_map,
let the callers of this patch hold a effective_replication_map per keyspace and pass it down
to the (now asynchronous) functions that use it (making affected storage_service methods static where possible
if they no longer depend on the storage_service instance).
Also, the repeated calls to everywhere_replication_strategy::calculate_natural_endpoints
are optimized in this series by introducing a virtual abstract_replication_strategy::has_static_natural_endpoints predicate
that is true for local_strategy and everywhere_replication_strategy, and is false otherwise.
With it, functions repeatedly calling calculate_natural_endpoints in a loop, for every token, will call it only once since it will return the same result every time anyhow.
Refs #11005
Doesn't fix the issue as the large allocation still remains until we make change dht::token_range_vector chunked (chunked_vector cannot be used as is at the moment since we require the ability to push also to the front when unwrapping)
Closes#11009
* github.com:scylladb/scylladb:
effective_replication_map: make get_range_addresses asynchronous
range_streamer: add_ranges and friends: get erm as param
storage_service: get_new_source_ranges: get erm as param
storage_service: get_changed_ranges_for_leaving: get erm as param
storage_service: get_ranges_for_endpoint: get erm as param
repair: use get_non_local_strategy_keyspaces_erms
database: add get_non_local_strategy_keyspaces_erms
database: add get_non_local_strategy_keyspaces
storage_service: coroutinize update_pending_ranges
effective_replication_map: add get_replication_strategy
effective_replication_map: get_range_addresses: use the precalculated replication_map
abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies
abstract_replication_strategy: reindent
utils: sequenced_set: expose set and `contains` method
abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set
utils: sequenced_set: templatize VectorType
utils: sanitize sequenced_set
utils: sequenced_set: delete mutable get_vector method
And use that in sights using the endpoint set
returned by abstract_replication_strategy::calculate_natural_endpoints.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And templatize its Vector type so it can be used
with a small_vector for inet_address_vector_replica_set.
Mark the methods const/noexcept as needed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is dangerous to use since modifying the
sequenced_set vector will make it go out of sync
with the associated unordered_set member, making
the object unusable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.
Fixes#11207
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use the common base class for uuid-based types.
tagged_uuid::to_uuid defined here for backward
compatibility, but it will be renamed in the next patch
to uuid().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So it can be used for other types in the system outside
of raft, like counter_id, table_id, table_schema_version,
and more.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that scylla requries c++20 there's no
need to define our own implementation in utils/bit_cast.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).
This can cause reactor stalls and availability issues when replicas
apply such deletions.
This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.
Fixes#11211Closes#11215
Don't open-code calling the region_impl
_listeners->moved() in region move-constructor
and move-assignment op.
The other._impl->_region might be different then &other
post region::merge so let the region_impl
decide which region* is moved from.
The new_region is also set to region_impl->_region
so need to open-code that either in the said call sites.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The other _impl is presumed to be engaged already,
so just call other.get_impl() once for both use cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>