logalloc::tracker has some members with the same names as types from
namespace scope. gcc (rightfully) complains that this changes
the meaning of the name. Qualify the types to disambiguate.
Recently, clang started complaining about std::unexpected_handler being
deprecated:
```
In file included from utils/exceptions.cc:18:
./utils/abi/eh_ia64.hh:26:10: warning: 'unexpected_handler' is deprecated [-Wdeprecated-declarations]
std::unexpected_handler unexpectedHandler;
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/exception:84:18: note: 'unexpected_handler' has been explicitly marked deprecated here
typedef void (*_GLIBCXX11_DEPRECATED unexpected_handler) ();
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/x86_64-redhat-linux/bits/c++config.h:2343:32: note: expanded from macro '_GLIBCXX11_DEPRECATED'
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/x86_64-redhat-linux/bits/c++config.h:2334:46: note: expanded from macro '_GLIBCXX_DEPRECATED'
^
1 warning generated.
```
According to cppreference.com, it was deprecated in C++11 and removed in
C++17 (!).
This commit gets rid of the warning by inlining the
std::unexpected_handler typedef, which is defined as a pointer a
function with 0 arguments, returning void.
Fixes: #12022Closes#12074
The lister accepts sort of a filter -- what kind of entries to list, regular, directories or both. It currently uses unordered_set, but enum_set is shorter and better describes the intent.
Closes#12017
* github.com:scylladb/scylladb:
lister: Make lister::dir_entry_types an enum_set
database: Avoid useless local variable
This type is currently an unordered_set, but only consists of at most
two elements. Making it an enum_set renders it into a size_t variable
and better describes the intention.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fix a typo introduced by the the recent patch fixing the spelling of
Barrett. The patch introduced a typo in the aarch64 version of the code,
which wasn't found by promotion, as that only builds on X86_64.
Closes#12006
As P. T. Barnoom famously said, "write what you like but spell my name
correctly". Following that, we correct the spelling of Barrett's name
in the source tree.
Closes#11989
We use Barrett tables (misspelled in the code unfortunately) to fold
crc computations of multiple buffers into a single crc. This is important
because it turns out to be faster to compute crc of three different buffers
in parallel rather than compute the crc of one large buffer, since the crc
instruction has latency 3.
Currently, we have a separate code generation step to compute the
fold tables. The step generates a new C++ source files with the tables.
But modern C++ allows us to do this computation at compile time, avoiding
the code generation step. This simplifies the build.
This series does that. There is some complication in that the code uses
compiler intrinsics for the computation, and these are not constexpr friendly.
So we first introduce constexpr-friendly alternatives and use them.
To prove the transformation is correct, I compared the generated code from
before the series and from just before the last step (where we use constexpr
evaluation but still retain the generated file) and saw no difference in the values.
Note that constexpr is not strictly needed - we could have run the code in the
global variables' initializer. But that would cause a crash if we run on a pre-clmul
machine, and is not as fun.
Closes#11957
* github.com:scylladb/scylladb:
test: crc: add unit tests for constexpr clmul and barrett fold
utils: crc combine table: generate at compile time
utils: barrett: inline functions in header
utils: crc combine table: generate tables at compile time
utils: crc combine table: extract table generation into a constexpr function
utils: crc combine table: extract "pow table" code into constexpr function
utils: crc combine table: store tables std::arrray rather than C array
utils: barrett: make the barrett reduction constexpr friendly
utils: clmul: add 64-bit constexpr clmul
utils: barrett: extract barrett reduction constants
utils: barrett: reorder functions
utils: make clmul() constexpr
By now the crc combine tables are generated at compile time,
but still in a separate code generation step. We now eliminate
the code generation step and instead link the global variables
directly into the main executable. The global variables have
been conveniently named exactly as the code generation step
names them, so we don't need to touch any users.
Move the tables into global constinit variables that are
generated at compile time. Note the code that creates
the generated crc32_combine_table.cc is still called; it
transorms compile-time generated tables into a C++ source
that contains the same values, as literals.
If we generate a diff between gen/utils/gz/crc_combine_table.cc
before this series and after this patch, we see the only change
in the file is the type of the variable (which changed to
std::array), proving our constexpr code is correct.
Move the code to a constexpr function, so we can later generate the tables at
compile time. Note that although the function is constexpr, it is still
evaluated at runtime, since the calling function (main()) isn't constexpr
itself.
A "pow table" is used to generate the Barrett fold tables. Extract its
code into a constexpr function so we can later generate the fold tables
at compile time.
C arrays cannot be returned from functions and therefore aren't suitable
for constexpr processing. std::array<> is a regular value and so is
constexpr friendly.
This is used when generating the Barrett reduction tables, and also when
applying the Barrett reduction at runtime, so we need it to be constexpr
friendly.
clmul() is a pure function and so should already be constexpr,
but it uses intrinsics that aren't defined as constexpr and
so the compiler can't really compute it at compile time.
Fix by defining a constexpr variant and dispatching based
on whether we're being constant-evaluated or not.
The implementation is simple, but in any case proof that it
is correct will be provided later on.
Currently when we set a single value we need
to call broadcast_to_all_shards to let observers on all
shards get notified of the new value.
However, the latter broadcasts all value to all shards
so it's terribly inefficient.
Instead, add async set_value_on_all_shards functions
to broadcast a value to all shards.
Use those in system_keyspace for db_config_table virtual table
and in task_manager_test to update the task_manager ttl.
Refs #7316
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This series is a step towards non-LRU cache algorithms.
Our cache items are able to unlink themselves from the LRU list. (In other words, they can be unlinked solely via a pointer to the item, without access to the containing list head). Some places in the code make use of that, e.g. by relying on auto-unlink of items in their destructor.
However, to implement algorithms smarter than LRU, we might want to update some cache-wide metadata on item removal. But any cache-wide structures are unreachable through an item pointer, since items only have access to themselves and their immediate neighbours. Therefore, we don't want items to unlink themselves — we want `cache.remove(item)`, rather than `item.remove_self()`, because the former can update the metadata in `cache`.
This series inserts explicit item unlink calls in places that were previously relying on destructors, gets rid of other self-unlinks, and adds an assert which ensures that every item is explicitly unlinked before destruction.
Closes#11716
* github.com:scylladb/scylladb:
utils: lru: assert that evictables are unlinked before destruction
utils: lru: remove unlink_from_lru()
cache: make all cache unlinks explicit
Previous patches introduce the assumption that evictables are manually unlinked
before destruction, to allow for correct bookkeeping within the cache.
This assert assures that this assumptions is correct.
This is particularly important because the switch from automatic to explicit
unlinking had to be done manually. Destructor calls are invisible, so it's
possible that we have missed some automatic destruction site.
unlink_from_lru() allows for unlinking elements from cache without notifying
the cache. This messes up any potential cache bookkeeping.
Improved that by replacing all uses of unlink_from_lru() with calls to
lru::remove(), which does have access to cache's metadata.
Our LSA cache is implemented as an auto_unlink Boost intrusive list, meaning
that elements of the list unlink themselves from the list automatically on
destruction. Some parts of the code rely on that, and don't unlink them
manually.
However, this precludes accurate bookkeeping about the cache. Elements only have
access to themselves and their neighbours, not to any bookkeeping context.
Therefore, a destructor cannot update the relevant metadata.
In this patch, we fix this by adding explicit unlink calls to places where it
would be done by a destructor. In a following patch, we will add an assert to
the destructor to check that every element is unlinked before destruction.
This series adds support for detecting collections that have too many items
and recording them in `system.large_cells`.
A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000.
Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table. Documentation has been updated respectively.
A new column was added to system.large_cells: `collection_items`.
Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't. Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold.
Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred.
Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it.
The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled.
In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items.
Closes#11449Closes#11674
* github.com:scylladb/scylladb:
sstables: mx/writer: optimize large data stats members order
sstables: mx/writer: keep large data stats entry as members
db: large_data_handler: dynamically update config thresholds
utils/updateable_value: add transforming_value_updater
db/large_data_handler: cql_table_large_data_handler: record large_collections
db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
db/large_data_handler: cql_table_large_data_handler: move ctor out of line
docs: large-rows-large-cells-tables: fix typos
db/system_keyspace: add collection_elements column to system.large_cells
gms/feature_service: add large_collection_detection cluster feature
test: sstable_3_x_test: add test_sstable_too_many_collection_elements
test: lib: simple_schema: add support for optional collection column
test: lib: simple_schema: build schema in ctor body
test: lib: simple_schema: cql: define s1 as static only if built this way
db/large_data_handler: maybe_record_large_cells: consider collection_elements
db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
sstables: mx/writer: add large_data_type::elements_in_collection
db/large_data_handler: get the collection_elements_count_threshold
db/config: add compaction_collection_elements_count_warning_threshold
test: sstable_3_x_test: add test_sstable_write_large_cell
test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
Automatically updates a value from a utils::updateable_value
Where they can be of different types.
An optional transfom function can provide an additional transformation
when updating the value, like multiplying it by a factor for unit conversion,
for example.
To be used for auto-updating the large data thresholds
from the db::config.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
reclaim_timer uses a coarse clock, but does not account for
the measurement error introduced by that -- it can falsely
report reclaims as stalls, even if they are shorter by a full
coarse clock tick from the requested threshold
(blocked-reactor-notify-ms).
Notably, if the stall threshold happens to be smaller or equal to coarse
clock resolution, Scylla's log gets spammed with false stall reports.
The resolution of coarse clocks in Linux is 1/CONFIG_HZ. This is
typically equal to 1 ms or 4 ms, and stall thresholds of this order
can occur in practice.
Eliminate false positives by requiring the measured reclaim duration to
be at least 1 clock tick longer than the configured threshold for it to
be considered a stall.
Fixes#10981Closes#11680
Tools want to be as little disrupting to the environment they run in as possible, because they might be run in a production environment, next to a running scylladb production server. As such, the usual behavior of seastar applications w.r.t. memory is an anti-pattern for tools: they don't want to reserve most of the system memory, in fact they don't want to reserve any amount, instead consuming as much as needed on-demand.
To achieve this, tools want to use the standard allocator. To achieve this they need a seastar option to to instruct seastar to *not* configure and use the seastar allocator and they need LSA to cooperate with the standard allocator.
The former is provided by https://github.com/scylladb/seastar/pull/1211.
The latter is solved by introducing the concept of a `segment_store_backend`, which abstracts away how the memory arena for segments is acquired and managed. We then refactor the existing segment store so that the seastar allocator specific parts are moved to an implementation of this backend concept, then we introduce another backend implementation appropriate to the standard allocator.
Finally, tools configure seastar with the newly introduced option to use the standard allocator and similarly configure LSA to use the standard allocator appropriate backend.
Refs: https://github.com/scylladb/scylladb/issues/9882
This is the last major code piece in scylla for making tools production ready.
Closes#11510
* github.com:scylladb/scylladb:
test/boost: add alternative variant of logalloc test
tools: use standard allocator
utils/logalloc: add use_standard_allocator_segment_pool_backend()
utils/logalloc: introduce segment store backend for standard allocator
utils/logalloc: rebase release segment-store on segment-store-backend
utils/logalloc: introduce segment_store_backend
utils/logalloc: push segment alloc/dealloc to segment_store
test/boost/logalloc_test: make test_compaction_with_multiple_regions exception-safe
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.
Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.
This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).
Fixes#11524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11576
Creating a standard-memory-allocator backend for the segment store.
This is targeted towards tools, which want to configure LSA with a
segment store backend that is appropriate for the standard allocator
(which they want to use).
We want to be able to use this in both release and debug mode. The
former will be used by tools and the latter will be used to run the
logalloc tests with this new backend, making sure it works and doesn't
regress. For this latter, we have to allow the release and debug stores
to coexist in the same build and for the debug store to be able to
delegate to the release store when the standard allocator backend is
used.
Rebase the seastar allocator based segment store implementation on the
recently introduced segment store backend which is now abstracts away
how memory for segments is obtained.
This patch also introduces an explicit `segment_npos` to be used for
cases when a segment -> index mapping fails (segment doesn't belong to
the store). Currently the seastar allocator based store simply doesn't
handle this case, while the standard allocator based store uses 0 as the
implicit invalid index.
We want to make it possible to select the segment-store to be used for
LSA -- the seastar allocator based one or the standard allocator based
on -- at runtime. Currently this choice is made at compile time via
preprocessor switches.
The current standard memory based store is specialized for debug build,
we want something more similar to the seastar standard memory allocator
based one. So we introduce a segment store backend for the current
seastar allocator based store, which abstracts how the backing memory
for all segments is allocated/freed, while keeping the segment <-> index
mapping common. In the next patches we will rebase the current seastar
allocator based segment store on this backend and later introduce
another backend for standard allocator, targeted for release builds.
Currently the actual alloc/dealloc of memory for segments is located
outside the segment stores. We want to abstract away how segments are
allocated, so we move this logic too into the segment store. For now
this results in duplicate code in the two segment store implementations,
but this will soon be gone.
When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.
This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:
sharded<service> s;
co_await s.invoke_on_all([] {
...
barrier.abort();
});
...
co_await s.stop();
If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.
However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.
fixes: #11303
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11553
The logger is proof against allocation failures, except if
--abort-on-seastar-bad-alloc is specified. If it is, it will crash.
The reclaim stall report is likely to be called in low memory conditions
(reclaim's job is to alleviate these conditions after all), so we're
likely to crash here if we're reclaiming a very low memory condition
and have a large stall simultaneously (AND we're running in a debug
environment).
Prevent all this by disabling --abort-on-seastar-bad-alloc temporarily.
Fixes#11549Closes#11555
Option names given in db/config.cc are handled for the command line by passing
them to boost::program_options, and by YAML by comparing them with YAML
keys.
boost::program_options has logic for understanding the
long_name,short_name syntax, so for a "workdir,W" option both --workdir and -W
worked, as intended. But our YAML config parsing doesn't have this logic
and expected "workdir,W" verbatim, which is obviously not intended. Fix that.
Fixes#7478Fixes#9500Fixes#11503Closes#11506
We want to consolidate all the logalloc state into a single object: the
shard tracker. Replacing this global with a member in said object is
part of this effort.
These are pretend free functions, accessing globals in the background,
make them a member of the tracker instead, which everything needed
locally to compute them. Callers still have to access these stats
through the global tracker instance, but this can be changed to happen
through a local instance. Soon....
Instead, get the tracker instance from the region. This requires adding
a `region&` parameter to `with_reserve()`.
This brings us one step closer to eliminating the global tracker.
Instead of a separate global segment pool instance, make it a member of
the already global tracker. Most users are inside the tracker instance
anyway. Outside users can access the pool through the global tracker
instance.
For now this member is initialized from the global tracker instance. But
it allows the members of region impl to be detached from said global,
making a step towards removing it.