Option names given in db/config.cc are handled for the command line by passing
them to boost::program_options, and by YAML by comparing them with YAML
keys.
boost::program_options has logic for understanding the
long_name,short_name syntax, so for a "workdir,W" option both --workdir and -W
worked, as intended. But our YAML config parsing doesn't have this logic
and expected "workdir,W" verbatim, which is obviously not intended. Fix that.
Fixes#7478Fixes#9500Fixes#11503Closes#11506
(cherry picked from commit af7ace3926)
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.
Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.
This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).
Fixes#11524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11576
(cherry picked from commit 4c93a694b7)
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).
This can cause reactor stalls and availability issues when replicas
apply such deletions.
This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.
Fixes#11211Closes#11215
(cherry picked from commit 7f80602b01)
Scylla's coding standard requires that each header is self-sufficient,
i.e., it includes whatever other headers it needs - so it can be included
without having to include any other header before it.
We have a test for this, "ninja dev-headers", but it isn't run very
frequently, and it turns out our code deviated from this requirement
in a few places. This patch fixes those places, and after it
"ninja dev-headers" succeeds again.
This is needed because our CI runs "ninja dev-headers".
Fixes#10995
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11457
This series enforces a minimum size of the unprivileged section when
performing `shrink()` operation.
When the cache is shrunk, we still drop entries first from unprivileged
section (as before this commit), however, if this section is already small
(smaller than `max_size / 2`), we will drop entries from the privileged
section.
This is necessary, as before this change the unprivileged section could
be starved. For example if the cache could store at most 50 entries and
there are 49 entries in privileged section, after adding 5 entries (that would
go to unprivileged section) 4 of them would get evicted and only the 5th one
would stay. This caused problems with BATCH statements where all
prepared statements in the batch have to stay in cache at the same time
for the batch to correctly execute.
To correctly check if the unprivileged section might get too small after
dropping an entry, `_current_size` variable, which tracked the overall size
of cache, is changed to two variables: `_unprivileged_section_size` and
`_privileged_section_size`, tracking section sizes separately.
New tests are added to check this new behavior and bookkeeping of the section
sizes. A test is added, that sets up a CQL environment with a very small
prepared statement cache, reproduces issue in #10440 and stresses the cache.
Fixes#10440.
Closes#10456
* github.com:scylladb/scylla:
loading_cache_test: test prepared stmts cache
loading_cache: force minimum size of unprivileged
loading_cache: extract dropping entries to lambdas
loading_cache: separately track size of sections
loading_cache: fix typo in 'privileged'
(cherry picked from commit 5169ce40ef)
There are two issues with current implementation of remove/remove_if:
1) If it happens concurrently with get_ptr(), the latter may still
populate the cache using value obtained from before remove() was
called. remove() is used to invalidate caches, e.g. the prepared
statements cache, and the expected semantic is that values
calculated from before remove() should not be present in the cache
after invalidation.
2) As long as there is any active pointer to the cached value
(obtained by get_ptr()), the old value from before remove() will be
still accessible and returned by get_ptr(). This can make remove()
have no effect indefinitely if there is persistent use of the cache.
One of the user-perceived effects of this bug is that some prepared
statements may not get invalidated after a schema change and still use
the old schema (until next invalidation). If the schema change was
modifying UDT, this can cause statement execution failures. CQL
coordinator will try to interpret bound values using old set of
fields. If the driver uses the new schema, the coordinaotr will fail
to process the value with the following exception:
User Defined Type value contained too many fields (expected 5, got 6)
The patch fixes the problem by making remove()/remove_if() erase old
entries from _loading_values immediately.
The predicate-based remove_if() variant has to also invalidate values
which are concurrently loading to be safe. The predicate cannot be
avaluated on values which are not ready. This may invalidate some
values unnecessarily, but I think it's fine.
Fixes#10117
Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>
(cherry picked from commit 8fa704972f)
timestamped_val (and two other type aliases) are nested inside loading_cache,
but indented as if they were top-level names. Adjust the indent to
avoid confusion.
Closes#10118
(cherry picked from commit d1a394fd97)
Ref #10117 - backport prerequisite
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.
Found during code review, no user impact.
Fixes#10364.
Message-Id: <20220411224741.644113-1-tgrabiec@scylladb.com>
(cherry picked from commit 0c365818c3)
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.
Found during code review, no known user impact.
Fixes#10363.
Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
(cherry picked from commit 01eeb33c6e)
If reserve() allocates more than one chunk, push_back() should not
work with the last chunk. This can result in items being pushed to the
wrong chunk, breaking internal invariants.
Also, pop_back() should not work with the last chunk. This breaks when
there is more than one chunk.
Currently, the container is only used in the sstable partition index
cache.
Manifests by crashes in sstable reader which touch sstables which have
partition index pages with more than 1638 partition entries.
Introduced in 78e5b9fd85 (4.6.0)
Fixes#10290
Message-Id: <20220407174023.527059-1-tgrabiec@scylladb.com>
(cherry picked from commit 41fe01ecff)
cached_page::on_evicted() is invoked in the LSA allocator context, set in the
reclaimer callback installed by the cache_tracker. However,
cached_pages are allocated in the standard allocator context (note:
page content is allocated inside LSA via lsa_buffer). The LSA region
will happily deallocate these, thinking that they these are large
objects which were delegated to the standard allocator. But the
_non_lsa_memory_in_use metric will underflow. When it underflows
enough, shard_segment_pool.total_memory() will become 0 and memory
reclamation will stop doing anything, leading to apparent OOM.
The fix is to switch to the standard allocator context inside
cached_page::on_evicted(). evict_range() was also given the same
treatment as a precaution, it currently is only invoked in the
standard allocator context.
The series also adds two safety checks to LSA to catch such problems earlier.
Fixes#10056
\cc @slivne @bhalevy
Closes#10130
* github.com:scylladb/scylla:
lsa: Abort when trying to free a standard allocator object not allocated through the region
lsa: Abort when _non_lsa_memory_in_use goes negative
tests: utils: cached_file: Validate occupancy after eviction
test: sstable_partition_index_cache_test: Fix alloc-dealloc mismatch
utils: cached_file: Fix alloc-dealloc mismatch during eviction
(cherry picked from commit ff2cd72766)
If memory reclamation is triggered inside _cache.emplace(), the _cache
btree can get corrupted. Reclaimers erase from it, and emplace()
assumes that the tree is not modified during its execution. It first
locates the target node and then does memory allocation.
Fix by running emplace() under allocating section, which disables
memory reclamation.
The bug manifests with assert failures, e.g:
./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed.
Fixes#9915
Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com>
Fixes#9922
storage proxy uses is_timeout_exception to traverse different code paths.
a6202ae079 broke this (because bit rot and
intermixing), by wrapping exception for information purposes.
This adds check of nested types in exception handling, as well as a test
for the routine itself.
Closes#9932
* github.com:scylladb/scylla:
database/storage_proxy: Use "is_timeout_exception" instead of catch match
utils::is_timeout_exception: Ensure we handle nested exception types
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Fixes#9922
storage proxy uses is_timeout_exception to traverse different code paths.
a6202ae079 broke this (because bit rot and
intermixing), by wrapping exception for information purposes.
This adds check of nested types in exception handling, as well as a test
for the routine itself.
This series greatly reduces gossipers' dependence on `seastar::async` (yet, not completely).
`i_endpoint_state_change_subscriber` callbacks are converted to return futures (again, to get rid of `seastar::async` dependency), all users are adjusted appropriately (e.g. `storage_service`, `cdc::generation_service`, `streaming::stream_manager`, `view_update_backlog_broker` and `migration_manager`).
This includes futurizing and coroutinizing the whole function call chain up to the `i_endpoint_state_change_subscriber` callback functions.
To aid the conversion process, a non-`seastar::async` dependent variant of `utils::atomic_vector::for_each` is introduced (`for_each_futurized`). A different name is used to clearly distinguish converted and non-converted code, so that the last step (remove `seastar::async()` wrappers around callback-calling code in gossiper) is easier. This is left for a follow-up series, though.
Tests: unit(dev)
Closes#9844
* github.com:scylladb/scylla:
service: storage_service: coroutinize `set_gossip_tokens`
service: storage_service: coroutinize `leave_ring`
service: storage_service: coroutinize `handle_state_left`
service: storage_service: coroutinize `handle_state_leaving`
service: storage_service: coroutinize `handle_state_removing`
service: storage_service: coroutinize `do_drain`
service: storage_service: coroutinize `shutdown_protocol_servers`
service: storage_service: coroutinize `excise`
service: storage_service: coroutinize `remove_endpoint`
service: storage_service: coroutinize `handle_state_replacing`
service: storage_service: coroutinize `handle_state_normal`
service: storage_service: coroutinize `update_peer_info`
service: storage_service: coroutinize `do_update_system_peers_table`
service: storage_service: coroutinize `update_table`
service: storage_service: coroutinize `handle_state_bootstrap`
service: storage_service: futurize `notify_*` functions
service: storage_service: coroutinize `handle_state_replacing_update_pending_ranges`
repair: row_level_repair_gossip_helper: coroutinize `remove_row_level_repair`
locator: reconnectable_snitch_helper: coroutinize `reconnect`
gms: i_endpoint_state_change_subscriber: make callbacks to return futures
utils: atomic_vector: introduce future-returning `for_each` function
utils: atomic_vector: rename `for_each` to `thread_for_each`
gms: gossiper: coroutinize `start_gossiping`
gms: gossiper: coroutinize `force_remove_endpoint`
gms: gossiper: coroutinize `do_status_check`
gms: gossiper: coroutinize `remove_endpoint`
Refs: #9555
When running the "Kraken" dynamodb streams test to provoke the issued observed by QA, I noticed on my setup mainly two things: Large allocation stalls (+ warnings) and timeouts on read semaphores in DB.
This tries to address the first issue, partly by making query_result_view serialization using chunked vector instead of linear one, and by introducing a streaming option for json return objects, avoiding linearizing to string before wire.
Note that the latter has some overhead issues of its own, mainly data copying, since we essentially will be triple buffering (local, wrapped http stream, and final output stream). Still, normal string output will typically do a lot of realloc which is potential extra copies as well, so...
This is not really performance tested, but with these tweaks I no longer get large alloc stalls at least, so that is a plus. :-)
Closes#9713
* github.com:scylladb/scylla:
alternator::executor: Use streamed result for scan etc if large result
alternator::streams: Use streamed result in get_records if large result
executor/server: Add routine to make stream object return
rjson: Add print to stream of rjson::value
query_idl: Make qr_partition::rows/query_result::partitions chunked
Allows direct stream of object to seastar::stream. While not 100%
efficient, it has the advantage of avoiding large allocations
(long string) for huge result messages.
seastar::later() was recently deprecated and replaced with two
alternatives: a cheap seastar::yield() and an expensive (but more
powerful) seastar::check_for_io_immediately(), that corresponds to
the original later().
This patch replaces all later() calls with the weaker yield(). In
all cases except one, it's unambiguously correct. In one case
(test/perf scheduling_latency_measurer::stop()) it's not so ambiguous,
since check_for_io_immediately() will additionally force a poll and
so will cause more work to be done (but no additional tasks to be
executed). However, I think that any measurement that relies on
the measuring the work on the last tick to be inaccurate (you need
thousands of ticks to get any amount of confidence in the
measurement) that in the end it doesn't matter what we pick.
Tests: unit (dev)
Closes#9904
To emphasize that the function requires `seastar::thread`
context to function properly.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The header file <gm/inet_address.hh> is included, directly or
indirectly, from 291 source files in Scylla. It is hard to reduce this
number because Scylla relies heavily on IP addresses as keys to
different things. So it is important that this header file be fast to
include. Unfortunately it wasn't... ClangBuildAnalyzer measurements
showed that each inclusion of this header file added a whopping 2 seconds
(in dev build mode) to the build. A total of 600 CPU seconds - 10 CPU
minutes - were spent just on this header file. It was actually worse
because the build also spent additional time on template instantiation
(more on this below).
So in this patch we:
1. Remove some unnecessary stuff from gms/inet_address.hh, and avoid
including it in one place that doesn't need it. This is just
cosmetic, and doesn't significantly speed up the build.
2. Move the to_sstring() implementation for the .hh to .cc. This saves
a lot of time on template instantiations - previously every source
file instantiated this to_sstring(), which was slow (that "format"
thing is slow).
3. Do not include <seastar/net/ip.hh> which is a huge file including
half the world. All we need from it is the type "ipv4_address",
so instead include just the new <seastar/net/ipv4_address.hh>.
This change brings most of the performance improvement.
So source files forgot to include various Seastar header files
because the includes-everything ip.hh did it - so we need to add
these missing includes in this patch.
After this patch, ClangBuildAnalyzer's reports that the cost of
inclusion of <gms/inet_address.hh> is down from 2 seconds to 0.326
seconds. Additionally the format<inet_address> template instantiation
291 times - about half a second each - is also gone.
All in all, this patch should reduce around 10 CPU minutes from the build.
Refs #1
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
alloc_buf() calls new_buf_active() when there is no active segment to
allocate a new active segment. new_buf_active() allocates memory
(e.g. a new segment) so may cause memory reclamation, which may cause
segment compaction, which may call alloc_buf() and re-enter
new_buf_active(). The first call to new_buf_active() would then
override _buf_active and cause the segment allocated during segment
compaction to be leaked.
This then causes abort when objects from the leaked segment are freed
because the segment is expected to be present in _closed_segments, but
isn't. boost::intrusive::list::erase() will fail on assertion that the
object being erased is linked.
Introduced in b5ca0eb2a2.
Fixes#9821Fixes#9192Fixes#9825Fixes#9544Fixes#9508
Refs #9573
Message-Id: <20211229201443.119812-1-tgrabiec@scylladb.com>
This change makes row cache support reverse reads natively so that reversing wrappers are not needed when reading from cache and thus the read can be executed efficiently, with similar cost as the forward-order read.
The database is serving reverse reads from cache by default after this. Before, it was bypassing cache by default after 703aed3277.
Refs: #1413
Tests:
- unit [dev]
- manual query with build/dev/scylla and cache tracing on
Closes#9454
* github.com:scylladb/scylla:
tests: row_cache: Extend test_concurrent_reads_and_eviction to run reverse queries
row_cache: partition_snapshot_row_cursor: Print more details about the current version vector
row_cache: Improve trace-level logging
config: Use cache for reversed reads by default
config: Adjust reversed_reads_auto_bypass_cache description
row_cache: Support reverse reads natively
mvcc: partition_snapshot: Support slicing range tombstones in reverse
test: flat_mutation_reader_assertions: Consume expected range tombstones before end_of_partition
row_cache: Log produced range tombstones
test: Make produces_range_tombstone() report ck_ranges
tests: lib: random_mutation_generator: Extract make_random_range_tombstone()
partition_snapshot_row_cursor: Support reverse iteration
utils: immutable-collection: Make movable
intrusive_btree: Make default-initialized iterator cast to false
Currently this parse function reads only 100KB worth
of members in eac hiteration.
Since the default max_chunk_capacity is 128KB,
100KB underutilize the chunk capacity, and it could
be safely increased to the max to reduce the number of
allocations and corresponding calls to read_exactly
for large arrays.
Expose utils::chunked_vector::max_chunk_capacity
so that the caler wouldn't have to guess this number
and use it in parse().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211222103126.1819289-2-bhalevy@scylladb.com>
Seastar moved the read_entire_stream(), read_entire_stream_contiguous()
and skip_entire_stream() from the "httpd" namespace to the "util"
namespace. Using them with their old names causes deprecation warnings
when compiling alternator/server.cc.
This patch fixes the namespace (and adds the new include) to get rid of
the deprecation warnings.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211209132759.1319420-1-nyh@scylladb.com>
The stall report uses the millisecond unit, but actually reports
nanoseconds.
Switch to microseconds (milliseconds are a bit too coarse) and
use the safer "duration / 1us" style rather than "duration::count()"
that leads to unit confusion.
Fixes#9733.
Closes#9734
Provide a template parameter to provide a static callbacks object to
increment a counter of evictions from the unprivileged section.
If entries are evicted from the cache while still in the unprivileged
section indicates a not efficient usage of the cache and should be
investigated.
This patch instruments authorized_prepared_statements_cache and a
prepared_statements_cache objects to provide non-empty callbacks.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This patch implements a simple variation of LFRU eviction policy:
* We define 2 dynamic cache sections which total size should not exceed the maximum cache size.
* New cache entry is always added to the "unprivileged" section.
* After a cache entry is read more than SectionHitThreshold times it moves to the second cache section.
* Both sections' entries obey expiration and reload rules in the same way as before this patch.
* When cache entries need to be evicted due to a size restriction "unprivileged" section's
least recently used entries are evicted first.
Note:
With a 2 sections cache it's not enough for a new entry to have the latest timestamp
in order not be evicted right after insertion: e.g. if all all other entries
are from the privileged section.
And obviously we want to allow new cache entries to be added to a cache.
Therefore we can no longer first add a new entry and then shrink the cache.
Switching the order of these two operations resolves the culprit.
Fixes#8674
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
* Store a reference to a parent (loading_cache) object instead of holding
references to separate fields.
* Access loading_cache fields via accessors.
* Move the LRU "touch" logic to the loading_cache.
* Keep only a plain "list entry" logic in the lru_entry class.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Hide internal classes inside the loading_cache class:
* Simpler calls - no need for a tricky back-referencing to access loading_cache fields.
* Cleaner interface.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Use std::pmr::polymorphic_allocator instead of
std::allocator - the former allows not to define the
allocated object during the template specification.
As a result we won't have to have lru_entry defined
before loading_cache, which in line would allow us
to rearrange classes making all classes internal to
loading_cache and hence simplifying the interface.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
fmt 8 checks format strings at compile time, and requires that
non-compile-time format strings be wrapped with fmt::runtime().
Do that, and to allow coexistence with fmt 7, supply our own
do-nothing version of fmt::runtime() if needed. Strictly speaking
we shouldn't be introducing names into the fmt namespace, but this
is transitional only.
Closes#9640
We've been observing hard to explain crashes recently around
lsa_buffer destruction, where the containing segment is absent in
_segment_descs which causes log_heap::adjust_up to abort. Add more
checks to catch certain impossible senarios which can lead to this
sooner.
Refs #9192.
Message-Id: <20211116122346.814437-1-tgrabiec@scylladb.com>
We cannot recover from a failure in this method. The implementation
makes sure it never happens. Invariants will be broken if this
throws. Detect violations early by marking as noexcept.
We could make it exception safe and try to leave the data structures
in a consistent state but the reclaimer cannot make progress if this throws, so
it's pointless.
Refs #9192
Message-Id: <20211116122019.813418-1-tgrabiec@scylladb.com>
A config option value is reported as 'text' type and contains
a string as it would looks like in json config.
The table is UPDATE-able. Only the 'value' columnt can be set
and the value accepted must be string. It will be converted into
the option type automatically, however in current implementation
is't not 100% precise -- conversion is lexicographical cast which
only works for simple types. However, liveupdate-able values are
only of those types, so it works in supported cases.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There soon will appear an updateable system.config table that
will push sstrings into names_value-s. Prepare for this change
by adding the respective .set_value() call. Since the update
only works for LiveUpdate-able options, and inability to do it
can be propagated back to the caller make this method return
true/false whether the update took place or not.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a named_value is .set_value()-d the caller may specify the reason
for this change. If not specified it's set to None, but None means
"it was there by default and was't changed" so it's a bit of a lie.
Add an explicit Internal reason. It's actually used by the directories
thing that update all directories according to --workdir option.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Those named_values that support .empty() check can be "selected"
like this
auto& v = option_a() || option_b() || option_c();
This code will put into v a reference to the first non-empty
named_value out of a/b/c.
This "selection" is actually used on start when scylla decides
which config options to use as listen/broadcact/rpc/etc. addresses.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>