With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "test/raft/randomized_nemesis_test.cc":
- addition of tagged and untagged (both should be tagged)
- taking the minimum of an index difference and a container size (both
should be untagged)
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
unpack and clean up the relatively complex
first_to_remain = max(snap.idx + 1 - preserve_log_entries, 0)
calculation in persistence::store_snapshot().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly untag the operands / arguments of the following operations, in
"test/raft/replication.hh":
- assignment to raft_cluster::_seen
- call to hasher_int::hash_range()
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
raft_cluster::get_states() passes a "start_idx" to create_log(), and
create_log() uses it as an "index_t" object. Match the type of "start_idx"
to its name.
This patch is best viewed with "git show -W".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In test_case::get_first_val(), the asssignment
first_val = initial_snapshots[initial_leader].snap.idx;
*both* relies on implicit conversion of the tagged integer type "index_t"
to the underlying "uint64_t", *and* is a logic bug, as reported at
<https://github.com/scylladb/scylladb/issues/20151>.
For now, wean the buggy asssignment off the disappearing
tagged-to-untaggged conversion.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Properly annotate index_t and term_t constants for use in
BOOST_CHECK_EQUAL() and BOOST_CHECK().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Properly annotate index_t and term_t constants for use in
BOOST_CHECK_EQUAL(), BOOST_CHECK(). Clean up the first args of
read_quorum() calls -- stay in term_t space.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The "from" and "to" parameters of compare_log_entries() are raft log
indices; change them to raft::index_t, and update the callers.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In raft_sys_table_storage::store_snapshot_descriptor(), the condition
preserve_log_entries > snap.idx
*both* relies on implicit conversion of the tagged integer type "index_t"
to the underlying "uint64_t", *and* is a logic bug, as reported at
<https://github.com/scylladb/scylladb/issues/20080>.
Ticket#20080 explains that this condition always evaluates to false in
practice, and that the "else" branch handles all cases correctly anyway.
For now, wean the buggy expression off the disappearing
tagged-to-untaggged conversion.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With conversion of tagged integers to untagged ones going away, replace
static_cast<uint64_t>(snap.idx)
with
snap.idx.value()
Furthermore, casting "preserve_log_entries" (of type "size_t") to
"uint64_t" is redundant (both "snap.idx" and "preserve_log_entries" carry
nonnegative values, and the mathematical difference is expected to be
nonnegative); remove the cast.
Finally, simplify the initialization syntax.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly untag index_t and term_t values in the following two contexts:
- when they are passed to CQL queries as int64_t,
- when they are default-constructed as fallbacks for int64_t fields
missing from CQL result sets.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "raft/server.cc":
- addition of tagged and untagged (both should be tagged)
- subscripting an array by tagged (should be untagged)
- comparing a size-like threshold against tagged (should be untagged)
- exposing tagged via gauges (should be untagged)
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "raft/fsm.cc":
- addition of tagged and untagged (both should be tagged)
- comparison (relop) between tagged an untagged (both should be tagged)
- subscripting or sizing an array by tagged (should be untagged)
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in raft/log.{cc,h}:
- addition of tagged and untagged (both should be tagged)
- comparison (relop) between tagged an untagged (both should be tagged)
- subscripting an array, or offsetting an iterator, by tagged (should be
untagged)
- comparing an array bound against tagged (should be untagged)
- subtracting tagged from an array bound (should be untagged)
Note: these files mix uniform initialization syntax (index_t{...}) with
constructor call syntax (index_t()), with the former being more frequent.
Stick with the former here too, for consistency.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Internally, increment_and_get_generation() produces a
"gms::generation_type" value.
In turn, all callers of increment_and_get_generation() -- namely
scylla_main() [main.cc] and single_node_cql_env::run_in_thread()
[test/lib/cql_test_env.cc] -- pass the resolved value to
storage_service::init_address_map() and storage_service::join_cluster(),
both of which take a "gms::generation_type".
Therefore it is pointless to "untag" the generation value temporarily
between the producer and the consumers. Correct the return type of
increment_and_get_generation().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The callers of gossiper::compare_endpoint_startup() need not (should not)
learn of any particular (tagged or untagged) difference of generations;
they only care about the ordering of generations. Change the return type
of compare_endpoint_startup() to "std::strong_ordering", and delegate the
comparison to tagged_tagged_integer::operator<=>.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In do_sort(), we need to drop to "int32_t" temporarily, so that we can
call ::abs() on the version difference. Do that explicitly.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
All lambdas passed to test_using_reusable_sst() conform to the prototype
void (test_env&, sstable_ptr)
All lambdas passed to test_using_reusable_sst_returning() conform to the
prototype
NON_VOID (test_env&, sstable_ptr)
The common parameter list of both prototypes can be expressed with the
concept
std::invocable<test_env&, sstable_ptr>
Once a "Func" template parameter (i.e., function type) satisfying this
concept is taken, then "Func"'s void or non-void return type can be
commonly expressed with
std::invoke_result_t<Func, test_env&, sstable_ptr>
In turn, test_env::do_with_async_returning<...> can be instantiated with
this return type, even if it happens to be "void".
([stmt.return] specifies, "[a] return statement with an operand of type
void shall be used only in a function that has a cv void return type",
meaning that
return func(env)
will do the right thing in the body of
test_env::do_with_async_returning<void>().)
Merge test_using_reusable_sst() and test_using_reusable_sst_returning()
into one. Preserve the function name from the former, and the
test_env::do_with_async_returning<...>() call from the latter.
Suggested-by: Avi Kivity <avi@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Closesscylladb/scylladb#20090
Unit testing for the SSTable validation API happens in
`sstable_validate_test`. Currently, this test checks the API against
some invalid SSTables with out-of-order clustering rows and out-of-order
partitions. However, both are types of content-level corruption that do
not trigger `malformed_sstable_exception` errors.
Extend the test to cover cases of file-level corruption as well, i.e.,
cases that would raise a `malformed_sstable_exception`. Construct an
SSTable with an invalid checksum to trigger this.
This is part of the effort to improve scrub to handle all kinds of
corruption.
Fixesscylladb/scylladb#19057
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#20096
`maybe_rehash` is complimentary and is not strictly require to succeed. If it fails, it will retry on the next call, but there's no reason to throw an exception that will fail its caller, since `maybe_rehash` is called as the final step after the caller has already succeeded with its action.
Minor enhancement for the error path, no backport required.
Closesscylladb/scylladb#19910
* github.com:scylladb/scylladb:
cell_locker: maybe_rehash: reindent
cell_locker: maybe_rehash: ignore allocation failures
Currently, each change to tablet metadata triggers a full metadata reload from disk. This is very wasteful, especially if the metadata change affects only a single row in the `system.tablets` table. This is the case when the tablet load balancer triggers a migration, this will affect a single row in the table, but today will trigger a full reload.
We expect tablet count to potentially grow to thousands and beyond and the overhead of this full reload can become significant.
This PR makes tablet metadata reload partial, instead of reloading all metadata on topology or schema changes, reload only the partitions that are affected by the change. Copy the rest from the in-memory state.
This is done with two passes: first the change mutations are scanned and a hint is produced. This hint is then passed down to the reload code, which will use it to only reload parts (rows/partitions) of the metadata that has actually changed.
The performance difference between full reload and partial reload is quite drastic:
```
INFO 2024-07-25 05:06:27,347 [shard 0:stat] testlog - Tablet metadata reload:
full 616.39ms
partial 0.18ms
```
This was measured with the modified (by this PR) `perf_tablets`, which creates 100 tables, each with 2K tablets. The test was modified to change a single tablet, then do a full and partial reload respectively, measuring the time it takes for reach.
Fixes: #15294
New feature, no backport needed.
Closesscylladb/scylladb#15541
* github.com:scylladb/scylladb:
test/perf/perf_tablets: add tablet metadata reload perf measurement
test/boost/tablets_test: add test for partial tablet metadata updates
db/schema_tables: pass tablet hint to update_tablet_metadata()
service/storage_service: load_tablet_metadata(): add hint parameter
service/migration_listener: update_tablet_metadata(): add hint parameter
service/raft/group0_state_machine: provide tablet change hint on topology change
service/storage_service: topology_state_load(): allow providing change hint
replica/tablets: add update_tablet_metadata()
replica/tablets: fix indentation
replica/tablets: extract tablet_metadata builder logic
replica/tablets: add get_tablet_metadata_change_hint() and update_tablet_metadata_change_hint()
locator/tablets: add tablet_map::clear_tablet_transition_info()
locator/tablets: make tablet_metadata cheap to copy
mutation/canonical_mutation: add key()
Measure reload perf of full reload vs. partial reload, after changing a
single tablet.
While at it, modify the `--tablets-per-table` parameter, so that it has
a default parameter which works OOTB. The previous default was both too
large (causing oversized commitlog entry errors) and not a power of two.
Replace the has_tablet_mutations in `merge_tables_and_views()` with a
hint parameter, which is calculated in the caller, from the original
schema change mutations. This hint is then forwarded to the notifier's
`update_tablet_metadata()` so that subscribers can refresh only the
tablet partitions that changed.
The hint contains information related to what exactly changed, allowing
listeners to do partial updates, instead of reloading all metadata on
each notification.
Extract a hint of what a tablet mutation changed. The hint can be later
used to selectively reload only the changed parts from disk.
Two variants are added:
* get_tablet_metadata_change_hint() - extracts a hint from a list of
tablet mutations
* update_tablet_metadata_change_hint() - updates an existing hint based
on a single mutation, allowing for incremental hint extraction
Keep lw_shared_ptr<tablet_map> in the tablet map and use COW semantics.
To prevent accidental changes to shared tablet_map instances, all
modifications to a tablet_map have to go through a new
`mutate_tablet_map()` method, which implements the copy-modify-swap
idiom.
Fixes#19960
Write path for sstables/commitlog need to handle the fact that IO extensions can
generate errors, some of which should be considered retry-able, and some that should,
similar to system IO errors, cause the node to go into isolate mode.
One option would of course be for extensions to simply generate std::system_errors,
with system_category and appropriate codes. But this is probably a bad idea, since
it makes it more muddy at which level an error happened, as well as limits the
expressibility of the error.
This adds three distinct types (sharing base) distinguishing permission, availabilty
and configuration errors. These are treated akin to EACCESS, ENOENT and EINVAL in
disk error handler and memtable write loop.
Tests updated to use and verify behaviour.
Closesscylladb/scylladb#19961
After removal of rwlock (53a6ec05ed), the race was introduced because the order that
compaction groups of a tablet are closed, is no longer deterministic.
Some background first:
Split compaction runs in main (unsplit) group, and adds sstable to left and right groups
on completion.
The race works as follow:
1) split compaction starts on main group of tablet X
2) tablet X reaches cleanup stage, so its compaction groups are closed in parallel
3) left or right group are closed before main (more likely when only main has flush work to do)
4) split compaction completes, and adds sstable to left and right
5) if e.g left is closed, adjusting backlog tracker will trigger an exception, and since that
happens in row cache update's execute(), node crashes.
The problem manifested as follow:
[shard 0: gms] raft_topology - Initiating tablet cleanup of 5739b9b0-49d4-11ef-828f-770894013415:15 on 102a904a-0b15-4661-ba3f-f9085a5ad03c:0
...
[shard 0:strm] compaction - [Split keyspace1.standard1 009e2f80-49e5-11ef-85e3-7161200fb137] Splitting [/var/lib/scylla/data/keyspace1/...]
...
[shard 0:strm] cache - Fatal error during cache update: std::out_of_range (Compaction state for table [0x600007772740] not found),
at: ...
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(...
--------
seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> >
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::(anonymous namespace)::thread_wake_task
--------
seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::async<sstables::compaction::run(...
seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::future<sstables::compaction_resu...
From the log above, it can be seen cache update failure happens under streaming sched group and
during compaction completion, which was good evidence to the cause.
Problem was reproduced locally with the help of tablet shuffling.
Fixes: #19873.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#19987
If parent_info argument of compaction_manager::perform_compaction
is std::nullopt, then created compaction executor isn't tracked by task
manager. Currently, all compaction operations should by visible in task
manager.
Modify split methods to keep split executor in task manager. Get rid of
the option to bypass task manager.
Closesscylladb/scylladb#19995
* github.com:scylladb/scylladb:
compaction: replace optional<task_info> with task_info param
compaction: keep split executor in task manager
There are some bugs missed in task handler:
- wait_for_task does not wait until virtual tasks are done, but returns the status immediately;
- wait_for_task suffers from use after return;
- get_status_recursively does not set the kind of task essentials.
Fix the aforementioned.
Closesscylladb/scylladb#19930
* github.com:scylladb/scylladb:
test: add test to check that task handler is fixed
tasks: fix task handler
This patch adds `suppress_features` error injection. It allows to revoke
support for some features and it can be used to simulate upgrade process
in test.py.
Features to suppress are passed as injection's value, separated by `;`.
Example: `PARALLELIZED_AGGREGATION;UDA_NATIVE_PARALLELIZED_AGGREGATION`
Fixesscylladb/scylladb#20034Closesscylladb/scylladb#20055
before this change, we use the default options for
performing read on the input. and the default options
is like
```c++
struct file_input_stream_options {
size_t buffer_size = 8192; ///< I/O buffer size
unsigned read_ahead = 0; ///< Maximum number of extra read-ahead operations
};
```
which is not able to offer good throughput when
reading from disk, when we stream to S3.
so, in this change, we use options which allows better throughput.
Refs 061def001d
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20074
Before these changes, we didn't specify which I/O scheduling
group commitlog instances in hinted handoff should use.
In this commit, we set it explicitly to the commitlog
scheduling group. The rationale for this choice is the fact
we don't want to cause a bottleneck on the write path
-- if hints are written too slowly, new incoming mutations
(NOT hints) might be rejected due to a too high number
of hints currently being written to disk; see
`storage_proxy::create_write_response_handler_helper()`
for more context.
Fixesscylladb/scylladb#18654Closesscylladb/scylladb#19170
This patch makes all cql connections update theirs service level parameters automatically when:
- any service level is created or changed
- one role is granted to another
- any service level is attached to/detached from a role
First of all, the patch defines what a service level and an effective service level are 938aa10509. No new type of service levels are introduced, the commit only clarifies definitions and names what an effective service level is.
(Effective service level is created by merging all service levels which are attached to all roles granted to the user. It represents exact values of connection's parameters.)
Previously, to find an effective service level of a user, it required O(n) internal queries: O(n) queries to recursively find all granted roles (`standard_role_manager::query_granted()`) and a query for each role to get its service level (`standard_role_manager::get_attribute()`, which sums to O(n) queries).
Because we want to reload SL parameters for all opened cql connections, we don't want to do O(n) queries for every connection, every time we create or change any service level/grant one role to another/attach or detach a service level to/from a role.
To speed it up, the patch adds another layer of service level controller cache, which stored `role_name -> effective_service_level` mapping. This way finding a effective service level for a role is only a lookup to a map.
Building the new cache requires only 2 queries: one to obtain all role hierarchy one to get all roles' service level.
Fixesscylladb/scylladb#12923Closesscylladb/scylladb#19085
* github.com:scylladb/scylladb:
test/auth_cluster/test_raft_service_levels: add test for automatic connection update
api/cql_server_test: add CQL server testing API
transport/cql_server: subscribe to sl effective cache reloaded
transport/controller: coroutinize `subscribe_server` and `unsubscribe_server`
transport/cql_server: add method to update service level params on all connections
generic_server: use async function in `for_each_gently()`
service/qos/sl_controller: use effective service levels cache
service/qos/service_level_controller: notify subscribers on effective cache reloaded
service/raft/group0_state_machine: update effective service levels cache
service/topology_coordinator: migrate service levels before auth
service/qos/service_level_controller: effective service levels cache
utils/sorting: allow to pass any container as verticies
service/qos/service_level_controller: replace shard check to assert
service/qos: define effective service level
service/qos/qos_common: use const reference in `init_effective_names()`
service/qos/service_level_controller: remove unused field
auth: return map of directly granted roles
test/auth/test_auth_v2_migration: create sl1 in the test
We have two mechanism to give visibility into reads having to process many tombstones:
* a warning in the logs, triggered if a read processed more the `tombstone_warn_threshold` dead rows/tombstones
* a trace message, which includes stats of the amount of rows in the page, including the amount of live and dead rows as well as tombstones
This series extends this to also include information on cells, so we have visibility into the case where a read has to process an excessive amount of cell tombstones (mainly because of collections).
A log line is now also logged if the amount of dead cells/tombstones in the page exceeds `tombstone_warn_threshold`. The trace message is also extended to contain cell stats.
The `tombstone_warn_threshold` log lines now receive a 10s rate-limit to avoid excessive log spamming. The rate-limit is separate for the row and cell logs.
Example of the new log line (`tombstone_warn_threshold=10` ):
```
WARN 2024-05-30 07:56:44,979 [shard 0:stmt] querier - Read 98 live cells and 126 dead cells/tombstones for system_schema.scylla_tables <partition-range-scan> (-inf, +inf) (see tombstone_warn_threshold)
```
Example of the new tracing message:
```
Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 13 cell(s) (1 live, 12 dead) [shard 0] | 2024-05-30 08:13:19.690803 | 127.0.0.1 | 6114 | 127.0.0.1
```
Fixes: https://github.com/scylladb/scylladb/issues/18996
Improvement, not a backport candidate.
Closesscylladb/scylladb#18997
* github.com:scylladb/scylladb:
test/boost: mutation_test: add test for cell compaction stats
mutation/compact_and_expire_result: drop operator bool()
querier: consume_page(): add rate-limiting to tombstone warnings
querier: consume_page(): add cell stats to page stats trace message
querier: consume_page(): add tombstone warning for cell tombstones
querier: consume_page(): extract code which logs tombstone warning
mutation/mutation_compactor: collect and aggregate cell compaction stats
mutation: row::compact_and_expire(): use compact_and_expire_result
collection_mutation: compact_and_expire(): use compact_and_expire_result
mutation: introduce compact_and_expire_result
Fixes#13334
All required code paths (see enterprise) now uses
extensions::is_extension_internal_keyspace.
The old mechanism can be removed. One less global var.
Closesscylladb/scylladb#20047
This is a followup to #19937, for #19803. See in particular [this comment](https://github.com/scylladb/scylladb/issues/19803#issuecomment-2258371923).
The primary conversion target is coroutines. However, while coroutines are the most convenient style, they are only infrequently usable in this case, for the following reasons:
- Wherever we have a `future::finally()` that calls a cleanup function that returns a future (which must be awaited), we cannot use `co_await`. We can only use `seastar::async()` with `deferred_close` or `defer()`.
- The code passes lots of lambdas, and `co_await` cannot be used in lambdas. First, I tried, and the compiler rejects it; second, a capturing lambda that is a coroutine is a trap [[1]](https://devblogs.microsoft.com/oldnewthing/20211103-00/?p=105870) [[2]](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rcoro-capture).
In most cases, I didn't have to use naked `seastar::async()`; there were specialized wrappers in place already. Thus, most of the changes target `seastar::thread` context under existent `seastar::async()` wrappers, and only a few functions end up as coroutines.
The last patch in the series (`test/sstable: remove useless variable from promoted_index_read()`) is an independent micro-cleanup, the opportunity for which I thought to have noticed while reading the code.
The tail of `test/boost/sstable_test.cc` (the stuff following `promoted_index_read()`) is already written as `seastar::thread`. That's already better (for readability) than future chaining; but could have I perhaps further converted those functions to coroutines? My answer was "no":
- Some of the candidate functions relied on deferred cleanups that might need to yield (all three variants of `count_rows()`).
- Some had been implemented by passing lambdas to wrappers of `seastar::async()` (`sub_partition_read()`, `sub_partitions_read()`).
- The test case `test_skipping_in_compressed_stream()` initially looked promising for co-routinization (from its starting point `seastar::async()`), because it seemed to employ no deferred cleanup (that might need to yield). However, the function uses three lambdas that must be able to yield internally, and one of those (`make_is()`) is even capturing.
- The rest (`test_empty_key_view_comparison()`, `test_parse_path_good()`, `test_parse_path_bad()`) was synchronous code to begin with.
```
test/boost/sstable_test.cc | 188 +++++++++-----------
1 file changed, 83 insertions(+), 105 deletions(-)
```
Refactoring; no backport needed.
Closesscylladb/scylladb#20011
* github.com:scylladb/scylladb:
test/sstable: remove useless variable from promoted_index_read()
test/sstable: rewrite promoted_index_read() with async()
test/sstable: unfuturize lambda invocation in test_using_reusable_sst*()
test/sstable: rewrite wrong_range() with async()
test/sstable: simplify not_find_key_composite_bucket0() under test_using_reusable_sst()
test/sstable: rewrite full_index_search() with async()
test/sstable: simplify find_key*(), all_in_place() under test_using_reusable_sst()
test/sstable: rewrite (un)compressed_random_access_read() with async()
test/sstable: simplify write_and_validate_sst()
test/sstable: simplify check_toc_func() under async()
test/sstable: simplify check_statistics_func() under async()
test/sstable: simplify check_summary_func() under async()
test/sstable: coroutinize check_component_integrity()
test/sstable: rewrite write_sst_info() with async()
test/sstable: simplify missing_summary_first_last_sane()
test/sstable: coroutinize summary_query_fail()
test/sstable: rewrite summary_query() with async()
test/sstable: coroutinize (simple/composite)_index_read()
test/sstable: rewrite index_read() with async()
test/sstable: rewrite test_using_reusable_sst() with async()
test/sstable: rewrite test_using_working_sst() with async()