Adjusts do_query so that it propagates and returns failed results. The
query_result method is added which is result-aware, and the old query
method was changed to call query_result.
Now, the logic of handling exceptions returned in reconcile() from
data_resolver->done() was changed so that the failed result does not
need to be converted to an exceptional future.
Now, the logic of handling exceptions returned in execute() from
data_resolver->done() was changed so that the failed result does not
need to be converted to an exceptional future.
This method is not overrided by any of the derived classes, so it does
not need to be virtual.
(cherry picked from commit b7fb93dc46531bca8db535301a069df52991f9d9)
Adds `utils::result_try` and `utils::result_futurize_try` - functions which allow to convert existing try..catch blocks into a version which handles C++ exceptions, failed results with exception containers and, depending on the function variant, exceptional futures using the same exception handling logic.
For example, you can convert the following try..catch block:
try {
return a_function_that_may_throw();
} catch (const my_exception& ex) {
return 123;
} catch (...) {
throw;
}
...to this:
return utils::result_try([&] {
return a_function_that_may_throw_or_return_a_failed_result();
}, utils::result_catch<my_exception>([&] (const Ex&) {
return 123;
}), utils::result_catch_dots([&] (auto&& handle) {
return handle.into_result();
});
Similarly, `utils::result_futurize_try` can be used to migrate `then_wrapped` or `f.handle_exception()` constructs.
As an example of the usability of the new constructs, two places in the current code which need to simultaneously handle exceptions and failed results are converted to use `result_try` and `result_futurize_try`.
Results of `perf_simple_query --smp 1 --operations-per-shard 1000000 --write`:
```
127041.61 tps ( 67.2 allocs/op, 14.2 tasks/op, 52422 insns/op)
126958.60 tps ( 67.2 allocs/op, 14.2 tasks/op, 52409 insns/op)
127088.37 tps ( 67.2 allocs/op, 14.2 tasks/op, 52411 insns/op)
127560.84 tps ( 67.2 allocs/op, 14.2 tasks/op, 52424 insns/op)
127826.61 tps ( 67.2 allocs/op, 14.2 tasks/op, 52406 insns/op)
126801.02 tps ( 67.2 allocs/op, 14.2 tasks/op, 52420 insns/op)
125371.51 tps ( 67.2 allocs/op, 14.2 tasks/op, 52425 insns/op)
126498.51 tps ( 67.2 allocs/op, 14.2 tasks/op, 52427 insns/op)
126359.41 tps ( 67.2 allocs/op, 14.2 tasks/op, 52423 insns/op)
126298.27 tps ( 67.2 allocs/op, 14.2 tasks/op, 52410 insns/op)
```
The number of tasks and allocations is unchanged. The number of instructions per operations seems similar, it may have increased slightly (by 10-20) but it's hard to tell for sure because of the noisiness of the results.
Tests: unit(dev)
Closes#10045
* github.com:scylladb/scylla:
transport: use result_try in process_request_one
storage_proxy: use result_futurize_try in mutate_end
storage_proxy: temporarily throw exception from result in mutate_end
utils: add result_try and result_futurize_try
Segregates result utilities into:
- result.hh - basic definitions related to results with exception
containers,
- result_combinators.hh - combinators for working with results in
conjunction with futures,
- result_loop.hh - loop-like combinators, currently has only
result_parallel_for_each.
The motivation for the split is:
1. In headers, usually only result.hh will be needed, so no need to
force most .cc files to compile definitions from other files,
2. Less files need to be recompiled when a combinator is added to
result_combinators or result_loop.
As a bonus, `result_with_exception` was moved from `utils::internal` to
just `utils`.
Adapts the mutate_end exception handling logic so that it uses the new
utils::result_futurize_try function to handle both exceptional futures
and failed results in an unified way.
Temporarily removes the logic which handles failed results in a
non-throwing way. Exceptions from failed results are thrown and handled
in try..catch.
The reason for this change is that it makes the following commit, which
migrates the whole try..catch block to utils::result_futurize_try much
nicer. The next commit will also bring back the non-throwing handling of
the failed result.
Changes the interface of `mutate_with_triggers` so that it returns
`future<result<>>` instead of `future<>`. No intermediate
`mutate_with_triggers_result` method is introduced because all call
sites will be changed in this PR so that they properly handle failed
`result<>`s with exceptions-as-values.
Similarly to `mutate_result` introduced in the previous commit,
`mutate_atomically_result` is introduced which returns some exceptions
inside `result<>`. The pre-existing `mutate_atomically` keeps the same
interface but uses `mutate_atomically_result` internally, converting
failed `result<>` to exceptional future if needed.
In order to be able to propagate exceptions-as-values from storage_proxy
but without having to modify all call sites of `mutate`, an in-between
method `mutate_result` is introduced which returns some exceptions
inside `result<>`. Now, `mutate` just calls the latter and converts
those exceptions to exceptional future if needed.
Instead of stupidly rethrowing the exception in failed result<>, the
`storage_proxy::mutate_end` function now inspects it with a visitor,
which does not involve any rethrows. Moreover, mutate_end now also
returns a `future<result<>>` instead of just `future<>`.
Changes the `storage_proxy::mutate_end` method to accept a
`future<result<>>` instead of `future<>`.
For the time being, all call call sites of that method pass a future
which is either exceptional or contains a result<> with a value.
Moreover, in case of a failed result<>, mutate_end just rethrows the
exception. Both of these will change in the upcoming commits of this PR.
Changes the type of the _ready promise in
abstract_write_response_handler - a promise used by the coordinator
logic to wait until the write operation is complete - to keep a
`result<>` instead of `void`. Now, a timeout is signalled by setting the
promise to a value containing a `result<>` with a mutation write timeout
exception - previously it was signalled by setting the promise to an
exceptional value.
This is just a first step on a long road of throwless propagation of the
error to the cql_server - for now, a failed result is immediately
converted to an exceptional future in `storage_proxy::response_wait`.
Fixes#9922
storage proxy uses is_timeout_exception to traverse different code paths.
a6202ae079 broke this (because bit rot and
intermixing), by wrapping exception for information purposes.
This adds check of nested types in exception handling, as well as a test
for the routine itself.
Closes#9932
* github.com:scylladb/scylla:
database/storage_proxy: Use "is_timeout_exception" instead of catch match
utils::is_timeout_exception: Ensure we handle nested exception types
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.
To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.
In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:
1) GC a tombstone after gc_grace_seconds
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;
This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.
2) Never GC a tombstone
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};
3) GC a tombstone immediately
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};
4) GC a tombstone after repair
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.
A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.
Tests: compaction_test.py, ninja test
Fixes#3560
[avi: resolve conflicts vs data_dictionary]
The default for get_unlimited_query_max_result_size() is 100MB (adjustable through config), whereas query::result_memory_limiter::maximum_result_size is 1MB (hard coded, should be enough for everybody)
This limit is then used by the replica to decide when to break pages and, in case of reversed clustering order reads, when to fail the read when accumulated data crosses the threshold. The latter behavior stems from the fact that reversed reads had to accumulate all the data (read in forward order) before they can reverse it and return the result. Reverse reads thus need a higher limit so that they have a higher chance of succeeding.
Most readers are now supporting reading in reverse natively, and only reversing wrappers (make_reversing_reader()) inserted on top of ka/la sstable readers need to accumulate all the data. In other cases, we could break pages sooner. This should lead to better stability (less memory usage) and performance (lower page build latency, higher read concurrency due to less memory footprint).
Tests: unit(dev)
Closes#9815
* github.com:scylladb/scylla:
storage_proxy: Send page_size in the read_command
gms: add SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT feature
result_memory_accounter: use new max_result_size::get_page_size in check_local_limit
max_result_size: Add page_size field
calculate_delay() implicitly converts a lowres_clock::duration to
std::chrono::microseconds. This fails if lowres_clock::duration has
higher resolution than microseconds.
Fix by using an explicit conversion, which always works.
When the whole cluster is already supporting
separate_page_size_and_safety_limit,
start sending page_size in read_command. This new value will be used
for determining the page size instead of hard_limit.
Fixes#9487Fixes#7586
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Instead of calling get_local_storage_proxy in paxos_state, get it from the
caller (who is, in fact, storage_proxy or one of its components).
Some of the callers, although they are storage_proxy components, don't
have a storage_proxy reference handy and so they ignomiously call
get_local_storage_proxy() themselves. This will be adjusted later.
The other callers who are, in fact, storage_proxy, have to take special
care not to cross a shard boundary. When they do, smp::submit_to()
is converted to sharded::invoke_on() in order to get the correct local instance.
Test: unit (dev)
Closes#9824
Probably storage_proxy is not the correct place to supply
data_dictionary, but it is available to practically all of
the coordinator code, so it is convenient.
storage_proxy::query_partition_key_range_concurrent() iterates through
vnodes produced by its argument query_ranges_to_vnodes_generator&&
ranges_to_vnodes and tries to merge them. This commit introduces
checking if subsequent vnodes are contiguous with each other, before
merging them.
Fixes#9167Closes#9175
And rename to get_batchlog_mutation_for while at it,
as it's about the batchlog, not batch_log.
This resolves a circular dependency between the
batchlog_manager and the storage_proxy that required
it in the case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There's nothing in this function that actually requries
the batchlog manager instance.
It uses a random number engine that's moved along with it
to class gossiper.
This resolves a circular dependency between the
batchlog_manager and storage_proxy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
Add a sharded locator::effective_replication_map_factory that holds
shared effective_replication_maps.
To search for e_r_m in the factory, we use a compound `factory_key`:
<replication_strategy type, replication_strategy options, token_metadata ring version>.
Start the sharded factory in main (plus cql_test_env and tools/schema_loader)
and pass a reference to it to storage_proxy and storage_server.
For each keyspace, use the registry to create the effective_replication_map.
When registered, effective_replication_map objects erase themselves
from the factory when destroyed. effective_replication_map then schedules
a background task to clear_gently its contents, protected by the e_r_m_f::stop()
function.
Note that for non-shard 0 instances, if the map
is not found in the registry, we construct it
by cloning the precalculated replication_map
from shard 0 to save the cpu cycles of re-calculating
it time and again on every shard.
Test: unit(dev), schema_loader_test(debug)
DTest: bootstrap_test.py:TestBootstrap.decommissioned_wiped_node_can_join_test update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_with_repair_test (dev)
"
* tag 'effective_replication_map_factory-v7' of https://github.com/bhalevy/scylla:
effective_replication_map: clear_gently when destroyed
database: shutdown keyspaces
test: cql_test_env: stop view_update_generator before database shuts down
effective_replication_map_factory: try cloning replication map from shard 0
tools: schema_loader: start a sharded erm_factory
storage_service: use erm_factory to create effective_replication_map
keyspace: use erm_factory to create effective_replication_map
effective_replication_map: erase from factory when destroyed
effective_replication_map_factory: add create_effective_replication_map
effective_replication_map: enable_lw_shared_from_this
effective_replication_map: define factory_key
keyspace: get a reference to the erm_factory
main: pass erm_factory to storage_service
main: pass erm_factory to storage_proxy
locator: add effective_replication_map_factory