We'd like to change the vector to a more involved type, so to avoid
repeating it everywhere, give it a name. The actual type isn't changed
in this patch.
storage_proxy uses std::vector<inet_address> for small lists of nodes - for replication (often 2-3 replicas per operation) and for pending operations (usually 0-1). These vectors require an allocation, sometimes more than one if reserve() is not used correctly.
This series switches storage_proxy to use utils::small_vector instead, removing the allocations in the common case.
Test results (perf_simple_query --smp 1 --task-quota-ms 10):
```
before: median 184810.98 tps ( 91.1 allocs/op, 20.1 tasks/op, 54564 insns/op)
after: median 192125.99 tps ( 87.1 allocs/op, 20.1 tasks/op, 53673 insns/op)
```
4 allocations and ~900 instructions are removed (the tps figure is also improved, but it is less reliable due to cpu frequency changes).
The type change is unfortunately not contained in storage_proxy - the abstraction leaks to providers of replica sets and topology change vectors. This is sad but IMO the benefits make it worthwhile.
I expect more such changes can be applied in storage_proxy, specifically std::unordered_set<gms::inet_address> and vectors of response handles.
Closes#8592
* github.com:scylladb/scylla:
storage_proxy, treewide: use utils::small_vector inet_address_vector:s
storage_proxy, treewide: introduce names for vectors of inet_address
utils: small_vector: add print operator for std::ostream
hints: messages.hh: add missing #include
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
Defer picking the appropriate max result size for unlimited queries to
the database, which is already the place where we make query classifying
decisions. This move means that all these decisions are now centralized
in the database, not scattered in different places and fixing one fixes
all users.
Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency.
When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter.
This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached.
This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value.
Fixes#8102
Tests:
- unit(dev)
- some manual tests:
- shutting down repair coordinator during hints replay,
- shutting down node participating in repair during hints replay,
Closes#8452
* github.com:scylladb/scylla:
repair: introduce abort_source for repair abort
repair: introduce abort_source for shutdown
storage_proxy: add abort_source to wait_for_hints_to_be_replayed
storage_proxy: stop waiting for hints replay when node goes down
hints: dismiss segment waiters when hint queue can't send
repair: plug in waiting for hints to be sent before repair
repair: add get_hosts_participating_in_repair
storage_proxy: coordinate waiting for hints to be sent
config: add wait_for_hint_replay_before_repair option
storage_proxy: implement verbs for hint sync points
messaging_service: add verbs for hint sync points
storage_proxy: add functions for syncing with hints queue
db/hints: make it possible to wait until current hints are sent
db/hints: add a metric for counting processed files
db/hints: allow to forcefully update segment list on flush
Adds a `wait_for_hints_to_be_replayed` function which waits until all
hints between specified endpoints are replayed.
For each node, a hint sync point is created. Then, repair coordinator
waits until the hint sync point is reached on every node, or timeout
occurs. This is done by querying each host participating in repair every
second in order if the sync point is still there.
Adds two methods to `storage_proxy`:
- `create_hint_queue_sync_point` - creates a "hint sync point" which
is kept present in storage_proxy until all hint queues on the local
node reach their curent end. It will also disappear if given deadline
is reached first.
- `check_hint_queue_sync_point` - checks if given hint sync point still
exists.
The created sync point waits for hint queues in all hint managers, on
all shards.
These two helpers are now namespace-scoped methods, but both
need the migration manager instance inside. All their callers
are now patched to have the migration manager at hands, so
the helpers can be turned into methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch is the bridge between the previous one and the
next one and is quite messy to be merged with either.
No heavy changes -- just copy the migration manager's ptr
onto rpc lambdas. Will be used in the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The proxy's messaging code uses migration manager to obtain schema.
Since proxy is more low-level service than migration manager, it's
incorrect to make proxy reference the manager directly. Instead,
push the shared_ptr into proxy's messaging code. This kills two
birst with one stone:
1: let proxy use migration manager
2: makes sure that by the time migration manager is stopped
the proxy's use of this pointer is gone (unregistered from
rpc)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A follow up for the patch for #7611. This change was requested
during review and moved out of #7611 to reduce its scope.
The patch switches UUID_gen API from using plain integers to
hold time units to units from std::chrono.
For one, we plan to switch the entire code base to std::chrono units,
to ensure type safety. Secondly, using std::chrono units allows to
increase code reuse with template metaprogramming and remove a few
of UUID_gen functions that beceme redundant as a result.
* switch get_time_UUID(), unix_timestamp(), get_time_UUID_raw(), switch
min_time_UUID(), max_time_UUID(), create_time_safe() to
std::chrono
* remove unused variant of from_unix_timestamp()
* remove unused get_time_UUID_bytes(), create_time_unsafe(),
redundant get_adjusted_timestamp()
* inline get_raw_UUID_bytes()
* collapse to similar implementations of get_time_UUID()
* switch internal constants to std::chrono
* remove unnecessary unique_ptr from UUID_gen::_instance
Message-Id: <20210406130152.3237914-2-kostja@scylladb.com>
If a table that is not replicated to a certain DC (rf=0) is accessed
with LOCAL_QUORUM on that DC the current code will crash since the
'targets' array will be empty and read executor does not handle it.
Fix it by replying with empty result.
Fixes#8354
Message-Id: <YGro+l2En3fF80CO@scylladb.com>
This just causes unneeded and slower recompliations. Instead replace
with forward declarations, or includes of smaller headers that were
incidentally brought in by the one removed. The .cc files that really
need it gain the include, but they are few.
Ref #1.
Closes#8403
Convert tri_comparators to return std::strong_ordering rather than int,
to prevent confusion with less comparators. Downstream users are either
also converted, or adjust the return type back to int, whichever happens
to be simpler; in all cases the change it trivial.
Coordinator prefers himself as the "counter leader", so if another
endpoint is chosen as the leader, we know that coordinator was
not a member of replica set. We can use this information to
increment relevant metric (which used to neglect the counters
completely).
Fixes#4337
This not only reduces internode traffic but is also needed for a
later change in this PR: metrics for non-token-aware writes
including counter updates.
Currently range scans build their result using the `reconcilable_result`
format and then convert it to `query::result`. This is inefficient for
multiple reasons:
1) it introduces an additional intermediate result format and a
subsequent conversion to the final one;
2) the reconcilable result format was designed for reconciliation so it
contains all data, including columns unselected by the query, dead
rows and tombstones, which takes much more memory to build;
There is no reason to go through all this trouble, if there ever was one
in the past it doesn't stand anymore. So switch to the newly introduced
`query_data_on_all_shards()` when doing normal data range scans, but
only if all the nodes in the cluster supports it, to avoid artificial
differences in page sizes due to how reconcilable result and
query::result calculates result size and the consequent false-positive
read repair.
The transition to this new more efficient method is coordinated by a
cluster feature and whether to use it is decided by the coordinator
(instead of each replica individually). This is to avoid needless
reconciliation due to the different page sizes the two formats will
produce.
The error is benign but if it is not handled "unhandled exception" error
will be printed in the logs.
Message-Id: <20210209150313.GA1708015@scylladb.com>
Since fea5067df we enforce a limit on the memory consumption of
otherwise non-limited queries like reverse and non-paged queries. This
limit is sent down to the replicas by the coordinator, ensuring that
each replica is working with the same limit. This however doesn't work
in a mixed cluster, when upgrading from a version which doesn't have
this series. This has been worked around by falling back to the old
max_result_size constant of 1MB in mixed clusters. This however resulted
in a regression when upgrading from a pre fea5067df to a post fea5067df
one. Pre fea5067df already had a limit for reverse queries, which was
generalized to also cover non-paged ones too by fea5067df.
The regression manifested in previously working reverse queries being
aborted. This happened because even though the user has set a generous
limit for them before the upgrade, in the mix cluster replicas fall back
to the much stricter 1MB limit temporarily ignoring the configured limit
if the coordinator is an old node. This patch solves this problem by
using the locally configured limit instead of the max_result_size
constant. This means that the user has to take extra care to configure
the same limit on all replicas, but at least they will have working
reverse queries during the upgrade.
Fixes: #8022
Tests: unit(release), manual test by user who reported the issue
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210209075947.1004164-1-bdenes@scylladb.com>
After support for mixed cluster compatibility feature
DIGEST_MULTIPARTITION_READ was dropped in 854a44ff9b
range_slice_read_executor and never_speculating_read_executor become
identical, so remove the former for good.
Message-Id: <20210124122731.GA1122499@scylladb.com>
This commit makes it possible to change hints manager's configuration at
runtime through HTTP API.
To preserve backwards compatibility, we keep the old behavior of not
creating and checking hints directories if they are not enabled at
startup. Instead, hint directories are lazily initialized when hints are
enabled for the first time through HTTP API.
Now, the hints manager object for regular hints is always created, even
if hints are disabled in configuration. Please note that the behavior of
hints will be unchanged - no hints will be sent when they are disabled.
The intent of this change is to make enabling and disabling hints in
runtime easier to implement.
Uses db::hints::host_filter as the type of hinted_handoff_enabled
configuration option.
Previously, hinted_handoff_enabled used to be a string option, and it
was parsed later in a separate function during startup. The function
returned a std::optional<std::unordered_set<sstring>>, whose meaning in
the context of hints is rather enigmatic for an observer not familiar
with hints.
Now, hinted_handoff_enabled has type of db::hints::host_filter, and it
is plugged into the config parsing framework, so there is no need for
later post-processing.
This change modifies db::hints::resource_manager so that it is now
possible to add hints::managers after it was started.
This change will make it possible to register the regular hints manager
later in runtime, if it wasn't enabled at boot time.
Materialized view updates participate in a retirement program,
which makes sure that they are immediately taken down once their
target node is down, without having to wait for timeout (since
views are a background operation and it's wasteful to wait in the
background for minutes). However, this mechanism has very delicate
lifetime issues, and it already caused problems more than once,
most recently in #5459.
In order to make another bug in this area less likely, the two
implementations of the mechanism, in on_down() and drain_on_shutdown(),
are unified.
Possibly refs #7572Closes#7624
This miniseries adds metrics which can help the users detect potential overloads:
* due to having too many in-flight hints
* due to exceeding the capacity of the read admission queue, on replica side
Closes#7584
* github.com:scylladb/scylla:
reader_concurrency_semaphore: add metrics for shed reads
storage_proxy: add metrics for too many in-flight hints failures
get() the latest token_metadata_ptr from the
shared_token_metadata before each use.
expose get_token_metadata_ptr() rather than get_token_metadata()
so that caller can keep it across continuations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use it to get a token_metadata& compatible
with current usage, until the services are converted to
use token_metadata_ptr.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When there are too many in-flight hints, writes start returning
overloaded exceptions. We're missing metrics for that, and these could
be useful when judging if the system is in overloaded state.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Unavailable exception means that operation was not started and it can be
retried safely. If lwt fails in the learn stage though it most
certainly means that its effect will be observable already. The patch
returns timeout exception instead which means uncertainty.
Fixes#7258
Message-Id: <20201001130724.GA2283830@scylladb.com>
xxhash algorithm is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
Write failure reply is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
Digest multipartition read is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
With the new hashing routine, null values are taken into account
when computing row digest. Previous behavior had a regression
which stopped computing the hash after the first null value
is encountered, but the original behavior was also prone
to errors - e.g. row [1, NULL, 2] was not distinguishable
from [1, 2, NULL], because their hashes were identical.
This hashing is not yet active - it will only be used after
the next commit introduces a proper cluster feature for it.
We are using mutate_locally to handle hint mutations that arrived
through RPC. The current implementation makes no distinction whether
the mutation came through hint verb or a mutation verb resulting in
using the same smp group for both. This commit adds the ability to
reference different smp group in mutate_locally private calls and
makes the handlers pass the correct smp group to mutate_locally.
Hints and regular writes currently uses the same cross shard
operation semaphore, which can lead to priority inversion, making
cross shard writes wait for cross shard hints. This commit adds
an smp_service_group for hints and adds it usage in the mutate_hint
function.
This patch causes orphaned hints (hints that were written towards a node
that is no longer their replica) to be sent with a default write
timeout. This is what is currently done for non-orphaned hints.
Previously, the timeout was hardcoded to one hour. This could cause a
long delay while shutting down, as hints manager waits until all ongoing
hint sending operation finish before stopping itself.
Fixes: #7051
"
We keep refrences to locator::token_metadata in many places.
Most of them are for read-only access and only a few want
to modify the token_metadata.
Recently, in 94995acedb,
we added yielding loops that access token_metadata in order
to avoid cpu stalls. To make that possible we need to make
sure they token_metadata object they are traversing won't change
mid-loop.
This series is a first step in ensuring the serialization of
updates to shared token metadata to reading it.
Test: unit(dev)
Dtest: bootstrap_test:TestBootstrap.start_stop_test{,_node}, update_cluster_layout_tests.py -a next-gating(dev)
"
* tag 'constify-token-metadata-access-v2' of github.com:bhalevy/scylla:
api/http_context: keep a const sharded<locator::token_metadata>&
gossiper: keep a const token_metadata&
storage_service: separate get_mutable_token_metadata
range_streamer: keep a const token_metadata&
storage_proxy: delete unused get_restricted_ranges declaration
storage_proxy: keep a const token_metadata&
storage_proxy: get rid of mutable get_token_metadata getter
database: keep const token_metadata&
database: keyspace_metadata: pass const locator::token_metadata& around
everywhere_replication_strategy: move methods out of line
replication_strategy: keep a const token_metadata&
abstract_replication_strategy: get_ranges: accept const token_metadata&
token_metadata: rename calculate_pending_ranges to update_pending_ranges
token_metadata: mark const methods
token_ranges: pending_endpoints_for: return empty vector if keyspace not found
token_ranges: get_pending_ranges: return empty vector if keyspace not found
token_ranges: get rid of unused get_pending_ranges variant
replication_strategy: calculate_natural_endpoints: make token_metadata& param const
token_metadata: add get_datacenter_racks() const variant