In order to combine multiple service level options coming from
multiple roles, a helper function is provided to merge two
of them. The semantics depend on each parameter, but for timeouts,
which are the only parameters at the time of writing this message,
the minimum value of the two is taken. That in particular means
that when service level A has timeout = 50ms and service level B
has timeout = 1s, the resulting service level options
would set the timeout to 50ms.
In order to avoid needless schema disagreements, a way of announcing
a schema change with fixed timestamp is added.
That way, when nodes update schemas of their internal tables (e.g.
during updates), it's possible for all nodes to use an identical
timestamp for this operation, which in turn makes their digests
identical.
storage_proxy uses std::vector<inet_address> for small lists of nodes - for replication (often 2-3 replicas per operation) and for pending operations (usually 0-1). These vectors require an allocation, sometimes more than one if reserve() is not used correctly.
This series switches storage_proxy to use utils::small_vector instead, removing the allocations in the common case.
Test results (perf_simple_query --smp 1 --task-quota-ms 10):
```
before: median 184810.98 tps ( 91.1 allocs/op, 20.1 tasks/op, 54564 insns/op)
after: median 192125.99 tps ( 87.1 allocs/op, 20.1 tasks/op, 53673 insns/op)
```
4 allocations and ~900 instructions are removed (the tps figure is also improved, but it is less reliable due to cpu frequency changes).
The type change is unfortunately not contained in storage_proxy - the abstraction leaks to providers of replica sets and topology change vectors. This is sad but IMO the benefits make it worthwhile.
I expect more such changes can be applied in storage_proxy, specifically std::unordered_set<gms::inet_address> and vectors of response handles.
Closes#8592
* github.com:scylladb/scylla:
storage_proxy, treewide: use utils::small_vector inet_address_vector:s
storage_proxy, treewide: introduce names for vectors of inet_address
utils: small_vector: add print operator for std::ostream
hints: messages.hh: add missing #include
* scylla-dev/raft-snapshot-fixes-v4:
raft: document that add entry my throw commit_status_unknown
raft: test: add test of a leadership change during ongoing snapshot transfer
raft: test: retry submitting an entry if it was dropped
raft: test: wait for the log to be fully replicated on new leader only
raft: drop waiters with outdated terms
raft: make snapshot transfer abortable
raft: accept snapshots transfer from multiple nodes simultaneously
raft: do not send probes while transferring snapshot
raft: handle messages sending errors
raft: test: return error from rpc module if nodes are disconnected
raft: fix a typo in a variable name
In commit 323f72e48a (repair: Switch to
use NODE_OPS_CMD for replace operation), we switched replace operation
to use the new NODE_OPS_CMD infrastructure.
In this patch set, we continue the work to switch decommission and bootstrap
operation to use NODE_OPS_CMD.
Fixes#8472Fixes#8471Closes#8481
* github.com:scylladb/scylla:
repair: Switch to use NODE_OPS_CMD for bootstrap operation
repair: Switch to use NODE_OPS_CMD for decommission operation
A snapshot transfer may take a lot of time and meanwhile a leader doing
it may lose the leadership. If that happens the ongoing snapshot transfer
becomes obsolete since the snapshot will be rejected by the receiving
node as coming from an old leader. Make snapshot transfer abortable and
abort them when leader changes.
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
"
Choosing the max-result-size for unlimited queries is broken for unknown
scheduling groups. In this case the system limit (unlimited) will be
chosen. A prime example of this break-down is when service levels are
used.
This series fixes this in the same spirit as the similar semaphore
selection issue (#8508) was fixed: use the user limit as the fall-back
in case of unknown scheduling groups.
To ensure future fixes automatically apply to both query-classification
related configurations, selecting the max result size for unlimited
queries is now delegated to the database, sharing the query
classification logic with the semaphore selection.
Fixes: #8591
Tests: unit(dev)
"
* 'query-max-size-service-level-fix/v2' of https://github.com/denesb/scylla:
service/storage_proxy: get_max_result_size() defer to db for unlimited queries
database: add get_unlimited_query_max_result_size()
query_class_config: add operator== for max_result_size
database: get_reader_concurrency_semaphore(): extract query classification logic
Defer picking the appropriate max result size for unlimited queries to
the database, which is already the place where we make query classifying
decisions. This move means that all these decisions are now centralized
in the database, not scattered in different places and fixing one fixes
all users.
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.
In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.
After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.
Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.
Refs #1
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505102111.955470-1-nyh@scylladb.com>
Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency.
When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter.
This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached.
This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value.
Fixes#8102
Tests:
- unit(dev)
- some manual tests:
- shutting down repair coordinator during hints replay,
- shutting down node participating in repair during hints replay,
Closes#8452
* github.com:scylladb/scylla:
repair: introduce abort_source for repair abort
repair: introduce abort_source for shutdown
storage_proxy: add abort_source to wait_for_hints_to_be_replayed
storage_proxy: stop waiting for hints replay when node goes down
hints: dismiss segment waiters when hint queue can't send
repair: plug in waiting for hints to be sent before repair
repair: add get_hosts_participating_in_repair
storage_proxy: coordinate waiting for hints to be sent
config: add wait_for_hint_replay_before_repair option
storage_proxy: implement verbs for hint sync points
messaging_service: add verbs for hint sync points
storage_proxy: add functions for syncing with hints queue
db/hints: make it possible to wait until current hints are sent
db/hints: add a metric for counting processed files
db/hints: allow to forcefully update segment list on flush
An out of block log print resulted in repeated prints about removal of
the default service level. The period of this print is every time the
configuration is scanned for changes. It happens when the default
service level is one of the last on the map (sorted as in the map).
Fixes#8567Closes#8576
Introduce a tagged id struct for `group_id`.
Raft code would want to generate quite a lot of unique
raft groups in the future (e.g. tablets). UUID is designed
exactly for that (e.g. larger capacity than `uint64_t`, obviously,
and also has built-in procedures to generate random ids).
Also, this is a preparation to make "raft group 0" use a random
ID instead of a literal fixed `0` as a group id.
The purpose is that every scylla cluster must have a unique ID
for "raft group 0" since we don't want the nodes from some other
cluster to disrupt the current cluster. This can happen if,
for some reason, a foreign node happens to contact a node in
our cluster.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210429170630.533596-3-pa.solodovnikov@scylladb.com>
In commit 323f72e48a (repair: Switch to
use NODE_OPS_CMD for replace operation), we switched replace operation
to use the new NODE_OPS_CMD infrastructure.
In this patch, we continue the work to switch bootstrap operation to use
NODE_OPS_CMD.
The benefits:
- It is more reliable to detect pending node operations, to avoid
multiple topology changes at the same time, than using gossip status.
- The cluster reverts to a state before the bootstrap operation
automatically in case of error much faster than gossip.
- Allows users to pass a list of dead nodes to ignore during bootstrap
explicitly.
- The BOOTSTRAP gossip status is not needed any more. This is one step
closer to achieve gossip-less topology change.
Fixes#8472
Adds a `wait_for_hints_to_be_replayed` function which waits until all
hints between specified endpoints are replayed.
For each node, a hint sync point is created. Then, repair coordinator
waits until the hint sync point is reached on every node, or timeout
occurs. This is done by querying each host participating in repair every
second in order if the sync point is still there.
Adds two methods to `storage_proxy`:
- `create_hint_queue_sync_point` - creates a "hint sync point" which
is kept present in storage_proxy until all hint queues on the local
node reach their curent end. It will also disappear if given deadline
is reached first.
- `check_hint_queue_sync_point` - checks if given hint sync point still
exists.
The created sync point waits for hint queues in all hint managers, on
all shards.
The configuration detection is based on a loop that
advances two iterators and compares the two collection
for deducing the configuration change. In order to
correctly deduce the changes the iteration have to be
according to the key (service level name) order for both
of the collections. If it doesn't happen the results are
undefined and in some cases can lead to a crash of the
system. The bug is that the _service_level_db field was
implemented using an unordered_map which obviously don't
guarantie the configuration change detection assumption.
The fix was simply to change the field type to a map
instead of unordered_map.
Another problem is that when a static service level (i.e
default) is at the end of the keys list, it is repeatedly
being deleted - which doesn't really do anything since deleting
a static service level is just retaining it's defult values
but it is stil wrong.
Exceptions around the loop polling were not handled properly.
This is an issue due to the fact that if an unhandled exception
slips out to the configuration polling loop itself it will break
it. When the configuration polling loop is broken, any further
change to the configuration will not be acted uppon in the nodes
where the loop is broken until the node is restarted. The chances
for exceptions are now greater than before since in one of the
previous commits we started quering the workload prioritization
configuration table with a sensible, shorter timeout.
This change also adds a logger for the workload prioritization
module and some logging mainly arround the configuration polling loop.
Most logs are added in the info level since they are not expected to
happen frequently but when they do we would like to have some
information by default regarding what broke the loop.
"
There are few places left that call for migration manager
by global reference. This set patches all those places
and makes the migration manager a service that locally
lives in main(). Surprisingly, the largest changes are to
get rid of global migration manager calls from ... the
migration manager itself.
Two tricks here. First, repair code gets its private global
migration manager pointer. That's not nice, but it aligned
with current repair design -- all its references are now
"global". Some day they all will be moved into sharded
repair service, for now these globals just describe the real
dependencies of the repair code.
Second is storage proxy that needs to call migration manager
to get schema. Proper layering makes migration manager sit
on top of storage proxy, so the direct back-reference is
not nice. To overcome this the proxy gets migration manager's
shared_from_this() pointer and drops all of them on stop.
This makes sure that by the time migration manager stops
no references from proxy exist.
tests: unit(dev), start-stop, start-drain-stop
"
* 'br-turn-migration-manager-local' of https://github.com/xemul/scylla: (21 commits)
migration_manager: Make it main-local
tests: Have own migration manager instances
tests: Use migration_manager from cql_test_env
migration_manager: Call maybe_sync from this
migration_manager: Make get_schema_for_... methods
migration_manager: Hide get_schema_definition
streaming: Keep migration_manager ptr in rpc lambdas
storage_proxy: Keep migration_manager ptr in rpc lambdas
streaming: Get migration_manager shared_ptr in messaging
storage_proxy: Get migration_manager shared_ptr in messaging
migration_manager: Make maybe_sync a method
migration_manager: Open-code merge lambda
migration_manager: Turn do_announce_new_type non-static
migration_manager: Make announce() non-static method
storage_servive: Use local migration manager
storage_service: Keep migration manager on board
migration_manager: Use 'this' where appropriate
repair: Use private migration manager pointer
repair: Keep private sharded migration manager pointer
redis: Carry sharded migration manager over init
...
Now everybody is patched to use component-local instance of migration
manager and its global instance can be moved into main() scope.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only caller of maybe_sync() method is now the method itself
and can stop using global migration manager instance and switch
to using 'this'.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
These two helpers are now namespace-scoped methods, but both
need the migration manager instance inside. All their callers
are now patched to have the migration manager at hands, so
the helpers can be turned into methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method is exclusively used inside migration manager code,
so (for now) no use in keeping it exposed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch is the bridge between the previous one and the
next one and is quite messy to be merged with either.
No heavy changes -- just copy the migration manager's ptr
onto rpc lambdas. Will be used in the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The proxy's messaging code uses migration manager to obtain schema.
Since proxy is more low-level service than migration manager, it's
incorrect to make proxy reference the manager directly. Instead,
push the shared_ptr into proxy's messaging code. This kills two
birst with one stone:
1: let proxy use migration manager
2: makes sure that by the time migration manager is stopped
the proxy's use of this pointer is gone (unregistered from
rpc)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the maybe_sync is namespace-scope function. Turn it into
a migration_manager method so that it can use 'this' instead of
get_local_migration_manager().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This lambda uses global migration manager instance. Open-coding
this short lambda makes further patching simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's the only place that calls recently patched .announce()
method, so instead of grabbing global migration manager,
use 'this'.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method needs to get migration manager instance to call
methods on it, so turn it non-static to have the instance in
'this'. Caller (yes, only one) gets local migration manager
itself, but will be patched soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage service needs migration manager to sync schema
on lifecycle notifiers and to stop the guy on drain. So this
patch just pushes the migration manager reference all the
way through the storage service constructor.
Few words about tests. Since now storage service needs the
migration manager in constructor, some tests should take it
from somewhere. The cql_test_env already has (and uses) it,
all the others can just provide a not-started sharded one,
it won't be in use in _those_ tests anyway.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some its non-static method call get_local_migration_manager
instead of using 'this'. None of these places use this to
get cross-shard instance, so it's safe to use 'this' there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The migration_task is the class with the single static method
that's called from a single place in migration manager and
this method calls migration manager back right at once. There's
no much sense in keeping this abstraction, merge it into the
migration manager.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Migration manager has a function to get a schema (for read or write),
this function queries a peer node and retrieves the schema from it. One
scenario where it can happen is if an old node, queries an old not fixed
index.
This makes a hole through which views that are only adjusted for reading
can slip through.
Here we plug the hole by fixing such views before they are registered.
Closes#8509
The service level controller spawns an updating thread,
which wasn't properly waited for during shutdown.
This behavior is now fixed.
In order to make the shutdown order more standardized,
the operation is split into two phases - draining and stopping.
Tests: manual
Fixes#8468
In commit 323f72e48a (repair: Switch to
use NODE_OPS_CMD for replace operation), we switched replace operation
to use the new NODE_OPS_CMD infrastructure.
In this patch, we continue the work to switch decommission operation to use
NODE_OPS_CMD.
The benefits:
- A UUID is used to identify each node operation across the cluster.
- It is more reliable to detect pending node operations, to avoid
multiple topology changes at the same time.
- The cluster reverts to a state before the decommission operation
automatically in case of error. Without this patch, the node to be
decommissioned will be stuck in decommission status forever until it
is restarted and goes back to normal status.
- Allows users to pass a list of dead nodes to ignore for decommission
explicitly.
- The LEAVING gossip status is not needed any more. This is one step
closer to achieve gossip-less topology change.
- Allows us to trigger of off-strategy easily on the node receiving the
ranges
Fixes#8471
storage_proxy.hh is huge and includes many headers itself, so
remove its inclusions from headers and re-add smaller headers
where needed (and storage_proxy.hh itself in source files that
need it).
Ref #1.
Nested classes cannot be forward declared, and
storage_proxy::coordinator_query_result is used in pagers, where
we'd like to forward-declare it. Unnest it and introduce an alias
for compatibility.
`system_distributed_everywhere` is a new keyspace that uses Everywhere
replication strategy. This is useful, for example, when we want to store
internal data that should be accessible by every node; the data can be
written using CL=ALL (e.g. during node operations such as node
bootstrap, which require all nodes to be alive - at least currently) and
then read by each node locally using CL=ONE (e.g. during node restarts).
Closes#8457