There are two reasons for the move. First is that cql server lifetime
is shorter than storage_proxy one and the later stores references to
the semaphore in each service_permit it holds. Second - we want thrift
(and in the future other user APIs) to share the same admission control
memory pool.
Fixes#4844
Message-Id: <20190814142614.GT17984@scylladb.com>
We in fact do stop cql server in storage_service::drain_on_shutdown()
which is called in main.cc during shutdown.
Message-Id: <20190814085027.GP17984@scylladb.com>
We should wait for a future returned from add_local_application_state() to
resolve before issuing new calculation, otherwise two
add_local_application_state() may run simultaneously for the same state.
Fixes#4838.
Message-Id: <20190812082158.GE17984@scylladb.com>
"
Current admission control takes a permit when cql requests starts and
releases it when reply is sent, but some requests may leave background
work behind after that point (some because there is genuine background
work to do like complete a write or do a read repair, and some because
a read/write may stuck in a queue longer than the request's timeout), so
after Scylla replies with a timeout some resources are still occupied.
The series fixes this by passing the permit down to storage_proxy where
it is held until all background work is completed.
Fixes#4768
"
* 'gleb/admission-v3' of github.com:scylladb/seastar-dev:
transport: add a metric to follow memory available for service permit.
storage_proxy: store a permit in a read executor
storage_proxy: store a permit in a write response handler
Pass service permit to storage_proxy
transport: introduce service_permit class and use it instead of semaphore_units
transport: hold admission a permit until a reply is sent
transport: remove cql server load balancer
A read executor exists until read operation completes in its entirety
so storing a permit there guaranties that it will be freed only after
no background work left for the request on this server.
A write response handler exists until write operation completes in its
entirety so storing a permit there guaranties that it will be freed only
after no background work left for the request on this server.
Current cql transport code acquire a permit before processing a query and
release it when the query gets a reply, but some quires leave work behind.
If the work is allowed to accumulate without any limit a server may
eventually run out of memory. To prevent that the permit system should
account for the background work as well. The patch is a first step in
this direction. It passes a permit down to storage proxy where it will
be later hold by background work.
Merged patch series from Avi Kivity:
In rare but valid cases (reconciling many tombstones, paging disabled),
a reconciled_result can grow large. This triggers large allocation
warnings. Switch to chunked_vector to avoid the large allocation.
In passing, fix chunked_vector's begin()/end() const correctness, and
add the reverse iterator function family which is needed by the conversion.
Fixes#4780.
Tests: unit (dev)
Commit Summary
utils: chunked_vector: make begin()/end() const correct
utils::chunked_vector: add rbegin() and related iterators
reconcilable_result: use chunked_vector to hold partitions
Fixes#4751
Adds using a preferred address family to dns name lookups related to
listen address and rpc address, adhering to the respective "prefer" options.
API, prometheus and broadcast address are all considered to be covered by
the "listen_interface_prefer_ipv6" option.
Note: scylla does not yet support actual interface binding, but these
options should apply equally to address name parameters.
Setting a "prefer_ipv6" option automtially enables ipv6 dns family query.
When a node is restarted, there is a race between gossip starts (other
nodes will mark this node up again and send requests) and the tokens are
replicated to other shards. Here is an example:
- n1, n2
- n2 is down, n1 think n2 is down
- n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends
reads/writes to n2, but n2 hasn't replicated the token_metadata to all
the shards.
- n2 complains:
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error
(sorted_tokens is empty in first_token_index!)
The code path looks like below:
0 stoarge_service::init_server
1 prepare_to_join()
2 add gossip application state of NET_VERSION, SCHEMA and so on.
3 _gossiper.start_gossiping().get()
4 join_token_ring()
5 _token_metadata.update_normal_tokens(tokens, get_broadcast_address());
6 replicate_to_all_cores().get()
7 storage_service::set_gossip_tokens() which adds the gossip application state of TOKENS and STATUS
The race talked above is at line 3 and line 6.
To fix, we can replicate the token_metadata early after it is filled
with the tokens read from system table before gossip starts. So that
when other nodes think this restarting node is up, the tokens are
already replicated to all the shards.
In addition, this patch also fixes the issue that other nodes might see
a node miss the TOKENS and STATUS application state in gossip if that
node failed in the middle of a restarting process, i.e., it is killed
after line 3 and before line 7. As a result we could not replace the
node.
Tests: update_cluster_layout_tests.py
Fixes: #4709Fixes: #4723
Usually, a reconcilable_result holds very few partitions (1 is common),
since the page size is limited by 1MB. But if we have paging disabled or
if we are reconciling a range full of tombstones, we may see many more.
This can cause large allocations.
Change to chunked_vector to prevent those large allocations, as they
can be quite expensive.
Fixes#4780.
"
A fix for #4338 "storage_proxy add a counter for cql requests
that arrived to a non replica"
Such requests should be tracked since forwarding them to a correct
replica can create a lot network noise and incur significant performance
penalty.
The current metrics are considered insufficient after introduction
of heat-weighted load balancing.
"
Fixes#4388.
* 'gh-4338' of https://github.com/kostja/scylla:
metrics: introduce a metric for non-local reads
metrics: account writes forwarded by a coordinator in an own metric.
Computed columns feature should be checked before creating
index schemas the new way - by adding computed column names
to system_schema.computed_columns.
The function announce_column_family_drop() drops (deletes) a base table
and all the materialized-views used for its secondary indexes, but not
other materialized views - if there are any, the operation refuses to
continue. This is exactly what CQL's "DROP TABLE" needs, because it is
not allowed to drop a table before manually dropping its views.
But there is no inherent reason why it we can't support an operation
to delete a table and *all* its views - not just those related to indexes.
This patch adds such an option to announce_column_family_drop().
This option is not used by the existing CQL layer, but can be used
by other code automating operations programatically without CQL.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190716150559.11806-1-nyh@scylladb.com>
A read which arrived to a non-replica and had to be forwarded to a
replica by the coordinator is accounted in an own metric,
reads_coordinator_outside_replica_set.
Most often such read is produced by a driver which is unaware of
token distribution on the ring.
If a read was forwarded to another replica due to heat weighted
load balancing or query preference set by the user, it's not accounted
in the metric.
In case of a multi-partition read (a query using IN statement,
e.g. x in (1, 2, 3)), if any of the keys is read from a
non-local node the read is accounted as a non-local.
The rationale behind it is that if the user tries to be careful and send
IN queries only to the same vnode, they are rewarded with the counter
staying at zero, while if they send multi-partition IN queries without
any precautions, they will see the metric go up which gives them a
starting point for investigating performance problems.
Closes#4338
Add a metric to account writes which arrived to a non-replica and
had to be forwarded by a coordinator to a replica.
The name of the added metric is 'writes_coordinator_outside_replica_set'.
Do not account forwarded read repair writes, since they are already
accounted by a reads_coordinator_outside_replica_set metric, added in a
subsequent patch.
In scope of #4338.
This series makes sure new schema is propagated to repair master and
follower nodes before repair.
Fixes#4575
* dev.git asias/repair_pull_schema_v2:
migration_manager: Add sync_schema
repair: Sync schema from follower nodes before repair
As soon as it's agreed that the cluster supports sstables in mc format,
infinite bound range deletions in statements can be safely enabled.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Recently, in merge commit 2718c90448,
we added the ability to cancel pending view-update requests when we detect
that the target node went down. This is important for view updates because
these have a very long timeout (5 minutes), and we wanted to make this
timeout even longer.
However, the implementation caused a race: Between *creating* the update's
request handler (create_write_response_handler()) and actually starting
the request with this handler (mutate_begin()), there is a preemption point
and we may end up deleting the request handler before starting the request.
So mutate_begin() must gracefully handle the case of a missing request
handler, and not crash with a segmentation fault as it did before this patch.
Eventually the lifetime management of request handlers could be refactored
to avoid this delicate fix (which requires more comments to explain than
code), or even better, it would be more correct to cancel individual writes
when a node goes down, not drop the entire handler (see issue #4523).
However, for now, let's not do such invasive changes and just fix bug that
we set out to fix.
Fixes#4386.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190620123949.22123-1-nyh@scylladb.com>
We saw a node crashing today with nodetool clearsnapshot being called.
After investigation, the reason is that nodetool clearsnapshot ws called
at the same time a new snapshot was created with the same tag. nodetool
clearsnapshot can't delete all files in the directory, because new files
had by then been created in that directory, and crashes on I/O error.
There are, many problems with allowing those operations to proceed in
parallel. Even if we fix the code not to crash and return an error on
directory non-empty, the moment they do any amount of work in parallel
the result of the operation becomes undefined. Some files in the
snapshot may have been deleted by clear, for example, and a user may
then not be able to properly restore from the backup if this snapshot
was used to generate a backup.
Moreover, although we could lock at the granularity of a keyspace or
column family, I think we should use a big hammer here and lock the
entire snapshot creation/deletion to avoid surprises (for example, if a
user requests creation of a snapshot for all keyspaces, and another
process requests clear of a single keyspace)
Fixes#4554
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190614174438.9002-1-glauber@scylladb.com>
Currently, each shard protects itself by not reading from rpc and the native
transport if in-flight requests consume too much memory for that shard. However,
if all shards then forward their requests to some other shard, then that shard
can easily run out of memory since its load can be multiplied by the number of
shards that send it requests.
To protect against this, use the new Seastar smp_service_group infrastructure.
We create three groups: read, write, and write ack (the latter is needed to
avoid ABBA deadlocks is shard A exhausts all its resources sending writes to shard B,
and shard B simulateously does the same; neither will be able to send
acknowledgements, so if the writes are throttled, they will never be unthrottled
until a timeout occurs).
Range scans are not addressed by this patch since they are handled by
multishard_mutation_query, which has its own complex cross-shard communication
scheme, but it be a similar solution.
Ref #1105 (missing range scan protection)
Tests: unit (dev)
Message-Id: <20190512142243.17795-1-avi@scylladb.com>
The calculation is done in a non preemptable loop over all tables, so if
numbers of tables is very large it may take a while since we also build
a string for gossiper state. Make the loop preemtable and also make
the string calculation more efficient by preallocating memory for it.
Message-Id: <20190516132748.6469-3-gleb@scylladb.com>
invoke_on_all() copies provided function for each shard it is executed
on, so by moving stats map into the capture we copy it for each shard
too. Avoid it by putting it into the top level object which is already
captured by reference.
Message-Id: <20190516132748.6469-2-gleb@scylladb.com>
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
That's not a problem during normal cluster operation because the
tombstones will expire at all nodes at the same time, and schema
digest, although changes, will change to the same value on all nodes
at about the same time.
This fix changes digest calculation to not feed any digest for
partitions which are empty after compaction.
The digest returned by schema_mutations::digest() is left unchanged by
this patch. It affects the table schema version calculation. It's not
changed because the version is calculated on boot, where we don't yet
know all the cluster features. It's possible to fix this but it's more
complicated, so this patch defers that.
Refs #4485.
Asd
batchlog make copies of topology and endpoint array in batchlog endpoint
choosing code. There is a remark that at least endpoint copy is
deliberate because Cassandra code has it. We do not have to follow. Our
endpoint calculation code is atomic, so we can use a reference.
Message-Id: <20190506115815.GK21208@scylladb.com>
After 7c87405, schema sync includes system_schema.view_virtual_columns
in the schema digest. Old nodes don't know about this table and will
not include it in the digest calculation. As a result, there will be
schema disagreement until the whole cluster is upgraded.
Fix this by taking the new table into account only when the whole
cluster is upgraded.
The table should not be used for anything before this happens. This is
not currently enforced, but should be.
Fixes#4457.
Needed for determining if all nodes in the cluster are aware of the
new schema table. Only when all nodes are aware of it we can take it
into account when calculating schema digest, otherwise there would be
permanent schema disagreement in during rolling upgrade.
In order to avoid schema disagreements during upgrades (which may lead
to deadlocks), system distributed keyspace initialization is moved
right before starting the bootstrapping process, after the schema
agreement checks already succeeded.
Fixes#3976
Message-Id: <932e642659df1d00a2953df988f939a81275774a.1556204185.git.sarna@scylladb.com>
Currently stop returns ready future immediately. This is not a problem
since calculation loop holds a shared pointer to the local service, so
it will not be destroyed until calculation completes and global database
object db, that also used by the calculation, is never destroyed. But the
later is just a workaround for a shutdown sequence that cannot handle
it and will be changed one day. Make cache hitrate calculation service
ready for it.
Message-Id: <20190422113538.GR21208@scylladb.com>
non_system_filter lambda is defined static which means it is initialized
only once, so the 'this' that is will capture will belong to a shard
where the function runs first. During service destruction the function
may run on different shard and access already other's shard service that
may be already freed.
Fixed#4425
Message-Id: <20190421152139.GN21208@scylladb.com>
This requires introduction of storage_service::get_known_features
and using it with check_knows_remote_features.
Otherwise a node joining the existing cluster won't be able to
join because it does not support unbounded range tombstones yet.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
There were many calls to read_keyspace_mutation. One in each function
that prepares a mutation for some other schema change.
With this patch they are all moved to a single location.
Tests: unit (dev, debug)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190328024440.26201-1-espindola@scylladb.com>
"
This series removes the usage of the static gossiper object in init.cc
and storage_service.
Follow up series will remove more in other components. This is the
effort to clean up the component dependencies and have better shutdown
procedure.
Tests: tests/gossip_test, tests/cql_query_test, tests/sstable_mutation_test, dtests.
"
* tag 'asias/storage_service_gossiper_dep_v5' of github.com:cloudius-systems/seastar-dev:
storage_service: Do not use the global gms::get_local_gossiper()
storage_service: Pass gossiper object to storage_service
gms: Remove i_failure_detector.hh
gossip: Get rid of the gms::get_local_failure_detector static object
dht: Do not use failure_detector::is_alive in failure_detector_source_filter
tests: Fix stop snitch in gossip_test.cc
gossiper: Do not use value_factory from storage_service object
gossiper: Use cfg options from _cfg instead of get_local_storage_service
gossiper: Pass db::config object to gossiper class
init: Pass gossiper object to init_ms_fd_gossiper
When a view replica becomes unavailable, updates to it are stored as
hints at the paired based replica. This on-disk queue of pending view
updates grows as long as there are view updated and the view replica
remains unavailable. Currently, we take that relative queue size into
account when calculating the delay for new base writes, in the context
of the backpressure algorithm for materialized views.
However, the way we're calculating that on-disk backlog is wrong,
since we calculate it per-device and then feed it to all the hints
managers for that device. This means that normal hints will show up as
backlog for the view hints manager, which in turn introduces delays.
This can make the view backpressure mechanism kick-in even if the
cluster uses no materialized views.
There's yet another way in which considering the view hints backlog is
wrong: a view replica that is unavailable for some period of time can
cause the backlog to grow to a point where all base writes are applied
the maximum delay of 1 second. This turns a single-node failure into
cluster unavailability.
The fix to both issues is to simply not take this on-disk backlog into
account for the backpressure algorithm.
Fixes#4351Fixes#4352
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190321170418.25953-1-duarte@scylladb.com>