Commit Graph

1446 Commits

Author SHA1 Message Date
Gleb Natapov
1779c3b7f6 move admission control semaphore from cql server to storage_service
There are two reasons for the move. First is that cql server lifetime
is shorter than storage_proxy one and the later stores references to
the semaphore in each service_permit it holds. Second - we want thrift
(and in the future other user APIs) to share the same admission control
memory pool.

Fixes #4844

Message-Id: <20190814142614.GT17984@scylladb.com>
2019-08-14 18:49:56 +03:00
Gleb Natapov
a1e9e6faa2 storage_service: remove outdated comment
We in fact do stop cql server in storage_service::drain_on_shutdown()
which is called in main.cc during shutdown.

Message-Id: <20190814085027.GP17984@scylladb.com>
2019-08-14 11:52:49 +03:00
Gleb Natapov
00c4078af3 cache_hitrate_calculator: do not ignore a future returned from gossiper::add_local_application_state
We should wait for a future returned from add_local_application_state() to
resolve before issuing new calculation, otherwise two
add_local_application_state() may run simultaneously for the same state.

Fixes #4838.

Message-Id: <20190812082158.GE17984@scylladb.com>
2019-08-13 11:48:38 +03:00
Avi Kivity
e6cde72d2b Merge "Fix cql server admission control to take all leftover work into account" from Gleb
"
Current admission control takes a permit when cql requests starts and
releases it when reply is sent, but some requests may leave background
work behind after that point (some because there is genuine background
work to do like complete a write or do a read repair, and some because
a read/write may stuck in a queue longer than the request's timeout), so
after Scylla replies with a timeout some resources are still occupied.

The series fixes this by passing the permit down to storage_proxy where
it is held until all background work is completed.

Fixes #4768
"

* 'gleb/admission-v3' of github.com:scylladb/seastar-dev:
  transport: add a metric to follow memory available for service permit.
  storage_proxy: store a permit in a read executor
  storage_proxy: store a permit in a write response handler
  Pass service permit to storage_proxy
  transport: introduce service_permit class and use it instead of semaphore_units
  transport: hold admission a permit until a reply is sent
  transport: remove cql server load balancer
2019-08-12 11:02:37 +03:00
Gleb Natapov
7d7b1685aa storage_proxy: store a permit in a read executor
A read executor exists until read operation completes in its entirety
so storing a permit there guaranties that it will be freed only after
no background work left for the request on this server.
2019-08-12 10:20:43 +03:00
Gleb Natapov
d5ced800f0 storage_proxy: store a permit in a write response handler
A write response handler exists until write operation completes in its
entirety so storing a permit there guaranties that it will be freed only
after no background work left for the request on this server.
2019-08-12 10:20:43 +03:00
Gleb Natapov
6a4207f202 Pass service permit to storage_proxy
Current cql transport code acquire a permit before processing a query and
release it when the query gets a reply, but some quires leave work behind.
If the work is allowed to accumulate without any limit a server may
eventually run out of memory. To prevent that the permit system should
account for the background work as well. The patch is a first step in
this direction. It passes a permit down to storage proxy where it will
be later hold by background work.
2019-08-12 10:20:43 +03:00
Gleb Natapov
7e3805ed3d transport: remove cql server load balancer
It is buggy, unused and unnecessary complicates the code.
2019-08-11 16:08:52 +03:00
Nadav Har'El
f9d6eaf5ff reconcilable_result: switch to chunked_vector
Merged patch series from Avi Kivity:

In rare but valid cases (reconciling many tombstones, paging disabled),
a reconciled_result can grow large. This triggers large allocation
warnings. Switch to chunked_vector to avoid the large allocation.
In passing, fix chunked_vector's begin()/end() const correctness, and
add the reverse iterator function family which is needed by the conversion.

Fixes #4780.

Tests: unit (dev)

Commit Summary

    utils: chunked_vector: make begin()/end() const correct
    utils::chunked_vector: add rbegin() and related iterators
    reconcilable_result: use chunked_vector to hold partitions
2019-08-11 16:03:13 +03:00
Calle Wilund
6c62e5741e init: Use the "prefer_ipv6" options available for rpc/listen address/interface
Fixes #4751

Adds using a preferred address family to dns name lookups related to
listen address and rpc address, adhering to the respective "prefer" options.

API, prometheus and broadcast address are all considered to be covered by
the "listen_interface_prefer_ipv6" option.

Note: scylla does not yet support actual interface binding, but these
options should apply equally to address name parameters.

Setting a "prefer_ipv6" option automtially enables ipv6 dns family query.
2019-08-06 08:32:10 +00:00
Asias He
3b39a59135 storage_service: Replicate and advertise tokens early in the boot up process
When a node is restarted, there is a race between gossip starts (other
nodes will mark this node up again and send requests) and the tokens are
replicated to other shards. Here is an example:

- n1, n2
- n2 is down, n1 think n2 is down
- n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends
  reads/writes to n2, but n2 hasn't replicated the token_metadata to all
  the shards.
- n2 complains:
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error
  (sorted_tokens is empty in first_token_index!)

The code path looks like below:

0 stoarge_service::init_server
1    prepare_to_join()
2          add gossip application state of NET_VERSION, SCHEMA and so on.
3         _gossiper.start_gossiping().get()
4    join_token_ring()
5           _token_metadata.update_normal_tokens(tokens, get_broadcast_address());
6           replicate_to_all_cores().get()
7           storage_service::set_gossip_tokens() which adds the gossip application state of TOKENS and STATUS

The race talked above is at line 3 and line 6.

To fix, we can replicate the token_metadata early after it is filled
with the tokens read from system table before gossip starts. So that
when other nodes think this restarting node is up, the tokens are
already replicated to all the shards.

In addition, this patch also fixes the issue that other nodes might see
a node miss the TOKENS and STATUS application state in gossip if that
node failed in the middle of a restarting process, i.e., it is killed
after line 3 and before line 7. As a result we could not replace the
node.

Tests: update_cluster_layout_tests.py
Fixes: #4709
Fixes: #4723
2019-08-04 15:18:31 +03:00
Avi Kivity
093d2cd7e5 reconcilable_result: use chunked_vector to hold partitions
Usually, a reconcilable_result holds very few partitions (1 is common),
since the page size is limited by 1MB. But if we have paging disabled or
if we are reconciling a range full of tombstones, we may see many more.
This can cause large allocations.

Change to chunked_vector to prevent those large allocations, as they
can be quite expensive.

Fixes #4780.
2019-08-01 18:49:13 +03:00
Avi Kivity
47b0f40d27 Merge "introduce metrics for non-local queries" from Konstantin
"
A fix for #4338 "storage_proxy add a counter for cql requests
that arrived to a non replica"

Such requests should be tracked since forwarding them to a correct
replica can create a lot network noise and incur significant performance
penalty.

The current metrics are considered insufficient after introduction
of heat-weighted load balancing.
"

Fixes #4388.

* 'gh-4338' of https://github.com/kostja/scylla:
  metrics: introduce a metric for non-local reads
  metrics: account writes forwarded by a coordinator in an own metric.
2019-08-01 10:09:33 +03:00
Piotr Sarna
a0e02df36a service: add computed columns feature
Computed columns feature should be checked before creating
index schemas the new way - by adding computed column names
to system_schema.computed_columns.
2019-07-19 11:58:42 +02:00
Nadav Har'El
997b92a666 migration_manager: allow dropping table and all its views
The function announce_column_family_drop() drops (deletes) a base table
and all the materialized-views used for its secondary indexes, but not
other materialized views - if there are any, the operation refuses to
continue. This is exactly what CQL's "DROP TABLE" needs, because it is
not allowed to drop a table before manually dropping its views.

But there is no inherent reason why it we can't support an operation
to delete a table and *all* its views - not just those related to indexes.
This patch adds such an option to announce_column_family_drop().
This option is not used by the existing CQL layer, but can be used
by other code automating operations programatically without CQL.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190716150559.11806-1-nyh@scylladb.com>
2019-07-18 13:26:25 +02:00
Konstantin Osipov
56f3bda4c7 metrics: introduce a metric for non-local reads
A read which arrived to a non-replica and had to be forwarded to a
replica by the coordinator is accounted in an own metric,
reads_coordinator_outside_replica_set.
Most often such read is produced by a driver which is unaware of
token distribution on the ring.

If a read was forwarded to another replica due to heat weighted
load balancing or query preference set by the user, it's not accounted
in the metric.

In case of a multi-partition read (a query using IN statement,
e.g. x in (1, 2, 3)), if any of the keys is read from a
non-local node the read is accounted as a non-local.
The rationale behind it is that if the user tries to be careful and send
IN queries only to the same vnode, they are rewarded with the counter
staying at zero, while if they send multi-partition IN queries without
any precautions, they will see the metric go up which gives them a
starting point for investigating performance problems.

Closes #4338
2019-07-08 19:23:38 +03:00
Konstantin Osipov
da1d1b74da metrics: account writes forwarded by a coordinator in an own metric.
Add a metric to account writes which arrived to a non-replica and
had to be forwarded by a coordinator to a replica.

The name of the added metric is 'writes_coordinator_outside_replica_set'.

Do not account forwarded read repair writes, since they are already
accounted by a reads_coordinator_outside_replica_set metric, added in a
subsequent patch.

In scope of #4338.
2019-07-08 18:17:48 +03:00
Calle Wilund
4ef940169f Replace use of "ipv4_addr" with socket_address
Allows the various sockets to use ipv6 address binding if so configured.
2019-07-08 14:13:09 +00:00
Tomasz Grabiec
269e65a8db Merge "Sync schema before repair" from Asias
This series makes sure new schema is propagated to repair master and
follower nodes before repair.

Fixes #4575

* dev.git asias/repair_pull_schema_v2:
  migration_manager: Add sync_schema
  repair: Sync schema from follower nodes before repair
2019-06-25 19:05:29 +03:00
Asias He
14c1a71860 migration_manager: Add sync_schema
Makes sure this node knows about all schema changes known by
"nodes" that were made prior to this call.

Refs: #4575
Backports: 3.1
2019-06-25 17:13:47 +08:00
Piotr Sarna
c19fdc4c90 service: enable infinite bound range deletions with mc
As soon as it's agreed that the cluster supports sstables in mc format,
infinite bound range deletions in statements can be safely enabled.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-06-24 15:58:28 +03:00
Nadav Har'El
6e87bca65d storage_proxy: fix race and crash in case of MV and other node shutdown
Recently, in merge commit 2718c90448,
we added the ability to cancel pending view-update requests when we detect
that the target node went down. This is important for view updates because
these have a very long timeout (5 minutes), and we wanted to make this
timeout even longer.

However, the implementation caused a race: Between *creating* the update's
request handler (create_write_response_handler()) and actually starting
the request with this handler (mutate_begin()), there is a preemption point
and we may end up deleting the request handler before starting the request.
So mutate_begin() must gracefully handle the case of a missing request
handler, and not crash with a segmentation fault as it did before this patch.

Eventually the lifetime management of request handlers could be refactored
to avoid this delicate fix (which requires more comments to explain than
code), or even better, it would be more correct to cancel individual writes
when a node goes down, not drop the entire handler (see issue #4523).
However, for now, let's not do such invasive changes and just fix bug that
we set out to fix.

Fixes #4386.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190620123949.22123-1-nyh@scylladb.com>
2019-06-23 16:03:06 +03:00
Glauber Costa
91b71a0b1a do not allow multiple snapshot operations at the same time
We saw a node crashing today with nodetool clearsnapshot being called.
After investigation, the reason is that nodetool clearsnapshot ws called
at the same time a new snapshot was created with the same tag. nodetool
clearsnapshot can't delete all files in the directory, because new files
had by then been created in that directory, and crashes on I/O error.

There are, many problems with allowing those operations to proceed in
parallel. Even if we fix the code not to crash and return an error on
directory non-empty, the moment they do any amount of work in parallel
the result of the operation becomes undefined. Some files in the
snapshot may have been deleted by clear, for example, and a user may
then not be able to properly restore from the backup if this snapshot
was used to generate a backup.

Moreover, although we could lock at the granularity of a keyspace or
column family, I think we should use a big hammer here and lock the
entire snapshot creation/deletion to avoid surprises (for example, if a
user requests creation of a snapshot for all keyspaces, and another
process requests clear of a single keyspace)

Fixes #4554

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190614174438.9002-1-glauber@scylladb.com>
2019-06-16 10:30:13 +03:00
Gleb Natapov
9213d56a06 storage_proxy: align background and foreground repair metric names
One is plural another is not, make them all plural.

Message-Id: <20190605135940.GI25001@scylladb.com>
2019-06-10 11:34:36 +03:00
Avi Kivity
591d2968cc storage_proxy: limit resources consumed in cross-shard operations
Currently, each shard protects itself by not reading from rpc and the native
transport if in-flight requests consume too much memory for that shard. However,
if all shards then forward their requests to some other shard, then that shard
can easily run out of memory since its load can be multiplied by the number of
shards that send it requests.

To protect against this, use the new Seastar smp_service_group infrastructure.
We create three groups: read, write, and write ack (the latter is needed to
avoid ABBA deadlocks is shard A exhausts all its resources sending writes to shard B,
and shard B simulateously does the same; neither will be able to send
acknowledgements, so if the writes are throttled, they will never be unthrottled
until a timeout occurs).

Range scans are not addressed by this patch since they are handled by
multishard_mutation_query, which has its own complex cross-shard communication
scheme, but it be a similar solution.

Ref #1105 (missing range scan protection)

Tests: unit (dev)
Message-Id: <20190512142243.17795-1-avi@scylladb.com>
2019-06-07 10:53:23 +02:00
Konstantin Osipov
29c27bfc28 storage_proxy: remove unnecessary lambdas in metrics binding
Remove unnecessasry lambdas when binding metrics of the storage proxy.
Message-Id: <20190603133753.1724-1-kostja@scylladb.com>
2019-06-03 16:55:16 +03:00
Gleb Natapov
31bf4cfb5e cache_hitrate_calculator: make cache hitrate calculation preemptable
The calculation is done in a non preemptable loop over all tables, so if
numbers of tables is very large it may take a while since we also build
a string for gossiper state. Make the loop preemtable and also make
the string calculation more efficient by preallocating memory for it.
Message-Id: <20190516132748.6469-3-gleb@scylladb.com>
2019-05-16 15:32:36 +02:00
Gleb Natapov
4517c56a57 cache_hitrate_calculator: do not copy stats map for each cpu
invoke_on_all() copies provided function for each shard it is executed
on, so by moving stats map into the capture we copy it for each shard
too. Avoid it by putting it into the top level object which is already
captured by reference.
Message-Id: <20190516132748.6469-2-gleb@scylladb.com>
2019-05-16 15:32:24 +02:00
Tomasz Grabiec
9de071d214 schema_tables, storage_service: Make schema digest insensitive to expired tombstones in empty partition
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

That's not a problem during normal cluster operation because the
tombstones will expire at all nodes at the same time, and schema
digest, although changes, will change to the same value on all nodes
at about the same time.

This fix changes digest calculation to not feed any digest for
partitions which are empty after compaction.

The digest returned by schema_mutations::digest() is left unchanged by
this patch. It affects the table schema version calculation. It's not
changed because the version is calculated on boot, where we don't yet
know all the cluster features. It's possible to fix this but it's more
complicated, so this patch defers that.

Refs #4485.

Asd
2019-05-14 10:43:06 +02:00
Avi Kivity
985a30a01c storage_proxy: fix pessimizing moves
Remove pessimizing moves, as reported by gcc 9.
2019-05-07 09:56:09 +03:00
Gleb Natapov
b8188e1e2f storage_proxy: avoid copying of a topology and endpoint array in batchlog code
batchlog make copies of topology and endpoint array in batchlog endpoint
choosing code. There is a remark that at least endpoint copy is
deliberate because Cassandra code has it. We do not have to follow. Our
endpoint calculation code is atomic, so we can use a reference.

Message-Id: <20190506115815.GK21208@scylladb.com>
2019-05-06 17:36:50 +03:00
Tomasz Grabiec
c96ee9882b db/schema_tables: Include view_virtual_columns in the digest only when all nodes do
After 7c87405, schema sync includes system_schema.view_virtual_columns
in the schema digest. Old nodes don't know about this table and will
not include it in the digest calculation. As a result, there will be
schema disagreement until the whole cluster is upgraded.

Fix this by taking the new table into account only when the whole
cluster is upgraded.

The table should not be used for anything before this happens. This is
not currently enforced, but should be.

Fixes #4457.
2019-04-28 15:50:13 +02:00
Tomasz Grabiec
a108df09f9 storage_service: Introduce the VIEW_VIRTUAL_COLUMNS cluster feature
Needed for determining if all nodes in the cluster are aware of the
new schema table. Only when all nodes are aware of it we can take it
into account when calculating schema digest, otherwise there would be
permanent schema disagreement in during rolling upgrade.
2019-04-28 15:50:13 +02:00
Tomasz Grabiec
3cb7b2d72e treewide: Propagate schema_features to db::schema::all_tables() 2019-04-28 15:50:13 +02:00
Tomasz Grabiec
1d9b88dceb service/storage_service: Introduce cluster_schema_features() 2019-04-28 15:50:12 +02:00
Tomasz Grabiec
6e2c190b5f schema_tables: Propagate storage_service& to merge_schema()
We will need to calculate cluster schema features at the time we
calculate the schema digest.
2019-04-28 12:33:10 +02:00
Piotr Sarna
037b517c85 service: initialize system distributed keyspace after schema agreement
In order to avoid schema disagreements during upgrades (which may lead
to deadlocks), system distributed keyspace initialization is moved
right before starting the bootstrapping process, after the schema
agreement checks already succeeded.

Fixes #3976
Message-Id: <932e642659df1d00a2953df988f939a81275774a.1556204185.git.sarna@scylladb.com>
2019-04-25 18:44:08 +02:00
Gleb Natapov
c6b3b9ff13 cache_hitrate_calculator: wait for ongoing calculation to complete during stop
Currently stop returns ready future immediately. This is not a problem
since calculation loop holds a shared pointer to the local service, so
it will not be destroyed until calculation completes and global database
object db, that also used by the calculation, is never destroyed. But the
later is just a workaround for a shutdown sequence that cannot handle
it and will be changed one day. Make cache hitrate calculation service
ready for it.

Message-Id: <20190422113538.GR21208@scylladb.com>
2019-04-22 14:44:42 +03:00
Gleb Natapov
306f5b99b5 cache_hitrate_calculator: fix use after free in non_system_filter lambda
non_system_filter lambda is defined static which means it is initialized
only once, so the 'this' that is will capture will belong to a shard
where the function runs first. During service destruction the function
may run on different shard and access already other's shard service that
may be already freed.

Fixed #4425

Message-Id: <20190421152139.GN21208@scylladb.com>
2019-04-21 18:22:31 +03:00
Piotr Jastrzebski
2c599122e1 Update supported features on format change
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-04-12 10:38:31 +02:00
Piotr Jastrzebski
9c7e3dd470 Add _unbounded_range_tombstones_feature
This requires introduction of storage_service::get_known_features
and using it with check_knows_remote_features.
Otherwise a node joining the existing cluster won't be able to
join because it does not support unbounded range tombstones yet.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-04-12 10:37:12 +02:00
Piotr Jastrzebski
96ad8f7df9 Use _sstables_format to determine current format
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-04-12 10:37:12 +02:00
Piotr Jastrzebski
7339e9de30 Add service::read_sstables_format
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-04-12 10:37:12 +02:00
Piotr Jastrzebski
9934740c39 Register feature listeners in storage_service
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-04-12 10:36:58 +02:00
Piotr Jastrzebski
081542cf00 storage_service: add _sstables_format field
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-04-12 09:33:40 +02:00
Rafael Ávila de Espíndola
6191fd7701 Avoid duplicated read_keyspace_mutation calls
There were many calls to read_keyspace_mutation. One in each function
that prepares a mutation for some other schema change.

With this patch they are all moved to a single location.

Tests: unit (dev, debug)

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190328024440.26201-1-espindola@scylladb.com>
2019-04-07 09:26:56 +03:00
Vlad Zolotarov
0dc0a6025d query_pager::fetch_page: cosmetics: fix code alignment
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190401214030.5570-2-vladz@scylladb.com>
2019-04-02 11:53:10 +03:00
Avi Kivity
f259a4c3b4 Merge "Remove usage of static gossiper object in init.cc and storage_service" from Asias
"
This series removes the usage of the static gossiper object in init.cc
and storage_service.

Follow up series will remove more in other components. This is the
effort to clean up the component dependencies and have better shutdown
procedure.

Tests: tests/gossip_test, tests/cql_query_test, tests/sstable_mutation_test,  dtests.
"

* tag 'asias/storage_service_gossiper_dep_v5' of github.com:cloudius-systems/seastar-dev:
  storage_service: Do not use the global gms::get_local_gossiper()
  storage_service: Pass gossiper object to storage_service
  gms: Remove i_failure_detector.hh
  gossip: Get rid of the gms::get_local_failure_detector static object
  dht: Do not use failure_detector::is_alive in failure_detector_source_filter
  tests: Fix stop snitch in gossip_test.cc
  gossiper: Do not use value_factory from storage_service object
  gossiper: Use cfg options from _cfg instead of get_local_storage_service
  gossiper: Pass db::config object to gossiper class
  init: Pass gossiper object to init_ms_fd_gossiper
2019-03-26 08:54:46 +02:00
Duarte Nunes
93a1c27b31 service/storage_proxy: Don't consider view hints for MV backpressure
When a view replica becomes unavailable, updates to it are stored as
hints at the paired based replica. This on-disk queue of pending view
updates grows as long as there are view updated and the view replica
remains unavailable. Currently, we take that relative queue size into
account when calculating the delay for new base writes, in the context
of the backpressure algorithm for materialized views.

However, the way we're calculating that on-disk backlog is wrong,
since we calculate it per-device and then feed it to all the hints
managers for that device. This means that normal hints will show up as
backlog for the view hints manager, which in turn introduces delays.
This can make the view backpressure mechanism kick-in even if the
cluster uses no materialized views.

There's yet another way in which considering the view hints backlog is
wrong: a view replica that is unavailable for some period of time can
cause the backlog to grow to a point where all base writes are applied
the maximum delay of 1 second. This turns a single-node failure into
cluster unavailability.

The fix to both issues is to simply not take this on-disk backlog into
account for the backpressure algorithm.

Fixes #4351
Fixes #4352

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190321170418.25953-1-duarte@scylladb.com>
2019-03-24 20:29:56 +02:00
Asias He
7447c92d63 storage_service: Do not use the global gms::get_local_gossiper()
Use the gossiper object stored in _gossiper member from storage_service.
2019-03-22 09:11:26 +08:00