scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 21:47:10 +00:00

Author	SHA1	Message	Date
Gleb Natapov	1779c3b7f6	move admission control semaphore from cql server to storage_service There are two reasons for the move. First is that cql server lifetime is shorter than storage_proxy one and the later stores references to the semaphore in each service_permit it holds. Second - we want thrift (and in the future other user APIs) to share the same admission control memory pool. Fixes #4844 Message-Id: <20190814142614.GT17984@scylladb.com>	2019-08-14 18:49:56 +03:00
Gleb Natapov	a1e9e6faa2	storage_service: remove outdated comment We in fact do stop cql server in storage_service::drain_on_shutdown() which is called in main.cc during shutdown. Message-Id: <20190814085027.GP17984@scylladb.com>	2019-08-14 11:52:49 +03:00
Gleb Natapov	00c4078af3	cache_hitrate_calculator: do not ignore a future returned from gossiper::add_local_application_state We should wait for a future returned from add_local_application_state() to resolve before issuing new calculation, otherwise two add_local_application_state() may run simultaneously for the same state. Fixes #4838. Message-Id: <20190812082158.GE17984@scylladb.com>	2019-08-13 11:48:38 +03:00
Avi Kivity	e6cde72d2b	Merge "Fix cql server admission control to take all leftover work into account" from Gleb " Current admission control takes a permit when cql requests starts and releases it when reply is sent, but some requests may leave background work behind after that point (some because there is genuine background work to do like complete a write or do a read repair, and some because a read/write may stuck in a queue longer than the request's timeout), so after Scylla replies with a timeout some resources are still occupied. The series fixes this by passing the permit down to storage_proxy where it is held until all background work is completed. Fixes #4768 " * 'gleb/admission-v3' of github.com:scylladb/seastar-dev: transport: add a metric to follow memory available for service permit. storage_proxy: store a permit in a read executor storage_proxy: store a permit in a write response handler Pass service permit to storage_proxy transport: introduce service_permit class and use it instead of semaphore_units transport: hold admission a permit until a reply is sent transport: remove cql server load balancer	2019-08-12 11:02:37 +03:00
Gleb Natapov	7d7b1685aa	storage_proxy: store a permit in a read executor A read executor exists until read operation completes in its entirety so storing a permit there guaranties that it will be freed only after no background work left for the request on this server.	2019-08-12 10:20:43 +03:00
Gleb Natapov	d5ced800f0	storage_proxy: store a permit in a write response handler A write response handler exists until write operation completes in its entirety so storing a permit there guaranties that it will be freed only after no background work left for the request on this server.	2019-08-12 10:20:43 +03:00
Gleb Natapov	6a4207f202	Pass service permit to storage_proxy Current cql transport code acquire a permit before processing a query and release it when the query gets a reply, but some quires leave work behind. If the work is allowed to accumulate without any limit a server may eventually run out of memory. To prevent that the permit system should account for the background work as well. The patch is a first step in this direction. It passes a permit down to storage proxy where it will be later hold by background work.	2019-08-12 10:20:43 +03:00
Gleb Natapov	7e3805ed3d	transport: remove cql server load balancer It is buggy, unused and unnecessary complicates the code.	2019-08-11 16:08:52 +03:00
Nadav Har'El	f9d6eaf5ff	reconcilable_result: switch to chunked_vector Merged patch series from Avi Kivity: In rare but valid cases (reconciling many tombstones, paging disabled), a reconciled_result can grow large. This triggers large allocation warnings. Switch to chunked_vector to avoid the large allocation. In passing, fix chunked_vector's begin()/end() const correctness, and add the reverse iterator function family which is needed by the conversion. Fixes #4780. Tests: unit (dev) Commit Summary utils: chunked_vector: make begin()/end() const correct utils::chunked_vector: add rbegin() and related iterators reconcilable_result: use chunked_vector to hold partitions	2019-08-11 16:03:13 +03:00
Calle Wilund	6c62e5741e	init: Use the "prefer_ipv6" options available for rpc/listen address/interface Fixes #4751 Adds using a preferred address family to dns name lookups related to listen address and rpc address, adhering to the respective "prefer" options. API, prometheus and broadcast address are all considered to be covered by the "listen_interface_prefer_ipv6" option. Note: scylla does not yet support actual interface binding, but these options should apply equally to address name parameters. Setting a "prefer_ipv6" option automtially enables ipv6 dns family query.	2019-08-06 08:32:10 +00:00
Asias He	3b39a59135	storage_service: Replicate and advertise tokens early in the boot up process When a node is restarted, there is a race between gossip starts (other nodes will mark this node up again and send requests) and the tokens are replicated to other shards. Here is an example: - n1, n2 - n2 is down, n1 think n2 is down - n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends reads/writes to n2, but n2 hasn't replicated the token_metadata to all the shards. - n2 complains: token_metadata - sorted_tokens is empty in first_token_index! token_metadata - sorted_tokens is empty in first_token_index! token_metadata - sorted_tokens is empty in first_token_index! token_metadata - sorted_tokens is empty in first_token_index! token_metadata - sorted_tokens is empty in first_token_index! token_metadata - sorted_tokens is empty in first_token_index! storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error (sorted_tokens is empty in first_token_index!) The code path looks like below: 0 stoarge_service::init_server 1 prepare_to_join() 2 add gossip application state of NET_VERSION, SCHEMA and so on. 3 _gossiper.start_gossiping().get() 4 join_token_ring() 5 _token_metadata.update_normal_tokens(tokens, get_broadcast_address()); 6 replicate_to_all_cores().get() 7 storage_service::set_gossip_tokens() which adds the gossip application state of TOKENS and STATUS The race talked above is at line 3 and line 6. To fix, we can replicate the token_metadata early after it is filled with the tokens read from system table before gossip starts. So that when other nodes think this restarting node is up, the tokens are already replicated to all the shards. In addition, this patch also fixes the issue that other nodes might see a node miss the TOKENS and STATUS application state in gossip if that node failed in the middle of a restarting process, i.e., it is killed after line 3 and before line 7. As a result we could not replace the node. Tests: update_cluster_layout_tests.py Fixes: #4709 Fixes: #4723	2019-08-04 15:18:31 +03:00
Avi Kivity	093d2cd7e5	reconcilable_result: use chunked_vector to hold partitions Usually, a reconcilable_result holds very few partitions (1 is common), since the page size is limited by 1MB. But if we have paging disabled or if we are reconciling a range full of tombstones, we may see many more. This can cause large allocations. Change to chunked_vector to prevent those large allocations, as they can be quite expensive. Fixes #4780.	2019-08-01 18:49:13 +03:00
Avi Kivity	47b0f40d27	Merge "introduce metrics for non-local queries" from Konstantin " A fix for #4338 "storage_proxy add a counter for cql requests that arrived to a non replica" Such requests should be tracked since forwarding them to a correct replica can create a lot network noise and incur significant performance penalty. The current metrics are considered insufficient after introduction of heat-weighted load balancing. " Fixes #4388. * 'gh-4338' of https://github.com/kostja/scylla: metrics: introduce a metric for non-local reads metrics: account writes forwarded by a coordinator in an own metric.	2019-08-01 10:09:33 +03:00
Piotr Sarna	a0e02df36a	service: add computed columns feature Computed columns feature should be checked before creating index schemas the new way - by adding computed column names to system_schema.computed_columns.	2019-07-19 11:58:42 +02:00
Nadav Har'El	997b92a666	migration_manager: allow dropping table and all its views The function announce_column_family_drop() drops (deletes) a base table and all the materialized-views used for its secondary indexes, but not other materialized views - if there are any, the operation refuses to continue. This is exactly what CQL's "DROP TABLE" needs, because it is not allowed to drop a table before manually dropping its views. But there is no inherent reason why it we can't support an operation to delete a table and all its views - not just those related to indexes. This patch adds such an option to announce_column_family_drop(). This option is not used by the existing CQL layer, but can be used by other code automating operations programatically without CQL. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190716150559.11806-1-nyh@scylladb.com>	2019-07-18 13:26:25 +02:00
Konstantin Osipov	56f3bda4c7	metrics: introduce a metric for non-local reads A read which arrived to a non-replica and had to be forwarded to a replica by the coordinator is accounted in an own metric, reads_coordinator_outside_replica_set. Most often such read is produced by a driver which is unaware of token distribution on the ring. If a read was forwarded to another replica due to heat weighted load balancing or query preference set by the user, it's not accounted in the metric. In case of a multi-partition read (a query using IN statement, e.g. x in (1, 2, 3)), if any of the keys is read from a non-local node the read is accounted as a non-local. The rationale behind it is that if the user tries to be careful and send IN queries only to the same vnode, they are rewarded with the counter staying at zero, while if they send multi-partition IN queries without any precautions, they will see the metric go up which gives them a starting point for investigating performance problems. Closes #4338	2019-07-08 19:23:38 +03:00
Konstantin Osipov	da1d1b74da	metrics: account writes forwarded by a coordinator in an own metric. Add a metric to account writes which arrived to a non-replica and had to be forwarded by a coordinator to a replica. The name of the added metric is 'writes_coordinator_outside_replica_set'. Do not account forwarded read repair writes, since they are already accounted by a reads_coordinator_outside_replica_set metric, added in a subsequent patch. In scope of #4338.	2019-07-08 18:17:48 +03:00
Calle Wilund	4ef940169f	Replace use of "ipv4_addr" with socket_address Allows the various sockets to use ipv6 address binding if so configured.	2019-07-08 14:13:09 +00:00
Tomasz Grabiec	269e65a8db	Merge "Sync schema before repair" from Asias This series makes sure new schema is propagated to repair master and follower nodes before repair. Fixes #4575 * dev.git asias/repair_pull_schema_v2: migration_manager: Add sync_schema repair: Sync schema from follower nodes before repair	2019-06-25 19:05:29 +03:00
Asias He	14c1a71860	migration_manager: Add sync_schema Makes sure this node knows about all schema changes known by "nodes" that were made prior to this call. Refs: #4575 Backports: 3.1	2019-06-25 17:13:47 +08:00
Piotr Sarna	c19fdc4c90	service: enable infinite bound range deletions with mc As soon as it's agreed that the cluster supports sstables in mc format, infinite bound range deletions in statements can be safely enabled. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-06-24 15:58:28 +03:00
Nadav Har'El	6e87bca65d	storage_proxy: fix race and crash in case of MV and other node shutdown Recently, in merge commit `2718c90448`, we added the ability to cancel pending view-update requests when we detect that the target node went down. This is important for view updates because these have a very long timeout (5 minutes), and we wanted to make this timeout even longer. However, the implementation caused a race: Between creating the update's request handler (create_write_response_handler()) and actually starting the request with this handler (mutate_begin()), there is a preemption point and we may end up deleting the request handler before starting the request. So mutate_begin() must gracefully handle the case of a missing request handler, and not crash with a segmentation fault as it did before this patch. Eventually the lifetime management of request handlers could be refactored to avoid this delicate fix (which requires more comments to explain than code), or even better, it would be more correct to cancel individual writes when a node goes down, not drop the entire handler (see issue #4523). However, for now, let's not do such invasive changes and just fix bug that we set out to fix. Fixes #4386. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190620123949.22123-1-nyh@scylladb.com>	2019-06-23 16:03:06 +03:00
Glauber Costa	91b71a0b1a	do not allow multiple snapshot operations at the same time We saw a node crashing today with nodetool clearsnapshot being called. After investigation, the reason is that nodetool clearsnapshot ws called at the same time a new snapshot was created with the same tag. nodetool clearsnapshot can't delete all files in the directory, because new files had by then been created in that directory, and crashes on I/O error. There are, many problems with allowing those operations to proceed in parallel. Even if we fix the code not to crash and return an error on directory non-empty, the moment they do any amount of work in parallel the result of the operation becomes undefined. Some files in the snapshot may have been deleted by clear, for example, and a user may then not be able to properly restore from the backup if this snapshot was used to generate a backup. Moreover, although we could lock at the granularity of a keyspace or column family, I think we should use a big hammer here and lock the entire snapshot creation/deletion to avoid surprises (for example, if a user requests creation of a snapshot for all keyspaces, and another process requests clear of a single keyspace) Fixes #4554 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20190614174438.9002-1-glauber@scylladb.com>	2019-06-16 10:30:13 +03:00
Gleb Natapov	9213d56a06	storage_proxy: align background and foreground repair metric names One is plural another is not, make them all plural. Message-Id: <20190605135940.GI25001@scylladb.com>	2019-06-10 11:34:36 +03:00
Avi Kivity	591d2968cc	storage_proxy: limit resources consumed in cross-shard operations Currently, each shard protects itself by not reading from rpc and the native transport if in-flight requests consume too much memory for that shard. However, if all shards then forward their requests to some other shard, then that shard can easily run out of memory since its load can be multiplied by the number of shards that send it requests. To protect against this, use the new Seastar smp_service_group infrastructure. We create three groups: read, write, and write ack (the latter is needed to avoid ABBA deadlocks is shard A exhausts all its resources sending writes to shard B, and shard B simulateously does the same; neither will be able to send acknowledgements, so if the writes are throttled, they will never be unthrottled until a timeout occurs). Range scans are not addressed by this patch since they are handled by multishard_mutation_query, which has its own complex cross-shard communication scheme, but it be a similar solution. Ref #1105 (missing range scan protection) Tests: unit (dev) Message-Id: <20190512142243.17795-1-avi@scylladb.com>	2019-06-07 10:53:23 +02:00
Konstantin Osipov	29c27bfc28	storage_proxy: remove unnecessary lambdas in metrics binding Remove unnecessasry lambdas when binding metrics of the storage proxy. Message-Id: <20190603133753.1724-1-kostja@scylladb.com>	2019-06-03 16:55:16 +03:00
Gleb Natapov	31bf4cfb5e	cache_hitrate_calculator: make cache hitrate calculation preemptable The calculation is done in a non preemptable loop over all tables, so if numbers of tables is very large it may take a while since we also build a string for gossiper state. Make the loop preemtable and also make the string calculation more efficient by preallocating memory for it. Message-Id: <20190516132748.6469-3-gleb@scylladb.com>	2019-05-16 15:32:36 +02:00
Gleb Natapov	4517c56a57	cache_hitrate_calculator: do not copy stats map for each cpu invoke_on_all() copies provided function for each shard it is executed on, so by moving stats map into the capture we copy it for each shard too. Avoid it by putting it into the top level object which is already captured by reference. Message-Id: <20190516132748.6469-2-gleb@scylladb.com>	2019-05-16 15:32:24 +02:00
Tomasz Grabiec	9de071d214	schema_tables, storage_service: Make schema digest insensitive to expired tombstones in empty partition Schema digest is calculated by querying for mutations of all schema tables, then compacting them so that all tombstones in them are dropped. However, even if the mutation becomes empty after compaction, we still feed its partition key. If the same mutations were compacted prior to the query, because the tombstones expire, we won't get any mutation at all and won't feed the partition key. So schema digest will change once an empty partition of some schema table is compacted away. That's not a problem during normal cluster operation because the tombstones will expire at all nodes at the same time, and schema digest, although changes, will change to the same value on all nodes at about the same time. This fix changes digest calculation to not feed any digest for partitions which are empty after compaction. The digest returned by schema_mutations::digest() is left unchanged by this patch. It affects the table schema version calculation. It's not changed because the version is calculated on boot, where we don't yet know all the cluster features. It's possible to fix this but it's more complicated, so this patch defers that. Refs #4485. Asd	2019-05-14 10:43:06 +02:00
Avi Kivity	985a30a01c	storage_proxy: fix pessimizing moves Remove pessimizing moves, as reported by gcc 9.	2019-05-07 09:56:09 +03:00
Gleb Natapov	b8188e1e2f	storage_proxy: avoid copying of a topology and endpoint array in batchlog code batchlog make copies of topology and endpoint array in batchlog endpoint choosing code. There is a remark that at least endpoint copy is deliberate because Cassandra code has it. We do not have to follow. Our endpoint calculation code is atomic, so we can use a reference. Message-Id: <20190506115815.GK21208@scylladb.com>	2019-05-06 17:36:50 +03:00
Tomasz Grabiec	c96ee9882b	db/schema_tables: Include view_virtual_columns in the digest only when all nodes do After `7c87405`, schema sync includes system_schema.view_virtual_columns in the schema digest. Old nodes don't know about this table and will not include it in the digest calculation. As a result, there will be schema disagreement until the whole cluster is upgraded. Fix this by taking the new table into account only when the whole cluster is upgraded. The table should not be used for anything before this happens. This is not currently enforced, but should be. Fixes #4457.	2019-04-28 15:50:13 +02:00
Tomasz Grabiec	a108df09f9	storage_service: Introduce the VIEW_VIRTUAL_COLUMNS cluster feature Needed for determining if all nodes in the cluster are aware of the new schema table. Only when all nodes are aware of it we can take it into account when calculating schema digest, otherwise there would be permanent schema disagreement in during rolling upgrade.	2019-04-28 15:50:13 +02:00
Tomasz Grabiec	3cb7b2d72e	treewide: Propagate schema_features to db::schema::all_tables()	2019-04-28 15:50:13 +02:00
Tomasz Grabiec	1d9b88dceb	service/storage_service: Introduce cluster_schema_features()	2019-04-28 15:50:12 +02:00
Tomasz Grabiec	6e2c190b5f	schema_tables: Propagate storage_service& to merge_schema() We will need to calculate cluster schema features at the time we calculate the schema digest.	2019-04-28 12:33:10 +02:00
Piotr Sarna	037b517c85	service: initialize system distributed keyspace after schema agreement In order to avoid schema disagreements during upgrades (which may lead to deadlocks), system distributed keyspace initialization is moved right before starting the bootstrapping process, after the schema agreement checks already succeeded. Fixes #3976 Message-Id: <932e642659df1d00a2953df988f939a81275774a.1556204185.git.sarna@scylladb.com>	2019-04-25 18:44:08 +02:00
Gleb Natapov	c6b3b9ff13	cache_hitrate_calculator: wait for ongoing calculation to complete during stop Currently stop returns ready future immediately. This is not a problem since calculation loop holds a shared pointer to the local service, so it will not be destroyed until calculation completes and global database object db, that also used by the calculation, is never destroyed. But the later is just a workaround for a shutdown sequence that cannot handle it and will be changed one day. Make cache hitrate calculation service ready for it. Message-Id: <20190422113538.GR21208@scylladb.com>	2019-04-22 14:44:42 +03:00
Gleb Natapov	306f5b99b5	cache_hitrate_calculator: fix use after free in non_system_filter lambda non_system_filter lambda is defined static which means it is initialized only once, so the 'this' that is will capture will belong to a shard where the function runs first. During service destruction the function may run on different shard and access already other's shard service that may be already freed. Fixed #4425 Message-Id: <20190421152139.GN21208@scylladb.com>	2019-04-21 18:22:31 +03:00
Piotr Jastrzebski	2c599122e1	Update supported features on format change Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-04-12 10:38:31 +02:00
Piotr Jastrzebski	9c7e3dd470	Add _unbounded_range_tombstones_feature This requires introduction of storage_service::get_known_features and using it with check_knows_remote_features. Otherwise a node joining the existing cluster won't be able to join because it does not support unbounded range tombstones yet. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-04-12 10:37:12 +02:00
Piotr Jastrzebski	96ad8f7df9	Use _sstables_format to determine current format Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-04-12 10:37:12 +02:00
Piotr Jastrzebski	7339e9de30	Add service::read_sstables_format Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-04-12 10:37:12 +02:00
Piotr Jastrzebski	9934740c39	Register feature listeners in storage_service Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-04-12 10:36:58 +02:00
Piotr Jastrzebski	081542cf00	storage_service: add _sstables_format field Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-04-12 09:33:40 +02:00
Rafael Ávila de Espíndola	6191fd7701	Avoid duplicated read_keyspace_mutation calls There were many calls to read_keyspace_mutation. One in each function that prepares a mutation for some other schema change. With this patch they are all moved to a single location. Tests: unit (dev, debug) Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20190328024440.26201-1-espindola@scylladb.com>	2019-04-07 09:26:56 +03:00
Vlad Zolotarov	0dc0a6025d	query_pager::fetch_page: cosmetics: fix code alignment Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20190401214030.5570-2-vladz@scylladb.com>	2019-04-02 11:53:10 +03:00
Avi Kivity	f259a4c3b4	Merge "Remove usage of static gossiper object in init.cc and storage_service" from Asias " This series removes the usage of the static gossiper object in init.cc and storage_service. Follow up series will remove more in other components. This is the effort to clean up the component dependencies and have better shutdown procedure. Tests: tests/gossip_test, tests/cql_query_test, tests/sstable_mutation_test, dtests. " * tag 'asias/storage_service_gossiper_dep_v5' of github.com:cloudius-systems/seastar-dev: storage_service: Do not use the global gms::get_local_gossiper() storage_service: Pass gossiper object to storage_service gms: Remove i_failure_detector.hh gossip: Get rid of the gms::get_local_failure_detector static object dht: Do not use failure_detector::is_alive in failure_detector_source_filter tests: Fix stop snitch in gossip_test.cc gossiper: Do not use value_factory from storage_service object gossiper: Use cfg options from _cfg instead of get_local_storage_service gossiper: Pass db::config object to gossiper class init: Pass gossiper object to init_ms_fd_gossiper	2019-03-26 08:54:46 +02:00
Duarte Nunes	93a1c27b31	service/storage_proxy: Don't consider view hints for MV backpressure When a view replica becomes unavailable, updates to it are stored as hints at the paired based replica. This on-disk queue of pending view updates grows as long as there are view updated and the view replica remains unavailable. Currently, we take that relative queue size into account when calculating the delay for new base writes, in the context of the backpressure algorithm for materialized views. However, the way we're calculating that on-disk backlog is wrong, since we calculate it per-device and then feed it to all the hints managers for that device. This means that normal hints will show up as backlog for the view hints manager, which in turn introduces delays. This can make the view backpressure mechanism kick-in even if the cluster uses no materialized views. There's yet another way in which considering the view hints backlog is wrong: a view replica that is unavailable for some period of time can cause the backlog to grow to a point where all base writes are applied the maximum delay of 1 second. This turns a single-node failure into cluster unavailability. The fix to both issues is to simply not take this on-disk backlog into account for the backpressure algorithm. Fixes #4351 Fixes #4352 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Reviewed-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190321170418.25953-1-duarte@scylladb.com>	2019-03-24 20:29:56 +02:00
Asias He	7447c92d63	storage_service: Do not use the global gms::get_local_gossiper() Use the gossiper object stored in _gossiper member from storage_service.	2019-03-22 09:11:26 +08:00

1 2 3 4 5 ...

1446 Commits