scylladb

Author	SHA1	Message	Date
Calle Wilund	8c257c40b4	storage_service: Only replicate token metadata iff modified in on_change Fixes #2869 Message-Id: <20171101105629.22104-1-calle@scylladb.com>	2017-11-01 14:56:55 +02:00
Duarte Nunes	044b8deae4	Merge 'Solves problems related to gossip which can be observed in a large cluster' from Tomasz "The main problem fixed is slow processing of application state changes. This may lead to a bootstrapping node not having up to date view on the ring, and serve incorrect data. Fixes #2855." * tag 'tgrabiec/gossip-performance-v3' of github.com:scylladb/seastar-dev: gms/gossiper: Remove periodic replication of endpoint state map gossiper: Check for features in the change listener gms/gossiper: Replicate changes incrementally to other shards gms/gossiper: Document validity of endpoint_state properties storage_service: Update token_metadata after changing endpoint_state gms/gossiper: Process endpoints in parallel gms/gossiper: Serialize state changes and notifications for given node utils/loading_shared_values: Allow Loader to return non-future result gms/gossiper: Encapsulate lookup of endpoint_state storage_service: Batch token metadata and endpoint state replication utils/serialized_action: Introduce trigger_later() gossiper: Add and improve logging gms/gossiper: Don't fire change listeners when there is no change gms/gossiper: Allow parallel apply_state_locally() gms/gossiper: Avoid copies in endpoint_state::add_application_state() gms/failure_detector: Ignore short update intervals	2017-10-18 10:13:25 +01:00
Tomasz Grabiec	2d5fb9d109	gms/gossiper: Replicate changes incrementally to other shards storage_service depends on endpoint states to be replicated to all shards before token metadata is replicated. Currently this is taken care of by storage_service::replicate_to_all_cores(), invoked from storage_service's change listener. It copies whole endpoint state map, which is expensive in large clusters. It's more efficient to replicate only incremental changes, and only once, rather than for each application state.	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	cf113ed295	storage_service: Update token_metadata after changing endpoint_state There is a requirement that whatever is present in token_metadata, should also be present in endpoint_state. Because of that, we should update endpoint_state first (set_gossip_tokens). Apache Cassandra switched to this order as well in commit b39d984f7bd682c7638415d65dcc4ac9bcb74e5f.	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	6263b0ebb6	storage_service: Batch token metadata and endpoint state replication Replication needs to be serialized. We can batch replication requests which are waiting to start. Use serialized_action, which does this.	2017-10-18 08:49:52 +02:00
Pekka Enberg	ae92055b52	Merge "Bring histogram closer to what Prometheus expects" from Glauber "Histograms are a native prometheus type, and there are many functions available that operate on them. There is extensive documentation about them at https://prometheus.io/docs/practices/histograms/ One example is the function histogram_quantile(), that can extract useful quantiles from the histograms. Currently, those functions don't work well. The reasons are twofold: 1) We are only exporting 16 metrics, starting from 1usec. That means that the highest latency we can differentiate is 4ms. After that, everything falls into the same bin. 2) The format that prometheus expects is that each bin will contain the total number of points seen up until that bin, while we currently export the total number of points that falls between bins. IOW, it is a cummulative histogram. About point two, granted it is a bit hidden in their website, but it is there. The following phrase about a caveat make it clear: "Note that we divide the sum of both buckets. The reason is that the histogram buckets are cumulative. The le="0.3" bucket is also contained in the le="1.2" bucket; dividing it by 2 corrects for that." It is also not needed to accumulate things that fall over the last bin: the _count component of the histogram will already account for that." Acked-by: Amnon Heiman <amnon@scylladb.com> Acked-by: Gleb Natapov <gleb@scylladb.com> * 'prometheus-histograms' of github.com:glommer/scylla: storage_proxy: change reporting of estimated histograms estimated_histogram: bring histogram closer to what prometheus expects.	2017-10-17 20:23:10 +03:00
Duarte Nunes	e9358c1c83	service/storage_service: Avoid copies in prepare_replacement_info() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	674f5d8eaf	service/storage_service: Cleanup get_application_state_value() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	0ccb9211d7	service/storage_service: Cleanup handle_state_removing() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	bdee795876	service/storage_service: Cleanup get_rpc_address() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	03e6fc95ba	service/migration_manager: Avoid copies in is_ready_for_bootstrap() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	72ca6b34ef	service/migration_manager: Cleanup has_compatible_schema_tables_version() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	976324bbb8	service/migration_manager: Fix usages of get_application_state() We were taking a reference to a temporary value in different places. Fix them by using get_application_state_ptr(), which also avoids a copy. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	25b0654312	service/load_broadcaster: Avoid copy in on_join() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	2210d10552	gms/gossiper: Cleanup is_alive() Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is enabled, mark it as const, and have some callers use it instead of open coding the logic. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	ceebbe14cc	gossiper: Avoid endpoint_state copies gossiper::get_endpoint_state_for_endpoint() returns a copy of endpoint_state, which we've seen can be very expensive. This patch adds a similar function which returns a pointer instead, and changes the call sites where using the pointer-returning variant is deemed safe (the pointer neither escapes the function, nor crosses any defer point). Fixes #764 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-10 13:48:02 +01:00
Duarte Nunes	198b1b76b5	storage_service: Remove duplicate endpoint state check We already performed the check, so we don't need to do it again. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-10 13:25:34 +01:00
Glauber Costa	189ef02596	storage_proxy: change reporting of estimated histograms We are currently collapsing the histograms in 16 points, exponentially increasing in value, starting from 1. While reducing the number of points is a worthy goal, the current configuration caps us at 4ms. Our latencies tend to be higher than this. Starting from 1 is also a bit of an exhaggeration: rarely are our latencies in that range. This patch changes reporting so that we report 20 points starting from 32. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-10-04 20:01:15 -04:00
Tomasz Grabiec	b704710954	migration_manager: Make sure schema pulls eventually happen when schema_tables_v3 is enabled We don't pull schema during rolling upgrade, that is until schema_tables_v3 feature is enabled on all nodes. Because features are enabled from gossiper timer, there is a race between feature enablement and processing of endpoint states which may trigger schema pull. It can happen that we first try to pull, but only later enable the feature. In that case the schema pull will not happen until the next schema change. The fix is to ensure that pulls abandoned due to feature not being enabled will be retried when it is enabled. Fixes sporadic failure in dtest: repair_additional_test.py:RepairAdditionalTest.repair_schema_test Message-Id: <1506428715-8182-2-git-send-email-tgrabiec@scylladb.com>	2017-09-27 12:00:07 +01:00
Asias He	4b1034b9cd	storage_service: Remove the stream_hints Our hinted handoff implementation will not use the db::system_keyspace::HINTS system table to store hints. No need to stream them. Acked-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <3b9190e250b54321ceb87767f4722c7458d41797.1506391500.git.asias@scylladb.com>	2017-09-26 19:05:21 +03:00
Tomasz Grabiec	8e46d15f91	storage_service: Register features before joining Since commit `8378fe190`, we disable schema sync in a mixed cluster. The detection is done using gossiper features. We need to make sure the features are registerred, and thus can be enabled, before the bootstrapping of a non-seed node happens. Otherwise the bootstrap will hang waiting on schema sync which will not happen. Message-Id: <1505893837-27876-2-git-send-email-tgrabiec@scylladb.com>	2017-09-25 09:13:02 +01:00
Tomasz Grabiec	b92dcb0284	storage_service: Extract register_features() Message-Id: <1505893837-27876-1-git-send-email-tgrabiec@scylladb.com>	2017-09-25 09:12:46 +01:00
Asias He	ebc3bada12	storage_service: Check gossip feature update in replicate_tm_and_ep_map This is another place we can update endpoint_state_map in addition to gossiper::run(). Call the gossiper:maybe_enable_features() so that we won't miss gossip feature update.	2017-09-20 16:58:33 +08:00
Asias He	173cba67ba	storage_service: Remove rpc client on all shards in on_dead We should close connections to nodes that are down on all shards instead of the shard which runs the on_dead gossip callback. Found by Gleb. Message-Id: <527a14105a07218066e9f1da943693d9de6993e5.1505894260.git.asias@scylladb.com>	2017-09-20 10:23:31 +02:00
Avi Kivity	55e0b63e65	storage_proxy: scan more nodes exponentially to achieve target result set size The current sequential scan can take a long time on a small or empty table with a large (nr_nodes * nr_vnodes) count, and can time out. Switching to exponential scan reduces the time. Fixes #1230. Message-Id: <20170912173803.8277-1-avi@scylladb.com>	2017-09-18 15:15:15 +02:00
Tomasz Grabiec	5a92c18e63	migration_manager: Disable pulls during rolling upgrade from 1.7 If there is a schema pull during rolling upgrade among a two 2.0 nodes, then schema merge will delete the persisted schema version. When the node loads that table again, e.g. on restart, it will generate a version which is different than the one which 1.7 nodes use. This will cause reads and writes to fail. To avoid this, disable pulls until all nodes are upgraded. Fixes #2802.	2017-09-14 20:26:31 +02:00
Tomasz Grabiec	713d75fd51	storage_service: Introduce SCHEMA_TABLES_V3 feature	2017-09-14 20:26:31 +02:00
Gleb Natapov	31e803a36c	storage_proxy: wire up percentile speculative read properly Collect coordinator side read statistic per CF and use them in percentile speculative read executor. Getting percentile from estimated_histogram object is rather expensive, so cache it and recalculate only once per second (or if requested percentile changes). Fixes #2757 Message-Id: <20170911131752.27369-3-gleb@scylladb.com>	2017-09-14 10:31:26 +03:00
Avi Kivity	aebab377d9	storage_service: add missing include	2017-09-11 20:09:45 +03:00
Avi Kivity	9b540eccb0	database: remove dependency on compaction.hh and compaction_manager.hh	2017-09-11 20:09:45 +03:00
Gleb Natapov	d0d8bdf615	storage_proxy: remove unused parameter from get_restricted_ranges() function Message-Id: <20170911084653.GH24167@scylladb.com>	2017-09-11 11:58:44 +02:00
Gleb Natapov	f66e9377d4	storage_proxy: do not keep reference to a keyspace during write A keyspace can be deleted while write is ongoing, so the object cannot be used after defer point. The keyspace reference is only used to check how many replies a write operation should wait for and this can be precalculated during write handler creation. Fixes #2777 Message-Id: <20170911084436.GG24167@scylladb.com>	2017-09-11 11:57:00 +02:00
Asias He	bb9dbc5ade	storage_service: Do not use c_str() in the logger Use logger.info("{}", msg) instead. Message-Id: <d2f15007a54554b58e29fd05331c06ae030d582f.1504832296.git.asias@scylladb.com>	2017-09-10 18:10:24 +03:00
Paweł Dziepak	ecd2bf128b	storage_service: introduce CORRECT_COUNTER_ORDER feature Scylla 1.7.4 used incorrect ordering of counter shards. In order to fix this problem a new feature is introduced that will be used to determine when nodes with that bug fixed can start sending counter shard in the correct order.	2017-09-05 10:32:48 +01:00
Paweł Dziepak	9d82a1ebfd	abstract_read_executor: make make_requests() exception safe Message-Id: <20170821162934.25386-5-pdziepak@scylladb.com>	2017-08-22 12:09:42 +02:00
Avi Kivity	e428805ba5	Merge "Optimize query result partition and row counts" from Duarte "Now that range queries go through the normal digest path, we rely on query::result::calculate_counts() to count the amount of partitions and rows returned. This series optimizes it, in case it is needed, and also changes the result message to include the partition and row counts, avoiding the calculation altogether." * 'calculate-counts/v3' of github.com:duarten/scylla: query-result: Send row and partition count over the wire query::result: Optimize calculate_counts()	2017-08-17 13:41:21 +03:00
Duarte Nunes	ec75eac37d	ring_position_exponential_vector_sharder: Take ranges by rvalue Avoids some copies. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170814093310.29200-1-duarte@scylladb.com>	2017-08-14 12:55:43 +03:00
Duarte Nunes	d7bab684ea	query::result: Optimize calculate_counts() Now that range queries go through the normal digest path, we rely on query::result::calculate_counts() to count the amount of partitions and rows returned. This patch makes it a bit faster. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-14 10:28:29 +02:00
Duarte Nunes	bcf21aacc2	storage_proxy: Directly call query_nonsingular_mutations_locally Instead of duplicating the branch. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170811001559.25788-1-duarte@scylladb.com>	2017-08-11 09:06:01 +03:00
Duarte Nunes	a3ee99554b	service/storage_proxy: Remove out of date comment Now that we don't go directly to reconciliation for range queries, the result isn't required to have the row and partition counts calculated (we no longer transform a reconciled_result to a query::result). Furthermore, this line was causing a lot of dtests to fail on account of them not expecting an error line in the logs. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170810225351.12610-1-duarte@scylladb.com>	2017-08-11 09:04:23 +03:00
Asias He	49360992d9	storage_service: Use the new range_streamer interface for removenode So that removenode operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:48 +08:00
Asias He	6b8dc85f12	storage_service: Use the new range_streamer interface for decommission So that decommission operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:48 +08:00
Asias He	24584b8509	storage_service: Use the new range_streamer interface for rebuild So that rebuild operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:47 +08:00
Gleb Natapov	d2a2a6d471	storage_proxy: make range_slice_read_executor go through digest matching state Currently scanning reads go to reconciliation stage directly which requires asking for mutation data from all peers. This patch makes it to try matching digests first like a single partition read. The change requires internode protocol changes since currently it is not possible to ask for multi partition data/digest over RPC. It means that the capability has to be guarded by new gossip feature flag which the patch also adds.	2017-08-03 11:37:03 +03:00
Gleb Natapov	3b7d8c8767	storage_proxy: add capability to read data/digest for non singular ranges Currently only mutation_data read supports non singular ranges. This patch extends data/digest reads to support them too.	2017-08-03 10:35:09 +03:00
Gleb Natapov	c619ef258b	storage_proxy: remove redundant parameter from never_speculating_read_executor constructor never_speculating_read_executor always waits for all targets so block_for parameter is always equal to targets.size(). No need to to pass it explicitly.	2017-08-03 10:08:44 +03:00
Tomasz Grabiec	e09220dbff	migration_manager: Log schema pulls	2017-07-27 20:08:25 +02:00
Tomasz Grabiec	350d98d4e1	migration_manager: Prevent pull requests from accumulating If schema merging completes at lower rate than incoming pull requests, then merge processes will accumulate and needlessly request and hold schema mutations. In rare cases, when there are constant schema changes, they may even overflow memory. This was seen in dtest: concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test Allowing only one active and one queued pull request per remote endpoint is enough.	2017-07-27 20:08:25 +02:00
Vlad Zolotarov	e98adb13d5	service::storage_service: initialize auth and tracing after we joined the ring Initialize the system_auth and system_traces keyspaces and their tables after the Node joins the token ring because as a part of system_auth initialization there are going to be issues SELECT and possible INSERT CQL statements. This patch effectively reverts the `d3b8b67` patch and brings the initialization order to how it was before that patch. Fixes #2273 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>	2017-07-27 10:54:36 +02:00
Vlad Zolotarov	9086c643a6	service::storage_proxy: add a trace points pair in the SELECT replica flow Add two trace points: at the beginning and at the end of the replica flow on the replica shard. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1499961542-16263-1-git-send-email-vladz@scylladb.com>	2017-07-20 16:44:25 +02:00

1 2 3 4 5 ...

1083 Commits