scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 04:06:59 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	b704710954	migration_manager: Make sure schema pulls eventually happen when schema_tables_v3 is enabled We don't pull schema during rolling upgrade, that is until schema_tables_v3 feature is enabled on all nodes. Because features are enabled from gossiper timer, there is a race between feature enablement and processing of endpoint states which may trigger schema pull. It can happen that we first try to pull, but only later enable the feature. In that case the schema pull will not happen until the next schema change. The fix is to ensure that pulls abandoned due to feature not being enabled will be retried when it is enabled. Fixes sporadic failure in dtest: repair_additional_test.py:RepairAdditionalTest.repair_schema_test Message-Id: <1506428715-8182-2-git-send-email-tgrabiec@scylladb.com>	2017-09-27 12:00:07 +01:00
Asias He	4b1034b9cd	storage_service: Remove the stream_hints Our hinted handoff implementation will not use the db::system_keyspace::HINTS system table to store hints. No need to stream them. Acked-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <3b9190e250b54321ceb87767f4722c7458d41797.1506391500.git.asias@scylladb.com>	2017-09-26 19:05:21 +03:00
Tomasz Grabiec	8e46d15f91	storage_service: Register features before joining Since commit `8378fe190`, we disable schema sync in a mixed cluster. The detection is done using gossiper features. We need to make sure the features are registerred, and thus can be enabled, before the bootstrapping of a non-seed node happens. Otherwise the bootstrap will hang waiting on schema sync which will not happen. Message-Id: <1505893837-27876-2-git-send-email-tgrabiec@scylladb.com>	2017-09-25 09:13:02 +01:00
Tomasz Grabiec	b92dcb0284	storage_service: Extract register_features() Message-Id: <1505893837-27876-1-git-send-email-tgrabiec@scylladb.com>	2017-09-25 09:12:46 +01:00
Asias He	ebc3bada12	storage_service: Check gossip feature update in replicate_tm_and_ep_map This is another place we can update endpoint_state_map in addition to gossiper::run(). Call the gossiper:maybe_enable_features() so that we won't miss gossip feature update.	2017-09-20 16:58:33 +08:00
Asias He	173cba67ba	storage_service: Remove rpc client on all shards in on_dead We should close connections to nodes that are down on all shards instead of the shard which runs the on_dead gossip callback. Found by Gleb. Message-Id: <527a14105a07218066e9f1da943693d9de6993e5.1505894260.git.asias@scylladb.com>	2017-09-20 10:23:31 +02:00
Avi Kivity	55e0b63e65	storage_proxy: scan more nodes exponentially to achieve target result set size The current sequential scan can take a long time on a small or empty table with a large (nr_nodes * nr_vnodes) count, and can time out. Switching to exponential scan reduces the time. Fixes #1230. Message-Id: <20170912173803.8277-1-avi@scylladb.com>	2017-09-18 15:15:15 +02:00
Tomasz Grabiec	5a92c18e63	migration_manager: Disable pulls during rolling upgrade from 1.7 If there is a schema pull during rolling upgrade among a two 2.0 nodes, then schema merge will delete the persisted schema version. When the node loads that table again, e.g. on restart, it will generate a version which is different than the one which 1.7 nodes use. This will cause reads and writes to fail. To avoid this, disable pulls until all nodes are upgraded. Fixes #2802.	2017-09-14 20:26:31 +02:00
Tomasz Grabiec	713d75fd51	storage_service: Introduce SCHEMA_TABLES_V3 feature	2017-09-14 20:26:31 +02:00
Gleb Natapov	31e803a36c	storage_proxy: wire up percentile speculative read properly Collect coordinator side read statistic per CF and use them in percentile speculative read executor. Getting percentile from estimated_histogram object is rather expensive, so cache it and recalculate only once per second (or if requested percentile changes). Fixes #2757 Message-Id: <20170911131752.27369-3-gleb@scylladb.com>	2017-09-14 10:31:26 +03:00
Avi Kivity	aebab377d9	storage_service: add missing include	2017-09-11 20:09:45 +03:00
Avi Kivity	9b540eccb0	database: remove dependency on compaction.hh and compaction_manager.hh	2017-09-11 20:09:45 +03:00
Gleb Natapov	d0d8bdf615	storage_proxy: remove unused parameter from get_restricted_ranges() function Message-Id: <20170911084653.GH24167@scylladb.com>	2017-09-11 11:58:44 +02:00
Gleb Natapov	f66e9377d4	storage_proxy: do not keep reference to a keyspace during write A keyspace can be deleted while write is ongoing, so the object cannot be used after defer point. The keyspace reference is only used to check how many replies a write operation should wait for and this can be precalculated during write handler creation. Fixes #2777 Message-Id: <20170911084436.GG24167@scylladb.com>	2017-09-11 11:57:00 +02:00
Asias He	bb9dbc5ade	storage_service: Do not use c_str() in the logger Use logger.info("{}", msg) instead. Message-Id: <d2f15007a54554b58e29fd05331c06ae030d582f.1504832296.git.asias@scylladb.com>	2017-09-10 18:10:24 +03:00
Paweł Dziepak	ecd2bf128b	storage_service: introduce CORRECT_COUNTER_ORDER feature Scylla 1.7.4 used incorrect ordering of counter shards. In order to fix this problem a new feature is introduced that will be used to determine when nodes with that bug fixed can start sending counter shard in the correct order.	2017-09-05 10:32:48 +01:00
Paweł Dziepak	9d82a1ebfd	abstract_read_executor: make make_requests() exception safe Message-Id: <20170821162934.25386-5-pdziepak@scylladb.com>	2017-08-22 12:09:42 +02:00
Avi Kivity	e428805ba5	Merge "Optimize query result partition and row counts" from Duarte "Now that range queries go through the normal digest path, we rely on query::result::calculate_counts() to count the amount of partitions and rows returned. This series optimizes it, in case it is needed, and also changes the result message to include the partition and row counts, avoiding the calculation altogether." * 'calculate-counts/v3' of github.com:duarten/scylla: query-result: Send row and partition count over the wire query::result: Optimize calculate_counts()	2017-08-17 13:41:21 +03:00
Duarte Nunes	ec75eac37d	ring_position_exponential_vector_sharder: Take ranges by rvalue Avoids some copies. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170814093310.29200-1-duarte@scylladb.com>	2017-08-14 12:55:43 +03:00
Duarte Nunes	d7bab684ea	query::result: Optimize calculate_counts() Now that range queries go through the normal digest path, we rely on query::result::calculate_counts() to count the amount of partitions and rows returned. This patch makes it a bit faster. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-14 10:28:29 +02:00
Duarte Nunes	bcf21aacc2	storage_proxy: Directly call query_nonsingular_mutations_locally Instead of duplicating the branch. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170811001559.25788-1-duarte@scylladb.com>	2017-08-11 09:06:01 +03:00
Duarte Nunes	a3ee99554b	service/storage_proxy: Remove out of date comment Now that we don't go directly to reconciliation for range queries, the result isn't required to have the row and partition counts calculated (we no longer transform a reconciled_result to a query::result). Furthermore, this line was causing a lot of dtests to fail on account of them not expecting an error line in the logs. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170810225351.12610-1-duarte@scylladb.com>	2017-08-11 09:04:23 +03:00
Asias He	49360992d9	storage_service: Use the new range_streamer interface for removenode So that removenode operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:48 +08:00
Asias He	6b8dc85f12	storage_service: Use the new range_streamer interface for decommission So that decommission operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:48 +08:00
Asias He	24584b8509	storage_service: Use the new range_streamer interface for rebuild So that rebuild operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:47 +08:00
Gleb Natapov	d2a2a6d471	storage_proxy: make range_slice_read_executor go through digest matching state Currently scanning reads go to reconciliation stage directly which requires asking for mutation data from all peers. This patch makes it to try matching digests first like a single partition read. The change requires internode protocol changes since currently it is not possible to ask for multi partition data/digest over RPC. It means that the capability has to be guarded by new gossip feature flag which the patch also adds.	2017-08-03 11:37:03 +03:00
Gleb Natapov	3b7d8c8767	storage_proxy: add capability to read data/digest for non singular ranges Currently only mutation_data read supports non singular ranges. This patch extends data/digest reads to support them too.	2017-08-03 10:35:09 +03:00
Gleb Natapov	c619ef258b	storage_proxy: remove redundant parameter from never_speculating_read_executor constructor never_speculating_read_executor always waits for all targets so block_for parameter is always equal to targets.size(). No need to to pass it explicitly.	2017-08-03 10:08:44 +03:00
Tomasz Grabiec	e09220dbff	migration_manager: Log schema pulls	2017-07-27 20:08:25 +02:00
Tomasz Grabiec	350d98d4e1	migration_manager: Prevent pull requests from accumulating If schema merging completes at lower rate than incoming pull requests, then merge processes will accumulate and needlessly request and hold schema mutations. In rare cases, when there are constant schema changes, they may even overflow memory. This was seen in dtest: concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test Allowing only one active and one queued pull request per remote endpoint is enough.	2017-07-27 20:08:25 +02:00
Vlad Zolotarov	e98adb13d5	service::storage_service: initialize auth and tracing after we joined the ring Initialize the system_auth and system_traces keyspaces and their tables after the Node joins the token ring because as a part of system_auth initialization there are going to be issues SELECT and possible INSERT CQL statements. This patch effectively reverts the `d3b8b67` patch and brings the initialization order to how it was before that patch. Fixes #2273 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>	2017-07-27 10:54:36 +02:00
Vlad Zolotarov	9086c643a6	service::storage_proxy: add a trace points pair in the SELECT replica flow Add two trace points: at the beginning and at the end of the replica flow on the replica shard. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1499961542-16263-1-git-send-email-vladz@scylladb.com>	2017-07-20 16:44:25 +02:00
Calle Wilund	247c36e048	system_schema: Fix remaining places not handing two system keyspaces Some places remained where code looked directly at system_keyspace::NAME to determine iff a ks is considered special/system/protected. Including schema digest calculation. Export "is_system_keyspace" and use accordingly. Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>	2017-07-19 16:18:45 +03:00
Duarte Nunes	b8235f2e88	storage_proxy: Preserve replica order across mutations In storage_proxy we arrange the mutations sent by the replicas in a vector of vectors, such that each row corresponds to a partition key and each column contains the mutation, possibly empty, as sent by a particular replica. There is reconciliation-related code that assumes that all the mutations sent by a particular replica can be found in a single column, but that isn't guaranteed by the way we initially arrange the mutations. This patch fixes this and enforces the expected order. Fixes #2531 Fixes #2593 Signed-off-by: Gleb Natapov <gleb@scylladb.com> Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170713162014.15343-1-duarte@scylladb.com>	2017-07-14 12:11:22 +03:00
Gleb Natapov	f88723e739	storage_proxy: pass pending_endpoints by reference instead of by value This makes lifetime of dead_endpoints object more clear and move() also has its price. Message-Id: <20170710084549.GX2324@scylladb.com>	2017-07-11 16:52:21 +03:00
Tomasz Grabiec	07ed512060	migration_manager: Give empty response to schema pulls from incompatible nodes The old nodes which are still using v2 schema tables will fail to apply our response, with error messages complaining about not being able to locate schema of certain versions (new schema tables). This change inhibits such errors by responding with an empty mutation list.	2017-07-07 19:09:57 +02:00
Tomasz Grabiec	5f613d0527	migration_manager: Don't pull schema from incompatible nodes Currently it results in scary error messages in logs about not being able to find schema of given version. It's benign, but may scare users. It the future incompatibilities could result in more subtle errors. Better to inhibit it completely.	2017-07-07 19:08:59 +02:00
Tomasz Grabiec	18a9e1762c	service: Advertise schema tables format version through gossip Will be needed to inhibit schema exchange on per-peer basis.	2017-07-07 19:07:59 +02:00
Tomasz Grabiec	ae4b24db06	misc_services: Switch to using reads_with[_no]_misses counters They better approximate the intended meaning than hits/misses, which according to Gleb is whether a read did any I/O or not.	2017-07-04 13:55:06 +02:00
Piotr Jastrzebski	05b56fcfb0	mutation_partition: Add support for specifying continuity This will allow expressing lack of information about certain ranges of rows (including the static row), which will be used in cache to determine if information in cache is complete or not. Continuity is represented internally using flags on row entries. The key range between two consecutive entries is continuous iff rows_entry::continuous() is true for the later entry. The range starting after the last entry is assumed to be continuous. The range corresponding to the key of the entry is continuous iff rows_entry::dummy() is false. [tgrabiec: - based on the following commits: 4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry 773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry - documented that partition tombstone is always complete - require specifying the partition tombstone when creating an incomplete entry - replaced rows_entry(dummy_tag, ...) constructor with more general rows_entry(position_in_partition, ...) - documented continuity semantics on mutation_partition - fixed _static_row_cached being lost by mutation_partition copy constructors - fixed conversion to streamed_mutation to ignore dummy entries - fixed mutation_partition serializer to drop dummy entries - documented semantics of continuity on mutation_partition level - dropped assumptions that dummy entries can be only at the last position - changed equality to ignore continuity completely, rather than partially (it was not ignoring dummy entries, but ignoring continuity flag) - added printout of continuity information in mutation_partition - fixed handling of empty entries in apply_reversibly() with regards to continuity; we no longer can remove empty entries before merging, since that may affect continuity of the right-hand mutation. Added _erased flag. - fixed mutation_partition::clustered_row() with dummy==true to not ignore the key - fixed partition_builder to not ignore continuity - renamed dummy_tag_t to dummy_tag. _t suffix is reserved. - standardized all APIs on is_dummy and is_continuous bool_class:es - replaced add_dummy_entry() with ensure_last_dummy() with safer semantics - dropped unused remove_dummy_entry() - simplified and inlined cache_entry::add_dummy_entry() - fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous ]	2017-06-24 18:06:11 +02:00
Gleb Natapov	9b8499df0e	cache_hitrate_calculator: filter cfs based on replication strategy instead of a name The code filters CFs by name to not include system keyspace, but v3 schema added yet another system namespace. Better filter according to replication strategy to accommodate for schema v4 adding even more system keyspaces. Fixes: #2516 Message-Id: <20170621073816.GB3944@scylladb.com>	2017-06-22 11:26:34 +03:00
Gleb Natapov	72a4554dd9	storage_proxy: Fix compilation on older (1.55) boost Boost 1.55 (ubuntu 14) fails to compile because an iterator produce by boost::adaptors::transformed() when std::ref to lambda is passed to it do not match iterator concept. It cannot be default constructed because std::reference_wrapper is not default constructable. boost::range::min_element() never actually default construct it, but concept is checked anyway. The patch fixes it by providing an explicit functor that is default constructable. Message-Id: <20170618131836.GD3944@scylladb.com>	2017-06-18 16:54:41 +03:00
Duarte Nunes	b2c5aca4cf	db/schema_tables: View mutations shouldn't always include base ones When making the schema mutations for a view update, we should only include the base table schema mutations (in case the target node doesn't contain them) when the view is being directly updated. When it is being updated as a side effect of updating the base table, then including the base schema mutations will hide the actual changes being performed on the base. Fixes #2500 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1497782822-2711-1-git-send-email-duarte@scylladb.com>	2017-06-18 16:29:59 +03:00
Gleb Natapov	87094849fa	storage_proxy: load balance read requests according to cache hit rates This patch makes storage proxy to choose replicas to read from base on their cache hit rates. Replicas with higher cache hit rates will see more requests while replicas with lower hit rates will see less. Local node has a special bonus and will get more requests even if another node has slightly higher cache hit rate (same goes for local vs remote DC), but after the patch it is no longer guarantied that a coordinator node will be chosen as a replica for the read (if the feature is enabled).	2017-06-13 09:57:14 +03:00
Gleb Natapov	bc8aa1b4ee	choose extra replica for speculation in filter_for_query() Currently storage proxy has to loop over remaining replicas to search for suitable extra replica, but doing it in filter_for_query() is extremely easy, so do it there instead.	2017-06-13 09:57:14 +03:00
Gleb Natapov	0e4d5bc2f3	Store cluster wide cache hit statistics in CF	2017-06-13 09:57:14 +03:00
Gleb Natapov	69c5526301	messaging_service: return cache hit ratio as part of data read	2017-06-13 09:57:14 +03:00
Gleb Natapov	8ca1432b04	Distribute cache temperature over gossiper. When a node start it does not have any information about cache temperature of other nodes in the cluster and it is hard (if not impossible) to make right guess. During cluster startup all nodes have cold caches, so there is no point to redirect reads to other nodes even though local cache it cold, but if only that node restarted than other nodes have populated cache and reads should be redirected. The node will get up-to-date information about other nodes caches, but only after receiving first reply, until then it does not have the information to make right decisions which may cause unwanted spikes immediately after restart. Having cache temperature in gossiper helps to solve the problem.	2017-06-13 09:57:14 +03:00
Gleb Natapov	991ec4a16c	periodically calculate avg cache hit rate between all shards This patch adds new class cache_hitrate_calculator whose responsibility is to periodically calculate average cache hit rates between all shards for each CF.	2017-06-13 09:57:14 +03:00
Gleb Natapov	f59ecc2687	Rename load_broadcaster.cc to misc_services.cc load_broadcaster is very small class, move it into generic file so that we can put other small services there to save on compilation time.	2017-06-13 09:57:14 +03:00

1 2 3 4 5 ...

1065 Commits