scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-01 20:46:56 +00:00

Author	SHA1	Message	Date
Botond Dénes	b2f75a6c53	Add counters to monitor querier-cache efficiency Add the following counters: (1) querier_cache_lookups (2) querier_cache_misses (3) querier_cache_drops (4) querier_cache_time_based_evictions (5) querier_cache_resource_based_evictions (6) querier_cache_memory_based_evictions (6) querier_cache_population (1) counts the total number of querier cache lookups. Not all page-fetches will result in a querier lookup. For example the first page of a query will not do a lookup as there was no previous page to reuse the querier from. The second, and all subsequent pages however should attempt to reuse the querier from the previous page. (2) counts the subset of (1) where the read have missed the querier cache (failed to find a matching saved querier). (3) counts the subset of (1) where the querier was recalled and dropped immediately. This can happen for example if the querier was at the wrong position. (4) counts the cached queriers that were evicted due to their TTL expiring. (5) counts the cached queriers that were evicted due to reader-resource (those limited by reader-concurrency limits) shortage. (6) counts the cached queriers that were evicted due to reaching the cache's memory limits (currently set to 4% of the shards' memory). (7) is the current number of entries in the cache Note: * The count of cache hits can be derived from these counters as (1) - (2). * cache_drop (3) also implies a cache hit (see above). This means that the number of actually reused queriers is: (1) - (2) - (3)	2018-03-13 10:34:34 +02:00
Botond Dénes	8513549b55	Memory based cache eviction To bound the memory consumption of the querier-cache the total memory consumption of the cached queriers is limited to 4% of the shard's total memory. When inserting a new querier it is first checked whether it's insertion would cause the limit to be crossed. If this is the case existing entries are evicted until the memory consumption is sufficiently reduced so that after inserting the querier it stays below the limit. Cached queriers are evicted in LRU order as the oldest queriers are the most likely to be evicted based on their TTL anyway. To calculate the memory consumption of the cached queriers flat_mutation_reader::buffer_size() is used. While this is not very precise as it doesn't include object sizes and member containers it gives a good picture of the memory consumption of the queriers. Memory based cache eviction overlaps with resource-based cache eviction but only to some degree as that only accounts the memory consumption of sstable readers.	2018-03-13 10:34:34 +02:00
Botond Dénes	f488ae3917	Add buffer_size() to flat_mutation_reader buffer_size() exposes the collective size of the external memory consumed by the mutattion-fragments in the flat reader's buffer. This provides a basis to build basic memory accounting on. Altought this is not the entire memory consumption of any given reader it is the most volatile component and usually by far the largest one too.	2018-03-13 10:34:34 +02:00
Botond Dénes	212b2dabc4	Resource-based cache eviction Readers serving user-reads need to obtain a permit to start reading. There exists a restriction on how much active readers can be admitted based on their count and their memory onsumption. Since the saved readers of cached queriers are techically active (they hold a permit) they can block new readers from obtaining a permit. New readers have a higher priority because a cached reader might be abandoned or used later at best so in the face of memory pressure we evict cached readers to free up permits for new readers. Cached queriers are evicted in LRU order as the oldest queriers are the most likely to be evicted based on their TTL anyway.	2018-03-13 10:34:34 +02:00
Botond Dénes	d5bcadcfda	Time-based cache eviction Cached queriers should not sit in the cache indefinitely otherwise abandoned reads would cause excess and unncessary resource-usage. Attach an expiry timer to each cache-entry which evicts it after the TTL passes.	2018-03-13 10:34:34 +02:00
Botond Dénes	ff808d9ce6	Save and restore queriers in mutation_query() and data_query() Use the querier_cache (represented by the passed-in querier_cache_context) object to lookup saved queriers at the start of the page and save them at the end of it if it is likely that there will be more page requests.	2018-03-13 10:34:34 +02:00
Botond Dénes	cab38c9f81	Add the querier_cache_context helper querier_cache_context is supposed to make propagating the cache and the key down the layers. It comes bundled with some of the required parameters (the lookup and save state) and aso hides all of the boiler-plate of dealing with the cache (checking whether the key is non-empty, etc.). It also makes it possible to not use the cache and hide this from the lower layers.	2018-03-13 10:34:34 +02:00
Botond Dénes	bbfe17437e	Add querier_cache This is the cache where suspended queriers are going to be saved between pages. This is not a general purpose cache. It caters to the specific needs of the querier recall mechanism. More specifically: (1) Cache entries are of single-use, they are inserted once and the first lookup removes them. Multiple items may be stored under a single key. Identifying the correct one happens based on additional information like the query range. Lookup knows to drop queriers when they cannot be used to serve the next page. (2) Cache entries are evicted after a certain time to avoid the depletion of resources due to abandoned reads. (3) Cache entries are evicted when facing reader-permit shortage, until either enough permits are freed up or all entries are evicted. (4) A memory limiter is set up which keeps the total memory consumption of the cache under a limit (4% of memory) by evicting the oldest entries when inserting a new one would cause the total memory consumption to go above the limit. (5) It updates the relevant counters of the db_stats. This patch only implements (1), the other features will be implemented in their own patches.	2018-03-13 10:34:34 +02:00
Botond Dénes	7a5143a670	Add querier The querier encapsulates all objects needed to serve queries, except result builders. It is designed to be suspendable, savable and resumable. It contains all logic needed to suspend, resume and determine whether the querier can be resumed or not. It is the foundation upon which the "reader-reuse" mechanism is built.	2018-03-13 10:34:34 +02:00
Botond Dénes	84d872babf	Add are_limits_reached() compact_mutation_state are_limits_reached() allows querying whether the compactor reached the page's limits. This is needed to determine whether there will be more pages and thus whether the compact_mutation_state has to be kept around.	2018-03-13 10:34:34 +02:00
Botond Dénes	2c1081b0e9	Add start_new_page() to compact_mutation_state start_new_page() resets the limits to the current page's ones and sets the _empty_partition flag so that the partition header (if the last page finished inside a partition) will be reemitted.	2018-03-13 10:34:34 +02:00
Botond Dénes	3fca8aaefb	Save last key of the page and method to query it Make a copy of the current decorated-key in consume_end_of_stream() so that it persists while the compaction state is suspended. Also add current_partition() to allow client code to query the partition the compaction is positioned in. This is needed to determine whether the start position of the next page matches that of the compact_mutation_state.	2018-03-13 10:34:34 +02:00
Botond Dénes	2fcc99fe43	Make compact_mutation reusable Currently compact_mutation is used as a use-once-then-throw-away object. After it satisfies its consumer it's destroyed together with the consumer. This conflicts with the effort to save and reuse readers and associated infrastructure between pages of a query. To resolve this conflict compact_mutation is split into two classes: (1) compact_mutation_state (2) compact_mutation compact_mutation_state encapsulates all the compaction logic and state, while compact_mutation continues to provide the same API using compact_mutation_state behind the scenes. compact_mutation_state doesn't store the consumer, instead its consume_* methods are templated on the consumer and take it as an argument. This allows compact_mutation_state to be independent of the consumer's type. Additionally compact_mutation can now be constructed from a shared pointer to compact_mutation_state. This allows client code to pre-construct a compaction state and retain it after the compact_mutation object is destroyed. These changes allow the state of a compaction to be saved and restored later while code that is only interested in storing the saved state can stay independent of the consumer's type. This patch only contains the splitting of compact_mutation into compact_mutation and compact_mutation_state. The next patches will add the missing functionality that is needed to make compact_mutation_state truly reusable across pages.	2018-03-13 10:34:34 +02:00
Botond Dénes	7bd500049d	Add the CompactedFragmentsConsumer Undust the commented CompactMutationConsumer concept, make it usable and rename it to CompactedFragmentsConsumer (as we not have flat readers).	2018-03-13 10:34:34 +02:00
Botond Dénes	f1171803b5	Use the last_replicas stored in the page_state Pass the last_replicas from the page_state as the preferred_replicas for query() and save the returned last_replicas as the last_replicas field of the next page_state. The circle is now complete. The first page of any query will pass an empty list as the preferred replicas (having no previous paging_state) so the replicas will be selected according to the load-balancing strategy. Any subsequent page will use the last replicas from the last page as the preferred ones for the current one. Thus if all goes well all pages of a query will hit the same replicas.	2018-03-13 10:34:34 +02:00
Botond Dénes	536a32bb5e	query_singular(): return the used replicas This patch implements the last_replicas returning part of the query() signature changes for singular queries. It allows for client code to save the last returned replicas and pass it to query() on the next page as the preferred-replicas parameter, thus faciliate the read requests for the next page hitting the same replicas.	2018-03-13 10:34:34 +02:00
Botond Dénes	aaf67bcbaa	Consider preferred replicas when choosing endpoints for query_singular() Propagate the preferred_replicas to db::filter_for_query() and consider them when selecting the endpoints. The algoritm for selecting the endpoints is as follows: * Compute the intersection of the endpoint candidates and the preferred endpoints. * If this yields a set of endpoints that already satisfies the CL requirements use this set. * Otherwise select the remaining endpoints according to the load-balancing strategy, just like before.	2018-03-13 10:34:34 +02:00
Botond Dénes	eac597d726	Add preferred and last replicas to the signature of query() preferred_replicas are added to the parameters and last_replicas are added to the return type. The preferred replicas will be used as a hint for the selection of the replicas to send the read requests to. The last replicas (returned) are the replicas actually selected for the read. This will allow queries to consistently hit the same replicas for each page thus reusing readers created on these replicas. For convenience a query() overload is provided that doesn't take or return the preferred and last replicas. This patch only adds the parameters and propagates them down to query_singular() and query_partition_key_range(). The code to actually use these preferred-replicas will be added in later patches. This reason for separating this is to reduce noise and improve reviewability for those functional changes later.	2018-03-13 10:34:34 +02:00
Botond Dénes	f281b3e923	Add last_replicas to paging_state Helps paged queries consistently hit the same replicas for each subsequent page. Replicas that already served a page will keep the readers used for filling it around in a cache. Subsequent page request hitting the same replicas can reuse these readers to fill the pages avoiding the work of creating these readers from scratch on every page. In a mixed cluster older coordinators will ignore this value. The value of last_replicas may change between pages as nodes may become available/unavailable or the coordinator may decide to send the read requests to different replicas at its discretion. Replicas are identified by an opaque uuid which should only make sense to the storage-proxy.	2018-03-13 10:34:34 +02:00
Nadav Har'El	fa284f6307	Add query UUID to read command This patch adds the parameter to read_command which is needed for caching of readers during multiple pages of a paged queries, which we will introduce in the next patches. The query_uuid is a UUID of a previously saved reader, which the replica is now asked to recall and resume (if this saved reader is no longer in the cache, it is fine, a new reader will be started). Additionally a helper flag is_first_page is added so that the replica can avoid doing any cache lookups (and incrementing miss counters) for the first page. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-03-13 10:34:34 +02:00
Nadav Har'El	ec7c56d18a	Add query UUID to paging state This patch adds to the "paging_state", the opaque cookie that clients are supposed to provide when asking for the next page on a paged query, a unique id field. This new field will be used to tell that a new request for a page really continues the previous page, and doesn't just by chance start at the same position the previous page stopped. We need to support setups with mixed versions - a client may get a paging state from a coordinator running a new version of Scylla and send it to a different coordinator running an old version - or vice versa. So the new uuid field is set up to have a default uuid of UUID() (a recognizable invalid uuid 0), so new versions receiving no uuid from an old version will set this invalid uuid, and old versions receiving a uuid from a new version will simply ignore it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-03-13 10:34:34 +02:00
Avi Kivity	78a9ab827e	Merge seastar upstream * seastar 42159d4...bcfbe0c (1): > core: fix directory scanning by returning actual entry type Fixes #3274 (hopefully).	2018-03-12 20:58:44 +02:00
Duarte Nunes	36b8c1043d	Merge 'Reduce dependencies on messaging_service.hh' from Avi Refactor some includes to reduce dependencies on messaging_service.hh, which can change quite a lot as it includes many unrelated items itself. Tests: build * tag 'includes/messaging_service.hh/v1' of https://github.com/avikivity/scylla: tests: reduce dependencies in test_services.hh migration_manager: remove dependency on messaging_service.hh in header messaging_service: move msg_addr into its own header file	2018-03-12 18:49:13 +00:00
Avi Kivity	bd7881066a	tests: reduce dependencies in test_services.hh Convert storage_service_for_test to a pimpl implementation to reduce dependencies. Tests that depended on those includes were fixed to include their dependencies directly.	2018-03-12 20:05:23 +02:00
Avi Kivity	5f2600a71d	migration_manager: remove dependency on messaging_service.hh in header Use the new msg_addr.hh header to remove a dependency on messaging_service.hh.	2018-03-12 20:05:23 +02:00
Avi Kivity	dd12214628	messaging_service: move msg_addr into its own header file Make it possible to use msg_addr without depending on messaging_service.hh.	2018-03-12 20:05:23 +02:00
Avi Kivity	af383228fb	locator: remove empty file locator.cc Empty but for compiler-time-consuming includes. Message-Id: <20180312073018.21646-1-avi@scylladb.com>	2018-03-12 10:32:26 +01:00
Avi Kivity	29d0a46220	locator: add copyright and license statements to production_snitch_base.cc Message-Id: <20180312073104.21840-1-avi@scylladb.com>	2018-03-12 10:30:48 +01:00
Asias He	8624467e26	utils: Remove utils/utils.cc It is used to make sure the header compiles in the early days. Message-Id: <531fc6570805bd163afedd53f5d71e1b79a477d1.1520840644.git.asias@scylladb.com>	2018-03-12 09:47:40 +02:00
Duarte Nunes	0ccf1c581a	Merge 'Reduce gratuitous inclusions of system_keyspace.hh' from Avi Try to avoid recompilations by reducing inclusions of system_keyspace.hh in other header files. Tests: unit (release) * tag 'system_keyspace.hh/v1' of https://github.com/avikivity/scylla: storage_service: remove system_keyspace.hh include locator: de-inline reconnectable_snitch_helper locator: de-inline production_snitch_base cql3: remove #include of system_keyspace.hh	2018-03-11 22:56:20 +00:00
Avi Kivity	cd668061fc	storage_service: remove system_keyspace.hh include Re-distribute include among the files that really need it.	2018-03-11 18:53:49 +02:00
Avi Kivity	b946f8b308	locator: de-inline reconnectable_snitch_helper Reduce dependencies by de-inlining reconnectable_snitch_helper. A new home is found in production_snitch_base.cc, which is somewhat related.	2018-03-11 18:31:05 +02:00
Avi Kivity	84004a2574	locator: de-inline production_snitch_base De-inlining allows us to remove some dependencies, and those functions are too complex to inline anyway. A few always-throwing functions get the [[noreturn]] attribute to avoid damaging code generation.	2018-03-11 18:22:49 +02:00
Avi Kivity	4f6b892aa1	cql3: remove #include of system_keyspace.hh We include system_keyspace for just the string "system" (and a related is_system_keyspace() function). Replace with a forward-declared functions.	2018-03-11 18:02:23 +02:00
Avi Kivity	7441c7153f	Merge seastar upstream * seastar 08e02dc...42159d4 (9): > memory: avoid unconditional calls to __tls_init > io_tester: bring back information about think time > Merge "Avoid continuations in I/O Scheduler path" from Glauber > Merge "Extend io_tester to support CPU loads" from Glauber > tutorial: fix undue complication in semaphore get_units() example > Tutorial: in HTML target, inline code snippets shouldn't be gray > tutorial: add build target for split HTML file > tutorial: mention seastar::thread as option for object lifetime management > tutorial: document new seastar::future::wait()	2018-03-11 15:45:42 +02:00
Avi Kivity	9569ba5e38	Update scylla-ami submodule * dist/ami/files/scylla-ami 3aa87a7...5170011 (3): > scylla_install_ami: install enhanced networking NIC drivers > scylla_install_ami: set kernel-ml as default kernel > scylla_install_ami: fix NIC down with enhanced networking on new base AMI	2018-03-11 15:45:05 +02:00
Raphael S. Carvalho	fb8ce14a36	sstables: don't set clustering components twice when loading sstable already called in update_info_for_opened_data() which is called by open_data(); no need for clustering components to be set early either. found it when auditing the code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180310225213.26017-1-raphaelsc@scylladb.com>	2018-03-11 10:10:35 +02:00
Tomasz Grabiec	3937352a9a	doc: Fix row_cache.md Dropped unfinished sentence and added missing "after". Message-Id: <1520615404-18458-1-git-send-email-tgrabiec@scylladb.com>	2018-03-10 16:27:04 +02:00
Raphael S. Carvalho	87035bd8d1	sstables: fix min and max timestamp when negative timestamp is specified unsigned type was incorrectly used for keeping track of min and max timestamp, so a negative number would be treated as a very high number that would incorrectly end up as max timestamp in sstable metadata. Fixes #3000. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180308162217.18963-1-raphaelsc@scylladb.com>	2018-03-08 18:31:30 +02:00
Avi Kivity	596a9d0fb3	Merge "Make reader concurrency dual-restricted by count and memory" from Botond " Refs #2692 Fixes #3246 The current restricting algorithm [1] restricts the active-reader queue based on the memory consumption of the existing active readers. When this memory consumption is above the limit new readers are not admitted. The inactive reader queue on the other hand has a fixed length. This caused performance regressions on two workloads: * read-only: since the inactive-reader queue length is severly limited (compared to the previous situation) reads will timeout at loads comfortably handled before. * mixed: since the memory consumption happens only at admission time (already created active readers are not limited) memory consumption growed significantly causing problems when compactions kicked in. The solution is to reintroduce the old limit of 100 active concurrent user-reads while still keeping the memory-based limit as well. For workloads that don't consume a lot of memory or on large boxes with lots of memory the count-based limit will be reached which is reverting to the old well-known behaviour. For memory-hungry workloads or on small boxes with little memory the memory based-limit will kick in sooner avoiding memory overconsumption. [1] introduced by `bdbbfe9390` " * 'restricted-reader-dual-limit/v3' of https://github.com/denesb/scylla: Modify unit tests so that they test the dual-limits Use the reader_concurrency_semaphore to limit reader concurrency Add reader_concurrency_semaphore Add reader_resource_tracker param to mutation_source mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh	2018-03-08 14:36:05 +02:00
Botond Dénes	341ddd096a	Modify unit tests so that they test the dual-limits	2018-03-08 14:12:12 +02:00
Botond Dénes	1259031af3	Use the reader_concurrency_semaphore to limit reader concurrency	2018-03-08 14:12:12 +02:00
Botond Dénes	dfa04c3fea	Add reader_concurrency_semaphore This semaphore implements the new dual, count and memory based active reader limiting. As purely memory-based limiting proved to cause problems on big boxes admitting a large number of readers (more than any disk could handle) the previous count-based limit is reintroduced in addition to the existing memory-based limit. When creating new readers first the count-based limit is checked. If that clears the memory limit is checked before admitting the reader. reader_conccurency_semaphore wraps the two semaphores that implement these limits and enforces the correct order of limit checking. This class also completely replaces the restricted_reader_config struct, it encapsulates all data and related functinality of the latter, making client code simpler.	2018-03-08 14:12:12 +02:00
Botond Dénes	872fd369ba	Add reader_resource_tracker param to mutation_source Soon, reader_resource_tracker will only be constructible after the reader has been admitted. This means that the resource tracker cannot be preconstructed and just captured by the lambda stored in the mutation source and instead has to be passed in along the other parameters.	2018-03-08 14:12:09 +02:00
Botond Dénes	d5bb8a47fc	mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh In preparation to reader_concurrency_semaphore being added to the file. The reader_resource_tracker is really only a helper class for reader_concurrency_semaphore so the latter is better suited to provide the name of the file.	2018-03-08 10:29:16 +02:00
Avi Kivity	0ebfe448e3	Merge "Row-level eviction" from Tomasz " This series switches granularity of memory-pressure-induced eviction in cache from a partition to a row. Since `9b21a9b` cache can store partial partitions with row granularity but they were still evicted as a unit. This is problematic for the following reasons: - more is evicted than necessary, which decreases cache efficiency. In the worst case, whole cache gets evicted at once - evicting large amounts of memory (large partitions) at once may impact latency badly Fixes #2576. See the documentation added in patch titled "doc: Document row cache eviction" for details on how eviction works. Open issues to be fixed incrementally: - range tombstones are not evictable - cache update still has partition granularity, which causes bad latency on memtable flush with large partitions " * tag 'tgrabiec/row-level-eviction-v3' of github.com:scylladb/seastar-dev: (43 commits) doc: Document row cache eviction tests: cache: Add tests for row-level eviction tests: cache: Check that data is evictable after schema change tests: cache: Move definitions to the top tests: perf_cache_eviction: Switch eviction counter to row granularity tests: row_cache_alloc_stress: Avoid quadratic behavior cache: Introduce unlink_from_lru() cache: Add row-level stats about cache update from memtable mvcc: Propagate information if insertion happened from ensure_entry_if_complete() cache: Track number of rows and row invalidations cache: Evict with row granularity cache: Track static row insertions separately from regular rows tests: mvcc: Use apply_to_incomplete() to create versions tests: mvcc: Fix test_apply_to_incomplete() tests: cache: Do not depend on particular granularity of eviction tests: cache: Make sure readers touch rows in test_eviction() mvcc: Store complete rows in each version in evictable entries mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_in_latest() tests: cache: Invoke partial eviction in test_concurrent_reads_and_eviction cache: Ensure all evictable partition_versions have a dummy after all rows ...	2018-03-07 17:57:07 +02:00
Tomasz Grabiec	4caeed7e40	doc: Document row cache eviction	2018-03-07 16:52:59 +01:00
Tomasz Grabiec	180a877db3	tests: cache: Add tests for row-level eviction	2018-03-07 16:52:59 +01:00
Tomasz Grabiec	9fab5068c6	tests: cache: Check that data is evictable after schema change	2018-03-07 16:52:59 +01:00
Tomasz Grabiec	f0e0c79a70	tests: cache: Move definitions to the top	2018-03-07 16:52:59 +01:00

1 2 3 4 5 ...

14805 Commits