scylladb

Author	SHA1	Message	Date
Botond Dénes	b2f75a6c53	Add counters to monitor querier-cache efficiency Add the following counters: (1) querier_cache_lookups (2) querier_cache_misses (3) querier_cache_drops (4) querier_cache_time_based_evictions (5) querier_cache_resource_based_evictions (6) querier_cache_memory_based_evictions (6) querier_cache_population (1) counts the total number of querier cache lookups. Not all page-fetches will result in a querier lookup. For example the first page of a query will not do a lookup as there was no previous page to reuse the querier from. The second, and all subsequent pages however should attempt to reuse the querier from the previous page. (2) counts the subset of (1) where the read have missed the querier cache (failed to find a matching saved querier). (3) counts the subset of (1) where the querier was recalled and dropped immediately. This can happen for example if the querier was at the wrong position. (4) counts the cached queriers that were evicted due to their TTL expiring. (5) counts the cached queriers that were evicted due to reader-resource (those limited by reader-concurrency limits) shortage. (6) counts the cached queriers that were evicted due to reaching the cache's memory limits (currently set to 4% of the shards' memory). (7) is the current number of entries in the cache Note: * The count of cache hits can be derived from these counters as (1) - (2). * cache_drop (3) also implies a cache hit (see above). This means that the number of actually reused queriers is: (1) - (2) - (3)	2018-03-13 10:34:34 +02:00
Botond Dénes	212b2dabc4	Resource-based cache eviction Readers serving user-reads need to obtain a permit to start reading. There exists a restriction on how much active readers can be admitted based on their count and their memory onsumption. Since the saved readers of cached queriers are techically active (they hold a permit) they can block new readers from obtaining a permit. New readers have a higher priority because a cached reader might be abandoned or used later at best so in the face of memory pressure we evict cached readers to free up permits for new readers. Cached queriers are evicted in LRU order as the oldest queriers are the most likely to be evicted based on their TTL anyway.	2018-03-13 10:34:34 +02:00
Botond Dénes	ff808d9ce6	Save and restore queriers in mutation_query() and data_query() Use the querier_cache (represented by the passed-in querier_cache_context) object to lookup saved queriers at the start of the page and save them at the end of it if it is likely that there will be more page requests.	2018-03-13 10:34:34 +02:00
Botond Dénes	1259031af3	Use the reader_concurrency_semaphore to limit reader concurrency	2018-03-08 14:12:12 +02:00
Raphael S. Carvalho	aa75684ee7	sstables: Warn when an extra-large partition is written Based on https://issues.apache.org/jira/browse/CASSANDRA-9643 For compaction_large_partition_warning_threshold_mb option set to 1, follow an example output: WARN 2018-02-22 19:52:11,029 [shard 0] sstable - Writing large row system/local:{key: pk{00056c6f63616c}, token:-7564491331177403445} (1276758 bytes) Fixes #2209. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180306175912.19259-1-raphaelsc@scylladb.com>	2018-03-07 15:49:46 +00:00
Duarte Nunes	76e6423910	database: Truncate views when truncating the base table Fixes #3200 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180211124218.41373-1-duarte@scylladb.com>	2018-02-27 15:54:43 +02:00
Avi Kivity	d973445a94	Merge "sstable/schema extensions" from Calle " Adds extension points to schema/sstables to enable hooking in stuff, like, say, something that modifies how sstable disk io works. (Cough, cough, encryption) Extensions are processed as property keywords in CQL. To add an extension, a "module" must register it into the extensions object on boot time. To avoid globals (and yet don't), extensions are reachable from config (and thus from db). Table/view tables already contain an extension element, so we utilize this to persist config. schema_tables tables/views from mutations now require a "context" object (currently only extensions, but abstracted for easier further changes. Because of how schemas currently operate, there is a super lame workaround to allow "schema_registry" access to config and by extension extensions. DB, upon instansiation, calls a thread local global "init" in schema_registry and registers the config. It, in turn, can then call table_from_mutations as required. Includes the (modified) patch to encapsulate compression into objects, mainly because it is nice to encapsulate, and isolate a little. " * 'calle/extensions-v5' of github.com:scylladb/seastar-dev: extensions: Small unit test sstables: Process extensions on file open sstables::types: Add optional extensions attribute to scylla metadata sstables::disk_types: Add hash and comparator(sstring) to disk_string schema_tables: Load/save extensions table cql: Add schema extensions processing to properties schema_tables: Require context object in schema load path schema_tables: Add opaque context object config_file_impl: Remove ostream operators main/init: Formalize configurables + add extensions to init call db::config: Add extensions as a config sub-object db::extensions: Configuration object to store various extensions cql3::statements::property_definitions: Use std::variant instead of any sstables: Add extension type for wrapping file io schema: Add opaque type to represent extensions sstables::compress/compress: Make compression a virtual object	2018-02-26 17:15:29 +02:00
Botond Dénes	c4b5249a46	backlog_controller::adjust(): fix heap-overflow Make sure idx will not be equal to _control_points.size() (and thus overflow the vector) when looking for the first control-point with a backlog not smaller then the current one, by stopping when it's equal to _control_points.size() - 1. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <47841592792573d820650d570fa1ab7e58bdac2c.1518700405.git.bdenes@scylladb.com>	2018-02-26 13:47:38 +02:00
Raphael S. Carvalho	f59f423f3c	Make sstable loading faster by not invoking all shards for each sstable Before `312bd9ce25`, boot had to call all shards for each sstable such that they would agree/disagree on their deletion, an atomic deletion manager requirement. After its removal, we can afford to call only the shards that own a given sstable. Reducing the operation on each sstable from (SSTABLES) * (SHARD_COUNT) to usually (SSTABLES). It may be the same as before after resharding, but resharding is an one-off operation. Boot time should be significantly reduced for nodes with a high smp count and column family using leveled strategy (which can end up with thousands of sstables). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180220032554.17776-1-raphaelsc@scylladb.com>	2018-02-22 09:39:56 +00:00
Avi Kivity	432268f582	Merge "branch 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla" from Raphael "The motivation is that it's no longer needed after new resharding algorithm that is the sole responsible for working with shared sstables and regular compaction will not work with those! So resharding will schedule deletion of shared sstables once it's certain that shards that own them have the new unshared sstables. The manager was needed for orchestrating deletion of shared sstable across shards. It brings extra complexity that's not longer needed, and it was also overloading shard 0, but the latter could have been fixed. Tests: - unit: release mode - dtest: resharding_test.py" * 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla: Remove SSTable's atomic deletion manager Stop using SSTable's atomic deletion manager database: split column_family::rebuild_sstable_list	2018-02-08 19:10:16 +02:00
Avi Kivity	404172652e	Merge "Use xxHash for digest instead of MD5" from Duarte "This series changes digest calculation to use a faster algorithm (xxHash) and to also cache calculated cell hashes that can be kept in memory to speed up subsequent digest requests. The MD5 hash function has proved to be slow for large cell values: size = 256; elapsed = 4us size = 512; elapsed = 8us size = 1024; elapsed = 14us size = 2048; elapsed = 21us size = 4096; elapsed = 33us size = 8192; elapsed = 51us size = 16384; elapsed = 86us size = 32768; elapsed = 150us size = 65536; elapsed = 278us size = 131072; elapsed = 531us size = 262144; elapsed = 1032us size = 524288; elapsed = 2026us size = 1048576; elapsed = 4004us size = 2097152; elapsed = 7943us size = 4194304; elapsed = 15800us size = 8388608; elapsed = 31731us size = 16777216; elapsed = 64681us size = 33554432; elapsed = 130752us size = 67108864; elapsed = 263154us The xxHash is a non-cryptographic, 64bit (there's work in progress on the 128 version) hash that can be used to replace MD5. It performs much better: size = 256; elapsed = 2us size = 512; elapsed = 1us size = 1024; elapsed = 1us size = 2048; elapsed = 2us size = 4096; elapsed = 2us size = 8192; elapsed = 3us size = 16384; elapsed = 5us size = 32768; elapsed = 8us size = 65536; elapsed = 14us size = 131072; elapsed = 28us size = 262144; elapsed = 59us size = 524288; elapsed = 116us size = 1048576; elapsed = 226us size = 2097152; elapsed = 456us size = 4194304; elapsed = 935us size = 8388608; elapsed = 1848us size = 16777216; elapsed = 4723us size = 33554432; elapsed = 10507us size = 67108864; elapsed = 21622us Performance was tested using a 3 node cluster with 1 cpu and 8GB, and with the following cassandra-stress loaders. Measurements are for the read workload. sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100 sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100 xxhash + caching: Results: op rate : 32699 [READ:32699] partition rate : 32699 [READ:32699] row rate : 32699 [READ:32699] latency mean : 3.0 [READ:3.0] latency median : 3.0 [READ:3.0] latency 95th percentile : 3.9 [READ:3.9] latency 99th percentile : 4.5 [READ:4.5] latency 99.9th percentile : 6.6 [READ:6.6] latency max : 24.0 [READ:24.0] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:05:05 END md5: Results: op rate : 25241 [READ:25241] partition rate : 25241 [READ:25241] row rate : 25241 [READ:25241] latency mean : 3.9 [READ:3.9] latency median : 3.9 [READ:3.9] latency 95th percentile : 5.1 [READ:5.1] latency 99th percentile : 5.8 [READ:5.8] latency 99.9th percentile : 8.0 [READ:8.0] latency max : 24.8 [READ:24.8] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:06:36 END This translates into a 21% improvoment for this workload. Bigger cell values were also tested: sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100 sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100 xxhash + caching: Results: op rate : 19964 [READ:19964] partition rate : 19964 [READ:19964] row rate : 19964 [READ:19964] latency mean : 4.9 [READ:4.9] latency median : 4.6 [READ:4.6] latency 95th percentile : 7.2 [READ:7.2] latency 99th percentile : 11.5 [READ:11.5] latency 99.9th percentile : 13.6 [READ:13.6] latency max : 29.2 [READ:29.2] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:08:20 END md5: Results: op rate : 12773 [READ:12773] partition rate : 12773 [READ:12773] row rate : 12773 [READ:12773] latency mean : 7.7 [READ:7.7] latency median : 7.3 [READ:7.3] latency 95th percentile : 10.2 [READ:10.2] latency 99th percentile : 16.8 [READ:16.8] latency 99.9th percentile : 19.2 [READ:19.2] latency max : 71.5 [READ:71.5] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:13:02 END This translates into a 37% improvoment for this workload. Fixes #2884 Tests: unit-tests (release), dtests (smp=2) Note: dtests are kinda broken in master (> 30 failures), so take the tests tag with a grain of himalayan salt." * 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits) tests/row_cache_test: Test hash caching tests/memtable_test: Test hash caching tests/mutation_test: Use xxHash instead of MD5 for some tests tests/mutation_test: Test xx_hasher alongside md5_hasher schema: Remove unneeded include service/storage_proxy: Enable hash caching service/storage_service: Add and use xxhash feature message/messaging_service: Specify algorithm when requesting digest storage_proxy: Extract decision about digest algorithm to use cache_flat_mutation_reader: Pre-calculate cell hash partition_snapshot_reader: Pre-calculate cell hash query::partition_slice: Add option to specify when digest is requested row: Use cached hash for hash calculation mutation_partition: Replace hash_row_slice with appending_hash mutation_partition: Allow caching cell hashes mutation_partition: Force vector_storage internal storage size test.py: Increase memory for row_cache_stress_test atomic_cell_hash: Add specialization for atomic_cell_or_collection query-result: Use digester instead of md5_hasher range_tombstone: Replace feed_hash() member function with appending_hash ...	2018-02-08 18:24:58 +02:00
Raphael S. Carvalho	312bd9ce25	Remove SSTable's atomic deletion manager Not used anymore, can be deleted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2018-02-07 22:38:45 -02:00
Raphael S. Carvalho	1472cfcc19	Stop using SSTable's atomic deletion manager The motivation is that it's no longer needed after new resharding algorithm that is the sole responsible for working with shared sstables and regular compaction will not work with those! So resharding will schedule deletion of shared sstables once it's certain that shards that own them have the new unshared sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2018-02-07 22:27:17 -02:00
Raphael S. Carvalho	b78881c0e9	database: split column_family::rebuild_sstable_list The motivation is that resharding will not want the code that is specific to regular compaction after atomic deletion is removed. Resharding will eventually only need to replace old tables with new ones, and it will be in charge of deletion of old tables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2018-02-07 22:18:18 -02:00
Glauber Costa	4272279bbb	controllers: unify the I/O and CPU controllers We have had so far an I/O controller, for compactions and memtables, and a CPU controller, for memtables only -- since the scheduling was still quota-based. Now that the CPU scheduler is fully functional, it is time to do away with the differences and integrate them both into one. We now have a memtable controller and a compaction controller, and they control both CPU and I/O. In the future, we may want to control processes that don't do one of them, like cache updates. If that ever happens, we'll try to make controlling one of them optional. But for now, since the I/O and CPU controllers for our main two processes would look exactly the same we should integrate them. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:30 -05:00
Glauber Costa	7b6f188e27	controllers: allow a static priority to override the controller output We have merged the I/O controller without this, but we want to integrate the CPU and I/O controllers into one. Currently, the quota can be statically set for the CPU controller. For now, until we gain more experience with it we should allow a static value to override the controller's output as well. That is particularly important since we don't yet control some strategies like LCS and the time-based ones. Users in the field may be using one of those strategies with a static value for background quota. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:29 -05:00
Glauber Costa	b895d495cc	controllers: allow memtable I/O controller to have shares statically set This is so it looks more like the CPU controller. The end goal is to integrate them. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:29 -05:00
Glauber Costa	c099c98676	controllers: retire auto_adjust_flush_quota It no longer makes sense now that we have the full scheduler + controllers. In its lieu, we will provide an option to statically set the controller's shares as a safe guard against us getting this wrong. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:29 -05:00
Glauber Costa	2c1d5cf966	database: remove cpu_flush_quota metric We can now grab that from the CPU scheduler, that exports both runtime and shares. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:29 -05:00
Avi Kivity	ce94e6deb7	database: place data_query execution stage into scheduling_group Because execution stages defer and batch processing of the function they run, they escape their fiber's context and therefore the scheduling group. Fix (for data_query) by initializing the execution_stage with the query scheduling_group. To do that we have to move the execution stage into the database object, so it has access to the scheduling group during initialization.	2018-02-07 17:19:29 -05:00
Glauber Costa	956af9f099	database, main: set up scheduling_groups for our main tasks Set up scheduling groups for streaming, compaction, memtable flush, query, and commitlog. The background writer scheduling group is retired; it is split into the memtable flush and compaction groups. Comments from Glauber: This patch is based in a patch from Avi with the same subject, but the differences are signficant enough so that I reset authorship. In particular: 1) A bug/regression is fixed with the boundary calculations for the memtable controller sampling function. 2) A leftover is removed, where after flushing a memtable we would go back to the main group before going to the cache group again 3) As per Tomek's suggestion, now the submission of compactions themselves are run in the compaction scheduling group. Having that working is what changes this patch the most: we now store the scheduling group in the compaction manager and let the compaction manager itself enforce the scheduling group. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:29 -05:00
Avi Kivity	641aaba12c	database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler thread_scheduling_groups are converted to plain scheduling_group. Due to differences in initialization (scheduling_group initializtion defers), we create the scheduling_groups in main.cc and propagate them to users via a new class database_config. The sstable writer loses its thread_scheduling_group parameter and instead inherits scheduling from its caller. Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas, the flush controller was adjusted to return values within the higher ranges.	2018-02-07 17:19:29 -05:00
Calle Wilund	2b56bbfa7d	schema_tables: Require context object in schema load path Requires "workaround" fix for schema_registry and frozen_mutation, since the former is a free-float thread local, and the latter is a pure data carrier. frozen_schema can take a parameter for unfreeze, but schema registry requires being told which the system extensions are.	2018-02-07 10:11:46 +00:00
Duarte Nunes	6b4b429883	query-result: Introduce class result_options Introduce class result_options to carry result options through the request pipeline, which at this point mean the result type and the digest algorithm. This class allows us to encapsulate the concrete digest algorithm to use. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 00:22:50 +00:00
Nadav Har'El	2ea1922a4d	Materialized views: serialize read-modify-update of base table Before this patch, our Materialized Views implementation can produce incorrect results when given concurrent updates of the same base-table row. Such concurrent updates may result, in certain cases, in two different rows added to the view table, instead of just one with the latest data. In this patch we we add locking which serializes the two conflicting updates, and solves this problem. The locking for a single base-table column_family is implemented by the row_locker class introduced in a previous patch. A long comment in the code of this patch explains in more detail why this locking is needed, when, and what types of locks are needed: We sometimes need to lock a single clustering row, sometimes an entire partition, sometimes an exclusive lock and sometimes a shared lock. Fixes #3168 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-01-30 16:21:43 +02:00
Piotr Jastrzebski	5636a97c81	Remove unused query_state::reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-24 20:56:48 +01:00
Piotr Jastrzebski	39ec13133f	row_cache: rename make_flat_reader to make_reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-24 20:54:45 +01:00
Amnon Heiman	a0a1961b6d	database: correct the label creation for database reads The labels in database active_reads metrics where not define correctly. Label should be created so it will be possible to select based on their value. The current implementation define a label "class" with three instances: user, streaming, system. Fixes: #2770 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20180123125206.23660-1-amnon@scylladb.com>	2018-01-24 20:09:40 +01:00
Glauber Costa	0c00667206	streaming big: keep write_monitor alive until the end of flush After the new compaction controller code, the monitor has to be kept alive until the sstable is added to the SSTable set. This is correctly handled for all the writers, except the streaming big. That flusher is a big confusing, as it builds an sstable list first and only later adds the elements in the list to the sstable set. The monitors are destroyed at the end of phase 1, so we will SIGSEGV later when calling add_sstable(). The fix for this is to make sure the lifetime of the monitors are tied to the lifetime of the sstables being handled big the big streaming flush process. Caught by dtests, update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test Fixes #3131 Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test now passes. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180118202230.17107-1-glauber@scylladb.com>	2018-01-21 14:09:43 +02:00
Tomasz Grabiec	16e06b5b46	Merge "remove ability to create a non-flat mutation reader" from Piotr * seastar-dev.git haaawk/flat_reader_clean_up_mutation_source_v3: test_range_queries: create flat reader from source run_sstable_resharding_test: create flat reader from source make_sstable_containing: create flat reader from source test_cache_delegates_to_underlying_only_once_multiple_mutation: use flat reader Migrate materalized views to flat_mutation_reader test_can_write_and_read_non_compound_range_tombstone_as_compound: use flat reader test_writing_combined_stream_with_tombstones_at_the_same_position: use flat reader Add flat_mutation_reader::peek() Add flat_mutation_reader_assertions::produces_range_tombstone Accept clustering_row_ranges in flat_mutation_reader_assertions::produces Add flat_mutation_reader_assertions::produces_eos_or_empty_mutation Add flat_mutation_reader_assertions::fast_forward_to overload test_query_only_static_row: use flat reader Move mutation_rebuilder to header test_streamed_mutation_forwarding_is_consistent_with_slicing: use flat reader test_clustering_slices: use flat reader test_streamed_mutation_forwarding_guarantees: use flat reader test_streamed_mutation_forwarding_across_range_tombstones: use flat reader test_streamed_mutation_slicing_returns_only_relevant_tombstones: use flat reader Add flat_mutation_reader_assertions::is_buffer_full test_fast_forwarding_across_partitions_to_empty_range: use flat reader Remove unused mutation_source::operator() mutation_source: rename make_flat_mutation_reader to make_reader Clean up imports in tests	2018-01-19 12:43:50 +01:00
Piotr Jastrzebski	d266eaa01e	mutation_source: rename make_flat_mutation_reader to make_reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-19 09:30:12 +01:00
Piotr Jastrzebski	4c74b8c7e7	Migrate materalized views to flat_mutation_reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-18 07:32:35 +01:00
Tomasz Grabiec	b5d5bf5bc4	database: Invalidate only affected ranges from flush_streaming_mutations() Invalidating whole range causes larger latency spikes. Regression from 2.0 introduced in `d22fdf4261`. Refs #3119 Tests: units (release) Message-Id: <1516046938-26855-1-git-send-email-tgrabiec@scylladb.com>	2018-01-16 11:17:57 +02:00
Glauber Costa	08a0c3714c	allow request-specific read timeouts in storage proxy reads Timeouts are a global property. However, for tables in keyspaces like the system keyspace, we don't want to uphold that timeout--in fact, we wan't no timeout there at all. We already apply such configuration for requests waiting in the queued sstable queue: system keyspace requests won't be removed. However, the storage proxy will insert its own timeouts in those requests, causing them to fail. This patch changes the storage proxy read layer so that the timeout is applied based on the column family configuration, which is in turn inherited from the keyspace configuration. This matches our usual way of passing db parameters down. In terms of implementation, we can either move the timeout inside the abstract read executor or keep it external. The former is a bit cleaner, the the latter has the nice property that all executors generated will share the exact same timeout point. In this patch, we chose the latter. We are also careful to propagate the timeout information to the replica. So even if we are talking about the local replica, when we add the request to the concurrency queue, we will do it in accordance with the timeout specified by the storage proxy layer. After this patch, Scylla is able to start just fine with very low timeouts--since read timeouts in the system keyspace are now ignored. Fixes #2462 Implementation notes, and general comments about open discussion in 2462: * Because we are not bypassing the timeout, just setting it high enough, I consider the concerns about the batchlog moot: if we fail for any other reason that will be propagated. Last case, because the timeout is per-CF, we could do what we do for the dirty memory manager and move the batchlog alone to use a different timeout setting. * Storage proxy likes specifying its timeouts as a time_point, whereas when we get low enough as to deal with the read_concurrency_config, we are talking about deltas. So at some point we need to convert time_points to durations. We do that in the database query functions. v2: - use per-request instead of per-table timeouts. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-12 07:43:21 -05:00
Glauber Costa	3c9eeea4cf	restricted_mutation_reader: don't pass timeouts through the config structure This patch enables passing a timeout to the restricted_mutation_reader through the read path interface -- using fill_buffer and friends. This will serve as a basis for having per-timeout requests. The config structure still has a timeout, but that is so far only used to actually pass the value to the query interface. Once that starts coming from the storage proxy layer (next patch) we will remove. The query callers are patched so that we pass the timeout down. We patch the callers in database.cc, but leave the streaming ones alone. That can be safely done because the default for the query path is now no_timeout, and that is what the streaming code wants. So there is no need to complicate the interface to allow for passing a timeout that we intend to disable. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-12 07:43:21 -05:00
Glauber Costa	5140aaea00	add a timeout to fast forward to In the last patch, we enabled per-request timeouts, we enable timeouts in fill_buffer. There are many places, though, in which we fast_forward_to before we fill_buffer, so in order to make that effective we need to propagate the timeouts to fast_forward_to as well. In the same way as fill_buffer, we make the argument optional wherever possible in the high level callers, making them mandatory in the implementations. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-12 07:43:19 -05:00
Glauber Costa	80c4a211d8	consolidate timeout_clock At the moment, various different subsystems use their different ideas of what a timeout_clock is. This makes it a bit harder to pass timeouts between them because although most are actually a lowres_clock, that is not guaranteed to be the case. As a matter of fact, the timeout for restricted reads is expressed as nanoseconds, which is not a valid duration in the lowres_clock. As a first step towards fixing this, we'll consolidate all of the existing timeout_clocks in one, now called db::timeout_clock. Other things that tend to be expressed in terms of that clock--like the fact that the maximum time_point means no timeout and a semaphore that wait()s with that resolution are also moved to the common header. In the upcoming patch we will fix the restricted reader timeouts to be expressed in terms of the new timeout_clock. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-11 12:07:41 -05:00
Glauber Costa	40c428dc19	database: delete unused function no in-tree users. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-11 12:07:41 -05:00
Avi Kivity	72c673fcc3	Merge "I/O Controller for memtables and compactions" from Glauber "This patchset implements the compaction controller for I/O shares. The goal is to automatic adjust compaction shares based on a strategy-specific backlog. A higher backlog will translate into higher shares. As compaction progresses, that reduces the backlog. As new data is flushed, that increases the backlog. The goal of the controler is to keep the backlog constant at a certain rate, so that we don't go neither too fast or too slow. Tracking reads and writes: ========================== Tracking of reads and writes happen through the read_monitor and the write_monitor. The write monitor is an existing interface that has the purpose of releasing the write permit at particular points of the write process. We enhance it so to get a reference to an instance that tracks the current offset inside the sstables::file_writer. This way the backlog tracker can always know for sure what's the offset of the current write. A similar thing is done for reads. The data_consumer already tracks the position of the current read, and we isolate that into a structure to which we can get a reference. A read_monitor allows us to connect the compaction to that reference. Lifetime management: ==================== In general, tracking objects will be owned by their callers and passed down as references. The compaction object will own the read monitors and the compaction write monitors and the memtable flush write monitor will be kept alive in a do_with block around the flush itself. The backlog_{write,read}_progress_manager needs to be kept alive until the SSTable is no longer in progress. For writes, that means until we are able to add the SSTable charges in full, and for reads (compaction) that means until we are able to remove the charges in full. It is important to do that to avoid spikes in the graph. If we remove the progress managers in a different operation than updating the SSTable list we will be left in a temporary state where charges appear or disappear abruptly, to be fixed when the final add_sstable/remove_sstable happens. So we want those things to happen together. The compaction_backlog_tracker is kept alive until the strategy changes, for example, through ALTER TABLE. Current charges are transferred to the new strategy's compaction_backlog_tracker object when we do that. If the type of strategy changes, the current read charges are forgotten. We can do that because those running compaction will not really contribute to decrease the backlog of the new compaction strategy. Tranfer of Charges ================== When ALTER TABLE happens, we need to transfer ongoing writes to the new backlog manager. Ongoing reads will still be tracked by the backlog_manager that originated them. The rationale for that is that reads still belong to the current compaction, with the strategy that generated them. But new Tables being written will add to the backlog of the new strategy. Note that ALTER TABLE operations not necessarily cause a change of Strategy. We can be using the same strategy but just changing properties. If that is the case, we expect no discontinuity in the backlog graph (tested). Resharding ========== Resharding compactions are more complex than normal compactions because the SSTables are created in one shard and later sent to another shard. It is better, then, to track resharding compactions separately and let them have their own backlog tracker, which will insert backlog in proportion to the amount of data to be resharded. Memtable Flush I/O Controller ============================= With the current infrastructure it becomes trivial to add a new controller, for either I/O or CPU. This patchset then adds an I/O controller for memtable flushes, using the same backlog algorithm that we already used for CPU." * 'compaction-controller-io-v5' of github.com:glommer/scylla: database: add a controller for I/O on memtable flushes. document the compaction controller compaction: adjust shares for compactions backlog_controllers: implement generic I/O controller factor out some of the controller code io shares: multiply all shares by 10 compaction_strategy: implement backlog manager for the SizeTiered strategy infrastructure for backlog estimator for compaction work. sstables: notify about end of data component write sstables: add read_monitor_generator sstables: add read_monitor sstables: enhance data consumer with a position tracker sstables: enhance the file_writer with an offset tracker sstables: pass references instead of pointers for write_monitor compaction: control destruction of readers	2018-01-07 15:00:10 +02:00
Glauber Costa	4f1b875784	database: add a controller for I/O on memtable flushes. The algorithm and principle of operation is the same as the CPU controller. It is, however, always enabled and we will operate on I/O shares. I/O-bound workloads are expected to hit the maximum once virtual dirty fills up and stay there while the load is steady. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-03 19:58:57 -05:00
Glauber Costa	244c564aac	compaction: adjust shares for compactions Compactions can be a heavy disk user and the I/O scheduler can always guarantee that it uses its fair share of disk. Such fair share can, however, be a lot more than what compaction indeed need. This patch draws on the controllers infrastructure to adjust the I/O shares that the compaction class will get so that compaction bandwidth is dynamically adjusted. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-03 19:58:57 -05:00
Glauber Costa	4b44a22236	backlog_controllers: implement generic I/O controller Like the CPU controller, but will act on I/O priorities. Shares can go from 0 to 1000. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-03 19:56:54 -05:00
Glauber Costa	1671d9c433	factor out some of the controller code The control algorithm we are using for memtables have proven itself quite successful. We will very likely use the same for other processes, like compactions. Make the code a bit more generic, so that a new controller has to only set the desired parameters Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-03 19:56:54 -05:00
Raphael S. Carvalho	818830715f	Fix potential infinite recursion when combining mutations for leveled compaction The issue is triggered by compaction of sstables of level higher than 0. The problem happens when interval map of partitioned sstable set stores intervals such as follow: [-9223362900961284625 : -3695961740249769322 ] (-3695961740249769322 : -3695961103022958562 ] When selector is called for first interval above, the exclusive lower bound of the second interval is returned as next token, but the inclusivess info is not returned. So reader_selector was returning that there were new readers when the current token was -3695961740249769322 because it was stored in selector position field as inclusive, but it's actually exclusive. This false positive was leading to infinite recursion in combined reader because sstable set's incremental selector itself knew that there were actually no new readers, and therefore no progress could be made. Fix is to use ring_position in reader_selector, such that inclusiveness would be respected. So reader_selector::has_new_readers() won't return false positive under the conditions described above. Fixes #2908. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2018-01-03 16:23:01 -02:00
Glauber Costa	ca284174d0	infrastructure for backlog estimator for compaction work. This patch adds infrastucture in various points in the system to allow us to determine the amount of work present as backlog from compactions. What needs to be done can be explained in three major pieces: 1) Add hooks in the points where sstables are added or inserted to a column family (or more precisely, to a compaction_strategy object). 2) Add hooks in reads and write monitors that allows a compaction backlog estimator (tracker) to become aware of bytes that are partially written and compacted away. 3) Add a per-column family class (compaction_backlog_tracker) that can be used to track work that is done and relevant to compactions (like the two above), and a compaction manager to provide a system-wide backlog based on the response of the individual trackers. The definition of how much backlog one has is strategy-specific. The Null strategy is easy, as it never really has any backlog, and so is the major strategy - since what it really matters is the backlog of the underlying compaction strategy. Although backlogs are strategy-specific, they should be "compatible", in the sense that if a particular strategy has more work to do, it should yield a higher number than its counterparts. All the others are presented in this patch as unimplemented: they will always advertise a mild backlog that should yield a constant CPU-utilization if used alone. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	86d7c160fd	sstables: notify about end of data component write We need to notify the monitor that the offset tracker that we are using is about to be destroyed and will no longer be valid. While we could modify the file_writer interface so that we could capture the offset_tracker and take ownership of it - guaranteeing it is alive until we reach the existing on_write_completed(), this feels like a layer violation. It is also potentially useful in general to offer the monitor callers with knowledge that writing the data portion is done. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	3bd6bceaf0	sstables: add read_monitor_generator Passing the read monitor down to the sstable readers is tricky. The point of interest - like compaction - are usually very far from the interfaces that register the monitor, like read_rows. Between the two, there is usually a mutation_reader, which is and ought to be totally unaware of the read monitor: technically, a mutation_reader may not even know it is backed by sstables. The solution is to create a read_monitor_generator, that can be passed from the upper layers, like compaction, to the layers that are actually making the decision of which sstables to create readers for. Note that we don't need an equivalent piece of infrastructure for writes, because writes don't happen through hidden layers and have all the information they need to initialize their monitors. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	110b8531f4	sstables: enhance the file_writer with an offset tracker Callers, like the memtable flusher or compactions will be able to find out the current amount of bytes written at any time. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	00df0a5ad3	sstables: pass references instead of pointers for write_monitor This came from Avi's review on the read_monitors. He suggests we wouldn't keep shared pointers, and would instead have the caller ensuring lifetime. That makes sense, but having the writer interface using shared_ptr and the read interface using references would lead to an inconsistent interface. For the sake of consistency we will change the write monitor to take references before we do that. From database.cc's perspective, we could now keep the monitors in a do_with() block, but we will keep the shared_ptrs to manage their lifetime in anticipation of upcoming patches in this series, where we'll have to pass them somewhere else. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:06 -05:00
Avi Kivity	8795238869	Merge "Fix handling of range tombstones starting at same position" from Tomasz "When we get two range tombstones with the same lower bound from different data sources (e.g. two sstable), which need to be combined into a single stream, they need to be de-overlapped, because each mutation fragment in the stream must have a different position. If we have range tombstones [1, 10) and [1, 20), the result of that de-overlapping will be [1, 10) and [10, 20]. The problem is that if the stream corresponds to a clustering slice with upper bound greater than 1, but lower than 10, the second range tombstone would appear as being out of the query range. This is currently violating assumptions made by some consumers, like cache populator. One effect of this may be that a reader will miss rows which are in the range (1, 10) (after the start of the first range tombstone, and before the start of the second range tombstone), if the second range tombstone happens to be the last fragment which was read for a discontinuous range in cache and we stopped reading at that point because of a full buffer and cache was evicted before we resumed reading, so we went to reading from the sstable reader again. There could be more cases in which this violation may resurface. There is also a related bug in mutation_fragment_merger. If the reader is in forwarding mode, and the current range is [1, 5], the reader would still emit range_tombstone([10, 20]). If that reader is later fast forwarded to another range, say [6, 8], it may produce fragments with smaller positions which were emitted before, violating monotonicity of fragment positions in the stream. A similar bug was also present in partition_snapshot_flat_reader. Possible solutions: 1) relax the assumption (in cache) that streams contain only relevant range tombstones, and only require that they contain at least all relevant tombstones 2) allow subsequent range tombstones in a stream to share the same starting position (position is weakly monotonic), then we don't need to de-overlap the tombstones in readers. 3) teach combining readers about query restrictions so that they can drop fragments which fall outside the range 4) force leaf readers to trim all range tombstones to query restrictions This patch implements solution no 2. It simplifies combining readers, which don't need to accumulate and trim range tombstones. I don't like solution 3, because it makes combining readers more complicated, slower, and harder to properly construct (currently combining readers don't need to know restrictions of the leaf streams). Solution 4 is confined to implementations of leaf readers, but also has disadvantage of making those more complicated and slower. There is only one consumer which needs the tombstones with monotonic positions, and that is the sstable writer. Fixes #3093." * tag 'tgrabiec/fix-out-of-range-tombstones-v1' of github.com:scylladb/seastar-dev: tests: row_cache: Introduce test for concurrent read, population and eviction tests: sstables: Add test for writing combined stream with range tombstones at same position tests: memtable: Test that combined mutation source is a mutation source tests: memtable: Test that memtable with many versions is a mutation source tests: mutation_source: Add test for stream invariants with overlapping tombstones tests: mutation_reader: Test fast forwarding of combined reader with overlapping range tombstones tests: mutation_reader: Test combined reader slicing on random mutations tests: mutation_source_test: Extract random_mutation_generator::make_partition_keys() mutation_fragment: Introduce range() clustering_interval_set: Introduce overlaps() clustering_interval_set: Extract private make_interval() mutation_reader: Allow range tombstones with same position in the fragment stream sstables: Handle consecutive range_tombstone fragments with same position tests: streamed_mutation_assertions: Merge range_tombstones with the same position in produces_range_tombstone() streamed_mutation: Introduce peek() mutation_fragment: Extract mergeable_with() mutation_reader: Move definition of combining mutation reader to source file mutation_reader: Use make_combined_reader() to create combined reader	2018-01-02 18:32:09 +02:00

1 2 3 4 5 ...

1004 Commits