scylladb

Author	SHA1	Message	Date
Paweł Dziepak	295689d16f	db: include counter writes on leader in metrics Counters write path on leader is completely different than on any other replica (non-leaders share write path between counters and regular columns). This patch makes sure that counter writes performed on leader are added to appropriate metrics. Message-Id: <20170725153346.31238-1-pdziepak@scylladb.com>	2017-07-25 18:31:43 +02:00
Raphael S. Carvalho	637f3bfa50	db: refresh row cache's underlying data source after compaction Underlying data source in row cache holds a reference to sstable set prior to compaction which isn't released until a memtable flush, which means file descriptors of deleted sstables remains opened, wasting disk space. The fix is to refresh underlying data source in row cache. Fixes #2570. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-24 15:49:11 -03:00
Raphael S. Carvalho	e3ad676433	db: atomically synchronize cache with changes to the snapshot updates to cache and snapshot (i.e. sstable set) aren't synchronized, so it may happen that cache update for memtable flush will use wrong snapshot version, and that violates cache invariant of each partition entry only reflecting one snapshot. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-24 15:45:05 -03:00
Tomasz Grabiec	714d609605	database: Fix reversed order of keyspace and table names in a log message Message-Id: <1500649623-25377-1-git-send-email-tgrabiec@scylladb.com>	2017-07-21 17:10:17 +02:00
Tomasz Grabiec	408cea66cd	database: Allow disabling auto snapshots during drop/truncate Message-Id: <1500573920-31478-1-git-send-email-tgrabiec@scylladb.com>	2017-07-21 16:56:29 +02:00
Avi Kivity	c5ee62a6a4	Merge "restrict background writers with scheduling groups" from Glauber "This patchset restricts background writers - such as compactions, streaming flushes and memtable flushes to a maximum amount of CPU usage through a seastar::thread_scheduling_group. The said maximum is recommended to be set 50 % - it is default disabled, but can be adjusted through a configuration option until we are able to auto-tune this. The second patch in this series provides a preview on how such auto-tune would look like. By implementing a simple controller we automatically adjust the quota for the memtable writer processes, so that the rate at which bytes come in is equal to the rates at which bytes are flushed. Tail latencies are greatly reduced by this series, and heavy spikes that previously appeared on CPU-bound workloads are no more." * 'memtable-controller-v5' of https://github.com/glommer/scylla: simple controller for memtable/streaming writer shares. restrict background writers to 50 % of CPU.	2017-07-20 10:58:53 +03:00
Calle Wilund	247c36e048	system_schema: Fix remaining places not handing two system keyspaces Some places remained where code looked directly at system_keyspace::NAME to determine iff a ks is considered special/system/protected. Including schema digest calculation. Export "is_system_keyspace" and use accordingly. Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>	2017-07-19 16:18:45 +03:00
Glauber Costa	c9a529ebee	simple controller for memtable/streaming writer shares. This patch introduces a simple controller that will adjust memtables CPU shares, trying to keep it around the soft limit: if we start going below it means we're too fast (unless we are idle) and shares are adjusted downwards. If we start going above it means we're too fast and shares are adjusted upwards. I have tested this extensively in a single-CPU setup with various CPU-bound workloads while tracking virtual dirty and the results are good, with virtual dirty fluctuating only slightly, somewhere within the desired range. Exceptions to this are: 1) when the load is very light - the idle system goes faster, and that's ok 2) when the load is very high - as foreground requests dominate we can't flush fast enough and hit the hard limit. However, in such scenarios the memtable shares do hit its maximum, and the results are no worse than they are right now and this will only be fixed by CPU-limiting the actual requests. This feature can be disabled with a config option - that is scheduled to go away as we acquire more confidence in this. When the feature is disabled, all background writers (streaming, compaction, memtables) will share the same scheduling group, with static quotas. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-07-18 23:35:47 -04:00
Glauber Costa	4f01ec0910	restrict background writers to 50 % of CPU. In scylla, we have foreground processes, which are latency sensitive and need to be responded to as fast as possible in order to maintain good latency profiles, and background process, which are less so. The most important background processes we have during normal write workload operations are memtable writes and sstable compactions. Those processes are quite CPU-intensive, and left unchecked will easily dominate the CPU. Lower values of task-quota usually help, as it will force those processes to preempt more, but aren't enough to guarantee good isolation. We have seen boxes with good NVMe storage having their throughput reduced to less than half of the original baseline in a short dive down for the duration of a compaction. In the long run, our goal is to leverage the CPU scheduler to make sure that those processes are balanced with respect to all the others. However, the current state of affairs is causing grievances as this very moment. Thankfully, those processes live in a seastar::thread, that ships with its own rudimentary bandwidth control mechanism: the scheduling group. The goal of this patch is to wrap background processes together in a scheduling group, and assign to such group 50 % of our CPU power; the remainder being left to foreground processes. While we pride ourselves in dynamically adjusting things to the workload, we won't be able to do this properly before the CPU scheduler lands - and let's face it, leaving background processes run wild is not adaptative either. Every workload would benefit most from a different value for such shares, but 50 % is as fair as it gets if we really need static partitining in the mean time. As a defense against unforeseen consequences, we'll leave the actual value as an option, but will do our best to hide it - as this is not a tunable that we want to be part of a normal Scylla setup. The most convenient place for this tunable is still db::config, so we can easily pass it down to the database layer - but we will not document it in the yaml, and will clearly note in the help string that it is not supposed to be tuned. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-07-18 23:35:33 -04:00
Duarte Nunes	2c711922cc	database: Drop mutations that raced with truncate Mutations that race with a truncate can just be dropped. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-16 00:08:05 +02:00
Duarte Nunes	0825c9c805	database: Rename replay_position_reordered_exception Rename replay_position_reordered_exception to mutation_reordered_with_truncate_exception for more precision, since this is the only situation where this exception can be thrown. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-16 00:08:05 +02:00
Duarte Nunes	5f24e9a4a5	memtable: Stop tracking the highest flushed rp Since we no longer enforce that mutations are applied in memory ordered by their replay_positions, the way the highest_flush_rp is being tracked is no longer correct. The invariant it was used to maintain no longer exists, so we can get rid of it together with the assertion on the highest_flush_rp on flush(). Fixes #2074 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:56:06 +02:00
Duarte Nunes	003941cd95	column_family: Stop using flush_queue Since commitlog ordering requirements have been relaxed, we now keep the set of replay_positions seen by a memtable in a set, which we then use to clean up relevant segments in the commitlog. This means that the guarantees provided by the flush_queue are no longer necessary. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:56:00 +02:00
Duarte Nunes	7e6fe5895e	column_family: Don't bother closing the flush_queue on stop() When stopping a column family we issue a flush(), for which we wait. Since writes are supposed to have stopped coming in, and also new flush requests, there's no need to call and wait for the flush_queue to be closed. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:58 +02:00
Duarte Nunes	a1f4536ffb	column_family: Don't rely on flush_queue to guarantee flushes finished We now don't ensure mutations are applied in memory following the order of their replay positions, so we can't rely on the replay position to order memtable flushes. So, use a phased_barrier() to ensure that calling flush() returns a future that completes when all flushes up to that point have finished. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:58 +02:00
Duarte Nunes	1b320496e2	dirty_memory_manager: Remove unnecessary check from flush_one() We don't need to check whether a memtable is empty in flush_one(), as that must be checked later, during the actual sealing. The condition itself is rare and is checked already after the potentially contented semaphore has been acquired. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:57 +02:00
Duarte Nunes	59bdaed02b	column_family: More precise count of pending flushes This patch ensures we update the count of pending flushes in the same place as we update the stats across column families, which is more correct since it only accounts for actual flushes and not those of empty memtables or that have been coalesced together. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Duarte Nunes	3e27c335a9	column_family: Fix typo in pending_tasks metric name Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Duarte Nunes	a11724c6e1	column_family: More precise count of switched memtables The memtable_switch_count metric is supposed to count the number of times a flush has resulted in the memtable being switched out, but we were incrementing the count regardless of whether we tried to flush an empty memtable or two or more flushes were coalesced into one. This patch fixes this by moving the metric to where the memtable is actually switched. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Duarte Nunes	bca1b19ce9	commitlog: Always flush latest memtable We now don't ensure mutations are applied in memory following the order of their replay positions, so we can't rely on the replay position to order memtable flushes. When flushing commit log segments, ensure we flush the latest memtable. Refs #2074 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Duarte Nunes	3df6777b9b	database: Load views after loading tables Since base tables no longer look for their views, we need to parse base tables first so that when we add a view we can fetch and connect it to its base table. When announcing view table mutations to other nodes we always include the base table mutations, so there's no need to expect a view being added before its base table. Found out while testing view building. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170712172115.2960-1-duarte@scylladb.com>	2017-07-13 11:14:02 +02:00
Duarte Nunes	136accdbf6	database: Fix typos in metric descriptions Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170709145522.19534-1-duarte@scylladb.com>	2017-07-09 18:35:17 +03:00
Botond Dénes	b1082641f9	Make sure keyspace strategy class is stored in qualified form Even when it's provided in unqualified (short) form. Fixes #767 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <4379f8864843e64c097d432fd06129ce4025f100.1499322476.git.bdenes@scylladb.com>	2017-07-06 14:50:00 +03:00
Raphael S. Carvalho	972a0237ef	database: restore indentation for cleanup_sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170630035324.19881-2-raphaelsc@scylladb.com>	2017-07-03 12:48:54 +03:00
Raphael S. Carvalho	b9d0645199	database: fix potential use-after-free in sstable cleanup when do_for_each is in its last iteration and with_semaphore defers because there's an ongoing cleanup, sstable object will be used after freed because it was taken by ref and the container it lives in was destroyed prematurely. Let's fix it with a do_with, also making code nicer. Fixes #2537. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170630035324.19881-1-raphaelsc@scylladb.com>	2017-07-03 12:48:53 +03:00
Avi Kivity	fc966c0c4c	Merge "tombstone removal compaction" from Raphael "This feature is intended to make compaction more efficient at getting rid of droppable tombstone and expired data wasting disk space. So far, people have been dealing with it manually through major compaction. With strategies other than date tiered, large sstables will be left untouched for a long time even though it's all expired. Date tiered suffers from it when mixing data with different TTL because it only includes for compaction sstable that is fully expired. sstables keeps as metadata a histogram which allows us to easily estimate droppable data ratio from gc_before. sstables which droppable data ratio is above 20% (default value for tombstone_threshold option) will be considered candidates for the operation. Like in C, we will only do tombstone removal compaction when there's nothing to compact in standard way. It would be interesting to trigger it too when disk usage is above a given threshold, but I decided to leave this for later. Fixes #2306." 'tombstone_removal_compaction_v4' of github.com:raphaelsc/scylla: tests: more testing for tombstone compaction options tests: basic tombstone compaction test for date tiered compaction/dtcs: add support for tombstone compaction tests: basic test of tombstone compaction with lcs compaction/lcs: add support for tombstone compaction tests: basic tombstone compaction test for size tiered compaction/stcs: add support for tombstone compaction tests: add test for estimation of droppable tombstone ratio sstables: introduce function to estimate droppable tombstone ratio compaction_manager: periodically submit cfs for compaction streaming_histogram: fix coding style tests: add streaming_histogram_test streaming_histogram: implement sum tests: add test for sstable with bad tombstone histogram sstables: discard bad streaming histogram for future use tests: add sstable tombstone histogram test streaming_histogram: fix update streaming_histogram: move it to utils streaming_histogram: do not limit it to be used by sstables sstables: update tombstone_histogram for cells with expiration time	2017-06-29 10:19:59 +03:00
Raphael S. Carvalho	a3a73899bc	database: remove outdated FIXME comments Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170621002253.29660-1-raphaelsc@scylladb.com>	2017-06-28 11:06:02 +02:00
Raphael S. Carvalho	fb9bc609c6	streaming_histogram: do not limit it to be used by sstables streaming histogram will later be placed in /utils, so we want it to use std::unordered_map<> instead of disk_hash<>. That also requires implementing serialization/deserialization functions for it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-27 16:51:52 -03:00
Nadav Har'El	6cf44f6817	Optimize column_family::make_sstable_reader() for one partition This patch does the same thing to column_family::make_sstable_reader() as commit `186f031` did to sstable::as_mutation_source(). Although usually one can fast_forward_to() on the result of a column_family::make_sstable_reader(), earlier we had an optimization where if a single partition was specified, it was read exactly, and fast_forward_to() was NOT allowed. With the mutation_reader::forwarding flag patch, when this flag was on - requesting fast_forward_to() - we disabled this optimization. This makes sense, but is not backward compatible with the code which previously assumes this optimization exists. In particular, column_family::data_query() does a single partition read but does not specify forwarding::no explicitly. So this patch returns this optimization, despite this meaning that we blatently ignore the fwd_mr flag in that case. Fixes #2524. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170626141121.30322-1-nyh@scylladb.com>	2017-06-26 17:13:03 +03:00
Avi Kivity	9b21a9bfb6	Merge "Implement partial cache" from Tomasz and Piotr "This series enables cache to keep partial partitions. Reads no longer have to read whole partition from sstables in order to cache the result. The 10MB threshold for partition size in cache is lifted. Known issues: - There is no partial eviction yet, whole partitions are still evicted, and partition snapshots held by active reads are not evictable at all - Information about range continuity is not recorded if that would require inserting a dummy entry, or if previous entry doesn't belong to the latest snapshot - Cache update after memtable flush happening concurrently with reads may inhibit that reads' ability to populate cache (new issue) - Cache update from flushed memtables has partition granularity, so may cause latency problems with large partition - Schema is still tracked per-partition, so after schema changes reads may induce high latency due to whole partition needing to be converted atomically - Range tombstones are repeated in the stream for every range between cache entries they cover (new issue) - Populating scans for both small and large partitions (perf_fast_forward) experienced a 40% reduction of throughput, CPU bound How was this tested: - test.py --mode release - row_cache_stress_test -c1 -m1G - perf_fast_forward, passes except for the test case checking range continuity population which would require inserting a dummy entry (mentioned above) - perf_simple_query (-c1 -m1G --duration 32): before: 90k [ops/s] stdev: 4k [ops/s] after: 94k [ops/s] stdev: 2k [ops/s]" * tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits) tests: row_cache: Add test_tombstone_merging_in_partial_partition test case tests: Introduce row_cache_stress_test utils: Add helpers for dealing with nonwrapping_range<int> tests: simple_schema: Allow passing the tombstone to make_range_tombstone() tests: simple_schema: Accept value by reference tests: simple_schema: Make add_row() accept optional timestamp tests: simple_schema: Make new_timestamp() public tests: simple_schema: Introduce make_ckeys() tests: simple_schema: Introduce get_value(const clustered_row&) helper tests: simple_schema: Fix comment tests: simple_schema: Add missing include row_cache: Introduce evict() tests: Add cache_streamed_mutation_test tests: mutation_assertions: Allow expecting fragments mutation_fragment: Implement equality check tests: row_cache: Add test for population of random partitions tests: row_cache: Add test for partition tombstone population tests: row_cache: Test reading randomly populated partition tests: row_cache: Add test_single_partition_update() tests: row_cache: Add test_scan_with_partial_partitions ...	2017-06-26 14:54:37 +03:00
Avi Kivity	555621b537	Disentable memtables from sstables Remove sstable::write_components(memtable), replacing it with a helper. Fixes #2354 Message-Id: <20170624142639.16662-1-avi@scylladb.com>	2017-06-26 09:37:11 +02:00
Tomasz Grabiec	1828e28bbb	database: Invalidate cache atomically with attaching streaming sstables Not doing so may cause reads to see partial writes, if another update+read happens in between.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	896196b841	database: Invalidate cache from seal_active_streaming_memtable_immediate() Cache must be synchronized atomically with changing the underlying mutation source, otherwise write atomicity may not hold.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8ba6366610	row_cache: Switch to using snapshot_source Currently every time cache needs to create reader for missing data it obtains a reader which is most up to date. That reader includes writes from later populate phases, for which update() was not yet called. This will be problematic once we allow partitions to be partially populated, because different parts of the partition could be partially populated using readers using different sets of writes, and break write atomicity. The solution will be to always populate given partition using the same set of writes, using reader created from the current snapshot. The snapshot changes only on update(), with update() gradually converting each partition to the new snapshot.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	446bcdb00d	database: Add missing cache invalidation after attaching sstables This violation of the contract is currently benign, because there are no reads from those tables before they are populated. If there were, the cache would mark the whole (empty) range as continuous and the table would appear empty. It will cause similar problem once cache starts using snapshots of the underlying mutation source. Then this lack of invalidate() will also result in cache thinking that the table is still empty.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e23c7e2f34	row_cache: Rework invalidate() implementation 1) Reduce duplication by delegating to more general overloads 2) Improve documentation to not mention effects in terms of population (detail) but rather write visibiliy 3) Rename clear() to invalidate() and merge with the range variant, it has the same semantics	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	c82c6ec6ed	database: Allow obtaining snapshot_source for sstables	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	77f944880c	cache: Remove support for wide partitions This will be handled by row cache now. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Avi Kivity	8585a356eb	Revert "Revert "db: prevent latency spikes during streaming/repair"" This reverts commit `399d219cab`. Turns out it was not the culprit.	2017-06-21 16:58:04 +03:00
Avi Kivity	399d219cab	Revert "db: prevent latency spikes during streaming/repair" This reverts commit bdfa2ed923245e236837f58925c797e26df32361; prevents nodes from joining.	2017-06-21 11:28:29 +03:00
Avi Kivity	bdfa2ed923	db: prevent latency spikes during streaming/repair The memtable destructor can take a long time if the memtable is full; use clear_gently() to clear it without impacting latency. Fixes #2477. Message-Id: <20170620093550.16121-1-avi@scylladb.com>	2017-06-20 13:03:43 +02:00
Nadav Har'El	3018df11b5	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619152629.11703-1-nyh@scylladb.com>	2017-06-19 18:31:32 +03:00
Avi Kivity	6e2c9ef9fb	Revert "Allow reading exactly desired byte ranges and fast_forward_to" This reverts commit `317d7fc253` (and also the related `2c57ab84b2`). It causes crashes during range scans, reported by Gleb: "To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s dataset and 3 node cluster. Backtrace: at /home/gleb/work/seastar/seastar/core/apply.hh:36 rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57 range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142 at ./seastar/core/future.hh:890 at /home/gleb/work/seastar/seastar/core/future-util.hh:119 at /home/gleb/work/seastar/seastar/core/future-util.hh:142	2017-06-18 16:10:21 +03:00
Avi Kivity	9cf6db3de5	Merge	2017-06-15 19:11:07 +03:00
Nadav Har'El	317d7fc253	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170614072122.13473-1-nyh@scylladb.com>	2017-06-15 13:22:46 +01:00
Avi Kivity	da24bd7c34	Merge "Balance read requests according to CF's cache hit ratio" from Gleb "During read query with CL<ALL not all replicas are contacted. It is possible for some replicas to cache less data for some CF's (for instance because of node restart), so the replica choice may have a big impact on request's completion latency and on amount of work it generates in a cluster. This patch series keep track of per CF cached hit ratio and uses this information to choose best replicas for a request. Nodes with lower hit ratios are still contacted in order to populate their cache, but less frequently." * 'gleb/cache-hitrate' of github.com:cloudius-systems/seastar-dev: storage_proxy: load balance read requests according to cache hit rates choose extra replica for speculation in filter_for_query() consistency_level: drop filter_for_query_dc_local function database: reset node's hit rate information on connection drop messaging_service: connection drop notifier Store cluster wide cache hit statistics in CF messaging_service: return cache hit ratio as part of data read Distribute cache temperature over gossiper. periodically calculate avg cache hit rate between all shards database: introduce cache_temperature class Rename load_broadcaster.cc to misc_services.cc storage_proxy: use db::count_local_endpoints function instead open code it	2017-06-15 14:33:08 +03:00
Calle Wilund	525730e135	database: Fix assert in truncate to handle empty memtables+sstables If we do two truncates in a row, the second will have neither memtable nor sstable data. Thus we will not write/remove sstables, and thus get no resulting truncation replay position. Message-Id: <1497378469-6063-1-git-send-email-calle@scylladb.com>	2017-06-14 11:21:21 +02:00
Gleb Natapov	ca812a8ea0	database: reset node's hit rate information on connection drop Node may go down, so after it restarts cache hit rate info will be incorrect and it can be overwhelmed with traffic until new and up-to-date cache hit rate arrives. Solve this by dropping node's information on connection reset, it is more accurate than relying on gossip which may be slow and miss reboot of a node.	2017-06-13 09:57:14 +03:00
Gleb Natapov	0e4d5bc2f3	Store cluster wide cache hit statistics in CF	2017-06-13 09:57:14 +03:00
Gleb Natapov	69c5526301	messaging_service: return cache hit ratio as part of data read	2017-06-13 09:57:14 +03:00

1 2 3 4 5 ...

841 Commits