scylladb

Author	SHA1	Message	Date
Duarte Nunes	a85232dd82	Fix compilation errors on GCC 6 GCC 6 inconsistently requires explicitly calling a member function through "this->" for lambda functions capturing "this". Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170731143755.21970-1-duarte@scylladb.com>	2017-07-31 17:40:44 +03:00
Avi Kivity	3fe6731436	Merge "educe the effect of the latency metrics" from Amnon "This series reduce that effect in two ways: 1. Remove the latency counters from the system keyspaces 2. Reduce the histogram size by limiting the maximum number of buckets and stop the last bucket." Fixes #2650. * 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev: database: remove latency from the system table estimated histogram: return a smaller histogram	2017-07-31 15:58:30 +03:00
Duarte Nunes	c81431ad16	column_family: Re-acquire flush permit in case of error If we fail to flush an sstable, after creating the flush_reader, then we will have released the flush permit when we retry the flush. Ensure that when retrying, we re-acquire the flush permit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	9162e016da	column_family: Don't hold sstable read lock when retrying flush Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	1a33cc6847	sstables: Release the flush permit before fsyncing This allows a queued flush to start while we fsync the current sstable, which helps reduce the overall time new writes are blocked on dirty memory. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	a2b732c156	dirty_memory_manager: Refactor flush permit lifetime management This patch refactors how the flush permit lifetime is managed, dropping the current hash table in favour of a RAII approach. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	f647f5b14a	dirty_memory_manager: Invert permit acquisition order For an upcoming fix it is required to invert the permit acquisition order: first we acquire the background work permit and then the single flush permit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	e371accac8	memtable_list: Register different seal functions for each behaviour Instead of passing a flush_behaviour to the seal function, use two different functions for each of the behaviours. This will be important in the forthcoming patches, which will require the signatures of those functions to differ. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Avi Kivity	e855a28fae	Revert "Merge "memtable flush: Fixes and improvements" from Duarte" This reverts commit `733a64a1df`, reversing changes made to `e11e66723a`. Breaks sstable_test and perf_fast_forward.	2017-07-31 12:44:28 +03:00
Duarte Nunes	0f1bd81523	column_family: Re-acquire flush permit in case of error If we fail to flush an sstable, after creating the flush_reader, then we will have released the flush permit when we retry the flush. Ensure that when retrying, we re-acquire the flush permit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	2f4cffc7f6	column_family: Don't hold sstable read lock when retrying flush Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	5e64839e85	sstables: Release the flush permit before fsyncing This allows a queued flush to start while we fsync the current sstable, which helps reduce the overall time new writes are blocked on dirty memory. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	ef1275e9dd	dirty_memory_manager: Refactor flush permit lifetime management This patch refactors how the flush permit lifetime is managed, dropping the current hash table in favour of a RAII approach. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	cfc8fae33f	dirty_memory_manager: Invert permit acquisition order For an upcoming fix it is required to invert the permit acquisition order: first we acquire the background work permit and then the single flush permit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	7e68e4677d	memtable_list: Register different seal functions for each behaviour Instead of passing a flush_behaviour to the seal function, use two different functions for each of the behaviours. This will be important in the forthcoming patches, which will require the signatures of those functions to differ. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Amnon Heiman	a71b9e498a	database: remove latency from the system table This patch remove the latency histograms from the system table, it also extend the already existing exclusion to all system keyspaces. It also uses the new get_histogram API to set a minimal bucket size to 100 microseconds.	2017-07-27 11:41:15 +03:00
Paweł Dziepak	295689d16f	db: include counter writes on leader in metrics Counters write path on leader is completely different than on any other replica (non-leaders share write path between counters and regular columns). This patch makes sure that counter writes performed on leader are added to appropriate metrics. Message-Id: <20170725153346.31238-1-pdziepak@scylladb.com>	2017-07-25 18:31:43 +02:00
Raphael S. Carvalho	637f3bfa50	db: refresh row cache's underlying data source after compaction Underlying data source in row cache holds a reference to sstable set prior to compaction which isn't released until a memtable flush, which means file descriptors of deleted sstables remains opened, wasting disk space. The fix is to refresh underlying data source in row cache. Fixes #2570. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-24 15:49:11 -03:00
Raphael S. Carvalho	e3ad676433	db: atomically synchronize cache with changes to the snapshot updates to cache and snapshot (i.e. sstable set) aren't synchronized, so it may happen that cache update for memtable flush will use wrong snapshot version, and that violates cache invariant of each partition entry only reflecting one snapshot. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-24 15:45:05 -03:00
Tomasz Grabiec	714d609605	database: Fix reversed order of keyspace and table names in a log message Message-Id: <1500649623-25377-1-git-send-email-tgrabiec@scylladb.com>	2017-07-21 17:10:17 +02:00
Tomasz Grabiec	408cea66cd	database: Allow disabling auto snapshots during drop/truncate Message-Id: <1500573920-31478-1-git-send-email-tgrabiec@scylladb.com>	2017-07-21 16:56:29 +02:00
Avi Kivity	c5ee62a6a4	Merge "restrict background writers with scheduling groups" from Glauber "This patchset restricts background writers - such as compactions, streaming flushes and memtable flushes to a maximum amount of CPU usage through a seastar::thread_scheduling_group. The said maximum is recommended to be set 50 % - it is default disabled, but can be adjusted through a configuration option until we are able to auto-tune this. The second patch in this series provides a preview on how such auto-tune would look like. By implementing a simple controller we automatically adjust the quota for the memtable writer processes, so that the rate at which bytes come in is equal to the rates at which bytes are flushed. Tail latencies are greatly reduced by this series, and heavy spikes that previously appeared on CPU-bound workloads are no more." * 'memtable-controller-v5' of https://github.com/glommer/scylla: simple controller for memtable/streaming writer shares. restrict background writers to 50 % of CPU.	2017-07-20 10:58:53 +03:00
Calle Wilund	247c36e048	system_schema: Fix remaining places not handing two system keyspaces Some places remained where code looked directly at system_keyspace::NAME to determine iff a ks is considered special/system/protected. Including schema digest calculation. Export "is_system_keyspace" and use accordingly. Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>	2017-07-19 16:18:45 +03:00
Glauber Costa	c9a529ebee	simple controller for memtable/streaming writer shares. This patch introduces a simple controller that will adjust memtables CPU shares, trying to keep it around the soft limit: if we start going below it means we're too fast (unless we are idle) and shares are adjusted downwards. If we start going above it means we're too fast and shares are adjusted upwards. I have tested this extensively in a single-CPU setup with various CPU-bound workloads while tracking virtual dirty and the results are good, with virtual dirty fluctuating only slightly, somewhere within the desired range. Exceptions to this are: 1) when the load is very light - the idle system goes faster, and that's ok 2) when the load is very high - as foreground requests dominate we can't flush fast enough and hit the hard limit. However, in such scenarios the memtable shares do hit its maximum, and the results are no worse than they are right now and this will only be fixed by CPU-limiting the actual requests. This feature can be disabled with a config option - that is scheduled to go away as we acquire more confidence in this. When the feature is disabled, all background writers (streaming, compaction, memtables) will share the same scheduling group, with static quotas. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-07-18 23:35:47 -04:00
Glauber Costa	4f01ec0910	restrict background writers to 50 % of CPU. In scylla, we have foreground processes, which are latency sensitive and need to be responded to as fast as possible in order to maintain good latency profiles, and background process, which are less so. The most important background processes we have during normal write workload operations are memtable writes and sstable compactions. Those processes are quite CPU-intensive, and left unchecked will easily dominate the CPU. Lower values of task-quota usually help, as it will force those processes to preempt more, but aren't enough to guarantee good isolation. We have seen boxes with good NVMe storage having their throughput reduced to less than half of the original baseline in a short dive down for the duration of a compaction. In the long run, our goal is to leverage the CPU scheduler to make sure that those processes are balanced with respect to all the others. However, the current state of affairs is causing grievances as this very moment. Thankfully, those processes live in a seastar::thread, that ships with its own rudimentary bandwidth control mechanism: the scheduling group. The goal of this patch is to wrap background processes together in a scheduling group, and assign to such group 50 % of our CPU power; the remainder being left to foreground processes. While we pride ourselves in dynamically adjusting things to the workload, we won't be able to do this properly before the CPU scheduler lands - and let's face it, leaving background processes run wild is not adaptative either. Every workload would benefit most from a different value for such shares, but 50 % is as fair as it gets if we really need static partitining in the mean time. As a defense against unforeseen consequences, we'll leave the actual value as an option, but will do our best to hide it - as this is not a tunable that we want to be part of a normal Scylla setup. The most convenient place for this tunable is still db::config, so we can easily pass it down to the database layer - but we will not document it in the yaml, and will clearly note in the help string that it is not supposed to be tuned. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-07-18 23:35:33 -04:00
Duarte Nunes	2c711922cc	database: Drop mutations that raced with truncate Mutations that race with a truncate can just be dropped. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-16 00:08:05 +02:00
Duarte Nunes	0825c9c805	database: Rename replay_position_reordered_exception Rename replay_position_reordered_exception to mutation_reordered_with_truncate_exception for more precision, since this is the only situation where this exception can be thrown. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-16 00:08:05 +02:00
Duarte Nunes	5f24e9a4a5	memtable: Stop tracking the highest flushed rp Since we no longer enforce that mutations are applied in memory ordered by their replay_positions, the way the highest_flush_rp is being tracked is no longer correct. The invariant it was used to maintain no longer exists, so we can get rid of it together with the assertion on the highest_flush_rp on flush(). Fixes #2074 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:56:06 +02:00
Duarte Nunes	003941cd95	column_family: Stop using flush_queue Since commitlog ordering requirements have been relaxed, we now keep the set of replay_positions seen by a memtable in a set, which we then use to clean up relevant segments in the commitlog. This means that the guarantees provided by the flush_queue are no longer necessary. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:56:00 +02:00
Duarte Nunes	7e6fe5895e	column_family: Don't bother closing the flush_queue on stop() When stopping a column family we issue a flush(), for which we wait. Since writes are supposed to have stopped coming in, and also new flush requests, there's no need to call and wait for the flush_queue to be closed. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:58 +02:00
Duarte Nunes	a1f4536ffb	column_family: Don't rely on flush_queue to guarantee flushes finished We now don't ensure mutations are applied in memory following the order of their replay positions, so we can't rely on the replay position to order memtable flushes. So, use a phased_barrier() to ensure that calling flush() returns a future that completes when all flushes up to that point have finished. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:58 +02:00
Duarte Nunes	1b320496e2	dirty_memory_manager: Remove unnecessary check from flush_one() We don't need to check whether a memtable is empty in flush_one(), as that must be checked later, during the actual sealing. The condition itself is rare and is checked already after the potentially contented semaphore has been acquired. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:57 +02:00
Duarte Nunes	59bdaed02b	column_family: More precise count of pending flushes This patch ensures we update the count of pending flushes in the same place as we update the stats across column families, which is more correct since it only accounts for actual flushes and not those of empty memtables or that have been coalesced together. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Duarte Nunes	3e27c335a9	column_family: Fix typo in pending_tasks metric name Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Duarte Nunes	a11724c6e1	column_family: More precise count of switched memtables The memtable_switch_count metric is supposed to count the number of times a flush has resulted in the memtable being switched out, but we were incrementing the count regardless of whether we tried to flush an empty memtable or two or more flushes were coalesced into one. This patch fixes this by moving the metric to where the memtable is actually switched. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Duarte Nunes	bca1b19ce9	commitlog: Always flush latest memtable We now don't ensure mutations are applied in memory following the order of their replay positions, so we can't rely on the replay position to order memtable flushes. When flushing commit log segments, ensure we flush the latest memtable. Refs #2074 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Duarte Nunes	3df6777b9b	database: Load views after loading tables Since base tables no longer look for their views, we need to parse base tables first so that when we add a view we can fetch and connect it to its base table. When announcing view table mutations to other nodes we always include the base table mutations, so there's no need to expect a view being added before its base table. Found out while testing view building. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170712172115.2960-1-duarte@scylladb.com>	2017-07-13 11:14:02 +02:00
Duarte Nunes	136accdbf6	database: Fix typos in metric descriptions Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170709145522.19534-1-duarte@scylladb.com>	2017-07-09 18:35:17 +03:00
Botond Dénes	b1082641f9	Make sure keyspace strategy class is stored in qualified form Even when it's provided in unqualified (short) form. Fixes #767 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <4379f8864843e64c097d432fd06129ce4025f100.1499322476.git.bdenes@scylladb.com>	2017-07-06 14:50:00 +03:00
Raphael S. Carvalho	972a0237ef	database: restore indentation for cleanup_sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170630035324.19881-2-raphaelsc@scylladb.com>	2017-07-03 12:48:54 +03:00
Raphael S. Carvalho	b9d0645199	database: fix potential use-after-free in sstable cleanup when do_for_each is in its last iteration and with_semaphore defers because there's an ongoing cleanup, sstable object will be used after freed because it was taken by ref and the container it lives in was destroyed prematurely. Let's fix it with a do_with, also making code nicer. Fixes #2537. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170630035324.19881-1-raphaelsc@scylladb.com>	2017-07-03 12:48:53 +03:00
Avi Kivity	fc966c0c4c	Merge "tombstone removal compaction" from Raphael "This feature is intended to make compaction more efficient at getting rid of droppable tombstone and expired data wasting disk space. So far, people have been dealing with it manually through major compaction. With strategies other than date tiered, large sstables will be left untouched for a long time even though it's all expired. Date tiered suffers from it when mixing data with different TTL because it only includes for compaction sstable that is fully expired. sstables keeps as metadata a histogram which allows us to easily estimate droppable data ratio from gc_before. sstables which droppable data ratio is above 20% (default value for tombstone_threshold option) will be considered candidates for the operation. Like in C, we will only do tombstone removal compaction when there's nothing to compact in standard way. It would be interesting to trigger it too when disk usage is above a given threshold, but I decided to leave this for later. Fixes #2306." 'tombstone_removal_compaction_v4' of github.com:raphaelsc/scylla: tests: more testing for tombstone compaction options tests: basic tombstone compaction test for date tiered compaction/dtcs: add support for tombstone compaction tests: basic test of tombstone compaction with lcs compaction/lcs: add support for tombstone compaction tests: basic tombstone compaction test for size tiered compaction/stcs: add support for tombstone compaction tests: add test for estimation of droppable tombstone ratio sstables: introduce function to estimate droppable tombstone ratio compaction_manager: periodically submit cfs for compaction streaming_histogram: fix coding style tests: add streaming_histogram_test streaming_histogram: implement sum tests: add test for sstable with bad tombstone histogram sstables: discard bad streaming histogram for future use tests: add sstable tombstone histogram test streaming_histogram: fix update streaming_histogram: move it to utils streaming_histogram: do not limit it to be used by sstables sstables: update tombstone_histogram for cells with expiration time	2017-06-29 10:19:59 +03:00
Raphael S. Carvalho	a3a73899bc	database: remove outdated FIXME comments Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170621002253.29660-1-raphaelsc@scylladb.com>	2017-06-28 11:06:02 +02:00
Raphael S. Carvalho	fb9bc609c6	streaming_histogram: do not limit it to be used by sstables streaming histogram will later be placed in /utils, so we want it to use std::unordered_map<> instead of disk_hash<>. That also requires implementing serialization/deserialization functions for it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-27 16:51:52 -03:00
Nadav Har'El	6cf44f6817	Optimize column_family::make_sstable_reader() for one partition This patch does the same thing to column_family::make_sstable_reader() as commit `186f031` did to sstable::as_mutation_source(). Although usually one can fast_forward_to() on the result of a column_family::make_sstable_reader(), earlier we had an optimization where if a single partition was specified, it was read exactly, and fast_forward_to() was NOT allowed. With the mutation_reader::forwarding flag patch, when this flag was on - requesting fast_forward_to() - we disabled this optimization. This makes sense, but is not backward compatible with the code which previously assumes this optimization exists. In particular, column_family::data_query() does a single partition read but does not specify forwarding::no explicitly. So this patch returns this optimization, despite this meaning that we blatently ignore the fwd_mr flag in that case. Fixes #2524. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170626141121.30322-1-nyh@scylladb.com>	2017-06-26 17:13:03 +03:00
Avi Kivity	9b21a9bfb6	Merge "Implement partial cache" from Tomasz and Piotr "This series enables cache to keep partial partitions. Reads no longer have to read whole partition from sstables in order to cache the result. The 10MB threshold for partition size in cache is lifted. Known issues: - There is no partial eviction yet, whole partitions are still evicted, and partition snapshots held by active reads are not evictable at all - Information about range continuity is not recorded if that would require inserting a dummy entry, or if previous entry doesn't belong to the latest snapshot - Cache update after memtable flush happening concurrently with reads may inhibit that reads' ability to populate cache (new issue) - Cache update from flushed memtables has partition granularity, so may cause latency problems with large partition - Schema is still tracked per-partition, so after schema changes reads may induce high latency due to whole partition needing to be converted atomically - Range tombstones are repeated in the stream for every range between cache entries they cover (new issue) - Populating scans for both small and large partitions (perf_fast_forward) experienced a 40% reduction of throughput, CPU bound How was this tested: - test.py --mode release - row_cache_stress_test -c1 -m1G - perf_fast_forward, passes except for the test case checking range continuity population which would require inserting a dummy entry (mentioned above) - perf_simple_query (-c1 -m1G --duration 32): before: 90k [ops/s] stdev: 4k [ops/s] after: 94k [ops/s] stdev: 2k [ops/s]" * tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits) tests: row_cache: Add test_tombstone_merging_in_partial_partition test case tests: Introduce row_cache_stress_test utils: Add helpers for dealing with nonwrapping_range<int> tests: simple_schema: Allow passing the tombstone to make_range_tombstone() tests: simple_schema: Accept value by reference tests: simple_schema: Make add_row() accept optional timestamp tests: simple_schema: Make new_timestamp() public tests: simple_schema: Introduce make_ckeys() tests: simple_schema: Introduce get_value(const clustered_row&) helper tests: simple_schema: Fix comment tests: simple_schema: Add missing include row_cache: Introduce evict() tests: Add cache_streamed_mutation_test tests: mutation_assertions: Allow expecting fragments mutation_fragment: Implement equality check tests: row_cache: Add test for population of random partitions tests: row_cache: Add test for partition tombstone population tests: row_cache: Test reading randomly populated partition tests: row_cache: Add test_single_partition_update() tests: row_cache: Add test_scan_with_partial_partitions ...	2017-06-26 14:54:37 +03:00
Avi Kivity	555621b537	Disentable memtables from sstables Remove sstable::write_components(memtable), replacing it with a helper. Fixes #2354 Message-Id: <20170624142639.16662-1-avi@scylladb.com>	2017-06-26 09:37:11 +02:00
Tomasz Grabiec	1828e28bbb	database: Invalidate cache atomically with attaching streaming sstables Not doing so may cause reads to see partial writes, if another update+read happens in between.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	896196b841	database: Invalidate cache from seal_active_streaming_memtable_immediate() Cache must be synchronized atomically with changing the underlying mutation source, otherwise write atomicity may not hold.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8ba6366610	row_cache: Switch to using snapshot_source Currently every time cache needs to create reader for missing data it obtains a reader which is most up to date. That reader includes writes from later populate phases, for which update() was not yet called. This will be problematic once we allow partitions to be partially populated, because different parts of the partition could be partially populated using readers using different sets of writes, and break write atomicity. The solution will be to always populate given partition using the same set of writes, using reader created from the current snapshot. The snapshot changes only on update(), with update() gradually converting each partition to the new snapshot.	2017-06-24 18:06:11 +02:00

1 2 3 4 5 ...

857 Commits