scylladb

Author	SHA1	Message	Date
Tzach Livyatan	12fb975282	Fix typos in metrics description Fixes #2658 Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <20170803121732.19640-1-tzach@scylladb.com>	2017-08-28 10:48:28 +03:00
Glauber Costa	83323e155e	database: add gate for generic async operations to column family run_with_compaction_disabled(), which is called by truncate, has a pretty large defer point in remove(). When the code gets to finally execute, we can't guarantee that the column family will still be alive. That is true in particular if we issued a drop table command following truncate: by the time truncate gets to resume, the CF will be gone. Before the column family is dropped, it will always call its stop() method, which means we have an opportunity to do some waiting there. We already wait for flushes and current compactions to end. Traditionally, we have been solving similar problems by adding a gate that will catch asynchronous operations and making sure that potentially asynchronous operations will enter the gate before executing. Let's do the same thing here. We will close() the gate during stop(). Fixes #2726 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-08-24 13:12:57 -04:00
Glauber Costa	d090e7be35	database: make sure that column family is always stopped when dropped truncate can throw exceptions. If it does, cf->stop() will never be called because it is contained in a .then clause instead of finally. One of the things that truncate does - in a finally block of its own - is initiate a final compaction. If it returns an exception nobody will wait for that compaction to finish (since cf->stop() is the one doing that) and we'll crash. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-08-24 13:01:47 -04:00
Amnon Heiman	abbd78367c	Add configuration to disable per keyspace and column family metrics The number of keysapce and column family metrics reported is proportional to the number of shards times the number of keysapce/column families. This can cause a performance issue both on the reporting system and on the collecting system. This patch adds a configuration flag (set to false by default) to enable or disable those metrics. Fixes #2701 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170821113843.1036-1-amnon@scylladb.com>	2017-08-22 19:19:54 +03:00
Raphael S. Carvalho	10eaa2339e	compaction: Make resharding go through compaction manager Two reasons for this change: 1) every compaction should be multiplexed to manager which in turn will make decision when to schedule. improvements on it will immediately benefit every existing compaction type. 2) active tasks metric will now track ongoing reshard jobs. Fixes #2671. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170817224334.6402-1-raphaelsc@scylladb.com>	2017-08-20 11:35:14 +03:00
Botond Dénes	e70cfc8f36	incremental_reader_selector: account for possibly disengaged lower bound In addition to the constructor (fixed previously) the check for no sstables on the first call to select() also has to be prepared for the lower bound of the range being disengaged. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <4ab1296c71814fcd492996fa36fd00fd7bbbbc7f.1502949875.git.bdenes@scylladb.com>	2017-08-17 10:07:26 +03:00
Botond Dénes	af83b7f57b	incremental_reader_selector: use lazy_deref instead of tertiary operator Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <4f4b884c6a1f517bd654f3b27608d854b17a66e1.1502948635.git.bdenes@scylladb.com>	2017-08-17 08:45:46 +03:00
Avi Kivity	8df6dd1fa0	database: make incremental_reader_selector robust vs. full-range partition_range incremental_reader_selector assumes the partition_range it receives has a lower bound, but it was seen in mutation_test that this is not so. Fix by checking whether the bound exists or not. Message-Id: <20170815095852.14149-1-avi@scylladb.com>	2017-08-15 11:03:22 +01:00
Duarte Nunes	7fb6a74302	combined_mutation_reader: Drop exhausted readers if not in FF mode Exhausted readers can be fast forwarded, so we have to keep them around. However, if the current reader is not fast forwardable, then we can drop those readers and their buffers. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-14 14:37:27 +02:00
Botond Dénes	9ee9988097	Add combined_mutation_reader_test unit test	2017-08-10 12:38:10 +03:00
Botond Dénes	3e97a5cd6b	Remove range_sstable_reader range_sstable_reader is replaced with combined_mutation_reader, using the incremental_reader_selector.	2017-08-10 12:38:10 +03:00
Botond Dénes	bfc74f1312	Add incremental_reader_selector incremental_reader_selector is a specialization of reader_selector for the case when sstables have narrow and/or disjoint token ranges. To exploit this it creates new readers on-demand when their sstable's token range intersects with the current ring position.	2017-08-10 12:38:02 +03:00
Botond Dénes	94fc550e68	sstable_set::incremental_selector: select() now returns a selection A seletion contains - in addition to the list of sstables - a next_token which is a hint as to what is the next best token to call select() with. This should be the smallest token such that at the next call to select() the least number of new sstables will be returned, without skipping any.	2017-08-09 16:27:33 +03:00
Glauber Costa	4a911879a3	add active streaming reads metric In commit `f38e4ff3f`, we have separated streaming reads from normal reads for the purpose of determining the maximum number of reads going on. However, we'll now be totally unaware of how many reads will be happening on behalf of streaming and that can be important information when debugging issues. This patch adds this metric so we don't fly blind. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <1501909973-32519-1-git-send-email-glauber@scylladb.com>	2017-08-05 11:06:37 +03:00
Avi Kivity	f38e4ff3f9	database: prevent streaming reads from blocking normal reads Streaming reads and normal reads share a semaphore, so if a bunch of streaming reads use all available slots, no normal reads can proceed. Fix by assigning streaming reads their own semaphore; they will compete with normal reads once issued, and the I/O scheduler will determine the winner. Fixes #2663. Message-Id: <20170802153107.939-1-avi@scylladb.com>	2017-08-03 10:23:01 +01:00
Avi Kivity	911536960a	database: remove streaming read queue length limit If we fail a streaming read due queue overload, we will fail the entire repair. Remove the limit for streaming, and trust the caller (repair) to have bounded concurrency. Fixes #2659. Message-Id: <20170802143448.28311-1-avi@scylladb.com>	2017-08-03 10:21:07 +01:00
Duarte Nunes	a85232dd82	Fix compilation errors on GCC 6 GCC 6 inconsistently requires explicitly calling a member function through "this->" for lambda functions capturing "this". Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170731143755.21970-1-duarte@scylladb.com>	2017-07-31 17:40:44 +03:00
Avi Kivity	3fe6731436	Merge "educe the effect of the latency metrics" from Amnon "This series reduce that effect in two ways: 1. Remove the latency counters from the system keyspaces 2. Reduce the histogram size by limiting the maximum number of buckets and stop the last bucket." Fixes #2650. * 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev: database: remove latency from the system table estimated histogram: return a smaller histogram	2017-07-31 15:58:30 +03:00
Duarte Nunes	c81431ad16	column_family: Re-acquire flush permit in case of error If we fail to flush an sstable, after creating the flush_reader, then we will have released the flush permit when we retry the flush. Ensure that when retrying, we re-acquire the flush permit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	9162e016da	column_family: Don't hold sstable read lock when retrying flush Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	1a33cc6847	sstables: Release the flush permit before fsyncing This allows a queued flush to start while we fsync the current sstable, which helps reduce the overall time new writes are blocked on dirty memory. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	a2b732c156	dirty_memory_manager: Refactor flush permit lifetime management This patch refactors how the flush permit lifetime is managed, dropping the current hash table in favour of a RAII approach. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	f647f5b14a	dirty_memory_manager: Invert permit acquisition order For an upcoming fix it is required to invert the permit acquisition order: first we acquire the background work permit and then the single flush permit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	e371accac8	memtable_list: Register different seal functions for each behaviour Instead of passing a flush_behaviour to the seal function, use two different functions for each of the behaviours. This will be important in the forthcoming patches, which will require the signatures of those functions to differ. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Avi Kivity	e855a28fae	Revert "Merge "memtable flush: Fixes and improvements" from Duarte" This reverts commit `733a64a1df`, reversing changes made to `e11e66723a`. Breaks sstable_test and perf_fast_forward.	2017-07-31 12:44:28 +03:00
Duarte Nunes	0f1bd81523	column_family: Re-acquire flush permit in case of error If we fail to flush an sstable, after creating the flush_reader, then we will have released the flush permit when we retry the flush. Ensure that when retrying, we re-acquire the flush permit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	2f4cffc7f6	column_family: Don't hold sstable read lock when retrying flush Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	5e64839e85	sstables: Release the flush permit before fsyncing This allows a queued flush to start while we fsync the current sstable, which helps reduce the overall time new writes are blocked on dirty memory. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	ef1275e9dd	dirty_memory_manager: Refactor flush permit lifetime management This patch refactors how the flush permit lifetime is managed, dropping the current hash table in favour of a RAII approach. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	cfc8fae33f	dirty_memory_manager: Invert permit acquisition order For an upcoming fix it is required to invert the permit acquisition order: first we acquire the background work permit and then the single flush permit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	7e68e4677d	memtable_list: Register different seal functions for each behaviour Instead of passing a flush_behaviour to the seal function, use two different functions for each of the behaviours. This will be important in the forthcoming patches, which will require the signatures of those functions to differ. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Amnon Heiman	a71b9e498a	database: remove latency from the system table This patch remove the latency histograms from the system table, it also extend the already existing exclusion to all system keyspaces. It also uses the new get_histogram API to set a minimal bucket size to 100 microseconds.	2017-07-27 11:41:15 +03:00
Paweł Dziepak	295689d16f	db: include counter writes on leader in metrics Counters write path on leader is completely different than on any other replica (non-leaders share write path between counters and regular columns). This patch makes sure that counter writes performed on leader are added to appropriate metrics. Message-Id: <20170725153346.31238-1-pdziepak@scylladb.com>	2017-07-25 18:31:43 +02:00
Raphael S. Carvalho	637f3bfa50	db: refresh row cache's underlying data source after compaction Underlying data source in row cache holds a reference to sstable set prior to compaction which isn't released until a memtable flush, which means file descriptors of deleted sstables remains opened, wasting disk space. The fix is to refresh underlying data source in row cache. Fixes #2570. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-24 15:49:11 -03:00
Raphael S. Carvalho	e3ad676433	db: atomically synchronize cache with changes to the snapshot updates to cache and snapshot (i.e. sstable set) aren't synchronized, so it may happen that cache update for memtable flush will use wrong snapshot version, and that violates cache invariant of each partition entry only reflecting one snapshot. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-24 15:45:05 -03:00
Tomasz Grabiec	714d609605	database: Fix reversed order of keyspace and table names in a log message Message-Id: <1500649623-25377-1-git-send-email-tgrabiec@scylladb.com>	2017-07-21 17:10:17 +02:00
Tomasz Grabiec	408cea66cd	database: Allow disabling auto snapshots during drop/truncate Message-Id: <1500573920-31478-1-git-send-email-tgrabiec@scylladb.com>	2017-07-21 16:56:29 +02:00
Avi Kivity	c5ee62a6a4	Merge "restrict background writers with scheduling groups" from Glauber "This patchset restricts background writers - such as compactions, streaming flushes and memtable flushes to a maximum amount of CPU usage through a seastar::thread_scheduling_group. The said maximum is recommended to be set 50 % - it is default disabled, but can be adjusted through a configuration option until we are able to auto-tune this. The second patch in this series provides a preview on how such auto-tune would look like. By implementing a simple controller we automatically adjust the quota for the memtable writer processes, so that the rate at which bytes come in is equal to the rates at which bytes are flushed. Tail latencies are greatly reduced by this series, and heavy spikes that previously appeared on CPU-bound workloads are no more." * 'memtable-controller-v5' of https://github.com/glommer/scylla: simple controller for memtable/streaming writer shares. restrict background writers to 50 % of CPU.	2017-07-20 10:58:53 +03:00
Calle Wilund	247c36e048	system_schema: Fix remaining places not handing two system keyspaces Some places remained where code looked directly at system_keyspace::NAME to determine iff a ks is considered special/system/protected. Including schema digest calculation. Export "is_system_keyspace" and use accordingly. Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>	2017-07-19 16:18:45 +03:00
Glauber Costa	c9a529ebee	simple controller for memtable/streaming writer shares. This patch introduces a simple controller that will adjust memtables CPU shares, trying to keep it around the soft limit: if we start going below it means we're too fast (unless we are idle) and shares are adjusted downwards. If we start going above it means we're too fast and shares are adjusted upwards. I have tested this extensively in a single-CPU setup with various CPU-bound workloads while tracking virtual dirty and the results are good, with virtual dirty fluctuating only slightly, somewhere within the desired range. Exceptions to this are: 1) when the load is very light - the idle system goes faster, and that's ok 2) when the load is very high - as foreground requests dominate we can't flush fast enough and hit the hard limit. However, in such scenarios the memtable shares do hit its maximum, and the results are no worse than they are right now and this will only be fixed by CPU-limiting the actual requests. This feature can be disabled with a config option - that is scheduled to go away as we acquire more confidence in this. When the feature is disabled, all background writers (streaming, compaction, memtables) will share the same scheduling group, with static quotas. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-07-18 23:35:47 -04:00
Glauber Costa	4f01ec0910	restrict background writers to 50 % of CPU. In scylla, we have foreground processes, which are latency sensitive and need to be responded to as fast as possible in order to maintain good latency profiles, and background process, which are less so. The most important background processes we have during normal write workload operations are memtable writes and sstable compactions. Those processes are quite CPU-intensive, and left unchecked will easily dominate the CPU. Lower values of task-quota usually help, as it will force those processes to preempt more, but aren't enough to guarantee good isolation. We have seen boxes with good NVMe storage having their throughput reduced to less than half of the original baseline in a short dive down for the duration of a compaction. In the long run, our goal is to leverage the CPU scheduler to make sure that those processes are balanced with respect to all the others. However, the current state of affairs is causing grievances as this very moment. Thankfully, those processes live in a seastar::thread, that ships with its own rudimentary bandwidth control mechanism: the scheduling group. The goal of this patch is to wrap background processes together in a scheduling group, and assign to such group 50 % of our CPU power; the remainder being left to foreground processes. While we pride ourselves in dynamically adjusting things to the workload, we won't be able to do this properly before the CPU scheduler lands - and let's face it, leaving background processes run wild is not adaptative either. Every workload would benefit most from a different value for such shares, but 50 % is as fair as it gets if we really need static partitining in the mean time. As a defense against unforeseen consequences, we'll leave the actual value as an option, but will do our best to hide it - as this is not a tunable that we want to be part of a normal Scylla setup. The most convenient place for this tunable is still db::config, so we can easily pass it down to the database layer - but we will not document it in the yaml, and will clearly note in the help string that it is not supposed to be tuned. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-07-18 23:35:33 -04:00
Duarte Nunes	2c711922cc	database: Drop mutations that raced with truncate Mutations that race with a truncate can just be dropped. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-16 00:08:05 +02:00
Duarte Nunes	0825c9c805	database: Rename replay_position_reordered_exception Rename replay_position_reordered_exception to mutation_reordered_with_truncate_exception for more precision, since this is the only situation where this exception can be thrown. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-16 00:08:05 +02:00
Duarte Nunes	5f24e9a4a5	memtable: Stop tracking the highest flushed rp Since we no longer enforce that mutations are applied in memory ordered by their replay_positions, the way the highest_flush_rp is being tracked is no longer correct. The invariant it was used to maintain no longer exists, so we can get rid of it together with the assertion on the highest_flush_rp on flush(). Fixes #2074 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:56:06 +02:00
Duarte Nunes	003941cd95	column_family: Stop using flush_queue Since commitlog ordering requirements have been relaxed, we now keep the set of replay_positions seen by a memtable in a set, which we then use to clean up relevant segments in the commitlog. This means that the guarantees provided by the flush_queue are no longer necessary. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:56:00 +02:00
Duarte Nunes	7e6fe5895e	column_family: Don't bother closing the flush_queue on stop() When stopping a column family we issue a flush(), for which we wait. Since writes are supposed to have stopped coming in, and also new flush requests, there's no need to call and wait for the flush_queue to be closed. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:58 +02:00
Duarte Nunes	a1f4536ffb	column_family: Don't rely on flush_queue to guarantee flushes finished We now don't ensure mutations are applied in memory following the order of their replay positions, so we can't rely on the replay position to order memtable flushes. So, use a phased_barrier() to ensure that calling flush() returns a future that completes when all flushes up to that point have finished. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:58 +02:00
Duarte Nunes	1b320496e2	dirty_memory_manager: Remove unnecessary check from flush_one() We don't need to check whether a memtable is empty in flush_one(), as that must be checked later, during the actual sealing. The condition itself is rare and is checked already after the potentially contented semaphore has been acquired. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:57 +02:00
Duarte Nunes	59bdaed02b	column_family: More precise count of pending flushes This patch ensures we update the count of pending flushes in the same place as we update the stats across column families, which is more correct since it only accounts for actual flushes and not those of empty memtables or that have been coalesced together. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Duarte Nunes	3e27c335a9	column_family: Fix typo in pending_tasks metric name Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00

1 2 3 4 5 ...

873 Commits