scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 12:17:02 +00:00

Author	SHA1	Message	Date
Glauber Costa	956af9f099	database, main: set up scheduling_groups for our main tasks Set up scheduling groups for streaming, compaction, memtable flush, query, and commitlog. The background writer scheduling group is retired; it is split into the memtable flush and compaction groups. Comments from Glauber: This patch is based in a patch from Avi with the same subject, but the differences are signficant enough so that I reset authorship. In particular: 1) A bug/regression is fixed with the boundary calculations for the memtable controller sampling function. 2) A leftover is removed, where after flushing a memtable we would go back to the main group before going to the cache group again 3) As per Tomek's suggestion, now the submission of compactions themselves are run in the compaction scheduling group. Having that working is what changes this patch the most: we now store the scheduling group in the compaction manager and let the compaction manager itself enforce the scheduling group. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:29 -05:00
Avi Kivity	641aaba12c	database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler thread_scheduling_groups are converted to plain scheduling_group. Due to differences in initialization (scheduling_group initializtion defers), we create the scheduling_groups in main.cc and propagate them to users via a new class database_config. The sstable writer loses its thread_scheduling_group parameter and instead inherits scheduling from its caller. Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas, the flush controller was adjusted to return values within the higher ranges.	2018-02-07 17:19:29 -05:00
Avi Kivity	72c673fcc3	Merge "I/O Controller for memtables and compactions" from Glauber "This patchset implements the compaction controller for I/O shares. The goal is to automatic adjust compaction shares based on a strategy-specific backlog. A higher backlog will translate into higher shares. As compaction progresses, that reduces the backlog. As new data is flushed, that increases the backlog. The goal of the controler is to keep the backlog constant at a certain rate, so that we don't go neither too fast or too slow. Tracking reads and writes: ========================== Tracking of reads and writes happen through the read_monitor and the write_monitor. The write monitor is an existing interface that has the purpose of releasing the write permit at particular points of the write process. We enhance it so to get a reference to an instance that tracks the current offset inside the sstables::file_writer. This way the backlog tracker can always know for sure what's the offset of the current write. A similar thing is done for reads. The data_consumer already tracks the position of the current read, and we isolate that into a structure to which we can get a reference. A read_monitor allows us to connect the compaction to that reference. Lifetime management: ==================== In general, tracking objects will be owned by their callers and passed down as references. The compaction object will own the read monitors and the compaction write monitors and the memtable flush write monitor will be kept alive in a do_with block around the flush itself. The backlog_{write,read}_progress_manager needs to be kept alive until the SSTable is no longer in progress. For writes, that means until we are able to add the SSTable charges in full, and for reads (compaction) that means until we are able to remove the charges in full. It is important to do that to avoid spikes in the graph. If we remove the progress managers in a different operation than updating the SSTable list we will be left in a temporary state where charges appear or disappear abruptly, to be fixed when the final add_sstable/remove_sstable happens. So we want those things to happen together. The compaction_backlog_tracker is kept alive until the strategy changes, for example, through ALTER TABLE. Current charges are transferred to the new strategy's compaction_backlog_tracker object when we do that. If the type of strategy changes, the current read charges are forgotten. We can do that because those running compaction will not really contribute to decrease the backlog of the new compaction strategy. Tranfer of Charges ================== When ALTER TABLE happens, we need to transfer ongoing writes to the new backlog manager. Ongoing reads will still be tracked by the backlog_manager that originated them. The rationale for that is that reads still belong to the current compaction, with the strategy that generated them. But new Tables being written will add to the backlog of the new strategy. Note that ALTER TABLE operations not necessarily cause a change of Strategy. We can be using the same strategy but just changing properties. If that is the case, we expect no discontinuity in the backlog graph (tested). Resharding ========== Resharding compactions are more complex than normal compactions because the SSTables are created in one shard and later sent to another shard. It is better, then, to track resharding compactions separately and let them have their own backlog tracker, which will insert backlog in proportion to the amount of data to be resharded. Memtable Flush I/O Controller ============================= With the current infrastructure it becomes trivial to add a new controller, for either I/O or CPU. This patchset then adds an I/O controller for memtable flushes, using the same backlog algorithm that we already used for CPU." * 'compaction-controller-io-v5' of github.com:glommer/scylla: database: add a controller for I/O on memtable flushes. document the compaction controller compaction: adjust shares for compactions backlog_controllers: implement generic I/O controller factor out some of the controller code io shares: multiply all shares by 10 compaction_strategy: implement backlog manager for the SizeTiered strategy infrastructure for backlog estimator for compaction work. sstables: notify about end of data component write sstables: add read_monitor_generator sstables: add read_monitor sstables: enhance data consumer with a position tracker sstables: enhance the file_writer with an offset tracker sstables: pass references instead of pointers for write_monitor compaction: control destruction of readers	2018-01-07 15:00:10 +02:00
Raphael S. Carvalho	e29b598c5f	sstables: make compaction_descriptor's ctor explicit to avoid bad conversion perf sstable used old sstables::compact_sstables() interface and still compiled due to bad implicit conversion. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180103041900.21186-1-raphaelsc@scylladb.com>	2018-01-03 12:37:12 +02:00
Glauber Costa	ca284174d0	infrastructure for backlog estimator for compaction work. This patch adds infrastucture in various points in the system to allow us to determine the amount of work present as backlog from compactions. What needs to be done can be explained in three major pieces: 1) Add hooks in the points where sstables are added or inserted to a column family (or more precisely, to a compaction_strategy object). 2) Add hooks in reads and write monitors that allows a compaction backlog estimator (tracker) to become aware of bytes that are partially written and compacted away. 3) Add a per-column family class (compaction_backlog_tracker) that can be used to track work that is done and relevant to compactions (like the two above), and a compaction manager to provide a system-wide backlog based on the response of the individual trackers. The definition of how much backlog one has is strategy-specific. The Null strategy is easy, as it never really has any backlog, and so is the major strategy - since what it really matters is the backlog of the underlying compaction strategy. Although backlogs are strategy-specific, they should be "compatible", in the sense that if a particular strategy has more work to do, it should yield a higher number than its counterparts. All the others are presented in this patch as unimplemented: they will always advertise a mild backlog that should yield a constant CPU-utilization if used alone. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Raphael S. Carvalho	eff62bc61e	compaction_manager: serialize compaction of same size tier for different cfs Currently, compaction manager will serialize compaction of same size tier (or weight) if they belong to the same column family. However, it fails to do so if the compaction jobs belong to different column families. That can lead to an ungodly amount of running compaction which gets worse the higher the number of shards and active column families. The problem is that it may affect overall system performance due to excessive resource usage. It's easy to trigger it during bootstraping after loading node with new sstables or repairing, or if lots of cfs are being actively written. That being said, compaction jobs of same size tier are now serialized on a given shard, such that maximum number of compaction (system wise) is now: (SHARDS) * (SIZE TIERS) instead of: (SHARDS) * (COLUMN FAMILIES) * (SIZE TIERS) We'll work hard to release a size tier (weight) for a column family waiting on it as fast as possible, given that we wouldn't like to underutilize resources available for compaction. We want one starting after the other. Compaction for a column family that cannot run now because the size tier is taken, will be postponed. There's a worker that will be sleeping on a condition variable that will be signalled whenever a compaction completes. FIFO ordering is used on postponed list for fairness. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-17 17:42:48 -02:00
Raphael S. Carvalho	49f3cfe746	sstables: improve compact_sstables() interface Motivation is that a new field in the descriptor will be forwarded to compaction procedure without extending parameter list even more. Also beautifies the interface, making it concise and easier to play with. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-17 17:22:19 -02:00
Raphael S. Carvalho	d2ab154f12	sstables: switch to const ref wherever possible Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-06 19:52:33 -02:00
Raphael S. Carvalho	d916c8cdad	sstables: use gc_clock::time_point for gc_before Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-06 19:52:33 -02:00
Raphael S. Carvalho	45c11865fa	sstables: change return value type of get_fully_expired_sstables unordered_set will allow us to quickly extract fully expired tables from a set of compacting sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-06 18:45:55 -02:00
Raphael S. Carvalho	cb6d060d8e	compaction: make size_tiered_most_interesting_bucket static method of stcs class Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-11-14 13:24:03 -02:00
Raphael S. Carvalho	e34c1db642	db: update compaction history outside the sstable write lock The reason to do that is because compaction can deadlock if refresh disables write which waits for compaction, and compaction in turn waits for dirty memory[1] that would be released by memtable write. Dirty memory manager for non-system cfs was being used for system cfs, which was useful for exposing this problem. [1]: when updating compaction history. Fixes #2769. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170918215238.9810-2-raphaelsc@scylladb.com>	2017-09-26 19:51:12 +02:00
Avi Kivity	578bf55371	sstables: reduce dependencies Use shared_sstable.hh instead of sstables.hh.	2017-09-12 10:43:36 +03:00
Avi Kivity	0efa444a56	compaction.hh: add missing includes	2017-09-12 10:42:45 +03:00
Glauber Costa	4f01ec0910	restrict background writers to 50 % of CPU. In scylla, we have foreground processes, which are latency sensitive and need to be responded to as fast as possible in order to maintain good latency profiles, and background process, which are less so. The most important background processes we have during normal write workload operations are memtable writes and sstable compactions. Those processes are quite CPU-intensive, and left unchecked will easily dominate the CPU. Lower values of task-quota usually help, as it will force those processes to preempt more, but aren't enough to guarantee good isolation. We have seen boxes with good NVMe storage having their throughput reduced to less than half of the original baseline in a short dive down for the duration of a compaction. In the long run, our goal is to leverage the CPU scheduler to make sure that those processes are balanced with respect to all the others. However, the current state of affairs is causing grievances as this very moment. Thankfully, those processes live in a seastar::thread, that ships with its own rudimentary bandwidth control mechanism: the scheduling group. The goal of this patch is to wrap background processes together in a scheduling group, and assign to such group 50 % of our CPU power; the remainder being left to foreground processes. While we pride ourselves in dynamically adjusting things to the workload, we won't be able to do this properly before the CPU scheduler lands - and let's face it, leaving background processes run wild is not adaptative either. Every workload would benefit most from a different value for such shares, but 50 % is as fair as it gets if we really need static partitining in the mean time. As a defense against unforeseen consequences, we'll leave the actual value as an option, but will do our best to hide it - as this is not a tunable that we want to be part of a normal Scylla setup. The most convenient place for this tunable is still db::config, so we can easily pass it down to the database layer - but we will not document it in the yaml, and will clearly note in the help string that it is not supposed to be tuned. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-07-18 23:35:33 -04:00
Raphael S. Carvalho	b350352e6c	compaction: keep only one variant of size_tiered_most_interesting_bucket two variants of size_tiered_most_interesting_bucket existed to avoid copy, but subsequent work will make lcs use vector for each level of sstables, so let's only keep one variant. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-04 03:34:51 -03:00
Raphael S. Carvalho	4351e0a996	compaction: introduce new compaction type for reshard so now user can look at nodetool compactionstats and determine whether or not resharding is running, for example: $ ./bin/nodetool compactionstats pending tasks: 3 id compaction type keyspace table completed total unit progress <none> RESHARD system compaction_history 11 256 keys 4.30% <none> RESHARD system compaction_history 2 256 keys 0.78% <none> RESHARD system compaction_history 10 256 keys 3.91% <none> RESHARD system compaction_history 8 256 keys 3.12% <none> RESHARD system compaction_history 7 256 keys 2.73% Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170620175733.25882-1-raphaelsc@scylladb.com>	2017-06-22 14:48:38 +03:00
Raphael S. Carvalho	61229ab88c	compaction: fix type for cleanup After compaction revamp, compaction type set by cleanup at its ctor is being overwritten at compaction::setup(). Consequently, cleanup would not be stopped by 'nodetool stop cleanup' and cleanup would be listed as regular compaction in 'nodetool compactionstats'. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170504013046.23522-1-raphaelsc@scylladb.com>	2017-05-04 11:48:50 +03:00
Raphael S. Carvalho	13477075e2	compaction_strategy: implement resharding strategy for compaction strategies Strategies other than leveled will reshard one shared sstable at a time, and the target shard, shard at which job will run, for each job will be chosen in a round-robin fashion. For leveled strategy, we will reshard together smp::count adjacent sstables that belong to same level. The reason for that is because resharding one sstable at a time may result in creation of file for each shard, meaning after resharding we could end up with NO_SSTABLES*NO_SHARDS. These resharding strategies will be used for our new resharding algorithm. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:24 -03:00
Raphael S. Carvalho	0127309820	sstables: extend compaction for new resharding Extends compaction for new resharding algorithm. Not wired yet. New resharding will compact shared sstable(s) and create one sstable for each owner. It's up to the caller to open these new unshared sstables at their respective column families. This new approach will save a lot of bandwidth because we'll no longer read the entire shared sstable #smp::count times. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:08 -03:00
Raphael S. Carvalho	e9076f39be	compaction: implement function to get fully expired sstables Strongly based on org.apache.cassandra.db.compaction. CompactionController.getFullyExpiredSSTables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 02:11:47 -03:00
Raphael S. Carvalho	80d8c5ef6f	compaction: use proper type in constructor Correctness is not affected due to long type, but an unsigned long type should be definitely used instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d3ab15a3206306de195aeb3d78f9b5bc4ca9208e.1465908970.git.raphaelsc@scylladb.com>	2016-06-14 17:02:32 +03:00
Raphael S. Carvalho	0b2cd41daf	database: remember sstable level when cleaning it up Cleanup operation wasn't preserving level of sstables. That will have a bad impact on performance because compaction work is lost. Fixes #1317. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <35ce8fbbb4590725bb0414e6a5450fcbe6cb7212.1465843387.git.raphaelsc@scylladb.com>	2016-06-14 08:06:00 +03:00
Raphael S. Carvalho	8fe7524e46	sstables: enable leveled strategy feature to prevent L0 from falling behind If level 0 falls behind, size tiered strategy is used on it to reduce overhead until we can catch up on the higher levels. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <17bf15b7d12cd5dc652cc92939c0c68f921662a2.1459976469.git.raphaelsc@scylladb.com>	2016-04-11 11:52:00 +03:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Raphael S. Carvalho	ed61fe5831	sstables: make compaction stop report user-friendly When scylla stopped an ongoing compaction, the event was reported as an error. This patch introduces a specialized exception for compaction stop so that the event can be handled appropriately. Before: ERROR [shard 0] compaction_manager - compaction failed: read exception: std::runtime_error (Compaction for keyspace1/standard1 was deliberately stopped.) After: INFO [shard 0] compaction_manager - compaction info: Compaction for keyspace1/standard1 was stopped due to shutdown. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <1f85d4e5c24d23a1b4e7e0370a2cffc97cbc6d44.1455034236.git.raphaelsc@scylladb.com>	2016-02-11 12:16:53 +02:00
Raphael S. Carvalho	a46aa47ab1	make sstables::compact_sstables return list of created sstables Now, sstables::compact_sstables() receives as input a list of sstables to be compacted, and outputs a list of sstables generated by compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <0d8397f0395ce560a7c83cccf6e897a7f464d030.1454110234.git.raphaelsc@scylladb.com>	2016-01-31 12:39:20 +02:00
Raphael S. Carvalho	ba4260ea8f	api: print proper compaction type There are several compaction types, and we should print the correct one when listing ongoing compaction. Currently, we only support compaction types: COMPACTION and CLEANUP. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <c96b1508a8216bf5405b1a0b0f8489d5cc4be844.1453851299.git.raphaelsc@scylladb.com>	2016-01-28 13:47:00 +02:00
Raphael S. Carvalho	3bd240d9e8	compaction: add ability to stop an ongoing compaction That's needed for nodetool stop, which is called to stop all ongoing compaction. The implementation is about informing an ongoing compaction that it was asked to stop, so the compaction itself will trigger an exception. Compaction manager will catch this exception and re-schedule the compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-01-19 23:15:18 -02:00
Raphael S. Carvalho	ec4c73d451	compaction: rename compaction_stats to compaction_info compaction_info makes more sense because this structure doesn't only store stats about ongoing compaction. Soon, we will add information to it about whether or not an user asked to stop the respective ongoing compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-01-19 23:15:18 -02:00
Raphael S. Carvalho	ed80ed82ef	sstables: prepare compact_sstables to work with cleanup Cleanup is about rewriting a sstable discarding any keys that are irrelevant, i.e. keys that don't belong to current node. Parameter cleanup was added to compact_sstables. If set to true, irrelevant code such as the one that updates compaction history will be skipped. Logic was also added to discard irrelevant keys. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-01-11 21:43:40 -02:00
Raphael S. Carvalho	1fba394dd0	sstables: store keyspace and cf in compaction_stats The reason behind this change is that we will need ks and cf for the compaction stats API. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-12-15 09:50:02 -02:00
Raphael S. Carvalho	ac1a67c8bc	sstables: move compaction_stats to header file Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-12-15 09:49:45 -02:00
Raphael S. Carvalho	35b75e9b67	adapt compaction procedure to support leveled strategy Adapt our compaction code to start writing a new sstable if the one being written reached its maximum size. Leveled strategy works with that concept. If a strategy other than leveled is being used, everything will work as before. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-10-16 01:54:52 -03:00
Raphael S. Carvalho	fbec0d0254	convert LeveledManifest to C++ Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-10-16 01:46:15 -03:00
Avi Kivity	d5cf0fb2b1	Add license notices	2015-09-20 10:43:39 +03:00
Paweł Dziepak	969fe6b878	sstables: make compact_sstables() take ref to column_family Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>	2015-09-07 21:20:32 +02:00
Raphael S. Carvalho	8faa202e98	sstables: add function to return candidates using size-tiered strategy That's helpful for the purpose of testing, and leveled compaction may also end up using size-tiered compaction strategy for selecting candidates. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-07-20 12:27:33 -03:00
Nadav Har'El	0b297b9f6c	sstable compaction: simplify compact_sstables() function Instead of requiring the user to subclass a "sstable_creator" class to specify how to create a new sstable (or in the future, several of them), switch to an std::function. In practice, it is much easier to specify a lambda than a class, especialy since C++11 made it easy to capture variables into lambdas - but not into local classes. The "commit()" function is also unnecessary. Then intention there was to provide a function to "commit" the new sstables (i.e., rename them). But the caller doesn't need to supply this function - it can just wait for the future of the end of compaction, and do his own committing code right then. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-06-24 16:44:11 +03:00
Nadav Har'El	f26dae3bf9	sstable: basic compaction function This patch adds the basic compaction function sstables::compact_sstables, which takes a list of input sstables, and creates several (currently one) merged sstable. This implementation is pretty simple once we have all the infrastructure in place (combining reader, writer, and a pipe between them to reduce context switches). This is already working compaction, but not quite complete: We'll need to add compaction strategies (which sstables to compact, and when), better cardinality estimator, sstable management and renaming, and a lot of other details, and we'll probably still need to change the API. But we can already write a test for compacting existing sstables (see the next patch), and I wanted to get this patch out of the way, so we can start working on applying compaction in a real use case. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-06-23 09:48:58 +03:00

40 Commits