scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-01 12:36:56 +00:00

Author	SHA1	Message	Date
Botond Dénes	94fc550e68	sstable_set::incremental_selector: select() now returns a selection A seletion contains - in addition to the list of sstables - a next_token which is a hint as to what is the next best token to call select() with. This should be the smallest token such that at the next call to select() the least number of new sstables will be returned, without skipping any.	2017-08-09 16:27:33 +03:00
Glauber Costa	4f01ec0910	restrict background writers to 50 % of CPU. In scylla, we have foreground processes, which are latency sensitive and need to be responded to as fast as possible in order to maintain good latency profiles, and background process, which are less so. The most important background processes we have during normal write workload operations are memtable writes and sstable compactions. Those processes are quite CPU-intensive, and left unchecked will easily dominate the CPU. Lower values of task-quota usually help, as it will force those processes to preempt more, but aren't enough to guarantee good isolation. We have seen boxes with good NVMe storage having their throughput reduced to less than half of the original baseline in a short dive down for the duration of a compaction. In the long run, our goal is to leverage the CPU scheduler to make sure that those processes are balanced with respect to all the others. However, the current state of affairs is causing grievances as this very moment. Thankfully, those processes live in a seastar::thread, that ships with its own rudimentary bandwidth control mechanism: the scheduling group. The goal of this patch is to wrap background processes together in a scheduling group, and assign to such group 50 % of our CPU power; the remainder being left to foreground processes. While we pride ourselves in dynamically adjusting things to the workload, we won't be able to do this properly before the CPU scheduler lands - and let's face it, leaving background processes run wild is not adaptative either. Every workload would benefit most from a different value for such shares, but 50 % is as fair as it gets if we really need static partitining in the mean time. As a defense against unforeseen consequences, we'll leave the actual value as an option, but will do our best to hide it - as this is not a tunable that we want to be part of a normal Scylla setup. The most convenient place for this tunable is still db::config, so we can easily pass it down to the database layer - but we will not document it in the yaml, and will clearly note in the help string that it is not supposed to be tuned. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-07-18 23:35:33 -04:00
Raphael S. Carvalho	4351e0a996	compaction: introduce new compaction type for reshard so now user can look at nodetool compactionstats and determine whether or not resharding is running, for example: $ ./bin/nodetool compactionstats pending tasks: 3 id compaction type keyspace table completed total unit progress <none> RESHARD system compaction_history 11 256 keys 4.30% <none> RESHARD system compaction_history 2 256 keys 0.78% <none> RESHARD system compaction_history 10 256 keys 3.91% <none> RESHARD system compaction_history 8 256 keys 3.12% <none> RESHARD system compaction_history 7 256 keys 2.73% Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170620175733.25882-1-raphaelsc@scylladb.com>	2017-06-22 14:48:38 +03:00
Raphael S. Carvalho	41137c7fb6	compaction: use sstable::bytes_on_disk for calculating start and end size Currently, start and end size of compaction are calculated using the uncompressed size of data component. bytes_on_disk() returns size used by all components. NOTE: start and end size are written to compaction history, so users who monitor it should be aware of this change. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170525212129.6758-1-raphaelsc@scylladb.com>	2017-05-28 11:33:24 +03:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Raphael S. Carvalho	ddc1d80c28	compaction: remove dead function declaration Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170504013046.23522-2-raphaelsc@scylladb.com>	2017-05-04 11:48:51 +03:00
Raphael S. Carvalho	61229ab88c	compaction: fix type for cleanup After compaction revamp, compaction type set by cleanup at its ctor is being overwritten at compaction::setup(). Consequently, cleanup would not be stopped by 'nodetool stop cleanup' and cleanup would be listed as regular compaction in 'nodetool compactionstats'. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170504013046.23522-1-raphaelsc@scylladb.com>	2017-05-04 11:48:50 +03:00
Raphael S. Carvalho	3071b9052a	compaction: make cleanup_compaction inherit from regular_compaction Some fields that belong to regular and cleanup aren't needed for resharding_compaction, such as incremental selector (which is used for determining max purgeable timestamp for a given decorated key) Better move those fields to regular and make cleanup inherit from regular compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170428195611.9196-1-raphaelsc@scylladb.com>	2017-04-30 19:37:09 +03:00
Raphael S. Carvalho	687a4bb0c2	dtcs: do not compact fully expired sstable which ancestor is not deleted yet Currently, fully expired sstable[1] is unconditionally chosen for compaction by DTCS, but that may lead to a compaction loop under certain conditions. Let's consider that an almost expired sstable is compacted, and it's not deleted yet, and that the new sstable becomes expired before its ancestor is deleted. Because this new sstable is expired, it will be chosen by DTCS, but it will not be purged because 'compacted undeleted' sstables are taken into account by calculation of max purgeable timestamp and prevents expired data from being purged. The problem is that this sequence of events can keep happening forever as reported by issue #2260. NOTE: This problem was easier to reproduce before improvement on compaction of expired cells, because fully expired sstable was being converted into a sstable full of tombstones, which is also considered fully expired. Fixes #2260. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170428233554.13744-1-raphaelsc@scylladb.com>	2017-04-30 19:35:46 +03:00
Duarte Nunes	4e693383f7	mutation_partion: Use row_tombstone This patch replaces the current row tombstone representation by a row_tombstone. The intent of the patch is thus to reify the idea of shadowable tombstones, that up until now we considered all materialized view row tombstones to be. We need to distinguish shadowable from non-shadowable row tombstones to support scenarios such as, when inserting to a table with a materialzied view: 1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1 2. delete from base using timestamp 2 where p = 3 3. insert into base (p, v1) values (3, 1) using timestamp 3 These should yield a view row where v2 is definitely null, but with the current implementation, v2 will pop back with its value v2=3@TS=1, even though its dead in the base row. This is because the row tombstone inserted at 2) is a shadowable one. This patch only addresses the memory representation of such row_tombstones. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:46:33 +02:00
Raphael S. Carvalho	0127309820	sstables: extend compaction for new resharding Extends compaction for new resharding algorithm. Not wired yet. New resharding will compact shared sstable(s) and create one sstable for each owner. It's up to the caller to open these new unshared sstables at their respective column families. This new approach will save a lot of bandwidth because we'll no longer read the entire shared sstable #smp::count times. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:08 -03:00
Raphael S. Carvalho	2a437ab427	compaction: rework compacting_sstable_writer to work with multiple writers compacting_sstable_writer only allowed one writer so far, but we will need multiple ones for resharding. It's done by moving writer management to compaction. finish_sstable_writer() is added for compaction impl to stop all writers, whereas stop_sstable_writer() will only stop current writer (needed when current sstable reaches max limit size for example). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:05 -03:00
Raphael S. Carvalho	a35a3a9647	compaction: prepare compacting_sstable_writer to work with writers No need for compacting_sstable_writer to store items that are available in compaction class. Also, that's a step towards supporting multiple writers for compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:03 -03:00
Raphael S. Carvalho	38ed83e2f7	sstables: rework compaction to make it easy to extend compact_sstables() supported both regular and cleanup compaction, but with lots of conditions that made it ugly and hard to extend. In the future, we want to introduce a new type of compaction for resharding that will create one sstable for every shard owning the sstable(s) given as input. That will be easier now. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:02 -03:00
Tomasz Grabiec	124dde30db	sstables: Extract writer parameters into config objects Also enables users to change the default promoted index block size.	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	045b9fd7c1	sstables: Calculate key hash only once during compaction Improves compaction performance.	2016-12-22 13:24:46 +01:00
Asias He	937f28d2f1	Convert to use dht::partition_range_vector and dht::token_range_vector	2016-12-19 14:08:50 +08:00
Asias He	d1178fa299	Convert to use dht::token_range	2016-12-19 08:04:29 +08:00
Raphael S. Carvalho	fcfc84e836	compaction: reduce bloom filter overhead with incremental selector The procedure to calculate max purgeable timestamp is optimized by only visiting sstables that overlap with key being currently compacted. That's done using incremental sstable selector. Function to calculate maximum purgeable timestamp is made 10 times faster when compacting sstables overlap with 10% of all sstables. Fixes #1322. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-12-09 16:17:17 -02:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00
Paweł Dziepak	93cc4454a6	streamed_mutation: emit range_tombstones directly Originally, streamed_mutations guaranteed that emitted tombstones are disjoint. In order to achieve that two separate objects were produced for each range tombstone: range_tombstone_begin and range_tombstone_end. Unfortunately, this forced sstable writer to accumulate all clustering rows between range_tombstone_begin and range_tombstone_end. However, since there is no need to write disjoint tombstones to sstables (see #1153 "Write range tombstones to sstables like Cassandra does") it is also not necessary for streamed_mutations to produce disjoint range tombstones. This patch changes that by making streamed_mutation produce range_tombstone objects directly. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:18 +01:00
Tomasz Grabiec	8c4b5e4283	db: Avoiding checking bloom filters during compaction Checking bloom filters of sstables to compute max purgeable timestamp for compaction is expensive in terms of CPU time. We can avoid calculating it if we're not about to GC any tombstone. This patch changes compacting functions to accept a function instead of ready value for max_purgeable. I verified that bloom filter operations no longer appear on flame graphs during compaction-heavy workload (without tombstones). Refs #1322.	2016-07-10 09:54:20 +02:00
Avi Kivity	02530faeb2	compaction: fix tombstones not being garbage collected during compaction `2a46410f4a` changed sstable_list from a map to a set, so it is no longer sorted by generation. The code for finding the list of sstables not being compacted relied on this sort order, and now broke, returning a longer list than needed (including some of the sstables being compacted). As a result, the compaction code preserved the tombstones, incorrectly thinking there was still live data they referenced. Fix by sorting the set explicitly. Fixes #1429. Message-Id: <1467793026-6571-1-git-send-email-avi@scylladb.com>	2016-07-06 10:22:31 +02:00
Raphael S. Carvalho	e9076f39be	compaction: implement function to get fully expired sstables Strongly based on org.apache.cassandra.db.compaction. CompactionController.getFullyExpiredSSTables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 02:11:47 -03:00
Raphael S. Carvalho	8d38fa49d4	sstables: move code to get uncompacting sstables to a function Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 01:33:55 -03:00
Avi Kivity	e22517bafc	Merge "Optimize reads from leveled sstables" In a leveled column family, there can be many thousands of sstables, since each sstable is limited to a relatively small size (160M by default). With the current approach of reading from all sstables in parallel, cpu quickly becomes a bottleneck as we need to check the bloom filter for each of these sstables. This patch addresses the problem by introducing a compaction-strategy-specific data structure for holding sstables. This data structure has a method to obtain the sstables used for a read. For leveled compaction strategy, this data structure is an interval map, which can be efficiently used to select the right sstables.	2016-07-04 16:00:35 +03:00
Avi Kivity	2a46410f4a	Change sstable_list from a map to a set sstable_list is now a map<generation, sstable>; change it to a set in preparation for replacing it with sstable_set. The change simplifies a lot of code; the only casualty is the code that computes the highest generation number.	2016-07-03 10:26:57 +03:00
Paweł Dziepak	c2f0ee9b5f	sstables: add consumer-style sstable compactor This patch moves compaction logic to a consumer that can be used with consume_flattened_in_thread(). Internally, sstable_writer is used to write individual sstables. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:01 +01:00
Paweł Dziepak	599ed7f1ed	sstables: restore indentation Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:37:54 +01:00
Paweł Dziepak	e7ff20b3bb	sstables: run compaction code inside a thread Currently, each sstable write has its separate thread. However, the goal is to have compaction use consume_flattened() with a consumer that creates and writes the sstables. consume_flattened() needs to be executed inside a thread, since sstable writer may defer. This patch is a first step in preparations and it just makes whole compaction logic run inside a thread. That makes little sense now, since all sstable writes spawn their own threads but that's going to change in the following patches. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:37:54 +01:00
Paweł Dziepak	b6f78a8e2f	sstable: make sstable reads return streamed_mutation Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	737eb73499	mutation_reader: make readers return streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Vlad Zolotarov	baf3614e8f	sstables: don't backup sstables that are a result of a compaction According to incremental backup description (http://docs.datastax.com/en/cassandra_win/2.2/cassandra/operations/opsBackupIncremental.html) sstables that are a result of a compaction process should not be backed up since original sstables had already been backed up. Fixes #1308 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <1466338622-7323-1-git-send-email-vladz@cloudius-systems.com>	2016-06-20 09:52:30 +03:00
Raphael S. Carvalho	29db5f5e1f	sstables: move compaction strategy code to a new source file Moving compaction strategy code from sstables/compaction.cc to sstables/compaction_strategy.cc That improves readability. Strategy code should be separated from the generic compaction code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <5af6fc8f7321351a071fc0ce03c80ffea21f8396.1460761821.git.raphaelsc@scylladb.com>	2016-04-19 08:45:43 +03:00
Pekka Enberg	3f2286d02e	Merge "Delete compacted sstables atomically" from Avi "If we compact sstables A, B into a new sstable C we must either delete both A and B, or none of them. This is because a tombstone in B may delete data in A, and during compaction, both the tombstone and the data are removed. If only B is deleted, then the data gets resurrected. Non-atomic deletion occurs because the filesystem does not support atomic deletion of multiple files; but the window for that is small and is not addressed in this patchset. Another case is when A is shared across multiple shards (as is the case when changing shard count, or migrating from existing Cassandra sstables). This case is covered by this patchset. Fixes #1181."	2016-04-14 22:04:15 +03:00
Avi Kivity	a843aea547	db: delete compacted sstables atomically If sstables A, B are compacted, A and B must be deleted atomically. Otherwise, if A has data that is covered by a tombstone in B, and that tombstone is deleted, and if B is deleted while A is not, then the data in A is resurrected. Fixes #1181.	2016-04-14 17:14:26 +03:00
Raphael S. Carvalho	c28d168619	sstables: allow user to specify max sstable size with leveled strategy This change will allow user to specify the maximum size of a new sstable created as a result of leveled compaction. Example of using this setting: ALTER TABLE ks.test5 with compaction = {'sstable_size_in_mb': '1000', 'class': 'LeveledCompactionStrategy'} Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <ebb9844401af74388bda12586c2435283f6d8db8.1460486043.git.raphaelsc@scylladb.com>	2016-04-13 09:13:33 +03:00
Raphael S. Carvalho	8fe7524e46	sstables: enable leveled strategy feature to prevent L0 from falling behind If level 0 falls behind, size tiered strategy is used on it to reduce overhead until we can catch up on the higher levels. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <17bf15b7d12cd5dc652cc92939c0c68f921662a2.1459976469.git.raphaelsc@scylladb.com>	2016-04-11 11:52:00 +03:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Glauber Costa	d5c1366e85	compaction: be verbose about which table is causing an exception When we, for some reason, fail to compact an SSTable, we do not log the file name leaving us with cryptic messages that tell us what happened, but not where it happened. This patch adds logging in compaction so that we'll know what's going on. Please note that readers are more of a concern, because the SSTable being written technically do not exist yet. Still, better safe than sorry: if open_data fails, or we leave an unfinished SSTable, it is still good to know which one was the culprit. Some argument can be made about whether we should log this at the lower SSTable level, or at the compaction level. The reason I am logging this at the compaction level, is that we don't really know which exception will trigger, and where: it may be the case that we're seeing exceptions that are not SSTable specific, and may not have the chance to log it properly. In particular, if the exception happens inside the reader: read_rows() and friends only return a mutation reader, which doesn't really do anything until we call read(). But at that time, we don't hold any pointers to the SSTable anymore. In Summary, logging at the compaction level guarantees that we always do it no matter what. Exceptions that are part of the main SSTable path can log the file name as well if they want: if that's the case, we'll be left with the name appearing twice. That's totally harmless, and better than none. Fixes #1123 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <c5c969fb6aeb788a037bd7a4ea69979c1042cb34.1459263847.git.glauber@scylladb.com>	2016-03-29 18:15:56 +03:00
Raphael S. Carvalho	bb48f1b06c	sstables: use system clock's epoch for timestamp in compaction history As pointed out by Tomek, the type of column used is timestamp, therefore system's clock epoch (db_clock) should be used instead. Fixes #817. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <f80f9f411d673cf2d653e193ccb8ebaa36bc891b.1456317766.git.raphaelsc@scylladb.com>	2016-02-24 14:49:21 +02:00
Raphael S. Carvalho	9cb8a43684	start using notation ks.cf everywhere Some places were using the notation ks/cf to represent a keyspace and column family pair. ks.cf is the notation used by C*, so we should use it everywhere. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <939449af92565b79d1823890784dc4d1dc3cdc84.1455830989.git.raphaelsc@scylladb.com>	2016-02-21 11:15:09 +02:00
Raphael S. Carvalho	ed61fe5831	sstables: make compaction stop report user-friendly When scylla stopped an ongoing compaction, the event was reported as an error. This patch introduces a specialized exception for compaction stop so that the event can be handled appropriately. Before: ERROR [shard 0] compaction_manager - compaction failed: read exception: std::runtime_error (Compaction for keyspace1/standard1 was deliberately stopped.) After: INFO [shard 0] compaction_manager - compaction info: Compaction for keyspace1/standard1 was stopped due to shutdown. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <1f85d4e5c24d23a1b4e7e0370a2cffc97cbc6d44.1455034236.git.raphaelsc@scylladb.com>	2016-02-11 12:16:53 +02:00
Raphael S. Carvalho	a46aa47ab1	make sstables::compact_sstables return list of created sstables Now, sstables::compact_sstables() receives as input a list of sstables to be compacted, and outputs a list of sstables generated by compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <0d8397f0395ce560a7c83cccf6e897a7f464d030.1454110234.git.raphaelsc@scylladb.com>	2016-01-31 12:39:20 +02:00
Raphael S. Carvalho	ee84f310d9	move deletion of sstables generated by interrupted compaction This deletion should be handled by sstables::compact_sstables, which is the responsible for creation of new sstables. It also simplifies the code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <541206be2e910ab4edb1500b098eb5ebf29c6509.1454110234.git.raphaelsc@scylladb.com>	2016-01-31 12:39:20 +02:00
Raphael S. Carvalho	45c446d6eb	compaction: pass dht::token by reference Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-01-27 13:25:41 -02:00
Raphael S. Carvalho	fc541e2f08	compaction: remove code to sort local ranges storage_service::get_local_ranges returns sorted ranges, which are not overlapping nor wrap-around. As a result, there is no need for the consumer to do anything. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-01-27 13:15:36 -02:00
Glauber Costa	3f94070d4e	use auto&& instead of auto& for priority classes. By Avi's request, who reminds us that auto& is more suited for situations in which we are assigning to the variable in question. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <87c76520f4df8b8c152e60cac3b5fba5034f0b50.1453820373.git.glauber@scylladb.com>	2016-01-26 17:00:20 +02:00
Glauber Costa	b63611e148	mark I/O operations with priority classes After this patch, our I/O operations will be tagged into a specific priority class. The available classes are 5, and were defined in the previous patch: 1) memtable flush 2) commitlog writes 3) streaming mutation 4) SSTable compaction 5) CQL query Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-01-25 15:20:38 -05:00
Pekka Enberg	b5833e8002	Merge "Enable incremental backups option" from Vlad "This series moves the "backup" logic into the sstable::write_components() methods, adds a support for enabling backup for sstables flushed in the compaction flow (in addition to a regular flushing flow which had this support already) and enables the "incremental_backups" configuration option." I fixed up a merge conflict with commit `5e953b5` ("Merge "Add support to stop ongoing compaction" from Raphael").	2016-01-21 18:52:07 +02:00

1 2 3

105 Commits