scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 00:13:31 +00:00

Author	SHA1	Message	Date
Avi Kivity	d973445a94	Merge "sstable/schema extensions" from Calle " Adds extension points to schema/sstables to enable hooking in stuff, like, say, something that modifies how sstable disk io works. (Cough, cough, encryption) Extensions are processed as property keywords in CQL. To add an extension, a "module" must register it into the extensions object on boot time. To avoid globals (and yet don't), extensions are reachable from config (and thus from db). Table/view tables already contain an extension element, so we utilize this to persist config. schema_tables tables/views from mutations now require a "context" object (currently only extensions, but abstracted for easier further changes. Because of how schemas currently operate, there is a super lame workaround to allow "schema_registry" access to config and by extension extensions. DB, upon instansiation, calls a thread local global "init" in schema_registry and registers the config. It, in turn, can then call table_from_mutations as required. Includes the (modified) patch to encapsulate compression into objects, mainly because it is nice to encapsulate, and isolate a little. " * 'calle/extensions-v5' of github.com:scylladb/seastar-dev: extensions: Small unit test sstables: Process extensions on file open sstables::types: Add optional extensions attribute to scylla metadata sstables::disk_types: Add hash and comparator(sstring) to disk_string schema_tables: Load/save extensions table cql: Add schema extensions processing to properties schema_tables: Require context object in schema load path schema_tables: Add opaque context object config_file_impl: Remove ostream operators main/init: Formalize configurables + add extensions to init call db::config: Add extensions as a config sub-object db::extensions: Configuration object to store various extensions cql3::statements::property_definitions: Use std::variant instead of any sstables: Add extension type for wrapping file io schema: Add opaque type to represent extensions sstables::compress/compress: Make compression a virtual object	2018-02-26 17:15:29 +02:00
Duarte Nunes	e75f7c41d9	Merge 'Proper clean-up on closing index_reader' from Vladimir With the changes introduced in #2981 and #3189, the lifetime management of the objects used by index_reader became more complicated. This patchset addresses the immediate problems caused by lack of proper handling. The more holistic approach to this will take more time and is to be implemented under #3220. The current fix, however, should be good enought as a stop-gap solution. * 'issues/3213/v3' of https://github.com/argenet/scylla: Close promoted index streams when closing index_readers. Support proper closing of prepended_input_stream.	2018-02-21 01:02:16 +00:00
Vladimir Krivopalov	c996191411	Close promoted index streams when closing index_readers. Promoted index input streams must be explicitly closed when closing the index_reader in order to ensure all the pending read-aheads are completed. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-02-20 16:04:15 -08:00
Vladimir Krivopalov	8d52d809f7	Support proper closing of prepended_input_stream. When the stream is being closed, the call is forwarded to the stored data_source. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-02-20 16:04:05 -08:00
Vladimir Krivopalov	721bd3eef6	Added missing 'override' to skip() in buffer_input_stream and prepended_input_stream. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <4e91bead8de7f6fa9b3bfdab8bda73efdb22749d.1519152303.git.vladimir@scylladb.com>	2018-02-20 19:49:11 +00:00
Avi Kivity	87f10bc853	sstables: continuous_data_consumer: make _remain an unsigned type All of the adjustments to _remain already ensure it is greater than 0, and indeed a negative _remain doesn't make sense. Switching to an unsigne types allows us to re-enable -Wsign-compare. Tests: unit (release) Message-Id: <20180212121636.10463-1-avi@scylladb.com>	2018-02-12 12:25:21 +00:00
Avi Kivity	55168592ad	compaction_manager: fix use-after-free of column_family Commit `cce1a2bce8` ("Use the CPU scheduler") placed some compaction manager code in a scheduling_group. Unfortunately, downstream code relied on the callers not deferring, so it can rely on the column_family's existence. That doesn't happen if the column_family is removed quickly, as with_scheduling_group() always defers. Fix applying the scheduling group after we've taken the lock and guaranteed the stability of the column_family object. Fixes #3196. Message-Id: <20180211165155.18179-1-avi@scylladb.com>	2018-02-11 17:53:35 +00:00
Vladimir Krivopalov	71495691aa	Use separate shared_index_lists per sstable_mutation_reader instead of a single one per sstable. With the changes introduced in #2981, it is no longer safe to share index_entries among multiple sstable_mutation_readers. The original intent behind sharing index_entries among index_readers was to avoid re-reading same pages twice as we have two index readers - lower and upper bound - for every sstable_mutation_reader. In fact, the shared entries were held at the sstable object level so index_readers from different sstable_mutation_readers could have accessed them. Now, with calls to index_reader::advance_to(pos)/index_reader::advance_past(pos), index_entry can be accessed in a way that modifies its state if we need to read more promoted index blocks. It is safe to keep sharing them between two index_readers within the same sstable_mutation_reader as the invariant is maintained that readers can be only moved forward. We cannot safely assume, however, that this invariant holds for multiple sstable_mutation_readers as it may happen that one of them has read and thrown away some promoted index blocks that another one needs. So we restrict sharing to per-sstable_mutation_reader level. Fixes #3189. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <83957d007621fe4c62af49aebf1838bb2f32ee55.1518226793.git.vladimir@scylladb.com>	2018-02-10 15:08:45 +02:00
Avi Kivity	432268f582	Merge "branch 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla" from Raphael "The motivation is that it's no longer needed after new resharding algorithm that is the sole responsible for working with shared sstables and regular compaction will not work with those! So resharding will schedule deletion of shared sstables once it's certain that shards that own them have the new unshared sstables. The manager was needed for orchestrating deletion of shared sstable across shards. It brings extra complexity that's not longer needed, and it was also overloading shard 0, but the latter could have been fixed. Tests: - unit: release mode - dtest: resharding_test.py" * 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla: Remove SSTable's atomic deletion manager Stop using SSTable's atomic deletion manager database: split column_family::rebuild_sstable_list	2018-02-08 19:10:16 +02:00
Tomasz Grabiec	cce1a2bce8	Merge "Use the CPU scheduler" from Glauber & Avi In this patchset I am resubmitting Avi's enablement of the CPU scheduler in his behalf. I've done a ton of testing in the series and there are some improvements / changes that I had previously sent as a separate series. What you see here is the result of merging that work. After this patchset is applied, workloads are smoother and we are able to uphold the pre-defined shares among the various actors. We also finally have everything we need to merge the CPU and I/O controllers. After that is done the code is now much simpler. But also, as a bonus, controllers that were previously available for I/O only (compactions) are enabled for CPU as well. * git@github.com:glommer/scylla.git cpusched-v7: Avi Kivity (4): database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler memtable, database: make memtable::clear_gently() inherit scheduling_group config: mark background_writer_scheduling_quota as Unused database: place data_query execution stage into scheduling_group Glauber Costa (9): database, main: set up scheduling_groups for our main tasks row_cache: actually use the scheduling group for update_cache allow update_cache and clear_gently to use the entire task quota. database: remove cpu_flush_quota metric controllers: retire auto_adjust_flush_quota controllers: allow memtable I/O controller to have shares statically set controllers: update control points for memtable I/O controller controllers: allow a static priority to override the controller output controllers: unify the I/O and CPU controllers	2018-02-08 15:58:40 +01:00
Raphael S. Carvalho	312bd9ce25	Remove SSTable's atomic deletion manager Not used anymore, can be deleted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2018-02-07 22:38:45 -02:00
Raphael S. Carvalho	1472cfcc19	Stop using SSTable's atomic deletion manager The motivation is that it's no longer needed after new resharding algorithm that is the sole responsible for working with shared sstables and regular compaction will not work with those! So resharding will schedule deletion of shared sstables once it's certain that shards that own them have the new unshared sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2018-02-07 22:27:17 -02:00
Glauber Costa	956af9f099	database, main: set up scheduling_groups for our main tasks Set up scheduling groups for streaming, compaction, memtable flush, query, and commitlog. The background writer scheduling group is retired; it is split into the memtable flush and compaction groups. Comments from Glauber: This patch is based in a patch from Avi with the same subject, but the differences are signficant enough so that I reset authorship. In particular: 1) A bug/regression is fixed with the boundary calculations for the memtable controller sampling function. 2) A leftover is removed, where after flushing a memtable we would go back to the main group before going to the cache group again 3) As per Tomek's suggestion, now the submission of compactions themselves are run in the compaction scheduling group. Having that working is what changes this patch the most: we now store the scheduling group in the compaction manager and let the compaction manager itself enforce the scheduling group. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:29 -05:00
Avi Kivity	641aaba12c	database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler thread_scheduling_groups are converted to plain scheduling_group. Due to differences in initialization (scheduling_group initializtion defers), we create the scheduling_groups in main.cc and propagate them to users via a new class database_config. The sstable writer loses its thread_scheduling_group parameter and instead inherits scheduling from its caller. Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas, the flush controller was adjusted to return values within the higher ranges.	2018-02-07 17:19:29 -05:00
Calle Wilund	264b9d2da0	sstables: Process extensions on file open Allowing them to wrap/replace an opened file, and add to/read from scylla metadata.	2018-02-07 10:11:46 +00:00
Calle Wilund	b0c0c3c0ad	sstables::types: Add optional extensions attribute to scylla metadata Allowing storing key:value pairs.	2018-02-07 10:11:46 +00:00
Calle Wilund	68fc076f80	sstables::disk_types: Add hash and comparator(sstring) to disk_string	2018-02-07 10:11:46 +00:00
Calle Wilund	0dcf287230	sstables: Add extension type for wrapping file io	2018-02-07 10:11:45 +00:00
Calle Wilund	74758c87cd	sstables::compress/compress: Make compression a virtual object Make a "compressor" an actual class, that can be implemented and registered via class registry. For "common" compressors, the objects will be shared, but complex implementors can be semi-stateful. sstable compression is split into two parts: The "static" config which is shared across shards, and a "local" one, which holds a compressor pointer. The latter is encapsulated, along with actual compressed data writers, in sstables/compress.cc. For compression (write), compression writer is instansiated with the settings active in table metadata. For decompression (read), compression reader is instansiated with the settings stored in sstable metadata, which can differ from the currently active table metadata. v2: * Structured patch sets differently (dependencies) * Added more comments/api descs * Added patch to move all sstable compression into compress.cc, effectively separating top-level virtual compressor object from sstable io knowledge v3: * Rebased v4: * Moved all sstable compression logic/knowledge into compress.cc (local compression). Merged the two patches (separation just confuses reader).	2018-02-07 10:11:45 +00:00
Raphael S. Carvalho	09f4ee808f	sstables/compress: Fix race condition in segmented offset reading of shared sstable Race condition was introduced by commit `028c7a0888`, which introduces chunk offset compression, because a reading state is kept in the compress structure which is supposed to be immutable and can be shared among shards owning the same sstable. So it may happen that shard A updates state while shard B relies on information previously set which leads to incorrect decompression, which in turn leads to read misbehaving. We could serialize access to at() which would only lead to contention issues for shared sstables, but that can be avoided by moving state out of compress structure which is expected to be immutable after sstable is loaded and feeded to shards that own it. Sequential accessor (wraps state and reference to segmented_offset) is added to prevent at() and push_back() interfaces from being polluted. Tests: release mode. Fixes #3148. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180205192432.23405-1-raphaelsc@scylladb.com>	2018-02-06 12:10:10 +02:00
Vladimir Krivopalov	b91c3fd47e	Use advance_past for single partition upper bound. Instead of advancing to the next partition, try first find the more precise position using promoted index blocks. advance_past() only seeks within currently available PI blocks (or reads the first batch, if never read before) and uses the position if found, otherwise resorts to advance_to_next_partition() Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-01-29 11:57:45 -08:00
Vladimir Krivopalov	6f8c6a0933	Remove obsolete types and methods. These types and methods are no longer in use since the index_reader is now consuming promoted index incrementally. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-01-29 11:57:35 -08:00
Vladimir Krivopalov	0a7a56edd5	Simplify continuous_data_consumer::consume_input() interface. Remove redundant input parameter as continuous_data_consumer derivatives would only use themselves as a context. So take it internally and make the function regular (non-template) and having no parameters. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-01-29 11:57:26 -08:00
Vladimir Krivopalov	7e15e436de	Parse promoted index entries lazily upon request rather than immediately. Now promoted index is converted into an input_stream and skipped over instead of being consumed immediately and stored as a single buffer. The only part that is read right away is the deletion time as it is likely to be there in the already read buffer and reading it should both be cheap and prevent from reading the whole promoted index if only deletion time mark is needed. When accessed, promoted index is parsed in chunks, buffer by buffer, to limit memory consumption. Fixes #2981 Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-01-29 11:57:15 -08:00
Vladimir Krivopalov	9fdf4b24b5	Add helper input streams: buffer_input_stream and prepended_input_stream. buffer_input_stream is a simple input_stream wrapping a single temporary_buffer. prepended_input_stream suits for the case when some data has been read into a buffer and the rest is still in a stream. It accepts a buffer and a data_source and first reads from the buffer and then, when it ends, proceeds reading from the data_source. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-01-29 11:57:04 -08:00
Vladimir Krivopalov	5dca3100ed	Support skipping over bytes from input stream in parsers based on continuous_data_consumer Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-01-29 11:56:55 -08:00
Raphael S. Carvalho	2c181b69c9	sstables: fix wildly inaccurate sstable key estimation after dynamic index sampling The reason sstable key estimation is inaccurate is that it doesn't account that index sampling is now dynamic. The estimation is done as follow: uint64_t get_estimated_key_count() const { return ((uint64_t)_components->summary.header.size_at_full_sampling + 1) * _components->summary.header.min_index_interval; } The biggest problem is that _components->summary.header.min_index_interval isn't actually the minimum interval, but instead the default interval value set in the schema. So the estimation gets worse the larger the average partition, because the larger the average partition the lower the index sampling interval. One of the problems is that estimation has a big influence on bloom filter size, and so for large partitions we were generating bigger filters than we had to. From now on, size at full sampling is calculated as if sampling were static (which was the case until commit `8726ee937d` which introduced size-based sampling), using minimum index as a strict sampling interval. Tests: units (release) Fixes #3113. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180122233612.11147-1-raphaelsc@scylladb.com>	2018-01-23 10:42:24 +02:00
Glauber Costa	5140aaea00	add a timeout to fast forward to In the last patch, we enabled per-request timeouts, we enable timeouts in fill_buffer. There are many places, though, in which we fast_forward_to before we fill_buffer, so in order to make that effective we need to propagate the timeouts to fast_forward_to as well. In the same way as fill_buffer, we make the argument optional wherever possible in the high level callers, making them mandatory in the implementations. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-12 07:43:19 -05:00
Glauber Costa	d965af42b0	add a timeout to fill_buffer As part of the work to enable per-request timeouts, we enable timeouts in fill_buffer. The argument is made optional at the main classes, but mandatory in all the ::impl versions. This way we'll make sure we didn't forget anything. At this point we're still mostly passing that information around and don't have any entity that will act on those timeouts. In the next patch we will wire that up. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-11 12:07:41 -05:00
Duarte Nunes	cbbdfde979	sstables/compaction_backlog_tracker: Constify backlog() Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180111004914.25796-1-duarte@scylladb.com>	2018-01-11 13:20:57 +02:00
Duarte Nunes	43ad5bd182	sstables/compaction_backlog_manager: Fix user-after-free If the compaction_backlog_manager's lifetime ends before the linked compaction_backlog_tracker's, the latter's _manager pointer not being cleared, can lead to a use-after-free error when running ~compaction_backlog_tracker(), as evidenced by unit-tests failed. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180111004914.25796-2-duarte@scylladb.com>	2018-01-11 13:20:55 +02:00
Raphael S. Carvalho	4610e994e1	sstables: cure our blindness on sstable read failure After `611774b`, we're blind again on which sstable caused a compaction to fail, leaving us with cryptic message as follow: compaction_manager - compaction failed: std::runtime_error (compressed chunk failed checksum) After this change, now both read failure in compaction or regular read will report the guilty sstable, see: compaction_manager - compaction failed: std::runtime_error (SSTable reader found an exception when reading sstable ./data/.../keyspace1-standard1 ka-1-Data.db : std::runtime_error(compressed chunk failed checksum)) Fixes #3006. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180102230752.14701-1-raphaelsc@scylladb.com>	2018-01-08 13:43:13 +02:00
Avi Kivity	72c673fcc3	Merge "I/O Controller for memtables and compactions" from Glauber "This patchset implements the compaction controller for I/O shares. The goal is to automatic adjust compaction shares based on a strategy-specific backlog. A higher backlog will translate into higher shares. As compaction progresses, that reduces the backlog. As new data is flushed, that increases the backlog. The goal of the controler is to keep the backlog constant at a certain rate, so that we don't go neither too fast or too slow. Tracking reads and writes: ========================== Tracking of reads and writes happen through the read_monitor and the write_monitor. The write monitor is an existing interface that has the purpose of releasing the write permit at particular points of the write process. We enhance it so to get a reference to an instance that tracks the current offset inside the sstables::file_writer. This way the backlog tracker can always know for sure what's the offset of the current write. A similar thing is done for reads. The data_consumer already tracks the position of the current read, and we isolate that into a structure to which we can get a reference. A read_monitor allows us to connect the compaction to that reference. Lifetime management: ==================== In general, tracking objects will be owned by their callers and passed down as references. The compaction object will own the read monitors and the compaction write monitors and the memtable flush write monitor will be kept alive in a do_with block around the flush itself. The backlog_{write,read}_progress_manager needs to be kept alive until the SSTable is no longer in progress. For writes, that means until we are able to add the SSTable charges in full, and for reads (compaction) that means until we are able to remove the charges in full. It is important to do that to avoid spikes in the graph. If we remove the progress managers in a different operation than updating the SSTable list we will be left in a temporary state where charges appear or disappear abruptly, to be fixed when the final add_sstable/remove_sstable happens. So we want those things to happen together. The compaction_backlog_tracker is kept alive until the strategy changes, for example, through ALTER TABLE. Current charges are transferred to the new strategy's compaction_backlog_tracker object when we do that. If the type of strategy changes, the current read charges are forgotten. We can do that because those running compaction will not really contribute to decrease the backlog of the new compaction strategy. Tranfer of Charges ================== When ALTER TABLE happens, we need to transfer ongoing writes to the new backlog manager. Ongoing reads will still be tracked by the backlog_manager that originated them. The rationale for that is that reads still belong to the current compaction, with the strategy that generated them. But new Tables being written will add to the backlog of the new strategy. Note that ALTER TABLE operations not necessarily cause a change of Strategy. We can be using the same strategy but just changing properties. If that is the case, we expect no discontinuity in the backlog graph (tested). Resharding ========== Resharding compactions are more complex than normal compactions because the SSTables are created in one shard and later sent to another shard. It is better, then, to track resharding compactions separately and let them have their own backlog tracker, which will insert backlog in proportion to the amount of data to be resharded. Memtable Flush I/O Controller ============================= With the current infrastructure it becomes trivial to add a new controller, for either I/O or CPU. This patchset then adds an I/O controller for memtable flushes, using the same backlog algorithm that we already used for CPU." * 'compaction-controller-io-v5' of github.com:glommer/scylla: database: add a controller for I/O on memtable flushes. document the compaction controller compaction: adjust shares for compactions backlog_controllers: implement generic I/O controller factor out some of the controller code io shares: multiply all shares by 10 compaction_strategy: implement backlog manager for the SizeTiered strategy infrastructure for backlog estimator for compaction work. sstables: notify about end of data component write sstables: add read_monitor_generator sstables: add read_monitor sstables: enhance data consumer with a position tracker sstables: enhance the file_writer with an offset tracker sstables: pass references instead of pointers for write_monitor compaction: control destruction of readers	2018-01-07 15:00:10 +02:00
Avi Kivity	375ed938b4	Merge "Fix potential infinite recursion in leveled compaction" from Raphael '"The issue is triggered by compaction of sstables of level higher than 0. The problem happens when interval map of partitioned sstable set stores intervals such as follow: [-9223362900961284625 : -3695961740249769322 ] (-3695961740249769322 : -3695961103022958562 ] When selector is called for first interval above, the exclusive lower bound of the second interval is returned as next token, but the inclusivess info is not returned. So reader_selector was returning that there were new readers when the current token was -3695961740249769322 because it was stored in selector position field as inclusive, but it's actually exclusive. This false positive was leading to infinite recursion in combined reader because sstable set's incremental selector itself knew that there were actually no new readers, and therefore no progress could be made." Fixes #2908.' * 'high_level_compaction_infinite_recursion_fix_v4' of github.com:raphaelsc/scylla: tests: test for infinite recursion bug when doing high-level compaction Fix potential infinite recursion when combining mutations for leveled compaction dht: make it easier to create ring_position_view from token dht: introduce is_min/max for ring_position	2018-01-07 13:22:17 +02:00
Raphael S. Carvalho	818830715f	Fix potential infinite recursion when combining mutations for leveled compaction The issue is triggered by compaction of sstables of level higher than 0. The problem happens when interval map of partitioned sstable set stores intervals such as follow: [-9223362900961284625 : -3695961740249769322 ] (-3695961740249769322 : -3695961103022958562 ] When selector is called for first interval above, the exclusive lower bound of the second interval is returned as next token, but the inclusivess info is not returned. So reader_selector was returning that there were new readers when the current token was -3695961740249769322 because it was stored in selector position field as inclusive, but it's actually exclusive. This false positive was leading to infinite recursion in combined reader because sstable set's incremental selector itself knew that there were actually no new readers, and therefore no progress could be made. Fix is to use ring_position in reader_selector, such that inclusiveness would be respected. So reader_selector::has_new_readers() won't return false positive under the conditions described above. Fixes #2908. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2018-01-03 16:23:01 -02:00
Raphael S. Carvalho	e29b598c5f	sstables: make compaction_descriptor's ctor explicit to avoid bad conversion perf sstable used old sstables::compact_sstables() interface and still compiled due to bad implicit conversion. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180103041900.21186-1-raphaelsc@scylladb.com>	2018-01-03 12:37:12 +02:00
Glauber Costa	074a13ecf1	compaction_strategy: implement backlog manager for the SizeTiered strategy The SizeTiered backlog for a single SSTable is defined as: Bi = Ei * log4(T / Si) Where: - Si is the size of this individual SSTable - T is the sum of sizes for all individual SSTables - Ei is the effective bytes in this SSTable. The Effective size of an SSTable is: - The uncompacted size for an SSTable under compaction - The partially written size for an SSTable being written - The SSTable size for an SSTable that is not undergoing any of those processes. The Aggregate Backlog for the entire Table is just the sum of all individual SSTable backlogs, including the SSTables currently being written. Care is taken to avoid iterating over all SSTables, by separating the aggregate backlog into a static component (sstables not changing) and a component of SSTables that are undergoing change. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	ca284174d0	infrastructure for backlog estimator for compaction work. This patch adds infrastucture in various points in the system to allow us to determine the amount of work present as backlog from compactions. What needs to be done can be explained in three major pieces: 1) Add hooks in the points where sstables are added or inserted to a column family (or more precisely, to a compaction_strategy object). 2) Add hooks in reads and write monitors that allows a compaction backlog estimator (tracker) to become aware of bytes that are partially written and compacted away. 3) Add a per-column family class (compaction_backlog_tracker) that can be used to track work that is done and relevant to compactions (like the two above), and a compaction manager to provide a system-wide backlog based on the response of the individual trackers. The definition of how much backlog one has is strategy-specific. The Null strategy is easy, as it never really has any backlog, and so is the major strategy - since what it really matters is the backlog of the underlying compaction strategy. Although backlogs are strategy-specific, they should be "compatible", in the sense that if a particular strategy has more work to do, it should yield a higher number than its counterparts. All the others are presented in this patch as unimplemented: they will always advertise a mild backlog that should yield a constant CPU-utilization if used alone. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	86d7c160fd	sstables: notify about end of data component write We need to notify the monitor that the offset tracker that we are using is about to be destroyed and will no longer be valid. While we could modify the file_writer interface so that we could capture the offset_tracker and take ownership of it - guaranteeing it is alive until we reach the existing on_write_completed(), this feels like a layer violation. It is also potentially useful in general to offer the monitor callers with knowledge that writing the data portion is done. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	3bd6bceaf0	sstables: add read_monitor_generator Passing the read monitor down to the sstable readers is tricky. The point of interest - like compaction - are usually very far from the interfaces that register the monitor, like read_rows. Between the two, there is usually a mutation_reader, which is and ought to be totally unaware of the read monitor: technically, a mutation_reader may not even know it is backed by sstables. The solution is to create a read_monitor_generator, that can be passed from the upper layers, like compaction, to the layers that are actually making the decision of which sstables to create readers for. Note that we don't need an equivalent piece of infrastructure for writes, because writes don't happen through hidden layers and have all the information they need to initialize their monitors. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	9702a0935b	sstables: add read_monitor Similar to the write_monitor, it will track progress of an sstable being read. In the current interface, we will notify interested users about what is the current position in the data file. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	f0391bf9a0	sstables: enhance data consumer with a position tracker Callers, like compactions, will be able to know at any time the current progress of a read. As we do that, the currently unimplemented position() method of data_consume_context becomes redundant and is removed. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	110b8531f4	sstables: enhance the file_writer with an offset tracker Callers, like the memtable flusher or compactions will be able to find out the current amount of bytes written at any time. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	00df0a5ad3	sstables: pass references instead of pointers for write_monitor This came from Avi's review on the read_monitors. He suggests we wouldn't keep shared pointers, and would instead have the caller ensuring lifetime. That makes sense, but having the writer interface using shared_ptr and the read interface using references would lead to an inconsistent interface. For the sake of consistency we will change the write monitor to take references before we do that. From database.cc's perspective, we could now keep the monitors in a do_with() block, but we will keep the shared_ptrs to manage their lifetime in anticipation of upcoming patches in this series, where we'll have to pass them somewhere else. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:06 -05:00
Glauber Costa	d4109ebb80	compaction: control destruction of readers Compactions run from a seastar::thread, in run(). They will either fail or succeed, and from the point of view of ordering of destruction between the compaction object and its readers: - if compaction succeed, we have no control over who gets destructed first since both objects will be going out of scope. - if they fail, we will forceably destruct the compaction object, at which point the readers are still alive From the point of view of lifetime management, it would be nice to make sure that the compaction object outlives whichever other objects it needs during compaction. This nice to have will become paramount when we start adding read_monitors to the compaction object, that have to, themselves outlive the readers. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:06 -05:00
Avi Kivity	8795238869	Merge "Fix handling of range tombstones starting at same position" from Tomasz "When we get two range tombstones with the same lower bound from different data sources (e.g. two sstable), which need to be combined into a single stream, they need to be de-overlapped, because each mutation fragment in the stream must have a different position. If we have range tombstones [1, 10) and [1, 20), the result of that de-overlapping will be [1, 10) and [10, 20]. The problem is that if the stream corresponds to a clustering slice with upper bound greater than 1, but lower than 10, the second range tombstone would appear as being out of the query range. This is currently violating assumptions made by some consumers, like cache populator. One effect of this may be that a reader will miss rows which are in the range (1, 10) (after the start of the first range tombstone, and before the start of the second range tombstone), if the second range tombstone happens to be the last fragment which was read for a discontinuous range in cache and we stopped reading at that point because of a full buffer and cache was evicted before we resumed reading, so we went to reading from the sstable reader again. There could be more cases in which this violation may resurface. There is also a related bug in mutation_fragment_merger. If the reader is in forwarding mode, and the current range is [1, 5], the reader would still emit range_tombstone([10, 20]). If that reader is later fast forwarded to another range, say [6, 8], it may produce fragments with smaller positions which were emitted before, violating monotonicity of fragment positions in the stream. A similar bug was also present in partition_snapshot_flat_reader. Possible solutions: 1) relax the assumption (in cache) that streams contain only relevant range tombstones, and only require that they contain at least all relevant tombstones 2) allow subsequent range tombstones in a stream to share the same starting position (position is weakly monotonic), then we don't need to de-overlap the tombstones in readers. 3) teach combining readers about query restrictions so that they can drop fragments which fall outside the range 4) force leaf readers to trim all range tombstones to query restrictions This patch implements solution no 2. It simplifies combining readers, which don't need to accumulate and trim range tombstones. I don't like solution 3, because it makes combining readers more complicated, slower, and harder to properly construct (currently combining readers don't need to know restrictions of the leaf streams). Solution 4 is confined to implementations of leaf readers, but also has disadvantage of making those more complicated and slower. There is only one consumer which needs the tombstones with monotonic positions, and that is the sstable writer. Fixes #3093." * tag 'tgrabiec/fix-out-of-range-tombstones-v1' of github.com:scylladb/seastar-dev: tests: row_cache: Introduce test for concurrent read, population and eviction tests: sstables: Add test for writing combined stream with range tombstones at same position tests: memtable: Test that combined mutation source is a mutation source tests: memtable: Test that memtable with many versions is a mutation source tests: mutation_source: Add test for stream invariants with overlapping tombstones tests: mutation_reader: Test fast forwarding of combined reader with overlapping range tombstones tests: mutation_reader: Test combined reader slicing on random mutations tests: mutation_source_test: Extract random_mutation_generator::make_partition_keys() mutation_fragment: Introduce range() clustering_interval_set: Introduce overlaps() clustering_interval_set: Extract private make_interval() mutation_reader: Allow range tombstones with same position in the fragment stream sstables: Handle consecutive range_tombstone fragments with same position tests: streamed_mutation_assertions: Merge range_tombstones with the same position in produces_range_tombstone() streamed_mutation: Introduce peek() mutation_fragment: Extract mergeable_with() mutation_reader: Move definition of combining mutation reader to source file mutation_reader: Use make_combined_reader() to create combined reader	2018-01-02 18:32:09 +02:00
Raphael S. Carvalho	3dcf00ec67	sstables: feed new sstable with its owner shard Missed opportunity to feed shard id to sstable being written when working on `67c5c8dc67`, so when sstable is reopened after sealed, its shard doesn't need to be recomputed by open procedure. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171231024529.13664-1-raphaelsc@scylladb.com>	2018-01-01 10:17:07 +02:00
Raphael S. Carvalho	c76356fb39	sstables: make shard computation resilient to empty sharding metadata Scylla metadata could be empty due to bugs like the one introduced by `115ff10`. Let's make shard computation resilient to empty sharding metadata by falling back to the approach that uses first and last keys to compute shards. Refs #2932. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171223120140.3642-2-raphaelsc@scylladb.com>	2017-12-28 14:07:06 +02:00
Raphael S. Carvalho	fa5a26f12d	sstables: fail sstable write if unable to generate sharding metadata SSTable can generate an empty sharding metadata after a bug like the one introduced here `115ff10`, that results in tokens being generated using base table for the view table. That leads to sstable being deleted in subsequent boot because all shards will agree on its deletion given that it will not belong to anybody, and also compaction to crash because this relies on resulting sstable belonging to one shard at least. I wouldn't like to spend days debugging it again because sstable write silently generated empty sharding metadata, so let's make write fail when it happens (see issue #2932 for details). Refs #2932. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171223120140.3642-1-raphaelsc@scylladb.com>	2017-12-28 14:07:05 +02:00
Duarte Nunes	2618209c2d	Remove obsolete includes and fix build move.hh was deleted, but files weren't updated to reflect that. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-12-28 12:03:44 +00:00

1 2 3 4 5 ...

1253 Commits