scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-22 01:20:39 +00:00

Author	SHA1	Message	Date
Glauber Costa	3bd6bceaf0	sstables: add read_monitor_generator Passing the read monitor down to the sstable readers is tricky. The point of interest - like compaction - are usually very far from the interfaces that register the monitor, like read_rows. Between the two, there is usually a mutation_reader, which is and ought to be totally unaware of the read monitor: technically, a mutation_reader may not even know it is backed by sstables. The solution is to create a read_monitor_generator, that can be passed from the upper layers, like compaction, to the layers that are actually making the decision of which sstables to create readers for. Note that we don't need an equivalent piece of infrastructure for writes, because writes don't happen through hidden layers and have all the information they need to initialize their monitors. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	110b8531f4	sstables: enhance the file_writer with an offset tracker Callers, like the memtable flusher or compactions will be able to find out the current amount of bytes written at any time. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	00df0a5ad3	sstables: pass references instead of pointers for write_monitor This came from Avi's review on the read_monitors. He suggests we wouldn't keep shared pointers, and would instead have the caller ensuring lifetime. That makes sense, but having the writer interface using shared_ptr and the read interface using references would lead to an inconsistent interface. For the sake of consistency we will change the write monitor to take references before we do that. From database.cc's perspective, we could now keep the monitors in a do_with() block, but we will keep the shared_ptrs to manage their lifetime in anticipation of upcoming patches in this series, where we'll have to pass them somewhere else. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:06 -05:00
Avi Kivity	8795238869	Merge "Fix handling of range tombstones starting at same position" from Tomasz "When we get two range tombstones with the same lower bound from different data sources (e.g. two sstable), which need to be combined into a single stream, they need to be de-overlapped, because each mutation fragment in the stream must have a different position. If we have range tombstones [1, 10) and [1, 20), the result of that de-overlapping will be [1, 10) and [10, 20]. The problem is that if the stream corresponds to a clustering slice with upper bound greater than 1, but lower than 10, the second range tombstone would appear as being out of the query range. This is currently violating assumptions made by some consumers, like cache populator. One effect of this may be that a reader will miss rows which are in the range (1, 10) (after the start of the first range tombstone, and before the start of the second range tombstone), if the second range tombstone happens to be the last fragment which was read for a discontinuous range in cache and we stopped reading at that point because of a full buffer and cache was evicted before we resumed reading, so we went to reading from the sstable reader again. There could be more cases in which this violation may resurface. There is also a related bug in mutation_fragment_merger. If the reader is in forwarding mode, and the current range is [1, 5], the reader would still emit range_tombstone([10, 20]). If that reader is later fast forwarded to another range, say [6, 8], it may produce fragments with smaller positions which were emitted before, violating monotonicity of fragment positions in the stream. A similar bug was also present in partition_snapshot_flat_reader. Possible solutions: 1) relax the assumption (in cache) that streams contain only relevant range tombstones, and only require that they contain at least all relevant tombstones 2) allow subsequent range tombstones in a stream to share the same starting position (position is weakly monotonic), then we don't need to de-overlap the tombstones in readers. 3) teach combining readers about query restrictions so that they can drop fragments which fall outside the range 4) force leaf readers to trim all range tombstones to query restrictions This patch implements solution no 2. It simplifies combining readers, which don't need to accumulate and trim range tombstones. I don't like solution 3, because it makes combining readers more complicated, slower, and harder to properly construct (currently combining readers don't need to know restrictions of the leaf streams). Solution 4 is confined to implementations of leaf readers, but also has disadvantage of making those more complicated and slower. There is only one consumer which needs the tombstones with monotonic positions, and that is the sstable writer. Fixes #3093." * tag 'tgrabiec/fix-out-of-range-tombstones-v1' of github.com:scylladb/seastar-dev: tests: row_cache: Introduce test for concurrent read, population and eviction tests: sstables: Add test for writing combined stream with range tombstones at same position tests: memtable: Test that combined mutation source is a mutation source tests: memtable: Test that memtable with many versions is a mutation source tests: mutation_source: Add test for stream invariants with overlapping tombstones tests: mutation_reader: Test fast forwarding of combined reader with overlapping range tombstones tests: mutation_reader: Test combined reader slicing on random mutations tests: mutation_source_test: Extract random_mutation_generator::make_partition_keys() mutation_fragment: Introduce range() clustering_interval_set: Introduce overlaps() clustering_interval_set: Extract private make_interval() mutation_reader: Allow range tombstones with same position in the fragment stream sstables: Handle consecutive range_tombstone fragments with same position tests: streamed_mutation_assertions: Merge range_tombstones with the same position in produces_range_tombstone() streamed_mutation: Introduce peek() mutation_fragment: Extract mergeable_with() mutation_reader: Move definition of combining mutation reader to source file mutation_reader: Use make_combined_reader() to create combined reader	2018-01-02 18:32:09 +02:00
Duarte Nunes	89b353cd95	Delete unused nway_merger.hh Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1514463536-7732-1-git-send-email-duarte@scylladb.com>	2017-12-28 14:21:40 +02:00
Tomasz Grabiec	52285a9e73	mutation_reader: Use make_combined_reader() to create combined reader So that we can hide the definition of combined_mutation_reader. It's also less verbose.	2017-12-21 21:24:11 +01:00
Piotr Jastrzebski	308ec43ea5	cf::for_all_partitions::iteration_state: don't store schema_ptr Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-12-21 11:47:07 +01:00
Piotr Jastrzebski	570703a169	read_mutation_from_flat_mutation_reader: don't take schema_ptr Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-12-21 11:47:07 +01:00
Paweł Dziepak	3cf46a31a6	flat_multi_range_mutation_reader: disallow streamed_mutation::forwarding Properly implementing streamed_mutation::forwarding::yes in multi range reader would noticeably increase its complexity and is not needed.	2017-12-20 14:50:11 +00:00
Avi Kivity	2137d753b3	Merge "Serialize compaction of same size tier for different cfs" from Raphael "Currently, compaction manager will serialize compaction of same size tier (or weight) if they belong to the same column family. However, it fails to do so if the compaction jobs belong to different column families. That can lead to an ungodly amount of running compaction which gets worse the higher the number of shards and active column families. The problem is that it may affect overall system performance due to excessive resource usage. It's easy to trigger it during bootstraping after loading node with new sstables or repairing, or if lots of cfs are being actively written." Fixes #1295. * 'similar_sized_compaction_serialization_v4' of github.com:raphaelsc/scylla: sstables: remove column_family from compaction_weight_registration compaction_manager: serialize compaction of same size tier for different cfs sstables: introduces deregister() and weight() to compaction_weight_registration sstables: move compaction_weight_registration to its own header sstables: improve compact_sstables() interface	2017-12-19 16:32:27 +02:00
Piotr Jastrzebski	570fc5afed	Use row_cache::make_flat_reader in column_family::make_reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <ba1659ceed8676f45942ce6e7506158026947345.1513687259.git.piotr@scylladb.com>	2017-12-19 14:42:32 +02:00
Raphael S. Carvalho	eff62bc61e	compaction_manager: serialize compaction of same size tier for different cfs Currently, compaction manager will serialize compaction of same size tier (or weight) if they belong to the same column family. However, it fails to do so if the compaction jobs belong to different column families. That can lead to an ungodly amount of running compaction which gets worse the higher the number of shards and active column families. The problem is that it may affect overall system performance due to excessive resource usage. It's easy to trigger it during bootstraping after loading node with new sstables or repairing, or if lots of cfs are being actively written. That being said, compaction jobs of same size tier are now serialized on a given shard, such that maximum number of compaction (system wise) is now: (SHARDS) * (SIZE TIERS) instead of: (SHARDS) * (COLUMN FAMILIES) * (SIZE TIERS) We'll work hard to release a size tier (weight) for a column family waiting on it as fast as possible, given that we wouldn't like to underutilize resources available for compaction. We want one starting after the other. Compaction for a column family that cannot run now because the size tier is taken, will be postponed. There's a worker that will be sleeping on a condition variable that will be signalled whenever a compaction completes. FIFO ordering is used on postponed list for fairness. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-17 17:42:48 -02:00
Raphael S. Carvalho	49f3cfe746	sstables: improve compact_sstables() interface Motivation is that a new field in the descriptor will be forwarded to compaction procedure without extending parameter list even more. Also beautifies the interface, making it concise and easier to play with. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-17 17:22:19 -02:00
Paweł Dziepak	8e0da776ab	db: convert single_key_sstalbe_reader to flat streams Before flat mutation readers sstable::read_row() returned a future<streamed_mutation>. That required a helper reader that would wait for the streamed_mutations from all relevant sstables to be created and then construct a mutation merger. With flat mutation readers sstable::read_row_flat() returns a flat_mutation_reader (no futures) so that the code can be simplified by collecting all the relevant readers and creating a combined reader without suspension points. The unfortunate disadvantage of the flat_mutation_reader-based approach is the fact that combined reader now needlessly compares the partition keys even though we know that we read only a single partition, but optimising that is out of scope of this patch.	2017-12-13 12:01:03 +00:00
Paweł Dziepak	24026a0c7d	db: fully convert incremental_reader_selector to flat readers	2017-12-13 12:01:03 +00:00
Paweł Dziepak	73b3d02cc0	db: make make_range_sstable_reader() return flat reader	2017-12-13 12:01:03 +00:00
Paweł Dziepak	8b3c3fc832	db: make column_family::make_reader() return flat reader	2017-12-13 12:01:03 +00:00
Paweł Dziepak	e12959616c	db: make column_family::make_sstable_reader() return a flat reader	2017-12-13 12:01:03 +00:00
Paweł Dziepak	a0a13ceb46	filtering_reader: switch to flat mutation fragment streams	2017-12-13 12:01:03 +00:00
Paweł Dziepak	3bbb3b300d	filtering_reader: pass a const dht::decorated_key& to the callback All users of the filtering reader need only the decorated key of a partition, but currently the predicate is given a reference to streamed_mutations which are obsolete now.	2017-12-13 11:57:27 +00:00
Paweł Dziepak	f3901eb154	db: use make_restricted_flat_reader	2017-12-13 10:46:41 +00:00
Glauber Costa	1aabbc75ab	database: delete created SSTables if streaming writes fail We have had an issue recently where failed SSTable writes left the generated SSTables dangling in a potentially invalid state. If the write had, for instance, started and generated tmp TOCs but not finished, those files would be left for dead. We had fixed this in commit `b7e1575ad4`, but streaming memtables still have the same isse. Note that we can't fix this in the common function write_memtable_to_sstable because different flushers have different retry policies. Fixes #3062 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20171213011741.8156-1-glauber@scylladb.com>	2017-12-13 10:09:20 +02:00
Avi Kivity	d934ca55a7	Merge "SSTable resharding fixes" from Raphael "Didn't affect any release. Regression introduced in `301358e`. Fixes #3041" * 'resharding_fix_v4' of github.com:raphaelsc/scylla: tests: add sstable resharding test to test.py tests: fix sstable resharding test sstables: Fix resharding by not filtering out mutation that belongs to other shard db: introduce make_range_sstable_reader rename make_range_sstable_reader to make_local_shard_sstable_reader db: extract sstable reader creation from incremental_reader_selector db: reuse make_range_sstable_reader in make_sstable_reader	2017-12-07 16:42:48 +02:00
Raphael S. Carvalho	f1b65a115a	db: introduce make_range_sstable_reader introduce reader variant that will allow its caller to read a range in a given table without any filter applied. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-07 03:15:26 -02:00
Raphael S. Carvalho	d1b146baa6	rename make_range_sstable_reader to make_local_shard_sstable_reader Tomek says: "I think that the least surprising behavior for a function named like this is to read the sstables unfiltered (it just reads them), and the filtering should be indicated specially in the name or by accepting a parameter." Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-07 03:15:25 -02:00
Raphael S. Carvalho	3d725d6823	db: extract sstable reader creation from incremental_reader_selector step closer to divorcing incremental_selector from sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-07 01:53:16 -02:00
Raphael S. Carvalho	ab82bacddd	db: reuse make_range_sstable_reader in make_sstable_reader Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-07 01:53:14 -02:00
Raphael S. Carvalho	1d0e6496ec	gc_clock: introduce operator<<(ostream&, gc_clock::time_point) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-06 19:52:32 -02:00
Paweł Dziepak	ce9a890940	incremental_reader_selector: do not use read_range_rows()	2017-12-05 14:53:14 +00:00
Paweł Dziepak	bccca90207	database: use read_row_flat() instead of read_row()	2017-12-05 14:52:57 +00:00
Botond Dénes	8731c1bc66	Flatten the implementation of combined_mutation_reader In fact flatten mutation_reader_merger and adjust combined_mutation_reader accordingly.	2017-12-04 07:57:43 +02:00
Botond Dénes	3f8110b5b6	Make combined_mutation_reader a flat_mutation_reader For now only the interface is converted, behind the scenes the previous implementation remains, it's output is simply converted by flat_mutation_reader_from_mutation_reader. The implementation will be converted in the following patches.	2017-12-04 07:57:43 +02:00
Tomasz Grabiec	fd7ab5fe99	database: Move operator<<() overloads to appropriate source files	2017-12-01 10:52:37 +01:00
Paweł Dziepak	32eb6437fd	memtable: make make_flush_reader() return flat_mutation_reader	2017-11-27 20:07:22 +01:00
Paweł Dziepak	11b32276e6	sstables: switch write_components() to flat_mutation_reader	2017-11-23 18:14:31 +00:00
Piotr Jastrzebski	6cd4b6b09c	Remove sstable_range_wrapping_reader The wrapper is no longer needed because read_range_rows returns ::mutation_reader instead of sstables::mutation_reader and the reader returned from it keeps the pointer to shared_sstable that was used to create the reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-15 10:40:02 +01:00
Paweł Dziepak	dca93bea23	db: convert make_streaming_reader() to flat_mutation_reader	2017-11-13 16:49:52 +00:00
Paweł Dziepak	37640f223b	db: drop single-range make_streaming_reader()	2017-11-13 16:49:52 +00:00
Glauber Costa	a6b2226562	dirty_memory_manager: block if we hit the real dirty limit Since we started accounting virtual dirty memory we no longer have a cap on real dirty memory. In most situations that is not needed, since real dirty will just be at most twice as much as virtual dirty (current flushing memtable plus new memtable). However, due to things like cache updates and component flushing we can end up having a lot of memtables that are virtually freed but not yet fully released, leading real dirty memory to explode using all the box' memory. This patch adds a cap on real dirty memory as well. Because of the hierarchical nature of region_group, if the parent blocks due to memory depletion, so will the child (virtual dirty region group). A next step is to add a controller that will increase the priority of the tasks involving in releasing real dirty memory if we get dangerously close to the threshold. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Avi Kivity	d6cd44a725	Revert "Merge 'Single key sstable reader optimization' from Botond" This reverts commit `5e9cd128ad`, reversing changes made to `1f4e6759a7`. Tomek found some serious issues.	2017-10-19 12:47:21 +03:00
Botond Dénes	dfe312ca3a	Add counters for the single-key reader optimization Add two counters, one to determine how many of the reads fall into the optimization, and a second one to determine it's effectiveness. The first one is single_key_reader_optimization_hit_rate. It contains the rate of reads that the optimization applies to out of all the reads that go into the single_key_sstable_reader. The second one, single_key_reader_optimization_extra_read_proportion is a histogram of the efficiency of the optimization. It contains the proportion of extra data-sources read. It's a number between 0 and 1, where 0 is the best case (only one data-source was read) and 1 is the worst case (all data-sources were read eventually). This is the same number that is used for the threshold option (see previous patch). Each of the histogram's buckets cover a chunk of 0.1 from the [0, 1] range. Note that single_key_parallel_scan_threshold effectively provides an upper bound for the proportion as the optimization is turned off as soon as it goes above that number. The counters are disabled if single_key_parallel_scan_threshold is set to 0 disabling the optimization entirely.	2017-10-18 17:24:03 +03:00
Botond Dénes	08502f2d48	Add single_key_parallel_scan_threshold option This option regulates when exactly the single-key optimization is considered ineffective and turned off. The threshold is the proportion of the extra data source candidates that can be read before the optimization is considered ineffective and disabled. The proportion is calculated as follows: (read_data_sources - 1) / (total_data_sources - 1) We substract 1 from the read_data_sources and total_data_sources to effectively measure the rate of extra data sources we read. This makes sure that the proportion is meaningful even if e.g. we have only have a total of 2 data-sources and we read only 1 (best case). Whenever this number goes above the threshold the optimization is disabled. The threshold is number between 0 and 1, 0 forces the optimization off and 1 forces it on. Increase the treshold to favor throughput over latency for single-row reads, decrease the treshold to improve latency at the expense of throughput. If the threshold is > 0 (it's not force disabled) and the optimization is disabled due to a read crossing the threshold, we will issue "probing" reads (every 100th read) to determine if the optimization is worth re-enabling. Probing reads are allowed to run through the optimization path and if they go below the threshold the optimization is re-enabled.	2017-10-18 17:24:03 +03:00
Botond Dénes	3c1fa3ecc1	single_key_sstable_reader: optimize single-row queries For single-row queries that only query atomic cells one can put a lower bound on the timestamps which may affect the query results and thus rule out entire data sources. This allows the query to read only those sstables that actually contribute to the result. To do this we incrementally move through the sstables overlapping with the query range, checking after each read mutation whether we already have a value for all required cells and whether the lower-bound of their timestamps is higher than the upper-bound of the timestamps of all the remaining data-sources. When this condition is met we terminate the read.	2017-10-18 17:24:03 +03:00
Botond Dénes	5fc44c4307	single_key_sstable_reader: move reading code into it's own method	2017-10-18 17:24:03 +03:00
Paweł Dziepak	c28e31eac4	database: fix build (auto shards&)	2017-10-18 13:10:00 +01:00
Duarte Nunes	446e5f53db	database: Avoid superfluous shards_for_this_sstable vector copies Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171018112643.40411-1-duarte@scylladb.com>	2017-10-18 15:00:52 +03:00
Duarte Nunes	baeec0935f	Replace query::full_slice with schema::full_slice() query::full_slice doesn't select any regular or static columns, which is at odds with the expectations of its users. This patch replaces it with the schema::full_slice() version. Refs #2885 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>	2017-10-17 11:25:53 +02:00
Tomasz Grabiec	6d5a0f8a98	db: Add debug-level logging related to streaming Message-Id: <1505896395-30203-1-git-send-email-tgrabiec@scylladb.com>	2017-10-16 18:49:10 +01:00
Raphael S. Carvalho	16dd0d15fc	sstables: make get_shards_for_this_sstable return const ref Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171012072850.12681-1-raphaelsc@scylladb.com>	2017-10-12 11:58:23 +02:00
Duarte Nunes	bb89b97cbb	cache_hit_rate: Avoid copies in get_hit_rate() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00

1 2 3 4 5 ...

958 Commits