scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 20:16:43 +00:00

Author	SHA1	Message	Date
Raphael S. Carvalho	405e41e9a8	database: export column family dir Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:19 -03:00
Raphael S. Carvalho	2b774c5bc3	database: inform if column family has shared tables That's gonna be useful to quickly determine if it's worth resharding a column family. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:17 -03:00
Raphael S. Carvalho	2d119287b7	sstables: add method to export ancestors Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:16 -03:00
Raphael S. Carvalho	f2f8a2f5c7	lcs: implement get_level_count Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:14 -03:00
Raphael S. Carvalho	585596cede	compaction_manager: introduce method to check if manager stopped Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:12 -03:00
Raphael S. Carvalho	d82a8dfae0	lcs: restore invariant instead of sending overlapping sst to L0 A large token span sstable may find its way into high level due to resharding, which means the strategy invariant is broken. The invariant is restored by compacting first set of overlapping sstables, meaning that the restoration is done incrementally for multiple overlapping sets. Invariant is restored by regular compaction after resharding puts new unshared sstables into their original level, where level > 0. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:09 -03:00
Raphael S. Carvalho	0127309820	sstables: extend compaction for new resharding Extends compaction for new resharding algorithm. Not wired yet. New resharding will compact shared sstable(s) and create one sstable for each owner. It's up to the caller to open these new unshared sstables at their respective column families. This new approach will save a lot of bandwidth because we'll no longer read the entire shared sstable #smp::count times. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:08 -03:00
Raphael S. Carvalho	758bc38e7a	sstables: allow shard A to correctly create sstable for shard B That's possible by shard A explicitly saying that sstable is created for shard B. If we don't do that, sharding metadata isn't correct, and consequently sstable will report wrong owners. We'll need this for resharding which will create sstables for all shards that own the shared sstable. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:06 -03:00
Raphael S. Carvalho	2a437ab427	compaction: rework compacting_sstable_writer to work with multiple writers compacting_sstable_writer only allowed one writer so far, but we will need multiple ones for resharding. It's done by moving writer management to compaction. finish_sstable_writer() is added for compaction impl to stop all writers, whereas stop_sstable_writer() will only stop current writer (needed when current sstable reaches max limit size for example). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:05 -03:00
Raphael S. Carvalho	a35a3a9647	compaction: prepare compacting_sstable_writer to work with writers No need for compacting_sstable_writer to store items that are available in compaction class. Also, that's a step towards supporting multiple writers for compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:03 -03:00
Raphael S. Carvalho	38ed83e2f7	sstables: rework compaction to make it easy to extend compact_sstables() supported both regular and cleanup compaction, but with lots of conditions that made it ugly and hard to extend. In the future, we want to introduce a new type of compaction for resharding that will create one sstable for every shard owning the sstable(s) given as input. That will be easier now. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:02 -03:00
Avi Kivity	fdcf64520d	Merge seastar upstream * seastar 2eec212...194d80f (4): > removing the collectd tests > fix fstream metrics reporting. > do_for_each: Make it check for need preempt > core/sharded: introduce copy method to foreign_ptr	2017-04-21 22:14:01 +03:00
Avi Kivity	fccbf2c51f	Merge "Reduce memory reclamation latency" from Tomasz "Currently eviction is performed until occupancy of the whole region drops below the 85% threshold. This may take a while if region had high occupancy and is large. We could improve the situation by only evicting until occupancy of the sparsest segment drops below the threshold, as is done by this change. I tested this using a c-s read workload in which the condition triggers in the cache region, with 1G per shard: lsa-timing - Reclamation cycle took 12.934 us. lsa-timing - Reclamation cycle took 47.771 us. lsa-timing - Reclamation cycle took 125.946 us. lsa-timing - Reclamation cycle took 144356 us. lsa-timing - Reclamation cycle took 655.765 us. lsa-timing - Reclamation cycle took 693.418 us. lsa-timing - Reclamation cycle took 509.869 us. lsa-timing - Reclamation cycle took 1139.15 us. The 144ms pause is when large eviction is necessary. Statistics for reclamation pauses for a read workload over larger-than-memory data set: Before: avg = 865.796362 stdev = 10253.498038 min = 93.891000 max = 264078.000000 sum = 574022.988000 samples = 663 After: avg = 513.685650 stdev = 275.270157 min = 212.286000 max = 1089.670000 sum = 340573.586000 samples = 663 Refs #1634." * tag 'tgrabiec/lsa-reduce-reclaim-latency-v3' of github.com:cloudius-systems/seastar-dev: lsa: Reduce reclamation latency tests: Add test for log_histogram log_histogram: Allow non-power-of-two minimum values lsa: Use regular compaction threshold in on-idle compaction tests: row_cache_test: Induce update failure more reliably lsa: Add getter for region's eviction function	2017-04-21 17:47:06 +03:00
Tomasz Grabiec	20f4c9bf23	lsa: Reduce reclamation latency Currently eviction is performed until occupancy of the whole region drops below the 85% threshold. This may take a while if region had high occupancy and is large. We could improve the situation by only evicting until occupancy of the sparsest segment drops below the threshold, as is done by this change. I tested this using a c-s read workload in which the condition triggers in the cache region, with 1G per shard: lsa-timing - Reclamation cycle took 12.934 us. lsa-timing - Reclamation cycle took 47.771 us. lsa-timing - Reclamation cycle took 125.946 us. lsa-timing - Reclamation cycle took 144356 us. lsa-timing - Reclamation cycle took 655.765 us. lsa-timing - Reclamation cycle took 693.418 us. lsa-timing - Reclamation cycle took 509.869 us. lsa-timing - Reclamation cycle took 1139.15 us. The 144ms pause is when large eviction is necessary. Statistics for reclamation pauses for a read workload over larger-than-memory data set: Before: avg = 865.796362 stdev = 10253.498038 min = 93.891000 max = 264078.000000 sum = 574022.988000 samples = 663 After: avg = 513.685650 stdev = 275.270157 min = 212.286000 max = 1089.670000 sum = 340573.586000 samples = 663 Refs #1634. Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>	2017-04-21 12:52:31 +02:00
Tomasz Grabiec	4313641c03	tests: Add test for log_histogram	2017-04-21 12:52:31 +02:00
Tomasz Grabiec	c83768d6bb	log_histogram: Allow non-power-of-two minimum values We will want to reuse the min_size mechanism for the whole compaction threshold, including the occupancy threshold. That threshold is close to the segment size and we cannot pick a power of two which would be close enough to what we need. Therefore, change log_histogram to support arbitrary minimum base. bucket_of() was moved into log_histogram_options so that it can be used in number_of_buckets(), which makes for a simple and much less error-prone implementation.	2017-04-21 10:54:50 +02:00
Tomasz Grabiec	7a800c54bf	lsa: Use regular compaction threshold in on-idle compaction Idle-time compaction should not produce not-compactible segments becuase that means we would have to evict a lot when we finally need to reclaim some memory, so that occupancy falls below the regular compaction threshold. This may cause latency spikes. Refs #1634.	2017-04-20 15:00:15 +02:00
Tomasz Grabiec	e054ccc037	tests: row_cache_test: Induce update failure more reliably After changing region evicitability condition to be less strict, cache update stopped failing because reclamation was able to compact dense region. Induce failure by installing evictor which refuses to evict from cache beyond few elements.	2017-04-20 14:51:47 +02:00
Tomasz Grabiec	7aa286439f	lsa: Add getter for region's eviction function	2017-04-20 14:51:42 +02:00
Vlad Zolotarov	9c1d803157	fix_system_distributed_tables.py: add --node and --port parameters Allow giving a non-default IP address and a port to connect to the cluster. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1491316458-18420-1-git-send-email-vladz@scylladb.com>	2017-04-20 14:49:26 +03:00
Avi Kivity	68f0df12ee	Merge "Optimize reads with clustering restrictions" from Tomasz "This series makes several optimizations to sstable mutation reader relevant for large partitions. Some highlights: One optimization is to use the index for skipping across clustering restrictions. Currently we read whole partition in such cases. That includes the case when we need to read a static row and then jump to some clustering row in the middle of the partition. Another case is having more than one clustering restriction, e.g. selecting multiple single rows from the same partition. Another optimization is using information from the index for creation of streamed_mutation. That can save us the cost of reading the partition header form the data file in case we would not continue reading, but skip to the middle of that partition. Or we may not even attempt to read anything from that partition, if after we determine the key that reader will be put behind other readers, which will exhaust the query limit first. Another optimization is switching single-partition queries to use the index_reader infrastructure. Index lookups via index_reader are faster than find_disk_ranges(). This is also a cleanup, a step towards converting all code to use the index_reader." * tag 'tgrabiec/optimize-sstable-reads-with-restrictions-v2' of github.com:cloudius-systems/seastar-dev: (44 commits) sstables: Remove unused code sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition sstables: mutation_reader: Use index_reader for single-partition reads sstables: mutation_reader: Add trace-level logging sstables: mutation_reader: Move partition reading code to sstable_data_source sstables: mutation_reader: Move definitions out of the class body sstables: Move binary_search() to a header database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating sstables: index_reader: Introduce advance_to_next_partition() sstables: index_reader: Introduce advance_and_check_if_present() sstables: index_reader: Introduce advance_past() sstables: index_reader: Make copyable sstables: index_reader: Optimize advancing to extreme positions sstables: index_reader: Keep two last pages alive dht: ring_position_view: Add key getter dht: ring_position_view: Add constructor and factory from ring_position_view sstables: mutation_reader: Advance to next partition using index in some cases sstables: index_reader: Expose access to partition key and tombstone sstables: index_reader: Introduce promoted_index_view sstables: mutation_reader: Move _index_in_current to sstable_data_source ...	2017-04-20 13:58:37 +03:00
Tomasz Grabiec	3472a74de4	sstables: Remove unused code	2017-04-20 11:23:05 +02:00
Tomasz Grabiec	c1059ca8e4	sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition It's cheaper than a key-based lookup, so use it when we can.	2017-04-20 11:23:05 +02:00
Tomasz Grabiec	4742008b70	sstables: mutation_reader: Use index_reader for single-partition reads This switches single-partition query to use the index_reader infrastructure. Index lookups via index_reader are faster than find_disk_ranges(). perf_fast_forward, rows: 1000000, value size: 100 Before: Testing forwarding with clustering restriction in a large partition: pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu no 0.002182 2 916 3 152 2 0 0 1 1 88.1% After: Testing forwarding with clustering restriction in a large partition: pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu no 0.000758 2 2639 3 152 2 0 0 1 1 48.6% This is also a cleanup, a step towards converting all code to use the index_reader.	2017-04-20 11:23:05 +02:00
Tomasz Grabiec	9d8795089d	sstables: mutation_reader: Add trace-level logging	2017-04-20 11:18:55 +02:00
Tomasz Grabiec	b198c31c46	sstables: mutation_reader: Move partition reading code to sstable_data_source It will be reused for read_row(), which does't create mutation_reader instance, only sstable_data_source.	2017-04-20 11:18:26 +02:00
Tomasz Grabiec	6e4bca0be6	sstables: mutation_reader: Move definitions out of the class body To make further refactoring easier to review. No functional changes here.	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	4ed7e529db	sstables: Move binary_search() to a header There are instantiations of binary_search() used in sstables.cc, but defined in partition.cc. The instantiations are explicitly declared in partition.cc, but the types changed and they became obsolete. The thing worked because partition.cc also instantiated it with the right type. But after that code will be removed, it no longer would, and we would get a linker error. To avoid such problems, define binary_search() in a header.	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	bedd0ab6f9	database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	0b5ba13230	sstables: index_reader: Introduce advance_to_next_partition()	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	4b81844d2e	sstables: index_reader: Introduce advance_and_check_if_present()	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	b92f095bf0	sstables: index_reader: Introduce advance_past()	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	6780756258	sstables: index_reader: Make copyable	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	7db83fa3fe	sstables: index_reader: Optimize advancing to extreme positions	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	f66443c01c	sstables: index_reader: Keep two last pages alive The idea behind caching is that when we have two index readers where one is catching up with the other, each page will be read only once. Currently that's not always the case. There is a case when advance_to() may need to read two pages. That's when the target position is not found in the first page as determined by the summary index. The second reader which catches up would have to read the first page as well, but it would not be in cache any more. To avoid this extra I/O let's keep a reference to the two last pages touched by the index.	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	c7b9c5dfd3	dht: ring_position_view: Add key getter	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	5b71e0b9ab	dht: ring_position_view: Add constructor and factory from ring_position_view	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	3e8795494e	sstables: mutation_reader: Advance to next partition using index in some cases To produce a streamed_mutation for the next partition, we need to read its key and the tombstone. Currently we always do that by consuming the partition header from the data file. In some cases that may cause unnecessary IO. It's better to obtain partition information from the index if we already have it. We can save on IO if the user will skip past the front of partition immediately after. It is also better to pay the cost of reading the index if we know that we will need to use the index anyway soon. This patch predicts that by checking if there are any clustering restrictions. If there are any, we will almost surely need_skip() and use the index anyway. This change also lays the ground for unification of multi and single partiton queries without loss of performance.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	e35fe7492c	sstables: index_reader: Expose access to partition key and tombstone	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	ae72c159b1	sstables: index_reader: Introduce promoted_index_view So that we have a nice way of extracting tombstone out of it. We not always need fully parsed index.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	0ef33b7f29	sstables: mutation_reader: Move _index_in_current to sstable_data_source sstable_data_source holds a shared state between mutation_reader and streamed_mutation for sstables. The information whether index is in current partition will have to be accessed by both in the following patches.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	885f53d905	sstables: mutation_reader: Avoid resetting the walker Before the change, the following scenario was happening: 1) we try to skip based on clustering restrictions 2) we find the page and fast forward to it, recording walker's lower bound counter 3) we read the first fragment, it's not a tombstone, so we reset the walker, and its lower bound counter too 4) the fragment is not in range (the range starts in the middle of the page) 5) needs_skip() is true, we redo the index lookup, which wastes some CPU This change fixes the problem by avoiding resetting the walker. We can do that because leading tombstones are checked with a non-mutable contains_tombstone()	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	bf21aa3a1f	clustering_ranges_walker: Introduce contains_tombstone()	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	b030ce693d	sstables: mutation_reader: Don't try to read index to skip to static row Static row is always at the beginning, there's no point in doing index lookups.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	3e060659f1	sstables: mutation_reader: Don't try to read static row if table doesn't have any	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	b1860a8a24	clustering_ranges_walker: Allow excluding the static row	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	77d3e30239	sstables: mutation_reader: Use index to skip across clustering restrictions Improves scans with clustering restrictions. Before the change such scans would scan whole partition. Below are results of a test case from perf_fast_forward which selects few rows from a large partition using query restrictions (not fast forwarding). Before: stride rows time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu 1000000 1 0.000609 1 1642 3 152 2 1 0 1 1 38.0% 500000 2 0.242255 2 8 511 64152 398 4 0 1 1 98.6% 250000 4 0.281592 4 14 749 95832 564 4 0 1 1 98.4% 125000 8 0.328056 8 24 873 111704 657 4 0 1 1 98.4% 62500 16 0.306700 16 52 935 119640 751 4 0 1 1 99.4% After: stride rows time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu 1000000 1 0.000711 1 1406 3 152 2 1 0 1 1 42.1% 500000 2 0.000910 2 2197 5 216 3 2 0 1 1 39.2% 250000 4 0.001384 4 2891 9 344 5 4 0 1 1 35.3% 125000 8 0.003197 8 2502 21 728 13 8 0 1 1 53.1% 62500 16 0.006664 16 2401 41 1368 25 16 0 1 1 58.2%	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	05a1f92cbc	clustering_ranges_walker: Introduce lower_bound_change_counter() Allows detecting changes of lower_bound(). Result of advance_to() is not enough. When we get false from advance_to() twice in a row, lower bound may or may not have changed.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	461f2af0a1	sstables: mutation_reader: Avoid index lookups when out of range	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	10c92d37d1	sstables: mutation_reader: Simplify fast_forward_to()	2017-04-20 10:54:37 +02:00

1 2 3 4 5 ...

11764 Commits