scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 20:27:03 +00:00

Author	SHA1	Message	Date
Raphael S. Carvalho	e7a6f3926a	sstable_set: introduce for_each_sstable() This new method is preferred over all() for iterations purposes, because all() may have to copy sstables into a temporary. For example, all() implementation of the upcoming compound_sstable_set will have no choice but to merge all sstables from N managed sets into a temporary. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210311163009.42210-1-raphaelsc@scylladb.com>	2021-03-11 18:47:16 +02:00
Botond Dénes	361ba473c7	sstables: get rid of mp_row_consumer.{hh,cc} Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which will serve as the collection point of utility stuff needed by all reader implementations.	2021-03-11 12:17:13 +02:00
Botond Dénes	3ba782bddd	sstables: get rid of row.hh Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which will serve as the collection point of utility stuff needed by all reader implementations.	2021-03-11 12:17:13 +02:00
Botond Dénes	f5b0657fa5	sstables/mp_row_consumer.hh: remove unused struct new_mutation	2021-03-11 12:17:13 +02:00
Botond Dénes	cecc7f8064	sstables: move mx specific context and consumer to mx/reader.cc Move all the mx format specific context and consumer code to mx/reader.cc and add a factory function `mx::make_reader()` which takes over the job of instantiating the `sstable_mutation_reader` with the mx specific context and consumer.	2021-03-11 12:17:13 +02:00
Botond Dénes	4e3ae9d913	sstables: move kl specific context and consumer to kl/reader.cc Move all the kl format specific context and consumer code to kl/reader* and add a factory function `kl::make_reader()` which takes over the job of instantiating the `sstable_mutation_reader` with the kl specific context and consumer. Code which is used by test is moved to kl/reader_impl.hh, while code that can be hidden us moved to kl/reader.cc. Users who just want to create a reader only have to include kl/reader.hh.	2021-03-11 12:17:13 +02:00
Botond Dénes	0ec040921d	sstables: mv partition.cc sstable_mutation_reader.hh The sstable reader currently knows the definition of all the different consumers and contexts. But it doesn't really need to, as it is a template. Exploit this and prepare for a organization scheme where the consumers and contexts live hidden in a cc file which includes and instantiates the sstable reader template. As a first step expose `sstable_mutation_reader` in a header.	2021-03-11 12:17:13 +02:00
Avi Kivity	5342d79461	Merge "Preparatory work in sstable_set for the upcoming compound_sstable_set_impl" from Raphael * 'preparatory_work_for_compound_set' of github.com:raphaelsc/scylla: sstable_set: move all() implementation into sstable_set_impl sstable_set: preparatory work to change sstable_set::all() api sstables: remove bag_sstable_set	2021-03-10 19:19:26 +02:00
Raphael S. Carvalho	c3b8757fa1	sstable_set: move all() implementation into sstable_set_impl The main motivation behind this is that by moving all() impl into sstable_set_impl, sstable_set no longer needs to maintain a list with all sstables, which in turn may disagree with the respective sstable_set_impl. This will be very important for compound_sstable_set_impl which will be built from existing sets, and will implement all() by combining the all() of its managed sets. Without this patch, we'd have to insert the same sstable at both compound set and also the set managed by it, to guarantee all() of compound set would return the correct data, which would be expensive and error prone. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-10 12:02:13 -03:00
Raphael S. Carvalho	05b07c7161	sstable_set: preparatory work to change sstable_set::all() api users of sstable_set::all() rely on the set itself keeping a reference to the returned list, so user can iterate through the list assuming that it is alive all the way through. this will change in the future though, because there will be a compound set impl which will have to merge the all() of multiple managed sets, and the result is a temporary value. so even range-based loops on all() have to keep a ref to the returned list, to avoid the list from being prematurely destroyed. so the following code for (auto& sst : sstable_set.all()) { ...} becomes for (auto sstables = sstable_set.all(); auto& sst : sstables) { ... } Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-10 12:02:12 -03:00
Avi Kivity	746798fd56	Merge "sstables: get rid of data_consume_context" from Botond " This class is basically a wrapper around a unique pointer and a few short convenience methods, but is otherwise a distraction in trying to untangle the maze that is the sstable reader class hierachy. So this patchset folds it into its only real user: the sstable reader. " * 'data_consume_context_bye' of https://github.com/denesb/scylla: sstable: move data_consume_* factory methods to row.hh sstables: fold data_consume_context: into its users sstables: partition.cc: remove data_consume_* forward declarations	2021-03-10 16:45:32 +02:00
Botond Dénes	1aa2424dcf	sstable: move data_consume_* factory methods to row.hh	2021-03-10 15:40:50 +02:00
Botond Dénes	a06465a8f3	sstables: fold data_consume_context: into its users `data_consume_context` is a thin wrapper over the real context object and it does little more than forward method calls to it. The few methods doing more then mere forwarding can be folded into its single real user: `sstable_reader`.	2021-03-10 15:38:58 +02:00
Botond Dénes	37eb547224	sstables: partition.cc: remove data_consume_* forward declarations They don't seem to serve any purpose, everything builds fine without them.	2021-03-10 15:23:54 +02:00
Raphael S. Carvalho	f7cc431477	compaction_manager: Fix use-after-free in rewrite_sstables() Use-after-free introduced by `2cf0c4bbf1`. That's because compacting is moved into then_wrapped() lambda, so it's potentially freed on the next iteration of repeat(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>	2021-03-10 13:18:38 +02:00
Raphael S. Carvalho	863b95aa34	sstables: remove bag_sstable_set bag_sstable_set can be replaced with partitioned_sstable_set, which will provide the same functionality, given that L0 sstables go to a "bag" rather than interval map. STCS, for example, will only have L0 sstables, so it will get exact the same behavior with partitioned_sstable_set. it also gives us the benefit of keeping the leveled sstables in the interval map if user has switched from LCS to STCS, until they're all compacted into size-tiered ssts. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-09 08:39:48 -03:00
Raphael S. Carvalho	1226fc755f	compaction_manager: Increase cleanup compaction resilience when low on disk space In a scenario where node is running out of disk space, which is a common cause of cluster expansion, it's very important to clean up the smallest files first to increase the chances of success when the biggest files are reached down the road. That's possible given that cleanup operates on a single file at a time, and that the smaller the file the smaller the space requirement. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210303165520.55563-1-raphaelsc@scylladb.com>	2021-03-04 15:11:06 +02:00
Avi Kivity	5f4bf18387	Revert "Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros" This reverts commit `31909515b3`, reversing changes made to `ef97adc72a`. It shows many serious regressions in dtest. Fixes #8197.	2021-03-02 13:21:22 +02:00
Raphael S. Carvalho	2cf0c4bbf1	compaction: Prevent cleanup and regular from compacting the same sstable Due to regression introduced by `463d0ab`, regular can compact in parallel a sstable being compacted by cleanup, scrub or upgrade. This redundancy causes resources to be wasted, write amplification is increased and so does the operation time, etc. That's a potential source of data resurrection because the now-owned data from a sstable being compacted by both cleanup and regular will still exist in the node afterwards, so resurrection can happen if node regains ownership. Fixes #8155. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>	2021-03-01 20:34:35 +02:00
Avi Kivity	31909515b3	Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros Currently, the sstable_set in a table is copied before every change to allow accessing the unchanged version by existing sstable readers. This patch changes the sstable_set to a structure that keeps all its versions that are referenced somewhere and provides a way of getting a reference to an immutable version of the set. Each sstable in the set is associated with the versions it is alive in, and is removed when all such versions don't have references anymore. To avoid copying, the object holding all sstables in the set version is changed to a new structure, sstable_list, which was previously an alias for std::unordered_set<shared_sstable>, and which implements most of the methods of an unordered_set, but its iterator uses the actual set with all sstables from all referenced versions and iterates over those sstables that belong to the captured version. The methods that modify the sets contents give strong exception guarantee by trying to insert new sstables to its containers, and erasing them in the case of an caught exception. To release shared_sstables as soon as possible (i.e. when all references to versions that contain them die), each time a version is removed, all sstables that were referenced exclusively by this version are erased. We are able to find these sstables efficiently by storing, for each version, all sstables that were added and erased in it, and, when a version is removed, merging it with the next one. When a version that adds an sstable gets merged with a version that removes it, this sstable is erased. Fixes #2622 Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.com Closes #8111 * github.com:scylladb/scylla: sstables: add test for checking the latency of updating the sstable_set in a table sstables: move column_family_test class from test/boost to test/lib sstables: use fast copying of the sstable_set instead of rebuilding it sstables: replace the sstable_set with a versioned structure sstables: remove potential ub sstables: make sstable_set constructor less error-prone	2021-03-01 14:16:36 +02:00
Botond Dénes	f0b284dab8	sstables: enable token monotonicity validation by default Partition key order validation in data written to sstables can be very disruptive. All our components in the storage layers assume that partitions are in order, which means that reading out-of-order partitions triggers undefined behaviour. Computer scientists often joke that undefined behaviour can erase your hard drive and in this case the damage done by undefined behaviour caused by out-of-order partitions is very close to that. The corruption is known to mutate causing crashes, corrupting more data and even loose data. For this reason it is imperative that out-of-order partitions cannot get into sstables. This patch enables token monotonicity validation unconditionally in the sstable writer. As partition key monotonicity checks involve a key copy per partition, which might have an impact on the performance, we do the next best thing instead and enable only token monotonicity validation.	2021-03-01 07:49:23 +02:00
Botond Dénes	694f8a4ec6	mutation_fragment_stream_validating_filter: make validation levels more fine-grained Currently key order validation for the mutation fragment stream validating filter is all or nothing. Either no keys (partition or clustering) are validated or all of them. As we suspect that clustering key order validation would add a significant overhead, this discourages turning key validation on, which means we miss out on partition key monotonicity validation which has a much more moderate cost. This patch makes this configurable in a more fine-grained fashion, providing separate levels for partition and clustering key monotonicity validation. As the choice for the default validation level is not as clear-cut as before, the default value for the validation level is removed in the validating filter's constructor.	2021-03-01 07:49:23 +02:00
Botond Dénes	1d9b5911fe	time_series_sstable_set::create_single_key_sstable_reader(): fix use-after-free The optimal path of said method mistakenly captures `pos` (a local variable) in its reader factory method and passes a temporary range implicitly constructed from said `pos` as the range parameter to the sstable reader. This will lead to the sstable reader using a dangling range and will result in returning no result for queries. This patch fixes this bug and adds a unit test to cover this code path. Fixes #8138. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210226143111.104591-2-bdenes@scylladb.com>	2021-02-26 23:57:25 +02:00
Raphael S. Carvalho	7bf0744d36	reshape/TWCS: Fix off-by-one in threshold check A given time bucket should also be reshaped if its # of sstables has reached the threshold. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210223182634.570648-1-raphaelsc@scylladb.com>	2021-02-24 15:12:40 +02:00
Raphael S. Carvalho	21608bd677	sstables: Fix TWCS reshape for windows with at least min_threshold sstables TWCS reshape was silently ignoring windows which contain at least min_threshold sstables (can happen with data segregation). When resizing candidates, size of multi_window was incorrectly used and it was always empty in this path, which means candidates was always cleared. Fixes #8147. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>	2021-02-24 15:11:19 +02:00
Raphael S. Carvalho	81d773e5d8	compaction_manager: Redefine weight for better control of parallel compactions Compaction manager allows compaction of different weights to proceed in parallel. For example, a small-sized compaction job can happen in parallel to a large-sized one, but similar-sized jobs are serialized. The problem is the current definition of weight, which is the log (base 4) of total size (size of all sstables) of a job. This is what we get with the current weight definition: weight=5 for sizes=[1K, 3K] weight=6 for sizes=[4K, 15K] weight=7 for sizes=[16K, 63K] weight=8 for sizes=[64K, 255K] weight=9 for sizes=[258K, 1019K] weight=10 for sizes=[1M, 3M] weight=11 for sizes=[4M, 15M] weight=12 for sizes=[16M, 63M] weight=13 for sizes=[64M, 254M] weight=14 for sizes=[256M, 1022M] weight=15 for sizes=[1033M, 4078M] weight=16 for sizes=[4119M, 10188M] total weights: 12 Note that for jobs smaller than 1MB, we have 5 different weights, meaning 5 jobs smaller than 1MB could proceed in parallel. High number of parallel compactions can be observed after repair, which potentially produces tons of small sstables of varying sizes. That causes compaction to use a significant amount of resources. To fix this problem, let's add a fixed tax to the size before taking the log, so that jobs smaller than 1M will all have the same weight. Look at what we get with the new weight definition: weight=10 for sizes=[1K, 2M] weight=11 for sizes=[3M, 14M] weight=12 for sizes=[15M, 62M] weight=13 for sizes=[63M, 254M] weight=14 for sizes=[256M, 1022M] weight=15 for sizes=[1033M, 4078M] weight=16 for sizes=[4119M, 10188M] total weights: 7 Fixes #8124. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210217123022.241724-1-raphaelsc@scylladb.com>	2021-02-22 15:50:29 +02:00
Avi Kivity	78d1afeabd	Merge "Use radix tree to store cells on a row" from Pavel E " Current storage of cells in a row is a union of vector and set. The vector holds 5 cell_and_hash's inline, up to 32 ones in the external storage and then it's switched to std::set. Once switched, the whole union becomes the waste of space, as it's size is sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes and only 3 pointers from it are used (std::set header). Also the overhead to keep cell_and_hash as a set entry is more then the size of the structure itself. Column ids are 32-bit integers that most likely come sequentialy. For this kind of a search key a radix tree (with some care for non-sequential cases) can be beneficial. This set introduces a compact radix tree, that uses 7-bit sub values from the search key to index on each node and compacts the nodes themselves for better memory usage. Then the row::_storage is replaced with the new tree. The most notable result is the memory footprint decrease, for wide rows down to 2x times. The performance of micro-benchmarks is a bit lower for small rows and (!) higer for longer (8+ cells). The numbers are in patch #12 (spoiler: they are better than for v2) v3: - trimmed size of radix down to 7 bits - simplified the nodes layouts, now there are 2 of them (was 4) - enhanced perf_mutation to test N-cells schema - added AVX intra-nodes search for medium-sized nodes - added .clone_from() method that helped to improve perf_mutation - minor - changed functions not to return values via refs-arguments - fixed nested classes to properly use language constructors - renamed index_to to key_t to distinguish from node_index_t - improved recurring variadic templates not to use sentinel argument - use standard concepts v2: - fixed potential mis-compilation due to strict-aliasing violation - added oracle test (radix tree is compared with std::map) - added radix to perf_collection - cosmetic changes (concepts, comments, names) A note on item 1 from v2 changelog. The nodes are no longer packed perfectly, each has grown 3 bytes. But it turned out that when used as cells container most of this growth drowned in lsa alignments. next todo: - aarch64 version of 16-keys node search tests: unit(dev), unit(debug for radix), pref(dev) " 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla: test/memory_footpring: Print radix tree node sizes row: Remove old storages row: Prepare row::equal for switch row: Prepare row::difference for switch row: Introduce radix tree storage type row-equal: Re-declare the cells_equal lambda test: Add tests for radix tree utils: Compact radix tree array-search: Add helpers to search for a byte in array test/perf_collection: Add callback to check the speed of clone test/perf_mutation: Add option to run with more than 1 columns test/perf_mutation: Prepare to have several regular columns test/perf_mutation: Use builder to build schema	2021-02-18 21:19:14 +02:00
Raphael S. Carvalho	5206a97915	compaction: Fix leak of expired sstable in the backlog tracker expired sstables are skipped in the compaction setup phase, because they don't need to be actually compacted, but rather only deleted at the end. that is causing such sstables to not be removed from the backlog tracker, meaning that backlog caused by expired sstables will not be removed even after their deletion, which means shares will be higher than needed, making compaction potentially more aggressive than it have to. to fix this bug, let's manually register these sstables into the monitor, such that they'll be removed from the tracker once compaction completes. Fixes #6054. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210216203700.189362-1-raphaelsc@scylladb.com>	2021-02-18 11:12:00 +02:00
Botond Dénes	ba7a9d2ac3	imr: switch back to open-coded description of structures Commit `aab6b0ee27` introduced the controversial new IMR format, which relied on a very template-heavy infrastructure to generate serialization and deserialization code via template meta-programming. The promise was that this new format, beyond solving the problems the previous open-coded representation had (working on linearized buffers), will speed up migrating other components to this IMR format, as the IMR infrastructure reduces code bloat, makes the code more readable via declarative type descriptions as well as safer. However, the results were almost the opposite. The template meta-programming used by the IMR infrastructure proved very hard to understand. Developers don't want to read or modify it. Maintainers don't want to see it being used anywhere else. In short, nobody wants to touch it. This commit does a conceptual revert of `aab6b0ee27`. A verbatim revert is not possible because related code evolved a lot since the merge. Also, going back to the previous code would mean we regress as we'd revert the move to fragmented buffers. So this revert is only conceptual, it changes the underlying infrastructure back to the previous open-coded one, but keeps the fragmented buffers, as well as the interface of the related components (to the extent possible). Fixes: #5578	2021-02-16 23:43:07 +01:00
Pavel Emelyanov	1bdfa355ea	row: Remove old storages Now when the 3rd storage type (radix tree) is all in, old storage can be safely removed. The result is: 1. memory footprint sizeof(class row): 112 => 16 bytes sizeof(rows_entry): 126 => 120 bytes the "in cache" value depends on the number of cells: num of cells master patch 1 752 656 2 808 712 3 864 768 4 920 824 5 968 936 6 1136 992 ... 16 1840 1672 17 1904 1992 (+88) 18 1976 2048 (+72) 19 2048 2104 (+56) 20 2120 2160 (+40) 21 2184 2208 (+24) 22 2256 2264 ( +8) 23 2328 2320 ... 32 2960 2808 After 32 cells the storage switches into rbtree with 24-bytes per-cell overhead and the radix tree improvement rocketlaunches 64 7872 6056 128 15040 9512 256 29376 18568 2. perf_mutation test is enhanced by this series and the results differ depending on the number of columns used tps value --column-count master patch 1 59.9k 57.6k (-3.8%) 2 59.9k 57.5k 4 59.8k 57.6k 8 57.6k 57.7k <- eq 16 56.3k 57.6k 32 53.2k 57.4k (+7.9%) A note on this. Last time 1-column test was ~5% worse which was explained by inline storage of 5 cells that's present on current implementation and was absent in radix tree. An attempt to make inline storage for small radix trees resulted in complete loss of memory footprint gain, but gave fraction of percent to perf_mutation performance. So this version doesn't have inline nodes. The 1.2% improvement from v2 surprisingly came from the tree::clone_from() which in v2 was work-around-ed by slow walk+emplace sequence while this version has the optimized API call for cloning. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:35:06 +03:00
Wojciech Mitros	aa0cd940d6	sstables: replace the sstable_set with a versioned structure Currently, the sstable_set in a table is copied before every change to allow accessing the unchanged version by existing sstable readers. This patch changes the sstable_set to a structure that allows copying without actually copying all the sstables in the set, while providing the same methods(and some extra) without majorly decreasing their speed. This is achieved by associating all copies with sstable_set versions which hold the changes that were performed in them, and references to the versions that were copied, a.k.a. their parents. The set represented by a version is the result of combining all changes of its ancestors. This causes most methods of the version to have a time complexity dependent on the number of its ancestors. To limit this number, versions that represent copies that have already been deleted are merged with its descendants. The strategy used for deciding when and with which of its children should a version be merged heavily depends on the use case of sstable_sets: there is a main copy of the set in a table class which undergoes many insertions and deletions, and there are copies of it in compaction or mutation readers which are further copied or edited few or zero times. It's worth to mention, that when a copy is made, the copied set should not be modified anymore, because it would also modify the results given by the copy. In order to still allow modifying the copied set, if a change is to be performed on it, the version assiociated with this set is replaced with a new version depending on the previous one. As we can see, in our use case there is a main chain of versions(with changes from the table), and smaller branches of versions that start from a version from this chain, but are deleted soon after. In such case we can merge a version when it has exactly one descendant, as this limits the number of concurrent ancestors of a version to the number of copies of its ancestors are concurrently used. During each such merge, the parent version is removed and the child version is modified so that all operations on it give the same results. In order to preserve the same interface, the sstable_set still contains a lw_shared_ptr<sstable_list>, but sstable_list (previously an alias for unordered_set<shared_sstable>) is now a new structure. Each sstable_set contains a sstable_list but not every sstable_list has to be contained by a sstable_set, and we also want to allow fast copying of sstable_lists, so the reference to the sstable_set_version is kept by the sstable_lists and the sstable_set can access the sstable_set_version it's associated with through its sstable_list. Accessing sstables that are elements of a certain sstable_set copy(so the select, select_sstable_runs and sstable_list's iterator) get results from containers that hold all sstables from all versions(which are stored in a single, shared "versioned_sstable_set_data" structure), and then filter out these sstables that aren't present in the version in question. This version of the sstable_set allows adding and erasing the same sstable repeatedly. Inserting and erasing from the set modifies the containers in a version only when it has an actual effect: if an sstable has been added in the parent version, and hasn't been erased in the child version, adding it again will have no effect. This ensures that when merging versions, the versions have disjoint sets of added, and erased sstables (an sstable can still be added in one and erased in the second). It's worth noting hat if an sstable has been added in one of the merged sets and erased in the second, the version that remains after merging doesn't need to have any info about the sstable's inclusion in the set - it can be inferred from the changes in previous versions (and it doesn't matter if the sstable has been erased before or after being added). To release pointers to sstables as soon as possible (i.e. when all references to versions that contain them die), if an sstable is added/erased in all child versions that are based on a version which has no external references, this change gets removed from these versions and added to the parent version. If an sstable's insertion gets overwritten as a result, we might be able to remove the sstable completely from the set. We know how many times this needs to happen by counting, for each sstable, in how many different verisions has it been added. When a change that adds an sstable gets merged with a change that removes it, or when a such a change simply gets deleted alongside its associated version, this count is reduced, and when an sstable gets added to a version that doesn't already contain it, this count is increased. The methods that modify the sets contents give strong exception guarantee by trying to insert new sstables to its containers, and erasing them in the case of an caught exception. Fixes #2622 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-02-11 11:02:55 +01:00
Wojciech Mitros	e1b494633b	sstables: make sstable_set constructor less error-prone Adding an non-empty set of sstables as the set of all sstables in an sstable_set could cause inconsistencies with the values returned by select_sstable_runs because the _all_runs map would still be initialized empty. For similar reasons, the provided sstable_set_impl should also be empty. Dispel doubts by removing the unordered_set from the constructor, and adding a check of emptiness of the sstable_set_impl. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-02-11 11:02:55 +01:00
Avi Kivity	7f3083739f	Merge "sstables: Share partition index pages between readers" from Tomasz " Before this patch, each index reader had its own cache of partition index pages. Now there is a shared cache, owned by the sstable object. This allows concurrent reads to share partition index pages and thus reduce the amount of I/O. It used to be like that a few years ago, but we moved to per-reader cache to implement incremental promoted index parsing, to avoid OOMs with large partitions. At that time, the solution involved caching input streams inside partition index entries, which couldn't be reused between readers. This could have been solved differently. Instead of caching input streams, we can cache information needed to created them (temporary_buffer<>). This solution takes this approach. This series is also needed before we can implement promoted index caching. That's because before the promoted index can be shared by readers, the partition index entries, which hold the promoted index, must also be shareable. The pages live as long as there is at least one index reader referencing them. So it only helps when there is concurrent access. In the future we will keep them for longer and evict on memory pressure. Promoted index cursor is no longer created when the partition index entry is parsed, by it's created on-demand when the top-level cursor enters the partition. The promoted index cursor is owned by the top-level cursor, not by the partition index entry. Below are the results of an experiment performed on my laptop which demonstrates the improvement in performance. Load driver command line: ./scylla-bench \ -workload uniform \ -mode read \ --partition-count=10 \ -clustering-row-count=1 \ -concurrency 100 Scylla command line: scylla --developer-mode=1 -c1 -m1G --enable-cache=0 The workload is IO-bound. Before, we needed 2 I/O per read, now we need 1 (amortized). The throughput is ~70% higher. Before: time ops/s rows/s errors max 99.9th 99th 95th 90th median mean 1s 4706 4706 0 35ms 30ms 27ms 25ms 24ms 21ms 21ms 2s 4646 4646 0 42ms 31ms 31ms 27ms 25ms 21ms 22ms 3.1s 4670 4670 0 40ms 27ms 26ms 25ms 25ms 21ms 21ms 4.1s 4581 4581 0 39ms 33ms 33ms 27ms 26ms 21ms 22ms 5.1s 4345 4345 0 40ms 37ms 35ms 32ms 31ms 21ms 23ms 6.1s 4328 4328 0 49ms 40ms 34ms 32ms 31ms 22ms 23ms 7.1s 4198 4198 0 45ms 36ms 35ms 31ms 30ms 22ms 24ms 8.2s 3913 3913 0 51ms 50ms 50ms 39ms 35ms 24ms 26ms 9.2s 4524 4524 0 34ms 31ms 30ms 28ms 27ms 21ms 22ms After: time ops/s rows/s errors max 99.9th 99th 95th 90th median mean 1s 7913 7913 0 25ms 25ms 20ms 15ms 14ms 12ms 13ms 2s 7913 7913 0 18ms 18ms 18ms 16ms 14ms 12ms 13ms 3s 8125 8125 0 20ms 20ms 17ms 15ms 14ms 12ms 12ms 4s 5609 5609 0 41ms 35ms 29ms 28ms 27ms 13ms 18ms 5.1s 8020 8020 0 18ms 17ms 17ms 15ms 14ms 12ms 13ms 6.1s 7102 7102 0 27ms 27ms 24ms 19ms 18ms 13ms 14ms 7.1s 5780 5780 0 26ms 26ms 26ms 23ms 22ms 17ms 18ms 8.1s 6530 6530 0 37ms 34ms 26ms 22ms 20ms 15ms 15ms 9.1s 7937 7937 0 19ms 19ms 17ms 17ms 16ms 12ms 13ms Tests: - unit [release] - scylla-bench " * tag 'share-partition-index-v1' of github.com:tgrabiec/scylla: sstables: Share partition index pages between readers sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream() sstables: index_reader: Do not store cluster index cursor inside partition indexes	2021-02-04 17:27:49 +02:00
Tomasz Grabiec	63188abb87	sstables: Share partition index pages between readers Before this patch, each index reader had its own cache of partition index pages. Now there is a shared cache, owned by the sstable object. This allows concurrent reads to share partition index pages and thus reduce the amount of I/O. This change is also needed before we can implement promoted index caching. That's because before the promoted index can be shared by readers, the partition index entries, which hold the promoted index, must also be shareable. The pages live as long as there is at least one index reader referencing them. So it only helps when there is concurrent access. In the future we will keep them for longer and evict on memory pressure. Promoted index cursor is no longer created when the partition index entry is parsed, by it's created on-demand when the top-level cursor enters the partition. The promoted index cursor is owned by the top-level cursor, not by the partition index entry.	2021-02-04 15:24:07 +01:00
Tomasz Grabiec	c232d71fc8	sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream()	2021-02-04 15:24:07 +01:00
Tomasz Grabiec	5ed559c8c6	sstables: index_reader: Do not store cluster index cursor inside partition indexes Currently, the partition index page parser will create and store promoted index cursors for each entry. The assumption is that partition index pages are not shared by readers so each promoted index cursor will be used by a single index_reader (the top-level cursor). In order to be able to share partition index entries we must make the entries immutable and thus move the cursor outside. The promoted index cursor is now created and owned by each index_reader. There is at most one such active cursor per index_reader bound (lower/upper).	2021-02-04 15:23:55 +01:00
Benny Halevy	ba4b8dd6e5	sstables: row.hh: no need to include reader_concurrency_semaphore.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210204113413.1027893-1-bhalevy@scylladb.com>	2021-02-04 13:42:06 +02:00
Tomasz Grabiec	7b17969a6e	Merge 'sstable: reader: preempt after every fragment' from Avi Kivity Whenever we push a fragment, we check whether the buffer is full and return proceed::no if so, so that the state machine pauses and lets the consumer continue. This patch adds an additional condition - if preemption is needed, we also return proceed::no. This drops us back to the outer loop (in sstable_mutation_reader::fill_buffer), which will yield to the reactor as part of seastar::do_until(). Two cases (partition_start and partition_end) did not have the check for is_buffer_full(); it is added now. This can trigger is the partition has no rows. Unlike the previous attempt, push_ready_fragments() is not touched. The extra preemption opportunities triggered a preexisting bug in clustering_ranges_walker; it is fixed in the first patch of the series. I tested this by reading from a large partition with a simple schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE. However, even without the patch I only got sporadic stalls with the detector set to 1ms, so it's possible I'm not testing correctly. Test: unit (dev, debug, release) Fixes #7883. Closes #7928 * github.com:scylladb/scylla: sstable: reader: preempt after every fragment clustering_range_walker: fix false discontiguity detected after a static row	2021-02-02 12:21:58 +01:00
Avi Kivity	db4b9215dd	sstable: reader: preempt after every fragment Whenever we push a fragment, we check whether the buffer is full and return proceed::no if so, so that the state machine pauses and lets the consumer continue. This patch adds an additional condition - if preemption is needed, we also return proceed::no. This drops us back to the outer loop (in sstable_mutation_reader::fill_buffer), which will yield to the reactor as part of seastar::do_until(). Two cases (partition_start and partition_end) did not have the check for is_buffer_full(); it is added now. This can trigger is the partition has no rows. Unlike the previous attempt, push_ready_fragments() is not touched. I tested this by reading from a large partition with a simple schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE. However, even without the patch I only got sporadic stalls with the detector set to 1ms, so it's possible I'm not testing correctly. Test: unit (dev) Fixes #7883.	2021-02-01 19:32:07 +02:00
Benny Halevy	4b309e0829	compaction: log sstable origin Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-01 16:45:52 +02:00
Benny Halevy	77328a936a	sstables: scylla_metadata: add support for sstable_origin Add new scylla_metadata_type::SSTableOrigin. Store and retrive a sstring to the scylla metadata component. Pass sstable_writer_config::origin from the mx sstable writer and ignore it in the k_l writer. Add unit test to verify the sstable_origin extension using both empty and a random string. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-01 16:45:52 +02:00
Benny Halevy	22f6023ac3	sstables: sstable_writer_config: add origin member Add a string describing where the sstables originated from (e.g. memtable, repair, streaming, compaction, etc.) If configure_writer is called with a nullptr, the origin will be equal to an empty string. Introduce test_env_sstables_manager that provides an overload of configure_writer with no parmeters that calls the base-class' configure_writer with "test" origin. This was to reduce the code churn in this patch and to keep the tests simple. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-01 16:45:52 +02:00
Botond Dénes	6024ef5dad	sstable_mutation_reader: consolidate constructors The two remaining sstable constructor are very similar apart from the content of the initialize lambda. Speaking of which, the two remaining initializer lambdas can be easily merged into one too. So this patch does just that, consolidates the two constructors one and moves consolidates as well as extracts the initializer method into a member method. This means we have to store the previously captured variables as members, but this is actually a good thing: when debugging we can see the range and slice the reader is reading, and we are not actually paying for it either -- they were already stored, just out of sight.	2021-01-27 17:38:17 +02:00
Botond Dénes	43ad64db78	sstables: sstable_mutation_reader: remove now unused whole sstable constructor	2021-01-27 17:38:17 +02:00
Botond Dénes	ec6c540c30	sstables: stats: remove now unused sstable_partition_reads counter	2021-01-27 17:38:17 +02:00
Botond Dénes	5f18e9eb37	sstable: remove read_.row._flat() methods	2021-01-27 17:38:17 +02:00
Botond Dénes	c3b4e990a2	tree-wide: use sstables::make_reader() instead of the read_.row._flat() methods	2021-01-27 17:38:17 +02:00
Botond Dénes	080bc2ffec	sstables: pass partition_range to create_single_key_sstable_reader() We want to unify the various sstable reader creation methods and this method taking a ring position instead of a partition range like everybody else stands in the way of that. This is effect reverts `68663d0de`.	2021-01-27 17:38:14 +02:00
Botond Dénes	a5a8037f6e	sstables: sstable: add make_reader() This will be the only method to create sstable readers with. For now we leave the other variants, they as well as their users will be removed in a following patch.	2021-01-27 15:20:06 +02:00
Benny Halevy	1847d49971	test: test_env: pick the highest sstable version by default If possible, test the highest sstable format version, as it's the mostly used. If there pre-written sstables we need to load from the test directory from an older version, either specify their version explicitly, or use the new test_env::reusable_sst method that looks up the latest sstable version in the given directory and generation. Test: unit(release) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201210161822.2833510-1-bhalevy@scylladb.com>	2021-01-24 10:38:55 +02:00

1 2 3 4 5 ...

2374 Commits