scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-12 19:02:12 +00:00

Author	SHA1	Message	Date
Botond Dénes	fe127a2155	sstables: clamp estimated_partitions to [1, +inf) in writers In some cases estimated number of partitions can be 0, which is albeit a legit estimation result, breaks many low-level sstable writer code, so some of these have assertions to ensure estimated partitions is > 0. To avoid hitting this assert all users of the sstable writers do the clamping, to ensure estimated partitions is at least 1. However leaving this to the callers is error prone as #6913 has shown it. As this clamping is standard practice, it is better to do it in the writers themselves, avoiding this problem altogether. This is exactly what this patch does. It also adds two unit tests, one that reproduces the crash in #6913, and another one that ensures all sstable writers are fine with estimated partitions being 0 now. Call sites previously doing the clamping are changed to not do it, it is unnecessary now as the writer does it itself. Fixes #6913 Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>	2020-07-27 09:19:37 +02:00
Rafael Ávila de Espíndola	87b261ab32	sstables: Rename _writer to _compaction_writer Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-07-22 08:15:55 -07:00
Rafael Ávila de Espíndola	97b7fee78e	sstables: Move compaction_write_monitor to compaction_writer There is one monitor per writer, so we new keep them together in the compaction_writer struct. This trivially guarantees that the monitor is always destroyed before the writer. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-07-22 08:15:53 -07:00
Rafael Ávila de Espíndola	f8cc582e4a	sstables: Add couple of writer() getters to garbage_collected_sstable_writer This just reduces the noise of an upcoming patch. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-07-22 07:46:05 -07:00
Rafael Ávila de Espíndola	c740c66840	sstables: Move compaction_write_monitor earlier in the file This will used by followup patches. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-07-22 07:46:05 -07:00
Raphael S. Carvalho	b67066cae2	table: Fix Staging SSTables being incorrectly added or removed from the backlog tracker Staging SSTables can be incorrectly added or removed from the backlog tracker, after an ALTER TABLE or TRUNCATE, because the add and removal don't take into account if the SSTable requires view building, so a Staging SSTable can be added to the tracker after a ALTER table, or removed after a TRUNCATE, even though not added previously, potentially causing the backlog to become negative. Fixes #6798. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200716180737.944269-1-raphaelsc@scylladb.com>	2020-07-20 10:57:38 +03:00
Avi Kivity	5371be71e9	Merge "Reduce fanout of some mutation-related headers" from Pavel E " The set's goal is to reduce the indirect fanout of 3 headers only, but likely affects more. The measured improvement rates are flat_mutation_reader.hh: -80% mutation.hh : -70% mutation_partition.hh : -20% tests: dev-build, 'checkheaders' for changed headers (the tree-wide fails on master) " * 'br-debloat-mutation-headers' of https://github.com/xemul/scylla: headers:: Remove flat_mutation_reader.hh from several other headers migration_manager: Remove db/schema_tables.hh inclustion into header storage_proxy: Remove frozen_mutation.hh inclustion storage_proxy: Move paxos/*.hh inclusions from .hh to .cc storage_proxy: Move hint_wrapper from .hh to .cc headers: Remove mutation.hh from trace_state.hh	2020-07-19 19:47:59 +03:00
Pavel Emelyanov	92f58f62f2	headers:: Remove flat_mutation_reader.hh from several other headers All they can live with forward declaration of the f._m._r. plus a seastar header in commitlog code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-17 17:54:47 +03:00
Rafael Ávila de Espíndola	9fe4dc91d7	sstables: Move noop_write_monitor to a .cc file There is no need to expose a type that is only used via a virtual interface. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Reviewed-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200717021215.545525-1-espindola@scylladb.com>	2020-07-17 11:59:03 +03:00
Benny Halevy	eb1d558d00	compaction: print uuid in log messages By convention, print the following information in all compaction log messages: [{compaction.type} {keyspace}.{table} {compaction.uuid}] Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-07-16 13:55:23 +03:00
Benny Halevy	dec751cfbe	compaction: report_(start\|finish): just return description Rather than logging the message in the virtual callee method just return a string description and make the logger call in the common caller. 1. There is no need to do the logger call in the callee, it is simpler to format the log message in the the caller and just retrieve the per-compaction-type description. 2. Prepare to centrally print the compaction uuid. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-07-16 13:55:23 +03:00
Benny Halevy	e39fbe1849	compaction: move compaction uuid generation to compaction_info We'd like to use the same uuid both for printing compaction log messages and to update compaction_history. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-07-16 13:55:23 +03:00
Pavel Emelyanov	1e15c06889	dht: Detach ring_position_comparator_for_sstables Next patches will generalize ring_position_comparator with templates to replace cache_entry's and memtable_entry's comparators. The overload of operator() for sstables has its own implementation, that differs from the "generic" one, for smoother generalization it's better to detach it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-14 16:30:02 +03:00
Benny Halevy	d4615f4293	sstables: sstable_version_types: implement operator<=> Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200707061715.578604-1-bhalevy@scylladb.com>	2020-07-08 14:23:11 +03:00
Piotr Sarna	e4b74356bb	Merge 'view_update_generator: use partitioned sstable set' from Botond. Recently it was observed (#6603) that since 4e6400293ea, the staging reader is reading from a lot of sstables (200+). This consumes a lot of memory, and after this reaches a certain threshold -- the entire memory amount of the streaming reader concurrency semaphore -- it can cause a deadlock within the view update generation. To reduce this memory usage, we exploit the fact that the staging sstables are usually disjoint, and use the partitioned sstable set to create the staging reader. This should ensure that only the minimum number of sstable readers will be opened at any time. Refs: #6603 Fixes: #6707 Tests: unit(dev) * 'view-update-generator-use-partitioned-set/v1' of https://github.com/denesb/scylla: db/view: view_update_generator: use partitioned sstable set sstables: make_partitioned_sstable_set(): return an sstable_set	2020-07-06 14:36:08 +02:00
Botond Dénes	84b5d6d6d0	sstables: make_partitioned_sstable_set(): return an sstable_set Instead of an `std::unique_ptr<sstable_set_impl>`. The latter doesn't have a publicly available destructor, so it can only be called from withing `sstables/compaction_strategy.cc` where its definition resides. Thus it is not really usable as a public function in its current form, which shows as it has no users either. This patch makes it usable by returning an `sstable_set`. That is what potential callers would want anyway. In fact this patch prepares the ground for the next one, which wishes to use this function for just that but can't in its current form.	2020-07-06 13:38:23 +03:00
Benny Halevy	fc89018146	sstables: random_access_reader: make methods noexcept handle all exceptions in read_exactly, seek, and close and specify them as noexcept. Also, specify eof() as noexcept as it trivially is. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-07-05 19:40:48 +03:00
Benny Halevy	94460f3199	sstables: random_access_reader: futurize seek And adjust its callers to wait on the returned future. With this, there is no need for a gate to serialize close() with the background work seek() used to leave behind. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-07-05 19:40:26 +03:00
Benny Halevy	765c5752c2	sstables: random_access_reader: unify input stream close code Define a close_if_needed() helper function, to be called from seek() and close(). A future patch will call it with a possibly disengaged `_in` so it will close it only if it was engaged. close_if_needed() captures the input stream unique ptr so it will remain valid throughout close. This was missing from close(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-07-05 19:37:39 +03:00
Benny Halevy	e7fdadd748	sstables: random_access_reader: let file_random_access_reader set the input stream Allow file_random_access_reader constructor to set the input stream to prepare for futurizing seek() by adding a protected set() method. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-07-05 19:37:36 +03:00
Benny Halevy	0bb1c0f37d	sstables: random_access_reader: move functions out of line These are not good candidates for inlining. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-07-05 18:47:04 +03:00
Asias He	07e253542d	compaction_manager: Avoid stall in perform_cleanup The following stall was seen during a cleanup operation: scylla: Reactor stalled for 16262 ms on shard 4. \| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158 \| (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602 \| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:? \| (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56 \| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158 \| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158 \| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158 \| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158 \| (inlined by) operator() at ./sstables/compaction_manager.cc:691 \| (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286 \| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158 \| (inlined by) compaction_manager::rewrite_sstables(table, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604 \| compaction_manager::perform_cleanup(table) at /usr/include/fmt/format.h:1158 To fix, we furturize the function to get local ranges and sstables. In addition, this patch removes the dependency to global storage_service object. Fixes #6662	2020-07-01 15:03:50 +08:00
Asias He	868e2da1c4	compaction_manager: Return exception future in perform_cleanup We should return the exception future instead of throw a plain exception. Refs #6662	2020-07-01 15:00:01 +08:00
Raphael S. Carvalho	cf352e7c14	sstables: optimize procedure that checks if a sstable needs cleanup needs_cleanup() returns true if a sstable needs cleanup. Turns out it's very slow because it iterates through all the local ranges for all sstables in the set, making its complexity: O(num_sstables * local_ranges) We can optimize it by taking into account that abstract_replication_strategy documents that get_ranges() will return a list of ranges that is sorted and non-overlapping. Compaction for cleanup already takes advantage of that when checking if a given partition can be actually purged. So needs_cleanup() can be optimized into O(num_sstables * log(local_ranges)). With num_sstables=1000, RF=3, then local_ranges=256(num_tokens)*3, it means the max # of checks performed will go from 768000 to ~9584. Fixes #6730. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200629171355.45118-2-raphaelsc@scylladb.com>	2020-06-30 12:58:43 +03:00
Raphael S. Carvalho	a9eebdc778	sstables: export needs_cleanup() May be needed elsewhere, like in an unit test. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200629171355.45118-1-raphaelsc@scylladb.com>	2020-06-30 12:58:43 +03:00
Raphael S. Carvalho	68e12bd17e	sstables: sstable_directory: place debug message in logger this message, intended for debugging purposes, is not going through the logger. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200629184642.53348-1-raphaelsc@scylladb.com>	2020-06-30 12:47:17 +03:00
Raphael S. Carvalho	593c1e00c8	sstables:: kill unused sstables::sstable_open_info Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-06-29 14:23:48 -03:00
Raphael S. Carvalho	c7ba495691	sstables: kill unused sstable::load_shared_components() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-06-29 14:23:45 -03:00
Benny Halevy	a843945115	comapction: restore % in compaction completion message The % sign fell off in `c4841fa735` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200625151352.736561-1-bhalevy@scylladb.com>	2020-06-25 18:11:59 +02:00
Raphael S. Carvalho	b17d20b5f4	reshape: LCS: avoid unnecessary work on level 0 No need to sort level 0 as we only check if levels > 0 are disjoint. Also taking the opportunity to avoid copies when sorting. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200624151921.20160-1-raphaelsc@scylladb.com>	2020-06-24 18:27:22 +03:00
Raphael S. Carvalho	864eb20002	reshape: Fix reshaping procedure for LCS The function that determines if a level L, where L > 0, is disjoint, is returning false if level is disjoint. That's because it incorrectly accounts an overlapping SSTable in the level as a disjoint SSTable. So we need to inverse the logic. The side effect is that boot will always try to reshape levels greater than 0 because reshape procedure incorrectly thinks that levels are overlapping when they're actually disjoint. Fixes #6695. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200623180221.229695-1-raphaelsc@scylladb.com>	2020-06-24 12:50:19 +03:00
Rafael Ávila de Espíndola	64c8164e6c	everywhere: Update to seastar api v4 (when_all_succeed returning a tuple) We now just need to replace a few calls to then with then_unpack. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200618172100.111147-1-espindola@scylladb.com>	2020-06-23 19:40:18 +03:00
Raphael S. Carvalho	47f63d021a	sstables/sstable_directory: improve log message in reshape() We were blind about the table which needed reshape and its compaction strategy, so let's improve log message. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200622192502.187532-4-raphaelsc@scylladb.com>	2020-06-23 19:40:18 +03:00
Raphael S. Carvalho	9033fa82d7	compaction: Reduce boilerplate to create new compaction type Run id and compaction type can now be figured out from the base class. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200622160645.177707-1-raphaelsc@scylladb.com>	2020-06-22 20:27:57 +02:00
Raphael S. Carvalho	2a171ee470	reshape: LCS: fix the target level of reshaping job LCS reshape job may pick a wrong level because we iterate through levels from index 1 and stop the iteration as soon as the current level is NOT disjoint, so it happens that we never reach the upper levels, meaning the level of the first NOT disjoint level is used, and not the actual maximum filled level. That's fixed by doing the iteration in the inverse order. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200618154112.8335-1-raphaelsc@scylladb.com>	2020-06-22 16:40:57 +03:00
Raphael S. Carvalho	52180f91d4	compaction: Fix the 2x disk space requirement in SSTable upgrade SSTable upgrade is requiring 2x the space of input SSTables because we aren't releasing references of the SSTables that were already upgraded. So if we're upgrading 1TB, it means that up to 2TB may be required for the upgrade operation to succeed. That can be fixed by moving all input SSTables when rewrite_sstables() asks for the set of SSTables to be compacted, so allowing their space to be released as soon as there is no longer any ref to them. Spotted while auditting code. Fixes #6682. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200619205701.92891-1-raphaelsc@scylladb.com>	2020-06-22 14:03:13 +03:00
Benny Halevy	a3918bdc96	distributed_loader: reenable verify_owner_and_mode when loading new sstables The call to `verify_owner_and_mode` from `flush_upload_dir` fell between the cracks in `b34c0c2ff6` (distributed_loader: rework uploading of SSTables). It causes https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/528/testReport/nodetool_additional_test/TestNodetool/nodetool_refresh_with_wrong_upload_modes_test/ to fail like this: ``` /Directory cannot be accessed .* write/ not found in 'Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/7351db7cab7bbf907172940d0bbf8b90afde90ba/scylla-tools-java/bin/nodetool -h 127.0.87.1 -p 7187 refresh -- keyspace1 standard1' failed; exit status: 1; stdout: nodetool: Scylla API server HTTP POST to URL '/storage_service/sstables/keyspace1' failed: Failed to load new sstables: std::filesystem::__cxx11::filesystem_error (error system:13, filesystem error: remove failed: Permission denied [/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-rqzo7km7/test/node1/data/keyspace1/standard1-8a57a660b29611eabf0c000000000000/upload/mc-3-big-TOC.txt]) ``` Reenable it in this patch makes the dtest pass again. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200621140439.85843-1-bhalevy@scylladb.com>	2020-06-22 14:03:13 +03:00
Avi Kivity	7351db7cab	Merge "Reshape upload files and reshard+reshape at boot" from Glauber " This patchset adds a reshape operation to each compaction strategy; that is a strategy-specific way of detecting if SSTables are in-strategy or off-strategy, and in case they are offstrategy moving them to in-strategy. Often times the number of SSTables in a particular slice of the sstable set matters for that decision (number of SSTables in the same time window for TWCS, number of SSTables per tier for STCS, number of L0 SSTables for LCS). We want to be more lenient for operations that keep the node offline, like reshape at boot, but more forgiving for operations like upload, which run in maintenance mode. To accomodate for that the threshold for considering a slice of the SSTable set offstrategy is passed as a parameter Once this patchset is applied, the upload directory will reshape the SSTables before moving them to the main directory (if needed). One side effect of it is that it is no longer necessary to take locks for the refresh operation nor disable writes in the table. With the infrastructure that we have built in the upload directory, we can apply the same set of steps to populate_column_family. Using the sstable_directory to scan the files we can reshard and reshape (usually if we resharded a reshape will be necessary) with the node still offline. This has the benefit of never adding shared SSTables to the table. Applying this patchset will unlock a host of cleanups: - we can get rid of all testing for shared sstables, sstable_need_rewrite, etc. - we can remove the resharding backlog tracker. and many others. Most cleanups are deferred for a later patchset, though. " * 'reshard-reshape-v4' of github.com:glommer/scylla: distributed_loader: reshard before the node is made online distributed_loader: rework uploading of SSTables sstable_directory: add helper to reshape existing unshared sstables compaction_strategy: add method to reshape SSTables compaction: add a new compaction type, Reshape compaction: add a size and throught pretty printer. compaction: add default implementation for some pure functions tests: fix fragile database tests distributed_loader.cc: add a helper function to extract the highest SSTable version found distributed_loader.cc : extract highest_generation_seen code compaction_manager: rename run_resharding_job distributed_loader: assume populate_column_families is run in shard 0 api: do not allow user to meddle with auto compaction too early upload: use custom error handler for upload directory sstable_directory: fix debug message	2020-06-18 17:04:53 +03:00
Glauber Costa	e40aa042a7	distributed_loader: reshard before the node is made online This patch moves the resharding process to use the new directory_with_sstables_handler infrastructure. There is no longer a clear reshard step, and that just becomes a natural part of populate_column_family. In main.cc, a couple of changes are necessary to make that happen. The first one obviously is to stop calling reshard. We also need to make sure that: - The compaction manager is started much earlier, so we can register resharding jobs with it. - auto compactions are disabled in the populate method, so resharding doesn't have to fight for bandwidth with auto compactions. Now that we are resharding through the sstable_directory, the old resharding code can be deleted. There is also no need to deal with the resharding backlog either, because the SSTables are not yet added to the sstable set at this point. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:37:18 -04:00
Glauber Costa	b34c0c2ff6	distributed_loader: rework uploading of SSTables Uploading of SSTables is problematic: for historical reasons it takes a lock that may have to wait for ongoing compactions to finish, then it disables writes in the table, and then it goes loading SSTables as if it knew nothing about them. With the sstable_directory infrastructure we can do much better: * we can reshard and reshape the SSTables in place, keeping the number of SSTables in check. Because this is an background process we can be fairly aggressive and set the reshape mode to strict. * we can then move the SSTables directly into the main directory. Because we know they are few in number we can call the more elegant add_sstable_and_invalidate_cache instead of the open coding currently done by load_new_sstables * we know they are not shared (if they were, we resharded them), simplifying the load process even further. The major changes after this patch is applied is that all compactions (resharding and reshape) needed to make the SSTables in-strategy are done in the streaming class, which reduces the impact of this operation on the node. When the SSTables are loaded, subsequent reads will not suffer as we will not be adding shared SSTables in potential high numbers, nor will we reshard in the compaction class. There is also no more need for a lock in the upload process so in the fast path where users are uploading a set of SSTables from a backup this should essentially be instantaneous. The lock, as well as the code to disable and enable table writes is removed. A future improvement is to bypass the staging directory too, in which case the reshaping compaction would already generate the view updates. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:37:18 -04:00
Glauber Costa	4d6aacb265	sstable_directory: add helper to reshape existing unshared sstables Before moving SSTables to the main directory, we may need to reshape them into in-strategy. This patch provides helper code that reshapes the SSTables that are known to be unshared local in the sstable directory, and updates the sstable directory with the result. Rehaping can be made more or less aggressive by passing a reshape mode (relaxed or strict), which will influence the amount of SSTables reshape can tolerate to consider a particular slice of the SSTable set offstrategy. Because the compaction expects an std::vector everywhere, we changed our chunked vector for the unshared sstables to a std::vector so we can more easily pass it around without conversions. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:37:18 -04:00
Glauber Costa	3c254dd49d	compaction_strategy: add method to reshape SSTables Some SSTable sets are considered to be off-strategy: they are in a shape that is at best not optimal and at worst adversarial to the current compaction strategy. This patch introduces the compaction strategy-specific method get_reshaping_job(). Given an SSTable set, it returns one compaction that can be done to bring the table closer to being in-strategy. The caller can then call this repeatedly until the table is fully in-strategy. As an example of how this is supposed to work, consider TWCS: some SSTables will belong to a single window -> in which case they are already in-strategy and don't need to be compacted, and others span multiple windows in which case they are considered off-strategy and have to be compacted. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:37:18 -04:00
Glauber Costa	0467bd0a94	compaction: add a new compaction type, Reshape From the point of view of selecting SSTables and its expected output, Reshaping really is just a normal compaction. However, there are some key differences that we would like to uphold: - Reshaping is done separately from the main SSTable set. It can be done with the node offline, or it can be done in a separate priority class. Either way, we don't want those SSTables to count towards backlog. For reads, because the SSTables are not yet registered in the backlog tracker (if offline or coming from upload), if we were to deduct compaction charges from it we would go negative. For writes, we don't want to deal with backlog management here because we will add the SSTable at once when reshaping is finished. - We don't need to do early replacements. - We would like to clearly mark the Reshaping compactions as such in the logs For the reasons above, it is nicer to add a new Reshape compaction type, a subclass of compaction, that upholds such properties. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:37:18 -04:00
Glauber Costa	c4841fa735	compaction: add a size and throught pretty printer. This is so we don't always use MB. Sometimes it is best to report GB, TB, and their equivalent throughput metrics. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:37:18 -04:00
Glauber Costa	ef85a2cec5	compaction: add default implementation for some pure functions There are some functions that are today pure that have an obvious implementation (for example on_new_partition, do nothing). We'll add default implementations to the compaction class, which reduces the boilerplate needed to add a new compaction type. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:00:28 -04:00
Glauber Costa	9902af894a	compaction_manager: rename run_resharding_job It will be used to run any custom job where the caller provides a function. One such example is indeed resharding, but reshaping SSTables can also fall here. The semaphore is also renamed, and we'll allow only one custom job at a time (across all possible types). We also remove the assumption of the scheduling group. The caller has to have already placed the code in the correct CPU scheduling group. The I/O priority class comes from the descriptor. To make sure that we don't regress, we wrap the entire reshard-at-boot code in the compaction class. Currently the setup would be done in the main group, and the actual resharding in the compaction group. Note that this is temporary, as this code is about to change. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:00:27 -04:00
Rafael Ávila de Espíndola	f6e407ecd2	everywhere: Prepare for seastar api v4 (when_all_succeed return value) The seastar api v4 changes the return type of when_all_succeed. This patch adds discard_result when that is best solution to handle the change. This doesn't do the actual update to v4 since there are still a few issues left to fix in seastar. A patch doing just the update will follow. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200617233150.918110-1-espindola@scylladb.com>	2020-06-18 15:13:56 +03:00
Raphael S. Carvalho	03db448a92	sstables/backlog_tracker: Fix incorrect calculation of Compaction backlog When debugging this for first time c412a7a, I thought the problem, which causes backlog to be negative, was a bug in the implementation of the formula, but it turns out that the bug is actually in the formula itself. Not limiting the scope of this bug to STCS because its tracker is inherited by the trackers of other strategies, meaning they're also affected by this. The backlog for a SSTable is known to be Bi = Ei * log(T / Si) Where T = total Size minus compacted bytes for a table, Ci = Compacted Bytes for a SSTable, Si = Size of a SStable Ei = Ci - Si The problem was that we were assuming T > Si, but it can happen that T is lower than Si if the table in question is decreasing in size. If we rewrite SSTable backlog as Bi = Ei * log (T) - Ei * log(Si) It becomes even clearer why T cannot be lower than Si whatsoever, or the backlog calculation can go wrong because first term becomes lower than the second. Fixing the formula consists of changing it to Bi = Ei * log (T / Ei) Bi = Ei * log (T) - Ei * log (Si - Ci) After this change, the backlog still behave in a very similar way as before, which can be confirmed via this graph: https://user-images.githubusercontent.com/1409139/79627762-71afdf80-8111-11ea-9ebc-0831c4e3d9c6.png Fixes #6021. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200616174712.16505-1-raphaelsc@scylladb.com>	2020-06-18 13:56:47 +03:00
Avi Kivity	9322c07c71	Merge "Use binary search in sstable promoted index" from Tomasz " The "promoted index" is how the sstable format calls the clustering key index within a given partition. Large partitions with many rows have it. It's embedded in the partition index entry. Currently, lookups in the promoted index are done by scanning the index linearly so the lookup is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O. We could do better and use binary search in the index. This patch series switches the mc-format index reader to do that. Other formats use the old way. The "mc" format promoted index has an extra structure at the end of the index called "offset map". It's a vector of offsets of consecutive promoted index entries. This allows us to access random entries in the index without reading the whole index. The location of the offset entry for a given promoted index entry can be derived by knowing where the offset vector ends in the index file, so the offset map also doesn't have to be read completely into the memory. The most tricky part is caching. We need to cache blocks read from the index file to amortize the cost of binary search: - if the promoted index fits in the 32 KiB which was read from the index when looking for the partition entry, we don't want to issue any additional I/O to search the promoted index. - with large promoted indexes, the last few bisections will fall into the same I/O block and we want to reuse that block. - we don't want the cache to grow too big, we don't want to cache the whole promoted index as the read progresses over the index. Scanning reads may skip multiple times. This series implements a rather simple approach which meets all the above requirements and is not worse than the current state of affairs: - Each index cursor has its own cache of the index file area which corresponds to promoted index This is managed by the cached_file class. - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to reuse information obtained during lower bound lookup. This estimation is used to limit read-aheads in the data file. - Each cursor drops entries that it walked past so that memory footprint stays O(log N) - Cached buffers are accounted to read's reader_permit. Later, we could have a single cache shared by many readers. For that, we need to come up with eviction policy. Fixes #4007. TESTING RESULTS * Point reads, large promoted index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search: time: 1.9ms vs 22.9ms CPU utilization: 8.9% vs 92.3% I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller. - Slicing at the front (offset=0) is a mixed bag. time is similar: 1.8ms CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7% disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch) bsearch uses less bandwidth because the series reduces buffer size used for index file I/O. scan is issuing: 2 * 128 KB (index page) 2 * 32 KB (data file) bsearch is issuing: 1 * 64 KB (index page) 15 * 4 KB (promoted index) 1 * 64 KB (data file) The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead). 32 KB is the minimum I/O currently. Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work so that it uses 1 * 4 KB when it suffices. This is left for the follow-up. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001836 172 1 545 9 563 175 4.0 4 320 2 2 0 1 1 0 0 0 57.7% 0 0 32 0.001858 502 32 17220 126 17776 11526 3.2 3 324 2 1 0 1 1 0 0 0 56.4% 0 0 256 0.002833 339 256 90374 427 91757 85931 7.0 7 776 3 1 0 1 1 0 0 0 41.1% 0 0 4096 0.017211 58 4096 237984 2011 241802 233870 66.1 66 8376 59 2 0 1 1 0 0 0 21.4% 0 5000000 1 0.022952 42 1 44 1 45 41 29.2 29 3520 22 2 0 1 1 0 0 0 92.3% 0 5000000 32 0.023052 43 32 1388 14 1414 1331 31.1 32 3588 26 2 0 1 1 0 0 0 91.7% 0 5000000 256 0.024795 41 256 10325 129 10721 9993 43.1 39 4544 29 2 0 1 1 0 0 0 86.4% 0 5000000 4096 0.038856 27 4096 105414 398 106918 103162 95.2 95 12160 78 5 0 1 1 0 0 0 61.4% 0 After (v2): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001831 248 1 546 21 581 252 17.6 17 188 2 0 0 1 1 0 0 0 8.5% 0 0 32 0.001910 535 32 16751 626 17770 13896 17.9 19 160 3 0 0 1 1 0 0 0 8.8% 0 0 256 0.003545 266 256 72207 2333 89076 62852 26.9 24 764 7 0 0 1 1 0 0 0 9.7% 0 0 4096 0.016800 56 4096 243812 524 245430 239736 83.6 83 8700 64 0 0 1 1 0 0 0 16.6% 0 5000000 1 0.001968 351 1 508 19 538 380 21.3 21 172 2 0 0 1 1 0 0 0 8.9% 0 5000000 32 0.002273 431 32 14077 436 15503 11551 22.7 22 268 3 0 0 1 1 0 0 0 8.9% 0 5000000 256 0.003889 257 256 65824 2197 81833 57813 34.0 37 652 18 0 0 1 1 0 0 0 11.2% 0 5000000 4096 0.017115 54 4096 239324 834 241310 231993 88.3 88 8844 65 0 0 1 1 0 0 0 16.8% 0 After (v1): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001886 259 1 530 4 545 261 18.0 18 376 2 2 0 1 1 0 0 0 9.1% 0 0 32 0.001954 513 32 16381 93 16844 15618 19.0 19 408 3 2 0 1 1 0 0 0 9.3% 0 0 256 0.003266 318 256 78393 1820 81567 61663 30.8 26 1272 7 2 0 1 1 0 0 0 10.4% 0 0 4096 0.017991 57 4096 227666 855 231915 225781 83.1 83 8888 55 5 0 1 1 0 0 0 15.5% 0 5000000 1 0.002353 232 1 425 2 432 232 23.0 23 396 2 2 0 1 1 0 0 0 8.7% 0 5000000 32 0.002573 384 32 12437 47 12571 429 25.0 25 460 4 2 0 1 1 0 0 0 8.5% 0 5000000 256 0.003994 259 256 64101 2904 67924 51427 37.0 35 1484 11 2 0 1 1 0 0 0 10.6% 0 5000000 4096 0.018567 56 4096 220609 448 227395 219029 89.8 89 9036 59 5 0 1 1 0 0 0 15.1% 0 * Point reads, small promoted index (two blocks): Config: rows: 400, value size: 200 Partition size: 84 KiB Index size: 65 B Notes: - No significant difference in time - the same disk utilization - similar CPU utilization Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000279 470 1 3587 31 3829 478 3.0 3 68 2 1 0 1 1 0 0 0 21.1% 0 0 32 0.000276 3498 32 116038 811 122756 104033 3.0 3 68 2 1 0 1 1 0 0 0 24.0% 0 0 256 0.000412 2554 256 621044 1778 732150 559221 2.0 2 72 2 0 0 1 1 0 0 0 32.6% 0 0 4096 0.000510 1901 400 783883 4078 819058 665616 2.0 2 88 2 0 0 1 1 0 0 0 36.4% 0 200 1 0.000339 2712 1 2951 8 3001 2569 2.0 2 72 2 0 0 1 1 0 0 0 17.8% 0 200 32 0.000352 2586 32 91019 266 92427 83411 2.0 2 72 2 0 0 1 1 0 0 0 20.8% 0 200 256 0.000458 2073 200 436503 1618 453945 385501 2.0 2 88 2 0 0 1 1 0 0 0 29.4% 0 200 4096 0.000458 2097 200 436475 1676 458349 381558 2.0 2 88 2 0 0 1 1 0 0 0 29.0% 0 After (v1): Testing slicing of large partition using clustering keys: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000278 492 1 3598 30 3831 500 3.0 3 68 2 1 0 1 1 0 0 0 19.4% 0 0 32 0.000275 3433 32 116153 753 122915 92559 3.0 3 68 2 1 0 1 1 0 0 0 22.5% 0 0 256 0.000458 2576 256 559437 2978 728075 504375 2.1 2 88 2 0 0 1 1 0 0 0 29.0% 0 0 4096 0.000506 1888 400 790064 3306 822360 623109 2.0 2 88 2 0 0 1 1 0 0 0 36.6% 0 200 1 0.000382 2493 1 2619 10 2675 2268 2.0 2 88 2 0 0 1 1 0 0 0 16.3% 0 200 32 0.000398 2393 32 80422 333 84759 22281 2.0 2 88 2 0 0 1 1 0 0 0 19.0% 0 200 256 0.000459 2096 200 435943 1608 453989 380749 2.0 2 88 2 0 0 1 1 0 0 0 30.5% 0 200 4096 0.000458 2097 200 436410 1651 455779 382485 2.0 2 88 2 0 0 1 1 0 0 0 29.2% 0 * Scan with skips, large index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch) - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch) Binary search reads more by 828 KB and by 1719 IOs. It does more I/O to read the the promoted index offset map. - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan we would end up caching the whole index. But this is protected against by eviction as demonstrated by the last "mem" column. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-skips -c1 --test-case-duration=1 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.103451 4 5000000 138491 38 138601 138453 153932.0 153932 19703260 153561 1 0 1 1 0 0 0 31.5% 502690 After (v2): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 37.000145 4 5000000 135135 6 135146 135128 155651.0 155651 19704088 138968 0 0 1 1 0 0 0 34.2% 0 After (v1): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.965520 4 5000000 135261 30 135311 135231 155628.0 155628 19704216 139133 1 0 1 1 0 0 0 33.9% 248738 Also in: git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2 Tests: - unit (all modes) - manual using perf_fast_forward " * tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla: sstables: Add promoted index cache metrics position_in_partition: Introduce external_memory_usage() cached_file, sstables: Add tracing to index binary search and page cache sstables: Dynamically adjust I/O size for index reads sstables, tests: Allow disabling binary search in promoted index from perf tests sstables: mc: Use binary search over the promoted index utils: Introduce cached_file sstables: clustered_index: Relax scope of validity of entry_info sstables: index_entry: Introduce owning promoted_index_block_position compound_compat: Allow constructing composite from a view sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view sstables: mc: Extract parser for promoted index block sstables: mc: Extract parser for clustering out of the promoted index block parser sstables: consumer: Extract primitive_consumer sstables: Abstract the clustering index cursor behavior sstables: index_reader: Rearrange to reduce branching and optionals	2020-06-18 12:09:39 +03:00
Raphael S. Carvalho	2f680b3458	size_tiered_backlog_tracker: Rename total_bytes Reader can assume total_bytes and _total_bytes have the same meaning, but they don't, so let's give the former a more descriptive name. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200616175055.16771-1-raphaelsc@scylladb.com>	2020-06-17 13:39:30 +03:00

1 2 3 4 5 ...

2133 Commits