scylladb

Author	SHA1	Message	Date
Nadav Har'El	7922b9eb8f	materialized views: reduce recompilation when db/view/view.hh changes. Before this patch, when db/view/view.hh was modified, 89 source files had to be recompiled. After this patch, this number is down to 5. Most of the irrelevant source files got view.hh by including database.hh, which included view.hh just for the definition of statistics. So in this patch we split the view statistics to a separate header file, view_stats.hh, and database.hh only includes that. A few source files which included only database.hh and also needed view.hh (for materialized-view related functions) now need to include view.hh explicitly. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200319121031.540-1-nyh@scylladb.com>	2020-03-19 15:46:14 +02:00
Piotr Sarna	0c11e07faf	view,table: fix waiting for view updates during building View updates sent as part of the view building process should never be ignored, but `fd49fd7` introduced a bug which may cause exactly that: the updates are mistakenly sent to background, so the view builder will not receive negative feedback if an update failed, which will in turn not cause a retry. Consequently, view building may report that it "finished" building a view, while some of the updates were lost. A simple fix is to restore previous behaviour - all updates triggered by view building are now waited for. Fixes #6038 Tests: unit(dev), dtest: interrupt_build_process_with_resharding_low_to_half_test	2020-03-19 10:50:54 +02:00
Piotr Sarna	fd49fd773c	db,view: move putting view updates to background to mutate_MV Currently, launching view updates as an asynchronous background job is done via not waiting for mutate_MV() future in table::generate_and_propagate_view_updates. That has a big downside, since mutate_MV() handles all view updates for all views of a table, so it's not possible to wait for each view independently. Per-view granularity is required in order to implement synchronous view updates of local views - because then we'll synchronously wait for all views that write to a local node (due to having a matching partition key with the base), while remote view updates will still be sent asynchronously. In order to do that, instead of not waiting for mutate_MV, we do wait for it properly, but instead launch the asynchronous, unwaited-for futures inside mutate_MV. Effectively that means no changes for view updates so far - all updates will be fired in the background. Later, another patch will introduce a way to wait for selected updates to finish.	2020-03-11 09:05:56 +01:00
Piotr Sarna	3b3659e8cd	db,view: drop default parameter for mutate_MV::allow_hints Default parameters are considered harmful, and as part of a cleanup before editing view.cc code, a default value for allow_hints parameter is removed.	2020-03-11 09:05:56 +01:00
Raphael S. Carvalho	3ba3ee2a7b	distributed_loader: trigger regular compaction on resharding completion Regular compaction relies on compaction manager to run compaction jobs until compaction strategy is satisfied. Resharding, on the other hand, is an one-off operation which runs only once in compaction manager, and leave the sstable set in such a way that the strategy is very likely unsatisfied. We need to trigger regular compaction whenever a resharding job replaces a shared sstable by an unshared sstable, so that compaction will not fall way behind due to lots of new sstables created by resharding process. Fixes #5262. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200217144946.20338-1-raphaelsc@scylladb.com>	2020-03-04 16:08:13 +02:00
Avi Kivity	906784639d	Merge "Clean sstables from using global objects" from Pavel E " This set cleans sstable_writer_config and surrounding sstables code from using global storage_ and feature_ service-s and database by moving the configuration logic onto sstables_manager (that was supposed to do it since `eebc3701a5`). Most of the complexity is hidden around sstable_writer_config creation, this set makes the sstables_manager create this object with an explicit call. All the rest are consequences of this change. Tests: unit(debug), manual start-stop " * 'br-clean-sstables-manager-2' of https://github.com/xemul/scylla: sstables: Move get_highest_supported_format sstables: Remove global get_config() helper sstables: Use manager's config() in .new_sstable_component_file() sstable_writer_config: Extend with more db::config stuff sstables_manager: Don't use global helper to generate writer config sstable_writer_config: Sanitize out some features fields initialization sstable_writer_config: Factor out some field initialization sstables: Generate writer config via manager only sstables: Keep reference on manager test: Re-use existing global sstables_manager table: Pass sstable_writer_config into write_memtable_to_sstable	2020-03-03 18:33:01 +02:00
Avi Kivity	157fe4bd19	Merge "Remove default timeouts" from Botond " Timeouts defaulted to `db::no_timeout` are dangerous. They allow any modifications to the code to drop timeouts and introduce a source of unbounded request queue to the system. This series removes the last such default timeouts from the code. No problems were found, only test code had to be updated. tests: unit(dev) " * 'no-default-timeouts/v1' of https://github.com/denesb/scylla: database: database::query(), database::apply(): remove default timeouts database: table::query(): remove default timeout mutation_query: data_query(): remove default timeout mutation_query: mutation_query(): remove default timeout multishard_mutation_query: query_mutations_on_all_shards(): remove default timeout reader_concurrency_semaphore: wait_admission(): remove default timeout utils/logallog: run_when_memory_available(): remove default timeout	2020-03-01 17:29:17 +02:00
Botond Dénes	8da88e6cb9	mutation_query: data_query(): remove default timeout	2020-02-27 19:02:40 +02:00
Botond Dénes	7bdeec4b00	flat_mutation_reader: make_reversing_reader(): add memory limit If the reversing requires more memory than the limit, the read is aborted. All users are updated to get a meaningful limit, from the respective table object, with the exception of tests of course.	2020-02-27 18:11:54 +02:00
Pavel Emelyanov	7363d56946	sstables: Move get_highest_supported_format The global get_highest_supported_format helper and its declaration are scattered all over the code, so clean this up and prepare the ground for moving _sstables_format from the storage_service onto the sstables_manager (not this set). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-25 14:31:45 +03:00
Pavel Emelyanov	5adce3390c	sstables: Generate writer config via manager only The sstable_writer_config creation looks simple (just declare the struct instance) but behind the scenes references storage and feature services, messes with database config, etc. This patch teaches the sstables_manager generate the writer config and makes the rest of the code use it. For future safety by-hands creation of the sstable_writer_config is prohibited. The manager is referenced through table-s and sstable-s, but two existing sstables_managers live on database object, and table-s and sstable-s both live shorter than the database, this reference is save. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-25 14:31:04 +03:00
Pavel Emelyanov	961f1642c7	table: Pass sstable_writer_config into write_memtable_to_sstable The latter creates the config by hands, but the plan is to create it via sstables_manager. Callers of this helper are the final frontiers where the manager will be safely accessible. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-25 13:54:40 +03:00
Raphael S. Carvalho	f93912f344	Revert "Revert "streaming: Do not invalidate cache if no sstable is added in flush_streaming_mutations"" With #4446 fixed, this commit can be reverted. This reverts commit `454e7e0109`.	2020-02-20 10:55:50 -03:00
Raphael S. Carvalho	fb81f2aa7c	table: Fix stale data being returned due to lack of cache invalidation Row cache needs to be invalidated whenever data in sstables changes. Cleanup removes data from sstables which doesn't belong to the node anymore, which means cache must be invalidated on cleanup. Currently, stale data can be returned when a node re-owns ranges which data are still stored in the node's row cache, because cleanup didn't invalidate the cache. To prevent data that belongs to the node from being purged from the row cache, cleanup will only invalidate the cache with a set of token ranges that will not overlap with any of ranges owned by the node. update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test now passes. Fixes #4446. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-02-20 10:55:50 -03:00
Raphael S. Carvalho	65b4fc8bcd	sstables/compaction: Introduce compaction_completion_desc This descriptor contain all information needed for table to be properly updated on compaction completion. A new member will be added to it soon, which will store ranges to be invalidated in row cache on behalf of cleanup compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-02-19 19:29:32 -03:00
Avi Kivity	454e7e0109	Revert "streaming: Do not invalidate cache if no sstable is added in flush_streaming_mutations" This reverts commit `5e9925b9f0`. It causes data resurrection in simple_decommission_node_2_test. Fixes #5838.	2020-02-18 20:13:10 +02:00
Avi Kivity	6c7aa18238	Merge "Introduce schema::get_partitioner" from Piotr " Introduce schema::get_partitioner and use it instead of dht::global_partitioner. Fixes #5493 Tests: unit(dev, release, debug) " * 'per_table_partitioner_prep' of https://github.com/haaawk/scylla: (35 commits) cdc: stop using partitioners partitioner_test: stop calling set_global_partitioner storage_service: stop calling global_partitioner() mutation_writer_test: stop calling global_partitioner() schema: reduce number of global_partitioner() calls test_services: stop calling global_partitioner() sstable_utils: stop calling global_partitioner() sstable_resharding_test: stop depending on global partitioner sstable_mutation_test: stop calling global_partitioner() sstable_data_file_test: stop calling global_partitioner() random_schema: stop taking partitioner in constructor mutation_reader_test: stop calling global_partitioner() multishard_mutation_query_test: stop calling global_partitioner() row_level repair: stop calling global_partitioner() distribute_reader_and_consume_on_shards: don't take partitioner thrift: reduce global_partitioner() calls binary_search: stop calling global_partitioner() index_entry: stop calling global_partitioner() mc writer: stop calling global_partitioner() sstable: stop calling global_partitioner() ...	2020-02-17 18:12:53 +02:00
Tomasz Grabiec	76d1dd7ec6	Merge "nodetool scrub: implement validation and the skip-corrupted flag " from Botond Nodetool scrub rewrites all sstables, validating their data. If corrupt data is found the scrub is aborted. If the skip-corrupted flag is set, corrupt data is instead logged (just the keys) and skipped. The scrubbing algorithm itself is fairly simple, especially that we already have a mutation stream validator that we can use to validate the data. However currently scrub is piggy-backed on top of cleanup compaction. To implement this flag, we have to make scrub a separate compaction type and propagate down the flag. This required some massaging of the code: * Add support for more than two (cleanup or not) compaction types. * Allow passing custom options for each compaction type. * Allow stopping a compaction without the manager retrying it later. Additionally the validator itself needed some changes to allow different ways to handle errors, as needed by the scrub. Fixes: #5487 * https://github.com/denesb/nodetool-scrub-skip-corrupted/v7: table: cleanup_sstables(): only short-circuit on actual cleanup compaction: compaction_type: add Upgrade compaction: introduce compaction_options compaction: compaction_descriptor: use compaction options instead of cleanup flag compaction_manager: collect all cleanup related logic in perform_cleanup() sstables: compaction_stop_exception: add retry flag mutation_fragment_stream_validator: split into low-level and high-level API compaction: introduce scrub_compaction compaction_manager: scrub: don't piggy-back on upgrade_sstables() test: sstable_datafile_test: add scrub unit test	2020-02-17 15:28:07 +02:00
Piotr Jastrzebski	2d7532f87f	dht: add dht::get_token and replace all calls to dht::global_partitioner().get_token dht::get_token is better because it takes schema and uses it to obtain partitioner instead of using a global partitioner. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-02-17 10:59:15 +01:00
Piotr Jastrzebski	ca4a89d239	dht: add dht::decorate_key and replace all dht::global_partitioner().decorate_key with dht::decorate_key It is an improvement because dht::decorate_key takes schema and uses it to obtain partitioner instead of using global partitioner as it was before. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-02-17 10:59:06 +01:00
Piotr Jastrzebski	abd76e566f	dht::shard_of: stop calling global_partitioner() Take const schema& as a parameter of shard_of and use it to obtain partitioner instead of calling global_partitioner(). Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-02-17 10:23:16 +01:00
Asias He	5e9925b9f0	streaming: Do not invalidate cache if no sstable is added in flush_streaming_mutations The table::flush_streaming_mutations is used in the days when streaming data goes to memtable. After switching to the new streaming, data goes to sstables directly in streaming, so the sstables generated in table::flush_streaming_mutations will be empty. It is unnecessary to invalidate the cache if no sstables are added. To avoid unnecessary cache invalidating which pokes hole in the cache, skip calling _cache.invalidate() if the sstables is empty. The steps are: - STREAM_MUTATION_DONE verb is sent when streaming is done with old or new streaming - table::flush_streaming_mutations is called in the verb handler - cache is invalidated for the streaming ranges In summary, this patch will avoid a lot of cache invalidation for streaming. Backports: 3.0 3.1 3.2 Fixes: #5769	2020-02-16 11:22:30 +02:00
Pavel Emelyanov	b11cf6e950	cql3/query_processor.hh: Debloat from other headers This gives ~30% less (251 jobs -> 181 jobs) recompile when touching it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20200212225828.3374-1-xemul@scylladb.com>	2020-02-16 11:22:30 +02:00
Botond Dénes	8014c7124d	compaction_manager: collect all cleanup related logic in perform_cleanup() Currently the call chain for a cleanup collection looks like this: compaction_manager::perform_cleanup() compaction_manager::rewrite_sstables() table::cleanup_sstables() ... `perform_cleanup()` is essentially empty, immediately deferring to `rewrite_sstables()`. Cleanup related logic is scattered between the latter two methods on the call chain. These methods however recently started serving as generic methods for compactions that want to rewrite each sstable one-by-one, collecting cleanup related ifs in various places. The reason is historic, we first had cleanup, then bolted others on top, trying to share the underlying code as much as possible. It is time this is cleaned up (pun intended). Make `perform_cleanup()` the place where all cleanup related logic is, with the rest of the stack made truly generic.	2020-02-11 17:47:44 +02:00
Botond Dénes	b2dc5d4895	compaction: compaction_descriptor: use compaction options instead of cleanup flag Instead of the restrictive `cleanup` boolean flag, which allows for choosing between only two compaction types, use `compaction_options`, which in addition to allowing any number of compaction types to be selected, also allows seamlessly passing specific options to them.	2020-02-11 17:47:44 +02:00
Botond Dénes	0b53ccaecd	table: cleanup_sstables(): only short-circuit on actual cleanup Currently the cleanup call is short circuited if it is determined that cleanup is not needed for the sstable to-be-cleaned-up. This is undesired because actually not just cleanup uses this routine to rewrite sstables, sstable-upgrade and sstable-scrub also uses it, and they want to go on with the cleanup compaction sstables even if all data in it belongs to the current node. Fix: #5699	2020-02-11 17:47:44 +02:00
Eliran Sinvani	8cfc2aad57	internalize storage proxy statistics metric registration The storage proxy statistics structure did not contain a method for registering the statistics for metric groups, instead, each user had to register some of the metrics by itself. There is no real reason for separating the metrics registration from the statistics data. There is even less justification for doing this only for part of the stats as is the case for those statistics. This commit internalize the metrics registration in the storage_proxy stats structures. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2020-01-30 15:01:40 +01:00
Botond Dénes	dfc8b2fc45	treewide: replace reader_resource_tracer with reader_permit The former was never really more than a reader_permit with one additional method. Currently using it doesn't even save one from any includes. Now that readers will be using reader_permit we would have to pass down both to mutation_source. Instead get rid of reader_resource_tracker and just use reader_permit. Instead of making it a last and optional parameter that is easy to ignore, make it a first class parameter, right after schema, to signify that permits are now a prominent part of the reader API. This -- mostly mechanical -- patch essentially refactors mutation_source to ask for the reader_permit instead of reader_resource_tracking and updates all usage sites.	2020-01-28 08:13:16 +02:00
Amnon Heiman	028525daeb	database: add schema.cql file when creating a snapshot When creating a snapshot we need to add a schema.cql file in the snapshot directory that describes the table in that snapshot. This patch adds the file using the schema describe method. get_snapshot_details and manifest_json_filter were modified to ignore the schema.cql file. Fixes #4192 Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2020-01-15 15:06:00 +02:00
Benny Halevy	718e9eb341	table: move_sstables_from_staging: fix use after free of shared_sstable Introduced in `4b3243f5b9` Reproducible with materialized_views_test:TestMaterializedViews.mv_populating_from_existing_data_during_node_remove_test and read_amplification_test:ReadAmplificationTest.no_read_amplification_on_repair_with_mv_test ==955382==ERROR: AddressSanitizer: heap-use-after-free on address 0x60200023de18 at pc 0x00000051d788 bp 0x7f8a0563fcc0 sp 0x7f8a0563fcb0 READ of size 8 at 0x60200023de18 thread T1 (reactor-1) #0 0x51d787 in seastar::lw_shared_ptr<sstables::sstable>::lw_shared_ptr(seastar::lw_shared_ptr<sstables::sstable> const&) /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/shared_ptr.hh:289 #1 0x10ba189 in apply<table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)>&, const seastar::lw_shared_ptr<sstables::sstabl e>&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1530 #2 0x109c4f1 in apply<table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)>&, const seastar::lw_shared_ptr<sstables::sstabl e>&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1556 #3 0x106941a in do_for_each<__gnu_cxx::__normal_iterator<const seastar::lw_shared_ptr<sstables::sstable>, std::vector<seastar::lw_shared_ptr<sstables::sstable> > >, table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda( std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future-util.hh:618 #4 0x1069203 in operator() /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future-util.hh:626 #5 0x10ba589 in apply /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/apply.hh:36 #6 0x10ba668 in apply<seastar::do_for_each(Iterator, Iterator, AsyncAction) [with Iterator = __gnu_cxx::__normal_iterator<const seastar::lw_shared_ptr<sstables::sstable>, std::vector<seastar::lw_shared_ptr<sstables::sstable> > >; AsyncAction = table::move_sstables_from_staging (std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)>]::<lambda()>&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/apply.hh:44 #7 0x10ba7c0 in apply<seastar::do_for_each(Iterator, Iterator, AsyncAction) [with Iterator = __gnu_cxx::__normal_iterator<const seastar::lw_shared_ptr<sstables::sstable>, std::vector<seastar::lw_shared_ptr<sstables::sstable> > >; AsyncAction = table::move_sstables_from_staging (std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)>]::<lambda()>&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1563 ... 0x60200023de18 is located 8 bytes inside of 16-byte region [0x60200023de10,0x60200023de20) freed by thread T1 (reactor-1) here: #0 0x7f8a153b796f in operator delete(void) (/lib64/libasan.so.5+0x11096f) #1 0x6ab4d1 in __gnu_cxx::new_allocator<seastar::lw_shared_ptr<sstables::sstable> >::deallocate(seastar::lw_shared_ptr<sstables::sstable>, unsigned long) /usr/include/c++/9/ext/new_allocator.h:128 #2 0x612052 in std::allocator_traits<std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >::deallocate(std::allocator<seastar::lw_shared_ptr<sstables::sstable> >&, seastar::lw_shared_ptr<sstables::sstable>, unsigned long) /usr/include/c++/9/bits/alloc_traits.h:470 #3 0x58fdfb in std::_Vector_base<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >::_M_deallocate(seastar::lw_shared_ptr<sstables::sstable>*, unsigned long) /usr/include/c++/9/bits/stl_vector.h:351 #4 0x52a790 in std::_Vector_base<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >::~_Vector_base() /usr/include/c++/9/bits/stl_vector.h:332 #5 0x52a99b in std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >::~vector() /usr/include/c++/9/bits/stl_vector.h:680 #6 0xff60fa in ~<lambda> /local/home/bhalevy/dev/scylla/table.cc:2477 #7 0xff7202 in operator() /local/home/bhalevy/dev/scylla/table.cc:2496 #8 0x106af5b in apply<table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1573 #9 0x102f5d5 in futurize_apply<table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1645 #10 0x102f9ee in operator()<seastar::semaphore_units<seastar::named_semaphore_exception_factory> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/semaphore.hh:488 #11 0x109d2f1 in apply /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/apply.hh:36 #12 0x109d42c in apply<seastar::with_semaphore(seastar::basic_semaphore<ExceptionFactory, Clock>&, size_t, Func&&) [with ExceptionFactory = seastar::named_semaphore_exception_factory; Func = table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>; Clock = std::chrono::_V2::steady_clock]::<lambda(auto:51)>&, seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>&&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/apply.hh:44 #13 0x109d595 in apply<seastar::with_semaphore(seastar::basic_semaphore<ExceptionFactory, Clock>&, size_t, Func&&) [with ExceptionFactory = seastar::named_semaphore_exception_factory; Func = table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>; Clock = std::chrono::_V2::steady_clock]::<lambda(auto:51)>&, seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>&&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1563 ... Fixes #5511 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20191222214326.1229714-1-bhalevy@scylladb.com>	2019-12-23 15:20:41 +02:00
Benny Halevy	4b3243f5b9	table: move_sstables_from_staging_in_thread with _sstable_deletion_sem Hold the _sstable_deletion_sem while moving sstables from the staging directory so not to move them under the feet of table::snapshot. Fixes #5340 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-12-17 12:20:20 +02:00
Benny Halevy	6efef84185	sstable: return future from move_to_new_dir distributed_loader::probe_file needlessly creates a seastar thread for it and the next patch will use it as part of a parallel_for_each loop to move a list of sstables (and sync the directories once at the end). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-12-17 12:20:20 +02:00
Piotr Sarna	79c3a508f4	table: Reduce read amplification in view update generation This commit makes sure that single-partition readers for read-before-write do not have fast-forwarding enabled, as it may lead to huge read amplification. The observed case was: 1. Creating an index. CREATE INDEX index1 ON myks2.standard1 ("C1"); 2. Running cassandra-stress in order to generate view updates. cassandra-stress write no-warmup n=1000000 cl=ONE -schema \ 'replication(factor=2) compaction(strategy=LeveledCompactionStrategy)' \ keyspace=myks2 -pop seq=4000000..8000000 -rate threads=100 -errors skip-read-validation -node 127.0.0.1; Without disabling fast-forwarding, single-partition readers were turned into scanning readers in cache, which resulted in reading 36GB (sic!) on a workload which generates less than 1GB of view updates. After applying the fix, the number dropped down to less than 1GB, as expected. Refs #5409 Fixes #4615 Fixes #5418	2019-12-05 11:58:34 +02:00
Avi Kivity	fd951a36e3	Merge "Let compaction wait on background deletions" from Benny " In several cases in distributed testing (dtest) we trigger compaction using nodetool compact assuming that when it is done, it is indeed really done. However, the way compaction is currently implemented in scylla, it may leave behind some background tasks to delete the old sstables that were compacted. This commit changes major compaction (triggered via the ss::force_keyspace_compaction api) so it would wait on the background deletes and will return only when they finish. Fixes #4909 Tests: unit(dev), nodetool_refresh_with_data_perms_test, test_nodetool_snapshot_during_major_compaction "	2019-12-04 11:18:41 +02:00
Piotr Sarna	9c5a5a5ac2	treewide: add names to semaphores By default, semaphore exceptions bring along very little context: either that a semaphore was broken or that it timed out. In order to make debugging easier without introducing significant runtime costs, a notion of named semaphore is added. A named semaphore is simply a semaphore with statically defined name, which is present in its errors, bringing valuable context. A semaphore defined as: auto sem = semaphore(0); will present the following message when it breaks: "Semaphore broken" However, a named semaphore: auto named_sem = named_semaphore(0, named_semaphore_exception_factory{"io_concurrency_sem"}); will present a message with at least some debugging context: "Semaphore broken: io_concurrency_sem" It's not much, but it would really help in pinpointing bugs without having to inspect core dumps. At the same time, it does not incur any costs for normal semaphore operations (except for its creation), but instead only uses more CPU in case an error is actually thrown, which is considered rare and not to be on the hot path. Refs #4999 Tests: unit(dev), manual: hardcoding a failure in view building code	2019-11-26 15:14:21 +02:00
Benny Halevy	f9e93bba38	sstables: compaction: move cleanup parameter to compaction_descriptor Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20191117165806.3234-1-bhalevy@scylladb.com>	2019-11-18 10:52:20 +01:00
Kamil Braun	a67e887dea	sstables: fix sstable file I/O CQL tracing when reading multiple files (#5285 ) CQL tracing would only report file I/O involving one sstable, even if multiple sstables were read from during the query. Steps to reproduce: create a table with NullCompactionStrategy insert row, flush memtables insert row, flush memtables restart Scylla tracing on select * from table The trace would only report DMA reads from one of the two sstables. Kudos to @denesb for catching this. Related issue: #4908	2019-11-17 00:38:37 -08:00
Piotr Dulikowski	59fbbb993f	memtables: add partition/row hit/miss counters Adds per-table metrics for counting partition and row reuse in memtables. New metrics are as follows: - memtable_partition_writes - number of write operations performed on partitions in memtables, - memtable_partition_hits - number of write operations performed on partitions that previously existed in a memtable, - memtable_row_writes - number of row write operations performed in memtables, - memtable_row_hits - number of row write operations that ovewrote rows previously present in a memtable. Tests: unit(release)	2019-11-12 13:35:41 +01:00
Vladimir Davydov	b75862610e	paxos_state: account paxos round latency This patch adds the following per table stats: cas_prepare_latency cas_propose_latency cas_commit_latency They are equivalent to CasPropose, CasPrepare, CasCommit metrics exposed by Cassandra.	2019-10-29 19:26:18 +03:00
Kamil Braun	394c36835a	sstables: report sstable data file I/O in CQL tracing Use tracing::make_traced_file when creating an sstable input_stream. To achieve that, trace_state needs to be plumbed down through some functions.	2019-10-25 14:10:28 +02:00
Raphael S. Carvalho	7f1a2156c7	table: Don't account for shared SSTables in compaction backlog tracker We don't want to add shared sstables to table's backlog tracker because: 1) table's backlog tracker has only an influence on regular compaction 2) shared sstables are never regular compacted, they're worked by resharding which has its own backlog tracker. Such sstables belong to more than one shard, meaning that currently they're added to backlog tracker of all shards that own them. But the thing is that such sstables ends up being resharded in shard that may be completely random. So increasing backlog of all shards such sstables belong to, won't lead to faster resharding. Also, table's backlog tracker is supposed to deal only with regular compaction. Accounting for shared sstables in table's tracker may lead to incorrect speed up of regular compactions because the controller is not aware that some relevant part of the backlog is due to pending resharding. The fix is about ignoring sstables that will be resharded and let table's backlog tracker account only for sstables that can be worked on by regular compaction, and rely on resharding controlling itself with its own tracker. NOTE: this doesn't fix the resharding controlling issue completely, as described in #4952. We'll still need to throttle regular compaction on behalf of resharding. So subsequent work may be about: - move resharding to its own priority class, perhaps streaming. - make a resharding's backlog tracker accounts for sstables in all of its pending jobs, not only the ongoing ones (currently limited to 1 by shard). - limit compaction shares when resharding is in progress. THIS only fixes the issue in which controller for regular compaction shouldn't account sstables completely exclusive to resharding. Fixes #5077. Refs #4952. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20190924022109.17400-1-raphaelsc@scylladb.com>	2019-10-13 10:14:13 +03:00
Tomasz Grabiec	b0e0f29b06	db: read: Filter-out sstables using its first and last keys Affects single-partition reads only. Refs #5113 When executing a query on the replica we do several things in order to narrow down the sstable set we read from. For tables which use LeveledCompactionStrategy, we store sstables in an interval set and we select only sstables whose partition ranges overlap with the queried range. Other compaction strategies don't organize the sstables and will select all sstables at this stage. The reasoning behind this is that for non-LCS compaction strategies the sstables' ranges will typically overlap and using interval sets in this case would not be effective and would result in quadratic (in sstable count) memory consumption. The assumption for overlap does not hold if the sstables come from repair or streaming, which generates non-overlapping sstables. At a later stage, for single-partition queries, we use the sstables' bloom filter (kept in memory) to drop sstables which surely don't contain given partition. Then we proceed to sstable indexes to narrow down the data file range. Tables which don't use LCS will do unnecessary I/O to read index pages for single-partition reads if the partition is outside of the sstable's range and the bloom filter is ineffective (Refs #5112). This patch fixes the problem by consulting sstable's partition range in addition to the bloom filter, so that the non-overlapping sstables will be filtered out with certainty and not depend on bloom filter's efficiency. It's also faster to drop sstables based on the keys than the bloom filter. Tests: - unit (dev) - manual using cqlsh Reviewed-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20190927122505.21932-1-tgrabiec@scylladb.com>	2019-09-28 19:42:57 +03:00
Benny Halevy	19b67d82c9	table::on_compaction_completion: fix indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-09-02 12:15:38 +03:00
Benny Halevy	8dd6e13468	table::on_compaction_completion: wait for background deletes Don't let background deletes accumulate uncontrollably. Fixes #4909 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-09-02 12:15:38 +03:00
Benny Halevy	da6645dc2c	table: refresh_snapshot before deleting any sstables The row cache must not hold refrences to any sstable we're about to delete. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-09-02 12:15:29 +03:00
Botond Dénes	136fc856c5	treewide: silence discarded future warnings for questionable discards This patches silences the remaining discarded future warnings, those where it cannot be determined with reasonable confidence that this was indeed the actual intent of the author, or that the discarding of the future could lead to problems. For all those places a FIXME is added, with the intent that these will be soon followed-up with an actual fix. I deliberately haven't fixed any of these, even if the fix seems trivial. It is too easy to overlook a bad fix mixed in with so many mechanical changes.	2019-08-26 19:28:43 +03:00
Botond Dénes	fddd9a88dd	treewide: silence discarded future warnings for legit discards This patch silences those future discard warnings where it is clear that discarding the future was actually the intent of the original author, and they did the necessary precautions (handling errors). The patch also adds some trivial error handling (logging the error) in some places, which were lacking this, but otherwise look ok. No functional changes.	2019-08-26 18:54:44 +03:00
Piotr Sarna	1ab07b80b4	database: assign proper io priority for streaming view updates Streamed view updates parasitized on writing io priority, which is reserved for user writes - it's now properly bound to streaming write priority.	2019-08-20 00:24:50 +02:00
Avi Kivity	77686ab889	Merge "Make SSTable cleanup run aware" from Raphael " Fixes #4663. Fixes #4718. " * 'make_cleanup_run_aware_v3' of https://github.com/raphaelsc/scylla: tests/sstable_datafile_test: Check cleaned sstable is generated with expected run id table: Make SSTable cleanup run aware compaction: introduce constants for compaction descriptor compaction: Make it possible to config the identifier of the output sstable run table: do not rely on undefined behavior in cleanup_sstables	2019-07-31 19:10:22 +03:00
Tomasz Grabiec	7604980d63	database: Add missing partition slicing on streaming reader recreation streaming_reader_lifecycle_policy::create_reader() was ignoring the partition_slice passed to it and always creating the reader for the full slice. That's wrong because create_reader() is called when recreating a reader after it's evicted. If the reader stopped in the middle of partition we need to start from that point. Otherwise, fragments in the mutation stream will appear duplicated or out of ordre, violating assumptions of the consumers. This was observed to result in repair writing incorrect sstables with duplicated clustering rows, which results in malformed_sstable_exception on read from those sstables. Fixes #4659. In v2: - Added an overload without partition_slice to avoid changing existing users which never slice Tests: - unit (dev) - manual (3 node ccm + repair) Backport: 3.1 Reviewd-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <1563451506-8871-1-git-send-email-tgrabiec@scylladb.com>	2019-07-18 18:35:28 +03:00

1 2

100 Commits