scylladb

Author	SHA1	Message	Date
Raphael S. Carvalho	6a3f4fb3f9	database: Fix race condition in sstable snapshot Race condition takes place when one of the sstables selected by snapshot is deleted by compaction. Snapshot fails because it tries to link a sstable that was previously unlinked by compaction's sstable deletion. Refs #4051. (master commit `1b7cad3531`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>	2019-01-11 13:48:12 +02:00
Tomasz Grabiec	20c2745592	Merge "Improve times to start / stop the nodes" from Glauber If the compaction manager is started, compactions may start (this is regardless of whether or not we trigger them). The problem with that is that they start at a time in which we are flushing the commitlog and the initialization procedure waits for the commitlog to be fully flushed and the resulting memtables flushed before we move on. Because there are no incoming writes, the amount of shares in memtable flushes decrease as memory used decreases and that can cause the startup procedure to take a long time. We have recently started to bump the shares manually for manual flushes. While that guarantees that we will not drive the shares to zero, I will make the argument that we can do better by making sure that those things are, at this point, running alone: user experience is affected by startup times and the bump we give to user-triggered operations will only do so much. Even if we increase the shares a lot flushes will still be fighting for resources with compactions and startup will take longer than it could. By making sure that flushes are this point running alone we improve the user experience by making sure the startup is as fast as it can be. There is a similar problem at the drain level, which is also fixed in this series. Fixes #3958 * git@github.com:glommer/scylla.git faster-restart compaction_manager: delay initialization of the compaction manager. drain: stop compactions early (cherry picked from commit `3e70ae1d06`)	2019-01-03 14:56:16 +01:00
Avi Kivity	dbe347811c	Merge "materialized views: Apply backpressure from view replicas" from Duarte " As the amount of pending view updates increases we know that there’s a mismatch between the rate at which the base receives writes and the rate at which the view retires them. We react by applying backpressure to decrease the rate of incoming base writes, allowing the slow view replicas to catch up. We want to delay the client’s next writes to a base replica and we use the base’s backlog of view updates to derive this delay. To validate this approach we tested a 3 node Scylla cluster on GCE, using n1-standard-4 instances with NVMEs. A loader running on a n1-standard-8 instance run cassandra-stress with 100 threads. With the delay function d(x) set to 1s, we see no base write timeouts. With the delay function as defined in the series, we see that backlogs stabilize at some (arbitrary) point, as predicted, but this stabilization co-exists with base write timeouts. However, the system overall behaves better than the current version, with the 100 view update limit, and also better than the version without such limit or any backpressure. More work is necessary to further stabilize the system. Namely, we want to keep delaying until we see the backlog is decreasing. This will require us to add more delay beyond the stabilization point, which in turn should minimize the base write timeouts, and will also minimize the amount of memory the backlog takes at each base replica. Design document: https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWo Fixes #2538 " Reviewed-by: Nadav Har'El <nyh@scylladb.com> * 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits) service/storage_proxy: Release mutation as early as possible service/storage_proxy: Delay replica writes based on view update backlog service/storage_proxy: Get the backlog of a particular base replica service/storage_proxy: Add counters for delayed base writes main: Start and stop the view_update_backlog_broker service: Distribute a node's view update backlog service: Advertise view update backlog over gossip service/storage_proxy: Send view update backlog from replicas service/storage_proxy: Prepare to receive replica view update backlog service/storage_proxy: Expose local view update backlog tests/view_schema_test: Add simple test for db::view::node_update_backlog db/view: Introduce node_update_backlog class db/hints: Initialize current backlog database: Add counter for current view backlog database: Expose current memory view update backlog idl: Add db::view::update_backlog db/view: Add view_update_backlog database: Wait on view update semaphore for view building service/storage_proxy: Use near-infinite timeouts for view updates database: generate_and_propagate_view_updates no longer needs a timeout ... (cherry picked from commit `b66f59aa3d`)	2018-12-20 19:11:56 +02:00
Avi Kivity	16ee3b3ebe	Merge "Make inactive shard readers evictable" from Botond " This series attempts to solve the regressions recently discovered in performance of multi-partition range-scans. Namely that they: * Flood the reader concurrency semaphore's queues, trampling other reads. * Behave very badly when too many of them is running concurrently (trashing). * May deadlock if enough of them is running without a timeout. The solution for these problems is to make inactive shard readers evictable. This should address all three issues listed above, to varying degrees: * Shard readers will now not cling onto their permits for the entire duration of the scan, which might be a lot of time. * Will be less affected by infinite concurrency (more than the node can handle) as each scan now can make progress by evicting inactive shard readers belonging to other scans. * Will not deadlock at all. In addition to the above fix, this series also bundles two further improvements: * Add a mechanism to `reader_concurrecy_semaphore` to be notified of newly inserted evictables. * General cleanups and fixes for `multishard_combining_reader` and `foreign_reader`. I can unbundle these mini series and send them separately, if the maintainers so prefer, altough considering that this series will have to be backported to 3.0, I think this present form is better. Fixes: #3835 " * 'evictable-inactive-shard-readers/v7' of https://github.com/denesb/scylla: (27 commits) tests/multishard_mutation_query_test: test stateless query too tests/querier_cache: fail resource-based eviction test gracefully tests/querier_cache: simplify resource-based eviction test tests/mutation_reader_test: add test_multishard_combining_reader_next_partition tests/mutation_reader_test: restore indentation tests/mutation_reader_test: enrich pause-related multishard reader test multishard_combining_reader: use pause-resume API query::partition_slice: add clear_ranges() method position_in_partition: add region() accessor foreign_reader: add pause-resume API tests/mutation_reader_test: implement the pause-resume API query_mutations_on_all_shards(): implement pause-resume API make_multishard_streaming_reader(): implement the pause-resume API database: add accessors for user and streaming concurrency semaphores reader_lifecycle_policy: extend with a pause-resume API query_mutations_on_all_shards(): restore indentation query_mutations_on_all_shards(): simplify the state-machine multishard_combining_reader: use the reader lifecycle policy multishard_combining_reader: add reader lifecycle policy multishard_combining_reader: drop unnecessary `reader_promise` member ... (cherry picked from commit `414b14a6bd`)	2018-12-04 12:13:13 +02:00
Avi Kivity	befe0012f5	Merge "Fix multiple summary regeneration bugs." from Vladimir " This patchset addresses two recently discovered bugs both triggered by summary regeneration: Tests: unit {release} + Validated with debug build of Scylla (ASAN) that no use-after-free occurs when re-generating Summary.db. " * 'projects/sstables-30/summary-regeneration/v1' of https://github.com/argenet/scylla: tests: Add test reading SSTables in 'mc' format with missing summary. sstables: When loading, read statistics before summary. database: Capture io_priority_class by reference to avoid dangling ref. (cherry picked from commit `009cbd3dcb`)	2018-12-02 13:32:09 +02:00
Duarte Nunes	b72a94b53e	Merge 'Fix checking if system tables need view updates' from Piotr " This miniseries ensures that system tables are not checked for having view updates, because they never do. What's more, distributed system table is used in the process, so it's unsafe to query the table while streaming it. Tests: unit (release), dtest(update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test) " * 'fix_checking_if_system_tables_need_view_updates_3' of https://github.com/psarna/scylla: streaming: don't check view building of system tables database: add is_internal_keyspace streaming: remove unused sstable_is_staging bool class (cherry picked from commit `d09d4bbd91`)	2018-11-28 15:39:34 +00:00
Glauber Costa	f81fa5f75c	remove monitor if sstable write failed In (almost) all SSTable write paths, we need to inform the monitor that the write has failed as well. The monitor will remove the SSTable from controller's tracking at that point. Except there is one place where we are not doing that: streaming of big mutations. Streaming of big mutations is an interesting use case, in which it is done in 2 parts: if the writing of the SSTable fails right away, then we do the correct thing. But the SSTables are not commited at that point and the monitors are still kept around with the SSTables until a later time, when they are finally committed. Between those two points in time, it is possible that the streaming code will detect a failure and manually call fail_streaming_mutations(), which marks the SSTable for deletions. At that point we should propagate that information to the monitor as well, but we don't. Fixes #3732 (hopefully) Tests: unit (release) Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20181114213618.16789-1-glauber@scylladb.com> (cherry picked from commit `9f403334c8`)	2018-11-20 19:27:54 +02:00
Duarte Nunes	9776a048e7	Merge 'Generating view updates during streaming' from Piotr During streaming, there are cases when we should invoke the view write path. In particular, if we're streaming because of repair or if a view has not yet finished building and we're bootstrapping a new node. The design constraints are: 1) The streamed writes should be visible to new writes, but the sstable should not participate in compaction, or we would lose the ability to exclude the streamed writes on a restart; 2) The streamed writes must not be considered when generating view updates for them; 3) Resilient to node restarts; 4) Resilient to concurrent stream sessions, possibly streaming mutations for overlapping ranges. We achieve this by writing the streamed writes to an sstable in a different folder, call it "staging". We achieve 1) by publishing the sstable to the column family sstable set, but excluding it from compactions. We do these steps upon boot, by looking at the staging directory, thus achieving 3). Fixes #3275 * 'streaming_view_to_staging_sstables_9' of https://github.com/psarna/scylla: (29 commits) tests: add materialized views test tests: add view update generator to cql test env main: add registering staging sstables read from disk database: add a check if loaded sstable is already staging database: add get_staging_sstable method streaming: stream tables with views through staging sstables streaming: add system distributed keyspace ref to streaming streaming: add view update generator reference to streaming main: add generating missed mv updates from staging sstables storage_service: move initializing sys_dist_ks before bootstrap db/view: add view_update_from_staging_generator service db/view: add view updating consumer table: add stream_view_replica_updates table: split push_view_replica_updates table: add as_mutation_source_excluding table: move push_view_replica_updates to table.cc database: add populating tables with staging sstables database: add creating /staging directory for sstables database: add sstable-excluding reader table: add move_sstable_from_staging_in_thread function ... (cherry picked from commit `a38f6078fb`)	2018-11-15 17:46:20 +02:00
Glauber Costa	c3f27784de	database: guarantee a minimum amount of shares when manual operations are requested. We have found issues when a flush is requested outside the usual memtable flush loop and because there is not a lot of data the controller will not have a high amount of shares. To prevent this, this patch guarantees some minimum amount of shares when extraneous operations (nodetool flush, commitlog-driven flush, etc) are requested. Another option would be to add shares instead of guarantee a minimum. But in my view the approach I am taking here has two main advantages: 1) It won't cause spikes when those operations are requested 2) It is cumbersome to add shares in the current infrastructure, as just adding backlog can cause shares to spike. Consider this example: Backlog is within the first range of very low backlog (~0.2). Shares for this would be around ~20. If we want to add 200 shares, that is equivalent to a backlog of 0.8. Once we add those two backlogs together, we end up with 1 (max backlog). Fixes #3761 Tests: unit (release) Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180927131904.8826-1-glauber@scylladb.com>	2018-09-27 15:20:31 +02:00
Avi Kivity	337ee6153a	Merge "Support SSTables 3.x in Scylla runtime" from Vladimir and Piotr " This patchset makes it possible to use SSTables 'mc' format, commonly referred to as 'SSTables 3.x', when running Scylla instance. Several bugs found on this way are fixed. Also, a configuration option is introduced to allow running Scylla either with 'mc' or 'la' format as default. Tests: unit {release} + tested Scylla with both 'la' and 'mc' formats to work fine: cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; [3/1890] cqlsh> USE test; cqlsh:test> CREATE TABLE cfsst3 (pk int, ck int, rc int, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''}; cqlsh:test> INSERT INTO cfsst3 (pk, ck, rc) VALUES ( 4, 7, 8); <<flush>> cqlsh:test> DELETE from cfsst3 WHERE pk = 4 and ck> 3 and ck < 8; <<flush>> cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 2, 3); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 4, 6); cqlsh:test> SELECT * FROM cfsst3 ; pk \| ck \| rc ----+----+------ 2 \| 3 \| null 4 \| 6 \| null (2 rows) <<Scylla restart>> cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 5, 7); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 6, 8); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 7, 9); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 8, 10); cqlsh:test> SELECT * from cfsst3 ; pk \| ck \| rc ----+----+------ 5 \| 7 \| null 8 \| 10 \| null 2 \| 3 \| null 4 \| 6 \| null 7 \| 9 \| null 6 \| 8 \| null (6 rows) " * 'projects/sstables-30/try-runtime/v8' of https://github.com/argenet/scylla: database: Honour enable_sstables_mc_format configuration option. sstables: Support SSTables 'mc' format as a feature. db: Add configuration option for enabling SSTables 'mc' format. tests: Add test for reading a complex column with zero subcolumns (SST3). sstables: Fix parsing of complex columns with zero subcolumns. sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding. sstables: Use parser_type instead of abstract_type::parse_type in column_translation. bytes: Add helper for turning bytes_view into sstring_view. sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists. sstables: Fix string formatting for exception messages in m_format_read_helpers. sstables: Don't validate timestamps against the max value on parsing. sstables: Always store only min bases in serialization_header. sstables: Support 'mc' version parsing from filename. SST3: Make sure we call consume_partition_end	2018-09-26 11:10:07 +01:00
Vladimir Krivopalov	cd80d6ff65	database: Honour enable_sstables_mc_format configuration option. Only enable SSTables 'mc' format if the entire cluster supports it and it is enabled in the configuration file. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-25 17:23:40 -07:00
Raphael S. Carvalho	745e35fa82	database: Fix sstable resharding for mc format SStable format mc doesn't write ancestors to metadata, so resharding will not work with this new format because it relies on ancestors to replace new unshared sstables with old shared ones. Fix is about not relying on ancestors metadata for this operation. Fixes #3777. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180922211933.1987-1-raphaelsc@scylladb.com>	2018-09-25 18:37:48 +03:00
Botond Dénes	eb357a385d	flat_mutation_reader: make timeout opt-out rather than opt-in Currently timeout is opt-in, that is, all methods that even have it default it to `db::no_timeout`. This means that ensuring timeout is used where it should be is completely up to the author and the reviewrs of the code. As humans are notoriously prone to mistakes this has resulted in a very inconsistent usage of timeout, many clients of `flat_mutation_reader` passing the timeout only to some members and only on certain call sites. This is small wonder considering that some core operations like `operator()()` only recently received a timeout parameter and others like `peek()` didn't even have one until this patch. Both of these methods call `fill_buffer()` which potentially talks to the lower layers and is supposed to propagate the timeout. All this makes the `flat_mutation_reader`'s timeout effectively useless. To make order in this chaos make the timeout parameter a mandatory one on all `flat_mutation_reader` methods that need it. This ensures that humans now get a reminder from the compiler when they forget to pass the timeout. Clients can still opt-out from passing a timeout by passing `db::no_timeout` (the previous default value) but this will be now explicit and developers should think before typing it. There were suprisingly few core call sites to fix up. Where a timeout was available nearby I propagated it to be able to pass it to the reader, where I couldn't I passed `db::no_timeout`. Authors of the latter kind of code (view, streaming and repair are some of the notable examples) should maybe consider propagating down a timeout if needed. In the test code (the wast majority of the changes) I just used `db::no_timeout` everywhere. Tests: unit(release, debug) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>	2018-09-20 11:31:24 +02:00
Raphael S. Carvalho	5bc028f78b	database: fix 2x increase in disk usage during cleanup compaction Don't hold reference to sstables cleaned up, so that file descriptors for their index and data files will be closed and consequently disk space released. Fixes #3735. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180914194047.26288-1-raphaelsc@scylladb.com>	2018-09-17 17:26:46 +03:00
Botond Dénes	253407bdc8	multishard_mutation_query: add badness counters Add badness counters that allow tracking problems. The following counters are added: 1) multishard_query_unpopped_fragments 2) multishard_query_unpopped_bytes 3) multishard_query_failed_reader_stops 4) multishard_query_failed_reader_saves The first pair of counters observe the amount of work range scan queries have to undo on each page. It is normal for these counters to be non-zero, however sudden spikes in their values can indicate problems. This undoing of work is needed for stateful range-scans to work. When stateful queries are enabled the `multishard_combining_reader` is dismantled and all unconsumed fragments in its and any of its intermediate reader's buffers are pushed back into the originating shard reader's buffer (via `unpop_mutation_fragment()`). This also includes the `partition_start`, the `static_row` (if there is one) and all extracted and active `range_tombstone` fragments. This together can amount to a substantial amount of fragments. (1) counts the amount of fragments moved back, while (2) counts the number of bytes. Monitoring size and quantity separately allows for detecting edge cases like moving many small fragments or just a few huge ones. The counters count the fragments/bytes moved back to readers located on the shard they belong to. The second pair of counters are added to detect any problems around saving readers. Since the failure to save a reader will not fail the read itself, it is necessary to add visibility to these failures by other means. (3) counts the number of times stopping a shard reader (waiting on pending read-aheads and next-partitions) failed while (4) counts the number of times inserting the reader into the `querier_cache` failed. Contrary to the first two counters, which will almost certainly never be zero, these latter two counters should always be zero. Any other value indicates problems in the respective shards/nodes.	2018-09-03 10:31:44 +03:00
Botond Dénes	5f726e9a89	querier: move all to query namespace To avoid name clashes.	2018-09-03 10:31:44 +03:00
Glauber Costa	8dea1b3c61	database: fix directory for information when loading new SSTables from upload dir When we load new SSTables, we use the directory information from the entry descriptor to build information about those SSTables. When the descriptor is created by flush_upload_dir, the sstable directory used in the descriptor contains the `upload` part. Therefore, we will try to load SSTables that are in the upload directory when we already moved them out and fail. Since the generation also changes, we have been historically fixing the generation manually, but not the SSTable directory. The reason for that is that up until recently, the SSTable directory was passed statically to open_sstables, ignoring whatever the entry descriptor said. Now that the sstable directory is also derived from the entry descriptor, we should fix that too. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180829165326.12183-1-glauber@scylladb.com>	2018-08-30 10:34:25 +03:00
Paweł Dziepak	6f1c3e6945	Merge "Convert more execution_stages to inherit scheduling_groups" from Avi " Previous work (`71471bb322`) converted the CQL layer to inheriting execution stages, paving the way to multiple users sharing the front-end. This patchset does the same thing to the back-end, converting more execution stages to preserve the caller's scheduling_group. Since RPC now (`8c993e0728`) assigns the correct scheduling group within the replica, we can extend that work so a statement is executed with the same scheduling group all the way to sstable parsing, even if we cross nodes in the process. This improves performance isolation and paves the way to multi-user SLA guarantees. " * tag 'inherit-sched_group/v1' of https://github.com/avikivity/scylla: database: make database's mutation apply stage inherit its scheduling group from the caller database: make database::_mutation_query_stage inherit the scheduling group database: make database::_data_query_stage inheriting its caller's scheduling_group storage_proxy: make _mutate_stage inherit its caller's scheduling_group	2018-08-28 13:49:31 +01:00
Tomasz Grabiec	2afce13967	database: Avoid OOM when soft pressure but nothing to flush There could be soft pressure, but soft-pressure flusher may not be able to make progress (Refs #3716). It will keep trying to flush empty memtables, which block on earlier flushes to complete, and thus allocate continuations in memory. Those continuations accumulate in memory and can cause OOM. flush will take longer to complete. Due to scheduling group isolation, the soft-pressure flusher will keep getting the CPU. This causes bad_alloc and crashes of dtest: limits_test.py:TestLimits.max_cells_test Fixes #3717 Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>	2018-08-26 11:03:58 +03:00
Tomasz Grabiec	1e50f85288	database: Make soft-pressure memtable flusher not consider already flushed memtables The flusher picks the memtable list which contains the largest region according to region_impl::evictable_occupancy().total_space(), which follows region::occupancy().total_space(). But only the latest memtable in the list can start flushing. It can happen that the memtable corresponding to the largest region was already flushed to an sstable (flush permit released), but not yet fsynced or moved to cache, so it's still in the memtable list. The latest memtable in the winning list may be small, or empty, in which case the soft pressure flusher will not be able to make much progress. There could be other memtable lists with non-empty (flushable) latest memtables. This can lead to writes unnecessarily blocking on dirty. I observed this for the system memtable group, where it's easy for the memtables to overshoot small soft pressure limits. The flusher kept trying to flush empty memtables, while the previous non-empty memtable was still in the group. The CPU scheduler makes this worse, because it runs memtable_to_cache in a separate scheduling group, so it further defers in time the removal of the flushed memtable from the memtable list. This patch fixes the problem by making regions corresponding to memtables which started flushing report evictable_occupancy() as 0, so that they're picked by the flusher last. Fixes #3716. Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>	2018-08-26 11:02:34 +03:00
Avi Kivity	37f9a3c566	database: make database's mutation apply stage inherit its scheduling group from the caller Like the two preceeding patches, convert the mutation apply stage to an inheriting_concrete_scheduling_group. This change has two added benefits: we get rid of a thread_local, and we drop a with_scheduling_group() inside an execution stage which just creates a bunch of continuations and somewhat undoes the benefit of the execution stage.	2018-08-24 19:04:49 +03:00
Avi Kivity	ebff1cfc37	database: make database::_mutation_query_stage inherit the scheduling group Like the preceeding patch and for the same reasons, adjust database::_mutation_query_stage to inherit the scheduling group from its caller.	2018-08-24 19:04:49 +03:00
Avi Kivity	596fb6f2f7	database: make database::_data_query_stage inheriting its caller's scheduling_group Now (`8c993e0728`) that replica-side operations run under the correct scheduling group, we can inherit the scheduling_group for _data_query_stage from the caller. By itself this doesn't do much, but it will later allow us to have multiple groups for statement executions.	2018-08-24 19:04:49 +03:00
Rafi Einstein	c7f41c988f	Add a counter to count large partition warning in compaction Fixes #3562 Tests: dtest(compaction_test.py) Message-Id: <20180807190324.82014-1-rafie@scylladb.com>	2018-08-07 20:15:09 +01:00
Avi Kivity	2d311c26b3	database: tag dirty memory managers with scheduling groups dirty memory managers run code on behalf of their callers in a background fiber, so provide that background fiber with the scheduling group appropriate to their caller. - system: main (we want to let system writes through quickly) - dirty: statement (normal user writes) - streaming: streaming (streaming writes)	2018-07-31 13:18:21 +03:00
Avi Kivity	ef9b36376c	Merge "database: support multiple data directories" from Glauber " While Cassandra supports multiple data directories, we have been historically supporting just one. The one-directory model suits us better because of the I/O Scheduler and so far we have seen very few requests -- if any, to support this. Still, the infrastructure needed to support multiple directories can be beneficial so I am trying to bring this in. For simplicity, we will treat the first directory in the list as the main directory. By being able to still associate one singular directory with a table, most of the code doesn't have to change and we don't have to worry about how to distribute data between the directories. In this design: - We scan all data directories for existing data. - resharding only happens within a particular data directory. - snapshot details are accumulated with data for all directories that host snapshots for the tables we are examining - snapshots are created with files in its own directories, but the manifest file goes to the main directory. For this one, note that in Cassandra the same thing happens, except that there is no "main" directory. Still the manifest file is still just in one of them. - SSTables are flushed into the main directory. - Compactions write data into the main directory Despite the restrictions, one example of usage of this is recovery. If we have network attached devices for instance, we can quickly attach a network device to an existing node and make the data immediately available as it is compacted back to main storage. Tests: unit (release) " * 'multi-data-file-v2' of github.com:glommer/scylla: database: change ident database: support multiple data directories database: allow resharing to specify a directory database: support multiple directories in get_snapshot_details database: move get_snapshot_info into a seastar::thread snapshots: always create the snapshot directory sstables: pass sstable dir with entry descriptor database: make nodetool listsnapshots print correct information sstables: correctly create descriptors for snapshots	2018-07-15 13:31:04 +03:00
Asias He	6540051f77	database: Add add_sstable_and_update_cache Since we can write mutations to sstable directly in streaming, we need to add those sstables to the system so it can be seen by the query. Also we need to update the cache so the query refects the latest data.	2018-07-13 08:36:45 +08:00
Asias He	dfc2739625	database: Add make_streaming_sstable_for_write This will be used to create sstable for streaming receiver to write the mutations received from network to sstable file instead of writing to memtable.	2018-07-13 08:36:45 +08:00
Avi Kivity	2f8537b178	database: demote "Setting compaction strategy" log message to debug level It's not very helpful in normal operation, and generates much noise, especially when there are many tables. Message-Id: <20180708070051.8508-1-avi@scylladb.com>	2018-07-08 10:27:03 +01:00
Glauber Costa	82f7f7b36d	database: change ident Previous patches have used reviewer-oriented identation. Re-ident. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-07-05 17:11:01 -04:00
Glauber Costa	99c8a1917f	database: support multiple data directories While Cassandra supports multiple data directories, we have been historically supporting just one. The one-directory model suits us better because of the I/O Scheduler and so far we have seen very few requests -- if any, to support this. Still, the infrastructure needed to support multiple directories can be beneficial so I am trying to bring this in. For simplicity, we will treat the first directory in the list as the main directory. By being able to still associate one singular directory with a table, most of the code doesn't have to change and we don't have to worry about how to distribute data between the directories. In this design: - We scan all data directories for existing data. - resharding only happens within a particular data directory. - snapshot details are accumulated with data for all directories that host snapshots for the tables we are examining - snapshots are created with files in its own directories, but the manifest file goes to the main directory. For this one, note that in Cassandra the same thing happens, except that there is no "main" directory. Still the manifest file is still just in one of them. - SSTables are flushed into the main directory. - Compactions write data into the main directory Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-07-05 16:58:39 -04:00
Glauber Costa	3b46984a1e	database: allow resharing to specify a directory resharding assumes that all SSTables will be in cf->dir(), but in reality we will soon have tables in other places. We can specify a directory in get_all_shared_sstables and specify that directory from the resharding process. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-07-05 16:58:08 -04:00
Glauber Costa	c8b2d441a8	database: support multiple directories in get_snapshot_details Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-07-05 16:58:08 -04:00
Glauber Costa	a8ccf4d1e6	database: move get_snapshot_info into a seastar::thread I am about to add another level of identation and this code already shifts right too much. It is not performance critical, so let's use a thread for that. seastar::threads did not exist when this was first written. Also remove one unused continuation from inside the inner scan, simplifying its code. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-07-05 16:58:08 -04:00
Glauber Costa	919c7d6bb9	snapshots: always create the snapshot directory We currently don't always create the snapshot directory as an optimization. We have a test at sync time handling this use case. This works well when all SSTables are created in the same directory, but if we have more than one data directory than it may not work if we don't have SSTables in all data directories. We can fix it by unconditionally creating the directory. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-07-05 16:58:08 -04:00
Glauber Costa	86239e4e22	sstables: pass sstable dir with entry descriptor We have been assuming that all SSTables for a table will be in the same directory, and we pass the directory name to make_descriptor only because that's the way in ka and la to find out the keyspace and table names. However, SSTables for a given column family could be spread into multiple directories. So let's pass it down with the descriptor so we can load from the right place. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-07-05 16:45:26 -04:00
Glauber Costa	25a02c61d6	database: make nodetool listsnapshots print correct information nodetool listsnapshots is currently printing zero sizes for all snapshots The reason for that is that we are moving the snapshot directory name in the capture list, which can be evaluated by the compiler to happen before we use it as the function parameter. Fixes #3572 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-07-05 16:20:07 -04:00
Botond Dénes	a9c465d7d2	Revert "database: stop using incremental selectors" The data-loss bug is fixed, the incremental selector can be used again. This reverts commit `aeffbb6732`.	2018-07-04 17:42:37 +03:00
Botond Dénes	c37aff419e	incremental_reader_selector: don't jump over sstables Passing the current read position to the `incremental_selector::select()` can lead to "jumping" through sstables. This can happen when the currently open sstables have no partition that intersects with a yet unselected sstable that has an intersecting range nevertheless, in other words there is a gap in the selected sstables that this unselected one completely fits into. In this case the unselected sstable will be completely omitted from the read. The solution is to not to avoid calling `select()` with a position that is larger than the `next_position` returned from the previous `select()` call. Instead, call `select()` repeatedly with the `next_position` from the previous call, until either at least one new sstable is selected or the current read position is surpassed. This guarantess that no sstables will be jumped over. In other words, advance the incremental selector in a pace defined by itself thus guaranteeing that no sstable will be jumped over.	2018-07-04 17:42:37 +03:00
Botond Dénes	81a03db955	mutation_reader: reader_selector: use ring_position instead of token sstable_set::incremental selector was migrated to ring position, follow suit and migrate the reader_selector to use ring_position as well. Above correctness this also improves efficiency in case of dense tables, avoiding prematurely selecting sstables that share the token but start at different keys, altough one could argue that this is a niche case.	2018-07-04 17:42:37 +03:00
Botond Dénes	a8e795a16e	sstables_set::incremental_selector: use ring_position instead of token Currently `sstable_set::incremental_selector` works in terms of tokens. Sstables can be selected with tokens and internally the token-space is partitioned (in `partitioned_sstable_set`, used for LCS) with tokens as well. This is problematic for severeal reasons. The sub-range sstables cover from the token-space is defined in terms of decorated keys. It is even possible that multiple sstables cover multiple non-overlapping sub-ranges of a single token. The current system is unable to model this and will at best result in selecting unnecessary sstables. The usage of token for providing the next position where the intersecting sstables change [1] causes further problems. Attempting to walk over the token-space by repeatedly calling `select()` with the `next_position` returned from the previous call will quite possibly lead to an infinite loop as a token cannot express inclusiveness/exclusiveness and thus the incremental selector will not be able to make progress when the upper and lower bounds of two neighbouring intervals share the same token with different inclusiveness e.g. [t1, t2](t2, t3]. To solve these problems update incremental_selector to work in terms of ring position. This makes it possible to partition the token-space amoing sstables at decorated key granularity. It also makes it possible for select() to return a next_position that is guaranteed to make progress. partitioned_sstable_set now builds the internal interval map using the decorated key of the sstables, not just the tokens. incremental_selector::select() now uses `dht::ring_position_view` as both the selector and the next_position. ring_position_view can express positions between keys so it can also include information about inclusiveness/exclusiveness of the next interval guaranteeing forward progress. [1] `sstable_set::incremental_selector::selection::next_position`	2018-07-04 17:42:33 +03:00
Avi Kivity	f3da043230	Merge "Make in-memory partition version merging preemptable" from Tomasz " Partition snapshots go away when the last read using the snapshot is done. Currently we will synchronously attempt to merge partition versions on this event. If partitions are large, that may stall the reactor for a significant amount of time, depending on the size of newer versions. Cache update on memtable flush can create especially large versions. The solution implemented in this series is to allow merging to be preemptable, and continue in the background. Background merging is done by the mutation_cleaner associated with the container (memtable, cache). There is a single merging process per mutation_cleaner. The merging worker runs in a separate scheduling group, introduced here, called "mem_compaction". When the last user of a snapshot goes away the snapshot is slided to the oldest unreferenced version first so that the version is no longer reachable from partition_entry::read(). The cleaner will then keep merging preceding (newer) versions into it, until it merges a version which is referenced. The merging is preemtable. If the initial merging is preempted, the snapshot is enqueued into the cleaner, the worker woken up, and merging will continue asynchronously. When memtable is merged with cache, its cleaner is merged with cache cleaner, so any outstanding background merges will be continued by the cache cleaner without disruption. This reduces scheduling latency spikes in tests/perf_row_cache_update for the case of large partition with many rows. For -c1 -m1G I saw them dropping from >23ms to 1-2ms. System-level benchmark using scylla-bench shows a similar improvement. " * tag 'tgrabiec/merge-snapshots-gradually-v4' of github.com:tgrabiec/scylla: tests: perf_row_cache_update: Test with an active reader surviving memtable flush memtable, cache: Run mutation_cleaner worker in its own scheduling group mutation_cleaner: Make merge() redirect old instance to the new one mvcc: Use RAII to ensure that partition versions are merged mvcc: Merge partition version versions gradually in the background mutation_partition: Make merging preemtable tests: mvcc: Use the standard maybe_merge_versions() to merge snapshots	2018-07-01 15:32:51 +03:00
Tomasz Grabiec	074be4d4e8	memtable, cache: Run mutation_cleaner worker in its own scheduling group The worker is responsible for merging MVCC snapshots, which is similar to merging sstables, but in memory. The new scheduling group will be therefore called "memory compaction". We should run it in a separate scheduling group instead of main/memtables, so that it doesn't disrupt writes and other system activities. It's also nice for monitoring how much CPU time we spend on this.	2018-06-27 21:51:04 +02:00
Avi Kivity	e1efda8b0c	Merge "Disable sstable filtering based on min/max clustering key components" from Tomasz " With DateTiered and TimeWindow, there is a read optimization enabled which excludes sstables based on overlap with recorded min/max values of clustering key components. The problem is that it doesn't take into account partition tombstones and static rows, which should still be returned by the reader even if there is no overlap in the query's clustering range. A read which returns no clustering rows can mispopulate cache, which will appear as partition deletion or writes to the static row being lost. Until node restart or eviction of the partition entry. There is also a bad interaction between cache population on read and that optimization. When the clustering range of the query doesn't overlap with any sstable, the reader will return no partition markers for the read, which leads cache populator to assume there is no partition in sstables and it will cache an empty partition. This will cause later reads of that partition to miss prior writes to that partition until it is evicted from cache or node is restarted. Disable until a more elaborate fix is implemented. Fixes #3552 Fixes #3553 " * tag 'tgrabiec/disable-min-max-sstable-filtering-v1' of github.com:tgrabiec/scylla: tests: Add test for slicing a mutation source with date tiered compaction strategy tests: Check that database conforms to mutation source database: Disable sstable filtering based on min/max clustering key components	2018-06-27 14:28:27 +03:00
Piotr Sarna	03753cc431	database: make drop_column_family wait on reads in progress drop_column_family now waits for both writes and reads in progress. It solves possible liveness issues with row cache, when column_family could be dropped prematurely, before the read request was finished. Phaser operation is passed inside database::query() call. There are other places where reading logic is applied (e.g. view replicas), but these are guarded with different synchronization mechanisms, while _pending_reads_phaser applies to regular reads only. Fixes #3357 Reported-by: Duarte Nunes <duarte@scylladb.com> Signed-off-by: Piotr Sarna <sarna@scylladb.com> Message-Id: <d58a5ee10596d0d62c765ee2114ac171b6f087d2.1529928323.git.sarna@scylladb.com>	2018-06-27 10:02:56 +01:00
Tomasz Grabiec	19b76bf75b	database: Disable sstable filtering based on min/max clustering key components With DateTiered and TimeWindow, there is a read optimization enabled which excludes sstables based on overlap with recorded min/max values of clustering key components. The problem is that it doesn't take into account partition tombstones and static rows, which should still be returned by the reader even if there is no overlap in the query's clustering range. A read which returns no clustering rows can mispopulate cache, which will appear as partition deletion or writes to the static row being lost. Until node restart or eviction of the partition entry. There is also a bad interaction between cache population on read and that optimization. When the clustering range of the query doesn't overlap with any sstable, the reader will return no partition markers for the read, which leads cache populator to assume there is no partition in sstables and it will cache an empty partition. This will cause later reads of that partition to miss prior writes to that partition until it is evicted from cache or node is restarted. Disable until a more elaborate fix is implemented. Fixes #3552 Fixes #3553	2018-06-26 18:54:44 +02:00
Paweł Dziepak	96b0577343	row_cache: deglobalise row cache tracker Row cache tracker has numerous implicit dependencies on ohter objects (e.g. LSA migrators for data held by mutation_cleaner). The fact that both cache tracker and some of those dependencies are thread local objects makes it hard to guarantee correct destruction order. Let's deglobalise cache tracker and put in in the database class.	2018-06-25 09:37:43 +01:00
Avi Kivity	cb549c767a	database: rename column_family to table The name "column_family" is both awkward and obsolete. Rename to the modern and accurate "table". An alias is kept to avoid huge code churn. To prevent a One Definition Rule violation, a preexisting "table" type is moved to a new namespace row_cache_stress_test. Tests: unit (release) Message-Id: <20180624065238.26481-1-avi@scylladb.com>	2018-06-24 14:54:46 +03:00
Avi Kivity	e0eb66af6b	Merge "Do not allow compaction controller shares to grow indefinitely" from Glauber " We are seeing some workloads with large datasets where the compaction controller ends up with a lot of shares. Regardless of whether or not we'll change the algorithm, this patchset handles a more basic issue, which is the fact that the current controller doesn't set a maximum explicitly, so if the input is larger than the maximum it will keep growing without bounds. It also pushes the maximum input point of the compaction controller from 10 to 30, allowing us to err on the side of caution for the 2.2 release. " * 'tame-controller' of github.com:glommer/scylla: controller: do not increase shares of controllers for inputs higher than the maximum controller: adjust constants for compaction controller	2018-06-19 18:49:02 +03:00
Duarte Nunes	ee4b3c4c2d	database: Await pending writes before truncating CF on drop When dropping a table, wait for the column family to quiesce so that no pending writes compete with the truncate operation, possibly allowing data to be left on disk. Fixes #2562 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180618193134.31971-1-duarte@scylladb.com>	2018-06-19 16:26:52 +03:00

1 2 3 4 5 ...

1113 Commits