scylladb

Author	SHA1	Message	Date
Avi Kivity	1bcc5a1b5c	Merge "database: assign proper io priority for streaming view updates" from Piotr " Streamed view updates parasitized on writing io priority, which is reserved for user writes - it's now properly bound to streaming write priority. Verified manually by checking appropriate io metrics: scylla_io_queue_total_bytes{class="streaming_write" ...} vs scylla_io_queue_total_bytes{class="query" ...} Tests: unit(dev) " Fixes #4615. * 'assign_proper_io_priority_to_streaming_view_updates' of https://github.com/psarna/scylla: db,view: wrap view update generation in stream scheduling group database: assign proper io priority for streaming view updates (cherry picked from commit `2c7435418a`)	2019-08-22 16:21:42 +03:00
Benny Halevy	1e62fc8aac	table: document _sstables_lock/_sstable_deletion_sem locking order Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `0e4567c881`)	2019-08-07 17:09:47 +03:00
Benny Halevy	ebb14d93c9	table: uninline enable_sstable_write Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `bbbd749f70`)	2019-08-07 17:04:08 +03:00
Botond Dénes	7b94264ae5	mutlishard_mutation_query(): use correct reader concurrency semaphore The multishard mutation query used the semaphore obtained from `database::user_read_concurrency_sem()` to pause-resume shard readers. This presented a problem when `multishard_mutation_query()` was reading from system tables. In this case the readers themselves would obtain their permits from the system read concurrency semaphore. Since the pausing of shard readers used the user read semaphore, pausing failed to fulfill its objective of alleviating pressure on the semaphore the reads obtained their permits from. In some cases this lead to a deadlock during system reads. To ensure the correct semaphore is used for pausing-resuming readers, obtain the semaphore from the `table` object. To avoid looking up the table on every pause or resume call, cache the semaphores when readers are created. Fixes: #4096 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <c784a3cd525ce29642d7216fbe92638fa7884e88.1547729119.git.bdenes@scylladb.com> (cherry picked from commit `4537ec7426`)	2019-01-17 18:08:01 +02:00
Raphael S. Carvalho	6a3f4fb3f9	database: Fix race condition in sstable snapshot Race condition takes place when one of the sstables selected by snapshot is deleted by compaction. Snapshot fails because it tries to link a sstable that was previously unlinked by compaction's sstable deletion. Refs #4051. (master commit `1b7cad3531`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>	2019-01-11 13:48:12 +02:00
Avi Kivity	dbe347811c	Merge "materialized views: Apply backpressure from view replicas" from Duarte " As the amount of pending view updates increases we know that there’s a mismatch between the rate at which the base receives writes and the rate at which the view retires them. We react by applying backpressure to decrease the rate of incoming base writes, allowing the slow view replicas to catch up. We want to delay the client’s next writes to a base replica and we use the base’s backlog of view updates to derive this delay. To validate this approach we tested a 3 node Scylla cluster on GCE, using n1-standard-4 instances with NVMEs. A loader running on a n1-standard-8 instance run cassandra-stress with 100 threads. With the delay function d(x) set to 1s, we see no base write timeouts. With the delay function as defined in the series, we see that backlogs stabilize at some (arbitrary) point, as predicted, but this stabilization co-exists with base write timeouts. However, the system overall behaves better than the current version, with the 100 view update limit, and also better than the version without such limit or any backpressure. More work is necessary to further stabilize the system. Namely, we want to keep delaying until we see the backlog is decreasing. This will require us to add more delay beyond the stabilization point, which in turn should minimize the base write timeouts, and will also minimize the amount of memory the backlog takes at each base replica. Design document: https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWo Fixes #2538 " Reviewed-by: Nadav Har'El <nyh@scylladb.com> * 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits) service/storage_proxy: Release mutation as early as possible service/storage_proxy: Delay replica writes based on view update backlog service/storage_proxy: Get the backlog of a particular base replica service/storage_proxy: Add counters for delayed base writes main: Start and stop the view_update_backlog_broker service: Distribute a node's view update backlog service: Advertise view update backlog over gossip service/storage_proxy: Send view update backlog from replicas service/storage_proxy: Prepare to receive replica view update backlog service/storage_proxy: Expose local view update backlog tests/view_schema_test: Add simple test for db::view::node_update_backlog db/view: Introduce node_update_backlog class db/hints: Initialize current backlog database: Add counter for current view backlog database: Expose current memory view update backlog idl: Add db::view::update_backlog db/view: Add view_update_backlog database: Wait on view update semaphore for view building service/storage_proxy: Use near-infinite timeouts for view updates database: generate_and_propagate_view_updates no longer needs a timeout ... (cherry picked from commit `b66f59aa3d`)	2018-12-20 19:11:56 +02:00
Avi Kivity	16ee3b3ebe	Merge "Make inactive shard readers evictable" from Botond " This series attempts to solve the regressions recently discovered in performance of multi-partition range-scans. Namely that they: * Flood the reader concurrency semaphore's queues, trampling other reads. * Behave very badly when too many of them is running concurrently (trashing). * May deadlock if enough of them is running without a timeout. The solution for these problems is to make inactive shard readers evictable. This should address all three issues listed above, to varying degrees: * Shard readers will now not cling onto their permits for the entire duration of the scan, which might be a lot of time. * Will be less affected by infinite concurrency (more than the node can handle) as each scan now can make progress by evicting inactive shard readers belonging to other scans. * Will not deadlock at all. In addition to the above fix, this series also bundles two further improvements: * Add a mechanism to `reader_concurrecy_semaphore` to be notified of newly inserted evictables. * General cleanups and fixes for `multishard_combining_reader` and `foreign_reader`. I can unbundle these mini series and send them separately, if the maintainers so prefer, altough considering that this series will have to be backported to 3.0, I think this present form is better. Fixes: #3835 " * 'evictable-inactive-shard-readers/v7' of https://github.com/denesb/scylla: (27 commits) tests/multishard_mutation_query_test: test stateless query too tests/querier_cache: fail resource-based eviction test gracefully tests/querier_cache: simplify resource-based eviction test tests/mutation_reader_test: add test_multishard_combining_reader_next_partition tests/mutation_reader_test: restore indentation tests/mutation_reader_test: enrich pause-related multishard reader test multishard_combining_reader: use pause-resume API query::partition_slice: add clear_ranges() method position_in_partition: add region() accessor foreign_reader: add pause-resume API tests/mutation_reader_test: implement the pause-resume API query_mutations_on_all_shards(): implement pause-resume API make_multishard_streaming_reader(): implement the pause-resume API database: add accessors for user and streaming concurrency semaphores reader_lifecycle_policy: extend with a pause-resume API query_mutations_on_all_shards(): restore indentation query_mutations_on_all_shards(): simplify the state-machine multishard_combining_reader: use the reader lifecycle policy multishard_combining_reader: add reader lifecycle policy multishard_combining_reader: drop unnecessary `reader_promise` member ... (cherry picked from commit `414b14a6bd`)	2018-12-04 12:13:13 +02:00
Duarte Nunes	b72a94b53e	Merge 'Fix checking if system tables need view updates' from Piotr " This miniseries ensures that system tables are not checked for having view updates, because they never do. What's more, distributed system table is used in the process, so it's unsafe to query the table while streaming it. Tests: unit (release), dtest(update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test) " * 'fix_checking_if_system_tables_need_view_updates_3' of https://github.com/psarna/scylla: streaming: don't check view building of system tables database: add is_internal_keyspace streaming: remove unused sstable_is_staging bool class (cherry picked from commit `d09d4bbd91`)	2018-11-28 15:39:34 +00:00
Glauber Costa	f81fa5f75c	remove monitor if sstable write failed In (almost) all SSTable write paths, we need to inform the monitor that the write has failed as well. The monitor will remove the SSTable from controller's tracking at that point. Except there is one place where we are not doing that: streaming of big mutations. Streaming of big mutations is an interesting use case, in which it is done in 2 parts: if the writing of the SSTable fails right away, then we do the correct thing. But the SSTables are not commited at that point and the monitors are still kept around with the SSTables until a later time, when they are finally committed. Between those two points in time, it is possible that the streaming code will detect a failure and manually call fail_streaming_mutations(), which marks the SSTable for deletions. At that point we should propagate that information to the monitor as well, but we don't. Fixes #3732 (hopefully) Tests: unit (release) Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20181114213618.16789-1-glauber@scylladb.com> (cherry picked from commit `9f403334c8`)	2018-11-20 19:27:54 +02:00
Duarte Nunes	9776a048e7	Merge 'Generating view updates during streaming' from Piotr During streaming, there are cases when we should invoke the view write path. In particular, if we're streaming because of repair or if a view has not yet finished building and we're bootstrapping a new node. The design constraints are: 1) The streamed writes should be visible to new writes, but the sstable should not participate in compaction, or we would lose the ability to exclude the streamed writes on a restart; 2) The streamed writes must not be considered when generating view updates for them; 3) Resilient to node restarts; 4) Resilient to concurrent stream sessions, possibly streaming mutations for overlapping ranges. We achieve this by writing the streamed writes to an sstable in a different folder, call it "staging". We achieve 1) by publishing the sstable to the column family sstable set, but excluding it from compactions. We do these steps upon boot, by looking at the staging directory, thus achieving 3). Fixes #3275 * 'streaming_view_to_staging_sstables_9' of https://github.com/psarna/scylla: (29 commits) tests: add materialized views test tests: add view update generator to cql test env main: add registering staging sstables read from disk database: add a check if loaded sstable is already staging database: add get_staging_sstable method streaming: stream tables with views through staging sstables streaming: add system distributed keyspace ref to streaming streaming: add view update generator reference to streaming main: add generating missed mv updates from staging sstables storage_service: move initializing sys_dist_ks before bootstrap db/view: add view_update_from_staging_generator service db/view: add view updating consumer table: add stream_view_replica_updates table: split push_view_replica_updates table: add as_mutation_source_excluding table: move push_view_replica_updates to table.cc database: add populating tables with staging sstables database: add creating /staging directory for sstables database: add sstable-excluding reader table: add move_sstable_from_staging_in_thread function ... (cherry picked from commit `a38f6078fb`)	2018-11-15 17:46:20 +02:00
Raphael S. Carvalho	745e35fa82	database: Fix sstable resharding for mc format SStable format mc doesn't write ancestors to metadata, so resharding will not work with this new format because it relies on ancestors to replace new unshared sstables with old shared ones. Fix is about not relying on ancestors metadata for this operation. Fixes #3777. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180922211933.1987-1-raphaelsc@scylladb.com>	2018-09-25 18:37:48 +03:00
Botond Dénes	253407bdc8	multishard_mutation_query: add badness counters Add badness counters that allow tracking problems. The following counters are added: 1) multishard_query_unpopped_fragments 2) multishard_query_unpopped_bytes 3) multishard_query_failed_reader_stops 4) multishard_query_failed_reader_saves The first pair of counters observe the amount of work range scan queries have to undo on each page. It is normal for these counters to be non-zero, however sudden spikes in their values can indicate problems. This undoing of work is needed for stateful range-scans to work. When stateful queries are enabled the `multishard_combining_reader` is dismantled and all unconsumed fragments in its and any of its intermediate reader's buffers are pushed back into the originating shard reader's buffer (via `unpop_mutation_fragment()`). This also includes the `partition_start`, the `static_row` (if there is one) and all extracted and active `range_tombstone` fragments. This together can amount to a substantial amount of fragments. (1) counts the amount of fragments moved back, while (2) counts the number of bytes. Monitoring size and quantity separately allows for detecting edge cases like moving many small fragments or just a few huge ones. The counters count the fragments/bytes moved back to readers located on the shard they belong to. The second pair of counters are added to detect any problems around saving readers. Since the failure to save a reader will not fail the read itself, it is necessary to add visibility to these failures by other means. (3) counts the number of times stopping a shard reader (waiting on pending read-aheads and next-partitions) failed while (4) counts the number of times inserting the reader into the `querier_cache` failed. Contrary to the first two counters, which will almost certainly never be zero, these latter two counters should always be zero. Any other value indicates problems in the respective shards/nodes.	2018-09-03 10:31:44 +03:00
Botond Dénes	97364c7ad9	database: add query_mutations_on_all_shards() This method allows for querying a range or ranges on all shards of the node. Under the hood it uses the multishard_combining_reader for executing the query. It supports paging and stateful queries (saving and reusing the readers between pages). All this is transparent to the client, who only needs to supply the same query::read_command::query_uuid through the pages of the query (and supply correct start positions on each page, that match the stop position of the last page).	2018-09-03 10:31:44 +03:00
Botond Dénes	5f726e9a89	querier: move all to query namespace To avoid name clashes.	2018-09-03 10:31:44 +03:00
Avi Kivity	37f9a3c566	database: make database's mutation apply stage inherit its scheduling group from the caller Like the two preceeding patches, convert the mutation apply stage to an inheriting_concrete_scheduling_group. This change has two added benefits: we get rid of a thread_local, and we drop a with_scheduling_group() inside an execution stage which just creates a bunch of continuations and somewhat undoes the benefit of the execution stage.	2018-08-24 19:04:49 +03:00
Avi Kivity	596fb6f2f7	database: make database::_data_query_stage inheriting its caller's scheduling_group Now (`8c993e0728`) that replica-side operations run under the correct scheduling group, we can inherit the scheduling_group for _data_query_stage from the caller. By itself this doesn't do much, but it will later allow us to have multiple groups for statement executions.	2018-08-24 19:04:49 +03:00
Avi Kivity	ef9b36376c	Merge "database: support multiple data directories" from Glauber " While Cassandra supports multiple data directories, we have been historically supporting just one. The one-directory model suits us better because of the I/O Scheduler and so far we have seen very few requests -- if any, to support this. Still, the infrastructure needed to support multiple directories can be beneficial so I am trying to bring this in. For simplicity, we will treat the first directory in the list as the main directory. By being able to still associate one singular directory with a table, most of the code doesn't have to change and we don't have to worry about how to distribute data between the directories. In this design: - We scan all data directories for existing data. - resharding only happens within a particular data directory. - snapshot details are accumulated with data for all directories that host snapshots for the tables we are examining - snapshots are created with files in its own directories, but the manifest file goes to the main directory. For this one, note that in Cassandra the same thing happens, except that there is no "main" directory. Still the manifest file is still just in one of them. - SSTables are flushed into the main directory. - Compactions write data into the main directory Despite the restrictions, one example of usage of this is recovery. If we have network attached devices for instance, we can quickly attach a network device to an existing node and make the data immediately available as it is compacted back to main storage. Tests: unit (release) " * 'multi-data-file-v2' of github.com:glommer/scylla: database: change ident database: support multiple data directories database: allow resharing to specify a directory database: support multiple directories in get_snapshot_details database: move get_snapshot_info into a seastar::thread snapshots: always create the snapshot directory sstables: pass sstable dir with entry descriptor database: make nodetool listsnapshots print correct information sstables: correctly create descriptors for snapshots	2018-07-15 13:31:04 +03:00
Asias He	6540051f77	database: Add add_sstable_and_update_cache Since we can write mutations to sstable directly in streaming, we need to add those sstables to the system so it can be seen by the query. Also we need to update the cache so the query refects the latest data.	2018-07-13 08:36:45 +08:00
Asias He	dfc2739625	database: Add make_streaming_sstable_for_write This will be used to create sstable for streaming receiver to write the mutations received from network to sstable file instead of writing to memtable.	2018-07-13 08:36:45 +08:00
Glauber Costa	99c8a1917f	database: support multiple data directories While Cassandra supports multiple data directories, we have been historically supporting just one. The one-directory model suits us better because of the I/O Scheduler and so far we have seen very few requests -- if any, to support this. Still, the infrastructure needed to support multiple directories can be beneficial so I am trying to bring this in. For simplicity, we will treat the first directory in the list as the main directory. By being able to still associate one singular directory with a table, most of the code doesn't have to change and we don't have to worry about how to distribute data between the directories. In this design: - We scan all data directories for existing data. - resharding only happens within a particular data directory. - snapshot details are accumulated with data for all directories that host snapshots for the tables we are examining - snapshots are created with files in its own directories, but the manifest file goes to the main directory. For this one, note that in Cassandra the same thing happens, except that there is no "main" directory. Still the manifest file is still just in one of them. - SSTables are flushed into the main directory. - Compactions write data into the main directory Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-07-05 16:58:39 -04:00
Avi Kivity	f3da043230	Merge "Make in-memory partition version merging preemptable" from Tomasz " Partition snapshots go away when the last read using the snapshot is done. Currently we will synchronously attempt to merge partition versions on this event. If partitions are large, that may stall the reactor for a significant amount of time, depending on the size of newer versions. Cache update on memtable flush can create especially large versions. The solution implemented in this series is to allow merging to be preemptable, and continue in the background. Background merging is done by the mutation_cleaner associated with the container (memtable, cache). There is a single merging process per mutation_cleaner. The merging worker runs in a separate scheduling group, introduced here, called "mem_compaction". When the last user of a snapshot goes away the snapshot is slided to the oldest unreferenced version first so that the version is no longer reachable from partition_entry::read(). The cleaner will then keep merging preceding (newer) versions into it, until it merges a version which is referenced. The merging is preemtable. If the initial merging is preempted, the snapshot is enqueued into the cleaner, the worker woken up, and merging will continue asynchronously. When memtable is merged with cache, its cleaner is merged with cache cleaner, so any outstanding background merges will be continued by the cache cleaner without disruption. This reduces scheduling latency spikes in tests/perf_row_cache_update for the case of large partition with many rows. For -c1 -m1G I saw them dropping from >23ms to 1-2ms. System-level benchmark using scylla-bench shows a similar improvement. " * tag 'tgrabiec/merge-snapshots-gradually-v4' of github.com:tgrabiec/scylla: tests: perf_row_cache_update: Test with an active reader surviving memtable flush memtable, cache: Run mutation_cleaner worker in its own scheduling group mutation_cleaner: Make merge() redirect old instance to the new one mvcc: Use RAII to ensure that partition versions are merged mvcc: Merge partition version versions gradually in the background mutation_partition: Make merging preemtable tests: mvcc: Use the standard maybe_merge_versions() to merge snapshots	2018-07-01 15:32:51 +03:00
Tomasz Grabiec	074be4d4e8	memtable, cache: Run mutation_cleaner worker in its own scheduling group The worker is responsible for merging MVCC snapshots, which is similar to merging sstables, but in memory. The new scheduling group will be therefore called "memory compaction". We should run it in a separate scheduling group instead of main/memtables, so that it doesn't disrupt writes and other system activities. It's also nice for monitoring how much CPU time we spend on this.	2018-06-27 21:51:04 +02:00
Piotr Sarna	e1a867cbe3	database: add phaser for reads Currently drop_column_family waits on write_in_progress phaser, but there's no such mechanism for reads. This commit adds a corresponding reads phaser. Refs #3357 Reported-by: Duarte Nunes <duarte@scylladb.com> Signed-off-by: Piotr Sarna <sarna@scylladb.com> Message-Id: <70b5fdd44efbc24df61585baef024b809cabe527.1529928323.git.sarna@scylladb.com>	2018-06-27 10:02:56 +01:00
Paweł Dziepak	96b0577343	row_cache: deglobalise row cache tracker Row cache tracker has numerous implicit dependencies on ohter objects (e.g. LSA migrators for data held by mutation_cleaner). The fact that both cache tracker and some of those dependencies are thread local objects makes it hard to guarantee correct destruction order. Let's deglobalise cache tracker and put in in the database class.	2018-06-25 09:37:43 +01:00
Avi Kivity	cb549c767a	database: rename column_family to table The name "column_family" is both awkward and obsolete. Rename to the modern and accurate "table". An alias is kept to avoid huge code churn. To prevent a One Definition Rule violation, a preexisting "table" type is moved to a new namespace row_cache_stress_test. Tests: unit (release) Message-Id: <20180624065238.26481-1-avi@scylladb.com>	2018-06-24 14:54:46 +03:00
Glauber Costa	290d553c3a	compaction_strategy: allow the user to tell us if min_threshold has to be strict Now that we have the controller, we would like to take min_threshold as a hint. If there is nothing to compact, we can ignore that and start compacting less than min_threshold SSTables so that the backlog keeps reducing. But there are cases in which we don't want min_threshold to be a hint and we want to enforce it strictly. For instance, if write amplification is more of a concern than space amplification. This patch adds a YAML option that allows the user to tell us that. We will default to false, meaning min_threshold is not strictly enforced. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-06-15 13:42:43 -04:00
Gleb Natapov	f41575a156	Provide available memory size to database object during creation	2018-06-11 15:34:13 +03:00
Avi Kivity	2582f53b44	Merge "database and API: Add column_family::get_sstables_by_key" from Amnon " This is series is for nodetool getsstables. This patch is based on: `8daaf9833a` With some minor adjustments because of the code change in sstables. The idea is to allow searching for all the sstables that contains a given key. After this patch if there is a table t1 in keyspace k1 and it has a key called aa. curl -X GET "http://localhost:10000/column_family/sstables/by_key/k1%3At1?key=aa" Will return the list of sstables file names that contains that key. " * 'amnon/sstable_for_key_v4' of github.com:scylladb/seastar-dev: Add the API implementation to get_sstables_by_key api: column_family.json make the get_sstables_for_key doc clearer column_family: Add the get_sstables_by_partition_key method sstable test: add has_partition_key test sstable: Add has_partition_key method keys_test: add a test for nodetool_style string keys: Add from_nodetool_style_string factory method	2018-06-10 16:53:56 +03:00
Amnon Heiman	acb0a738eb	column_family: Add the get_sstables_by_partition_key method The get_sstables_by_partition_key method used by the API to return a set of sstables names that holds a given partition key. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2018-06-10 16:13:01 +03:00
Asias He	6496cdf0fb	db: Get rid of the streaming memtable delayed flush In `455d5a5` (streaming memtables: coalesce incoming writes), we introduced the delayed flush to coalesce incoming streaming mutations from different stream_plan. However, most of the time there will be one stream plan at a time, the next stream plan won't start until the previous one is finished. So, the current coalescing does not really work. The delayed flush adds 2s of dealy for each stream session. If we have lots of table to stream, we will waste a lot of time. We stream a keyspace in around 10 stream plans, i.e., 10% of ranges a time. If we have 5000 tables, even if the tables are almost empty, the delay will waste 5000 * 10 * 2 = 27 hours. To stream a keyspace with 4 tables, each table has 1000 rows. Before: [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 125.21 KiB/s [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 8.233 seconds After: [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 4772.32 KiB/s [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 0.216 seconds Fixes #3436 Message-Id: <cb2dde263782d2a2915ddfe678c74f9637ffd65b.1526979175.git.asias@scylladb.com>	2018-06-06 10:16:02 +03:00
Piotr Sarna	f8237dd664	database: do not truncate already removed views This commit clears table's views before truncating it in drop_column_family function. The only case when views are not empty during drop is when they're backing secondary indexes of a base table and they are all atomically dropped in the same go as the base table itself. This change will prevent trying to truncate views that were already dropped, which used to result in no_such_column_family error. References #3202	2018-05-22 21:10:51 +02:00
Duarte Nunes	a3bbd52e2e	Merge 'Add materialized view metrics' from Piotr " This series introduces materialized view statistics, as stated in issue #3385: - updates pushed - updates failed - row lock stats It also addresses issue #3416 by decoupling user write stats from view update stats. " * 'materialized_view_metrics_9' of https://github.com/psarna/scylla: view: adapt view_stats to act as write stats storage_proxy: decouple write_stats from stats db: add row locking metrics view: add view metrics	2018-05-22 18:41:51 +01:00
Piotr Sarna	9246bb36bc	db: add row locking metrics This commit adds statistics to row_locker class. Metrics are independendly counted for all lock types: row<->partition and exclusive<->shared. Metrics gathered: - total acquisitions - operations that wait on the lock - histogram of the time spent on waiting on this type of lock References #3385 References #3416	2018-05-22 16:52:58 +02:00
Piotr Sarna	49bebcfa25	view: add view metrics This commit introduces view statistics: - updates pushed to local/remote replicas - updates failed to be pushed to local/remote replicas Metrics are kept on per-table basis, i.e. updates_pushed_remote shows the number of total updates (mutations) pushed to all paired mv replicas that this particular table has. Every single update is taken into consideration, so if view update requires removing a row from one view and adding a row to another, it will be counted as 2 updates. References #3385 References #3416	2018-05-22 16:52:58 +02:00
Glauber Costa	d758a416f8	backlog_controller: move compaction controller to the compaction manager There was recently an attempt to add minimum shares to major compactions which ended up being harder than it should be due to all the plumbing necessary to call the compaction controller from inside the compaction manager-- since it is currently a database object. We had this problem again when trying to return fixed shares in case of an exception. Taking a step back, all of those problems stem from the fact that the compaction controller really shouldn't be a part of the database: as it deals with compactions and its consequences it is a lot more natural to have it inside the compaction manager to begin with. Once we do that, all the aforementioned problems go away. So let's move there where it belongs. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-05-22 09:24:19 -04:00
Duarte Nunes	c053275a48	db/view/row_locking: Add timeout when waiting for the lock This ensures we respect the write timeout set by the client when applying base writes, in case a writes takes too long to acquire the row lock for the read-before-write phase of a materialized view update. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180507132755.8751-1-duarte@scylladb.com>	2018-05-07 18:22:39 +01:00
Duarte Nunes	4b3562c3f5	db/view: Limit number of pending view updates This patch adds a simple and naive mechanism to ensure a base replica doesn't overwhelm a potentially overloaded view replica by sending too many concurrent view updates. We add a semaphore to limit to 100 the number of outstanding view updates. We limit globally per shard, and not per destination view replica. We also limit statically. Refs #2538 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180426134457.21290-2-duarte@scylladb.com>	2018-05-07 11:25:27 +03:00
Piotr Sarna	fe02c3d0e2	database, sstables, tests: add large_partition_handler This commit makes database, sstables and tests aware of which large_partition_handler they use. Proper large_partition_handler is retrievable from config information and is based on existing compaction_large_partition_warning_threshold_mb entry. Right now CQL TABLE variant of large_partition_handler is used in the database. Tests use a NOP version of large_partition_handler, which does not depend on CQL queries at all.	2018-05-04 14:38:13 +02:00
Duarte Nunes	f298f57137	column_family: Add function to populate views The populate_views() function takes a set of views to update, a tokento select base table partitions, and the set of sstables to query. This lays the foundation for a view building mechanism to exist, which walks over a given base table, reads data token-by-token, calculates view updates (in a simplified way, compared to the existing functions that push view updates), and sends them to the paired view replicas. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:10 +01:00
Duarte Nunes	67dd3e6e5d	column_family: Allow synchronizing with in-progress writes This patch adds a mechanism to class column_family through which we can synchronize with in-progress writes. This is useful for code that, after some modification, needs to ensure that new writes will see it before it can proceed. In particular, this will be used by the view building code, which needs to wait until the in-progress writes, which may have missed that there is now a view, is observable to the view building code. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:10 +01:00
Duarte Nunes	9b9ba525f7	database: Add get_views() function Returns all the schemas that are views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:10 +01:00
Glauber Costa	9188059427	database: group statements in their own scheduling group When we introduced the CPU scheduler, we have also introduced a group for commitlog - but never used it. There is also doubtful value in separating reads from writes, since they are often part of the same workload. To accomodate for that, let's rename the query group to "statement" (query is not incorrect, just confusing), and move the write path, currently ungrouped, inside it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-03-20 16:58:36 -04:00
Botond Dénes	c0009750c3	Add unit test for resource based cache eviction Specifically for the reader-permit based eviction. This test lives in a separate executable as it uses with_cql_test_env() and thus needs a main() of it's own.	2018-03-13 16:20:50 +02:00
Botond Dénes	d5bcadcfda	Time-based cache eviction Cached queriers should not sit in the cache indefinitely otherwise abandoned reads would cause excess and unncessary resource-usage. Attach an expiry timer to each cache-entry which evicts it after the TTL passes.	2018-03-13 10:34:34 +02:00
Botond Dénes	ff808d9ce6	Save and restore queriers in mutation_query() and data_query() Use the querier_cache (represented by the passed-in querier_cache_context) object to lookup saved queriers at the start of the page and save them at the end of it if it is likely that there will be more page requests.	2018-03-13 10:34:34 +02:00
Botond Dénes	1259031af3	Use the reader_concurrency_semaphore to limit reader concurrency	2018-03-08 14:12:12 +02:00
Botond Dénes	d5bb8a47fc	mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh In preparation to reader_concurrency_semaphore being added to the file. The reader_resource_tracker is really only a helper class for reader_concurrency_semaphore so the latter is better suited to provide the name of the file.	2018-03-08 10:29:16 +02:00
Raphael S. Carvalho	aa75684ee7	sstables: Warn when an extra-large partition is written Based on https://issues.apache.org/jira/browse/CASSANDRA-9643 For compaction_large_partition_warning_threshold_mb option set to 1, follow an example output: WARN 2018-02-22 19:52:11,029 [shard 0] sstable - Writing large row system/local:{key: pk{00056c6f63616c}, token:-7564491331177403445} (1276758 bytes) Fixes #2209. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180306175912.19259-1-raphaelsc@scylladb.com>	2018-03-07 15:49:46 +00:00
Duarte Nunes	76e6423910	database: Truncate views when truncating the base table Fixes #3200 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180211124218.41373-1-duarte@scylladb.com>	2018-02-27 15:54:43 +02:00
Avi Kivity	432268f582	Merge "branch 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla" from Raphael "The motivation is that it's no longer needed after new resharding algorithm that is the sole responsible for working with shared sstables and regular compaction will not work with those! So resharding will schedule deletion of shared sstables once it's certain that shards that own them have the new unshared sstables. The manager was needed for orchestrating deletion of shared sstable across shards. It brings extra complexity that's not longer needed, and it was also overloading shard 0, but the latter could have been fixed. Tests: - unit: release mode - dtest: resharding_test.py" * 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla: Remove SSTable's atomic deletion manager Stop using SSTable's atomic deletion manager database: split column_family::rebuild_sstable_list	2018-02-08 19:10:16 +02:00

1 2 3 4 5 ...

656 Commits