scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 18:10:39 +00:00

Author	SHA1	Message	Date
Raphael S. Carvalho	f5715d3f0b	replica: Move memtables to compaction_group Now memtables live in compaction_group. Also introduced function that selects group based on token, but today table always return the single group managed by it. Once multiple groups are supported, then the function should interpret token content to select the group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	f4579795e6	replica: move compound SSTable set to compaction group The group is now responsible for providing the compound set. table still has one compound set, which will span all groups for the cases we want to ignore the group isolation. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	6717d96684	replica: move maintenance SSTable set to compaction_group This commit is restricted to moving maintenance set into compaction_group. Next, we'll introduce compound set into it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	ce8e5f354c	replica: move main SSTable set to compaction_group This commit is restricted to moving main set into compaction_group. Next, we'll move maintenance set into it and finally the memtable. A method is introduced to figure out which group a sstable belongs to, but it's still unimplemented as table is still limited to a single group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	4871f1c97c	replica: Introduce compaction_group Compaction group is a new abstraction used to group SSTables that are eligible to be compacted together. By this definition, a table in a given shard has a single compaction group. The problem with this approach is that data from different vnodes is intermixed in the same sstable, making it hard to move data in a given sstable around. Therefore, we'll want to have multiple groups per table. A group can be thought of an isolated LSM tree where its memtable and sstable files are isolated from other groups. As for the implementation, the idea is to take a very incremental approach. In this commit, we're introducing a single compaction group to table. Next, we'll migrate sstable and maintenance set from table into that single compaction group. And finally, the memtable. Cache will be shared among the groups, for simplicity. It works due to its ability to invalidate a subset of the token range. There will be 1:1 relationship between compaction_group and table_state. We can later rename table_state to compaction_group_state. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	a6ecadf3de	replica: convert table::stop() into coroutine await_pending_ops() is today marked noexcept, so doesn't have to be implemented with finally() semantics. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Benny Halevy	7e4612d3aa	mutation_readers: pass tombstone_gc_state to compating_reader To be passed further done to `compact_mutation_state` in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:14 +03:00
Benny Halevy	2cd3fc2f36	compaction: table_state: add virtual get_tombstone_gc_state method and override it in table::table_state to get the tombstone_gc_state from the table's compaction_manager. It is going to be used in the next patched to pass the gc state from the compaction_strategy down to sstables and compaction. table_state_for_test was modified to just keep a null tombstone_gc_state. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:05:39 +03:00
Benny Halevy	71ede6124a	db: view: pass base table to view_update_builder To be used by generate_update() for getting the tombstone_gc_state via the table's compaction_manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:04:23 +03:00
Raphael S. Carvalho	631b2d8bdb	replica: rename table::on_compaction_completion and coroutinize it on_compaction_completion() is not very descriptive. let's rename it, following the example of update_sstable_lists_on_off_strategy_completion(). Also let's coroutinize it, so to remove the restriction of running it inside a thread only. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11407	2022-08-31 06:17:20 +03:00
Benny Halevy	a980510654	table: seal_active_memtable: handle ENOSPC error Aborting too soon on ENOSPC is too harsh, leading to loss of availability of the node for reads, while restarting it won't solve the ENOSPC condition. Fixes #11245 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11246	2022-08-23 17:58:20 +02:00
Avi Kivity	e9cbc9ee85	Merge 'Add support for empty replica pages' from Botond Dénes Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones. The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by `3131cbea62`, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones. The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set. Upgrade sanity test was conducted as following: * Created cluster of 3 nodes with RF=3 with master version * Wrote small dataset of 1000 rows. * Deleted prefix of 980 rows. * Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100` * Also did some manual queries via `cqlsh` with smaller page size and tracing on. * Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`. * Confirmed there are no errors or read-repairs. Perf regression test: ``` build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60 ``` Before: ``` median 133665.96 tps ( 62.0 allocs/op, 12.0 tasks/op, 43007 insns/op, 0 errors) median absolute deviation: 973.40 maximum: 135511.63 minimum: 104978.74 ``` After: ``` median 129984.90 tps ( 62.0 allocs/op, 12.0 tasks/op, 43181 insns/op, 0 errors) median absolute deviation: 2979.13 maximum: 134538.13 minimum: 114688.07 ``` Diff: +~200 instruction/op. Fixes: https://github.com/scylladb/scylla/issues/7689 Fixes: https://github.com/scylladb/scylla/issues/3914 Fixes: https://github.com/scylladb/scylla/issues/7933 Refs: https://github.com/scylladb/scylla/issues/3672 Closes #11053 * github.com:scylladb/scylladb: test/cql-pytest: add test for query tombstone page limit query-result-writer: stop when tombstone-limit is reached service/pager: prepare for empty pages service/storage_proxy: set smallest continue pos as query's continue pos service/storage_proxy: propagate last position on digest reads query: result_merger::get() don't reset last-pos on short-reads and last pages query: add tombstone-limit to read-command service/storage_proxy: add get_tombstone_limit() query: add tombstone_limit type db/config: add config item for query tombstone limit gms: add cluster feature for empty replica pages tree: don't use query::read_command's IDL constructor	2022-08-10 13:38:06 +03:00
Raphael S. Carvalho	ace6334619	replica: table: kill unused _sstables_staging Good change as it's one less thing to worry about in compaction group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-08-10 12:32:13 +03:00
Botond Dénes	7730419f5c	query-result-writer: stop when tombstone-limit is reached The query result writer now counts tombstones and cuts the page (marking it as a short one) when the tombstone limit is reached. This is to avoid timing out on large span of tombstones, especially prefixes. In the case of unpaged queries, we fail the read instead, similarly to how we do with max result size. If the limit is 0, the previous behaviour is used: tombstones are not taken into consideration at all.	2022-08-10 06:03:38 +03:00
Avi Kivity	8d37370a71	Revert "Merge "memtable-sstable: Add compacting reader when flushing memtable." from Mikołaj" This reverts commit `bcadd8229b`, reversing changes made to `cf528d7df9`. Since `4bd4aa2e88` ("Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec"), memtable is self-compacting and the extra compaction step only reduces throughput. The unit test in memtable_test.cc is not reverted as proof that the revert does not cause a regression. Closes #11243	2022-08-09 11:23:29 +03:00
Benny Halevy	45ce635527	table: reindent write_schema_as_cql Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	3b2cce068a	table: coroutinize write_schema_as_cql and make sure to always close the output stream. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	dbae7807d1	table: seal_snapshot: maybe_yield when iterating over the table names Add maybe_yield calls in tight loop, potentially over thousands of sstable names to prevent reactor stalls. Although the per-sstable cost is very small, we've experienced stalls realted to printing in O(#sstables) in compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	3ba0c72b77	table: reindent seal_snapshot Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	41a2d09a5d	table: coroutinize seal_snapshot Handle exceptions, making sure the output stream is properly closed in all cases, and an intermediate error, if any, is returned as the final future. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	5316dbbe78	table: delete unused snapshot_manager and pending_snapshots Now that snapshot orchestration in snapshot_on_all_shards doesn't use snapshot_manager, get rid of the data structure. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	cca9068cfb	table: delete unused snapshot function Now that snapshot orchestration is done solely in snapshot_on_all_shards, the per-shard snapshot function can be deleted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	351a3a313d	table: snapshot_on_all_shards: orchestrate snapshot process Call take_snapshot on each shard and collect the returns snapshot_file_set. When all are done, move the vector<snapshot_file_set> to finalize_snapshot. All that without resorting to using the snapshot_manager nor calling table::snapshot. Both will deleted in the following patches. Fixes #11132 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	84dfd2cabb	table: snapshot: move pending_snapshots.erase from seal_snapshot Now that seal_snapshot doesn't need to lookup the snapshot_manager in pending_snapshots to get to the file_sets, erasing the snapshot_manager object can be done in table::snapshot which also inserted it there. This will make it easier to get rid of it in a later patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	39276cacc3	table: finalize_snapshot: take the file sets as a param and pass it to seal_snapshot, so that the latter won't need to lookup and access the snapshot_manager object. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	4dd56bbd6d	table: make seal_snapshot a static member Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	7cb0a3f6f4	table: finalize_snapshot: reindent Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	12716866a9	table: refactor finalize_snapshot out of snapshot Write schema.cql and the files manifest in finalize_snapshot. Currently call it from table::snapshot, but it will be called in a later patch by snapshot_on_all_shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	240f83546d	table: snapshot: keep per-shard file sets in snapshot_manager To simplify processing of the per-shard file names for generating the manifest. We only need to print them to the manifest at the end of the process, so there's no point in copying them around in the process, just move the foreign unique unordered_set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	5100c1ba68	table: take_snapshot: return foreign unique ptr Currently copying the sstable file names are created and destroyed on each shard and are copied by the "coordinator" shards using submit_to, while the coroutine holds the source on its stack frame. To prepare for the next patches that refactor this code so that the coordinator shard will submit_to each shard to perform `take_snapshot` and return the set of sstrings in the future result, we need to wrap the result in a foreign_ptr so it gets freed on the shard that created it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	b54626ad0e	table: take_snapshot: maybe yield in per-sstable loop There could be thousands of sstables so we better cosider yielding in the tight loop that copies the sstable names into the unordered_set we return. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	24a1a4069e	table: take_snapshot: simplify tables construction code Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	75e38ebccc	table: take_snapshot: reindent Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	67c1d00f44	table: take_snapshot: simplify error handling Don't catch exception but rather just return them in the return future, as the exception is handled by the caller. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	ff6508aa53	table: refactor take_snapshot out of snapshot Do the actual snapshot-taking code in a per-shard take_snapshot function, to be called from snapshot_on_all_shards in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	4d4ca40c38	table: add snapshot_on_all_shards Called from the respective database entry points. Will be called also from the database drop / truncate path and will be used for central coordination of per-shard table::snapshot so we don't have to depend on the snapshot_manager mechanism that is fragile and currently causes abort if we fail to allocate it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Kamil Braun	7b4146dd2a	replica: table: `get_hit_rate`: take `const gossiper&` It doesn't use any non-const members.	2022-08-04 12:16:09 +02:00
Benny Halevy	f26e655646	compaction_manager: add maybe_wait_for_sstable_count_reduction Called from try_flush_memtable_to_sstable, maybe_wait_for_sstable_count_reduction will wait for compaction to catch up with memtable flush if there the bucket to compact is inflated, having too many sstables. In that case we don't want to add fuel to the fire by creating yet another sstable. Fixes #4116 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 14:43:30 +03:00
Avi Kivity	2c0932cc41	Merge 'Reduce the amount of per-table metrics' from Amnon Heiman This series is the first step in the effort to reduce the number of metrics reported by Scylla. The series focuses on the per-table metrics. The combination of histograms, per-tables, and per shard makes the number of metrics in a cluster explode. The following series uses multiple tools to reduce the number of metrics. 1. Multiple metrics should only be reported for the user tables and the condition that checked it was not updated when more non-user keyspaces were added. 2. Second, instead of a histogram, per table, per shard, it will report a summary per table, per shard, and a single histogram per node. 3. Histograms, summaries, and counters will be reported only if they are used (for example, the cas-related metrics will not be reported for tables that are not using cas). Closes #11058 * github.com:scylladb/scylla: Add summary_test database: Reduce the number of per-table metrics replica/table.cc: Do not register per-table metrics for system histogram_metrics_helper.hh: Add to_metrics_summary function Unified histogram, estimated_histogram, rates, and summaries Split the timed_rate_moving_average into data and timer utils/histogram.hh: should_sample should use a bitmask estimated_histogram: add missing getter method	2022-07-27 22:01:08 +03:00
Amnon Heiman	99a060126d	database: Reduce the number of per-table metrics This patch reduces the number of metrics that is reported per table, when the per-table flag is on. When possible, it moves from time_estimated_histogram and timed_rate_moving_average_and_histogram to use the unified timer. Instead of a histogram per shard, it will now report a summary per shard and a histogram per node. Counters, histograms, and summaries will not be reported if they were never used. The API was updated accordingly so it would not break. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2022-07-27 16:58:52 +03:00
Amnon Heiman	c31a58f2e9	replica/table.cc: Do not register per-table metrics for system There is a set of per-table metrics that should only be registered for user tables. As time passes there are more keyspaces that are not for the user keyspace and there is now a function that covers all those cases. This patch replaces the implementation to use is_internal_keyspace. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2022-07-27 16:58:52 +03:00
Benny Halevy	b5abbb971f	test: memtable_test: failed_flush_prevents_writes: extend error injection Inject errors into all seal_active_memtable distinct error handling sites. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 14:06:59 +03:00
Benny Halevy	a5911619c0	table: seal_active_memtable: abort if retried for too long If we haven't been able to flush the memtable in ~30 minutes (based on the number of retries) just abort assuming that the OOM condition is permanent rather than transient. Refs #4344 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 14:06:59 +03:00
Benny Halevy	bc18f750c6	table: seal_active_memtable: abort on unexpected error Currently when we can't write the flushed sstable due to corruption in the memtable we get into an infinite retry loop (see #10498). Until we can go into maintenance mode, the next best thing would be to abort, though there is still a risk that commitlog replay will reproduce the corruption in the memtable and we's end up with an infinite crash loop. (hence #10498 is not Fixed with this patch) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 14:06:57 +03:00
Benny Halevy	f0a597a252	table: try_flush_memtable_to_sstable: propagate errors to seal_active_memtable And let seal_active_memtable decide about how to handle them as now all flush error handling logic is implemented there. In particular, unlike today, sstable write errors will cause internal error rather than loop forever. Also, check for shutdown earlier to ignore errors like semaphore_broken that might happen when the table is stopped. Refs #10498 (The issue will be considered fixed when going into maintenance mode on write errors rather than throwing internal error and potentially retrying forever) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 14:04:55 +03:00
Benny Halevy	d55a2ac762	dirty_memory_manager: flush_when_needed: move error handling to flush_one/seal_active_memtable Currently flush is retried both by dirty_memory_manager::flush_when_needed and table::seal_active_memtable, which may be called by other paths like table::flush. Unify the retry logic into seal_active_memtable so that we have similar error handling semantics on all paths. Refs #4174 Refs #10498 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	67479e4243	table: reindent seal_active_memtable	2022-07-27 13:43:17 +03:00
Benny Halevy	00941452d5	table: coroutinize seal_active_memtable As a first step to making it robust using state machine driven retries. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	863e9d9e6a	dirty_memory_manager: flush_when_needed: target error handling at flush_one Now that everything prior to flush_one is noexcept make table::seal_active_memtable and the paths that call it noexcept, making sure that any errors are returned only as exceptional futures, and handle them in flush_when_needed(). The original handle_exception had a broader scope than now needed, so this change is mostly technical, to show that we can narrow down the error handling to the continuation of flush_one - and verify that the unit test is not broken. A later patch moves this error handling logic away to seal_active_memtable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	f60ff44fdf	table: try_flush_memtable_to_sstable: consume: close reader on error If an exception is throws in `consume` before write_memtable_to_sstable is called or if the latter fails, we must close the reader passed to it. Fixes #11075 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-19 16:35:59 +03:00

1 2 3 4

155 Commits