scylladb

Author	SHA1	Message	Date
Avi Kivity	4082f57edc	Merge 'Make commitlog disk limit a hard limit.' from Calle Wilund Refs #6148 Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing. This patch set does: * Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle. * Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results) * Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp. * Force more eager flush/recycle if we're out of segments Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should. Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating. Closes #7879 * github.com:scylladb/scylla: commitlog: Force earlier cycle/flush iff segment reserve is empty commitlog: Make segment allocation wait iff disk usage > max commitlog: Do partial (memtable) flushing based on threshold commitlog: Make flush threshold configurable table: Add a flush RP mark to table, and shortcut if not above	2021-02-08 16:44:05 +02:00
Raphael S. Carvalho	e1261d10f1	table: Avoid useless allocations when updating cache on memtable flush completion we're unconditionally using make_combined_mutation_source(), which causes extra allocations, even if memtable was flushed into a single sstable, which is the most common case. memtable will only be flushed into more than one sstable if TWCS is used and memtable had old data written into it due to out-of-order writes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210205182028.439948-1-raphaelsc@scylladb.com>	2021-02-06 20:03:33 +02:00
Benny Halevy	22f6023ac3	sstables: sstable_writer_config: add origin member Add a string describing where the sstables originated from (e.g. memtable, repair, streaming, compaction, etc.) If configure_writer is called with a nullptr, the origin will be equal to an empty string. Introduce test_env_sstables_manager that provides an overload of configure_writer with no parmeters that calls the base-class' configure_writer with "test" origin. This was to reduce the code churn in this patch and to keep the tests simple. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-01 16:45:52 +02:00
Botond Dénes	080bc2ffec	sstables: pass partition_range to create_single_key_sstable_reader() We want to unify the various sstable reader creation methods and this method taking a ring position instead of a partition range like everybody else stands in the way of that. This is effect reverts `68663d0de`.	2021-01-27 17:38:14 +02:00
Avi Kivity	df3ef800c2	Merge 'Introduce load and stream feature' from Asias He storage_service: Introduce load_and_stream === Introduction === This feature extends the nodetool refresh to allow loading arbitrary sstables that do not belong to a node into the cluster. It loads the sstables from disk and calculates the owning nodes of the data and streams to the owners automatically. From example, say the old cluster has 6 nodes and the new cluster has 3 nodes. We can copy the sstables from the old cluster to any of the new nodes and trigger the load and stream process. This can make restores and migrations much easier. === Performance === I managed to get 40MB/s per shard on my build machine. CPU: AMD Ryzen 7 1800X Eight-Core Processor DISK: Samsung SSD 970 PRO 512GB Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32 shards, we can finish the load and stream 1TB of data in 13 mins on each node. 1TB / 40 MB per shard * 32 shard / 60 s = 13 mins === Tests === backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test which creates a cluster with 4 nodes and inserts data, then use load_and_stream to restore to a 2 nodes cluster. === Usage === curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true === Notes === Btw, with the old nodetool refresh, the node will not pick up the data that does not belong to this node but it will not delete it either. One has to run nodetool cleanup to remove those data manually which is a surprise to me and probably to users as well. With load and stream, the process will delete the sstables once it finishes stream, so no nodetool cleanup is needed. The name of this feature load and stream follows load and store in CPU world. Fixes #7831 Closes #7846 * github.com:scylladb/scylla: storage_service: Introduce load_and_stream distributed_loader: Add get_sstables_from_upload_dir table: Add make_streaming_reader for given sstables set	2021-01-18 15:08:19 +02:00
Raphael S. Carvalho	00c29e1e24	table: Move notify_bootstrap_or_replace_*() out of line Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210117045747.69891-9-raphaelsc@scylladb.com>	2021-01-17 10:36:13 +02:00
Calle Wilund	c3d95811da	table: Add a flush RP mark to table, and shortcut if not above Adds a second RP to table, marking where we flushed last. If a new flush request comes in that is below this mark, we can skip a second flush. This is to (in future) support incremental CL flush.	2021-01-05 18:16:09 +00:00
Raphael S. Carvalho	9124a708f1	table: Wire interposer consumer for memtable flush From now on, memtable flush will use the strategy's interposer consumer iff split_during_flush is enabled (disabled by default). It has effect only for TWCS users as TWCS it's the only strategy that goes on to implement this interposer consumer, which consists of segregating data according to the window configuration. Fixes #4617. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-01-04 16:26:07 -03:00
Raphael S. Carvalho	c926a948e5	table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader This new variant will be needed for interposer consumer. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-01-04 16:23:00 -03:00
Raphael S. Carvalho	32acb44fec	table: Allow sstable write permit to be shared across monitors As a preparation for interposer on flush, let's allow database write monitor to store a shared sstable write permit, which will be released as soon as any of the sstable writers reach the sealing stage. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-01-04 14:46:43 -03:00
Raphael S. Carvalho	5519fdba72	table: Extend cache update to operate a memtable split into multiple sstables This extension is needed for future work where a memtable will be segregated during flush into one sstable or more. So now multiple sstables can be added to the set after a memtable flush, and compaction is only triggered at the end. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-01-04 13:24:10 -03:00
Asias He	84f482bde4	table: Add make_streaming_reader for given sstables set Add a streaming reader that streams from a given sstables set. Refs #7831	2020-12-30 08:32:42 +08:00
Raphael S. Carvalho	8dd7280107	table: Fix potential reactor stall on LCS compaction completion On every compaction completion, sstable set is rebuilt from scratch. With LCS and ~160G of data per shard, it means we'll have to create a new sstable set with ~1000 entries whenever compaction completes, which will likely result in reactor stalling for a significant amount of time. This is fixed by futurizing build_new_sstable_list(), so it will yield whenever needed. Fixes #7758. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-12-28 13:17:50 -03:00
Raphael S. Carvalho	6082da4703	table: decouple preparation from execution when updating sstable set row cache now allows updater to first prepare the work, and then execute the update atomically as the last step. let's do that when rebuilding the set, so now new set is created in the preparation phase, and the new set replaces the old one in the execution phase, satisfying the atomicity requirement of row cache. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-12-28 13:17:48 -03:00
Raphael S. Carvalho	43f0200b8f	table: change rebuild_sstable_list to return new sstable set procedure is changed to return the new set, so caller will be responsible for replacing the old set with the new one. this will allow our future work where building new set and enabling it will be decoupled. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-12-28 13:17:47 -03:00
Raphael S. Carvalho	198b87503f	row_cache: allow external updater to decouple preparation from execution External updater may do some preparatory work like constructing a new sstable list, and at the end atomically replace the old list by the new one. Decoupling the preparation from execution will give us the following benefits: - the preparation step can now yield if needed to avoid reactor stalls, as it's been futurized. - the execution step will now be able to provide strong exception guarantees, as it's now decoupled from the preparation step which can be non-exception-safe. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-12-28 13:17:45 -03:00
Calle Wilund	71c5dc82df	database: Verify iff we actually are writing memtables to disk in truncate Fixes #7732 When truncating with auto_snapshot on, we try to verify the low rp mark from the CF against the sstables discarded by the truncation timestamp. However, in a scenario like: Fill memtables Flush Truncate with snapshot A Fill memtables some more Truncate Move snapshot A to upload + refresh (load old tables) Truncate The last op will assert, because while we have sstables loaded, which will be discarded now, we did not in fact generate any _new_ ones (since memtables are empty), and the RP we get back from discard is one from an earlier generation set. (Any permutation of events that create the situation "empty memtable" + "non-empty sstables with only old tables" will generate the same error). Added a check that before flushing checks if we actually have any data, and if not, does not uphold the RP relation assert. Closes #7799	2020-12-15 16:24:36 +02:00
Piotr Sarna	cd1e351dc1	table: unify waiting for pending operations In order to reduce code duplication which already caused a bug, waiting for pending operations is now unified with a single helper function.	2020-12-15 13:11:25 +01:00
Piotr Sarna	df3204426d	database: add a phaser for flush operations Pending flushes can participate in races when a table with auto_snapshot==false is dropped. The race is as follows: 1. A flush of table T is initiated 2. The flush operation is preempted 3. Table T is dropped without flushing, because it has auto_snapshot off 4. The flush operation from (2.) wakes up and continues working on table T, which is already dropped 5. Segfault/memory corruption To prevent such races, a phaser for pending flushes is introduced	2020-12-15 12:59:36 +01:00
Avi Kivity	f802356572	Revert "Revert "Merge "raft: fix replication if existing log on leader" from Gleb"" This reverts commit `dc77d128e9`. It was reverted due to a strange and unexplained diff, which is now explained. The HEAD on the working directory being pulled from was set back, so git thought it was merging the intended commits, plus all the work that was committed from HEAD to master. So it is safe to restore it.	2020-12-08 19:19:55 +02:00
Avi Kivity	dc77d128e9	Revert "Merge "raft: fix replication if existing log on leader" from Gleb" This reverts commit `0aa1f7c70a`, reversing changes made to `72c59e8000`. The diff is strange, including unrelated commits. There is no understanding of the cause, so to be safe, revert and try again.	2020-12-06 11:34:19 +02:00
Kamil Braun	68663d0de0	sstables: pass ring_position to create_single_key_sstable_reader instead of partition_range. It would be best to pass `partition_key` or `decorated_key` here. However, the implementation of this function needs a `partition_range` to pass into `sstable_set::select`, and `partition_range` must be constructed from `ring_position`s. We could create the `ring_position` internally from the key but that would involve a copy which we want to avoid.	2020-11-23 12:33:24 +01:00
Kamil Braun	40d8bfa394	sstables: move sstable reader creation functions to `sstable_set` Lower level functions such as `create_single_key_sstable_reader` were made methods of `sstable_set`. The motivation is that each concrete sstable_set may decide to use a better sstable reading algorithm specific to the data structures used by this sstable_set. For this it needs to access the set's internals. A nice side effect is that we moved some code out of table.cc and database.hh which are huge files.	2020-11-19 17:52:39 +01:00
Avi Kivity	5d45662804	database, streaming: remove remnants of memtable-base streaming Commit `e5be3352cf` ("database, streaming, messaging: drop streaming memtables") removed streaming memtables; this removes the mechanisms to synchronize them: _streaming_flush_gate and _streaming_flush_phaser. The memory manager for streaming is removed, and its 10% reserve is evenly distributed between memtables and general use (e.g. cache). Note that _streaming_flush_phaser and _streaming_flush_date are no longer used to syncrhonize anything - the gate is only used to protect the phaser, and the phaser isn't used for anything. Closes #7454	2020-11-16 14:32:19 +01:00
Michał Chojnowski	1eb19976b9	database: make changes to durable_writes effective immediately Users can change `durable_writes` anytime with ALTER KEYSPACE. Cassandra reads the value of `durable_writes` every time when applying a mutation, so changes to that setting take effect immediately. That is, mutations are added to the commitlog only when `durable_writes` is `true` at the moment of their application. Scylla reads the value of `durable_writes` only at `keyspace` construction time, so changes to that setting take effect only after Scylla is restarted. This patch fixes the inconsistency. Fixes #3034 Closes #7533	2020-11-06 17:53:22 +01:00
Benny Halevy	82aabab054	table: get rid of reshuffle_sstables It is unused since `7351db7cab` Refs #6950 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201026074914.34721-1-bhalevy@scylladb.com>	2020-10-26 09:50:21 +02:00
Benny Halevy	70219b423f	table: add_sstable: provide strong exception guarantees Do not leave side-effects on nexception. Fixes #6658 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201020145429.19426-1-bhalevy@scylladb.com>	2020-10-21 11:40:03 +03:00
Botond Dénes	ff623e70b3	reader_concurrency_semaphore: name permits Require a schema and an operation name to be given to each permit when created. The schema is of the table the read is executed against, and the operation name, which is some name identifying the operation the permit is part of. Ideally this should be different for each site the permit is created at, to be able to discern not only different kind of reads, but different code paths the read took. As not all read can be associated with one schema, the schema is allowed to be null. The name will be used for debugging purposes, both for coredump debugging and runtime logging of permit-related diagnostics.	2020-10-13 12:32:13 +03:00
Avi Kivity	4d6739c2e6	Merge "Use max_concurrent_for_each" from Benny " max_concurrent_for_each was added to seastar for replacing sstable_directory::parallel_for_each_restricted by using more efficient concurrency control that doesn't create unlimited number of continuations. The series replaces the use of sstable_directory::parallel_for_each_restricted with max_concurrent_for_each and exposes the sstable_directory::do_for_each_sstable via a static method. This method is used here by table::snapshot to limit concurrency do snapshot operations that suffer from the same unbound concurrency problem sstable_directory solved. In addition sstable_directory::_load_semaphore that was used across calls to do_for_each_sstable was replaced by a static per-shard semaphore that caps concurrency across all calls to `do_for_each_sstable` on that shard. This makes sense since the disk is a shared resource. In the future, we may want to have a load semaphore per device rather than a single global one. We should experiment with that. Test: unit(dev) " * tag 'max_concurrent_for_each-v5' of github.com:bhalevy/scylla: table: snapshot: use max_concurrent_for_each sstable_directory: use a external load_semaphore test: sstable_directory_test: extract sstable_directory creation into with_sstable_directory distributed_loader: process_upload_dir: use initial_sstable_loading_concurrency sstables: sstable_directory: use max_concurrent_for_each	2020-10-12 09:43:12 +03:00
Benny Halevy	1ba9e253c4	table: snapshot: use max_concurrent_for_each Tables may have thousands of sstables and a number of component files for each sstables. Using parallel_for_each on all sstables (and parallel_for_each in sstables::create_links for each file) needlessly overloads the system with unbounded number of continuations. Use max_concurrent_for_each and acquire the db sst_dir_semaphore to limit parallelism. Note that although snapshot is called while scylla already loaded the sstable we use the configured initial_sstable_loading_concurrency(). As a future follow-up we may want to define yet another config variable for on-going operations on sstable directories if we see that it warrants a diffrent setting than the initial loading concurrency. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-10-08 11:57:06 +03:00
Avi Kivity	0ef85a102f	table: fix mishandled _sstable_deleted_gate exception in on_compaction_completion on_compaction_completion tries to handle a gate_closed_exception, but with_gate() throws rather than creating an exceptional future, so the extra handling is lost. This is relatively benign since it will just fail the compaction, requiring that work to be redone later. Fix by using the safer try_with_gate().	2020-10-06 08:31:28 +03:00
Avi Kivity	a43d5079f3	table: fix on_compaction_completion corrupting _sstables_compacted_but_not_deleted during self-race on_compaction_completion() updates _sstables_compacted_but_not_deleted through a temporary to avoid an exception causing a partial update: 1. copy _sstables_compacted_but_not_deleted to a temporary 2. update temporary 3. do dangerous stuff 4. move temporary to _sstables_compacted_but_not_deleted This is racy when we have parallel compactions, since step 3 yields. We can have two invocations running in parallel, taking snapshots of the same _sstables_compacted_but_not_deleted in step 1, each modifying it in different ways, and only one of them winning the race and assigning in step 4. With the right timing we can end with extra sstables in _sstables_compacted_but_not_deleted. Before `a5369881b3`, this was a benign race (only resulting in deleted file space not being reclaimed until the service is shut down), but afterwards, extra sstable references result in the service refusing to shut down. This was observed in database_test in debug mode, where the race more or less reliably happens for system.truncated. Fix by using a different method to protect _sstables_compacted_but_not_deleted. We unconditionally update it, and also unconditionally fix it up (on success or failure) using seastar::defer(). The fixup includes a call to rebuild_statistics() which must happen every time we touch the sstable list. Fixes #7331.	2020-10-06 08:29:34 +03:00
Botond Dénes	3fab83b3a1	flat_mutation_reader: impl: add reader_permit parameter Not used yet, this patch does all the churn of propagating a permit to each impl. In the next patch we will use it to track to track the memory consumption of `_buffer`.	2020-09-28 10:53:48 +03:00
Avi Kivity	88ea02bfeb	table: clear sstable set when stopping Drop references to a table's sstables when stopping it, so that the sstable_manager can start deleting it. This includes staging sstables. Although the table is no longer in use at this time, maintain cache synchronity by calling row_cache::invalidate() (this also has the benefit of avoiding a stall in row_cache's destructor). We also refresh the cache's view of the sstable set to drop the cache's references.	2020-09-23 20:55:05 +03:00
Avi Kivity	9932e6a899	table: prevent table::stop() race with table::query() Take the gate in table::query() so that stop() waits for queries. The gate is already waited for in table::stop(). This allows us to know we are no longer using the table's sstables in table::stop().	2020-09-23 20:55:05 +03:00
Avi Kivity	64c7c81bac	Merge "Update log messages to {fmt} rules" from Pavel E " Before seastar is updated with the {fmt} engine under the logging hood, some changes are to be made in scylla to conform to {fmt} standards. Compilation and tests checked against both -- old (current) and new seastar-s. tests: unit(dev), manual " * 'br-logging-update' of https://github.com/xemul/scylla: code: Force formatting of pointer in .debug and .trace code: Format { and } as {fmt} needs streaming: Do not reveal raw pointer in info message mp_row_consumer: Provide hex-formatting wrapper for bytes_view heat_load_balance: Include fmt/ranges.h	2020-09-03 15:10:09 +03:00
Pavel Emelyanov	812eed27fe	code: Force formatting of pointer in .debug and .trace ... and tests. Printin a pointer in logs is considered to be a bad practice, so the proposal is to keep this explicit (with fmt::ptr) and allow it for .debug and .trace cases. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-26 20:44:11 +03:00
Tomasz Grabiec	cf12b5e537	db: view: Refactor view_info::initialize_base_dependent_fields() It is no longer called once for a given view_info, so the name "initialize" is not appropriate. This patch splits the "initialize" method into the "make" part, which makes a new base_info object, and the "set" part, which changes the current base_info object attached to the view.	2020-08-20 14:53:07 +02:00
Tomasz Grabiec	f8df214836	db: view: Fix incorrect schema access during view building after base table schema changes The view building process was accessing mutation fragments using current table's schema. This is not correct, fragments must be accessed using the schema of the generating reader. This could lead to undefined behavior when the column set of the base table changes. out_of_range exceptions could be observed, or data in the view ending up in the wrong column. Refs #7061. The fix has two parts. First, we always use the reader's schema to access fragments generated by the reader. Second, when calling populate_views() we upgrade the fragment-wrapping reader's schema to the base table schema so that it matches the base table schema of view_and_base snapshots passed to populate_views().	2020-08-20 14:53:07 +02:00
Tomasz Grabiec	3a6ec9933c	db: views: Fix undefined behavior on base table schema changes The view_info object, which is attached to the schema object of the view, contains a data structure called "base_non_pk_columns_in_view_pk". This data structure contains column ids of the base table so is valid only for a particular version of the base table schema. This data structure is used by materialized view code to interpret mutations of the base table, those coming from base table writes, or reads of the base table done as part of view updates or view building. The base table schema version of that data structure must match the schema version of the mutation fragments, otherwise we hit undefined behavior. This may include aborts, exceptions, segfaults, or data corruption (e.g. writes landing in the wrong column in the view). Before this patch, we could get schema version mismatch here after the base table was altered. That's because the view schema does not change when the base table is altered. Part of the fix is to extract base_non_pk_columns_in_view_pk into a third entitiy called base_dependent_view_info, which changes both on base table schema changes and view schema changes. It is managed by a shared pointer so that we can take immutable snapshots of it, just like with schema_ptr. When starting the view update, the base table schema_ptr and the corresponding base_dependent_view_info have to match. So we must obtain them atomically, and base_dependent_view_info cannot change during update. Also, whenever the base table schema changes, we must update base_dependent_view_infos of all attached views (atomically) so that it matches the base table schema. Refs #7061.	2020-08-20 14:53:07 +02:00
Botond Dénes	78f94ba36a	table: get_sstables_by_partition_key(): don't make a copy of selected sstables Currently we assign the reference to the vector of selected sstables to `auto sst`. This makes a copy and we pass this local variable to `do_for_each()`, which will result in a use-after-free if the latter defers. Fix by not making a copy and instead just keep the reference. Fixes: #7060 Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200818091241.2341332-1-bdenes@scylladb.com>	2020-08-18 14:20:31 +03:00
Raphael S. Carvalho	11df96718a	compaction: Prevent non-regular compaction from picking compacting SSTables After `8014c7124`, cleanup can potentially pick a compacting SSTable. Upgrade and scrub can also pick a compacting SSTable. The problem is that table::candidates_for_compaction() was badly named. It misleads the user into thinking that the SSTables returned are perfect candidates for compaction, but manager still need to filter out the compacting SSTables from the returned set. So it's being renamed. When the same SSTable is compacted in parallel, the strategy invariant can be broken like overlapping being introduced in LCS, and also some deletion failures as more than one compaction process would try to delete the same files. Let's fix scrub, cleanup and ugprade by calling the manager function which gets the correct candidates for compaction. Fixes #6938. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>	2020-08-16 17:31:03 +03:00
Piotr Jastrzebski	c001374636	codebase wide: replace count with contains C++20 introduced `contains` member functions for maps and sets for checking whether an element is present in the collection. Previously `count` function was often used in various ways. `contains` does not only express the intend of the code better but also does it in more unified way. This commit replaces all the occurences of the `count` with the `contains`. Tests: unit(dev) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>	2020-08-15 20:26:02 +03:00
Nadav Har'El	8135647906	merge: Add metrics to semaphores Merged pull request https://github.com/scylladb/scylla/pull/7018 by Piotr Sarna: This series addresses various issues with metrics and semaphores - it mainly adds missing metrics, which makes it possible to see the length of the queues attached to the semaphores. In case of view building and view update generation, metrics was not present in these services at all, so a first, basic implementation is added. More precise semaphore metrics would ease the testing and development of load shedding and admission control. view_builder: add metrics db, view: add view update generator metrics hints: track resource_manager sending queue length hints: add drain queue length to metrics table: add metrics for sstable deletion semaphore database: remove unused semaphore	2020-08-12 12:39:59 +03:00
Piotr Sarna	8b56b24737	table: add metrics for sstable deletion semaphore It's now possible to read the number of tasks waiting on the sstable deletion semaphore.	2020-08-11 17:43:53 +02:00
Benny Halevy	7cfca519cb	table: filter_sstable_for_reader: allow clustering filtering md-format sstables Now that it is safe to filter md format sstable by min/max column names we can remove the `filtering_broken` variable that disabled filtering in `19b76bf75b` to fix #4442. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-08-10 19:19:32 +03:00
Benny Halevy	ab67629ea6	table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results To prevent https://github.com/scylladb/scylla/issues/3552 we want to ensure that in any case that the partition exists in any sstable, we emit partition_start/end, even when returning no rows. In the first filtering pass, filter_sstable_for_reader_by_pk filters the input sstables based on the partition key, and num_sstables is set the size of the sstables list after the first filtering pass. An empty sstables list at this stage means there are indeed no sstables with the required partition so returning an empty result will leave the cache in the desired state. Otherwise, we filter again, using filter_sstable_for_reader_by_ck, and examine the list of the remaining readers. If num_readers != num_sstables, we know that some sstables were filterd by clustering key, so we append a flat_mutation_reader_from_mutations to the list of readers and return a combined reader as before. This will ensure that we will always have a partition_start/end mutations for the queried partition, even if the filtered readers emit no rows. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-08-10 19:19:32 +03:00
Benny Halevy	a672747da3	table: filter_sstable_for_reader: adjust to md-format With the md sstable format, min/max column names in the metadata now track clustering rows (with or without row tombstones), range tombstones, and partition tombstones (that are reflected with empty min/max column names - indicating the full range). As such, min and max column names may be of different lengths due to range tombstones and potentially short clustering key prefixes with compact storage, so the current matching algorithm must be changed to take this into account. To determine if a slice range overlaps the min/max range we are using position_range::overlaps. sstable::clustering_components_ranges was renamed to position_range as it now holds a single position_range rather than a vector of bytes_view ranges. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-08-10 19:19:30 +03:00
Benny Halevy	90d0fea7df	table: filter_sstable_for_reader: include non-scylla sstables with tombstones Move contains_rows from table code to sstable::may_contain_rows since its implementation now has too specific knowledge of sstable internals. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-08-10 18:53:04 +03:00
Benny Halevy	2a57ec8c3d	table: filter_sstable_for_reader: do not filter if static column is requested Static rows aren't reflected in the sstable min/max clustering keys metadata. Since we don't have any indication in the metadata that the sstable stores static rows, we must read all sstables if a static column is requested. Refs #3553 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-08-10 18:53:04 +03:00

1 2 3 4 5

229 Commits