scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-06 06:53:12 +00:00

Author	SHA1	Message	Date
Jesse Haber-Kucharsky	682805b22c	auth: Use finite time-out for all QUORUM reads Commit `e664f9b0c6` transitioned internal CQL queries in the auth. sub-system to be executed with finite time-outs instead of infinite ones. It should have also modified the functions in `auth/roles-metadata.cc` to have finite time-outs. This change fixes some previously failing dtests, particularly around repair. Without this change, the QUORUM query fails to terminate when the necessary consistency level cannot be achieved. Fixes #3736. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <e244dc3e731b4019f3be72c52a91f23ee4bb68d1.1536163859.git.jhaberku@scylladb.com>	2018-09-05 21:55:26 +03:00
Tomasz Grabiec	82270c8699	storage_proxy: Fix misqualification of reads as foreground or background in some cases The foreground reads metric is derived from the number of live read executors minus the number of background reads. Background reads are counted down when their resolver times out. However, a read executor may still be around for a while, resulting in such reads being accounted as foreground. Usually, the gap in which this happens is short, because executor reference holders timeout quickly as well. It's not always the case though. For instance, local read executor doesn't time out quickly when the target shard has an overloaded CPU, and it takes a while before the request goes through all the queues, even if IO is not involved. Observed in #3628. Fixes #3734. Another problem is that all reads which received CL responses are accounted as background, until all replicas respond, but if such read needs reconciliation, it's still practically a foreground read and should be accounted as such. Found during code review. Fixes #3745. This patch fixes both issues by rearranging accounting to track foreground reads instead of background reads, and considering all reads as foreground until the resulting promise is resolved. Message-Id: <1535999620-25784-1-git-send-email-tgrabiec@scylladb.com>	2018-09-05 20:42:51 +03:00
Avi Kivity	c168805ca6	Merge "Filtering and fast-forwarding of range tombstones in SSTables 3.x" from Vladimir " This patchset adds proper support for sliced reads of partitions containing range tombstones. Given the SSTables 3.x repesentation of range tombstones by separate start and end markers, we refer to the index for the information about the currently opened range tombstone, if any, when skipping to the next promoted index block. Note that for this we have to take the promoted index block immediately preceding the one we are jumping to. Tests: unit {release} " * 'projects/sstables-30/range-tombstones-slicing/v3' of https://github.com/argenet/scylla: tests: Test filtering and forwarding on a partition with interleaved rows and RTs. tests: Add tests for reading wide partitions with range tombstones. sstables: Support slicing for range tombstones. sstables: Set/reset range tombstone start from end open marker. sstables: Fix end_open_marker population in promoted index blocks. sstables: Add need_skip() helper to data_consume_context. sstables: For end_open_marker, return both position in partition and deletion time.	2018-09-05 20:38:39 +03:00
Vladimir Krivopalov	3d13ee3909	tests: Test filtering and forwarding on a partition with interleaved rows and RTs. In this test, rows lie inside range tombstones so we split them on reading. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-05 09:48:17 -07:00
Vladimir Krivopalov	d39e58a97a	tests: Add tests for reading wide partitions with range tombstones. Test the case where rows lie outside range tombstones. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-05 09:48:17 -07:00
Vladimir Krivopalov	ec2047e1e6	sstables: Support slicing for range tombstones. Both filtering on queried ranges and fast-forwarding are supported. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-05 09:48:17 -07:00
Vladimir Krivopalov	d57380f44c	sstables: Set/reset range tombstone start from end open marker. When we skip through a wide partition using promoted index, we may land to a position that lies in the middle of a range tombstone so we need to be aware of it. For this, we check if the previous promoted block has an end open marker and either set the range tombstone start using it or reset if missing. Note several things about the implementation. Firstly, we have to peek back at the previous promoted index block for the end open marker, and so we have to always preserve one more promoted index block when we read the next batch so that we can stil access it. Secondly, we use the previous promoted block end position to build position in partition for the range tombstone start. Lastly, we don't have a notion of end open marker in older consumers that work with SSTables of ka/la formats so we only call the corresponding methods if the consumer supports them. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-05 09:48:17 -07:00
Vladimir Krivopalov	939e4893ef	sstables: Fix end_open_marker population in promoted index blocks. We should not access the internal object stored in std::optional when passing the end_open_marker, moreover that it can be disengaged. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-05 09:48:17 -07:00
Vladimir Krivopalov	84bff86fbc	sstables: Add need_skip() helper to data_consume_context. This methods tells whether we will need to skip to reach the input position or not. It can be used for skipping with index when reading SSTables 3.x because we only want to to set/reset the open range tombstone bound when we actually move to another promoted index block. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-05 09:48:17 -07:00
Tomasz Grabiec	cd201d1987	db/batchlog_manager: Do not return a value from timer callback Timer callbacks are std::function<void()>. Exposed by changing callback_t to noncopyable_function<>. Message-Id: <1536138045-29209-1-git-send-email-tgrabiec@scylladb.com>	2018-09-05 12:32:21 +03:00
Asias He	89b769a073	storage_service: Wait for range setup before announcing join status When a joining node announcing join status through gossip, other existing nodes will send writes to the joining node. At this time, it is possible the joining node hasn't learnt the tokens of other nodes that causes the error like below: token_metadata - sorted_tokens is empty in first_token_index! storage_proxy - Failed to apply mutation from 127.0.4.1#0: std::runtime_error (sorted_tokens is empty in first_token_index!) To fix, wait for the token range setup before announcing the join status. Fixes: #3382 Tests: 60 run of materialized_views_test.py:TestMaterializedViews.add_dc_during_mv_update_test Message-Id: <01abb21ae3315ae275297e507c5956e5774557ef.1536128531.git.asias@scylladb.com>	2018-09-05 10:51:43 +03:00
Vlad Zolotarov	dae70e1166	tests: loading_cache_test: configure a validity timeout in test_loading_cache_loading_different_keys to a greater value Change the validity timeout from 1s to 1h in order to avoid false alarms on busy systems: for a short value there is a chance that (loading_cache.size() == num_loaders) check is going to run after some elements of the cache have already been evicted. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20180904193026.7304-1-vladz@scylladb.com>	2018-09-05 10:19:59 +03:00
Vladimir Krivopalov	ac0c71bdc1	sstables: For end_open_marker, return both position in partition and deletion time. Prior to this fix, the end_open_marker has been only accessible as a plain deletion_time structure. Now it also contains the start position of a promoted index block so that it can be used for setting range tombstone open bound. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-04 18:16:21 -07:00
Piotr Sarna	f494d03c3f	tests: add test case for filtering with DESC clustering order Refs #3741 Message-Id: <1b8eab8d668eb000b306686c15324e6acde8e616.1535981852.git.sarna@scylladb.com>	2018-09-04 16:05:19 +03:00
Piotr Sarna	8e52b66516	cql3: fix filtering with descending clustering order When slice::is_satisfied_by() restriction check is performed on raw data represented as bytes, it should always use a regular type comparator, not a reversed one. Reversed types are used to preserve descending clustering order, but comparison with constants should be used with a regular underlying type comparator (for x < 1 to actually mean 'lesser than 1' instead of 'bigger than 1, because the clustering order is reversed'). Fixes #3741 Message-Id: <3e25fc66688c9253287f2c4f31ede8339b9bbe23.1535981852.git.sarna@scylladb.com>	2018-09-04 16:05:15 +03:00
Piotr Sarna	5b5c9f2707	cql3: fix a 'pratition_key' typo partition_key got misspelled with 'pratition_key' typo in the original series. Message-Id: <de59fe6161df5442b19d8ba4336e2f828b7ede32.1535981852.git.sarna@scylladb.com>	2018-09-04 16:05:09 +03:00
Takuya ASADA	bd8a5664b8	dist/common/scripts/scylla_raid_setup: create scylla-server.service.d when it doesn't exist When /etc/systemd/system/scylla-server.service.d/capabilities.conf is not installed, we don't have /etc/systemd/system/scylla-server.service.d/, need to create it. Fixes #3738 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20180904015841.18433-1-syuu@scylladb.com>	2018-09-04 10:12:32 +03:00
Tomasz Grabiec	4fb3f7e8eb	managed_vector: Make external_memory_usage() ignore reserved space This ensures that row::external_memory_usage() is invariant to insertion order of cells. It should be so, so that accounting of a clustering_row, merged from multiple MVCC versions by the partition_snapshot_flat_reader on behalf of a memtable flush, doesn't give a greater result than what is used by the memtable region. Overaccounting leads to assertion failure in ~flush_memory_accounter. Fixes #3625 (hopefully). Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>	2018-09-03 17:09:54 +03:00
Takuya ASADA	d78762d627	dist/debian: fix broken debian/changelog It also need $MUSTACHE_DIST. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20180903094558.3862-1-syuu@scylladb.com>	2018-09-03 14:04:01 +03:00
Duarte Nunes	e49a14e308	Merge 'Stateful range scans' from Botond " This series extends the query statefullness, introduced by `f8613a841` to point queries, to range scans as well. This means that queriers will be saved and reused for range scans too. This series builds heavily on the infrastructure introduced by stateful point queries, namely the querier object and the querier_cache. It also builds on another critical piece of infrastructure, the multishard_combining_reader, introduced by `2d126a79b`. To make the range scan on a given node suspendable and resumable we move away from the current code in `storage_proxy::query_nonsingular_mutations_locally()` and use a multishard_combining_reader to execute the read. When the page is filled this reader is dismantled and its shard readers are saved in the querier cache. There are of course a lot more details to it but this is the gist of it. Tests: unit(release, debug), dtest(paging_test.py, paging_additional_test.py) " * '1865/range-scans/v7.1' of https://github.com/denesb/scylla: (33 commits) query_pagers: generate query_uuid for range-scans as well storage_proxy: use preferred/last replicas storage_proxy: add preferred/last replicas to the signature of query_partition_key_range_concurrent db::consistency_level::filter_for_query() add preferred_endpoints storage_proxy: use query_mutations_from_all_shards() for range scans tests: add unit test for multishard_mutation_query() tests/mutation_assertions.hh: add missing include multishard_mutation_query: add badness counters database: add query_mutations_on_all_shards() mutation_compactor: add detach_state() flat_mutation_reader: add unpop_mutation_fragment() Move reconcilable_result_builder declaration to mutation_query.hh mutation_source_test: add an additional REQUIRE() mutation: add missing assert to mutation from reader querier: add shard_mutation_querier querier: prepare for multi-ranges tests/querier_cache: add tests specific for multiple entry-types querier: split querier into separate data and mutation querier types querier: move consume_page logic into a free function querier: move all matching related logic into free functions ...	2018-09-03 09:09:17 +01:00
Botond Dénes	cd49c23a66	query_pagers: generate query_uuid for range-scans as well And thus enable stateful range scans.	2018-09-03 10:31:44 +03:00
Botond Dénes	6486d6c8bd	storage_proxy: use preferred/last replicas	2018-09-03 10:31:44 +03:00
Botond Dénes	577a06ce1b	storage_proxy: add preferred/last replicas to the signature of query_partition_key_range_concurrent	2018-09-03 10:31:44 +03:00
Botond Dénes	6e59cee244	db::consistency_level::filter_for_query() add preferred_endpoints To the second overload (the one without read-repair related params) too.	2018-09-03 10:31:44 +03:00
Botond Dénes	2f66bde26f	storage_proxy: use query_mutations_from_all_shards() for range scans	2018-09-03 10:31:44 +03:00
Botond Dénes	6779b63dfe	tests: add unit test for multishard_mutation_query()	2018-09-03 10:31:44 +03:00
Botond Dénes	c678b665b4	tests/mutation_assertions.hh: add missing include	2018-09-03 10:31:44 +03:00
Botond Dénes	253407bdc8	multishard_mutation_query: add badness counters Add badness counters that allow tracking problems. The following counters are added: 1) multishard_query_unpopped_fragments 2) multishard_query_unpopped_bytes 3) multishard_query_failed_reader_stops 4) multishard_query_failed_reader_saves The first pair of counters observe the amount of work range scan queries have to undo on each page. It is normal for these counters to be non-zero, however sudden spikes in their values can indicate problems. This undoing of work is needed for stateful range-scans to work. When stateful queries are enabled the `multishard_combining_reader` is dismantled and all unconsumed fragments in its and any of its intermediate reader's buffers are pushed back into the originating shard reader's buffer (via `unpop_mutation_fragment()`). This also includes the `partition_start`, the `static_row` (if there is one) and all extracted and active `range_tombstone` fragments. This together can amount to a substantial amount of fragments. (1) counts the amount of fragments moved back, while (2) counts the number of bytes. Monitoring size and quantity separately allows for detecting edge cases like moving many small fragments or just a few huge ones. The counters count the fragments/bytes moved back to readers located on the shard they belong to. The second pair of counters are added to detect any problems around saving readers. Since the failure to save a reader will not fail the read itself, it is necessary to add visibility to these failures by other means. (3) counts the number of times stopping a shard reader (waiting on pending read-aheads and next-partitions) failed while (4) counts the number of times inserting the reader into the `querier_cache` failed. Contrary to the first two counters, which will almost certainly never be zero, these latter two counters should always be zero. Any other value indicates problems in the respective shards/nodes.	2018-09-03 10:31:44 +03:00
Botond Dénes	97364c7ad9	database: add query_mutations_on_all_shards() This method allows for querying a range or ranges on all shards of the node. Under the hood it uses the multishard_combining_reader for executing the query. It supports paging and stateful queries (saving and reusing the readers between pages). All this is transparent to the client, who only needs to supply the same query::read_command::query_uuid through the pages of the query (and supply correct start positions on each page, that match the stop position of the last page).	2018-09-03 10:31:44 +03:00
Botond Dénes	33d72efa49	mutation_compactor: add detach_state() Allow the state of the compaction to be detached. The detached state is a set of mutation fragments, which if replayed through a new compactor object will result in the latter being in the same state as the previous one was. This allows for storing the compaction state in the compacted reader by using `unpop_mutation_fragment()` to push back the fragments that comprise the detached state into the reader. This way, if a new compaction object is created it can just consume the reader and continue where the previous compaction left off.	2018-09-03 10:31:44 +03:00
Botond Dénes	48054ed810	flat_mutation_reader: add unpop_mutation_fragment() This is the inverse of `pop_mutation_fragment()`. Allow fragments to be pushed back into the buffer of the reader to undo a previous consumtion of the fragments.	2018-09-03 10:31:44 +03:00
Botond Dénes	3bcd577907	Move reconcilable_result_builder declaration to mutation_query.hh It will be used by code outside of mutation_partition.cc so it needs to be public. The definition remains in mutation_partition.cc.	2018-09-03 10:31:44 +03:00
Botond Dénes	b8b34223a4	mutation_source_test: add an additional REQUIRE() test_streamed_mutation_forwarding_is_consistent_with_slicing already has a REQUIRE() for the mutation read with the slicing reader. Add another one for the forwarding reader. This makes it more consistent and also helps finding problems with either the forwarding or slicing reader.	2018-09-03 10:31:44 +03:00
Botond Dénes	d347866664	mutation: add missing assert to mutation from reader read_mutation_from_flat_mutation_reader's internal adapter can build a single mutation only and hence can consume only a single partition. If more than one partitions are pushed down from the producer the adaptor will very likely crash. To avoid unnecessary investigations add an assert() to fail early and make it clear what the real problem is. All other consume_ methods have an assert() already for their invariants so this is just following suit.	2018-09-03 10:31:44 +03:00
Botond Dénes	ecb1e79bcc	querier: add shard_mutation_querier The querier to be used for saving shard readers belonging to a multishard range scan. This querier doesn't provide a `consume_page` method as it doesn't support reading from it directly. It is more of a storage to allow caching the reader and any objects it depends on.	2018-09-03 10:31:44 +03:00
Botond Dénes	07cdf766c5	querier: prepare for multi-ranges In the next patch a querier will be added that reads multiple ranges as opposed to a single range that data and mutation queriers read. To keep `querier_cache` code seamless regarding this difference change all range-matching logic to work in terms of `dht::partition_ranges_view`. This allows for cheap and seamless way of having a single code-base for the insert/lookup logic. Code actually matching ranges is updated to be able to handle both singular and multi-ranges while maintaining backward compatibility.	2018-09-03 10:31:44 +03:00
Botond Dénes	88a7effd8d	tests/querier_cache: add tests specific for multiple entry-types	2018-09-03 10:31:44 +03:00
Botond Dénes	c12008b8cb	querier: split querier into separate data and mutation querier types Instead of hiding what compaction method the querier uses (and only expose it via rejecting 'can_be_used_for_page()`) make it very explicit that these are really two different queriers. This allows using different indexes for the two queriers in `querier_cache` and eliminating the possibility of picking up a querier with the wrong compaction method (read kind). This also makes it possible to add new querier type(s) that suit the multishard-query's needs without making a confusing mess of `querier` by making it a union of all querying logic. Splitting the queriers this way changes what happens when a lookup finds a querier of the wrong kind (e.g. emit_only_live::yes for an emit_only_live::no command). As opposed to dropping the found (but wrong) querier the querier will now simply not be found by the lookup. This is a result of using separate search indexes for the different mutation kinds. This change should have no practical implications. Splitting is done by making querier templated on `emit_only_live_rows`. It doesn't make sense to duplicate the entire querier as the two share 99% of the code.	2018-09-03 10:31:44 +03:00
Botond Dénes	e46251ebf6	querier: move consume_page logic into a free function In preparation of the now single querier being split into multiple more specialized ones. Make it possible for the multiple queriers sharing the same implementation. Also, the code can now be reused by outside code as well, not just queriers.	2018-09-03 10:31:44 +03:00
Botond Dénes	c53f17ddb8	querier: move all matching related logic into free functions So that they can be used for multiple querier classes easily, without inheritance. The functions are not visible from the header. Also update the comments on `querier` to w.r.t. the disappeared checking functions. Change the language to be more general. In practice these checks are never done by client code, instead they are done by the `querier_cache`.	2018-09-03 10:31:44 +03:00
Botond Dénes	43f464c52d	querier: inline querier::current_position() and make it public	2018-09-03 10:31:44 +03:00
Botond Dénes	86a61ded7d	querier: s/position/position_view/ Also treat it as a view, that is take it by value in functions, instead of reference.	2018-09-03 10:31:44 +03:00
Botond Dénes	6e4ec53679	querier: move position outside of querier In preparation for having multiple querier types that can share code without inheritance.	2018-09-03 10:31:44 +03:00
Botond Dénes	a172dfec4e	querier: move clustering_position_tracker outside of querier In preparation for having multiple querier types that can share code without inheritance.	2018-09-03 10:31:44 +03:00
Botond Dénes	7bd955e993	querier_cache: move insert/lookup related logic into free functions In preparations for introducing support multiple entry types in the querier_cache move all insert/lookup related logic into free functions. Later these functions will be templated so they can handle multiple entry types with the same code.	2018-09-03 10:31:44 +03:00
Botond Dénes	cded477b94	querier: return std::optional<querier> instead of using create_fun() Requiring the caller of lookup() to pass in a `create_fun()` was not such a good idea in hindsight. It leads to awkward call sites and even more awkward code when trying to find out whether the lookup was successfull or not. Returning an optional gives calling code much more flexibility and makes the code cleaner.	2018-09-03 10:31:44 +03:00
Botond Dénes	5f726e9a89	querier: move all to query namespace To avoid name clashes.	2018-09-03 10:31:44 +03:00
Botond Dénes	867f69b9d1	dht::i_partitioner: add partition_ranges_view	2018-09-03 10:31:44 +03:00
Botond Dénes	a011a9ebf2	mutation_reader: multishard_combining_reader: support custom dismantler Add a dismantler functor parameter. When the multishard reader is destroyed this functor will be called for each shard reader, passing a future to a `stopped_foreign_reader`. This future becomes available when the shard reader is stopped, that is, when it finished all in-progress read-aheads and/or pending next partition calls. The intended use case for the dismantler functor is a client that needs to be notified when readers are destroyed and/or has to have access to any unconsumed fragments from the foreign readers wrapping the shard readers.	2018-09-03 10:31:44 +03:00
Botond Dénes	f13b878a94	mutation_reader: pass all standard reader params to `remote_reader_factory` Extend `remote_reader_factory` interface so that it accepts all standard mutation reader creation parameters. This allows factory lambdas to be truly stateless, not having to capture any standard parameters that is needed for creating the reader. Standard parameters are those accepted by `mutation_source::make_reader()`.	2018-09-03 10:31:44 +03:00

1 2 3 4 5 ...

16456 Commits