scylladb

Author	SHA1	Message	Date
Rafael Ávila de Espíndola	0da6807f7e	auth: Turn allow_all_authenticator_name into a std::string_view variable There is no constexpr operator+ for std::string_view, so we have to concatenate the strings ourselves. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-08-04 16:38:27 -07:00
Calle Wilund	30a700c5b0	system_keyspace: Remove support for legacy truncation records Fixes #6341 Since scylla no longer supports upgrading from a version without the "new" (dedicated) truncation record table, we can remove support for these and the migtration thereof. Make sure the above holds whereever this is committed. Note that this does not remove the "truncated_at" field in system.local.	2020-08-03 17:16:26 +03:00
Avi Kivity	257c17a87a	Merge "Don't depend on seastar::make_(lw_)?shared idiosyncrasies" from Rafael " While working on another patch I was getting odd compiler errors saying that a call to ::make_shared was ambiguous. The reason was that seastar has both: template <typename T, typename... A> shared_ptr<T> make_shared(A&&... a); template <typename T> shared_ptr<T> make_shared(T&& a); The second variant doesn't exist in std::make_shared. This series drops the dependency in scylla, so that a future change can make seastar::make_shared a bit more like std::make_shared. " * 'espindola/make_shared' of https://github.com/espindola/scylla: Everywhere: Explicitly instantiate make_lw_shared Everywhere: Add a make_shared_schema helper Everywhere: Explicitly instantiate make_shared cql3: Add a create_multi_column_relation helper main: Return a shared_ptr from defer_verbose_shutdown	2020-08-02 19:51:24 +03:00
Avi Kivity	fea5067dfa	Merge "Limit non-paged query memory consumption" from Botond " Non-paged queries completely ignore the query result size limiter mechanism. They consume all the memory they want. With sufficiently large datasets this can easily lead to a handful or even a single unpaged query producing an OOM. This series continues the work started by `134d5a5f7`, by introducing a configurable pair of soft/hard limit (default to 1MB/100MB) that is applied to otherwise unlimited queries, like reverse and unpaged ones. When an unlimited query reaches the soft limit a warning is logged. This should give users some heads-up to adjust their application. When the hard limit is reached the query is aborted. The idea is to not greet users with failing queries after an upgrade while at the same time protect the database from the really bad queries. The hard limit should be decreased from time to time gradually approaching the desired goal of 1MB. We don't want to limit internal queries, we trust ourselves to either use another form of memory usage control, or read only small datasets. So the limit is selected according to the query class. User reads use the `max_memory_for_unlimited_query_{soft,hard}_limit` configuration items, while internal reads are not limited. The limit is obtained by the coordinator, who passes it down to replicas using the existing `max_result_size` parameter (which is not a special type containing the two limits), which is now passed on every verb, instead of once per connection. This ensures that all replicas work with the same limits. For normal paged queries `max_result_size` is set to the usual `query::result_memory_limiter::maximum_result_size` For queries that can consume unlimited amount of memory -- unpaged and reverse queries -- this is set to the value of the aforementioned `max_memory_for_unlimited_query_{soft,hard}_limit` configuration item, but only for user reads, internal reads are not limited. This has the side-effect that reverse reads now send entire partitions in a single page, but this is not that bad. The data was already read, and its size was below the limit, the replica might as well send it all. Fixes: #5870 " * 'nonpaged-query-limit/v5' of https://github.com/denesb/scylla: (26 commits) test: database_test: add test for enforced max result limit mutation_partition: abort read when hard limit is exceeded for non-paged reads query-result.hh: move the definition of short_read to the top test: cql_test_env: set the max_memory_unlimited_query_{soft,hard}_limit test: set the allow_short_read slice option for paged queries partition_slice_builder: add with_option() result_memory_accounter: remove default constructor query_*(): use the coordinator specified memory limit for unlimited queries storage_proxy: use read_command::max_result_size to pass max result size around query: result_memory_limiter: use the new max_result_size type query: read_command: add max_result_size query: read_command: use tagged ints for limit ctor params query: read_command: add separate convenience constructor service: query_pager: set the allow_short_read flag result_memory_accounter: check(): use _maximum_result_size instead of hardcoded limit storage_proxy: add get_max_result_size() result_memory_limiter: add unlimited_result_size constant database: add get_statement_scheduling_group() database: query_mutations(): obtain the memory accounter inside query: query_class_config: use max_result_size for the max_memory_for_unlimited_query field ...	2020-07-29 13:41:53 +03:00
Botond Dénes	3804dfcc0c	test: database_test: add test for enforced max result limit Two tests are added: one that works on the low-level database API, and another one that works on the CQL API.	2020-07-29 08:32:34 +03:00
Botond Dénes	f7a4d19fb1	mutation_partition: abort read when hard limit is exceeded for non-paged reads If the read is not paged (short read is not allowed) abort the query if the hard memory limit is reached. On reaching the soft memory limit a warning is logged. This should allow users to adjust their application code while at the same time protecting the database from the really bad queries. The enforcement happens inside the memory accounter and doesn't require cooperation from the result builders. This ensures memory limit set for the query is respected for all kind of reads. Previously non-paged reads simply ignored the memory accounter requesting the read to stop and consumed all the memory they wanted.	2020-07-29 08:32:31 +03:00
Botond Dénes	648ce473ab	test: set the allow_short_read slice option for paged queries Some tests use the lower level methods directly and meant to use paging but didn't and nobody noticed. This was revealed by the enforcement of max result size (introduced in a later patch), which caused these tests to fail due to exceeding the max result size. This patch fixes this by setting the `allow_short_reads` slice option.	2020-07-28 18:00:29 +03:00
Botond Dénes	6660a5df51	result_memory_accounter: remove default constructor If somebody wants to bypass proper memory accounting they should at the very least be forced to consider if that is indeed wise and think a second about the limit they want to apply.	2020-07-28 18:00:29 +03:00
Botond Dénes	9eab5bca27	query_*(): use the coordinator specified memory limit for unlimited queries It is important that all replicas participating in a read use the same memory limits to avoid artificial differences due to different amount of results. The coordinator now passes down its own memory limit for reads, in the form of max_result_size (or max_size). For unpaged or reverse queries this has to be used now instead of the locally set max_memory_unlimited_query configuration item. To avoid the replicas accidentally using the local limit contained in the `query_class_config` returned from `database::make_query_class_config()`, we refactor the latter into `database::get_reader_concurrency_semaphore()`. Most of its callers were only interested in the semaphore only anyway and those that were interested in the limit as well should get it from the coordinator instead, so this refactoring is a win-win.	2020-07-28 18:00:29 +03:00
Botond Dénes	159d37053d	storage_proxy: use read_command::max_result_size to pass max result size around Use the recently added `max_result_size` field of `query::read_command` to pass the max result size around, including passing it to remote nodes. This means that the max result size will be sent along each read, instead of once per connection. As we want to select the appropriate `max_result_size` based on the type of the query as well as based on the query class (user or internal) the previous method won't do anymore. If the remote doesn't fill this field, the old per-connection value is used.	2020-07-28 18:00:29 +03:00
Botond Dénes	fbbbc3e05c	query: result_memory_limiter: use the new max_result_size type	2020-07-28 18:00:29 +03:00
Botond Dénes	92a7b16cba	query: read_command: add max_result_size This field will replace max size which is currently passed once per established rpc connection via the CLIENT_ID verb and stored as an auxiliary value on the client_info. For now it is unused, but we update all sites creating a read command to pass the correct value to it. In the next patch we will phase out the old max size and use this field to pass max size on each verb instead.	2020-07-28 18:00:29 +03:00
Botond Dénes	8992bcd1f8	query: read_command: use tagged ints for limit ctor params The convenience constructor of read_command now has two integer parameter next to each other. In the next patch we intend to add another one. This is recipe for disaster, so to avoid mistakes this patch converts these parameters to tagged integers. This makes sure callers pass what they meant to pass. As a matter of fact, while fixing up call-sites, I already found several ones passing `query::max_partitions` to the `row_limit` parameter. No harm done yet, as `query::max_partitions` == `query::max_rows` but this shows just how easy it is to mix up parameters with the same type.	2020-07-28 18:00:29 +03:00
Botond Dénes	2ca118b2d5	query: read_command: add separate convenience constructor query::read_command currently has a single constructor, which serves both as an idl constructor (order of parameters is fixed) and a convenience one (most parameters have default values). This makes it very error prone to add new parameters, that everyone should fill. The new parameter has to be added as last, with a default value, as the previous ones have a default value as well. This means the compiler's help cannot be enlisted to make sure all usages are updated. This patch adds a separate convenience constructor to be used by normal code. The idl constructor looses all default parameters. New parameters can be added to any position in the convenience constructor (to force users to fill in a meaningful value) while the removed default parameters from the idl constructor means code cannot accidentally use it without noticing.	2020-07-28 18:00:29 +03:00
Botond Dénes	c364c7c6a2	result_memory_limiter: add unlimited_result_size constant To be used as the max result size for internal queries.	2020-07-28 18:00:29 +03:00
Botond Dénes	d5cc932a0b	database: query_mutations(): obtain the memory accounter inside Instead of requesting callers to do it and pass it as a parameter. This is in line with data_query().	2020-07-28 18:00:29 +03:00
Botond Dénes	92ce39f014	query: query_class_config: use max_result_size for the max_memory_for_unlimited_query field We want to switch from using a single limit to a dual soft/hard limit. As a first step we switch the limit field of `query_class_config` to use the recently introduced type for this. As this field has a single user at the moment -- reverse queries (and not a lot of propagation) -- we update it in this same patch to use the soft/hard limit: warn on reaching the soft limit and abort on the hard limit (the previous behaviour).	2020-07-28 18:00:29 +03:00
Botond Dénes	46d5b651eb	db/config: introduce max_memory_for_unlimited_query_soft_limit and max_memory_for_unlimited_query_hard_limit This pair of limits replace the old max_memory_for_unlimited_query one, which remains as an alias to the hard limit. The soft limit inherits the previous value of the limit (1MB), when this limit is reached a warning will be logged allowing the users to adjust their client codes without downtime. The hard limit starts out with a more permissive default of 100MB. When this is reached queries are aborted, the same behaviour as with the previous single limit. The idea is to allow clients a grace period for fixing their code, while at the same time protecting the database from the really bad queries.	2020-07-28 18:00:29 +03:00
Piotr Sarna	ee35c4c3d6	db: handle errors when loading view build progress Currently, encountering an error when loading view build progress would result in view builder refusing to start - which also means that future views would not be built until the server restarts. A more user-friendly solution would be to log an error message, but continue to boot the view builder as if no views are currently in progress, which would at least allow future views to be built correctly. The test case is also amended, since now it expects the call to return that "no view builds are in progress" instead of an exception. Fixes #6934 Tests: unit(dev) Message-Id: <9f26de941d10e6654883a919fd43426066cee89c.1595922374.git.sarna@scylladb.com>	2020-07-28 11:32:09 +03:00
Piotr Sarna	0dbcaa1fd9	test: add a case for disengaged optional values in system tables Following the patch which fixes incorrect access to disengaged optionals, a test case which used to reproduce the problem is added. Message-Id: <99174d47c1c55ed8730b4998d5e5e464990d36e3.1595834092.git.sarna@scylladb.com>	2020-07-28 10:06:42 +03:00
Avi Kivity	3f84d41880	Merge "messaging: make verb handler registering independent of current scheduling group" from Botond " `0c6bbc8` refactored `get_rpc_client_idx()` to select different clients for statement verbs depending on the current scheduling group. The goal was to allow statement verbs to be sent on different connections depending on the current scheduling group. The new connections use per-connection isolation. For backward compatibility the already existing connections fall-back to per-handler isolation used previously. The old statement connection, called the default statement connection, also used this. `get_rpc_client_idx()` was changed to select the default statement connection when the current scheduling group is the statement group, and a non-default connection otherwise. This inadvertently broke `scheduling_group_for_verb()` which also used this method to get the scheduling group to be used to isolate a verb at handle register time. This method needs the default client idx for each verb, but if verb registering is run under the system group it instead got the non-default one, resulting in the per-handler isolation not being set-up for the default statement connection, resulting in default statement verb handlers running in whatever scheduling group the process loop of the rpc is running in, which is the system scheduling group. This caused all sorts of problems, even beyond user queries running in the system group. Also as of `0c6bbc8` queries on the replicas are classified based on the scheduling group they are running on, so user reads also ended up using the system concurrency semaphore. In particular this caused severe problems with ranges scans, which in some cases ended up using different semaphores per page resulting in a crash. This could happen because when the page was read locally the code would run in the statement scheduling group, but when the request arrived from a remote coordinator via rpc, it was read in a system scheduling group. This caused a mismatch between the semaphore the saved reader was created with and the one the new page was read with. The result was that in some cases when looking up a paused reader from the wrong semaphore, a reader belonging to another read was returned, creating a disconnect between the lifecycle between readers and that of the slice and range they were referencing. This series fixes the underlying problem of the scheduling group influencing the verb handler registration, as well as adding some additional defenses if this semaphore mismatch ever happens in the future. Inactive read handles are now unique across all semaphores, meaning that it is not possible anymore that a handle succeeds in looking up a reader when used with the wrong semaphore. The range scan algorithm now also makes sure there is no semaphore mismatch between the one used for the current page and that of the saved reader from the previous page. I manually checked that each individual defense added is already preventing the crash from happening. Fixes: #6613 Fixes: #6907 Fixes: #6908 Tests: unit(dev), manual(run the crash reproducer, observe no crash) " * 'query-classification-regressions/v1' of https://github.com/denesb/scylla: multishard_mutation_query: use cached semaphore messaging: make verb handler registering independent of current scheduling group multishard_mutation_query: validate the semaphore of the looked-up reader reader_concurrency_semaphore: make inactive read handles unique across semaphores reader_concurrency_semaphore: add name() accessor reader_concurrency_semaphore: allow passing name to no-limit constructor	2020-07-27 13:56:52 +03:00
Nadav Har'El	f488eaebaf	merge: db/view: view_update_generator: make staging reader evictable Merged patch set by Botond Dénes: The view update generation process creates two readers. One is used to read the staging sstables, the data which needs view updates to be generated for, and another reader for each processed mutation, which reads the current value (pre-image) of each row in said mutation. The staging reader is created first and is kept alive until all staging data is processed. The pre-image reader is created separately for each processed mutation. The staging reader is not restricted, meaning it does not wait for admission on the relevant reader concurrency semaphore, but it does register its resource usage on it. The pre-image reader however is restricted. This creates a situation, where the staging reader possibly consumes all resources from the semaphore, leaving none for the later created pre-image reader, which will not be able to start reading. This will block the view building process meaning that the staging reader will not be destroyed, causing a deadlock. This patch solves this by making the staging reader restricted and making it evictable. To prevent thrashing -- evicting the staging reader after reading only a really small partition -- we only make the staging reader evictable after we have read at least 1MB worth of data from it. test/boost: view_build_test: add test_view_update_generator_buffering test/boost: view_build_test: add test test_view_update_generator_deadlock reader_permit: reader_resources: add operator- and operator+ reader_concurrency_semaphore: add initial_resources() test: cql_test_env: allow overriding database_config mutation_reader: expose new_reader_base_cost db/view: view_updating_consumer: allow passing custom update pusher db/view: view_update_generator: make staging reader evictable db/view: view_updating_consumer: move implementation from table.cc to view.cc database: add make_restricted_range_sstable_reader() Signed-off-by: Botond Dénes <bdenes@scylladb.com> --- db/view/view_updating_consumer.hh \| 51 ++++++++++++++++++++++++++++--- db/view/view.cc \| 39 +++++++++++++++++------ db/view/view_update_generator.cc \| 19 +++++++++--- 3 files changed, 91 insertions(+), 18 deletions(-)	2020-07-27 09:19:37 +02:00
Botond Dénes	fe127a2155	sstables: clamp estimated_partitions to [1, +inf) in writers In some cases estimated number of partitions can be 0, which is albeit a legit estimation result, breaks many low-level sstable writer code, so some of these have assertions to ensure estimated partitions is > 0. To avoid hitting this assert all users of the sstable writers do the clamping, to ensure estimated partitions is at least 1. However leaving this to the callers is error prone as #6913 has shown it. As this clamping is standard practice, it is better to do it in the writers themselves, avoiding this problem altogether. This is exactly what this patch does. It also adds two unit tests, one that reproduces the crash in #6913, and another one that ensures all sstable writers are fine with estimated partitions being 0 now. Call sites previously doing the clamping are changed to not do it, it is unnecessary now as the writer does it itself. Fixes #6913 Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>	2020-07-27 09:19:37 +02:00
Piotr Sarna	e7c18963e4	test: check sizes before dereferencing the vector It's better to assert a certain vector size first and only then dereference its elements - otherwise, if a bug causes the size to be different, the test can crash with a segfault on an invalid dereference instead of graciously failing with a test assertion.	2020-07-23 16:49:35 +03:00
Botond Dénes	11105cbb78	reader_concurrency_semaphore: make inactive read handles unique across semaphores Currently inactive read handles are only unique within the same semaphore, allowing for an unregister against another semaphore to potentially succeed. This can lead to disasters ranging from crashes to data corruption. While a handle should never be used with another semaphore in the first place, we have recently seen a bug (#6613) causing exactly that, so in this patch we prevent such unregister operations from ever succeeding by making handles unique across all semaphores. This is achieved by adding a pointer to the semaphore to the handle.	2020-07-23 16:43:33 +03:00
Rafael Ávila de Espíndola	e15c8ee667	Everywhere: Explicitly instantiate make_lw_shared seastar::make_lw_shared has a constructor taking a T&&. There is no such constructor in std::make_shared: https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared This means that we have to move from make_lw_shared(T(...) to make_lw_shared<T>(...) If we don't want to depend on the idiosyncrasies of seastar::make_lw_shared. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-07-21 10:33:49 -07:00
Rafael Ávila de Espíndola	efeaded427	Everywhere: Add a make_shared_schema helper This replaces a lot of make_lw_shared(schema(...)) with make_shared_schema(...). This makes it easier to drop a dependency on the differences between seastar::make_shared and std::make_shared. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-07-21 10:33:49 -07:00
Botond Dénes	929cdd3a15	test/boost: view_build_test: add test_view_update_generator_buffering To exercise the new buffering and pausing logic of the view updating consumer.	2020-07-20 14:32:45 +03:00
Botond Dénes	e316796b3f	test/boost: view_build_test: add test test_view_update_generator_deadlock A test case which reproduces the view update generator hang, where the staging reader consumes all resources and leaves none for the pre-image reader which blocks on the semaphore indefinitely.	2020-07-20 14:32:13 +03:00
Botond Dénes	5de0afdab7	mutation_reader: expose new_reader_base_cost So that test code can use it.	2020-07-20 11:23:39 +03:00
Avi Kivity	5371be71e9	Merge "Reduce fanout of some mutation-related headers" from Pavel E " The set's goal is to reduce the indirect fanout of 3 headers only, but likely affects more. The measured improvement rates are flat_mutation_reader.hh: -80% mutation.hh : -70% mutation_partition.hh : -20% tests: dev-build, 'checkheaders' for changed headers (the tree-wide fails on master) " * 'br-debloat-mutation-headers' of https://github.com/xemul/scylla: headers:: Remove flat_mutation_reader.hh from several other headers migration_manager: Remove db/schema_tables.hh inclustion into header storage_proxy: Remove frozen_mutation.hh inclustion storage_proxy: Move paxos/*.hh inclusions from .hh to .cc storage_proxy: Move hint_wrapper from .hh to .cc headers: Remove mutation.hh from trace_state.hh	2020-07-19 19:47:59 +03:00
Rafael Ávila de Espíndola	9fd2682bfd	restrictions_test: Fix use after return The query_options constructor captures a reference to the cql_config. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200718013221.640926-1-espindola@scylladb.com>	2020-07-19 15:44:38 +03:00
Pavel Emelyanov	8618a02815	migration_manager: Remove db/schema_tables.hh inclustion into header The schema_tables.hh -> migration_manager.hh couple seems to work as one of "single header for everyhing" creating big blot for many seemingly unrelated .hh's. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-17 17:54:43 +03:00
Rafael Ávila de Espíndola	66d866427d	sstable_datafile_test: Use BOOST_REQUIRE_EQUAL This only works for types that can be printed, but produces a better error message if the check fails. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Reviewed-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200716232700.521414-1-espindola@scylladb.com>	2020-07-17 11:58:58 +03:00
Avi Kivity	7bf51b8c6c	Merge 'Distinguish single-column expressions in AST' from Dejan " Fix #6825 by explicitly distinguishing single- from multi-column expressions in AST. Tests: unit (dev), dtest secondary_indexes_test.py (dev) " * dekimir-single-multiple-ast: cql3/restrictions: Separate AST for single column cql3/restrictions: Single-column helper functions	2020-07-16 16:59:14 +03:00
Pavel Solodovnikov	5ff5df1afd	storage_proxy: un-hardcode force sync flag for `mutate_locally(mutation)` overload Corresponding overload of `storage_proxy::mutate_locally` was hardcoded to pass `db::commitlog::force_sync::no` to the `database::apply`. Unhardcode it and substitute `force_sync::no` to all existing call sites (as it were before). `force_sync::yes` will be used later for paxos learn writes when trying to apply mutations upgraded from an obsolete schema version (similar to the current case when applying locally a `frozen_mutation` stored in accepted proposal). Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20200716124915.464789-1-pa.solodovnikov@scylladb.com>	2020-07-16 16:38:48 +03:00
Dejan Mircevski	0047e1e44d	cql3/restrictions: Separate AST for single column Existing AST assumes the single-column expression is a special case of multi-column expressions, so it cannot distinguish `c=(0)` from `(c)=(0)`. This leads to incorrect behaviour and dtest failures. Fix it by separating the two cases explicitly in the AST representation. Modify AST-creation code to create different AST for single- and multi-column expressions. Modify AST-consuming code to handle column_name separately from vector<column_name>. Drop code relying on cardinality testing to distinguisn single-column cases. Add a new unit test for `c=(0)`. Fixes #6825. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-07-16 12:27:25 +02:00
Nadav Har'El	09a71ccd84	merge: cql3/restrictions: exclude NULLs from comparison in filtering Merge pull request https://github.com/scylladb/scylla/pull/6834 by Juliusz Stasiewicz: NULLs used to give false positives in GT, LT, GEQ and LEQ ops performed upon ALLOW FILTERING. That was a consequence of not distinguishing NULL from an empty buffer. This patch excludes NULLs on high level, preventing them from entering LHS of any comparison, i.e. it assumes that any binary operation should return false whenever the LHS operand is NULL (note: at the moment filters with RHS NULL, such as ...WHERE x=NULL ALLOW FILTERING, return empty sets anyway). Fixes #6295 * '6295-do-not-compare-nulls-v2' of github.com:jul-stas/scylla: filtering_test: check that NULLs do not compare to normal values cql3/restrictions: exclude NULLs from comparison in filtering	2020-07-15 18:32:14 +03:00
Juliusz Stasiewicz	c25398e8cf	filtering_test: check that NULLs do not compare to normal values Tested operators are: `<` and `>`. Tests all types of NULLs except `duration` because durations are explicitly not comparable. Values to compare against were chosen arbitrarily.	2020-07-14 15:37:17 +02:00
Pavel Emelyanov	cf1315cde5	double-decker: A combination of B+tree with array The collection is K:V store bplus::tree<Key = K, Value = array_trusted_bounds<V>> It will be used as partitions cache. The outer tree is used to quickly map token to cache_entry, the inner array -- to resolve (expected to be rare) hash collisions. It also must be equipped with two comparators -- less one for keys and full one for values. The latter is not kept on-board, but it required on all calls. The core API consists of just 2 calls - Heterogenuous lower_bound(search_key) -> iterator : finds the element that's greater or equal to the provided search key Other than the iterator the call returns a "hint" object that helps the next call. - emplace_before(iterator, key, hint, ...) : the call construct the element right before the given iterator. The key and hint are needed for more optimal algo, but strictly speaking not required. Adding an entry to the double_decker may result in growing the node's array. Here to B+ iterator's .reconstruct() method comes into play. The new array is created, old elements are moved onto it, then the fresh node replaces the old one. // TODO: Ideally this should be turned into the // template <typename OuterCollection, typename InnerCollection> // but for now the double_decker still has some intimate knowledge // about what outer and inner collections are. Insertion into this collection _may_ invalidate iterators, but may leave intact. Invalidation only happens in case of hashing conflict, which can be clearly seen from the hint object, so there's a good room for improvement. The main usage by row_cache (the find_or_create_entry) looks like cache_entry find_or_create_entry() { bound_hint hint; it = lower_bound(decorated_key, &hint); if (!hint.found) { it = emplace_before(it, decorated_key.token(), hint, <constructor args>) } return *it; } Now the hint. It contains 3 booleans, that are - match: set to true when the "greater or equal" condition evaluated to "equal". This frees the caller from the need to manually check whether the entry returned matches the search key or the new one should be inserted. This is the "!found" check from the above snippet. To explain the next 2 bools, here's a small example. Consider the tree containing two elements {token, partition key}: { 3, "a" }, { 5, "z" } As the collection is sorted they go in the order shown. Next, this is what the lower_bound would return for some cases: { 3, "z" } -> { 5, "z" } { 4, "a" } -> { 5, "z" } { 5, "a" } -> { 5, "z" } Apparently, the lower bound for those 3 elements are the same, but the code-flows of emplacing them before one differ drastically. { 3, "z" } : need to get previous element from the tree and push the element to it's vector's back { 4, "a" } : need to create new element in the tree and populate its empty vector with the single element { 5, "a" } : need to put the new element in the found tree element right before the found vector position To make one of the above decisions the .emplace_before would need to perform another set of comparisons of keys and elements. Fortunately, the needed information was already known inside the lower_bound call and can be reported via the hint. Said that, - key_match: set to true if tree.lower_bound() found the element for the Key (which is token). For above examples this will be true for cases 3z and 5a. - key_tail: set to true if the tree element was found, but when comparing values from array the bounding element turned out to belong to the next tree element and the iterator was ++-ed. For above examples this would be true for case 3z only. And the last, but not least -- the "erase self" feature. Which is given only the cache_entry pointer at hands remove it from the collection. To make this happen we need to make two steps: 1. get the array the entry sits in 2. get the b+ tree node the vectors sits in Both methods are provided by array_trusted_bounds and bplus::tree. So, when we need to get iterator from the given T pointer, the algo looks like - Walk back the T array untill hitting the head element - Call array_trusted_bounds::from_element() getting the array - Construct b+ iterator from obtained array - Construct the double_decker iterator from b+ iterator and from the number of "steps back" from above - Call double_decker::iterator.erase() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-14 16:29:53 +03:00
Pavel Emelyanov	eb70644c1c	intrusive-array: Array with trusted bounds A plain array of elements that grows and shrinks by constructing the new instance from an existing one and moving the elements from it. Behaves similarly to vector's external array, but has 0-bytes overhead. The array bounds (0-th and N-th elemements) are determined by checking the flags on the elements themselves. For this the type must support getters and setters for the flags. To remove an element from array there's also a nothrow option that drops the requested element from array, shifts the righter ones left and keeps the trailing unused memory (so called "train") until reconstruction or destruction. Also comes with lower_bound() helper that helps keeping the elements sotred and the from_element() one that returns back reference to the array in which the element sits. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-14 16:29:49 +03:00
Pavel Emelyanov	95f15ea383	utils: B+ tree implementation // The story is at // https://groups.google.com/forum/#!msg/scylladb-dev/sxqTHM9rSDQ/WqwF1AQDAQAJ This is the B+ version which satisfies several specific requirements to be suitable for row-cache usage. 1. Insert/Remove doesn't invalidate iterators 2. Elements should be LSA-compactable 3. Low overhead of data nodes (1 pointer) 4. External less-only comparator 5. As little actions on insert/delete as possible 6. Iterator walks the sorted keys The design, briefly is: There are 3 types of nodes: inner, leaf and data, inner and leaf keep build-in array of N keys and N(+1) nodes. Leaf nodes sit in a doubly linked list. Data nodes live separately from the leaf ones and keep pointers on them. Tree handler keeps pointers on root and left-most and right-most leaves. Nodes do _not_ keep pointers or references on the tree (except 3 of them, see below). changes in v9: - explicitly marked keys/kids indices with type aliases - marked the whole erase/clear stuff noexcept - disposers now accept object pointer instead of reference - clear tree in destructor - added more comments - style/readability review comments fixed Prior changes - Add noexcepts where possible - Restrict Less-comparator constraint -- it must be noexcept - Generalized node_id - Packed code for beging()/cbegin() - Unsigned indices everywhere - Cosmetics changes - Const iterators - C++20 concepts - The index_for() implmenetation is templatized the other way to make it possible for AVX key search specialization (further patching) - Insertion tries to push kids to siblings before split Before this change insertion into full node resulted into this node being split into two equal parts. This behaviour for random keys stress gives a tree with ~2/3 of nodes half-filled. With this change before splitting the full node try to push one element to each of the siblings (if they exist and not full). This slows the insertion a bit (but it's still way faster than the std::set), but gives 15% less total number of nodes. - Iterator method to reconstruct the data at the given position The helper creates a new data node, emplaces data into it and replaces the iterator's one with it. Needed to keep arrays of data in tree. - Milli-optimize erase() - Return back an iterator that will likely be not re-validated - Do not try to update ancestors separation key for leftmost kid This caused the clear()-like workload work poorly as compared to std:set. In particular the row_cache::invalidate() method does exactly this and this change improves its timing. - Perf test to measure drain speed - Helper call to collect tree counters - Fix corner case of iterator.emplace_before() - Clean heterogenous lookup API - Handle exceptions from nodes allocations - Explicitly mark places where the key is copied (for future) - Extend the tree.lower_bound() API to report back whether the bound hit the key or not - Addressed style/cleanness review comments Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-14 16:29:43 +03:00
Juliusz Stasiewicz	c69075bbef	cql3/restrictions: exclude NULLs from comparison in filtering NULLs used to give false positives in GT, LT, GEQ and LEQ ops performed upon `ALLOW FILTERING`. That was a consequence of not distinguishing NULL from an empty buffer. This patch excludes NULLs on high level, preventing them from entering any comparison, i.e. it assumes that any binary operator should return `false` whenever one of the operands is NULL (note: ATM filters such as `...WHERE x=NULL ALLOW FILTERING` return empty sets anyway). `restriction_test/regular_col_slice` had to be updated accordingly. Fixes #6295	2020-07-14 12:59:01 +02:00
Benny Halevy	3ce86a7160	test: restrictions_test: set_contains: uncomment check depnding on #6797 Now that #6797 is fixed. Refs #5763 Cc: Dejan Mircevski <dejan@scylladb.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Test: restrictions_test(debug) Message-Id: <20200709123703.955897-1-bhalevy@scylladb.com>	2020-07-09 17:56:09 +03:00
Nadav Har'El	9042161ba3	merge: cdc: better pre/postimages for complicated batches Merged pull request https://github.com/scylladb/scylla/pull/6741 by Piotr Dulikowski: This PR changes the algorithm used to generate preimages and postimages in CDC log. While its behavior is the same for non-batch operations (with one exception described later), it generates pre/postimages that are organized more nicely, and account for multiple updates to the same row in one CQL batch. Fixes #6597, #6598 Tests: - unit(dev), for each consecutive commit - unit(debug), for the last commit Previous method The previous method worked on a per delta row basis. First, the base table is queried for the current state of the rows being modified in the processed mutation (this is called the "preimage query"). Then, for each delta row (representing a modification of a row): If preimage is enabled and the row was already present in the table, a corresponding preimage row is inserted before the delta row. The preimage row contains data taken directly from the preimage query result. Only columns that are modified by the delta are included in the preimage. If postimage is enabled, then a postimage row is inserted after the delta row. The postimage row contains data which was a result of taking row data directly from the preimage query result and applying the change the corresponding delta row represented. All columns of the row are included in the postimage. The above works well for simple cases such like singular CQL INSERT, UPDATE, DELETE, or simple CQL BATCH-es. An example: cqlsh:ks> BEGIN UNLOGGED BATCH INSERT INTO tbl (pk, ck, v) VALUES (0, 1, 111); INSERT INTO tbl (pk, ck, v) VALUES (0, 2, 222); APPLY BATCH; cqlsh:ks> SELECT "cdc$batch_seq_no", "cdc$operation", "cdc$ttl", pk, ck, v from ks.tbl_scylla_cdc_log ; cdc$batch_seq_no \| cdc$operation \| cdc$ttl \| pk \| ck \| v ------------------+---------------+---------+----+----+----- ...snip... 0 \| 0 \| null \| 0 \| 1 \| 100 1 \| 2 \| null \| 0 \| 1 \| 111 2 \| 9 \| null \| 0 \| 1 \| 111 3 \| 0 \| null \| 0 \| 2 \| 200 4 \| 2 \| null \| 0 \| 2 \| 222 5 \| 9 \| null \| 0 \| 2 \| 222 Preimage rows are represented by cdc operation 0, and postimage by 9. Please note that all rows presented above share the same value of cdc$time column, which was not shown here for brevity. Problems with previous approach This simple algorithm has some conceptual and implementational problems which arise when processing more complicated CQL BATCH-es. Consider the following example: cqlsh:ks> BEGIN UNLOGGED BATCH INSERT INTO tbl (pk, ck, v1) VALUES (0, 0, 1) USING TTL 1000; INSERT INTO tbl (pk, ck, v2) VALUES (0, 0, 2) USING TTL 2000; APPLY BATCH; cqlsh:ks> SELECT "cdc$batch_seq_no", "cdc$operation", "cdc$ttl", pk, ck, v1, v2 FROM tbl_scylla_cdc_log; cdc$batch_seq_no \| cdc$operation \| cdc$ttl \| pk \| ck \| v1 \| v2 ------------------+---------------+---------+----+----+------+------ ...snip... 0 \| 0 \| null \| 0 \| 0 \| null \| 0 1 \| 2 \| 2000 \| 0 \| 0 \| null \| 2 2 \| 9 \| null \| 0 \| 0 \| 0 \| 2 3 \| 0 \| null \| 0 \| 0 \| 0 \| null 4 \| 1 \| 1000 \| 0 \| 0 \| 1 \| null 5 \| 9 \| null \| 0 \| 0 \| 1 \| 0 A single cdc group (corresponding to rows sharing the same cdc$time) might have more than one delta that modify the same row. For example, this happens when modifying two columns of the same row with different TTLs - due to our choice of CDC log schema, we must represent such change with two delta rows. It does not make sense to present a postimage after the first delta and preimage before the second - both deltas are applied simultaneously by the same CQL BATCH, so the middle "image" is purely imaginary and does not appear at any point in the table. Moreover, in this example, the last postimage is wrong - v1 is updated, but v2 is not. None of the postimages presented above represent the final state of the row. New algorithm The new algorithm works now on per cdc group basis, not delta row. When starting processing a CQL BATCH: Load preimage query results into a data structure representing current state of the affected rows. For each cdc group: For each row modified within the group, a preimage is produced, regardless if the row was present in the table. The preimage is calculated based on the current state. Only include columns that are modified for this row within the group. For each delta, produce a delta row and update the current state accordingly. Produce postimages in the same way as preimages - but include all columns for each row in the postimage. The new algorithm produces postimage correctly when multiple deltas affect one, because the state of the row is updated on the fly. This algorithm moves preimage and postimage rows to the beginning and the end of the cdc group, accordingly. This solves the problem of imaginary preimages and postimages appearing inside a cdc group. Unfortunately, it is possible for one CQL BATCH to contain changes that use multiple timestamps. This will result in one CQL BATCH creating multiple cdc groups, with different cdc$time. As it is impossible, with our choice of schema, to tell that those cdc groups were created from one CQL BATCH, instead we pretend as if those groups were separate CQL operations. By tracking the state of the affected rows, we make sure that preimage in later groups will reflect changes introduces in previous groups. One more thing - this algorithm should have the same results for singular CQL operations and simple CQL BATCH-es, with one exception. Previously, preimage not produced if a row was not present in the table. Now, the preimage row will appear unconditionally - it will have nulls in place of column values. * 'cdc-pre-postimage-persistence' of github.com:piodul/scylla: cdc: fix indentation cdc: don't update partition state when not needed cdc: implement pre/postimage persistence cdc: add interface for producing pre/postimages cdc: load preimage query result into partition state fields cdc: introduce fields for keeping partition state cdc: rename set_pk_columns -> allocate_new_log_row cdc: track batch_no inside transformer cdc: move cdc$time generation to transformer cdc: move find_timestamp to split.cc cdc: introduce change_processor interface cdc: remove redundant schema arguments from cdc functions cdc: move management of generated mutations inside transformer cdc: move preimage result set into a field of transformer cdc: keep ts and tuuid inside transformer cdc: track touched parts of mutations inside transformer cdc: always include preimage for affected rows	2020-07-09 16:55:55 +03:00
Dejan Mircevski	d956233a80	cql_query_test: Drop get() on cquery_nofail result cquery_nofail returns the query result, not a future. Invoking .get() on its result is unnecessary. This just happened to compile because shared_ptr has a get() method with the same signature as future::get. Tests: cql_query_test unit test (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-07-09 13:52:52 +03:00
Piotr Dulikowski	246f8da6f6	cdc: implement pre/postimage persistence Moves responsibility for generating pre/postimage rows from the "process_change" method to "produce_preimage" and "produce_postimage". This commit actually affects the contents of generated CDC log mutations. Added a unit test that verifies more complicated cases with CQL BATCH.	2020-07-08 15:36:41 +02:00
Piotr Dulikowski	f907cab156	cdc: remove redundant schema arguments from cdc functions A `mutation` object already has a reference to its schema. It does not make sense to call functions changed in this commit with a different schema.	2020-07-08 15:36:40 +02:00
Piotr Dulikowski	027d20c654	cdc: always include preimage for affected rows This changes the current algorithm so that the preimage row will not be skipped if the corresponding rows was not present in preimage query results.	2020-07-08 15:36:40 +02:00
Avi Kivity	7ea9ee27dd	Merge 'aggregates: Use type-specific comparators in min/max' from Juliusz " For collections and UDTs the `MIN()` and `MAX()` functions are generated on the fly. Until now they worked by comparing just the byte representations of their arguments. This patch employs specific per-type comparators to provide semantically sensible, dynamically created aggregates. Fixes #6768 " * jul-stas-6768-use-type-comparators-for-minmax: tests: Test min/max on set aggregate_fcts: Use per-type comparators for dynamic types	2020-07-08 15:07:57 +03:00

1 2 3 4 5 ...

480 Commits