Return the pre- 6773563d3 behavior of demanding ALLOW FILTERING when partition slice is requested but on potentially unlimited number of partitions. Put it on a flag defaulting to "off" for now.
Fixes#7608; see comments there for justification.
Tests: unit (debug, dev), dtest (cql_additional_test, paging_test)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9126
* github.com:scylladb/scylla:
cql3: Demand ALLOW FILTERING for unlimited, sliced partitions
cql3: Track warnings in prepared_statement
test: Use ALLOW FILTERING more strictly
cql3: Add statement_restrictions::to_string
When a query requests a partition slice but doesn't limit the number
of partitions, require that it also says ALLOW FILTERING. Although
do_filter() isn't invoked for such queries, the performance can still
be unexpectedly slow, and we want to signal that to the user by
demanding they explicitly say ALLOW FILTERING.
Because we now reject queries that worked fine before, existing
applications can break. Therefore, the behavior is controlled by a
flag currently defaulting to off. We will default to "on" in the next
Scylla version.
Fixes#7608; see comments there for justification.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Removes old code used to calculate local-index clustering range
and replaces it with new based on the expression variant.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
"
`function_call` AST nodes are created for each function
with side effects in a CQL query, i.e. non-deterministic
functions (`uuid()`, `now()` and some others timeuuid-related).
These nodes are evaluated either when a query itself is executed
or query restrictions are computed (e.g. partition/clustering
key ranges for LWT requests).
We need to cache the calls since otherwise when handling a
`bounce_to_shard` request for an LWT query, we can possibly
enter an infinite bouncing loop (in case a function is used
to calculate partition key ranges for a query), since the
results can be different each time.
Furthermore, we don't support bouncing more than one time.
Returning `bounce_to_shard` message more than one time
will result in a crash.
Caching works only for LWT statements and only for the function
calls that affect partition key range computation for the query.
`variable_specifications` class is renamed to `prepare_context`
and generalized to record information about each `function_call`
AST node and modify them, as needed:
* Check whether a given function call is a part of partition key
statement restriction.
* Assign ids for caching if above is true and the call is a part
of an LWT statement.
There is no need to include any kind of statement identifier
in the cache key since `query_options` (which holds the cache)
is limited to a single statement, anyway.
Function calls are indexed by the order in which they appear
within a statement while parsing. There is no need to
include any kind of statement identifier to the cache key
since `query_options` (which holds the cache) is limited
to a single statement, anyway.
Note that `function_call::raw` AST nodes are not created
for selection clauses of a SELECT statement hence they
can only accept only one of the following things as parameters:
* Other function calls.
* Literal values.
* Parameter markers.
In other words, only parameters that can be immediately reduced
to a byte buffer are allowed and we don't need to handle
database inputs to non-pure functions separately since they
are not possible in this context. Anyhow, we don't even have
a single non-pure function that accepts arguments, so precautions
are not needed at the moment.
Add a test written in `cql-pytest` framework to verify
that both prepared and unprepared lwt statements handle
`bounce_to_shard` messages correctly in such scenario.
Fixes: #8604
Tests: unit(dev, debug)
NOTE: the patchset uses `query_options` as a container for
cached values. This doesn't look clean and `service::query_state`
seems to be a better place to store them. But it's not
forwarded to most of the CQL code and would mean that a huge number
of places would have to be amended.
The series presents a trade-off to avoid forwarding `query_state`
everywhere (but maybe it's the thing that needs to be done, nonetheless).
"
* 'lwt_bounce_to_shard_cached_fn_v6' of https://github.com/ManManson/scylla:
cql-pytest: add a test for non-pure CQL functions
cql3: cache function calls evaluation for non-deterministic functions
cql3: rename `variable_specifications` to `prepare_context`
hierarchy with expressions' from Avi Kivity
Currently, the grammar has two parallel hierarchies. One hierarchy is
used in the WHERE clause, and is based on a combination of `term`
and expressions. The other is used in the SELECT clause, and is
using the cql3::selection::selectable hierarchy. There is some overlap
between the hierarchies: both can name columns. Logically, however,
they overlap completely - in SQL anything you can select you can
filter on, and vice versa. So merging the two hierarchies is important if
we want to enrich CQL. This series does that, partially (see below),
converting the SELECT clause to expressions.
There is another hierarchy split: between the "raw", pre-prepare object
hierarchy, and post-prepare non-raw. This series limits itself to converting
the raw hierarchy and leaves the non-raw hierarchy alone.
An important design choice is not to have this raw/non-raw split in expressions.
Note that most of the hierarchy is completely parallel: addition is addition
both before prepare and after prepare (but see [1]). The main difference
is around identifiers - before preparation they are unresolved, and after
preparation they become `column_definition` objects. We resolve that by
having two separate types: `unresolved_identifier` for the pre-prepare phase,
and the existing `column_value` for post-prepare phase.
Alternative choices would be to keep a separate expression::raw variant, or
to template the expression variant on whether it is raw or not. I think it would
cause undue bloat and confusion.
Note the series introduces many on_internal_error() calls. This is because
there is not a lot of overlap in the hierarchies today; you can't have a cast in
the WHERE clause, for example. These on_internal_error() calls cannot be
triggered since the grammar does not yet allow such expressions to be
expressed. As we expand the grammar, they will have to be replaced with
working implementations.
Lastly, field selection is expressible in both hierarchies. This series does not yet
merge the two representations (`column_value.sub` vs `field_selection`), but it
should be easy to do so later.
[1] the `+` operator can also be translated to list concatenation, which we may
choose to represent by yet another type.
Test: unit(dev)
Closes#9087
* github.com:scylladb/scylla:
cql3: expression: update find_atom, count_if for function_call, cast, field_selection
cql3: expressions: fix printing of nested expressions
cql3: selection: replace selectable::raw with expression
cql3: expression: convert selectable::with_field_selection::raw to expression
cql3: expression: convert selectable::with_cast::raw to expression
cql3: expression: convert selectable::with_anonymous_function::raw to expression
cql3: expression: convert selectable::with_function_call::raw to expressions
cql3: selectable: make selectable::raw forward-declarable
cql3: expressions: convert writetime_or_ttl::raw to expression
cql3: expression: add convenience constructor from expression element to nested expression
utils: introduce variant_element.hh
cql3: expression: use nested_expression in binary_operator
cql3: expression: introduce nested_expression class
Convert column_identifier_raw's use as selectable to expressions
make column_identifier::raw forward declarable
cql3: introduce selectable::with_expression::raw
And reuse these values when handling `bounce_to_shard` messages.
Otherwise such a function (e.g. `uuid()`) can yield a different
value when a statement re-executed on the other shard.
It can lead to an infinite number of `bounce_to_shard` messages
sent in case the function value is used to calculate partition
key ranges for the query. Which, in turn, will cause crashes
since we don't support bouncing more than one time and the second
hop will result in a crash.
Caching works only for LWT statements and only for the function
calls that affect partition key range computation for the query.
`variable_specifications` class is renamed to `prepare_context`
and generalized to record information about each `function_call`
AST node and modify them, as needed:
* Check whether a given function call is a part of partition key
statement restriction.
* Assign ids for caching if above is true and the call is a part
of an LWT statement.
There is no need to include any kind of statement identifier
in the cache key since `query_options` (which holds the cache)
is limited to a single statement, anyway.
Note that `function_call::raw` AST nodes are not created
for selection clauses of a SELECT statement hence they
can only accept only one of the following things as parameters:
* Other function calls.
* Literal values.
* Parameter markers.
In other words, only parameters that can be immediately reduced
to a byte buffer are allowed and we don't need to handle
database inputs to non-pure functions separately since they
are not possible in this context. Anyhow, we don't even have
a single non-pure function that accepts arguments, so precautions
are not needed at the moment.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Now that all selectable::raw subclasses have been converted to
cql3::selectable::with_expression::raw, the class structure is
just a wrapper around expressions. Peel it, converting the
virtual member functions to free functions, and replacing
object instances with expression or nested_expression as the
case allows.
Rather than creating a new variant element in expression, we extend
function_call to handle both named and anonymous functions, since
most of the processing is the same.
Introduce unresolved_identifer as an unprepared counterpart to column_value.
column_identifier_raw no longer inherits from selectable::raw, but
methods for now to reduce churn.
The class is repurposed to be more generic and also be able
to hold additional metadata related to function calls within
a CQL statement. Rename all methods appropriately.
Visitor functions in AST nodes (`collect_marker_specification`)
are also renamed to a more generic `fill_prepare_context`.
The name `prepare_context` designates that this metadata
structure is a byproduct of `stmt::raw::prepare()` call and
is needed only for "prepare" step of query execution.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
When using an index, restrictions like token(p) <= x were ignored.
Because of this a query like this would select all rows where r = 0:
SELECT * FROM tab WHERE r = 0 and token(p) > 0;
Adds proper handling of token restrictions to queries that use indexes.
Old indexes represented token as a blob, which complicates
clustering bounds. Special code is included, which translates
token clustering bounds to blob clustering bounds.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Returning a function parameter guarantees copy elision and does not
require a std::move(). Enable -Wredundant-move to warn us that the
move is unneeded, and gain slightly more readable code. A few violations
are trivially adjusted.
Closes#9004
This warning prevents using std::move() where it can hurt
- on an unnamed temporary or a named automatic variable being
returned from a function. In both cases the value could be
constructed directly in its final destination, but std::move()
prevents it.
Fix the handful of cases (all trivial), and enable the warning.
Closes#8992
After 845e36e76 "cql3: Use expr for global-index partition slice",
there is actually more dead code than was initially dropped.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8981
Instead of creating a single_column_clustering_key_restrictions
object, create an equivalent vector of expr::expressions and calculate
from it the clustering ranges just like we do for base-table queries.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Original code which introduced enforcing page limits for indexed
statements created a new constant for max result size in bytes.
Botond reported that we already have such a constant, so it's now
used instead of reinventing it from scratch.
Closes#8839
Indexed select statements fetch primary key information from
their internal materialized views and then use it to query
the base table. Unfortunately, the current mechanism for retrieving
base table rows makes it easy to overwhelm the replicas with unbounded
concurrency - the number of concurrent ops is increased exponentially
until a short read is encountered, but it's not enough to cap the
concurrency - if data is fetched row-by-row, then short reads usually
don't occur and as a result it's easy to see concurrency of 1M or
higher. In order to avoid overloading the replicas, the concurrency
of indexed queries is now capped at 4096 and additionally throttled
if enough results are already fetched. For paged queries it means that
the query returns as soon as 1MB of data is ready, and for unpaged ones
the concurrency will no longer be doubled as soon as the previous
iteration fetched 1MB of results.
The fixed 4096 value can be subject to debate, its reasoning is as follows:
for 2KiB rows, so moderately large but not huge, they result in
fetching 10MB of data, which is the granularity used by replicas.
For 200B rows, which is rather small, the result would still be
around 1MB.
At the same time, 4096 separate tasks also means 4096 allocations,
so increasing the number also strains the allocator.
Fixes#8799
Tests: unit(release),
manual: observing metrics of modified index_paging_test
Closes#8814
* github.com:scylladb/scylla:
cql3: limit the transitional result size for indexed queries
cql3: return indexed pages after 1MB worth of data
cql3: limit the concurrency of indexed statements
Unpaged indexed queries already have a concurrency limit of 4096,
but now the concurrency is further limited by previous number of bytes
fetched. Once this number reached 1MB, the concurrency will not be
increased in consecutive queries to avoid overload.
Currently there's no practical limit of the resulting page size
for an indexed query, because it simply translates a page worth
of base primary keys into base rows. In order to avoid sending
too large pages, the result is returned after hitting a 1MB limit.
Indexed select statements fetch primary key information from
their internal materialized views and then use it to query
the base table. Unfortunately, the current mechanism for retrieving
base table rows makes it easy to overwhelm the replicas with unbounded
concurrency - the number of concurrent ops is increased exponentially
until a short read is encountered, but it's not enough to cap the
concurrency - if data is fetched row-by-row, then short reads usually
don't occur and as a result it's easy to see concurrency of 1M or
higher. In order to avoid overloading the replicas, the concurrency
of indexed queries is now capped at 4096.
The number can be subject to debate, its reasoning is as follows:
for 2KiB rows, so moderately large but not huge, they result in
fetching 10MB of data, which is the granularity used by replicas.
For 200B rows, which is rather small, the result would still be
around 1MB.
At the same time, 4096 separate tasks also means 4096 allocations,
so increasing the number also strains the allocator.
Fixes#8799
Tests: unit(release),
manual: observing metrics of modified index_paging_test
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager
- namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.
Tests: unit(release), along with the additional test case
introduced in this series; the test also passes
on Cassandra
Fixes#8666
We will remove bounds_ranges when we kill the restrictions class
hierarchy. Of the several call sites, two can be easily modified to
avoid it. Others are more complicated and will be modified in a
subsequent commit.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
storage_proxy.hh is huge and includes many headers itself, so
remove its inclusions from headers and re-add smaller headers
where needed (and storage_proxy.hh itself in source files that
need it).
Ref #1.
The indexed queries will now record which index was chosen
for fetching the base table keys.
Example output:
activity
------------------------------------------------------------------------------------------------------------------------
Parsing a statement
Processing a statement
Consulting index my_v2_idx for a single slice of keys
Creating read executor for token -3248873570005575792 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE
read_data: querying locally
Start querying singular range {{-3248873570005575792, pk{000400000002}}}
Querying cache for range {{-3248873570005575792, pk{000400000002}}} and slice {(-inf, +inf)}
Querying is done
Done processing - preparing a result
We want to change the internals of cql3::raw_value{_view}.
However, users of cql3::raw_value and cql3::raw_value_view often
use them by extracting the internal representation, which will be different
after the planned change.
This commit prepares us for the change by making all accesses to the value
inside cql3::raw_value(_view) be done through helper methods which don't expose
the internal representation publicly.
After this commit we are free to change the internal representation of
raw_value_{view} without messing up their users.
When querying an index table, we assemble clustering-column
restrictions for that query by going over the base table token,
partition columns, and clustering columns. But if one of those
columns is the indexed column, there is a problem; the indexed column
is the index table's partition key, not clustering key. We end up
with invalid clustering slice, which can cause problems downstream.
Fix this by skipping the indexed column when assembling the clustering
restrictions.
Tests: unit (dev)
Fixes#7888
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8320
Currently the statement's execute() method accepts storage
proxy as the first argument. This is enough for all of them
but schema altering ones, because the latter need to call
migration manager's announce.
To provide the migration manager to those who need it it's
needed to have some higher-level service that the proxy. The
query processor seems to be good candidate for it.
Said that -- all the .execute()s now accept the querty
processor instead of the proxy and get the proxy itself from
the query processor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is a revival of #7490.
Quoting #7490:
The managed_bytes class now uses implicit linearization: outside LSA, data is never fragmented, and within LSA, data is linearized on-demand, as long as the code is running within with_linearized_managed_bytes() scope.
We would like to stop linearizing managed_bytes and keep it fragmented at all times, since linearization can require large contiguous chunks. Large contiguous allocations are hard to satisfy and cause latency spikes.
As a first step towards that, we remove all implicitly linearizing accessors and replace them with an explicit linearization accessor, with_linearized().
Some of the linearization happens long before use, by creating a bytes_view of the managed_bytes object and passing it onwards, perhaps storing it for later use. This does not work with with_linearized(), which creates a temporary linearized view, and does not work towards the longer term goal of never linearizing. As a substitute a managed_bytes_view class is introduced that acts as a view for managed_bytes (for interoperability it can also be a view for bytes and is compatible with bytes_view).
By the end of the series, all linearizations are temporary, within the scope of a with_linearized() call and can be converted to fragmented consumption of the data at leisure.
This has limited practical value directly, as current uses of managed_bytes are limited to keys (which are limited to 64k). However, it enables converting the atomic_cell layer back to managed_bytes (so we can remove IMR) and the CQL layer to managed_bytes/managed_bytes_view, removing contiguous allocations from the coordinator.
Closes#7820
* github.com:scylladb/scylla:
test: add hashers_test
memtable: fix accounting of managed_bytes in partition_snapshot_accounter
test: add managed_bytes_test
utils: fragment_range: add a fragment iterator for FragmentedView
keys: update comments after changes and remove an unused method
mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator
row_cache: more indentation fixes
utils: remove unused linearization facilities in `managed_bytes` class
misc: fix indentation
treewide: remove remaining `with_linearized_managed_bytes` uses
memtable, row_cache: remove `with_linearized_managed_bytes` uses
utils: managed_bytes: remove linearizing accessors
keys, compound: switch from bytes_view to managed_bytes_view
sstables: writer: add write_* helpers for managed_bytes_view
compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view
types: change equal() to accept managed_bytes_view
types: add parallel interfaces for managed_bytes_view
types: add to_managed_bytes(const sstring&)
serializer_impl: handle managed_bytes without linearizing
utils: managed_bytes: add managed_bytes_view::operator[]
utils: managed_bytes: introduce managed_bytes_view
utils: fragment_range: add serialization helpers for FragmentedMutableView
bytes: implement std::hash using appending_hash
utils: mutable_view: add substr()
utils: fragment_range: add compare_unsigned
utils: managed_bytes: make the constructors from bytes and bytes_view explicit
utils: managed_bytes: introduce with_linearized()
utils: managed_bytes: constrain with_linearized_managed_bytes()
utils: managed_bytes: avoid internal uses of managed_bytes::data()
utils: managed_bytes: extract do_linearize_pure()
thrift: do not depend on implicit conversion of keys to bytes_view
clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view
cql3: expression: linearize get_value_from_mutation() eariler
bytes: add to_bytes(bytes)
cql3: expression: mark do_get_value() as static
The keys classes (partition_key et al) already use managed_bytes,
but they assume the data is not fragmented and make liberal use
of that by casting to bytes_view. The view classes use bytes_view.
Change that to managed_bytes_view, and adjust return values
to managed_bytes/managed_bytes_view.
The callers are adjusted. In some places linearization (to_bytes())
is needed, but this isn't too bad as keys are always <= 64k and thus
will not be fragmented when out of LSA. We can remove this
linearization later.
The serialize_value() template is called from a long chain, and
can be reached with either bytes_view or managed_bytes_view.
Rather than trace and adjust all the callers, we patch it now
with constexpr if.
operator bytes_view (in keys) is converted to operator
managed_bytes_view, allowing callers to defer or avoid
linearization.
This patch enables select cql statements where collection columns are
selected columns in queries where clustering column is restricted by
"IN" cql operator. Such queries are accepted by cassandra since v4.0.
The internals actually provide correct support for this feature already,
this patch simply removes relevant cql query check.
Tests: cql-pytest (testInRestrictionWithCollection)
Fixes#7743Fixes#4251
Signed-off-by: Vojtech Havel <vojtahavel@gmail.com>
Message-Id: <20210104223422.81519-1-vojtahavel@gmail.com>
Functions for checking if the keyspace is system/internal were based
on sstring references, which is impractical compared to string views
and may lead to unnecessary creation of sstring instances.
First of all, select statement is extended with an 'attrs' field,
which keeps the per-query attributes. Currently, only TIMEOUT
parameter is legal to use, since TIMESTAMP and TTL bear no meaning
for reads.
Secondly, if TIMEOUT attribute is set, it will be used as the effective
timeout for a particular query.
"
The validate_column_family() helper uses the global proxy
reference to get database from. Fortunatelly, all the callers
of it can provide one via argument.
tests: unit(dev)
"
* 'br-no-proxy-in-validate' of https://github.com/xemul/scylla:
validation: Remove get_local_storage_proxy call
client_state: Call validate_column_family() with database arg
client_state: Add database& arg to has_column_family_access
storage_proxy: Add .local_db() getters
validate: Mark database argument const
It is called from cql3/statements' check_access methods and from thrift
handlers. The former have proxy argument from which they can get the
database. The latter already have the database itself on board.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, each internal page fetched during aggregating
gets a timeout based on the time the page fetch was started,
rather than the query start time. This means the query can
continue processing long after the client has abandoned it
due to its own timeout, which is based on the query start time.
Fix by establishing the timeout once when the query starts, and
not advancing it.
Test: manual (SELECT count(*) FROM a large table).
Fixes#1175.
Closes#7662
get() the latest token_metadata_ptr from the
shared_token_metadata before each use.
expose get_token_metadata_ptr() rather than get_token_metadata()
so that caller can keep it across continuations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Switches token column computation to (new) token_column_computation,
which fixes#7443, because new token column will be compared using
signed comparisons, not the previous unsigned comparison of CQL bytes
type.
This column computation type is only set if cluster supports
correct_idx_token_in_secondary_index feature to make sure that all nodes
will be able to compute (new) token_column_computation. Also old
indexes will need to be rebuilt to take advantage of this fix, as new
token column computation type is only set for new indexes.
Before this change, internal page_size when doing aggregate, GROUP BY
or nonpaged filtering queries was hard-coded to DEFAULT_COUNT_PAGE_SIZE.
This made testing hard (timeouts in debug build), because the tests had
to be large to test cases when there are multiple internal pages.
This change adds new internal_paging_size variable, which is
configurable by set_internal_paging_size and reset_internal_paging_size
free functions. This functionality is only meant for testing purposes.
Fixes two issues related to improper paging on indexed SELECTs. As those
two issues are closely related (fixing one without fixing the other
causes invalid results of queries), they are in a single commit.
The first issue is that when using slice.set_range, the existing
_row_ranges (which specify clustering key prefixes) are not taken into
account. This caused the wrong rows to be included in the result, as the
clustering key bound was set to a half-open range:
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
count
-------
2
This change fixes this issue by properly trimming row_ranges.
The second fixed problem is related to setting the paging_state
to internal_options. It was improperly set just after reading from
index, making the base query start from invalid paging_state.
This change fixes this issue by setting the paging_state after both
index and base table queries are done. Moreover, the paging_state is
now set based on paging_state of index query and the results of base
table query (as base query can return more rows than index query).
Fixes the first two failing examples from issue #7355.
Before the change, GROUP BY SELECTs with some WHERE restrictions on an
indexed column would return invalid results (same grouped column values
appearing multiple times):
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
pk
----
1
1
This is fixed by correctly passing _group_by_cell_indices to
result_set_builder. Fixes the third failing example from issue #7355.