mirror of
https://github.com/scylladb/scylladb.git
synced 2026-05-22 07:42:16 +00:00
Add user-facing documentation for the new CQL per-row TTL feature, in docs/cql/cql-extensions.md. Also mention (and link) the new alternative TTL feature in a few relevant documents about the old (per-write) TTL, about CDC, and about the CREATE TABLE and ALTER TABLE commands. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
484 lines
19 KiB
Markdown
484 lines
19 KiB
Markdown
# ScyllaDB CQL Extensions
|
|
|
|
ScyllaDB extends the CQL language to provide a few extra features. This document
|
|
lists those extensions.
|
|
|
|
## BYPASS CACHE clause
|
|
|
|
The `BYPASS CACHE` clause on `SELECT` statements informs the database that the data
|
|
being read is unlikely to be read again in the near future, and also
|
|
was unlikely to have been read in the near past; therefore no attempt
|
|
should be made to read it from the cache or to populate the cache with
|
|
the data. This is mostly useful for range scans; these typically
|
|
process large amounts of data with no temporal locality and do not
|
|
benefit from the cache.
|
|
|
|
The clause is placed immediately after the optional `ALLOW FILTERING`
|
|
clause:
|
|
|
|
SELECT ... FROM ...
|
|
WHERE ...
|
|
ALLOW FILTERING -- optional
|
|
BYPASS CACHE
|
|
|
|
## "Paxos grace seconds" per-table option
|
|
|
|
The `paxos_grace_seconds` option is used to set the amount of seconds which
|
|
are used to TTL data in paxos tables when using LWT queries against the base
|
|
table.
|
|
|
|
This value is intentionally decoupled from `gc_grace_seconds` since,
|
|
in general, the base table could use completely different strategy to garbage
|
|
collect entries, e.g. can set `gc_grace_seconds` to 0 if it doesn't use
|
|
deletions and hence doesn't need to repair.
|
|
|
|
However, paxos tables still rely on repair to achieve consistency, and
|
|
the user is required to execute repair within `paxos_grace_seconds`.
|
|
|
|
Default value is equal to `DEFAULT_GC_GRACE_SECONDS`, which is 10 days.
|
|
|
|
The option can be specified at `CREATE TABLE` or `ALTER TABLE` queries in the same
|
|
way as other options by using `WITH` clause:
|
|
|
|
CREATE TABLE tbl ...
|
|
WITH paxos_grace_seconds=1234
|
|
|
|
## USING TIMEOUT
|
|
|
|
TIMEOUT extension allows specifying per-query timeouts. This parameter accepts a single
|
|
duration and applies it as a timeout specific to a single particular query.
|
|
The parameter is supported for prepared statements as well.
|
|
The parameter acts as part of the USING clause, and thus can be combined with other
|
|
parameters - like timestamps and time-to-live.
|
|
For example, one can use ``USING TIMEOUT ... and TTL ...`` to specify both a non-default timeout and a ttl.
|
|
|
|
Examples:
|
|
```cql
|
|
SELECT * FROM t USING TIMEOUT 200ms;
|
|
```
|
|
```cql
|
|
INSERT INTO t(a,b,c) VALUES (1,2,3) USING TIMESTAMP 42 AND TIMEOUT 50ms;
|
|
```
|
|
```cql
|
|
TRUNCATE TABLE t USING TIMEOUT 5m;
|
|
```
|
|
|
|
Working with prepared statements works as usual - the timeout parameter can be
|
|
explicitly defined or provided as a marker:
|
|
|
|
```cql
|
|
SELECT * FROM t USING TIMEOUT ?;
|
|
```
|
|
```cql
|
|
INSERT INTO t(a,b,c) VALUES (?,?,?) USING TIMESTAMP 42 AND TIMEOUT 50ms;
|
|
```
|
|
|
|
The timeout parameter can be applied to the following data modification queries:
|
|
INSERT, UPDATE, DELETE, PRUNE MATERIALIZED VIEW, BATCH,
|
|
and to the TRUNCATE data definition query.
|
|
|
|
In addition, the timeout parameter can be applied to SELECT queries as well.
|
|
|
|
## PRUNE MATERIALIZED VIEW statements
|
|
|
|
A special statement is dedicated for pruning ghost rows from materialized views.
|
|
Ghost row is an inconsistency issue which manifests itself by having rows
|
|
in a materialized view which do not correspond to any base table rows.
|
|
Such inconsistencies should be prevented altogether and ScyllaDB is striving to avoid
|
|
them, but *if* they happen, this statement can be used to restore a materialized view
|
|
to a fully consistent state without rebuilding it from scratch.
|
|
|
|
Example usages:
|
|
```cql
|
|
PRUNE MATERIALIZED VIEW my_view;
|
|
PRUNE MATERIALIZED VIEW my_view WHERE token(v) > 7 AND token(v) < 1535250;
|
|
PRUNE MATERIALIZED VIEW my_view WHERE v = 19;
|
|
```
|
|
|
|
The statement works by fetching requested rows from a materialized view
|
|
and then trying to fetch their corresponding rows from the base table.
|
|
If it turns out that the base row does not exist, the row is considered
|
|
a ghost row and is thus deleted. The statement implicitly works with
|
|
consistency level ALL when fetching from the base table to avoid false
|
|
positives. As the example shows, a materialized view can be pruned
|
|
in one go, but one can also specify specific primary keys or token ranges,
|
|
which is recommended in order to make the operation less heavyweight
|
|
and allow for running multiple parallel pruning statements for non-overlapping
|
|
token ranges.
|
|
|
|
By default, the PRUNE MATERIALIZED VIEW statement is relatively slow, only
|
|
performing one base read or write at a time. This can be changed with the
|
|
USING CONCURRENCY clause. If the clause is used, the concurrency of reads
|
|
and writes from the base table will be allowed to increase up to the specified
|
|
value. For example, to run the PRUNE with 100 parallel reads/writes, you can use:
|
|
```cql
|
|
PRUNE MATERIALIZED VIEW my_view WHERE v = 19 USING CONCURRENCY 100;
|
|
```
|
|
|
|
## Synchronous materialized views
|
|
|
|
Usually, when a table with materialized views is updated, the update to the
|
|
views happens _asynchronously_, i.e., in the background. This means that
|
|
the user cannot know when the view updates have all finished - or even be
|
|
sure that they succeeded.
|
|
|
|
ScyllaDB allows marking a view as synchronous. When a view
|
|
is marked synchronous, base-table updates will wait for that view to be
|
|
updated before returning. A base table may have multiple views marked
|
|
synchronous, and will wait for all of them. The consistency level of a
|
|
write applies to synchronous views as well as to the base table: For
|
|
example, writing with QUORUM consistency level returns only after a
|
|
quorum of the base-table replicas were updated *and* also a quorum of
|
|
each synchronous view table was also updated.
|
|
|
|
Synchronous views tend to reduce the observed availability of the base table,
|
|
because a base-table write would only succeed if enough synchronous view
|
|
updates also succeed. On the other hand, failed view updates would be
|
|
detected immediately, and appropriate action can be taken, such as retrying
|
|
the write or pruning the materialized view (as mentioned in the previous
|
|
section). This can improve the consistency of the base table with its views.
|
|
|
|
To create a new materialized view with synchronous updates, use:
|
|
|
|
```cql
|
|
CREATE MATERIALIZED VIEW main.mv
|
|
AS SELECT * FROM main.t
|
|
WHERE v IS NOT NULL
|
|
PRIMARY KEY (v, id)
|
|
WITH synchronous_updates = true;
|
|
```
|
|
|
|
To make an existing materialized view synchronous, use:
|
|
|
|
```cql
|
|
ALTER MATERIALIZED VIEW main.mv WITH synchronous_updates = true;
|
|
```
|
|
|
|
To return a materialized view to the default behavior (which, as explained
|
|
above, _usually_ means asynchronous updates), use:
|
|
|
|
```cql
|
|
ALTER MATERIALIZED VIEW main.mv WITH synchronous_updates = false;
|
|
```
|
|
|
|
Even in an asynchronous view, _some_ view updates may be done synchronously.
|
|
This happens when the materialized-view replica is on the same node as the
|
|
base-table replica. This happens, for example, in tables using vnodes where
|
|
the base table and the view have the same partition key; But is not the case
|
|
if the table uses tablets: With tablets, the base and view tablets may migrate
|
|
to different nodes. In general, users should not, and cannot, rely on these
|
|
serendipitous synchronous view updates; If synchronous view updates are
|
|
important, mark the view explicitly with `synchronous_updates = true`.
|
|
|
|
### Synchronous global secondary indexes
|
|
|
|
Synchronous updates can also be turned on for global secondary indexes.
|
|
At the time of writing this paragraph there is no direct syntax to do that,
|
|
but it's possible to mark the underlying materialized view of an index
|
|
as synchronous. ScyllaDB's implementation of secondary indexes is based
|
|
on materialized views and the generated view's name can be extracted
|
|
from schema tables, and is generally constructed by appending `_index`
|
|
suffix to the index name:
|
|
|
|
```cql
|
|
create table main.t(id int primary key, v int);
|
|
create index on main.t(v);
|
|
|
|
select * from system_schema.indexes ;
|
|
|
|
keyspace_name | table_name | index_name | kind | options
|
|
---------------+------------+------------+------------+-----------------
|
|
main | t | t_v_idx | COMPOSITES | {'target': 'v'}
|
|
|
|
(1 rows)
|
|
|
|
|
|
select keyspace_name, view_name from system_schema.views ;
|
|
|
|
keyspace_name | view_name
|
|
---------------+---------------
|
|
main | t_v_idx_index
|
|
|
|
(1 rows)
|
|
|
|
alter materialized view t_v_idx_index with synchronous_updates = true;
|
|
|
|
```
|
|
|
|
Local secondary indexes already have synchronous updates, so there's no need
|
|
to explicitly mark them as such.
|
|
|
|
## Expressions
|
|
|
|
### NULL
|
|
|
|
Scylla aims for a uniform handling of NULL values in expressions, inspired
|
|
by SQL: The overarching principle is that a NULL signifies an _unknown value_,
|
|
so most expressions calculated based on a NULL also results in a NULL.
|
|
For example, the results of `x + NULL`, `x = NULL` or `x < NULL` are all NULL,
|
|
no matter what `x` is. Even the expression `NULL = NULL` evaluates to NULL,
|
|
not TRUE.
|
|
|
|
But not all expressions of NULL evaluate to NULL. An interesting example
|
|
is boolean conjunction:`FALSE AND NULL` returns FALSE - not NULL. This is
|
|
because no matter which unknown value the NULL represents, ANDing it with
|
|
FALSE will always result in FALSE. So the return value is not unknown - it
|
|
is a FALSE. In contrast, `TRUE AND NULL` does return NULL, because if we AND
|
|
a TRUE with an unknown value the result is also unknown: `TRUE AND TRUE` is
|
|
TRUE but `TRUE AND FALSE` is FALSE.
|
|
|
|
Because `x = NULL` always evaluates to NULL, a `SELECT` filter `WHERE x = NULL`
|
|
matches no row (_matching_ means evaluating to TRUE). It does **not** match
|
|
rows where x is missing. If you really want to match rows with missing x,
|
|
SQL offers a different syntax `x IS NULL` (and similarly, also `x IS NOT
|
|
NULL`), Scylla does not yet implement this syntax.
|
|
|
|
In contrast, Cassandra is less consistent in its handling of nulls.
|
|
The example `x = NULL` is considered an error, not a valid expression
|
|
whose result is NULL.
|
|
|
|
The rules explained above apply to most expressions, in particular to `WHERE`
|
|
filters in `SELECT`. However, the evaluation rules for LWT IF clauses
|
|
(_conditional updates_) are _different_: a `IF x = NULL` condition succeeds
|
|
if `x` is unset. This non-standard behavior of NULLs in IF expressions may
|
|
be made configurable in a future version.
|
|
|
|
## `NULL` is valid input for LWT IF clause element access
|
|
|
|
The LWT IF clauses
|
|
|
|
```cql
|
|
IF some_map[:var] = 3
|
|
```
|
|
|
|
or
|
|
|
|
```cql
|
|
IF some_map[:var] != 3
|
|
```
|
|
|
|
|
|
is an error if `:var` is `NULL` on Cassandra, but is accepted by
|
|
Scylla. The result of the comparison, for both `=` and `!=`, is `FALSE`.
|
|
|
|
## `NULL` is valid input for LWT IF clause `LIKE` patterns
|
|
|
|
The LWT IF clauses
|
|
|
|
```cql
|
|
IF some_column LIKE :pattern
|
|
```
|
|
|
|
is an error if `:pattern` is `NULL` on Cassandra, but is accepted by
|
|
Scylla. The result of the pattern match is `FALSE`.
|
|
|
|
For more details, see:
|
|
- [Lightweight Transactions](../features/lwt.rst)
|
|
- [How does ScyllaDB LWT Differ from Apache Cassandra?](../kb/lwt-differences.rst)
|
|
|
|
## REDUCEFUNC for UDA
|
|
|
|
REDUCEFUNC extension adds optional reduction function to user-defined aggregate.
|
|
This allows to speed up aggregation query execution by distributing the calculations
|
|
to other nodes and reducing partial results into final one.
|
|
Specification of this function is it has to be scalar function with two arguments,
|
|
both of the same type as UDA's state, also returning the state type.
|
|
|
|
```cql
|
|
CREATE FUNCTION row_fct(acc tuple<bigint, int>, val int)
|
|
RETURNS NULL ON NULL INPUT
|
|
RETURNS tuple<bigint, int>
|
|
LANGUAGE lua
|
|
AS $$
|
|
return { acc[1]+val, acc[2]+1 }
|
|
$$;
|
|
|
|
CREATE FUNCTION reduce_fct(acc tuple<bigint, int>, acc2 tuple<bigint, int>)
|
|
RETURNS NULL ON NULL INPUT
|
|
RETURNS tuple<bigint, int>
|
|
LANGUAGE lua
|
|
AS $$
|
|
return { acc[1]+acc2[1], acc[2]+acc2[2] }
|
|
$$;
|
|
|
|
CREATE FUNCTION final_fct(acc tuple<bigint, int>)
|
|
RETURNS NULL ON NULL INPUT
|
|
RETURNS double
|
|
LANGUAGE lua
|
|
AS $$
|
|
return acc[1]/acc[2]
|
|
$$;
|
|
|
|
CREATE AGGREGATE custom_avg(int)
|
|
SFUNC row_fct
|
|
STYPE tuple<bigint, int>
|
|
REDUCEFUNC reduce_fct
|
|
FINALFUNC final_fct
|
|
INITCOND (0, 0);
|
|
```
|
|
|
|
### Behavior of bind variables references with the same name
|
|
|
|
If a bind variable is referred to twice (example: `WHERE aa = :var AND bb = :var`; `:var`
|
|
is referenced twice), ScyllaDB and Cassandra treat it differently:
|
|
|
|
- Cassandra ignores the double reference and treats the two as two separate variables. They
|
|
can have different types, and occupy two slots in the bind variable metadata (used by
|
|
drivers when the user provides a bind variable tuple rather than a map)
|
|
- ScyllaDB treats the two references as referring to the same variable. The two references
|
|
must have the same type, and occupy one slot in the bind variable metadata.
|
|
|
|
ScyllaDB can revert to the Cassandra treatment by setting the configuration item
|
|
`cql_duplicate_bind_variable_names_refer_to_same_variable` to `false`.
|
|
|
|
### Lists elements for filtering
|
|
|
|
Subscripting a list in a WHERE clause is supported as are maps.
|
|
|
|
```cql
|
|
WHERE some_list[:index] = :value
|
|
```
|
|
|
|
## Per-partition rate limit
|
|
|
|
The `per_partition_rate_limit` option can be used to limit the allowed
|
|
rate of requests to each partition in a given table. When the cluster detects
|
|
that the rate of requests exceeds configured limit, the cluster will start
|
|
rejecting some of them in order to bring the throughput back to the configured
|
|
limit. Rejected requests are less costly which can help reduce overload.
|
|
|
|
_NOTE_: Due to ScyllaDB's distributed nature, tracking per-partition request rates
|
|
is not perfect and the actual rate of accepted requests may be higher up to
|
|
a factor of keyspace's `RF`. This feature should not be used to enforce precise
|
|
limits but rather serve as an overload protection feature.
|
|
|
|
_NOTE_: This feature works best when shard-aware drivers are used (rejected
|
|
requests have the least cost).
|
|
|
|
Limits are configured separately for reads and writes. Some examples:
|
|
|
|
```cql
|
|
ALTER TABLE t WITH per_partition_rate_limit = {
|
|
'max_reads_per_second': 100,
|
|
'max_writes_per_second': 200
|
|
};
|
|
```
|
|
|
|
Limit reads only, no limit for writes:
|
|
```cql
|
|
ALTER TABLE t WITH per_partition_rate_limit = {
|
|
'max_reads_per_second': 200
|
|
};
|
|
```
|
|
|
|
Rejected requests receive the scylla-specific "Rate limit exceeded" error.
|
|
If the driver doesn't support it, `Config_error` will be sent instead.
|
|
|
|
For more details, see:
|
|
|
|
- Detailed [design notes](https://github.com/scylladb/scylla/blob/master/docs/dev/per-partition-rate-limit.md)
|
|
- Description of the [rate limit exceeded](https://github.com/scylladb/scylla/blob/master/docs/dev/protocol-extensions.md#rate-limit-error) error
|
|
|
|
## Effective service level
|
|
|
|
Actual values of service level's options may come from different service levels, not only from the one user is assigned with.
|
|
|
|
To facilitate insight into which values come from which service level, there is ``LIST EFFECTIVE SERVICE LEVEL OF <role_name>`` command.
|
|
```cql
|
|
> LIST EFFECTIVE SERVICE LEVEL OF role2;
|
|
|
|
service_level_option | effective_service_level | value
|
|
----------------------+-------------------------+-------------
|
|
workload_type | sl2 | batch
|
|
timeout | sl1 | 2s
|
|
```
|
|
|
|
For more details, check [Service Levels docs](https://github.com/scylladb/scylla/blob/master/docs/cql/service-levels.rst)
|
|
|
|
## DESCRIBE SCHEMA WITH INTERNALS [AND PASSWORDS]
|
|
|
|
We extended the semantics of `DESCRIBE SCHEMA WITH INTERNALS`: aside from describing the elements of the schema,
|
|
it also describes authentication/authorization and service levels. Additionally, we introduced a new tier of the
|
|
statement: `DESCRIBE SCHEMA WITH INTERNALS AND PASSWORDS`, which also includes the information about hashed passwords of the roles.
|
|
|
|
For more details, see [the article on DESCRIBE SCHEMA](./describe-schema.rst).
|
|
|
|
## Per-row TTL
|
|
|
|
CQL's traditional time-to-live (TTL) feature attaches an expiration time to
|
|
each cell - i.e., each value in each column. For example, the statement:
|
|
```
|
|
UPDATE tbl USING TTL 60 SET x = 1 WHERE p = 2
|
|
```
|
|
Sets a new value for the column `x` in row `p = 2`, and asks for this value to
|
|
expire in 60 seconds. When a row is updated incrementally, with different
|
|
columns set at different times, this can result in different pieces of the
|
|
row expiring at different times. Applications rarely want partially-expired
|
|
rows, so they often need to re-write an entire row each time the row needs
|
|
updating. In particular, it is not possible to change the expiration time of
|
|
an existing row without re-writing it.
|
|
|
|
Per-row time-to-live (TTL) is a new CQL feature that is an alternative to
|
|
the traditional per-cell TTL. One column is designated as the "expiration
|
|
time" column, and the value of this column determines when the entire row
|
|
will expire. It becomes possible to update pieces of a row without changing
|
|
its expiration time, and vice versa - to change a row's expiration time
|
|
without rewriting its data.
|
|
|
|
The expiration-time column of a table can be chosen when it is created by
|
|
adding the keyword "TTL" to one of the columns:
|
|
```cql
|
|
CREATE TABLE tab (
|
|
id int PRIMARY KEY,
|
|
t text,
|
|
expiration timestamp TTL
|
|
);
|
|
```
|
|
The TTL column's name, in this example `expiration`, can be anything.
|
|
|
|
Per-row TTL can also be enabled on an existing table by adding the "TTL"
|
|
designation to one of the existing columns, with:
|
|
```cql
|
|
ALTER TABLE tab TTL colname
|
|
```
|
|
Or per-row TTL can be disabled (rows will never expire), with:
|
|
```cql
|
|
ALTER TABLE tab TTL NULL
|
|
```
|
|
|
|
It is not possible to enable per-row TTL if it's already enabled, or disable
|
|
it when already disabled. If you have TTL enabled on one column and want to
|
|
enable it instead on a second column, you must do it in two steps: First
|
|
disable TTL and then re-enable it on the second column.
|
|
|
|
The designated TTL column must have the type `timestamp` or `bigint`,
|
|
and specifies the absolute time when the row should expire (the `bigint`
|
|
type is interpreted as seconds since the UNIX epoch). It must be a regular
|
|
column (not a primary key column or a static column), and there can only be
|
|
one such column.
|
|
|
|
The 32-bit type `int` (specifying number of seconds since the UNIX epoch)
|
|
is also supported, but not recommended because it will wrap around in 2038.
|
|
Unless you must use the `int` type because of pre-existing expiration data
|
|
with that type, please prefer `timestamp` or `bigint`.
|
|
|
|
Another important feature of per-row TTL is that if CDC is enabled, when a
|
|
row expires a deletion event appears in the CDC log - something that doesn't
|
|
happen in per-cell TTL. This deletion event can be distinguished from user-
|
|
initiated deletes: Whereas user-initiated deletes have `cdc_operation` set to
|
|
3 (`row_delete`) or 4 (`partition_delete`), those generated by expiration have
|
|
`cdc_operation` -3 (`service_row_delete`) or -4 (`service_partition_delete`).
|
|
|
|
Unlike per-cell TTL where a value becomes unreadable at the precise specified
|
|
second, the per-row TTL's expiration is _eventual_ - the row will expire
|
|
some time _after_ its requested expiration time, where this "some time" can
|
|
be controlled by the configuration `alternator_ttl_period_in_seconds`. Until
|
|
the row is actually deleted, it can still be read, and even written.
|
|
Importantly, the CDC event will appear immediately after the row is finally
|
|
deleted.
|
|
|
|
It's important to re-iterate that the per-cell TTL and per-row TTL features
|
|
are separate and distinct, use a different CQL syntax, have a different
|
|
implementation and provide different guarantees. It is possible to use
|
|
both features in the same table, or even the same row.
|