Commit Graph

642 Commits

Author SHA1 Message Date
Calle Wilund
a079c3dbbe alternator::streams: Special case single table in list_streams
Avoid iterating all tables (at least multiple times).
2023-01-24 09:14:33 +00:00
Calle Wilund
9412d8f259 alternator::streams: Only sort tables iff limit < # tables or ExclusiveStartStreamArn set
Avoid sorts for request that will be answered immediately.
2023-01-24 08:48:20 +00:00
Calle Wilund
9886788a46 alternator::streams: Set default list_streams limit to 100 as per spec
AWS docs says so.
2023-01-24 08:24:42 +00:00
Calle Wilund
da8adb4d26 alterator::streams: Sort tables in list_streams to ensure no duplicates
Fixes #12601 (maybe?)

Sort the set of tables on ID. This should ensure we never
generate duplicates in a paged listing here. Can obviously miss things if they
are added between paged calls and end up with a "smaller" UUID/ARN, but that
is to be expected.
2023-01-23 11:41:40 +00:00
Marcin Maliszkiewicz
7230841431 alternator: unify json streaming heuristic
Main assumption here is that if is_big is good enough for
GetBatchItems operation it should work well also for Scan,
Query and GetRecords. And it's easier to maintain more unified
code.

Additionally 'future<> print' documentation used for streaming
suggests that there is quite big overhead so since it seems the
only motivation for streaming was to reduce contiguous allocation
size below some threshold we should not stream when this threshold
is not exceeded.

Closes #12164
2023-01-19 16:40:43 +02:00
Marcin Maliszkiewicz
4c33791f96 alternator: eliminate regexes from the hot path
This decreases the whole alternator::get_table cpu time by 78%
(from 2.8 us to 0.6 us on my cpu).

In perf_simple_query it decreases allocs/op by 1.6% (by removing 4 allocations)
and increases median tps by 3.4%.

Raw results from running:

./build/release/test/perf/perf_simple_query_g --smp 1 \
         --alternator forbid --default-log-level error \
         --random-seed=1235000092 --duration=180 --write

Before the patch:

median 46903.65 tps (197.2 allocs/op,  12.1 tasks/op,  170886 insns/op, 0 errors)
median absolute deviation: 210.15
maximum: 47354.59
minimum: 42535.63

After the patch:

median 48484.76 tps (194.1 allocs/op,  12.1 tasks/op,  168512 insns/op, 0 errors)
median absolute deviation: 317.32
maximum: 49247.69
minimum: 44656.38

Closes #12445
2023-01-19 13:23:24 +02:00
Marcin Maliszkiewicz
6f055ca5f9 alternator: evaluate expressions as false for stored malformed binary
data

We'll try to distinguish the case when data comes from the storage rather
than user reuqest. Such attribute can be used in expressions and
when it can't be decoded it should make expression evaluate as
false to simply exclude the row during filter query or scan.

Note that this change focuses on binary type, for other types we
may have some inconsistencies in the implementation.
2023-01-16 15:15:27 +01:00
Marcin Maliszkiewicz
bcbaccc143 rjson: avoid copy constructors in from_string calls when possible
This function anyway copies the value so no need to do extra copy.
2023-01-16 15:15:26 +01:00
Marcin Maliszkiewicz
668fffb6c5 alternator: remove unused parameters from describe_items func 2023-01-16 14:36:23 +01:00
Avi Kivity
2739ac66ed treewide: drop cql_serialization_format
Now that we don't accept cql protocol version 1 or 2, we can
drop cql_serialization format everywhere, except when in the IDL
(since it's part of the inter-node protocol).

A few functions had duplicate versions, one with and one without
a cql_serialization_format parameter. They are deduplicated.

Care is taken that `partition_slice`, which communicates
the cql_serialization_format across nodes, still presents
a valid cql_serialization_format to other nodes when
transmitting itself and rejects protocol 1 and 2 serialization\
format when receiving. The IDL is unchanged.

One test checking the 16-bit serialization format is removed.
2023-01-03 19:54:13 +02:00
Botond Dénes
6daa1e973f Merge 'alternator: fix hangs related to TTL scanning' from Nadav Har'El
The first patch in this small series fixes a hang during shutdown when the expired-item scanning thread can hang in a retry loop instead of quitting.  These hangs were seen in some test runs (issue #12145).

The second patch is a failsafe against additional bugs like those solved by the first patch: If any bugs causes the same page fetch to repeatedly time out, let's stop the attempts after 10 retries instead of retrying for ever. When we stop the retries, a warning will be printed to the log, Scylla will wait until the next scan period and start a new scan from scratch - from a random position in the database, instead of hanging potentially-forever waiting for the same page.

Closes #12152

* github.com:scylladb/scylladb:
  alternator ttl: in scanning thread, don't retry the same page too many times
  alternator: fix hang during shutdown of expiration-scanning thread
2022-12-06 06:44:22 +02:00
Pavel Emelyanov
4c6bfc078d code: Use http::re(quest|ply) instead of httpd:: ones
Recent seastar update deprecated those from httpd namespace.

fixes: #12142

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12161
2022-12-01 17:33:35 +02:00
Nadav Har'El
5eda8ce4fd alternator ttl: in scanning thread, don't retry the same page too many times
Since fixing issue #11737, when the expiration scanner times out reading
a page of data, it retries asking for the same page instead of giving up
on the scan and starting anew later. This retry was infinite - which can
cause problems if we have a bug in the code or several nodes down, which
can lead to getting hung in the same place in the scan for a very long
(potentially infinite) time without making any progress.

An example of such a bug was issue #12145, where we forgot to handle
shutdowns, so on shutdown of the cluster we just hung forever repeating
the same request that will never succeed. It's better in this case to
just give up on the current scan, and start it anew (from a random
position) later.

Refs #12145 (that issue was already fixed, by a different patch which
stops the iteration when shutting down - not waiting for an infinite
number of iterations and not even one more).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-11-30 18:42:37 +02:00
Nadav Har'El
d08eef5a30 alternator: fix hang during shutdown of expiration-scanning thread
The expiration-scanning thread is a long-running thread which can scan
data for hours, but checks for its abort-source before fetching each
page to allow for timely shutdown. Recently, we added the ability to
retry the page fetching in case of timeout, for forgot to check the
abort source in this new retry loop - which lead to an infinitely-long
shutdown in some tests while the retry loop retries forever.

In this patch we fix this bug by using sleep_abortable() instead of
sleep(). sleep_abortable() will throw an exception if the abort source
was triggered before or during the sleep - and this exception will
stop the scan immediately.

Fixes #12145

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-11-30 18:38:17 +02:00
Nadav Har'El
1e59c3f9ef alternator: if TTL scan times out, continue immediately
The Alternator TTL expiration scanner scans an entire table using many
small pages. If any of those pages time out for some reason (e.g., an
overload situation), we currently consider the entire scan to have failed
and wait for the next scan period (which by default is 24 hours) when
we start the scan from scratch (at a random position). There is a risk
that if these timeouts are common enough to occur once or more per
scan, the result is that we double or more the effective expiration lag.

A better solution, done in this patch, is to retry from the same position
if a single page timed out - immediately (or almost immediately, we add
a one-second sleep).

Fixes #11737

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12092
2022-11-28 11:30:00 +02:00
Marcin Maliszkiewicz
2bf2ffd3ed alternator: add maybe_quote to secondary indexes 'where' condition
This bug doesn't affect anything, the reason is descibed in the commit:
'alternator: fix wrong 'where' condition for GSI range key'.

But it's theoretically correct to escape those key names and
the difference can be observed via CQL's describe table. Before
the patch 'where' condition is missing one double quote in variable
name making it mismatched with corresponding column name.
2022-11-22 11:08:23 +01:00
Marcin Maliszkiewicz
d6d20134de alternator: fix wrong 'where' condition for GSI range key
This bug doesn't manifest in a visible way to the user.

Adding the index to an existing table via GlobalSecondaryIndexUpdates is not supported
so we don't need to consider what could happen for empty values of index range key.
After the index is added the only interesting value user can set is omitting
the value (null or empty are not allowed, see test_gsi_empty_value and
test_gsi_null_value).

In practice no matter of 'where' condition the underlaying materialized
view code is skipping row updates with missing keys as per this comment:
'If one of the key columns is missing, set has_new_row = false
meaning that after the update there will be no view row'.

Thats why the added test passes both before and after the patch.
But it's still usefull to include it to exercise those code paths.

Fixes #11800
2022-11-22 11:08:23 +01:00
Botond Dénes
f1a039fc2b treewide: use ::for_partition_start() instead of ::partition_start_tag_t{}
We just added a convenience static factory method for partition start,
change the present users of the clunky constructor+tag to use it
instead.
2022-11-11 09:58:18 +02:00
Avi Kivity
f0643d1713 alternator: ttl: do not copy mutation while constructing a vector
The vector(initializer_list<T>) constructor copies the T since
initializer_list is read-only. Move the mutation instead.

This happens to fix a use-after-return on clang 15 on aarch64.
I'm fairly sure that's a miscompile, but the fix is worthwhile
regardless.

Closes #11818
2022-10-21 10:04:00 +03:00
Nadav Har'El
264f453b9d Merge 'Associate alternator user with its service level configuration' from Piotr Sarna
Until now, authentication in alternator served only two purposes:
 - refusing clients without proper credentials
 - printing user information with logs

After this series, this user information is passed to lower layers, which also means that users are capable of attaching service levels to roles, and this service level configuration will be effective with alternator requests.

tests: manually by adding more debug logs and inspecting that per-service-level timeout value was properly applied for an authenticated alternator user

Fixes #11379

Closes #11380

* github.com:scylladb/scylladb:
  alternator: propagate authenticated user in client state
  client_state: add internal constructor with auth_service
  alternator: pass auth_service and sl_controller to server
2022-10-19 23:27:48 +03:00
Alexander Turetskiy
636e14cc77 Alternator: Projection field added to return from DescribeTable which describes GSIs and LSIs.
The return from DescribeTable which describes GSIs and LSIs is missing
the Projection field. We do not yet support all the settings Projection
(see #5036), but the default which we support is ALL, and DescribeTable
should return that in its description.

Fixes #11470

Closes #11693
2022-10-19 19:01:08 +03:00
Nadav Har'El
4a7794fb64 alternator: better error message when adding a GSI to an existing table
Due to issue #11567, Alternator do not yet support adding a GSI to an
existing table via UpdateTable with the GlobalSecondaryIndexUpdates
parameter.

However, currently, we print a misleading error message in this case,
complaining about the AttributeDefinitions parameter. This parameter
is also required with GlobalSecondaryIndexUpdates, but it's not the
main problem, and the user is likely to be confused why the error message
points to that specific paramter and what it means that this parameter
is claimed to be "not supported" (while it is supported, in CreateTable).
With this patch, we report that GlobalSecondaryIndexUpdates is not
supported.

This patch does not fix the unsupported feature - it just improves
the error message saying that it's not supported.

Refs #11567

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11650
2022-09-29 09:00:31 +03:00
Piotr Sarna
481240b8b4 Merge 'Alternator: Run more TTL tests by default (and add a test for metrics)' from Nadav Har'El
We had quite a few tests for Alternator TTL in test/alternator, but most
of them did not run as part of the usual Jenkins test suite, because
they were considered "very slow" (and require a special "--runveryslow"
flag to run).

In this series we enable six tests which run quickly enough to run by
default, without an additional flag. We also make them even quicker -
the six tests now take around 2.5 seconds.

I also noticed that we don't have a test for the Alternator TTL metrics
- and added one.

Fixes #11374.
Refs https://github.com/scylladb/scylla-monitoring/issues/1783

Closes #11384

* github.com:scylladb/scylladb:
  test/alternator: insert test names into Scylla logs
  rest api: add a new /system/log operation
  alternator ttl: log warning if scan took too long.
  alternator,ttl: allow sub-second TTL scanning period, for tests
  test/alternator: skip fewer Alternator TTL tests
  test/alternator: test Alternator TTL metrics
2022-09-22 09:47:50 +02:00
Piotr Sarna
5597bc8573 Merge 'Alternator: test and fix crashes and errors...
when using ":attrs" attribute' from Nadav Har'El

This PR improves the testing for issue #5009 and fixes most of it (but
not all - see below). Issue #5009 is about what happens when a user
tries to use the name `:attrs` for an attribute - while Alternator uses
a map column with that name to hold all the schema-less attributes of an
item.  The tests we had for this issue were partial, and missed the
worst cases which could result in Scylla crashing on specially-crafted
PutItem or UpdateItem requests.

What the tests missed were the cases that `:attrs` is used as a
**non-key**. So in this PR we add additional tests for this case,
several of them fail or even crash Scylla, and then we fix all these
cases.

Issue #5009 remains open because using `:attrs` as the name of a **key**
is still not allowed. But because it results in a clean error message
when attempting to create a table with such a key, I consider this
remaining problem very minor.

Refs #5009.

Closes #11572

* github.com:scylladb/scylladb:
  alternator: fix crashes an errors when using ":attrs" attribute
  alternator: improve tests for reserved attribute name ":attrs"
2022-09-19 09:48:06 +02:00
Nadav Har'El
999ca2d588 alternator: fix crashes an errors when using ":attrs" attribute
Alternator uses a single column, a map, with the deliberately strange
name ":attrs", to hold all the schema-less attributes of an item.
The existing code is buggy when the user tries to write to an attribute
with this strange name ":attrs". Although it is extremely unlikely that
any user would happen to choose such a name, it is nevertheless a legal
attribute name in DynamoDB, and should definitely not cause Scylla to crash
as it does in some cases today.

The bug was caused by the code assuming that to check whether an attribute
is stored in its own column in the schema, we just need to check whether
a column with that name exists. This is almost true, except for the name
":attrs" - a column with this name exists, but it is a map - the attribute
with that name should be stored *in* the map, not as the map. The fix
is to modify that check to special-case ":attrs".

This fix makes the relevant tests, which used to crash or fail, now pass.

This fix solves most of #5009, but one point is not yet solved (and
perhaps we don't need to solve): It is still not allowed to use the
name ":attrs" for a **key** attribute. But trying to do that fails cleanly
(during the table creation) with an appropriate error message, so is only
a very minor compatibility issue.

Refs #5009

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-19 10:30:11 +03:00
Nadav Har'El
33e6a88d9a alternator ttl: comment fixes
This patch fixes a few errors and out-of-date descriptions in comments
in alternator/ttl.cc. No functional changes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-15 00:03:43 +03:00
Nadav Har'El
b9792ffb06 alternator ttl: log warning if scan took too long.
Currently, we log at "info" level how much time remained at the end of
a full TTL scan until the next scanning period (we sleep for that time).
If the scan was slower than the period, we didn't print anything.

Let's print a warning in this case - it can be useful for debugging,
and also users should know when their desired scan period is not being
honored because the full scan is taking longer than the desired scan
period.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Nadav Har'El
e7e9adc519 alternator,ttl: allow sub-second TTL scanning period, for tests
Alternator has the "alternator_ttl_period_in_seconds" parameter for
controlling how often the expiration thread looks for expired items to
delete. It is usually a very large number of seconds, but for tests
to finish quickly, we set it to 1 second.
With 1 second expiration latency, test/alternator/test_ttl.py took 5
seconds to run.

In this patch, we change the parameter to allow a  floating-point number
of seconds instead of just an integer. Then, this allows us to halve the
TTL period used by tests to 0.5 seconds, and as a result, the run time of
test_ttl.py halves to 2.5 seconds. I think this is fast enough for now.

I verified that even if I change the period to 0.1, there is no noticable
slowdown to other Alternator tests, so 0.5 is definitely safe.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Piotr Sarna
2379a25ade alternator: propagate authenticated user in client state
From now on, when an alternator user correctly passed an authentication
step, their assigned client_state will have that information,
which also means proper access to service level configuration.
Previously the username was only used in tracing.
2022-09-05 10:43:29 +02:00
Piotr Sarna
9511c21686 alternator: pass auth_service and sl_controller to server
It's going to be needed to recreate a client state for an authenticated
user.
2022-09-05 10:03:00 +02:00
Nadav Har'El
941c719a23 alternator: return ProvisionedThroughput in DescribeTable
DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing
mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable
operation must return a ProvisionedThroughput structure, listing both
ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not
stated in some DynamoDB documentation but is explictly mentioned in
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html
Also in empirically, DynamoDB returns ProvisionedThroughput with zeros
even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this.

The ProvisionedThroughput structure being missing was a problem for
applications like DynamoDB connectors for Spark, if they implicitly
assume that ProvisionedThroughput is returned by DescribeTable, and
fail (as described in issue #11222) if it's outright missing.

So this patch adds the missing ProvisionedThroughput structure, and
the xfailing test starts to pass.

Note that this patch doesn't change the fact that attempting to set
a table to PROVISIONED billing mode is ignored: DescribeTable continues
to always return PAY_PER_REQUEST as the billing mode and zero as the
provisioned capacities.

Fixes #11222

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11298
2022-08-22 09:58:09 +02:00
Botond Dénes
d1d53f1b84 query: add tombstone-limit to read-command
Propagate the tombstone-limit from coordinator to replicas, to make sure
all is using the same limit.
2022-08-10 06:01:47 +03:00
Benny Halevy
c71ef330b2 query-request, everywhere: define and use query_id as a strong type
Define query_id as a tagged_uuid
So it can be differentiated from other uuid-class types.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:13:28 +03:00
Benny Halevy
257d74bb34 schema, everywhere: define and use table_id as a strong type
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.

Fixes #11207

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:41 +03:00
Kamil Braun
ab946e392f alternator: ttl: pass gossiper& to expiration_service
This allows us to remove the `gossiper()` getter from `storage_proxy`.
2022-08-04 12:12:43 +02:00
Michał Sala
041cb77ad0 alternator, db: move the tag code to db/tags
Tags are a useful mechanism that could be used outside of alternator
namespace. My motivation to move tags_extension and other utilities to
db/tags/ was that I wanted to use them to mark "synchronous mode" views.

I have extracted `get_tags_of_table`, `find_tag` and `update_tags`
method to db/tags/utils.cc and moved alternator/tags_extension.hh to
db/tags/.

The signature of `get_tags_of_table` was changed from `const
std::map<sstring, sstring>&` to `const std::map<sstring, sstring>*`
Original behavior of this function was to throw an
`alternator::api_error` exception. This was undesirable, as it
introduced a dependency on the alternator module. I chose to change it
to return a potentially null value, and added a wrapper function to the
alternator module - `get_tags_of_table_or_throw` to keep the previous
throwing behavior.
2022-07-25 09:53:33 +02:00
Benny Halevy
acae3cc223 treewide: stop use of deprecated coroutine::make_exception
Convert most use sites from `co_return coroutine::make_exception`
to `co_await coroutine::return_exception{,_ptr}` where possible.

In cases this is done in a catch clause, convert to
`co_return coroutine::exception`, generating an exception_ptr
if needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10972
2022-07-07 15:02:16 +03:00
Botond Dénes
2b6eeadc07 alternator: use position-in-partition in paging cookie only when reading CQL tables
Recently, we added full position-in-partition support to alternator's
paging cookie, so it can support stopping at arbitrary positions. This
support however is only really needed when tables have range tombstones
and alternator tables never have them. So to avoid having to make the
new fields in 'ExclusiveStartKey' reserved, we avoid filling these in
when reading an alternator table, as in this case it is safe to assume
the position is `after_key($clustring_key)`. We do include these new
members however when reading CQL tables through alternator. As this is
only supported for system tables, we can also be sure that the elaborate
names we used for these fields are enough to avoid naming clashes.
The condition in the code implementing this is actually even more
general: it only includes the region/weight members when the position
differs from that of a normal alternator one.
2022-06-30 15:10:30 +03:00
Botond Dénes
52058ea974 alternator: make is_alternator_keyspace() a standalone method 2022-06-30 14:18:29 +03:00
Avi Kivity
3131cbea62 Merge 'query: allow replica to provide arbitrary continue position' from Botond Dénes
Currently, we use the last row in the query result set as the position where the query is continued from on the next page. Since only live rows make it into query result set, this mandates the query to be stopped on a live row on the replica, lest any dead rows or tombstones processed after the live rows, would have to be re-processed on the next page (and the saved reader would have to be thrown away due to position mismatch). This requirement of having to stop on a live row is problematic with datasets which have lots of dead rows or tombstones, especially if these form a prefix. In the extreme case, a query can time out before it can process a single live row and the data-set becomes effectively unreadable until compaction gets rid of the tombstones.
This series prepares the way for the solution: it allows the replica to determine what position the query should continue from on the next page. This position can be that of a dead row, if the query stopped on a dead row. For now, the replica supplies the same position that would have been obtained with looking at the last row in the result set, this series merely introduces the infrastructure for transferring a position together with the query result, and it prepares the paging logic to make use of this position. If the coordinator is not prepared for the new field, it will simply fall-back to the old way of looking at the last row in the result set. As I said for now this is still the same as the content of the new field so there is no problem in mixed clusters.

Refs: https://github.com/scylladb/scylla/issues/3672
Refs: https://github.com/scylladb/scylla/issues/7689
Refs: https://github.com/scylladb/scylla/issues/7933

Tests: manual upgrade test.
I wrote a data set with:
```
./scylla-bench -mode=write -workload=sequential -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -clustering-row-size=8096 -partition-count=1000
```
This creates large, 80MB partitions, which should fill many pages if read in full. Then I started a read workload:
```
./scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100
```
I confirmed that paging is happening as expected, then upgraded the nodes one-by-one to this PR (while the read-load was ongoing). I observed no read errors or any other errors in the logs.

Closes #10829

* github.com:scylladb/scylla:
  query: have replica provide the last position
  idl/query: add last_position to query_result
  mutlishard_mutation_query: propagate compaction state to result builder
  multishard_mutation_query: defer creating result builder until needed
  querier: use full_position instead of ad-hoc struct
  querier: rely on compactor for position tracking
  mutation_compactor: add current_full_position() convenience accessor
  mutation_compactor: s/_last_clustering_pos/_last_pos/
  mutation_compactor: add state accessor to compact_mutation
  introduce full_position
  idl: move position_in_partition into own header
  service/paging: use position_in_partition instead of clustering_key for last row
  alternator/serialization: extract value object parsing logic
  service/pagers/query_pagers.cc: fix indentation
  position_in_partition: add to_string(partition_region) and parse_partition_region()
  mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh
2022-06-27 12:23:21 +03:00
Avi Kivity
dab56b82fa Merge 'Per-partition rate limiting' from Piotr Dulikowski
Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode.

This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension:

```
ALTER TABLE ks.tbl
WITH per_partition_rate_limit = {
	'max_writes_per_second': 100,
	'max_reads_per_second': 200
};
```

Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead.

Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all.

The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here.

Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`:

- Write mode:
  ```
  8f690fdd47 (PR base):
  129644.11 tps ( 56.2 allocs/op,  13.2 tasks/op,   49785 insns/op)
  This PR:
  125564.01 tps ( 56.2 allocs/op,  13.2 tasks/op,   49825 insns/op)
  ```
- Read mode:
  ```
  8f690fdd47 (PR base):
  150026.63 tps ( 63.1 allocs/op,  12.1 tasks/op,   42806 insns/op)
  This PR:
  151043.00 tps ( 63.1 allocs/op,  12.1 tasks/op,   43075 insns/op)
  ```

Manual upgrade test:
- Start 3 nodes, 4 shards each, Scylla version 8f690fdd47
- Create a keyspace with scylla-bench, RF=3
- Start reading and writing with scylla-bench with CL=QUORUM
- Manually upgrade nodes one by one to the version from this PR
- Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded
- Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected

Fixes: #4703

Closes #9810

* github.com:scylladb/scylla:
  storage_proxy: metrics for per-partition rate limiting of reads
  storage_proxy: metrics for per-partition rate limiting of writes
  database: add stats for per partition rate limiting
  tests: add per_partition_rate_limit_test
  config: add add_per_partition_rate_limit_extension function for testing
  cf_prop_defs: guard per-partition rate limit with a feature
  query-request: add allow_limit flag
  storage_proxy: add allow rate limit flag to get_read_executor
  storage_proxy: resultize return type of get_read_executor
  storage_proxy: add per partition rate limit info to read RPC
  storage_proxy: add per partition rate limit info to query_result_local(_digest)
  storage_proxy: add allow rate limit flag to mutate/mutate_result
  storage_proxy: add allow rate limit flag to mutate_internal
  storage_proxy: add allow rate limit flag to mutate_begin
  storage_proxy: choose the right per partition rate limit info in write handler
  storage_proxy: resultize return types of write handler creation path
  storage_proxy: add per partition rate limit to mutation_holders
  storage_proxy: add per partition rate limit info to write RPC
  storage_proxy: add per partition rate limit info to mutate_locally
  database: apply per-partition rate limiting for reads/writes
  database: move and rename: classify_query -> classify_request
  schema: add per_partition_rate_limit schema extension
  db: add rate_limiter
  storage_proxy: propagate rate_limit_exception through read RPC
  gms: add TYPED_ERRORS_IN_READ_RPC cluster feature
  storage_proxy: pass rate_limit_exception through write RPC
  replica: add rate_limit_exception and a simple serialization framework
  docs: design doc for per-partition rate limiting
  transport: add rate_limit_error
2022-06-24 01:32:13 +03:00
Botond Dénes
2b0bc11f2e service/paging: use position_in_partition instead of clustering_key for last row
The former allows for expressing more positions, like a position
before/after a clustering key. This practically enables the coordinator
side paging logic, for a query to be stopped at a tombstone (which can
have said positions).
2022-06-23 13:36:20 +03:00
Botond Dénes
adabe3b5a3 alternator/serialization: extract value object parsing logic
To make it reusable by a method added by the next patch.
2022-06-23 11:33:18 +03:00
Piotr Dulikowski
a7ad70600d query-request: add allow_limit flag
Adds allow_limit flag to the read_command. The flag decides whether rate
limiting of this operation is allowed.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
e6beab3106 storage_proxy: add allow rate limit flag to mutate/mutate_result
Now, mutate/mutate_result accept a flag which decides whether the write
should be rate limited or not.

The new parameter is mandatory and all call sites were updated.
2022-06-22 20:16:49 +02:00
Pavel Emelyanov
98a4d41e31 alternator: Get rack/datacenter from topology
It's needed in two places, both can get topology from the proxy's token
metadata.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Nadav Har'El
3aca1ca572 alternator: make BatchGetItem group reads by partition
DynamoDB API's BatchGetItem invokes a number (up to 25) of read requests
in parallel, returning when all results are available. Alternator naively
implemented this by sending all read requests in parallel, no matter which
requests these were.

That implementation was inefficient when all the requests are to different
items (clustering rows) of the same partition. In a multi-node setup this
will end up sending 25 separate requests to the same remote node(s). Even
on a single-node setup, this may result in reading from disk more than
once, and even if the partition is cached - doing an O(logN) search in
each multiple times.

What we do in this patch, instead, is to group all the BatchGetItem
requests that aimed at the same partition into a single read request
asking for a (sorted) list of clustering keys. This is similar to an
"IN" request in CQL.

As an example of the performance benefit of this patch, I tried a
BatchGetItem request asking for 20 random items from a 10-million item
partition. I measured the latency of this request on a single-node
Scylla. Before this patch, I saw a latency of 17-21 ms (the lower number
is when the request is retried and the requested items are already in
the cache). After this patch, the latency is 10-14 ms. The performance
improvement on multi-node clusters are expected to be even higher.

Unfortunately the patch is less trivial than I hoped it would be,
because some of the old code was organized under the assumption that
each read request only returned one item (and if it failed, it means
only one item failed), so this part of the code had to be reorganized
(and, for making the code more readable, coroutinized).

An unintended benefit of the code reorganization is that it also gave
me an opportunity to fail an attempt to ask BatchGetItem the same
item more than once (issue #10757).

The patch also adds a few more corner cases in the tests, to be even
more sure that the code reorganization doesn't introduce a regression
in BatchGetItem.

Fixes #10753
Fixes #10757

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-19 14:47:57 +03:00
Nadav Har'El
e20233dab1 alternator: improve error handling when trying to tag a GSI or LSI
In issue #10786, we raised the idea of maybe allowing to tag (with
TagResource) GSIs and LSIs, not just base tables. However, currently,
neither DynamoDB nor Syclla allows it. So in this patch we add a
test that confirms this. And while at it, we fix Alternator to
return the same error message as DynamoDB in this case.

Refs #10786.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-13 18:14:42 +03:00
Nadav Har'El
8866c326de alternator: forbid duplicate index (LSI and GSI) names
Adding an LSI and GSI with the same name to the same Alternator table
should be forbidden - because if both exists only one of them (the GSI)
would actually be usable. DynamoDB also forbids such duplicate name.

So in this patch we add a test for this issue, and fix it.

Since the patch involves a few more uses of the IndexName string,
we also clean up its handling a bit, to use std::string_view instead
of the old-style std::string&.

Fixes #10789

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-13 18:14:42 +03:00
Nadav Har'El
00866a75d8 alternator: add ARN for indexes (LSI and GSI)
DynamoDB gives an ARN ("Amazon Resource Name") to LSIs and GSIs. These
look like BASEARN/index/INDEXNAME, where BASEARN is the ARN of the base
table, and INDEXNAME is the name of the LSI or the GSI.

These ARNs should be returned by DescribeTable as part of its
description of each index, and this patch adds that missing IndexArn
field.

The ARN we're adding here is hardly useful (e.g., as explained in
issue #10786, it can't be used to add tags to the index table),
but nevertheless should exist for compatibility with DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-13 18:14:42 +03:00