Commit Graph

377 Commits

Author SHA1 Message Date
Nadav Har'El
d6aba8232b alternator: configurable override for DescribeEndpoints
The AWS C++ SDK has a bug (https://github.com/aws/aws-sdk-cpp/issues/2554)
where even if a user specifies a specific enpoint URL, the SDK uses
DescribeEndpoints to try to "refresh" the endpoint. The problem is that
DescribeEndpoints can't return a scheme (http or https) and the SDK
arbitrarily picks https - making it unable to communicate with Alternator
over http. As an example, the new "dynamodb shell" (written in C++)
cannot communicate with Alternator running over http.

This patch adds a configuration option, "alternator_describe_endpoints",
which can be used to override what DescribeEndpoints does:

1. Empty string (the default) leaves the current behavior -
   DescribeEndpoints echos the request's "Host" header.

2. The string "disabled" disables the DescribeEndpoints (it will return
   an UnknownOperationException). This is how DynamoDB Local behaves,
   and the AWS C++ SDK and the Dynamodb Shell work well in this mode.

3. Any other string is a fixed string to be returned by DescribeEndpoints.
   It can be useful in setups that should return a known address.

Note that this patch does not, by default, change the current behaivor
of DescribeEndpoints. But it us the future to override its behavior
in a user experiences problems in the field - without code changes.

Fixes #14410.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14432
2023-07-07 11:08:10 +02:00
Marcin Maliszkiewicz
6424dd5ec4 alternator: close output_stream when exception is thrown during response streaming
When exception occurs and we omit closing output_stream then the whole process is brought down
by an assertion in ~output_stream.

Fixes https://github.com/scylladb/scylladb/issues/14453
Relates https://github.com/scylladb/scylladb/issues/14403

Closes #14454
2023-07-04 16:15:08 +03:00
Harsh Soni
d8c3b144cb alternator: optimize validate_table_name call
Prior to this `table_name` was validated for every request in `find_table_name` leading to unnecessary overhead (although small, but unnecessary). Now, the `table_name` is only validated while creation reqeust and in other requests iff the table does not exist (to keep compatibility with DynamoDB's exception).

Fixes: #12538

Closes #13966
2023-06-12 10:46:13 +03:00
Nadav Har'El
d2e089777b Merge 'Yield while building large results in Alternator - rjson::print, executor::batch_get_item' from Marcin Maliszkiewicz
Adds preemption points used in Alternator when:
 - sending bigger json response
 - building results for BatchGetItem

I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:

    auto start  = std::chrono::steady_clock::now();
    do { } while ((std::chrono::steady_clock::now() - start) < 100ms);

and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.

Refs #7926
Fixes #13689

Closes #12351

* github.com:scylladb/scylladb:
  alternator: remove redundant flush call in make_streamed
  utils: yield when streaming json in print()
  alternator: yield during BatchGetItem operation
2023-06-04 23:22:51 +03:00
Nadav Har'El
8a1334cf6f Merge 'alternator: eliminate duplicated rjson::find() of ExpressionAttributeNames and ExpressionAttributeValues' from Marcin Maliszkiewicz
Summary of the patch set:
- eliminates not needed calls to rjson::find (~1% tps improvement in `perf-simple-query --write`)
- adds some very specific test in this area (more general cases were covered already)
- fixes some minor validation bug

Fixes https://github.com/scylladb/scylladb/issues/13251

Closes #12675

* github.com:scylladb/scylladb:
  alternator: fix unused ExpressionAttributeNames validation when used as a part of BatchGetItem
  alternator: eliminate duplicated rjson::find() of ExpressionAttributeNames and ExpressionAttributeValues
2023-06-04 23:10:12 +03:00
Alexey Novikov
ffd4fcceec Alternator: return full table description on return of DeleteTable
The DeleteTable operation in Alternator shoudl return a TableDescription
object describing the table which has just been deleted, similar to what
DescribeTable returns

Fixes scylladb#11472

Closes #11628
2023-06-04 21:00:26 +03:00
Marcin Maliszkiewicz
9ce65270d5 alternator: fix unused ExpressionAttributeNames validation when used as a part of BatchGetItem
BatchGetItem request is a map of table names and 'sub-requests', ExpressionAttributeNames is defined on
'sub-request' level but the code was instead checking the top level, obtaining nullptr every time which
effectively disables unused names check.

Fixes #13251
2023-05-26 15:03:15 +02:00
Marcin Maliszkiewicz
fb5c325cfd alternator: eliminate duplicated rjson::find() of ExpressionAttributeNames and ExpressionAttributeValues
rjson::find is not a very cheap function, it involves bunch of function calls and loop iteration.
Overall it costs 120-170 intructions even for small requests. Some example profile of alternator::executor::query
execution shows ~18 rjson::find calls, taking in total around 7% of query internal proessing time (note that
JSON parse/print and http handling are not part of this function).

This patch eliminates 2 rjson::find calls for most request types. I saw 1-2% tps improvement
in `perf-simple-query --write` although it does improve tps I suspect real percentage is smaller and
don't have much confindentce in this particular number, observed benchmark variance is too high to measure it reliably.
2023-05-26 15:03:15 +02:00
Gleb Natapov
a429018a8a migration_manager: add wait_for_schema_agreement() function
Several subsystems re-implement the same logic for waiting for schema
agreement. Provide the function in the migration_manager and use it
instead.
2023-05-25 14:44:53 +03:00
Botond Dénes
0cff0ffa08 Merge 'alternator,config: make alternator_timeout_in_ms live-updateable' from Kefu Chai
before this change, alternator_timeout_in_ms is not live-updatable,
as after setting executor's default timeout right before creating
sharded executor instances, they never get updated with this option
anymore. but many users would like to set the driver timers based on
server timers. we need to enable them to configure timeout even
when the server is still running.

in this change,

* `alternator_timeout_in_ms` is marked as live-updateable
* `executor::_s_default_timeout` is changed to a thread_local variable,
   so it can be updated by a per-shard updateable_value. and
   it is now a updateable_value, so its variable name is updated
   accordingly. this value is set in the ctor of executor, and
   it is disconnected from the corresponding named_value<> option
   in the dtor of executor.
* alternator_timeout_in_ms is passed to the constructor of
   executor via sharded_parameter, so `executor::_timeout_in_ms` can
   be initialized on per-shard basis
* `executor::set_default_timeout()` is dropped, as we already pass
   the option to executor in its ctor.

Fixes #12232

Closes #13300

* github.com:scylladb/scylladb:
  alternator: split the param list of executor ctor into multi lines
  alternator,config: make alternator_timeout_in_ms live-updateable
2023-05-15 10:16:29 +03:00
Nadav Har'El
e57252092c Merge 'cql3: result_set, selector: change value type to managed_bytes_opt' from Avi Kivity
CQL evolved several expression evaluation mechanisms: WHERE clause,
selectors (the SELECT clause), and the LWT IF clause are just some
examples. Most now use expressions, which use managed_bytes_opt
as the underlying value representation, but selectors still use bytes_opt.

This poses two problems:
1. bytes_opt generates large contiguous allocations when used with large blobs, impacting latency
2. trying to use expressions with bytes_opt will incur a copy, reducing performance

To solve the problem, we harmonize the data types to managed_bytes_opt
(#13216 notwithstanding). This is somewhat difficult since the source of the values
are views into a bytes_ostream. However, luckily bytes_ostream and managed_bytes_view
are mostly compatible so with a little effort this can be done.

The series is neutral wrt performance:

before:
```
222118.61 tps ( 61.1 allocs/op,  12.1 tasks/op,   43092 insns/op,        0 errors)
224250.14 tps ( 61.1 allocs/op,  12.1 tasks/op,   43094 insns/op,        0 errors)
224115.66 tps ( 61.1 allocs/op,  12.1 tasks/op,   43092 insns/op,        0 errors)
223508.70 tps ( 61.1 allocs/op,  12.1 tasks/op,   43107 insns/op,        0 errors)
223498.04 tps ( 61.1 allocs/op,  12.1 tasks/op,   43087 insns/op,        0 errors)
```

after:
```
220708.37 tps ( 61.1 allocs/op,  12.1 tasks/op,   43118 insns/op,        0 errors)
225168.99 tps ( 61.1 allocs/op,  12.1 tasks/op,   43081 insns/op,        0 errors)
222406.00 tps ( 61.1 allocs/op,  12.1 tasks/op,   43088 insns/op,        0 errors)
224608.27 tps ( 61.1 allocs/op,  12.1 tasks/op,   43102 insns/op,        0 errors)
225458.32 tps ( 61.1 allocs/op,  12.1 tasks/op,   43098 insns/op,        0 errors)
```

Though I expect with some more effort we can eliminate some copies.

Closes #13637

* github.com:scylladb/scylladb:
  cql3: untyped_result_set: switch to managed_bytes_view as the cell type
  cql3: result_set: switch cell data type from bytes_opt to managed_bytes_opt
  cql3: untyped_result_set: always own data
  types: abstract_type: add mixed-type versions of compare() and equal()
  utils/managed_bytes, serializer: add conversion between buffer_view<bytes_ostream> and managed_bytes_view
  utils: managed_bytes: add bidirectional conversion between bytes_opt and managed_bytes_opt
  utils: managed_bytes: add managed_bytes_view::with_linearized()
  utils: managed_bytes: mark managed_bytes_view::is_linearized() const
2023-05-10 15:01:45 +03:00
Kefu Chai
5fa459bd1a treewide: do not include unused header
since #13452, we switched most of the caller sites from std::regex
to boost::regex. in this change, all occurences of `#include <regex>`
are dropped unless std::regex is used in the same source file.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13765
2023-05-07 19:01:29 +03:00
Avi Kivity
42a1ced73b cql3: result_set: switch cell data type from bytes_opt to managed_bytes_opt
The expression system uses managed_bytes_opt for values, but result_set
uses bytes_opt. This means that processing values from the result set
in expressions requires a copy.

Out of the two, managed_bytes_opt is the better choice, since it prevents
large contiguous allocations for large blobs. So we switch result_set
to use managed_bytes_opt. Users of the result_set API are adjusted.

The db::function interface is not modified to limit churn; instead we
convert the types on entry and exit. This will be adjusted in a following
patch.
2023-05-07 17:17:36 +03:00
Tomasz Grabiec
5b046043ea migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
It will be extended with listener notification firing, which is an
async operation.
2023-04-24 10:49:37 +02:00
Botond Dénes
dba1d36aa6 Merge 'alternator: fix isolation of concurrent modifications to tags' from Nadav Har'El
Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed.

The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected.

Fixes #6389.

Closes #13150

* github.com:scylladb/scylladb:
  test/alternator: test concurrent TagResource / UntagResource
  db/tags: drop unsafe update_tags() utility function
  alternator: isolate concurrent modification to tags
  db/tags: add safe modify_tags() utility functions
  migration_manager: expose access to storage_proxy
2023-04-11 11:17:23 +03:00
Kefu Chai
500eeeb12c mutation: specialize fmt::formatter<partition_region>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `partition_region` with the help of fmt::ostream.

to help with the review process, the corresponding `to_string()` is
dropped, and its callers now switch over to `fmt::to_string()` in
this change as well. to use `fmt::to_string()` helps with consolidating
all places to use fmtlib for printing/formatting.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-03-31 19:03:14 +08:00
Kefu Chai
69c21f490a alternator,config: make alternator_timeout_in_ms live-updateable
before this change, alternator_timeout_in_ms is not live-updatable,
as after setting executor's default timeout right before creating
sharded executor instances, they never get updated with this option
anymore.

in this change,

* alternator_timeout_in_ms is marked as live-updateable
* executor::_s_default_timeout is changed to a thread_local variable,
  so it can be updated by a per-shard updateable_value. and
  it is now a updateable_value, so its variable name is updated
  accordingly. this value is set in the ctor of executor, and
  it is disconnected from the corresponding named_value<> option
  in the dtor of executor.
* alternator_timeout_in_ms is passed to the constructor of
  executor via sharded_parameter, so executor::_timeout_in_ms can
  be initialized on per-shard basis
* executor::set_default_timeout() is dropped, as we already pass
  the option to executor in its ctor.

please note, in the ctor of executor, we always update the cached
value of `s_default_timeout` with the value of `_timeout_in_ms`,
and we set the default timeout to 10s in `alternator_test_env`.
this is a design decision to avoid bending the production code for
testing, as in production, we always set the timeout with the value
specified either by the default value of yaml conf file.

Fixes #12232
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-03-23 20:57:08 +08:00
Nadav Har'El
c196bd78de alternator: isolate concurrent modification to tags
Alternator modifies tags in three operations - TagResource, UntagResource
and UpdateTimeToLive (the latter uses a tag to store the TTL configuration).

All three operations were implemented by three separate steps:

 1. Read the current tags.
 2. Modify the tags according to the desired operation.
 3. Write the modified tags back with update_tags().

This implementation was not safe for concurrent operations - some
modifications may be be lost. We fix this in this patch by using the new
modify_tags() function introduced in the previous patch, which performs
all three steps under one lock so the tag operations are serialized and
correctly isolated.

Fixes #6389

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-03-13 12:25:03 +02:00
Kefu Chai
a37610f66a alternator: using chrono_literals before using it
we should assume that some included header does this for us.

we'd have following compiling failure if seastar's
src/http/request_parser.rl does not `using namespace httpd;` anymore.

```
/home/kefu/dev/scylladb/alternator/streams.cc:433:55: error: no matching literal operator for call to 'operator""h' with argument of type 'unsigned long long' or 'const char *', and no matching literal operator template
static constexpr auto dynamodb_streams_max_window = 24h;
                                                      ^
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-03-07 18:20:36 +08:00
Kefu Chai
0cb842797a treewide: do not define/capture unused variables
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-15 22:57:18 +02:00
Avi Kivity
69a385fd9d Introduce schema/ module
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.

Closes #12858
2023-02-15 11:01:50 +02:00
Marcin Maliszkiewicz
0b5655021a alternator: remove redundant flush call in make_streamed
As output_stream close() is doing flush anyway.
2023-01-23 13:46:06 +01:00
Marcin Maliszkiewicz
f2788f5391 alternator: yield during BatchGetItem operation 2023-01-20 14:44:24 +01:00
Marcin Maliszkiewicz
7230841431 alternator: unify json streaming heuristic
Main assumption here is that if is_big is good enough for
GetBatchItems operation it should work well also for Scan,
Query and GetRecords. And it's easier to maintain more unified
code.

Additionally 'future<> print' documentation used for streaming
suggests that there is quite big overhead so since it seems the
only motivation for streaming was to reduce contiguous allocation
size below some threshold we should not stream when this threshold
is not exceeded.

Closes #12164
2023-01-19 16:40:43 +02:00
Marcin Maliszkiewicz
4c33791f96 alternator: eliminate regexes from the hot path
This decreases the whole alternator::get_table cpu time by 78%
(from 2.8 us to 0.6 us on my cpu).

In perf_simple_query it decreases allocs/op by 1.6% (by removing 4 allocations)
and increases median tps by 3.4%.

Raw results from running:

./build/release/test/perf/perf_simple_query_g --smp 1 \
         --alternator forbid --default-log-level error \
         --random-seed=1235000092 --duration=180 --write

Before the patch:

median 46903.65 tps (197.2 allocs/op,  12.1 tasks/op,  170886 insns/op, 0 errors)
median absolute deviation: 210.15
maximum: 47354.59
minimum: 42535.63

After the patch:

median 48484.76 tps (194.1 allocs/op,  12.1 tasks/op,  168512 insns/op, 0 errors)
median absolute deviation: 317.32
maximum: 49247.69
minimum: 44656.38

Closes #12445
2023-01-19 13:23:24 +02:00
Marcin Maliszkiewicz
bcbaccc143 rjson: avoid copy constructors in from_string calls when possible
This function anyway copies the value so no need to do extra copy.
2023-01-16 15:15:26 +01:00
Marcin Maliszkiewicz
668fffb6c5 alternator: remove unused parameters from describe_items func 2023-01-16 14:36:23 +01:00
Avi Kivity
2739ac66ed treewide: drop cql_serialization_format
Now that we don't accept cql protocol version 1 or 2, we can
drop cql_serialization format everywhere, except when in the IDL
(since it's part of the inter-node protocol).

A few functions had duplicate versions, one with and one without
a cql_serialization_format parameter. They are deduplicated.

Care is taken that `partition_slice`, which communicates
the cql_serialization_format across nodes, still presents
a valid cql_serialization_format to other nodes when
transmitting itself and rejects protocol 1 and 2 serialization\
format when receiving. The IDL is unchanged.

One test checking the 16-bit serialization format is removed.
2023-01-03 19:54:13 +02:00
Marcin Maliszkiewicz
2bf2ffd3ed alternator: add maybe_quote to secondary indexes 'where' condition
This bug doesn't affect anything, the reason is descibed in the commit:
'alternator: fix wrong 'where' condition for GSI range key'.

But it's theoretically correct to escape those key names and
the difference can be observed via CQL's describe table. Before
the patch 'where' condition is missing one double quote in variable
name making it mismatched with corresponding column name.
2022-11-22 11:08:23 +01:00
Marcin Maliszkiewicz
d6d20134de alternator: fix wrong 'where' condition for GSI range key
This bug doesn't manifest in a visible way to the user.

Adding the index to an existing table via GlobalSecondaryIndexUpdates is not supported
so we don't need to consider what could happen for empty values of index range key.
After the index is added the only interesting value user can set is omitting
the value (null or empty are not allowed, see test_gsi_empty_value and
test_gsi_null_value).

In practice no matter of 'where' condition the underlaying materialized
view code is skipping row updates with missing keys as per this comment:
'If one of the key columns is missing, set has_new_row = false
meaning that after the update there will be no view row'.

Thats why the added test passes both before and after the patch.
But it's still usefull to include it to exercise those code paths.

Fixes #11800
2022-11-22 11:08:23 +01:00
Botond Dénes
f1a039fc2b treewide: use ::for_partition_start() instead of ::partition_start_tag_t{}
We just added a convenience static factory method for partition start,
change the present users of the clunky constructor+tag to use it
instead.
2022-11-11 09:58:18 +02:00
Alexander Turetskiy
636e14cc77 Alternator: Projection field added to return from DescribeTable which describes GSIs and LSIs.
The return from DescribeTable which describes GSIs and LSIs is missing
the Projection field. We do not yet support all the settings Projection
(see #5036), but the default which we support is ALL, and DescribeTable
should return that in its description.

Fixes #11470

Closes #11693
2022-10-19 19:01:08 +03:00
Nadav Har'El
4a7794fb64 alternator: better error message when adding a GSI to an existing table
Due to issue #11567, Alternator do not yet support adding a GSI to an
existing table via UpdateTable with the GlobalSecondaryIndexUpdates
parameter.

However, currently, we print a misleading error message in this case,
complaining about the AttributeDefinitions parameter. This parameter
is also required with GlobalSecondaryIndexUpdates, but it's not the
main problem, and the user is likely to be confused why the error message
points to that specific paramter and what it means that this parameter
is claimed to be "not supported" (while it is supported, in CreateTable).
With this patch, we report that GlobalSecondaryIndexUpdates is not
supported.

This patch does not fix the unsupported feature - it just improves
the error message saying that it's not supported.

Refs #11567

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11650
2022-09-29 09:00:31 +03:00
Nadav Har'El
999ca2d588 alternator: fix crashes an errors when using ":attrs" attribute
Alternator uses a single column, a map, with the deliberately strange
name ":attrs", to hold all the schema-less attributes of an item.
The existing code is buggy when the user tries to write to an attribute
with this strange name ":attrs". Although it is extremely unlikely that
any user would happen to choose such a name, it is nevertheless a legal
attribute name in DynamoDB, and should definitely not cause Scylla to crash
as it does in some cases today.

The bug was caused by the code assuming that to check whether an attribute
is stored in its own column in the schema, we just need to check whether
a column with that name exists. This is almost true, except for the name
":attrs" - a column with this name exists, but it is a map - the attribute
with that name should be stored *in* the map, not as the map. The fix
is to modify that check to special-case ":attrs".

This fix makes the relevant tests, which used to crash or fail, now pass.

This fix solves most of #5009, but one point is not yet solved (and
perhaps we don't need to solve): It is still not allowed to use the
name ":attrs" for a **key** attribute. But trying to do that fails cleanly
(during the table creation) with an appropriate error message, so is only
a very minor compatibility issue.

Refs #5009

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-19 10:30:11 +03:00
Nadav Har'El
941c719a23 alternator: return ProvisionedThroughput in DescribeTable
DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing
mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable
operation must return a ProvisionedThroughput structure, listing both
ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not
stated in some DynamoDB documentation but is explictly mentioned in
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html
Also in empirically, DynamoDB returns ProvisionedThroughput with zeros
even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this.

The ProvisionedThroughput structure being missing was a problem for
applications like DynamoDB connectors for Spark, if they implicitly
assume that ProvisionedThroughput is returned by DescribeTable, and
fail (as described in issue #11222) if it's outright missing.

So this patch adds the missing ProvisionedThroughput structure, and
the xfailing test starts to pass.

Note that this patch doesn't change the fact that attempting to set
a table to PROVISIONED billing mode is ignored: DescribeTable continues
to always return PAY_PER_REQUEST as the billing mode and zero as the
provisioned capacities.

Fixes #11222

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11298
2022-08-22 09:58:09 +02:00
Botond Dénes
d1d53f1b84 query: add tombstone-limit to read-command
Propagate the tombstone-limit from coordinator to replicas, to make sure
all is using the same limit.
2022-08-10 06:01:47 +03:00
Benny Halevy
c71ef330b2 query-request, everywhere: define and use query_id as a strong type
Define query_id as a tagged_uuid
So it can be differentiated from other uuid-class types.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:13:28 +03:00
Michał Sala
041cb77ad0 alternator, db: move the tag code to db/tags
Tags are a useful mechanism that could be used outside of alternator
namespace. My motivation to move tags_extension and other utilities to
db/tags/ was that I wanted to use them to mark "synchronous mode" views.

I have extracted `get_tags_of_table`, `find_tag` and `update_tags`
method to db/tags/utils.cc and moved alternator/tags_extension.hh to
db/tags/.

The signature of `get_tags_of_table` was changed from `const
std::map<sstring, sstring>&` to `const std::map<sstring, sstring>*`
Original behavior of this function was to throw an
`alternator::api_error` exception. This was undesirable, as it
introduced a dependency on the alternator module. I chose to change it
to return a potentially null value, and added a wrapper function to the
alternator module - `get_tags_of_table_or_throw` to keep the previous
throwing behavior.
2022-07-25 09:53:33 +02:00
Benny Halevy
acae3cc223 treewide: stop use of deprecated coroutine::make_exception
Convert most use sites from `co_return coroutine::make_exception`
to `co_await coroutine::return_exception{,_ptr}` where possible.

In cases this is done in a catch clause, convert to
`co_return coroutine::exception`, generating an exception_ptr
if needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10972
2022-07-07 15:02:16 +03:00
Botond Dénes
2b6eeadc07 alternator: use position-in-partition in paging cookie only when reading CQL tables
Recently, we added full position-in-partition support to alternator's
paging cookie, so it can support stopping at arbitrary positions. This
support however is only really needed when tables have range tombstones
and alternator tables never have them. So to avoid having to make the
new fields in 'ExclusiveStartKey' reserved, we avoid filling these in
when reading an alternator table, as in this case it is safe to assume
the position is `after_key($clustring_key)`. We do include these new
members however when reading CQL tables through alternator. As this is
only supported for system tables, we can also be sure that the elaborate
names we used for these fields are enough to avoid naming clashes.
The condition in the code implementing this is actually even more
general: it only includes the region/weight members when the position
differs from that of a normal alternator one.
2022-06-30 15:10:30 +03:00
Botond Dénes
52058ea974 alternator: make is_alternator_keyspace() a standalone method 2022-06-30 14:18:29 +03:00
Avi Kivity
3131cbea62 Merge 'query: allow replica to provide arbitrary continue position' from Botond Dénes
Currently, we use the last row in the query result set as the position where the query is continued from on the next page. Since only live rows make it into query result set, this mandates the query to be stopped on a live row on the replica, lest any dead rows or tombstones processed after the live rows, would have to be re-processed on the next page (and the saved reader would have to be thrown away due to position mismatch). This requirement of having to stop on a live row is problematic with datasets which have lots of dead rows or tombstones, especially if these form a prefix. In the extreme case, a query can time out before it can process a single live row and the data-set becomes effectively unreadable until compaction gets rid of the tombstones.
This series prepares the way for the solution: it allows the replica to determine what position the query should continue from on the next page. This position can be that of a dead row, if the query stopped on a dead row. For now, the replica supplies the same position that would have been obtained with looking at the last row in the result set, this series merely introduces the infrastructure for transferring a position together with the query result, and it prepares the paging logic to make use of this position. If the coordinator is not prepared for the new field, it will simply fall-back to the old way of looking at the last row in the result set. As I said for now this is still the same as the content of the new field so there is no problem in mixed clusters.

Refs: https://github.com/scylladb/scylla/issues/3672
Refs: https://github.com/scylladb/scylla/issues/7689
Refs: https://github.com/scylladb/scylla/issues/7933

Tests: manual upgrade test.
I wrote a data set with:
```
./scylla-bench -mode=write -workload=sequential -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -clustering-row-size=8096 -partition-count=1000
```
This creates large, 80MB partitions, which should fill many pages if read in full. Then I started a read workload:
```
./scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100
```
I confirmed that paging is happening as expected, then upgraded the nodes one-by-one to this PR (while the read-load was ongoing). I observed no read errors or any other errors in the logs.

Closes #10829

* github.com:scylladb/scylla:
  query: have replica provide the last position
  idl/query: add last_position to query_result
  mutlishard_mutation_query: propagate compaction state to result builder
  multishard_mutation_query: defer creating result builder until needed
  querier: use full_position instead of ad-hoc struct
  querier: rely on compactor for position tracking
  mutation_compactor: add current_full_position() convenience accessor
  mutation_compactor: s/_last_clustering_pos/_last_pos/
  mutation_compactor: add state accessor to compact_mutation
  introduce full_position
  idl: move position_in_partition into own header
  service/paging: use position_in_partition instead of clustering_key for last row
  alternator/serialization: extract value object parsing logic
  service/pagers/query_pagers.cc: fix indentation
  position_in_partition: add to_string(partition_region) and parse_partition_region()
  mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh
2022-06-27 12:23:21 +03:00
Avi Kivity
dab56b82fa Merge 'Per-partition rate limiting' from Piotr Dulikowski
Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode.

This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension:

```
ALTER TABLE ks.tbl
WITH per_partition_rate_limit = {
	'max_writes_per_second': 100,
	'max_reads_per_second': 200
};
```

Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead.

Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all.

The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here.

Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`:

- Write mode:
  ```
  8f690fdd47 (PR base):
  129644.11 tps ( 56.2 allocs/op,  13.2 tasks/op,   49785 insns/op)
  This PR:
  125564.01 tps ( 56.2 allocs/op,  13.2 tasks/op,   49825 insns/op)
  ```
- Read mode:
  ```
  8f690fdd47 (PR base):
  150026.63 tps ( 63.1 allocs/op,  12.1 tasks/op,   42806 insns/op)
  This PR:
  151043.00 tps ( 63.1 allocs/op,  12.1 tasks/op,   43075 insns/op)
  ```

Manual upgrade test:
- Start 3 nodes, 4 shards each, Scylla version 8f690fdd47
- Create a keyspace with scylla-bench, RF=3
- Start reading and writing with scylla-bench with CL=QUORUM
- Manually upgrade nodes one by one to the version from this PR
- Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded
- Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected

Fixes: #4703

Closes #9810

* github.com:scylladb/scylla:
  storage_proxy: metrics for per-partition rate limiting of reads
  storage_proxy: metrics for per-partition rate limiting of writes
  database: add stats for per partition rate limiting
  tests: add per_partition_rate_limit_test
  config: add add_per_partition_rate_limit_extension function for testing
  cf_prop_defs: guard per-partition rate limit with a feature
  query-request: add allow_limit flag
  storage_proxy: add allow rate limit flag to get_read_executor
  storage_proxy: resultize return type of get_read_executor
  storage_proxy: add per partition rate limit info to read RPC
  storage_proxy: add per partition rate limit info to query_result_local(_digest)
  storage_proxy: add allow rate limit flag to mutate/mutate_result
  storage_proxy: add allow rate limit flag to mutate_internal
  storage_proxy: add allow rate limit flag to mutate_begin
  storage_proxy: choose the right per partition rate limit info in write handler
  storage_proxy: resultize return types of write handler creation path
  storage_proxy: add per partition rate limit to mutation_holders
  storage_proxy: add per partition rate limit info to write RPC
  storage_proxy: add per partition rate limit info to mutate_locally
  database: apply per-partition rate limiting for reads/writes
  database: move and rename: classify_query -> classify_request
  schema: add per_partition_rate_limit schema extension
  db: add rate_limiter
  storage_proxy: propagate rate_limit_exception through read RPC
  gms: add TYPED_ERRORS_IN_READ_RPC cluster feature
  storage_proxy: pass rate_limit_exception through write RPC
  replica: add rate_limit_exception and a simple serialization framework
  docs: design doc for per-partition rate limiting
  transport: add rate_limit_error
2022-06-24 01:32:13 +03:00
Botond Dénes
2b0bc11f2e service/paging: use position_in_partition instead of clustering_key for last row
The former allows for expressing more positions, like a position
before/after a clustering key. This practically enables the coordinator
side paging logic, for a query to be stopped at a tombstone (which can
have said positions).
2022-06-23 13:36:20 +03:00
Piotr Dulikowski
a7ad70600d query-request: add allow_limit flag
Adds allow_limit flag to the read_command. The flag decides whether rate
limiting of this operation is allowed.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
e6beab3106 storage_proxy: add allow rate limit flag to mutate/mutate_result
Now, mutate/mutate_result accept a flag which decides whether the write
should be rate limited or not.

The new parameter is mandatory and all call sites were updated.
2022-06-22 20:16:49 +02:00
Pavel Emelyanov
98a4d41e31 alternator: Get rack/datacenter from topology
It's needed in two places, both can get topology from the proxy's token
metadata.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Nadav Har'El
3aca1ca572 alternator: make BatchGetItem group reads by partition
DynamoDB API's BatchGetItem invokes a number (up to 25) of read requests
in parallel, returning when all results are available. Alternator naively
implemented this by sending all read requests in parallel, no matter which
requests these were.

That implementation was inefficient when all the requests are to different
items (clustering rows) of the same partition. In a multi-node setup this
will end up sending 25 separate requests to the same remote node(s). Even
on a single-node setup, this may result in reading from disk more than
once, and even if the partition is cached - doing an O(logN) search in
each multiple times.

What we do in this patch, instead, is to group all the BatchGetItem
requests that aimed at the same partition into a single read request
asking for a (sorted) list of clustering keys. This is similar to an
"IN" request in CQL.

As an example of the performance benefit of this patch, I tried a
BatchGetItem request asking for 20 random items from a 10-million item
partition. I measured the latency of this request on a single-node
Scylla. Before this patch, I saw a latency of 17-21 ms (the lower number
is when the request is retried and the requested items are already in
the cache). After this patch, the latency is 10-14 ms. The performance
improvement on multi-node clusters are expected to be even higher.

Unfortunately the patch is less trivial than I hoped it would be,
because some of the old code was organized under the assumption that
each read request only returned one item (and if it failed, it means
only one item failed), so this part of the code had to be reorganized
(and, for making the code more readable, coroutinized).

An unintended benefit of the code reorganization is that it also gave
me an opportunity to fail an attempt to ask BatchGetItem the same
item more than once (issue #10757).

The patch also adds a few more corner cases in the tests, to be even
more sure that the code reorganization doesn't introduce a regression
in BatchGetItem.

Fixes #10753
Fixes #10757

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-19 14:47:57 +03:00
Nadav Har'El
e20233dab1 alternator: improve error handling when trying to tag a GSI or LSI
In issue #10786, we raised the idea of maybe allowing to tag (with
TagResource) GSIs and LSIs, not just base tables. However, currently,
neither DynamoDB nor Syclla allows it. So in this patch we add a
test that confirms this. And while at it, we fix Alternator to
return the same error message as DynamoDB in this case.

Refs #10786.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-13 18:14:42 +03:00
Nadav Har'El
8866c326de alternator: forbid duplicate index (LSI and GSI) names
Adding an LSI and GSI with the same name to the same Alternator table
should be forbidden - because if both exists only one of them (the GSI)
would actually be usable. DynamoDB also forbids such duplicate name.

So in this patch we add a test for this issue, and fix it.

Since the patch involves a few more uses of the IndexName string,
we also clean up its handling a bit, to use std::string_view instead
of the old-style std::string&.

Fixes #10789

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-13 18:14:42 +03:00