Commit Graph

529 Commits

Author SHA1 Message Date
Piotr Wieczorek
4be0bdbc07 alternator: Don't emit a redundant REMOVE event in Alternator Streams for PutItem calls
Until now, every PutItem operation appeared in the Alternator Streams as
two events - a REMOVE and a MODIFY. DynamoDB Streams emits only INSERT
or MODIFY, depending on whether a row was replaced, or created anew. A
related issue scylladb#6918 concerns distinguishing the mutation type properly.

This was because each call to PutItem emitted the two CDC rows, returned
by GetRecords. Since this patch, we use a collection tombstone for the
`:attrs` column, and a separate tombstone for each regular column in the
table's schema. We don't expect that new tables would have any other
regular column, except for the `:attrs` and keys, but we may encounter
them in in upgraded tables which had old GSIs or LSIs.

Fixes: scylladb#6930.

Closes scylladb/scylladb#24991
2025-09-30 13:12:16 +03:00
Szymon Malewski
6ce7843774 alternator: use expression caching
Before this patch, every expression in Alternator's requests was parsed from string to adequate structure.
This patch enables caching - all calls to parse an expression (all types) are proxied through the cache.
New expression is added to the cache, the least recently used entry (above cache size) is removed.
For existing entries the copy of the template is returned - individual instances still need to be resolved (placeholders substituted with names and values).
The cache is per shard - shared for all operations, expression types, tables, users.
Default cache size is 2000 entries per shard and it has configuration option `alternator_max_expression_cache_entries_per_shard` (0 means cache disabled).
Added Python tests are based on metrics.
2025-09-28 04:27:44 +02:00
Szymon Malewski
65c95d3a93 alternator/expressions: error on parsing empty update expression
With this patch empty update expression is no longer accepted by the parser.
So far it was rejected only after resolving, however it could pollute the expression cache.
2025-09-28 04:06:00 +02:00
Nadav Har'El
f65998cd39 alternator: improve error handling of incorrect ARN
Before this patch, if an ARN that is passed to Alternator requests
like TagResource is well-formatted but points to non-existent table,
Alternator returns the unhelpful error:

  (AccessDeniedException) when calling the TagResource operation:
  Incorrect resource identifier

This patch modifies this error to be:

  (ResourceNotFoundException) when calling the TagResource operation:
  ResourceArn 'arn:scylla:alternator:alternator_alternator_Test_
  1758532308880:scylla:table/alternator_Test_1758532308880x' not found

This is the same error type (ResourceNotFoundException) that DynamoDB
returns in that case - and a more helpful error message.

This patch also includes a regression test that checks the error
type in this case. The new test fails on Alternator before this
patch, and passes afterwards (and also passes on DyanamoDB).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#26179
2025-09-25 11:51:17 +03:00
Botond Dénes
50038ef2cc Merge 'alternator: update references to alternator streams issue' from Michael Litvak
update all the references about the issue of tablets support for
alternator streams to issue https://github.com/scylladb/scylladb/issues/23838 instead of https://github.com/scylladb/scylladb/issues/16317.

The issue https://github.com/scylladb/scylladb/issues/16317 is about support of CDC with tablets, but it is now
closed and it didn't address alternator streams. the remaining issues
about alternator streams should be addressed as part of https://github.com/scylladb/scylladb/issues/23838, so fix
the references in order for them not to be missed.

backport is not needed

Closes scylladb/scylladb#26178

* github.com:scylladb/scylladb:
  test/cqlpy/test_permissions: unskip test for tablets
  alternator: update references to alternator streams issue
2025-09-25 11:05:52 +03:00
Ernest Zaslavsky
5ba5aec1f8 treewide: Move mutation related files to a mutation directory
As requested in #22104, moved the files and fixed other includes and build system.

Moved files:
 - combine.hh
 - collection_mutation.hh
 - collection_mutation.cc
 - converting_mutation_partition_applier.hh
 - converting_mutation_partition_applier.cc
 - counters.hh
 - counters.cc
 - timestamp.hh

Fixes: #22104

This is a cleanup, no need to backport

Closes scylladb/scylladb#25085
2025-09-24 13:23:38 +03:00
Michael Litvak
65351fda29 alternator: update references to alternator streams issue
update all the references about the issue of tablets support for
alternator streams to issue #23838 instead of #16317.

The issue #16317 is about support of CDC with tablets, but it is now
closed and it didn't address alternator streams. the remaining issues
about alternator streams should be addressed as part of #23838, so fix
the references in order for them not to be missed.
2025-09-22 09:56:23 +02:00
Nadav Har'El
7be5454db1 Merge 'alternator: Store LSI keys in :attrs for newly created tables' from Piotr Wieczorek
Previously, LSI keys were stored as separate, top-level columns in the base table. This patch changes this behavior for newly created tables, so that the key columns are stored inside the `:attrs` map. Then, we use top-level computed columns instead of regular ones.

This makes LSI storage consistent with GSIs and allows the use of a collection tombstone on `:attrs` to delete all attributes in a row except for keys in new tables.

Refs https://github.com/scylladb/scylladb/pull/24991
Refs https://github.com/scylladb/scylladb/issues/6930

Closes scylladb/scylladb#25796

* github.com:scylladb/scylladb:
  alternator: Store LSI keys in :attrs for newly created tables
  alternator/test: Add LSI tests based mostly on the existing GSI tests
2025-09-18 21:48:43 +03:00
Radosław Cybulski
c0db278c03 Don't report spurious keys in DescribeTable
Alternator, when creating gsi, adds artificially columns, that user
had not ask for. This patch prevents those columns from showing up in
DescribeTable's output.

Fixes #5320

Closes scylladb/scylladb#25978
2025-09-18 19:34:39 +03:00
Nadav Har'El
3c0032deb4 alternator: fix bug in combination of AttributeUpdates + ReturnValues
In test/alternator/test_returnvalues.py we had tests for the
ReturnValues feature on UpdateItem requests - but we only tested
UpdateItem requests with the "modern" UpdateExpression, and forgot to
test the combination of ReturnValues with the old AttributeUpdates API.

It turns out this combination is buggy: when both ReturnValues=ALL_OLD
and AttributeUpdates need the previous value of the item, we may wrongly
std::move() the value out, and the operation will fail with a strange
error:

    An error occurred (ValidationException) when calling the UpdateItem
    operation: JSON assert failed on condition 'IsObject()'

The fix in this patch is trivial - just move the std::move() to the
correct place, after both UpdateExpression and AttributeUpdates
handling is done.

This patch also includes a reproducing test, which fails before this
patch and passes with it - and of course passes on DynamoDB. This
test reproduces two cases where the bug happened, as well as one
case where it didn't (to make sure we don't regress in what already
worked).

Fixes #25894

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25900
2025-09-17 16:04:01 +03:00
Ernest Zaslavsky
d624413ddd treewide: Move query related files to a new query directory
As requested in #22120, moved the files and fixed other includes and build system.

Moved files:
- query.cc
- query-request.hh
- query-result.hh
- query-result-reader.hh
- query-result-set.cc
- query-result-set.hh
- query-result-writer.hh
- query_id.hh
- query_result_merger.hh

Fixes: #22120

This is a cleanup, no need to backport

Closes scylladb/scylladb#25105
2025-09-16 23:40:47 +03:00
Nadav Har'El
208d3986a7 alternator: add explanation of internal tags
Alternator needs to store a few pieces of information for each table
that it can't store in the existing CQL schema. We decided to store
this information in hidden tags - tags named with the prefix "system:" -
and we already have four of those: Provisioned RCU and WCU, table
creation time, and TTL's expiration-time attribute.

This patch moves the definition of all four tags to one place in
executor.cc, adds a short comment about the content of each tag,
and adds a longer comment explaining why we have these hidden tags
at all.

It is expected that more hidden tags will follow - e.g., to solve
issue #5320. So we expect more tags to be added later in the same
place in the code.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25980
2025-09-15 08:41:39 +03:00
Piotr Wieczorek
db8b7d1c89 alternator: Store LSI keys in :attrs for newly created tables
Before this patch, LSI keys were stored as separate, top-level columns
in the base table. This patch changes this behavior for newly created
tables, so that the key columns are stored inside the `:attrs` map.
Then, in the LSI's materialized view, we create a computed column for
each LSI's range key that is not a key in the base table.

This makes LSI storage consistent with GSIs and allows the use of a
collection tombstone on `:attrs` to delete all fields in a row, except
for keys, in new tables.

Refs https://github.com/scylladb/scylladb/pull/24991
Refs https://github.com/scylladb/scylladb/issues/6930
2025-09-04 15:02:37 +02:00
Michał Jadwiszczak
204f61ffe1 service/migration_manager: pass storage_proxy to prepare_keyspace_drop_announcement()
The reference is needed to get `view_building_state_machine`.
2025-08-27 08:55:47 +02:00
Nadav Har'El
87dd96f9a2 Merge ' Alternator: DynamoDB compatible WCU Calculation via Read-Before-Write Support' from Amnon Heiman
This series adds support for a DynamoDB-compatible Write Capacity Unit (WCU) calculation in Alternator by introducing an optional forced read-before-write mechanism.

Alternator's model differs from DynamoDB, and as a result, some write operations may report lower WCU usage compared to what DynamoDB would report. While this is acceptable in many cases, there are scenarios where users may require accurate WCU reporting that aligns more closely with DynamoDB's behavior.

To address this, a new configuration option, alternator_force_read_before_write, is introduced. When enabled, Alternator will perform a read before executing PutItem, UpdateItem, and DeleteItem operations. This allows it to take the existing item size into account when computing the WCU. BatchWriteItem support is also extended to use this mechanism. Because BatchWriteItem does not support returning old items directly, several internal changes were made to support reading previous item sizes with minimal overhead. Reads are performed at consistency level LOCAL_ONE for efficiency, and the WCU calculation is now done in multiple stages to accurately account for item size differences.

In addition to the implementation changes, test coverage was added to validate the new behavior. These tests confirm that WCU is calculated based on the larger of the old and new items when read-before-write is active, including for BatchWriteItem.

This feature comes with performance overhead and is therefore disabled by default. It can be enabled at runtime via the system.config table and should be used only when precise WCU tracking is necessary.
**New feature, no need to backport**

Closes scylladb/scylladb#24436

* github.com:scylladb/scylladb:
  alternator/test_returnconsumedcapacity.py: Test forced read before write
  alternator/executor.cc: DynamoDB WCU calculation in BatchWriteItem using read-before-write
  executor.cc: get_previous_item with consistency level
  executor: Extend API of put_or_delete_item
  alternator/executor.cc: Accurate WCU for put, update, delete
  config: add alternator_force_read_before_write
2025-08-24 11:38:24 +03:00
Amnon Heiman
ffc7171a5f alternator/executor.cc: DynamoDB WCU calculation in BatchWriteItem using read-before-write
Alternator's internal model differs from DynamoDB, which can result in
lower WCU (Write Capacity Unit) estimates for some operations. While this
is often acceptable, accurate WCU tracking is occasionally required.

This patch enables a DynamoDB like WCU calculation for `BatchWriteItem` when the
`alternator_force_read_before_write` configuration is enabled, by reading
existing items before applying changes.

WCU calculation in `BatchWriteItem` is now performed in three stages:
1. During the initial scan, no calculation is done, just the pointers
   are collected.
2. If read-before-write is enabled, each item is read again and its
   prior size is compared to the new value. WCU is based on the larger
   of the two. The updated size is stored in the mutation building
   array.
3. Regardless of read-before-write, metrics are updated and consumed
   capacity units are returned if requested. This is done in a loop
   before sending the mutations.

For performance, reads in `BatchWriteItem` use consistency level LOCAL_ONE.

These changes increase WCU DynamoDB compatibility in batch operations,
but add overhead and should be enabled only when needed.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-08-13 18:03:55 +03:00
Amnon Heiman
2c02cb394b executor.cc: get_previous_item with consistency level
This patch extends get_previous_item so it can be used to calculate the
size of a previous item. This will allow batch_get_item to obtain the
size of a previous item without needing the item itself.

The patch includes the following changes:

* Removes the unneeded forward declaration of get_previous_item.

* Extends get_previous_item to accept an explicit consistency level.

* Modifies the regular get_previous_item to maintain the same
  functionality while calling the base implementation.

* Adds a get_previous_item_size function that uses the base
  implementation to retrieve the size of a previous item when only the
  size is needed. For performance reasons, get_previous_item_size uses
  consistency level one.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-08-07 12:14:43 +03:00
Nadav Har'El
a896e2dbb9 alternator: add optional support for writing to system table
In commit 44a1daf we added the ability to read system tables through
the DynamoDB API (actually, the Scan and Query requests only).
This ability is useful for tests, and can also be useful to users who
want to read information that is only available through system tables.

This patch adds support also for *writing* into system tables. This will
be useful for Alternator tests, were we want to temporarily change
some live-updatable configuration option - and so far haven't been
able to do that like we did do in some cql-pytest tests.

For reasons explained in issue #23218, only superuser roles are allowed to
write to system tables - it is not enough for the role to be granted
MODIFY permissions on the system table or on ALL KEYSPACES. Moreover,
the ability to modify system tables carries special risks, so this
patch only allows writes to the system tables if a new configuration
option "alternator_allow_system_table_write" turned on. This option is
turned off by default.

This patch also includes a test for this new configuration-writing
capability. The test scripts test/alternator/run and test.py now
run Scylla with alternator_allow_system_table_write turned on, but
the new test can also run without this option, and will be skipped
in that case (to allow running the test suite against some manually-
run instance of Scylla).

Fixes: #12348

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-06 10:00:04 +03:00
Amnon Heiman
5abbfa1e52 executor: Extend API of put_or_delete_item
This patch adds two methods to put_or_delete_item that will be used in
WCU batch calculation:

A setter to set the item length.

A boolean getter that determines if this is a delete action.
2025-08-05 10:23:36 +03:00
Amnon Heiman
f75aa1b35e alternator/executor.cc: Accurate WCU for put, update, delete
To improve the accuracy of Write Capacity Unit (WCU) calculation,
this patch introduces the use of the `alternator_force_read_before_write`
configuration option. When enabled, it forces a read-before-write
operation on `PutItem`, `UpdateItem`, and `DeleteItem` requests.

This comes with performance overhead and should be used with caution,
especially in high-throughput environments.
2025-08-05 10:01:32 +03:00
Avi Kivity
ee138217ba alternator: simplify std::views::transform calls that extract a member from a class
Rather than calling std::views::transform with a lambda that extracts
a member from a class, call std::views::transform with a pointer-to-member
to do the same thing. This results in more concise code.

Closes scylladb/scylladb#25012
2025-07-22 12:39:01 +02:00
Botond Dénes
fd6877c654 Merge 'alternator: avoid oversized allocation in Query/Scan' from Nadav Har'El
This series fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator.

The first patch in the series is the main fix - the later patches are cleanups requested by reviewers but also involved other pre-existing code, so I did those cleanups as separate patches.

Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test.

Fixes #23535

The stalls caused by large allocations was seen by actual users, so it makes sense to backport this patch. On the other hand, the patch while not big is fairly intrusive (modifies the nomal Scan and Query path and also the later patches do some cleanup of additional code) so there is some small risk involved in the backport.

Closes scylladb/scylladb#24480

* github.com:scylladb/scylladb:
  alternator: clean up by co-routinizing
  alternator: avoid spamming the log when failing to write response
  alternator: clean up and simplify request_return_type
  alternator: avoid oversized allocation in Query/Scan
2025-07-17 11:30:40 +03:00
Botond Dénes
2d3965c76e Merge 'Reduce Alternator table name length limit to 192 and fix crash when adding stream to table with very long name' from Nadav Har'El
Before this series, it is possible to crash Scylla (due to an I/O error) by creating an Alternator table close to the maximum name length of 222, and then enabling Alternator Streams. This series fixes this bug in two ways:

1. On a pre-existing table whose name might be up to 222 characters, enabling Streams will check if the resulting name is too long, and if it is, fail with a clear error instead of crashing. This case will effect pre-existing tables whose name has between 207 and 222 characters (207 is `222 - strlen("_scylla_cdc_log")`) - for such tables enabling Streams will fail, but no longer crash.
2. For new tables, the table name length limit is lowered from 222 to 192. The new limit is still high enough, but ensures it will be possible to enable streams any new table. It will also always be possible to add a GSI for such a table with name up to 29 characters (if the table name is shorter, the GSI name can be longer - the sum can be up to 221 characters).

No need to backport, Alternator Streams is still an experimental feature and this patch just improves the unlikely situation of extremely long table names.

Fixes #24598

Closes scylladb/scylladb#24717

* github.com:scylladb/scylladb:
  alternator: lower maximum table name length to 192
  alternator: don't crash when adding Streams to long table name
  alternator: split length limit for regular and auxiliary tables
  alternator: avoid needlessly validating table name
2025-07-15 06:57:04 +03:00
Nadav Har'El
a248336e66 alternator: clean up by co-routinizing
Reviewers of the previous patch complained on some ugly pre-existing
code in alternator/executor.cc, where returning from an asynchronous
(future) function require lengthy verbose casts. So this patch cleans
up a few instances of these ugly casts by using co_return instead of
return.

For example, the long and verbose

    return make_ready_future<executor::request_return_type>(
        rjson::print(std::move(response)));

can be changed to the shorter and more readable

    co_return rjson::print(std::move(response));

This patch should not have any functional implications, and also not any
performance implications: I only coroutinized slow-path functions and
one function that was already "partially" coroutinized (and this was
expecially ugly and deserved being fixed).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-14 18:41:35 +03:00
Nadav Har'El
13ec94107a alternator: avoid spamming the log when failing to write response
Both make_streamed() and new make_streamed_with_extra_array() functions,
used when returning a long response in Alternator, would write an error-
level log message if it failed to write the response. This log message
is probably not helpful, and may spam the log if the application causes
repeated errors intentionally or accidentally.

So drop these log messages. The exception is still thrown as usual.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-14 18:41:34 +03:00
Nadav Har'El
d8fab2a01a alternator: clean up and simplify request_return_type
The previous patch introduced a function make_streamed_with_extra_array
which was a duplicate of the existing make_streamed. Reviewers
complained how baroque the new function is (just like the old function),
having to jump through hoops to return a copyable function working
on non-copyable objects, making strange-named copies and shared pointers
of everything.

We needed to return a copyable function (std::function) just because
Alternator used Seastar's json::json_return_type in the return type
from executor function (request_return_type). This json_return_type
contained either a sstring or an std::function, but neither was ever
really appropriate:

  1. We want to return noncopyable_function, not an std::function!
  2. We want to return an std::string (which rjson::print()) returns,
     not an sstring!

So in this patch we stop using seastar::json::json_return_type
entirely in Alternator.

Alternator's request_return_type is now an std::variant of *three* types:
  1. std::string for short responses,
  2. noncopyable_function for long streamed response
  3. api_error for errors.

The ugliest parts of make_streamed() where we made copies and shared
pointers to allow for a copyable function are all gone. Even nicer, a
lot of other ugly relics of using seastar::json_return_type are gone:

1. We no longer need obscure classes and functions like make_jsonable()
   and json_string() to convert strings to response bodies - an operation
   can simply return a string directly - usually returning
   rjson::print(value) or a fixed string like "" and it just works.

2. There is no more usage of seastar::json in Alternator (except one
   minor use of seastar::json::formatter::to_json in streams.cc that
   can be removed later). Alternator uses RapidJSON for its JSON
   needs, we don't need to use random pieces from a different JSON
   library.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-14 18:41:34 +03:00
Nadav Har'El
2385fba4b6 alternator: avoid oversized allocation in Query/Scan
This patch fixes one cause of oversized allocations - and therefore
potentially stalls and increased tail latencies - in Alternator.

Alternator's Scan or Query operation return a page of results. When the
number of items is not limited by a "Limit" parameter, the default is
to return a 1 MB page. If items are short, a large number of them can
fit in that 1MB. The test test_query.py::test_query_large_page_small_rows
has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array
"Items". Before this patch, we build the full response as a RapidJSON
object before sending it. The problem is that unfortunately, RapidJSON
stores arrays as contiguous allocations. This results in large
contiguous allocations in workloads that scan many small items, and
large contiguous allocations can also cause stalls and high tail
latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a
RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e,
a chunked (non-contiguous) array of items (each a JSON value).
After collecting this array separately from the response object, we
need to print its content without actually inserting it into the object -
we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number
(currently, >256) of items were scanned. When there is a smaller number
of items in a page (this is typical when each item is longer), we just
insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation
(which is now gone), this patch also includes a new test which
exercises the new code with a scan of 700 (>256) items in a page -
but this new test is fast enough to be permanently in our test suite
and not a manual "veryslow" test as the other test.

Fixes #23535
2025-07-14 18:41:34 +03:00
Benny Halevy
3feb759943 everywhere: use utils::chunked_vector for list of mutations
Currently, we use std::vector<*mutation> to keep
a list of mutations for processing.
This can lead to large allocation, e.g. when the vector
size is a function of the number of tables.

Use a chunked vector instead to prevent oversized allocations.

`perf-simple-query --smp 1` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (read path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

89055.97 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   18003 cycles/op,        0 errors)
103372.72 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39380 insns/op,   17300 cycles/op,        0 errors)
98942.27 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39413 insns/op,   17336 cycles/op,        0 errors)
103752.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39407 insns/op,   17252 cycles/op,        0 errors)
102516.77 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39403 insns/op,   17288 cycles/op,        0 errors)
throughput:
	mean=   99528.13 standard-deviation=6155.71
	median= 102516.77 median-absolute-deviation=3844.59
	maximum=103752.93 minimum=89055.97
instructions_per_op:
	mean=   39403.99 standard-deviation=14.25
	median= 39406.75 median-absolute-deviation=9.30
	maximum=39416.63 minimum=39380.39
cpu_cycles_per_op:
	mean=   17435.81 standard-deviation=318.24
	median= 17300.40 median-absolute-deviation=147.59
	maximum=18002.53 minimum=17251.75
```

After (read path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
59755.04 tps ( 66.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39466 insns/op,   22834 cycles/op,        0 errors)
71854.16 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   17883 cycles/op,        0 errors)
82149.45 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39411 insns/op,   17409 cycles/op,        0 errors)
49640.04 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   19975 cycles/op,        0 errors)
54963.22 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   18235 cycles/op,        0 errors)
throughput:
	mean=   63672.38 standard-deviation=13195.12
	median= 59755.04 median-absolute-deviation=8709.16
	maximum=82149.45 minimum=49640.04
instructions_per_op:
	mean=   39448.38 standard-deviation=31.60
	median= 39466.17 median-absolute-deviation=25.75
	maximum=39474.12 minimum=39411.42
cpu_cycles_per_op:
	mean=   19267.01 standard-deviation=2217.03
	median= 18234.80 median-absolute-deviation=1384.25
	maximum=22834.26 minimum=17408.67
```

`perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (write path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
63736.96 tps ( 59.4 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   49667 insns/op,   19924 cycles/op,        0 errors)
64109.41 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   49992 insns/op,   20084 cycles/op,        0 errors)
56950.47 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50005 insns/op,   20501 cycles/op,        0 errors)
44858.42 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50014 insns/op,   21947 cycles/op,        0 errors)
28592.87 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50027 insns/op,   27659 cycles/op,        0 errors)
throughput:
	mean=   51649.63 standard-deviation=15059.74
	median= 56950.47 median-absolute-deviation=12087.33
	maximum=64109.41 minimum=28592.87
instructions_per_op:
	mean=   49941.18 standard-deviation=153.76
	median= 50005.24 median-absolute-deviation=73.01
	maximum=50027.07 minimum=49667.05
cpu_cycles_per_op:
	mean=   22023.01 standard-deviation=3249.92
	median= 20500.74 median-absolute-deviation=1938.76
	maximum=27658.75 minimum=19924.32
```

After (write path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
53395.93 tps ( 59.4 allocs/op,  16.5 logallocs/op,  14.3 tasks/op,   50326 insns/op,   21252 cycles/op,        0 errors)
46527.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50704 insns/op,   21555 cycles/op,        0 errors)
55846.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50731 insns/op,   21060 cycles/op,        0 errors)
55669.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50735 insns/op,   21521 cycles/op,        0 errors)
52130.17 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50757 insns/op,   21334 cycles/op,        0 errors)
throughput:
	mean=   52713.91 standard-deviation=3795.38
	median= 53395.93 median-absolute-deviation=2955.40
	maximum=55846.30 minimum=46527.83
instructions_per_op:
	mean=   50650.57 standard-deviation=182.46
	median= 50731.38 median-absolute-deviation=84.09
	maximum=50756.62 minimum=50325.87
cpu_cycles_per_op:
	mean=   21344.42 standard-deviation=202.86
	median= 21334.00 median-absolute-deviation=176.37
	maximum=21554.61 minimum=21060.24
```

Fixes #24815

Improvement for rare corner cases. No backport required

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24919
2025-07-13 19:13:11 +03:00
Nadav Har'El
18b6c4d3c5 alternator: lower maximum table name length to 192
Currently, Alternator allows creating a table with a name up to 222
(max_table_name_length) characters in length. But if you do create
a table with such a long name, you can have some difficulties later:
You you will not be able to add Streams or GSI or LSI to that table,
because 222 is also the absolute maximum length Scylla tables can have
and the auxilliary tables we want to create (CDC log, materialized views)
will go over this absolute limit (max_auxiliary_table_name_length).

This is not nice. DynamoDB users assume that after successfully
creating a table, they can later - perhaps much later - decide to
add Streams or GSI to it, and today if they chose extremely long
names, they won't be able to do this.

So in this patch, we lower max_table_name_length from 222 to 192.
A user will not be able to create tables with longer names, but
the good news is that once successfully creating a table, it will
always be possible to enable Streams on it (the CDC log table has an
extra 15 bytes in its name, and 192 + 15 is less than 222), and it
will be possible to add GSIs with short enough names (if the GSI
name is 29 or less, 192 + 29 + 1 = 222).

This patch is a trivial one-line code change, but also includes the
corrected documentation of the limits, and a fix for one test that
previously checked that a table name with length 222 was allowed -
and now needs to check 192 because 222 is no longer allowed.

Note that if a user has existing tables and upgrades Scylla, it
is possible that some pre-existing Alternator tables might have
lengths over 192 (up to 222). This is fine - in the previous patches
we made sure that even in this case, all operations will still work
correctly on these old tables (by not not validating the name!), and
we also made sure that attempting to enable Streams may fail when
the name is too long (we do not remove those old checks in this patch,
and don't plan to remove them in the forseeable future).

Note that the limit we chose - 192 characters - is identical to the
table name limit we recently chose in CQL. It's nicer that we don't
need to memorize two different limits for Alternator and CQL.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-07 11:58:21 +03:00
Nadav Har'El
3ed8e269f9 alternator: don't crash when adding Streams to long table name
Currently, in Alternator it is possible to create a table whose name has
222 characters, and then trying to add Streams to that table results in
an attempt to create a CDC log table with the same name plus a
15-character suffix "_scylla_cdc_log", which resulted (Ref #24598) in
an IO-error and a Scylla shutdown.

This patch adds code to the Stream-adding operations (both CreateTable
and UpdateTable) that validates that the table's name, plus that 15
character suffix, doesn't exceed max_auxiliary_table_name_length, i.e.,
222.

After this patch, if you have a table whose name is between 207 and 222
characters, attempting to enable Streams on it will fail with:

 "Streams cannot be added if the table name is longer than 207 characters."

Note that in the future, if we lower max_table_name_length to below 207,
e.g., to 192, then it will always be possible to add a stream to any
legal table, and the new checks we had here will be mostly redundant.
But only "mostly" - not entirely: Checking in UpdateTable is still
important because of the possibility that an upgrading user might have
a pre-existing table whose name is longer than the new limit, and might
try to enable Streams.

After this patch, the crash reported in #24598 can no longer happen, so
in this sense the bug is solved. However, we still want to lower
max_table_name_length from 222 to 192, so that it will always be
possible to enable streams on any table with a legal name length.
We'll do this in the next patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-07 11:58:13 +03:00
Nadav Har'El
898665ca38 alternator: split length limit for regular and auxiliary tables
Alternator has a constant, max_table_name_length=222, which is currently
used for two different things:

1. Limiting the length of the name allowed for Alternator table.
2. Limiting the length of some auxiliary tables the user is not aware
   of, such as a materialized view (whose name is tablename:indexname)
   or (in the next patch) CDC log table.

In principle, there is no reason why these two limits need to be identical -
we could lower the table name limit to, say, 192, but still allow the
tablename:indexname to be even longer, up to 222 - i.e., allow creating
materialized views even on tables whose name has 192 characters.

So in this patch we split this variable into two, max_table_name_length
and max_auxiliary_table_name_length. At the moment, both are still set
to the same value - 222. In a following patch we plan to lower
max_table_name_length but leave max_auxiliary_table_name_length at 222.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-07 11:43:49 +03:00
Nadav Har'El
09aa062ab6 alternator: avoid needlessly validating table name
In commit d8c3b144cb we fixed #12538:
That issue noted that most requests which take a TableName don't need
to "validate" the table's name (check that it has allowed characters
and length) if the table is found in the schema. We only need to do
this validation on CreateTable, or when the table is *not* found
(because in that case, DynamoDB chose to print a validation error
instead of table-not-found error).

It turns out that the fix missed a couple of places where the name
validation was unnecessary, so this patch fixes those remaining places.

The original motivation for fixing was #12538 was performance, so
it focused just one cheap common requests. But now, we want to be sure
we fixed *all* requests, because of a new motivation:

We are considering, due to #24598, to lower the maximum allowed table
name length. However, when we'll do that, we'll want the new lower
length limit to not apply to already existing tables. For example,
it should be possible to delete a pre-existing table with DeleteTable,
if it exists, without the command complaining that the name of this table
is too long. So it's important to make sure that the table's name is
only validated in CreateTable or if the table does not exist.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-07 10:05:43 +03:00
Tomasz Grabiec
97679002ee Merge 'Co-locate tablets of different tables' from Michael Litvak
Add the option to co-locate tablets of different tables. For example, a base table and its CDC table, or a local index.

main changes and ideas:
* "table group" - a set of one or more tables that should be co-located. (Example: base table and CDC table). A group consists of one base table and zero or more children tables.
* new column `base_table` in `system.tablets`: when creating a new table, it can be set to point to a base table, which the new table's tablets will be co-located with. when it's set, the tablet map information should be retrieved from the base table map. the child map doesn't contain per-tablet information.
* co-located tables always have the same tablet count and the same tablet replicas. each tablet operation - migration, resize, repair - is applied on all tablets in a synchronized manner by the topology coordinator.
* resize decision for a group is made by combining the per-table hints and comparing the average tablet size (over all tablets in the group) with the target tablet size.
* the tablets load balancer works with the base table as a representative of the group. it represents a single migration unit with some `group_size` that is taken into account.
* view tablets are co-located with base tablets when the partition keys match.

Fixes https://github.com/scylladb/scylladb/issues/17043

backport is not needed. this is preliminary work for support of MVs and CDC with tablets.

Closes scylladb/scylladb#22906

* github.com:scylladb/scylladb:
  tablets: validate no clustering row mutations on co-located tables
  raft_group0_client: extend validate_change to mixed_change type
  docs: topology-over-raft: document co-located tables
  tablet-mon.py: visual indication for co-located tablets
  tablet-mon.py: handle co-located tablets
  test/boost/view_schema_test.cc: fix race in wait_until_built
  boost/tablets_test: test load balancing and resize of co-located tablets
  test/tablets: test tablets colocation
  tablets: co-locate view tablets with base when the partition keys match
  test/pylib/tablets: common get_tablet_count api
  test_mv_tablets: use get_tablet_replicas from common tablets api
  test/pylib/tablets: fix test api to read tablet replicas from base table
  tablets: allocator: create co-located tables in a single operation
  alternator: prepare all new tables in a single announcement
  migration_manager: add notification for creating multiple tables
  tablets: read_tablet_transition_stage: read from base table
  storage service: allow repair request only on base tables
  tablets: keyspace_rf_change: apply on base table
  storage service: generate tablet migration updates on base tables
  tablets: replace all_tables method
  tablets: split when all co-located tablets are ready
  tablets: load balancer: sizing plan for table groups
  tablets: load balancer: handle co-located tablets
  tablets: allocate co-located tablets
  tablets: handle migration of co-located tablets
  storage service: add repair colocated tablets rpc
  tablets: save and read tablet metadata of co-located tables
  tablets: represent co-located tables in tablet metadata
  tablets: add base_table column to system.tablets
  docs: update system.tablets schema
2025-07-01 16:02:30 +02:00
Michael Litvak
2d0ec1c20a alternator: prepare all new tables in a single announcement
When creating base and view tables in alternator, they are created in a
single operation, so use a single announcement for creating multiple
tables in a single operation instead of announcing each table
separately.

This is needed because when we create base tables and local indexes we
need to make them co-located, so we need to allocate tablets for them
together.
2025-07-01 13:20:18 +03:00
Nadav Har'El
e12ff4d3ab Merge 'LWT: use tablet_metadata_guard' from Petr Gusev
This PR is a step towards enabling LWT for tablet-based tables.

It pursues several goals:
* Make it explicit that the tablet can't migrate after the `cas_shard` check in `selec_statement/modification_statement`. Currently, `storage_proxy::cas` expects that the client calls it on a correct shard -- the one which owns the partition key the LWT is running on. There reasons for that are explained in [this commit](f16e3b0491 (diff-1073ea9ce4c5e00bb6eb614154f523ba7962403a4fe6c8cd877d1c8b73b3f649)) message. The statements check the current shard and invokes `bounce_to_shard` if it's not the right one. However , the erm strong pointer is only captured in `storage_proxy::cas` and until that moment there is no explicit structure in the code which would prevent the ongoing migrations. In this PR we introduce such stucture -- `erm_handle`. We create it before the `cas_check` and pass it down to `storage_proxy::cas` and `paxos_response_handler`.
* Another goal of this PR is an optimization -- we don't want to hold erm for the duration of entire LWT, unless it directly affects the current tablet. The is a `tablet_metadata_guard` class which is used for long running tablet operations. It automatically switches to a new erm if the topology change represented by the new erm doesn't affect the current tablet. We use this class in `erm_handle` if the table uses tablets. Otherwise, `erm_handle` just stores erm directly.
* Fixes [shard bouncing issue in alternator](https://github.com/scylladb/scylladb/issues/17399)

Backport: not needed (new feature).

Closes scylladb/scylladb#24495

* github.com:scylladb/scylladb:
  LWT: make cas_shard non-optional in sp::cas
  LWT: create cas_shard in select_statement
  LWT: create cas_shard in modification and batch statements
  LWT: create cas_shard in alternator
  LWT: use cas_shard in storage_proxy::cas
  do_query_with_paxos: remove redundant cas_shard check
  storage_proxy: add cas_shard class
  sp::cas_shard: rename to get_cas_shard
  token_metadata_guard: a topology guard for a token
  tablet_metadata_guard: mark as noncopyable and nonmoveable
2025-07-01 11:33:20 +03:00
Petr Gusev
7e64852bfd LWT: create cas_shard in alternator
We create cas_shard instance in shard_for_execute(). This implies that
the decision about the correct shard was made using the specific
token_metadata_guard, and it remains valid only as long as the guard
is held.

When forwarding a request to another shard, we keep the original
cas_shard alive. This ensures that the target shard
remains a valid owner for the given token.

Fixes scylladb/scylladb#17399
2025-06-30 10:37:33 +02:00
Petr Gusev
deb7afbc87 LWT: use cas_shard in storage_proxy::cas
Take cas_shard parameter in sp::cas and pass token_metadata_guard down to paxos_response_handler.

We make cas_shard parameter optional in storage_proxy methods
to make the refactoring easier. The sp::cas method constructs a new
token_metadata_guard if it's not set. All call sites pass null
in this commit, we will add the proper implementation in the next
commits.
2025-06-30 10:33:17 +02:00
Nadav Har'El
0ce0b2934f alternator: improve, document and test table/index name lengths
Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255
characters each, Alternator currently has different (and lower)
limitations:
 1. A table name must be up to 222 characters.
 2. For a GSI, the sum of the table's and GSI's name length, plus 1,
    must be up to 222 characters.
 3. For an LSI, the sum of the table's and LSI's name length, plus 2,
    must be up to 222 characters.

These specific limitations were never documented, so in this patch we
add this information to docs/alternator/compatibility.md.

Moreover, these limitations where only partially tested, so in this patch
we add testing for more cases that we forgot to check - such as length
of LSI names (only GSI were checked before this patch), or adding a
GSI to an existing table. It is important to check all these corner
cases because there is a risk that if we attempt to create a table
without checking its length, we can end up with an I/O error that brings
down Scylla.

In one case - UpdateTable adding a GSI to an existing table - the new
test exposed a trivial bug: Because UpdateTable wants to verify the new
GSI doesn't have the same name as an existing LSI, it mistakenly applied
the LSI's length name limit instead of the GSI's name length limit,
which is one byte less than it should be. So this patch fixes this
trivial bug as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-06-29 11:40:55 +03:00
Petr Gusev
aa970bf2e4 sp::cas_shard: rename to get_cas_shard
We intend to introduce a separate cas_shard
class in the next commits. We rename the existing
function here to avoid conflicts.
2025-06-18 11:51:48 +02:00
Wojciech Mitros
5eb4466789 Return correct creation date time in describe table
Add system:table_creation_time tag with value - timestamp in milliseconds of creation table.
If the tag is present, it will used to fill creation timestamp value (when CreateTable or DescribeTable is called).
If the tag is missing, value 0 for timestamp will be substituted (in other words table was created on 1th january of 1970).
Update test to change how we make sure timestamp is actually used - we create two tables one after another and make sure their creation timestamp is in correct order.
Update tests, that work with tags to filter system tags out.

Fixes #5013

Closes scylladb/scylladb#24007
2025-06-10 15:25:57 +03:00
Amnon Heiman
3ad7a24eee alternator/executor.cc: Update per-table metrics
This patch adds support for updating per-table metrics. It introduces a
helper function that retrieves the stats object from a table schema.
The code uses a lw_shared_ptr for the stats object to ensure safe updates
even if the table holding it has been deleted.

There is some duplication in the updated code, as both per-shard and
per-table metrics are updated.

The rmw_operation::execute function now accepts two stats objects: one
for the global metrics and one for the per-table metrics.  The use of
execute was also modified—rather than modifying the WCU directly, a
parameter is used so both global and per-table stats can be updated.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-06-05 15:12:13 +03:00
Amnon Heiman
d6afd42342 alternator/stats: Add per-table metrics
This patch allows registering a stats object per table. The per-table
stats object needs its metrics registry to be part of the table's
lifecycle, but there could be a scenario in which a table is already
deleted while some Alternator operations are still in progress.  To
handle this, the patch separates the registry from the metrics holder.
It is safe to modify a parameter that is not registered.

Metrics registration is performed via functions instead of the
constructor.

The registration accepts a keyspace and table name as parameters.

The per-table metrics use an alternator_table prefix to distinguish them
from their per-shard equivalents.

The metrics are aggregated and reported once per node.  Metrics that do
not make sense to report per table (such as create and delete) are not
registered. All metrics are marked with skip_when_empty.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-06-05 14:44:03 +03:00
Nadav Har'El
6cbcabd100 alternator: hide internal tags from users
The "tags" mechanism in Alternator is a convenient way to attach metadata
to Alternator tables. Recently we have started using it more and more for
internal metadata storage:

  * UpdateTimeToLive stores the attribute in a tag system:ttl_attribute
  * CreateTable stores provisioned throughput in tags
    system:provisioned_rcu and system:provisioned_wcu
  * CreateTable stores the table's creation time in a tag called
    system:table_creation_time.

We do not want any of these internal tags to be visible to a
ListTagsOfResource request, because if they are visible (as before this
patch), systems such as Terraform can get confused when they suddenly
see a tag which they didn't set - and may even attempt to delete it
(as reported in issue #24098).

Moreover, we don't want any of these internal tags to be writable
with TagResource or UntagResource: If a user wants to change the TTL
setting they should do it via UpdateTimeToLive - not by writing
directly to tags.

So in this patch we forbid read or write to *any* tag that begins
with the "system:" prefix, except one: "system:write_isolation".
That tag is deliberately intended to be writable by the user, as
a configuration mechanism, and is never created internally by
Scylla. We should have perhaps chosen a different prefix for
configurable vs. internal tags, or chosen more unique prefixes -
but let's not change these historic names now.

This patch also adds regression tests for the internal tags features,
failing before this patch and passing after:
1. internal tags, specifically system:ttl_attribute, are not visible
   in ListTagsOfResource, and cannot be modified by TagResource or
   UntagResource.
2. system:write_isolation is not internal, and be written by either
   TagResource or UntagResource, and read with ListTagsOfResource.

This patch also fixes a bug in the test where we added more checks
for system:write_isolation - test_tag_resource_write_isolation_values.
This test forgot to remove the system:write_isolation tags from
test_table when it ended, which would lead to other tests that run
later to run with a non-default write isolation - something which we
never intended.

Fixes #24098.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24299
2025-06-03 20:40:50 +03:00
Szymon Malewski
18d237a393 alternator/executor: Added checks in batch_write_item
This patch adds checks validating 'BatchWriteItem' requests mostly to avoid ugly fallback message.
It changes request's behaviour in case of an empty array of WriteRequests - previously such an array was ignored and whole request might succeed, now it raises ValidationException, following the documentation and behaviour of DynamoDB.
Patch includes tests in test_manual_requests (`test_batch_write_item_invalid_payload`, `test_batch_write_item_empty_request_list`) testing with several offending cases.

Fixes #23233

Closes scylladb/scylladb#23878
2025-05-29 20:33:57 +03:00
Nadav Har'El
252c5b5c9d Merge 'Alternator batch_write_item wcu' from Amnon Heiman
This series adds support for WCU tracking in batch_write_item and tests it.

The patches include:

Switch the metrics (RCU and WCU) to count units vs half-units as they were, to make the metrics clearer for users.

Adding a public static get_half_units function to wcu_consumed_capacity_counter for use by batch write item, which cannot directly use the counter object.

Adding WCU calculation support to batch_write_item, based on item size for puts and a fixed 1 WCU for deletes. WCU metrics are updated, and consumed capacity is returned per table when requested.

The return handling was refactored to be coroutine-like for easier management of the consumed capacity array.

Adding tests that validate WCU calculation for batch put requests on a single table and across multiple tables, ensuring delete operations are counted correctly.

Adding a test that validates that WCU metrics are updated correctly during batch write item operations, ensuring the WCU of each item is calculated independently.

**Need backport, WCU is partially supported, and is missing from batch_write_item**

Fixes #23940

Closes scylladb/scylladb#23941

* github.com:scylladb/scylladb:
  alternator/test_metrics.py: batch_write validate WCU
  alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU
  alternator/executor: add WCU for batch_write_items
  alternator/consumed_capacity: make wcu get_units public
  Alternator: Change the WCU/RCU to use units
2025-05-06 13:31:53 +03:00
Amnon Heiman
68db77643f alternator/executor: add WCU for batch_write_items
This patch adds consumed capacity unit support to batch_write_item.

It calculates the WCU based on an item's length (for put) or a static 1
WCU (for delete), for each item on each table.

The WCU metrics are always updated. if the user requests consumed
capacity, a vector of consumed capacity is returned with an entry for
each of the tables.

For code simplicity, the return part of batch_write_item was updated to
be coroutine-like; this makes it easier to manage the life cycle of the
returned consumed_capacity array.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:20:14 +03:00
Amnon Heiman
5ae11746fa Alternator: Change the WCU/RCU to use units
This patch changes the RCU/WCU Alternator metrics to use whole units
instead of half units. The change includes the following:

Change the metrics documentation. Keep the RCU counter internally in
half units, but return the actual (whole unit) value.
Change the RCU name to be rcu_half_units_total to indicates that it
counts half units.
Change the WCU to count in whole units instead of half units.

Update the tests accordingly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:18:09 +03:00
Nadav Har'El
3ce7e250cc alternator: fix schema "concurrent modification" errors
In ScyllaDB, schema modification operations use "optimistic locking":
A schema operation reads the current schema, decides what it wants to do
and prepares changes to the schema, and then attempts to commit those
changes - but only if the schema hasn't changed since the first read.
If the schema has already been changed by some other node - we need to
try again. In a loop.

In Alternator, there are six operations that perform schema modification:
CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and
UpdateTimeToLive. All of them were missing this loop. We knew about
this - and even had FIXME in all places. So all these operations,
when facing contention of concurrent schema modifications on different
nodes may fail one of these operations with an error like:

   Internal server error: service::group0_concurrent_modification
   (Failed to apply group 0 change due to concurrent modification).

This problem had very minor effect, if any, on real users because the
DynamoDB SDK automatically retries operations that fail with retryable
errors - like this "Internal server error" - and most likely the schema
operation will succeed upon retry. However, as shown in issue #13152
these failures were annoying in our CI, where tests - which disable
request retries - failed on these errors.

This patch fixes all six operations (the last three operations all
use one common function, db::modify_tags(), so are fixed by one
change) to add the missing loop.

The patch also includes reproducing tests for all these operations -
the new tests all fail before this patch, and pass with it.

These new tests are much more reliable reproducers than the dtests
we had that only sometimes - very rarely - reproduced the problem.
Moreover, the new tests reproduces the bug seperately for each of the
six operations, so if we forget to fix one of the six operations, one
of the tests would have continued to fail. Of course I checked this
during development.

The new tests are in the test/cluster framework, not test/alternator,
because this problem can only be reproduced in a multi-node cluster:
On a single node, it serializes its schema modifications on its own;
The collisions only happen when more than one node attempts schema
modifications at the same time.

Fixes #13152

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23827
2025-05-05 09:59:08 +03:00
Nadav Har'El
262530f27c Merge 'mv: make base_info in view schemas immutable' from Wojciech Mitros
Currently, the base_info may or may not be set in view schemas.
Even when it's set, it may be modified. This necessitates extra
checks when handling view schemas, as we'll as potentially causing
errors when we forget to set it at some point.

Instead, we want to make the base info an immutable member of view
schemas (inside view_info). To achieve this, in this series we remove
all base_info members that can change due to a base schema update,
and we calculate the remaining values during view update generation,
using the most up-to-date base schema version.

To calculate the values that depend on the base schema version, we
need to iterate over the view primary key and find the corresponding
columns, which adds extra overhead for each batch of view updates.
However, this overhead should be relatively small, as when creating
a view update, we need to prepare each of its columns anyway. And
if we need to read the old value of the base row, the relative
overhead is even lower.

After this change, the base info in view schemas stays the same
for all base schema updates, so we'll no longer get issues with
base_info being incompatible with a base schema version. Additionally,
it's a step towards making the schema objects immutable, which
we sometimes incorrectly assumed in the past (they're still not
completely immutable yet, as some other fields in view_info other
than base_info are initialized lazily and may depend on the base
schema version).

Fixes https://github.com/scylladb/scylladb/issues/9059
Fixes https://github.com/scylladb/scylladb/issues/21292
Fixes https://github.com/scylladb/scylladb/issues/22194
Fixes https://github.com/scylladb/scylladb/issues/22410

Closes scylladb/scylladb#23337

* github.com:scylladb/scylladb:
  test: remove flakiness from test_schema_is_recovered_after_dying
  mv: add a test for dropping an index while it's building
  base_info: remove the lw_shared_ptr variant
  view_info: don't re-set base_info after construction
  base_info: remove base_info snapshot semantics
  base_info: remove base schema from the base_info
  schema_registry: store base info instead of base schema for view entries
  base_info: make members non-const
  view_info: move the base info to a separate header
  view_info: move computation of view pk columns not in base pk to view_updates
  view_info: move base-dependent variables into base_info
  view_info: set base info on construction
2025-04-27 19:12:12 +03:00
Piotr Szymaniak
e588c8667f alternator: Limit attribute name lengths
Attribute names are now checked against DynamoDB-compatible length
limits. When exceeded, Alternator emits exception identical or similar
to the DDB one. It might be worth noting that DDB emits more than a
single kind of an exception string for some exceptions. The tests'
catch clauses handle all the observed kinds of messages from DynamoDB.
The validation differentiates between key and non-key attributes and
applies the limit accordingly.

AWS DDB raises exceptions with somewhat different contents when the
get request contains ProjectionExpression, so this case needed separate
treatment to emit the corresponding exception string. The
length-validating function was declared and defined in
expressions.hh/.cc respectively, because that's where the relevant
parsing happens.

** Tests

The following tests were validated when handling this issue:
test_limit_attribute_length_nonkey_good,
test_limit_attribute_length_nonkey_bad,
test_limit_attribute_length_key_good,
test_limit_attribute_length_key_bad,
test_limit_attribute_length_gsi_lsi_good,
test_limit_attribute_length_gsi_lsi_bad,
test_limit_attribute_length_gsi_lsi_projection_bad.

Some of the tests were expanded into being more granular. Namely, there
is a new test function
`test_limit_attribute_length_key_bad_incoherent_names`
which groups tests with too long attribute names in the case of
incorrect (incoherent) user requests.
Similarily, there is a new test function
`test_limit_attribute_length_gsi_lsi_bad_incoherent_names`
All the tests cover now each combination of the key/keys being too long.
Both the new fuctions contain tests that verify that ScyllaDB throws
length-related exceptions (instead of the coherency-related), similar
to what DynamoDB does.

The new test test_limit_gsiu_key_len_bad covers the case of too long
attribute name inside GlobalSecondaryIndexUpdates.
The new test test_limit_gsiu_key_len_bad_incoherent_names covers the
case of incorrect (incoherent) user requests containing too long
attribute names and GlobalSecondaryIndexUpdates.

test_limit_attribute_length_key_bad was found to have contaned an
illegal KeySchema structure.

Some of the tests were corrected their match clause.

All the tests are stripped of the xfail flag except
test_limit_attribute_length_key_bad, which has it changed since it
still fails due to Projection in GSI and LIS not implemented in Alternator.
The xfail now points to #5036.

Fixes scylladb/scylladb#9169

Closes scylladb/scylladb#23097
2025-04-27 18:39:20 +03:00