scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-20 08:30:35 +00:00

Author	SHA1	Message	Date
Botond Dénes	6fb37ae77f	Merge '[Backport 2025.4] alternator: fix batch writes during intranode tablet migrations' from Scylladb[bot] Scylla implements `LWT` in the` storage_proxy::cas` method. This method expects to be called on a specific shard, represented by the `cas_shard` parameter. Clients must create this object before calling `storage_proxy::cas`, check its `this_shard()` method, and jump to `cas_shard.shard()` if it returns false. The nuance is that by the time the request reaches the destination shard, the tablet may have already advanced in its migration state machine. For example, a client may acquire a `cas_shard` at the `streaming` tablet state, then submit a request to another shard via `smp::submit_to(cas_shard.shard())`. However, the new `cas_shard` created on that other shard might already be in the `write_both_read_new` state, and its `cas_shard.shard()` would not be equal to `this_shard_id()`. Such broken invariant results in an `on_internal_error` in `storage_proxy::cas`. Clients of `storage_proxy::cas` are expected to check` cas_shard.this_shard()` and recursively jump to another shard if it returns false. Most calls to `storage_proxy::cas` already implement this logic. The only exception is `executor::do_batch_write`, which currently checks `cas_shard.this_shard()` only once. This can break the invariant if the tablet state changes more than once during the operation. This PR fixes the issue by implementing recursive `cas_shard.this_shard()` checks in `executor::do_batch_write`. It also adds a test that reproduces the problem. Fixes: scylladb/scylladb#27353 backport: need to be backported to 2025.4 - (cherry picked from commit `e60bcd0011`) - (cherry picked from commit `74bf24a4a7`) - (cherry picked from commit `9bef142328`) - (cherry picked from commit `c6eec4eeef`) - (cherry picked from commit `3a865fe991`) - (cherry picked from commit `0bcc2977bb`) - (cherry picked from commit `608eee0357`) Parent PR: #27396 Closes scylladb/scylladb#27529 * github.com:scylladb/scylladb: alternator/executor.cc: eliminate redundant dk copy alternator/executor.cc: release cas_shard on the original shard alternator/executor.cc: move shard check into cas_write alternator/executor.cc: make cas_write a private method alternator/executor.cc: make do_batch_write a private method alternator/executor.cc: fix indent test_alternator: add test_alternator_invalid_shard_for_lwt alternator/executor.cc: avoid cross-shard free	2026-01-08 16:40:20 +02:00
Radosław Cybulski	b104c80c8e	Fix use-after-free in encode_paging_state in Alternator Fix unlikely use-after-free in `encode_paging_state`. The function incorrectly assumes that current position to encode will always have data for all clustering columns the schema defines. It's possible to encounter current position having less than all columns specified, for eample in case of range tombstone. Those don't happen in Alternator tables as DynamoDB doesn't allow range deletions and clustering key might be of size at most 1. Alternator api can be used to read scylla system tables and those do have range tombstones with more than single clustering column. The fix is to stop trying to encode columns, that don't have the value - they are not needed anyway, as there's no possible position with those values (range tombstone made sure of that). Fixes #27001 Fixes #27125 Closes scylladb/scylladb#26960 (cherry picked from commit `b54a9f4613`) Closes scylladb/scylladb#27347	2025-12-21 14:12:23 +02:00
Michael Litvak	4b26a86cb0	alternator: require rf_rack_valid_keyspaces when creating index When creating an alternator table with tablets, if it has an index, LSI or GSI, require the config option rf_rack_valid_keyspaces to be enabled. The option is required for materialized views in tablets keyspaces to function properly and avoid consistency issues that could happen due to cross-rack migrations and pairing switches when RF-rack validity is not enforced. Currently the option is validated when creating a materialized view via the CQL interface, but it's missing from the alternator interface. Since alternator indexes are based on materialized views, the same check should be added there as well. Fixes scylladb/scylladb#27612 Closes scylladb/scylladb#27622 (cherry picked from commit `b9ec1180f5`) Closes scylladb/scylladb#27671	2025-12-16 10:13:31 +02:00
Petr Gusev	14fea2c5ba	alternator/executor.cc: eliminate redundant dk copy A small refactoring/optimization. (cherry picked from commit `608eee0357`)	2025-12-10 11:49:04 +01:00
Petr Gusev	e5223415e4	alternator/executor.cc: release cas_shard on the original shard Before this series, we kept the cas_shard on the original shard to guard against tablet movements running in parallel with storage_proxy::cas. The bug addressed by this PR shows that this approach is flawed: keeping the cas_shard on the original shard does not guarantee that a new cas_shard acquired on the target shard won’t require another jump. We fixed this in the previous commit by checking cas_shard.this_shard() on the target shard and continuing to jump to another shard if necessary. Once cas_shard.this_shard() on the target shard returns true, the storage_proxy::cas invariants are satisfied, and no other cas_shard instances need to remain alive except the one passed into storage_proxy::cas. (cherry picked from commit `0bcc2977bb`)	2025-12-10 11:49:04 +01:00
Petr Gusev	8cf9c3eb8d	alternator/executor.cc: move shard check into cas_write This change ensures that if cas_shard points to a different shard, the executor will continue issuing shard jumps until cas_shard.this_shard() returns true. The commit simply moves the this_shard() check from the parallel_for_each lambda into cas_write, with minimal functional changes. We enable test_alternator_invalid_shard_for_lwt since now it should pass. Fixes scylladb/scylladb#27353 (cherry picked from commit `3a865fe991`)	2025-12-10 11:49:04 +01:00
Petr Gusev	5ceabe90fc	alternator/executor.cc: make cas_write a private method We will need to access executor::_stats field from cas_write. We could pass it as a paramter, but it seems simpler to just make cas_write and instance method too. (cherry picked from commit `c6eec4eeef`)	2025-12-10 11:49:04 +01:00
Petr Gusev	c90578a1fb	alternator/executor.cc: make do_batch_write a private method We will need to access executor::_stats field on other shards. (cherry picked from commit `9bef142328`)	2025-12-10 11:44:52 +01:00
Petr Gusev	bdc43c62b1	alternator/executor.cc: fix indent (cherry picked from commit `74bf24a4a7`)	2025-12-10 11:43:02 +01:00
Petr Gusev	37c575c104	test_alternator: add test_alternator_invalid_shard_for_lwt This test reproduces scylladb/scylladb#27353 using two injection points. First, the test triggers an intra-node tablet migration and suspends it at the streaming stage using the intranode_migration_streaming_wait injection. Next, it enables the alternator_executor_batch_write_wait injection, which suspends a batch write after its cas_shard has already been created. The test then issues several batch writes and waits until one of them hits this injection on the destination shard. At this point, the cas_shard.erm for that write is still in the streaming state, meaning the executor would need to jump back to the source shard. The test then resumes the suspended tablet migration, allowing it to update the ERM on the source shard to write_both_read_new. After that, the test releases the suspended batch write and expects it to perform two shard jumps: first from the destination to the source shard, and then again back to the source shard. This commit adds the alternator_executor_batch_write_wait injection to alternator/executor.cc. Coroutines are intentionally avoided in the parallel_for_each lambda to prevent unnecessary coroutine-frame allocations. (cherry picked from commit `e60bcd0011`)	2025-12-10 11:42:54 +01:00
Petr Gusev	b88ed6156b	alternator/executor.cc: avoid cross-shard free This commit is an optimization: avoiding destruction of foreign objects on the wrong shard. Releasing objects allocated on a different shard causes their ::free calls to be executed remotely, which adds unnecessary load to the SMP subsystem. Before this patch, a std::vector could be moved to another shard. When the vector was eventually destroyed, its ::free had to be marshalled back to the shard where the memory had originally been allocated. This change avoids that overhead by passing the vector by const reference instead. The referenced objects lifetime correctness reasoning: * the put_or_delete_item refs usages in put_or_delete_item_cas_request are bound to its lifetime * cas_request lifetime is bound to storage_proxy::cas future * we don't release put_or_delete_item-s untill all storage_proxy::cas calls are done. (cherry picked from commit `f00f7976c1`)	2025-12-10 11:41:55 +01:00
Nadav Har'El	433bc4c17f	Fix backport conflicts	2025-11-26 20:42:08 +02:00
Nadav Har'El	a5ba028d40	alternator: improve error in tablets_mode_for_new_keyspaces=enforced When in tablets_mode_for_new_keyspaces=enforced mode, Alternator is supposed to fail when CreateTable asks explicitly for vnodes. Before this patch, this error was an ugly "Internal Server Error" (an exception thrown from deep inside the implementation), this patch checks for this case in the right place, to generate a proper ValidationException with a proper error message. We also enable the test test_tablets_tag_vs_config which should have caught this error, but didn't because it was marked xfail because tablets_mode_for_new_keyspaces had not been live-updatable. Now that it is, we can enable the test. I also improved the test to be slightly faster (no need to change the configuration so many times) and also check the ordinary case - where the schema doesn't choose neither vnodes nor tablets explicitly and we should just use the default. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `c03081eb12`)	2025-11-26 20:42:08 +02:00
Nadav Har'El	58b89b4d28	alternator: improve comment about non-hidden system tags The previous patches added a somewhat misleading comment in front of system:initial_tablets, which this patch improves. That tag is NOT where Alternator "stores" table properties like the existing comment claimed. In fact, the whole point is that it's the opposite - Alternator never writes to this tag - it's a user-writable tag which Alternator reads, to configure the new table. And this is why it obviously can't be hidden from the user. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `b34f28dae2`)	2025-11-26 20:42:08 +02:00
Piotr Szymaniak	7489db0097	alternator: Fix tag name to request vnodes The tag was lately renamed from `experimental:initial_tablets` to `system::initial_tablets`. This commit fixes both the tests as well as the exceptions sent to the user instructing how to create table with vnodes. (cherry picked from commit `63897370cb`)	2025-11-26 20:42:07 +02:00
Piotr Szymaniak	65ceb83f42	alternator: Support `tablets_mode_for_new_keyspaces` config flag Until now, tablets in Alternator were experimental feature enabled only when a TAG "experimental:initial_tablets" was present when creating a table and associated with a numeric value. After this patch, Alternator honours the value of `tablets_mode_for_new_keyspaces` config flag. Each table can be overriden to use tablets or not by supplying a new TAG "system:initial_tablets". The rules stay the same as with the earlier, experimental tag: when supplied with a numeric value, the table will use tablets (as long as they are supported). When supplied with something else (like a string "none"), the table will use vnodes, provided that tablets are not `enforced` by the config flag. Fixes #22463 (cherry picked from commit `376a2f2109`)	2025-11-26 20:42:06 +02:00
Wojciech Mitros	33eef5122c	alternator: use storage_proxy from the correct shard in executor::delete_table When we delete a table in alternator, the schema change is performed on shard 0. However, we actually use the storage_proxy from the shard that is handling the delete_table command. This can lead to problems because some information is stored only on shard 0 and using storage_proxy from another shard may make us miss it. In this patch we fix this by using the storage_proxy from shard 0 instead. Fixes https://github.com/scylladb/scylladb/issues/27223 Closes scylladb/scylladb#27224 (cherry picked from commit `3c376d1b64`) Closes scylladb/scylladb#27260	2025-11-26 14:33:21 +02:00
Pavel Emelyanov	9d997458d7	Merge 'Alternator: allow warning on auth errors before enabling enforcement' from Nadav Har'El An Alternator user was recently "bit" when switching `alternator_enforce_authorization` from "false" to "true": ְְְAfter the configuration change, all application requests suddenly failed because unbeknownst to the user, their application used incorrect secret keys. This series introduces a solution for users who want to safely switch `alternator_enforce_authorization` from "false" to "true": Before switching from "false" to "true", the user can temporarily switch a new option, `alternator_warn_authorization`, to true. In this "warn" mode, authentication and authorization errors are counted in metrics (`scylla_alternator_authentication_failures` and `scylla_alternator_authorization_failures`) and logged as WARNings, but the user's application continues to work. The user can use these metrics or log messages to learn of errors in their application's setup, fix them, and only do the switch of `alternator_enforce_authorization` when the metrics or log messages show there are no more errors. The first patch is the implementation of the the feature - the new configuration option, the metrics and the log messages, the second patch is a test for the new feature, and the third patch is documentation recommending how to use the warn mode and the associated metrics or log messages to safely switch `alternaor_enforce_authorization` from false to true. Fixes #25308 This is a feature that users need, so it should probably be backported to live branches. Closes scylladb/scylladb#25457 * github.com:scylladb/scylladb: docs/alternator: explain alternator_warn_authorization test/alternator: tests for new auth failure metrics and log messages alternator: add alternator_warn_authorization config (cherry picked from commit `59019bc9a9`) Closes scylladb/scylladb#26894	2025-11-07 16:48:50 +02:00
Michał Jadwiszczak	6df48aacd7	alternator/executor: instantly mark view as built when creating it with base table `CreateTable` request creates GSI/LSI together with the base table, the base table is empty and we don't need to actually build the view. In tablet-based keyspaces we can just don't create view building tasks and mark the view build status as SUCCESS on all nodes. Then, the view building worker on each node will mark the view as built in `system.built_views` (`view_building_worker::update_built_views()`). Vnode-based keyspaces will use the "old" logic of view builder, which will process the view and mark it as built. Fixes scylladb/scylladb#26615 (cherry picked from commit `8fbf122277`)	2025-10-22 10:51:54 +00:00
Piotr Wieczorek	c191c31682	alternator: Correct RCU undercount in BatchGetItem The `describe_multi_item` function treated the last reference-captured argument as the number of used RCU half units. The caller `batch_get_item`, however, expected this parameter to hold an item size. This RCU value was then passed to `rcu_consumed_capacity_counter::get_half_units`, treating the already-calculated RCU integer as if it were a size in bytes. This caused a second conversion that undercounted the true RCU. During conversion, the number of bytes is divided by `RCU_BLOCK_SIZE_LENGTH` (=4KB), so the double conversion divided the number of bytes by 16 MB. The fix removes the second conversion in `describe_multi_item` and changes the API of `describe_multi_item`. Fixes: https://github.com/scylladb/scylladb/pull/25847 Closes scylladb/scylladb#25842 (cherry picked from commit `a55c5e9ec7`) Closes scylladb/scylladb#26539	2025-10-14 11:53:09 +03:00
Piotr Wieczorek	4be0bdbc07	alternator: Don't emit a redundant REMOVE event in Alternator Streams for PutItem calls Until now, every PutItem operation appeared in the Alternator Streams as two events - a REMOVE and a MODIFY. DynamoDB Streams emits only INSERT or MODIFY, depending on whether a row was replaced, or created anew. A related issue scylladb#6918 concerns distinguishing the mutation type properly. This was because each call to PutItem emitted the two CDC rows, returned by GetRecords. Since this patch, we use a collection tombstone for the `:attrs` column, and a separate tombstone for each regular column in the table's schema. We don't expect that new tables would have any other regular column, except for the `:attrs` and keys, but we may encounter them in in upgraded tables which had old GSIs or LSIs. Fixes: scylladb#6930. Closes scylladb/scylladb#24991	2025-09-30 13:12:16 +03:00
Szymon Malewski	6ce7843774	alternator: use expression caching Before this patch, every expression in Alternator's requests was parsed from string to adequate structure. This patch enables caching - all calls to parse an expression (all types) are proxied through the cache. New expression is added to the cache, the least recently used entry (above cache size) is removed. For existing entries the copy of the template is returned - individual instances still need to be resolved (placeholders substituted with names and values). The cache is per shard - shared for all operations, expression types, tables, users. Default cache size is 2000 entries per shard and it has configuration option `alternator_max_expression_cache_entries_per_shard` (0 means cache disabled). Added Python tests are based on metrics.	2025-09-28 04:27:44 +02:00
Szymon Malewski	65c95d3a93	alternator/expressions: error on parsing empty update expression With this patch empty update expression is no longer accepted by the parser. So far it was rejected only after resolving, however it could pollute the expression cache.	2025-09-28 04:06:00 +02:00
Nadav Har'El	f65998cd39	alternator: improve error handling of incorrect ARN Before this patch, if an ARN that is passed to Alternator requests like TagResource is well-formatted but points to non-existent table, Alternator returns the unhelpful error: (AccessDeniedException) when calling the TagResource operation: Incorrect resource identifier This patch modifies this error to be: (ResourceNotFoundException) when calling the TagResource operation: ResourceArn 'arn:scylla:alternator:alternator_alternator_Test_ 1758532308880:scylla:table/alternator_Test_1758532308880x' not found This is the same error type (ResourceNotFoundException) that DynamoDB returns in that case - and a more helpful error message. This patch also includes a regression test that checks the error type in this case. The new test fails on Alternator before this patch, and passes afterwards (and also passes on DyanamoDB). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26179	2025-09-25 11:51:17 +03:00
Botond Dénes	50038ef2cc	Merge 'alternator: update references to alternator streams issue' from Michael Litvak update all the references about the issue of tablets support for alternator streams to issue https://github.com/scylladb/scylladb/issues/23838 instead of https://github.com/scylladb/scylladb/issues/16317. The issue https://github.com/scylladb/scylladb/issues/16317 is about support of CDC with tablets, but it is now closed and it didn't address alternator streams. the remaining issues about alternator streams should be addressed as part of https://github.com/scylladb/scylladb/issues/23838, so fix the references in order for them not to be missed. backport is not needed Closes scylladb/scylladb#26178 * github.com:scylladb/scylladb: test/cqlpy/test_permissions: unskip test for tablets alternator: update references to alternator streams issue	2025-09-25 11:05:52 +03:00
Ernest Zaslavsky	5ba5aec1f8	treewide: Move mutation related files to a `mutation` directory As requested in #22104, moved the files and fixed other includes and build system. Moved files: - combine.hh - collection_mutation.hh - collection_mutation.cc - converting_mutation_partition_applier.hh - converting_mutation_partition_applier.cc - counters.hh - counters.cc - timestamp.hh Fixes: #22104 This is a cleanup, no need to backport Closes scylladb/scylladb#25085	2025-09-24 13:23:38 +03:00
Michael Litvak	65351fda29	alternator: update references to alternator streams issue update all the references about the issue of tablets support for alternator streams to issue #23838 instead of #16317. The issue #16317 is about support of CDC with tablets, but it is now closed and it didn't address alternator streams. the remaining issues about alternator streams should be addressed as part of #23838, so fix the references in order for them not to be missed.	2025-09-22 09:56:23 +02:00
Nadav Har'El	7be5454db1	Merge 'alternator: Store LSI keys in :attrs for newly created tables' from Piotr Wieczorek Previously, LSI keys were stored as separate, top-level columns in the base table. This patch changes this behavior for newly created tables, so that the key columns are stored inside the `:attrs` map. Then, we use top-level computed columns instead of regular ones. This makes LSI storage consistent with GSIs and allows the use of a collection tombstone on `:attrs` to delete all attributes in a row except for keys in new tables. Refs https://github.com/scylladb/scylladb/pull/24991 Refs https://github.com/scylladb/scylladb/issues/6930 Closes scylladb/scylladb#25796 * github.com:scylladb/scylladb: alternator: Store LSI keys in :attrs for newly created tables alternator/test: Add LSI tests based mostly on the existing GSI tests	2025-09-18 21:48:43 +03:00
Radosław Cybulski	c0db278c03	Don't report spurious keys in DescribeTable Alternator, when creating gsi, adds artificially columns, that user had not ask for. This patch prevents those columns from showing up in DescribeTable's output. Fixes #5320 Closes scylladb/scylladb#25978	2025-09-18 19:34:39 +03:00
Nadav Har'El	3c0032deb4	alternator: fix bug in combination of AttributeUpdates + ReturnValues In test/alternator/test_returnvalues.py we had tests for the ReturnValues feature on UpdateItem requests - but we only tested UpdateItem requests with the "modern" UpdateExpression, and forgot to test the combination of ReturnValues with the old AttributeUpdates API. It turns out this combination is buggy: when both ReturnValues=ALL_OLD and AttributeUpdates need the previous value of the item, we may wrongly std::move() the value out, and the operation will fail with a strange error: An error occurred (ValidationException) when calling the UpdateItem operation: JSON assert failed on condition 'IsObject()' The fix in this patch is trivial - just move the std::move() to the correct place, after both UpdateExpression and AttributeUpdates handling is done. This patch also includes a reproducing test, which fails before this patch and passes with it - and of course passes on DynamoDB. This test reproduces two cases where the bug happened, as well as one case where it didn't (to make sure we don't regress in what already worked). Fixes #25894 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25900	2025-09-17 16:04:01 +03:00
Ernest Zaslavsky	d624413ddd	treewide: Move query related files to a new `query` directory As requested in #22120, moved the files and fixed other includes and build system. Moved files: - query.cc - query-request.hh - query-result.hh - query-result-reader.hh - query-result-set.cc - query-result-set.hh - query-result-writer.hh - query_id.hh - query_result_merger.hh Fixes: #22120 This is a cleanup, no need to backport Closes scylladb/scylladb#25105	2025-09-16 23:40:47 +03:00
Nadav Har'El	208d3986a7	alternator: add explanation of internal tags Alternator needs to store a few pieces of information for each table that it can't store in the existing CQL schema. We decided to store this information in hidden tags - tags named with the prefix "system:" - and we already have four of those: Provisioned RCU and WCU, table creation time, and TTL's expiration-time attribute. This patch moves the definition of all four tags to one place in executor.cc, adds a short comment about the content of each tag, and adds a longer comment explaining why we have these hidden tags at all. It is expected that more hidden tags will follow - e.g., to solve issue #5320. So we expect more tags to be added later in the same place in the code. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25980	2025-09-15 08:41:39 +03:00
Piotr Wieczorek	db8b7d1c89	alternator: Store LSI keys in :attrs for newly created tables Before this patch, LSI keys were stored as separate, top-level columns in the base table. This patch changes this behavior for newly created tables, so that the key columns are stored inside the `:attrs` map. Then, in the LSI's materialized view, we create a computed column for each LSI's range key that is not a key in the base table. This makes LSI storage consistent with GSIs and allows the use of a collection tombstone on `:attrs` to delete all fields in a row, except for keys, in new tables. Refs https://github.com/scylladb/scylladb/pull/24991 Refs https://github.com/scylladb/scylladb/issues/6930	2025-09-04 15:02:37 +02:00
Michał Jadwiszczak	204f61ffe1	service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()` The reference is needed to get `view_building_state_machine`.	2025-08-27 08:55:47 +02:00
Nadav Har'El	87dd96f9a2	Merge ' Alternator: DynamoDB compatible WCU Calculation via Read-Before-Write Support' from Amnon Heiman This series adds support for a DynamoDB-compatible Write Capacity Unit (WCU) calculation in Alternator by introducing an optional forced read-before-write mechanism. Alternator's model differs from DynamoDB, and as a result, some write operations may report lower WCU usage compared to what DynamoDB would report. While this is acceptable in many cases, there are scenarios where users may require accurate WCU reporting that aligns more closely with DynamoDB's behavior. To address this, a new configuration option, alternator_force_read_before_write, is introduced. When enabled, Alternator will perform a read before executing PutItem, UpdateItem, and DeleteItem operations. This allows it to take the existing item size into account when computing the WCU. BatchWriteItem support is also extended to use this mechanism. Because BatchWriteItem does not support returning old items directly, several internal changes were made to support reading previous item sizes with minimal overhead. Reads are performed at consistency level LOCAL_ONE for efficiency, and the WCU calculation is now done in multiple stages to accurately account for item size differences. In addition to the implementation changes, test coverage was added to validate the new behavior. These tests confirm that WCU is calculated based on the larger of the old and new items when read-before-write is active, including for BatchWriteItem. This feature comes with performance overhead and is therefore disabled by default. It can be enabled at runtime via the system.config table and should be used only when precise WCU tracking is necessary. New feature, no need to backport Closes scylladb/scylladb#24436 * github.com:scylladb/scylladb: alternator/test_returnconsumedcapacity.py: Test forced read before write alternator/executor.cc: DynamoDB WCU calculation in BatchWriteItem using read-before-write executor.cc: get_previous_item with consistency level executor: Extend API of put_or_delete_item alternator/executor.cc: Accurate WCU for put, update, delete config: add alternator_force_read_before_write	2025-08-24 11:38:24 +03:00
Amnon Heiman	ffc7171a5f	alternator/executor.cc: DynamoDB WCU calculation in BatchWriteItem using read-before-write Alternator's internal model differs from DynamoDB, which can result in lower WCU (Write Capacity Unit) estimates for some operations. While this is often acceptable, accurate WCU tracking is occasionally required. This patch enables a DynamoDB like WCU calculation for `BatchWriteItem` when the `alternator_force_read_before_write` configuration is enabled, by reading existing items before applying changes. WCU calculation in `BatchWriteItem` is now performed in three stages: 1. During the initial scan, no calculation is done, just the pointers are collected. 2. If read-before-write is enabled, each item is read again and its prior size is compared to the new value. WCU is based on the larger of the two. The updated size is stored in the mutation building array. 3. Regardless of read-before-write, metrics are updated and consumed capacity units are returned if requested. This is done in a loop before sending the mutations. For performance, reads in `BatchWriteItem` use consistency level LOCAL_ONE. These changes increase WCU DynamoDB compatibility in batch operations, but add overhead and should be enabled only when needed. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-08-13 18:03:55 +03:00
Amnon Heiman	2c02cb394b	executor.cc: get_previous_item with consistency level This patch extends get_previous_item so it can be used to calculate the size of a previous item. This will allow batch_get_item to obtain the size of a previous item without needing the item itself. The patch includes the following changes: * Removes the unneeded forward declaration of get_previous_item. * Extends get_previous_item to accept an explicit consistency level. * Modifies the regular get_previous_item to maintain the same functionality while calling the base implementation. * Adds a get_previous_item_size function that uses the base implementation to retrieve the size of a previous item when only the size is needed. For performance reasons, get_previous_item_size uses consistency level one. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-08-07 12:14:43 +03:00
Nadav Har'El	a896e2dbb9	alternator: add optional support for writing to system table In commit `44a1daf` we added the ability to read system tables through the DynamoDB API (actually, the Scan and Query requests only). This ability is useful for tests, and can also be useful to users who want to read information that is only available through system tables. This patch adds support also for writing into system tables. This will be useful for Alternator tests, were we want to temporarily change some live-updatable configuration option - and so far haven't been able to do that like we did do in some cql-pytest tests. For reasons explained in issue #23218, only superuser roles are allowed to write to system tables - it is not enough for the role to be granted MODIFY permissions on the system table or on ALL KEYSPACES. Moreover, the ability to modify system tables carries special risks, so this patch only allows writes to the system tables if a new configuration option "alternator_allow_system_table_write" turned on. This option is turned off by default. This patch also includes a test for this new configuration-writing capability. The test scripts test/alternator/run and test.py now run Scylla with alternator_allow_system_table_write turned on, but the new test can also run without this option, and will be skipped in that case (to allow running the test suite against some manually- run instance of Scylla). Fixes: #12348 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-06 10:00:04 +03:00
Amnon Heiman	5abbfa1e52	executor: Extend API of put_or_delete_item This patch adds two methods to put_or_delete_item that will be used in WCU batch calculation: A setter to set the item length. A boolean getter that determines if this is a delete action.	2025-08-05 10:23:36 +03:00
Amnon Heiman	f75aa1b35e	alternator/executor.cc: Accurate WCU for put, update, delete To improve the accuracy of Write Capacity Unit (WCU) calculation, this patch introduces the use of the `alternator_force_read_before_write` configuration option. When enabled, it forces a read-before-write operation on `PutItem`, `UpdateItem`, and `DeleteItem` requests. This comes with performance overhead and should be used with caution, especially in high-throughput environments.	2025-08-05 10:01:32 +03:00
Avi Kivity	ee138217ba	alternator: simplify std::views::transform calls that extract a member from a class Rather than calling std::views::transform with a lambda that extracts a member from a class, call std::views::transform with a pointer-to-member to do the same thing. This results in more concise code. Closes scylladb/scylladb#25012	2025-07-22 12:39:01 +02:00
Botond Dénes	fd6877c654	Merge 'alternator: avoid oversized allocation in Query/Scan' from Nadav Har'El This series fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator. The first patch in the series is the main fix - the later patches are cleanups requested by reviewers but also involved other pre-existing code, so I did those cleanups as separate patches. Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page. In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running test/alternator/run --runveryslow \ test_query.py::test_query_large_page_small_rows reports in the log: oversized allocation: 573440 bytes. After this patch, this warning no longer appears. The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that. The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before. Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test. Fixes #23535 The stalls caused by large allocations was seen by actual users, so it makes sense to backport this patch. On the other hand, the patch while not big is fairly intrusive (modifies the nomal Scan and Query path and also the later patches do some cleanup of additional code) so there is some small risk involved in the backport. Closes scylladb/scylladb#24480 * github.com:scylladb/scylladb: alternator: clean up by co-routinizing alternator: avoid spamming the log when failing to write response alternator: clean up and simplify request_return_type alternator: avoid oversized allocation in Query/Scan	2025-07-17 11:30:40 +03:00
Botond Dénes	2d3965c76e	Merge 'Reduce Alternator table name length limit to 192 and fix crash when adding stream to table with very long name' from Nadav Har'El Before this series, it is possible to crash Scylla (due to an I/O error) by creating an Alternator table close to the maximum name length of 222, and then enabling Alternator Streams. This series fixes this bug in two ways: 1. On a pre-existing table whose name might be up to 222 characters, enabling Streams will check if the resulting name is too long, and if it is, fail with a clear error instead of crashing. This case will effect pre-existing tables whose name has between 207 and 222 characters (207 is `222 - strlen("_scylla_cdc_log")`) - for such tables enabling Streams will fail, but no longer crash. 2. For new tables, the table name length limit is lowered from 222 to 192. The new limit is still high enough, but ensures it will be possible to enable streams any new table. It will also always be possible to add a GSI for such a table with name up to 29 characters (if the table name is shorter, the GSI name can be longer - the sum can be up to 221 characters). No need to backport, Alternator Streams is still an experimental feature and this patch just improves the unlikely situation of extremely long table names. Fixes #24598 Closes scylladb/scylladb#24717 * github.com:scylladb/scylladb: alternator: lower maximum table name length to 192 alternator: don't crash when adding Streams to long table name alternator: split length limit for regular and auxiliary tables alternator: avoid needlessly validating table name	2025-07-15 06:57:04 +03:00
Nadav Har'El	a248336e66	alternator: clean up by co-routinizing Reviewers of the previous patch complained on some ugly pre-existing code in alternator/executor.cc, where returning from an asynchronous (future) function require lengthy verbose casts. So this patch cleans up a few instances of these ugly casts by using co_return instead of return. For example, the long and verbose return make_ready_future<executor::request_return_type>( rjson::print(std::move(response))); can be changed to the shorter and more readable co_return rjson::print(std::move(response)); This patch should not have any functional implications, and also not any performance implications: I only coroutinized slow-path functions and one function that was already "partially" coroutinized (and this was expecially ugly and deserved being fixed). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-14 18:41:35 +03:00
Nadav Har'El	13ec94107a	alternator: avoid spamming the log when failing to write response Both make_streamed() and new make_streamed_with_extra_array() functions, used when returning a long response in Alternator, would write an error- level log message if it failed to write the response. This log message is probably not helpful, and may spam the log if the application causes repeated errors intentionally or accidentally. So drop these log messages. The exception is still thrown as usual. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-14 18:41:34 +03:00
Nadav Har'El	d8fab2a01a	alternator: clean up and simplify request_return_type The previous patch introduced a function make_streamed_with_extra_array which was a duplicate of the existing make_streamed. Reviewers complained how baroque the new function is (just like the old function), having to jump through hoops to return a copyable function working on non-copyable objects, making strange-named copies and shared pointers of everything. We needed to return a copyable function (std::function) just because Alternator used Seastar's json::json_return_type in the return type from executor function (request_return_type). This json_return_type contained either a sstring or an std::function, but neither was ever really appropriate: 1. We want to return noncopyable_function, not an std::function! 2. We want to return an std::string (which rjson::print()) returns, not an sstring! So in this patch we stop using seastar::json::json_return_type entirely in Alternator. Alternator's request_return_type is now an std::variant of three types: 1. std::string for short responses, 2. noncopyable_function for long streamed response 3. api_error for errors. The ugliest parts of make_streamed() where we made copies and shared pointers to allow for a copyable function are all gone. Even nicer, a lot of other ugly relics of using seastar::json_return_type are gone: 1. We no longer need obscure classes and functions like make_jsonable() and json_string() to convert strings to response bodies - an operation can simply return a string directly - usually returning rjson::print(value) or a fixed string like "" and it just works. 2. There is no more usage of seastar::json in Alternator (except one minor use of seastar::json::formatter::to_json in streams.cc that can be removed later). Alternator uses RapidJSON for its JSON needs, we don't need to use random pieces from a different JSON library. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-14 18:41:34 +03:00
Nadav Har'El	2385fba4b6	alternator: avoid oversized allocation in Query/Scan This patch fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator. Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page. In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running test/alternator/run --runveryslow \ test_query.py::test_query_large_page_small_rows reports in the log: oversized allocation: 573440 bytes. After this patch, this warning no longer appears. The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that. The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before. Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test. Fixes #23535	2025-07-14 18:41:34 +03:00
Benny Halevy	3feb759943	everywhere: use utils::chunked_vector for list of mutations Currently, we use std::vector<*mutation> to keep a list of mutations for processing. This can lead to large allocation, e.g. when the vector size is a function of the number of tables. Use a chunked vector instead to prevent oversized allocations. `perf-simple-query --smp 1` results obtained for fixed 400MHz frequency and PGO disabled: Before (read path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 89055.97 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 18003 cycles/op, 0 errors) 103372.72 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39380 insns/op, 17300 cycles/op, 0 errors) 98942.27 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39413 insns/op, 17336 cycles/op, 0 errors) 103752.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39407 insns/op, 17252 cycles/op, 0 errors) 102516.77 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39403 insns/op, 17288 cycles/op, 0 errors) throughput: mean= 99528.13 standard-deviation=6155.71 median= 102516.77 median-absolute-deviation=3844.59 maximum=103752.93 minimum=89055.97 instructions_per_op: mean= 39403.99 standard-deviation=14.25 median= 39406.75 median-absolute-deviation=9.30 maximum=39416.63 minimum=39380.39 cpu_cycles_per_op: mean= 17435.81 standard-deviation=318.24 median= 17300.40 median-absolute-deviation=147.59 maximum=18002.53 minimum=17251.75 ``` After (read path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 59755.04 tps ( 66.2 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39466 insns/op, 22834 cycles/op, 0 errors) 71854.16 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 17883 cycles/op, 0 errors) 82149.45 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39411 insns/op, 17409 cycles/op, 0 errors) 49640.04 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 19975 cycles/op, 0 errors) 54963.22 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 18235 cycles/op, 0 errors) throughput: mean= 63672.38 standard-deviation=13195.12 median= 59755.04 median-absolute-deviation=8709.16 maximum=82149.45 minimum=49640.04 instructions_per_op: mean= 39448.38 standard-deviation=31.60 median= 39466.17 median-absolute-deviation=25.75 maximum=39474.12 minimum=39411.42 cpu_cycles_per_op: mean= 19267.01 standard-deviation=2217.03 median= 18234.80 median-absolute-deviation=1384.25 maximum=22834.26 minimum=17408.67 ``` `perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency and PGO disabled: Before (write path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 63736.96 tps ( 59.4 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 49667 insns/op, 19924 cycles/op, 0 errors) 64109.41 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 49992 insns/op, 20084 cycles/op, 0 errors) 56950.47 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50005 insns/op, 20501 cycles/op, 0 errors) 44858.42 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50014 insns/op, 21947 cycles/op, 0 errors) 28592.87 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50027 insns/op, 27659 cycles/op, 0 errors) throughput: mean= 51649.63 standard-deviation=15059.74 median= 56950.47 median-absolute-deviation=12087.33 maximum=64109.41 minimum=28592.87 instructions_per_op: mean= 49941.18 standard-deviation=153.76 median= 50005.24 median-absolute-deviation=73.01 maximum=50027.07 minimum=49667.05 cpu_cycles_per_op: mean= 22023.01 standard-deviation=3249.92 median= 20500.74 median-absolute-deviation=1938.76 maximum=27658.75 minimum=19924.32 ``` After (write path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 53395.93 tps ( 59.4 allocs/op, 16.5 logallocs/op, 14.3 tasks/op, 50326 insns/op, 21252 cycles/op, 0 errors) 46527.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50704 insns/op, 21555 cycles/op, 0 errors) 55846.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50731 insns/op, 21060 cycles/op, 0 errors) 55669.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50735 insns/op, 21521 cycles/op, 0 errors) 52130.17 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50757 insns/op, 21334 cycles/op, 0 errors) throughput: mean= 52713.91 standard-deviation=3795.38 median= 53395.93 median-absolute-deviation=2955.40 maximum=55846.30 minimum=46527.83 instructions_per_op: mean= 50650.57 standard-deviation=182.46 median= 50731.38 median-absolute-deviation=84.09 maximum=50756.62 minimum=50325.87 cpu_cycles_per_op: mean= 21344.42 standard-deviation=202.86 median= 21334.00 median-absolute-deviation=176.37 maximum=21554.61 minimum=21060.24 ``` Fixes #24815 Improvement for rare corner cases. No backport required Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24919	2025-07-13 19:13:11 +03:00
Nadav Har'El	18b6c4d3c5	alternator: lower maximum table name length to 192 Currently, Alternator allows creating a table with a name up to 222 (max_table_name_length) characters in length. But if you do create a table with such a long name, you can have some difficulties later: You you will not be able to add Streams or GSI or LSI to that table, because 222 is also the absolute maximum length Scylla tables can have and the auxilliary tables we want to create (CDC log, materialized views) will go over this absolute limit (max_auxiliary_table_name_length). This is not nice. DynamoDB users assume that after successfully creating a table, they can later - perhaps much later - decide to add Streams or GSI to it, and today if they chose extremely long names, they won't be able to do this. So in this patch, we lower max_table_name_length from 222 to 192. A user will not be able to create tables with longer names, but the good news is that once successfully creating a table, it will always be possible to enable Streams on it (the CDC log table has an extra 15 bytes in its name, and 192 + 15 is less than 222), and it will be possible to add GSIs with short enough names (if the GSI name is 29 or less, 192 + 29 + 1 = 222). This patch is a trivial one-line code change, but also includes the corrected documentation of the limits, and a fix for one test that previously checked that a table name with length 222 was allowed - and now needs to check 192 because 222 is no longer allowed. Note that if a user has existing tables and upgrades Scylla, it is possible that some pre-existing Alternator tables might have lengths over 192 (up to 222). This is fine - in the previous patches we made sure that even in this case, all operations will still work correctly on these old tables (by not not validating the name!), and we also made sure that attempting to enable Streams may fail when the name is too long (we do not remove those old checks in this patch, and don't plan to remove them in the forseeable future). Note that the limit we chose - 192 characters - is identical to the table name limit we recently chose in CQL. It's nicer that we don't need to memorize two different limits for Alternator and CQL. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-07 11:58:21 +03:00
Nadav Har'El	3ed8e269f9	alternator: don't crash when adding Streams to long table name Currently, in Alternator it is possible to create a table whose name has 222 characters, and then trying to add Streams to that table results in an attempt to create a CDC log table with the same name plus a 15-character suffix "_scylla_cdc_log", which resulted (Ref #24598) in an IO-error and a Scylla shutdown. This patch adds code to the Stream-adding operations (both CreateTable and UpdateTable) that validates that the table's name, plus that 15 character suffix, doesn't exceed max_auxiliary_table_name_length, i.e., 222. After this patch, if you have a table whose name is between 207 and 222 characters, attempting to enable Streams on it will fail with: "Streams cannot be added if the table name is longer than 207 characters." Note that in the future, if we lower max_table_name_length to below 207, e.g., to 192, then it will always be possible to add a stream to any legal table, and the new checks we had here will be mostly redundant. But only "mostly" - not entirely: Checking in UpdateTable is still important because of the possibility that an upgrading user might have a pre-existing table whose name is longer than the new limit, and might try to enable Streams. After this patch, the crash reported in #24598 can no longer happen, so in this sense the bug is solved. However, we still want to lower max_table_name_length from 222 to 192, so that it will always be possible to enable streams on any table with a legal name length. We'll do this in the next patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-07 11:58:13 +03:00

1 2 3 4 5 ...

549 Commits