Today in Alternator vector search, vectors are presented to the API as
lists of numbers. I.e., in JSON a vector is sent in requests and responses
as:
{"L": [{"N": "3.14159"}, {"N":" "6.7"}}
This format is verbose and inefficient for long vectors. Even worse,
because the "N" number format has precision guarantees in DynamoDB,
we cannot optimize the storage of such vectors by, for example, storing
the numbers as 32-bit floats. We actually store these vectors as JSON,
exactly as shown above.
So in this patch we introduce a new DynamoDB type, "FLOAT32VECTOR", for
vectors. The above vector will look like this in JSON:
{"FLOAT32VECTOR": [3.14159, 6.7]}
Note that each number is an unquoted JSON number, not a JSON string.
Importantly, the definition of the "FLOAT32VECTOR" type specifies that
components of the vector only have 32-bit precision. This means that
Scylla may store internally these vectors as lists of 32-bit floats -
not as a JSON. And indeed, this patch includes this optimization:
Top-level vector attributes are now encoded in an optimized way,
as a byte 5 (alternator_type::FLOAT32VECTOR) followed by the elements
of the vector, just 4 bytes each (the 4-byte big-endian IEEE 754
representation of each floating-point component).
This patch also includes documentation, and extensive tests that the
new "FLOAT32VECTOR" type works (which also serves as an example how to
use it in the boto3 SDK), that it is indeed encoded internally as 32-bit
floats and not wasteful JSON strings, and that vector search on such items
work. The last thing requires cooperation from the vector store, of
course - it needs to be able to understand the new optimized encoding
of vector attributes in addition to the old unoptimized one.
Note that the old unoptimized ("list of numbers") vectors are still
supported. Although not recommended for general use, some users might
still want to use the unoptimized type if they have pre-existing data
created on DynamoDB or Alternator without vector search in mind, and
the vectors already exist as lists of numbers.
Although this is less important, the new vector type "FLOAT32VECTOR"
is also allowed in a Query's QueryVector.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, vector search always used the COSINE similarity
function. In this patch we add the ability to choose a different
similarity function when creating a new vector index (with CreateTable
or UpdateTable) by using the SimilarityFunction option. We still default
to "COSINE" if SimilarityFunction isn't specified.
Allowed similarity functions are COSINE, DOT_PRODUCT, and EUCLIDEAN.
DescribeTable can also retrieve a vector index's SimilarityFunction.
As usual, this patch also includes documentation for the new feature,
and tests. Some of the tests can run without a vector store - verifying
the API syntax and which similarity function is supported - but we
also add tests that require the vector store and check that the different
similarity functions actually sort the nearest items in the expected
order.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When an Alternator stream is disabled, the data should continue to be accessible so that consumers can finish reading. When the stream is later re-enabled, a new StreamArn is produced and only then the old data is purged.
On disable, the existing CDC options (including preimage and postimage) are preserved so that DescribeStream can still report StreamViewType. All stream APIs continue to work on the disabled stream, with all shards reported as closed (EndingSequenceNumber set). No new CDC records are written; existing data expires via TTL after 24 hours.
On re-enable, the old CDC log table is dropped as a separate Raft group0 schema change and a fresh one is created with a new UUID, giving a new StreamArn. This is Alternator-specific — CQL CDC keeps reusing the log table. Re-enabling is the only way to immediately purge old stream data.
Old stream data is removed immediately upon re-enable (a discrepancy with DynamoDB, which keeps it readable for 24 hours through the old StreamArn).
Tests updated to cover the new disable and re-enable behavior.
Fixes#7239
Fixes SCYLLADB-523
Closesscylladb/scylladb#29413
* github.com:scylladb/scylladb:
alternator/streams: remove dead next_iter in get_records
test/alternator: fix stream wait timeouts to use wall-clock time
docs/alternator: document stream disable/re-enable behavior
alternator/streams: keep disabled streams usable and purge on re-enable
Previously, disabling Alternator Streams would create a blank
cdc::options with only enabled=false, which meant losing access also
to stored Streams's data (including preimage and postimage).
Now, when a stream is disabled:
- The existing CDC options are preserved (only 'enabled' is flipped to
false), so StreamViewType remains available.
- DescribeStream enumerates all shards with EndingSequenceNumber set,
indicating they are closed.
- GetRecords omits NextShardIterator for disabled streams.
- DescribeTable (supplement_table_stream_info) reports the stream ARN
and StreamEnabled: false when the CDC log table still exists.
- ListStreams uses get_base_table instead of is_log_for_some_table so
that disabled streams whose log table still exists are listed.
When a stream is re-enabled on an Alternator table that has an existing
(disabled) CDC log table, the old log table is dropped and a fresh one
is created with a new UUID, producing a new StreamArn. This is
Alternator-specific behavior; CQL CDC tables continue to reuse the
existing log table.
The old stream data is lost immediately upon re-enable. DynamoDB keeps
it readable for 24 hours.
Tests:
- test_streams_closed_read, test_streams_disabled_stream: remove xfail
now that disabled streams are usable.
- test_streams_reenable: new test verifying that re-enabling produces
a new ARN and the old data is still readable via the old ARN (xfail
because Scylla currently purges old data on re-enable).
Fixesscylladb/scylladb#7239
CreateTable and UpdateTable call wait_for_schema_agreement() after
announcing the schema change, to ensure all live nodes have applied
the new schema before returning to the user. This wait has a hard-
coded 10 second timeout, and on some overloaded test machines we
saw it not completing in time, and causing tests to become flaky.
This patch increases this timeout from 10 seconds to 30 seconds.
It's still hard-coded and not configurable via alternator_timeout_in_ms
because it is unlikely any user will want to change it - it just needs
to be long.
The patch also improves the behavior of a schema-agreement timeout,
when it happens:
1. Provide an InternalServerError with more descriptive text.
2. This InternalServerError tells the user that the result of the
operation is unknown; So the user will repeat the CreateTable, and
will get a ResourceInUseException because the table exists. In that
case too, we need to wait for schema agreement. So we added this
missing wait.
Fixes SCYLLADB-1804
Refs #5052 (claiming CreateTable shouldn't wait at all)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB Streams API can only convey a single parent per stream shard.
Tablet merges produce 2 parents, which is incompatible. When streams
are requested on a tablet table, block tablet merges via
tablet_merge_blocked (the allocator suppresses new merge decisions and
revokes any active merge decision).
add_stream_options() sets tablet_merge_blocked=true alongside
enabled=true, so CreateTable needs no special handling — the flag
is inert on vnode tables and immediately effective on tablet tables.
For UpdateTable, CDC enablement is deferred: store the user's intent
via enable_requested, and let the topology coordinator finalize
enablement once no in-progress merges remain. A new helper,
defer_enabling_streams_block_tablet_merges(), amends the CDC options
to this deferred state.
Disabling streams clears all flags, immediately re-allowing merges.
The tablet allocator accesses the merge-blocked flag through a
schema::tablet_merges_forbidden() accessor rather than reaching into
CDC options directly.
Mark test_parent_children_merge as xfail and remove downward
(merge) steps from tablet_multipliers in test_parent_filtering and
test_get_records_with_alternating_tablets_count.
Remove `if` condition, that prevented tables with tablets
working with Streams.
Remove a test, that verifies, that Alternator will reject
tables with tablets underneath working with Streams feature enabled
on them.
Update few tests, that were expected to fail on tablets to enable their
normal execution.
Add a reference to `system_keyspace` object to `executor` object in
alternator. The reference is needed, because in future commit
we will add there (and use) helper functions that read `cdc_log` tables
for tablet based tables similarly to already existing siblings
for vnodes living in `system_distributed_keyspace`.
This patch continues the effort to split the huge executor.cc (5000
lines before this patch) even more.
In this patch we introduce a new source file, executor_util.cc, for
various utility functions that are used for many different operations
and therefore are useful to have in a header file. These utility
functions will now be in executor_util.cc and executor_util.hh -
instead of executor.cc and executor.hh.
Various source files, including executor.cc, the executor_read.cc
introduced in the previous patch, as well as older source files like
as streams.cc, ttl.cc and serialization.cc, use the new header file.
This patch removes over 700 lines of code from executor.cc, and
also removes a large amount of utility functions declerations from
executor.hh. Originally, executor.hh was meant to be about the
interface that the Alternator server needs to *execute* the different
DynamoDB API operations - and after this patch it returns closer to
this original goal.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Already six years ago, in #5783, we noticed that alternator/executor.cc
has grown too large. The previous patches added hundreds of more lines
to it to implement vector search, and it reached a whopping 7,000 lines
of code. This is too much.
This patch splits from executor.cc two major chunks:
1. The implementation of **read** requests - GetItem, BatchGetItem,
Query (base table, GSI/LSI, and vector-search), and Scan - was
moved to a new source file alternator/executor_read.cc.
The new file has 2,000 lines.
2. Moved 250 lines of template functions dealing with attribute paths
and maps of them to a new header file, attribute_path.hh.
These utilities are used for many different operations - various
read operations use them for ProjectionExpression, and UpdateItem
uses them for modifications to nested attributes, so we need the
new header file from both executor.cc and executor_read.cc
The remaining executor.cc is still pretty big, 5,000 lines, and
contains write operations (PutItem, UpdateItem, DeleteItem,
BatchWriteItem) as well as various table and other operations, and
also many utility functions used by many types of operations, so
we can later continue this refactoring effort.
Refs #5783
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When a table has a vector index, writes to the indexed attribute
(via PutItem, UpdateItem, or BatchWriteItem) must supply a value that
is a vector of the appropriate length: It must be a list of exactly the
declared number of elements, where each element is a numeric type ("N")
representable as a 32-bit float. Before this patch, invalid values were
silently accepted and the item was simply not indexed (it was skipped
by the vector store when it read this item). Now these writes are
rejected with a ValidationException.
This is analogous to the existing validation of GSI/LSI key attribute
values - in DynamoDB after a certain attribute becomes the key of a
GSI or LSI, the user is no longer allowed to write the same type.
The implementation we add here is also analogous to the implementation
of the GSI/LSI key validation. The GSI/LSI key validation is done
by validate_value_if_index_key / si_key_attributes, and in this
patch we add the vector-index parallels: vector_index_attributes()
collects the attribute name and declared dimensions for every vector
index in the schema, and validate_value_if_vector_index_attribute()
enforces the type limitations.
For efficiency in the common case where a table has no vector indexes
and no GSIs/LSIs, both validation functions are out-of-line and each
call site guards the call with an explicit empty() check, so no
function-call overhead is incurred when there is nothing to validate.
For UpdateItem, the map of vector index attributes is cached in
update_item_operation (alongside the existing _key_attributes cache)
to avoid recomputing it on every call to update_attribute().
Add to DescribeTable's output for VectorIndexes two fields - IndexStatus
and Backfilling - which are intended to exactly mirror these two fields
that exist for GlobalSecondaryIndexes:
When a vector index is added, IndexStatus is "CREATING" before the index
is usable, and "ACTIVE" when it is finally usable for a Query. During
"CREATING" phase, "Backfilling" may be set to true when the index is
currently being backfilled (the table is scaned and an index is built).
A user is expected to call DescribeTable in a loop after creating a
vector index (via either CreateTable and UpdateTable) and only call
Query on the index after the IndexStatus is finally ACTIVE. Calling
Query earlier, while IndexStatus is still CREATING, will result in an
error.
In the current implementation, Alternator does not track the state of the
vector index, so it needs to contact the vector store to inquire about
the state of the index - using a new function introduced in this patch
that uses an existing vector-store API. This makes DescribeTable slower
on tables that have vector indexes, because the vector store is contacted
on every DescribeTable call.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We introduce to the Query request a new "VectorSearch" parameter, which
take a mandatory "QueryVector" (a value which must be a numeric vector
of the right length) and "Limit".
The "Limit" of a vector search (Query with VectorSearch) determines the
number of nearest neighbors to return, and does not allow pagination
(ExclusiveKeyStart is not allowed). ConsistentRead=True is also not
allowed on a vector search query.
The "Select"/"ProjectionExpression"/"AttributesToGet" parameters are
also supported, requesting which attributes to fetch. Using Select=
ALL_PROJECTED_ATTRIBUTES means read only the attributes found in the
vector index - currently only the key columns - so it is significantly
faster than ALL_ATTRIBUTES because it doesn't require reading the items
from the base table.
The "FilterExpression" parameter is also supported. Like in DynamoDB's
traditional Query, this does post-filtering, i.e., removing some of the
results returned by the vector index that don't match the filter, and
as a result fewer than Limit results may be returned.
Pre-filtering (done on the vector store, and always returns Limit
results) is not yet implemented.
In commit a55c5e9ec7, the function
describe_multi_item() got a new item_callback parameter, that can
be used to calculate the size of the item. This new parameter
has a default, an empty noncopyable_function. But an empty
noncopyable_function shouldn't be called - exactly like std::function,
it throws std::bad_function_call if called when empty.
So describe_multi_item() should only call this item_callback if
it's not empty.
This became a problem in the next patch, implementing vector search
query, which called describe_multi_item with the default item_callback.
But in general, the function should be usable with the default parameter
(or we shouldn't have defined a default value for this parameter!).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All the "indexes" we implement in Alternator - GSI, LSI and the new
vector index - share the same IndexName namespace, which we'll use in
Query to refer to the index. In the previous patch we already prevented
adding a vector index with the same name as an existing GSI or LSI.
In this patch we also prevent the reverse - adding a GSI with the name
of an existing vector index.
Additionally, one cannot add a GSI on a key that is already the key of
a vector index: The types conflict: The key of a vector index must be a
vector column, while the key of a GSI must have a standard key type
(string, binary or number).
We have tests for this later, this the big test patch.
After an earlier patch allowed CreateTable to create vector indexes
together with a table, in this patch we add to UpdateTable the ability
to add a new vector index to an existing table, as well as the ability
to delete a vector index from an existing table.
The implementation is inspired by DynamoDB's syntax for GSI - just like
GSI has GlobalSecondaryIndexUpdates with "Create" and "Delete" operations,
for vector indexes we have VectorIndexUpdates supporting Create and
Delete. "Update" is not yet supported - we didn't implement yet any
parameter that can be updated - but we can easily implement it in the
future.
ScyllaDB supports the "vector search" feature in CQL.
In this patch we start the path to adding vector search support also to
Alternator.
In this patch, we implement CreateTable support - allowing the user to enable
vector search in a new table. The following patches will enable additional
operations like UpdateTable (adding a vector index to an existing table or
deleting a vector index to an existing table) and DescribeTable.
Extensive tests for all these features will come at the end of the series.
Those tests were written in parallel with writing this implementation so cover
(hopefully) every nook and cranny of the imlementation.
Previously Alternator, when emit Amazon's ARN would not stick to the
standard. After our attempt to run KCL with scylla we discovered few
issues.
Amazon's ARN looks like this:
arn:partition:service:region:account-id:resource-type/resource-id
for example:
arn:aws:dynamodb:us-west-2:111122223333:table/TestTable/stream/2015-05-11T21:21:33.291
KCL checks for:
- ARN provided from Alternator calls must fit with basic Amazon's ARN
pattern shown above,
- region constisting only of lower letter alphabets and `-`, no
underscore character
- account-id being only digits (exactly 12)
- service being `dynamodb`
- partition starting with `aws`
The patch updates our code handling ARNs to match those findings.
1. Split `stream_arn` object into `stream_arn` - ARN for streams only and
`stream_shard_id` - id value for stream shards. The latter receives original
implementation. The former emits and parses ARN in a Amazon style.
for example:
2. Update new `stream_arn` class to encode keyspace and table together
separating them by `@`. New ARN looks like this:
arn:aws:dynamodb:us-east-1:000000000000:table/TestKeyspace@TestTable/stream/2015-05-11T21:21:33.291
3. hardcode `dynamodb` as service, `aws` as partition, `us-east-1` as
region and `000000000000` as account-id (must have 12 digits)
4. Update code handling ARNs for tags manipulation to be able to parse
Amazon's style ARNs. Emiting code is left intact - the parser is now
capable of parsing both styles.
5. Added unit tests.
Fixes#28350
Fixes: SCYLLADB-539
Fixes: #28142Closesscylladb/scylladb#28187
This series fixes a metrics visibility gap in Alternator and adds regression coverage.
Until now, BatchGetItem and BatchWriteItem updated global latency histograms but did not consistently update per-table latency histograms. As a result, table-level latency dashboards could miss batch traffic.
It updates the batch read/write paths to compute request duration once and record it in both global and per-table latency metrics.
Add the missing tests, including a metric-agnostic helper and a dedicated per-table latency test that verifies latency counters increase for item and batch operations.
This change is metrics-only (no API/behavior change for requests) and improves observability consistency between global and per-table views.
Fixes#28721
**We assume the alternator per-table metrics exist, but the batch ones are not updated**
Closesscylladb/scylladb#28732
* github.com:scylladb/scylladb:
test(alternator): add per-table latency coverage for item and batch ops
alternator: track per-table latency for batch get/write operations
assert(), and SCYLLA_ASSERT() are evil (Refs #7871) because they can cause the entire Scylla cluster to crash mysteriously instead of cleanly failing the specific request that encountered a serious problem of failed pre-requisite.
In this two-patch series, in the first patch we introduce a new macro throwing_assert(), a convenient drop-in replacement for SCYLLA_ASSERT() but which has all the benefits of on_internal_error() instead of the dangers of SCYLLA_ASSERT().
In the second patch we use the new function to replace every call to SCYLLA_ASSERT() in Alternator by the new throwing_assert().
Here is an example from the second patch to demonstrate the power of this approach: The Alternator code uses the attrs_column() function to retrieve the ":attrs" column of a schema. Since every Alternator table always has an ":attrs" column in its schema, we felt safe to SCYLLA_ASSERT() that this column exists. However, imagine that one day because of a bug, one Alternator table is missing this column. Or maybe not a bug - maybe a malicious user on a shared cluster found a way to deliberately delete this column (e.g, with a CQL command!) and this check fails. Before this patch, the entire Scylla node will crash. If the same request is sent to all nodes - the entire cluster will crash. The user might not even know which request caused this crash. In contrast, after this patch, the specific operation - e.g., PutItem - will get an exception. Only this operation, and nothing else, will be aborted, and the user who sent this request will even get an "Internal Server Error" with the assertion-failure message, alerting them that this specific query is causing problems, while other queries might work normally.
There's no need to backport this patch - unless it becomes annoying that other branches don't have the throwing_assert() function and we want it to ease other backports.
Fixes#28308.
Closesscylladb/scylladb#28445
* github.com:scylladb/scylladb:
alternator: replace SCYLLA_ASSERT with throwing_assert
utils: introduce throwing_assert(), a safe replacement for assert
Batch operations were updating only global latency histograms, which left
table-level latency metrics incomplete.
This change computes request duration once at the end of each operation and
reuses it to update both global and per-table latency stats:
Latencies are stored per table used,
This aligns batch read/write metric behavior with other operations and improves
per-table observability.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
TTL_TAG_KEY stores the name of the tag in which we store the name of the
table's expiration-time column, for Alternator's TTL feature.
We already need this name in two source files, and soon we'll need it
in more files - as we want to use the same implementation also for for
a new per-row TTL feature in CQL. So it's time to move the declaration
of this variable to a new header file - alternator/ttl_tag.hh.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Replace all calls to SCYLLA_ASSSERT() in Alternator by the better and
safer throwing_assert() introduced in the previous patch.
As a result of this patch, if one of the call sites for these asserts
is buggy and ever fails, only the involved operation will be killed
by an exception, instead of crashing the whole server - and often the
entire cluster (as the same buggy request reaches all nodes and
crashes them all).
Additionally, this patch replaces a few existing uses in Alternator
of on_internal_error() with a non-interesting message with a
more-or-less equivalent, but shorter, throwing_assert(). The idea is
to convert the verbose idiom:
if (!condition) {
on_internal_error(logger, "some error message")
}
With the shorter
throwing_assert(condition)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Copilot found these typos in comments and variable name in alternator/,
so might as well fix them.
There are no functional changes in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28447
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.
The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
Following 954f2cbd2f, which added proxy protocol v2 listeners
for CQL, we do the same for alternator. We add two optional ports
for plain and TLS-wrapped HTTP.
We test each new port, that the old ports still work, and that
mixing up a port with no proxy protocol and a connection with proxy
protocol (or the opposite) fails. The latter serves to show
that the testing strategy is valid and doesn't just pass whatever
happens. We also verify that the correct addresses (and TLS mode)
show up in system.clients.
Closesscylladb/scylladb#27889
When creating an alternator table with tablets, if it has an index, LSI
or GSI, require the config option rf_rack_valid_keyspaces to be enabled.
The option is required for materialized views in tablets keyspaces to
function properly and avoid consistency issues that could happen due to
cross-rack migrations and pairing switches when RF-rack validity is not
enforced.
Currently the option is validated when creating a materialized view via
the CQL interface, but it's missing from the alternator interface. Since
alternator indexes are based on materialized views, the same check
should be added there as well.
Fixesscylladb/scylladb#27612Closesscylladb/scylladb#27622
This patch-set consolidates and corrects rjson string conversion handling.
It removes unnecessary string copies, ensures proper length usage and
replaces ad-hoc conversions with consistent helper functions.
Overall, the changes make rjson string handling safer, faster, and more uniform across the codebase.
Backport: no, it's a refactor
Closesscylladb/scylladb#27394
* github.com:scylladb/scylladb:
fix rjson::value to bytes conversion with missing GetStringLength call
alternator: change type from string to string_view in should_add_capacity
fix rjson::value to string_view conversion with missing GetStringLength call
use rjson::to_string_view when rjson::value gets converted using GetStringLength
use rjson::to_sstring and rjson::to_string for various string conversions
utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds
utils: move rjson::to_string_view func to string related place
utils: add to_sstring and to_string rjson helper
In some cases we unnecessarily convert to string which
causes a copy. In other we convert without calling
GetStringLength which causes iteration to dermine length
which is already known. In some cases we do even both.
This commit fixes that.
Before this series, we kept the cas_shard on the original shard to
guard against tablet movements running in parallel with
storage_proxy::cas.
The bug addressed by this PR shows that this approach is flawed:
keeping the cas_shard on the original shard does not guarantee that
a new cas_shard acquired on the target shard won’t require another
jump.
We fixed this in the previous commit by checking cas_shard.this_shard()
on the target shard and continuing to jump to another shard if
necessary. Once cas_shard.this_shard() on the target shard returns
true, the storage_proxy::cas invariants are satisfied, and no other
cas_shard instances need to remain alive except the one passed
into storage_proxy::cas.
This change ensures that if cas_shard points to a different shard,
the executor will continue issuing shard jumps until
cas_shard.this_shard() returns true. The commit simply moves the
this_shard() check from the parallel_for_each lambda into cas_write,
with minimal functional changes.
We enable test_alternator_invalid_shard_for_lwt since now it should
pass.
Fixesscylladb/scylladb#27353
We will need to access executor::_stats field from cas_write. We could
pass it as a paramter, but it seems simpler to just make cas_write
and instance method too.
This test reproduces scylladb/scylladb#27353 using two injection
points. First, the test triggers an intra-node tablet migration and
suspends it at the streaming stage using the
intranode_migration_streaming_wait injection. Next, it enables the
alternator_executor_batch_write_wait injection, which suspends a
batch write after its cas_shard has already been created.
The test then issues several batch writes and waits until one of them
hits this injection on the destination shard. At this point, the
cas_shard.erm for that write is still in the streaming state,
meaning the executor would need to jump back to the source shard.
The test then resumes the suspended tablet migration, allowing it to
update the ERM on the source shard to write_both_read_new. After that,
the test releases the suspended batch write and expects it to perform
two shard jumps: first from the destination to the source shard, and
then again back to the source shard.
This commit adds the alternator_executor_batch_write_wait injection to
alternator/executor.cc. Coroutines are intentionally avoided in the
parallel_for_each lambda to prevent unnecessary coroutine-frame
allocations.
This commit is an optimization: avoiding destruction of
foreign objects on the wrong shard. Releasing objects allocated on a
different shard causes their ::free calls to be executed remotely,
which adds unnecessary load to the SMP subsystem.
Before this patch, a std::vector could be moved
to another shard. When the vector was eventually destroyed,
its ::free had to be marshalled back to the shard where the memory had
originally been allocated. This change avoids that overhead by passing
the vector by const reference instead.
The referenced objects lifetime correctness reasoning:
* the put_or_delete_item refs usages in put_or_delete_item_cas_request
are bound to its lifetime
* cas_request lifetime is bound to storage_proxy::cas future
* we don't release put_or_delete_item-s untill all storage_proxy::cas
calls are done.
In the next commit we want to add an optimization that relies on
precise control over the lifetime of cas_request. In particular, we
want the implementation of this interface in Alternator to operate on
raw references that are guaranteed to remain valid only until the
cas() future is resolved. We already depend on the same lifetime
assumptions in cas_request when used by modification_statement.
However, these assumptions are not clearly expressed in the current
interface: cas_request is taken by shared_ptr, and nothing prevents
cas() from storing that pointer inside paxos_response_handler, which
may outlive the cas() future.
This commit fixes that by taking cas_request by raw reference. This
makes it explicit that cas() does not assume ownership of the object.
Callers must ensure that the referenced object remains valid until
the returned future is resolved.