Commit Graph

63 Commits

Author SHA1 Message Date
Nadav Har'El
84143c2ee5 alternator: implement Select option of Query and Scan
This patch implements the previously-unimplemented Select option of the
Query and Scan operators.

The most interesting use case of this option is Select=COUNT which means
we should only count the items, without returning their actual content.
But there are actually four different Select settings: COUNT,
ALL_ATTRIBUTES, SPECIFIC_ATTRIBUTES, and ALL_PROJECTED_ATTRIBUTES.

Five previously-failing tests now pass, and their xfail mark is removed:

 *  test_query.py::test_query_select
 *  test_scan.py::test_scan_select
 *  test_query_filter.py::test_query_filter_and_select_count
 *  test_filter_expression.py::test_filter_expression_and_select_count
 *  test_gsi.py::test_gsi_query_select_1

These tests cover many different cases of successes and errors, including
combination of Select and other options. E.g., combining Select=COUNT
with filtering requires us to get the parts of the items needed for the
filtering function - even if we don't need to return them to the user
at the end.

Because we do not yet support GSI/LSI projection (issue #5036), the
support for ALL_PROJECTED_ATTRIBUTES is a bit simpler than it will need
to be in the future, but we can only finish that after #5036 is done.

Fixes #5058.

The most intrusive part of this patch is a change from attrs_to_get -
a map of top-level attributes that a read needs to fetch - to an
optional<attrs_to_get>. This change is needed because we also need
to support the case that we want to read no attributes (Select=COUNT),
and attrs_to_get.empty() used to mean that we want to read *all*
attributes, not no attributes. After this patch, an unset
optional<attrs_to_get> means read *all* attributes, a set but empty
attrs_to_get means read *no* attributes, and a set and non-empty
attrs_to_get means read those specific attributes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-2-nyh@scylladb.com>
2022-04-11 10:04:32 +02:00
Avi Kivity
e74f570eda alternator: streams: fix use-after-free of data_dictionary in describe_stream()
In 4aa9e86924 ("Merge 'alternator: move uses of replica module to
data_dictionary' from Avi Kivity"), we changed alternator to use
data_dictionary instead of replica::database. However,
data_dictionary::database objects are different from replica::database
objects in that they don't have a stable address and need to be
captured by value (they are pointer-like). One capture in
describe_stream() was capturing a data_dictionary::database
by reference and so caused a use-after-free when the previous
continuation was deallocated.

Fix by capturing by value.

Fixes #9952.

Closes #9954
2022-01-25 09:52:30 +02:00
Nadav Har'El
4aa9e86924 Merge 'alternator: move uses of replica module to data_dictionary' from Avi Kivity
Alternator is a coordinator-side service and so should not access
the replica module. In this series all but one of uses of the replica
module are replaced with data_dictionary.

One case remains - accessing the replication map which is not
available (and should not be available) via the data dictionary.

The data_dictionary module is expanded with missing accessors.

Closes #9945

* github.com:scylladb/scylla:
  alternator: switch to data_dictionary for table listing purposes
  data_dictionary: add get_tables()
  data_dictionary: introduce keyspace::is_internal()
2022-01-19 11:34:25 +02:00
Avi Kivity
7399f3fae7 alternator: switch to data_dictionary for table listing purposes
As a coordinator-side service, alternator shouldn't touch the
replica module, so it is migrated here to data_dictionary.

One use case still remains that uses replica::keyspace - accessing
the replication map. This really isn't a replica-side thing, but it's
also not logically part of the data dictionary, so it's left using
replica::keyspace (using the data_dictionary::database::real_database()
escape hatch). Figuring out how to expose the replication map to
coordinator-side services is left for later.
2022-01-19 11:03:36 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Nadav Har'El
8bcd23fa02 Merge: move rest of internal ddl users to use raft from Gleb
The patch series moves the rest of internal ddl users to do schema
change over raft (if enabled). After that series only tests are left
using old API.

* 'gleb/raft-schema-rest-v6' of github.com:scylladb/scylla-dev: (33 commits)
  migration_manager: drop no longer used functions
  system_distributed_keyspace: move schema creation code to use raft
  auth: move table creation code to use raft
  auth: move keyspace creation code to use raft
  table_helper: move schema creation code to use raft
  cql3: make query_processor inherit from peering_sharded_service
  table_helper: make setup_table() static
  table_helper: co-routinize setup_keyspace()
  redis: move schema creation code to go through raft
  thrift: move system_update_column_family() to raft
  thrift: authenticate a statement before verifying in system_update_column_family()
  thrift: co-routinize system_update_column_family()
  thrift: move system_update_keyspace() to raft
  thrift: authenticate a statement before verifying in system_update_keyspace()
  thrift: co-routinize system_update_keyspace()
  thrift: move system_drop_keyspace() to raft
  thrift: authenticate a statement before verifying in system_drop_keyspace()
  thrift: co-routinize system_drop_keyspace()
  thrift: move system_add_keyspace() to raft
  thrift: co-routinize system_add_keyspace()
  ...
2022-01-12 18:09:08 +02:00
Gleb Natapov
0ac20b5494 alternator: make some functions static
Make add_stream_options, supplement_table_info, supplement_table_stream_info static. They only need a pointer
to storage_proxy, so pass it directly.
2022-01-12 16:33:15 +02:00
Calle Wilund
0c1ff5c2f5 alternator::streams: Use streamed result in get_records if large result
If we have a resonable result set to send back to client, use direct
streaming of the object.

Todo: determine threshold.
2022-01-12 13:34:49 +00:00
Avi Kivity
bbad8f4677 replica: move ::database, ::keyspace, and ::table to replica namespace
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.

References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.

scylla-gdb.py is adjusted to look for both the new and old names.
2022-01-07 12:04:38 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Nadav Har'El
5e52858295 rjson, alternator: rename set() functions add()
The rjson::set() *sounds* like it can set any member of a JSON object
(i.e., map), but that's not true :-( It calls the RapidJson function
AddMember() so it can only add a member to an object which doesn't have
a member with the same name (i.e., key). If it is called with a key
that already has a value, the result may have two values for the same
key, which is ill-formed and can cause bugs like issue #9542.

So in this patch we begin by renaming rjson::set() and its variant to
rjson::add() - to suggest to its user that this function only adds
members, without checking if they already exist.

After this rename, I was left with dozens of calls to the set() functions
that need to changed to either add() - if we're sure that the object
cannot already have a member with the same name - or to replace() if
it might.

The vast majority of the set() calls were starting with an empty item
and adding members with fixed (string constant) names, so these can
be trivially changed to add().

It turns out that *all* other set() calls - except the one fixed in
issue #9542 - can also use add() because there are various "excuses"
why we know the member names will be unique. A typical example is
a map with column-name keys, where we know that the column names
are unique. I added comments in front of such non-obvious uses of
add() which are safe.

Almost all uses of rjson except a handful are in Alternator, so I
verified that all Alternator test cases continue to pass after this
patch.

Fixes #9583
Refs #9542

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104152540.48900-1-nyh@scylladb.com>
2021-11-04 16:35:38 +01:00
Avi Kivity
2d25705db0 cql3: deinline non-trivial methods in selection.hh
This allows us to forward-declare raw_selector, which in turn reduces
indirect inclusions of expression.hh from 147 to 58, reducing rebuilds
when anything in that area changes.

Includes that were lost due to the change are restored in individual
translation units.

Closes #9434
2021-10-05 12:58:55 +02:00
Pavel Emelyanov
0fd00d7016 cdc: Add database argument to is_log_for_some_table
All callers has been patched already. This argument can now
be used to replace get_local_storage_proxy().get_db().local()
call.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-27 14:07:26 +03:00
Piotr Dulikowski
5a0942a0f8 utils,alternator: move base64 code from alternator to utils
The base64 encoding/decoding functions will be used for serialization of
hint sync point descriptions. Base64 format is not specific to
Alternator, so it can be moved to utils.
2021-08-09 09:24:36 +02:00
Pavel Emelyanov
c39f04fa6f code: Remove storage-service header from irrelevant places
Some .cc files over the code include the storage service
for no real need. Drop the header and include (in some)
what's really needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:50:19 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
2187a59089 treewide: move service::cas_request out from storage_proxy.hh
And remove all remaining inclusions of `storage_proxy.hh` in the
headers.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Konstantin Osipov
c83cf1f965 uuid: switch the API to use std::chrono
A follow up for the patch for #7611. This change was requested
during review and moved out of #7611 to reduce its scope.

The patch switches UUID_gen API from using plain integers to
hold time units to units from std::chrono.

For one, we plan to switch the entire code base to std::chrono units,
to ensure type safety. Secondly, using std::chrono units allows to
increase code reuse with template metaprogramming and remove a few
of UUID_gen functions that beceme redundant as a result.

* switch  get_time_UUID(), unix_timestamp(), get_time_UUID_raw(), switch
  min_time_UUID(), max_time_UUID(), create_time_safe() to
  std::chrono
* remove unused variant of from_unix_timestamp()
* remove unused get_time_UUID_bytes(), create_time_unsafe(),
  redundant get_adjusted_timestamp()
* inline get_raw_UUID_bytes()
* collapse to similar implementations of get_time_UUID()
* switch internal constants to std::chrono
* remove unnecessary unique_ptr from UUID_gen::_instance
Message-Id: <20210406130152.3237914-2-kostja@scylladb.com>
2021-04-06 17:12:54 +03:00
Calle Wilund
8bbc976ff1 alternator::streams: Use better method for generation timestamp
Get timestamp via system_distributed, instead of local gen.
2021-03-03 15:46:38 +00:00
Kamil Braun
e2f03e4aba cdc: move (most of) CDC generation management code to the new service
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.

Previous commits have introduced a new service for managing CDC
generations. This code moves most of the relevant code to this new
service.

However, some part still remains in storage_service: the bootstrap
procedure, which happens inside storage_service, must also do some
initialization regarding CDC generations, for example: on restart it
must retrieve the latest known generation timestamp from disk; on
bootstrap it must create a new generation and announce it to other
nodes. The order of these operations w.r.t the rest of the startup
procedure is important, hence the startup procedure is the only right
place for them.

Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.
2021-02-26 12:06:12 +01:00
Kamil Braun
67d4e5576d sys_dist_ks: split CDC streams table partitions into clustered rows
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments

This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.

The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.

Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
2021-02-18 11:44:59 +01:00
Nadav Har'El
7c5db2da83 alternator: overhaul ProjectionExpression hierarchy implementation
For ProjectionExpression we implemented a hierarchical filter object which
can be used to hold a tree of attribute paths groups by a the top-level
attributes, and also detect overlapping and conflicting entries.

For UpdateExpression, we need almost exactly the same object: We need to
group update actions (e.g., SET a.b=3) by the top-level attribute, and
also detect and fail overlapping or conflicting paths.

So in this patch we rewrite the data structure we had for ProjectionExpression
in a more genric manner, using the template attribute_path_map<T> - which
holds data of type T for each attribute path. We also implement a template
function attribute_path_map_add() to add a path/value pair to this map,
and includes all the overlap and conflict detecting logic.

There shouldn't be functional changes in this patch. The ProjectionExpression
code uses the new generic code instead of the specific code, but should work
the same. In the next patch we can use the new generic code to implement
UpdateExpression as well.

The only somewhat functional change is better error messages for
conflicting or overlapping paths - which now include one of the
conflicting paths.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
6340619e69 alternator: overhaul attrs_to_get handling
In the existing code, the variable "attrs_to_get" is a list of top-level
attributes to fetch for an item. It is used to implement features like
ProjectionExpression or AttributesToGet in GetItem and other places.

However, to support attribute paths (e.g., a.b.c[2]) in ProjectionExpression,
i.e., issue #5024, we need more than that. We still need to know the top-
level attribute "a", because this is the granularity we have in the Scylla
table (all the content inside "a" is serialized as a single JSON); But we
also need to remember exactly which parts *inside* "a" we will need to
extract and return.

So in this patch we add a new type, "attrs_to_get", which is more than
just a list of top-level attributes. Instead, it is a *map*, whose keys
are the top-level attributes, and the value for each of them is a
"hierarchy_filter", an object which describes which part of the attribute
is needed.

This patch includes the code which converts the AttributesToGet and
ProjectionExpression into the new attrs_to_get structure. During this
conversion, we recognize two kinds of errors which DynamoDB complains
about: We recognize "overlapping" attributes (e.g., requesting both
a.b and a.b.c) and "conflicting" attributes (e.g, requesting both
a.b and a[1]). After this, two xfailing tests we had for detecting
these overlap and conflicts finally pass and their "xfail" label is
removed.

After this patch, we have the attrs_to_get object which can allow us
to filter only the requested pieces of the top-level attributes, but
we don't use it yet - so this patch is not enough for complete support
of attribute paths in ProjectionExpression. We will complete this
support in the next patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Piotr Jastrzebski
d2897d8f8b alternator: guard streams with an experimental flag
Add new alternator-streams experimental flag for
alternator streams control.

CDC becomes GA and won't be guarded by an experimental flag any more.
Alternator Streams stay experimental so now they need to be controlled
by their own experimental flag.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-12 12:36:16 +01:00
Benny Halevy
3fab0f8694 storage_proxy: convert to shared_token_metadata
get() the latest token_metadata_ptr from the
shared_token_metadata before each use.

expose get_token_metadata_ptr() rather than get_token_metadata()
so that caller can keep it across continuations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Calle Wilund
1db9da2353 alternator::streams: Workaround fix for apparent code gen bug in seq_number
Fixes #7325

When building with clang on fedora32, calling the string_view constructor
of bignum generates broken ID:s (i.e. parsing borks). Creating a temp
std::string fixes it.

Closes #7542
2020-11-04 09:26:08 +02:00
Calle Wilund
7c8f457bab alternator::streams: Reduce the query limit depending on cdc opts
Avoid querying much more than needed.
Since we have exact row markers now, this is more safe to do.
2020-11-02 08:37:27 +00:00
Calle Wilund
c79108edbb alternator::streams: Use end-of-record info in get_records
Fixes #7496

Since cdc log now has an end-of-batch/record marker that tells
us explicitly that we've read the last row of a change, we
can use this instead of timestamp checks + limit extra to
ensure we have complete records.

Note that this does not try to fulfill user query limit
exact. To do this we would need to add a loop and potentially
re-query if quried rows are not enough. But that is a
separate exercise, and superbly suited for coroutines!
2020-11-02 08:35:36 +00:00
Calle Wilund
1bc96a5785 alternator::streams: Make describe_stream use actual log ttl as window
Allows QA to bypass the normal hardcoded 24h ttl of data and still
get "proper" behaviour w.r.t. available stream set/generations.
I.e. can manually change cdc ttl option for alternator table after
streams enabled. Should not be exposed, but perhaps useful for
testing.

Closes #7483
2020-10-26 12:16:36 +02:00
Calle Wilund
83339f4bac Alternator::streams: Make SequenceNumber monotinically growing
Fixes #7424

AWS sdk (kinesis) assumes SequenceNumbers are monotonically
growing bigints. Since we sort on and use timeuuids are these
a "raw" bit representation of this will _not_ fulfill the
requirement. However, we can "unwrap" the timestamp of uuid
msb and give the value as timestamp<<64|lsb, which will
ensure sort order == bigint order.
2020-10-14 16:45:21 +03:00
Calle Wilund
3f800d68c6 alternator::streams: Ensure shards are reported in string lexical order
Fixes #7409

AWS kinesis Java sdk requires/expects shards to be reported in
lexical order, and even worse, ignores lastevalshard. Thus not
upholding said order will break their stream intropection badly.

Added asserts to unit tests.

v2:
* Added more comments
* use unsigned_cmp
* unconditional check in streams_test
2020-10-14 16:45:21 +03:00
Calle Wilund
1ed864ce4c alternator::streams: Set dynamodb data TTL explicitly in cdc options
They should be the same by default, but setting it explicitly protects
us from any changing defaults.
2020-10-07 08:43:39 +00:00
Calle Wilund
04deacd7e7 alternator::streams: Improve paging and fix parent-child calculation
Fixes #7345
Fixes #7346

Do a more efficient collection skip when doing paging, instead of
iterating the full sets.

Ensure some semblance of sanity in the parent-child relationship
between shards by ensuring token order sorting and finding the
apparent previous ID coverting the approximate range of new gen.

Fix endsequencenumber generation by looking at whether we are
last gen or not, instead of the (not filled in) 'expired' column.
2020-10-07 08:43:39 +00:00
Calle Wilund
3cdd7fe191 alternator::streams: Remove table from shard_id
Fixes #7344

It is not data really needed, as shard_id:s are not required
to be unique across streams, and also because the length limit
on shard_id text representation.

As a side effect, shard iter instead carries the stream arn.
2020-10-07 08:43:39 +00:00
Calle Wilund
f1ad66218a alternator::streams: Filter our cdc streams older than data/table
Fixes #7347

If cdc stream id:s are older than either table creation or now - 24h
we can skip them in describe_stream, to minimize the amount of
shards being returned.
2020-10-07 06:13:28 +00:00
Nadav Har'El
4c2e026e04 alternator streams: fix NextShardIterator for closed shard
As the test test_streams_closed_read confirmed, when a stream shard is
closed, GetRecords should not return a NextShardIterator at all.
Before this patch we wrongly returned an empty string for it.

Before this patch, several Alternator Stream tests (in test_streams.py)
failed when running against a multi-node Scylla cluster. The reason is as
follows: As a multi-node cluster boots and more and more nodes enter the
cluster, the cluster changes its mind about the token ownership, and
therefore the list of stream shards changes. By the time we have the full
cluster, a bunch of shards were created and closed without any data yet.
All the tests will see these closed shards, and need to understand them.
The fetch_more() utility function correctly assumed that a closed shard
does not return a NextShardIterator, and got confused by the empty string
we used to return.

Now that closed shards can return responses without NextShardIterator,
we also needed to fix in this patch a couple of tests which wrongly assumed
this can't happen. These tests did not fail on DynamoDB because unlike in
Scylla, DynamoDB does not have any closed shards in normal tests which
do not specifically cause them (only test_streams_closed_read).

We also need to fix test_streams_closed_read to get rid of an unnecessary
assumption: It currently assumes that when we read the very last item in
a closed shard is read, the end-of-shard is immediately signaled (i.e.,
NextShardIterator is not returned). Although DynamoDB does in fact do this,
it is also perfectly legal for Alternator's implementation to return the
last item with a new NextShardIterator - and only when the client reads
from that iterator, we finally return the signal the end of the shard.

Fixes #7237.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200922082529.511199-1-nyh@scylladb.com>
2020-09-23 09:25:10 +02:00
Avi Kivity
cc3c9ba03a alternator/streams: don't use non-existent std::ostringstream::view()
We call ostringstream::view(), but that member doesn't exist. It
works because it is guarded by an #ifdef and the guard isn't satisified,
but if it is (as with clang) it doesn't compile. Remove it.
2020-09-21 16:32:10 +03:00
Calle Wilund
7224ae6d38 alternator: Set CDC delta to keys only for alternator streams
Fixes #7190

Since we don't use any delta value when translating cdc -> streams
it is wasteful to write these to the log table, esp. since we already
write big fat pre- and post images.
2020-09-07 14:27:54 +00:00
Calle Wilund
f7bb0baba7 alternator: Include stream spec in desc for create/update/describe
Fixes #7163

If enabled, the resulting table description should include a
StreamDescription object with the appropriate members describing
current stream settings.
2020-09-07 14:26:21 +00:00
Calle Wilund
e6266d5652 alternator: Include LatestStreamLabel in resulting desc for create/update table
Fixes #7162

Same value as 'StreamLabel' in the currently active stream (cdc log) if
enabled.
2020-09-07 14:24:48 +00:00
Calle Wilund
fa68493d64 alternator: Make "StreamLabel" an iso8601 timestamp
Fixes #7164

See https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TableDescription.html
StreamLabel: A timestamp, in ISO 8601 format, for this stream

Scylla tables do not have a timestamp as such, but the UUID for a given schema
is a timeuuid, so we can misuse this to fake a creation timestamp.
2020-09-07 14:24:00 +00:00
Calle Wilund
a7d021ee57 alternator: Fix sequence number range using wrong format
Fixes #7158

A streams shard  descriptions has a sequence range describing start/end
(if available) of the shard. This is specified as being "numeric only".

Alternator incorrectly used UUID here, which breaks kinesis.

v2:
* Fix uint128_t parsing from string. bmp::number constructor accepted
  sstring, but did not interpret it as std::string/chars. Weird results.
2020-09-07 12:01:22 +00:00
Calle Wilund
f5c79d15a8 alternator: Include stream arn in table description if enabled
Fixes #7157

When creating/altering/describing a table, if streams are enabled, the
"latest active" stream arn should be included as LatestStreamArn.

Not doing so breaks java kinesis.
2020-09-07 08:16:11 +00:00
Nadav Har'El
52f92b886b alternator streams: fix bug returning the same change again
This patch fixes a bug which caused sporadic failures of the Alternator
test - test_streams.py::test_streams_last_result.

The GetRecords operation reads from an Alternator Streams shard and then
returns an "iterator" from where to continue reading next time. Because
we obviously don't want to read the same change again, we "incremented"
the current position, to start at the incremented position on the next read.

Unfortunately, the implementation of the increment() function wasn't quite
right. The position in the CDC log is a timeuuid, which has a really bizarre
comparison function (see compare_visitor in types.cc). In particular the
least-sigificant bytes of the UUID are compared as *signed* bytes. This
means that if the last byte of the UUID was 127, and increment() increased
it to 128, and this was wrong because the comparison function later deemed
that as a signed byte, where 128 is lower than 127, not higher! The result
was that with 1/256 probability (whenever the last byte of the position was
127) we would return an item twice. This was reproduced (with 1/256
probability) by the test test_streams_last_result, as reported in issue #7004.

The fix in this patch is to drop the increment() and replace it by a flag
whether an iterator is inclusive of the threshold (>=) or exclusive (>).
The internal representation of the iterator has a boolean flag "inclusive",
and the string representation uses the prefixes "I" or "i" to indicate an
inclusive or exclusive range, respectively - whereas before this patch we
always used the prefix "I".

Although increment() could have been fixed to work correctly, the result would
have been ugly because of the weirdness of the timeuuid comparison function.
increment() would also require extensive new unit-tests: we were lucky that
the high-level functional tests caught a 1 in 256 error, but they would not
have caught rarer errors (e.g., 1 in 2^32). Furthermore, I am looking at
Alternator as the first "user" of CDC, and seeing how complicated and
error-prone increment() is, we should not recommend to users to use this
technique - they should use exclusive (>) range queries instead.

Fixes #7004.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200901102718.435227-1-nyh@scylladb.com>
2020-09-01 12:28:39 +02:00
Calle Wilund
678ecc7469 alternator_streams: Include keys in OldImage/NewImage
Fixes #6935
Fixes #7107

DynamoDB streams for some reason duplicate the record keys
into both the "Keys" and "OldImage"/"NewImage" sub-objects
when doing GetRecords.

This patch appends the pk/ck parts into old/new image, and
also removes the previous restrictions on image generation
since cdc now generates more consistent pre/post image
data.
2020-08-26 18:14:09 +00:00
Benny Halevy
dfa5f8ff1e storage_proxy: get rid of mutable get_token_metadata getter
We'd like to strictly control who can modify token metadata
and nobody currently needs a mutable reference to storage_proxy::_token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Nadav Har'El
7e01ae089e cdc: avoid including cdc/cdc_options.hh everywhere
Before this patch, modifying cdc/cdc_options.hh required recompiling 264
source files. This is because this header file was included by a couple
other header files - most notably schema.hh, where a forward declaration
would have been enough. Only the handful of source files which really
need to access the CDC options should include "cdc/cdc_options.hh" directly.

After this patch, modifying cdc/cdc_options.hh requires only 6 source files
to be recompiled.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200813070631.180192-1-nyh@scylladb.com>
2020-08-16 14:41:47 +03:00
Calle Wilund
730c5ea283 alternator: Set "preimage" to "full" for streams
Fixes #7030

Dynamo/alternator streams old image data is supposed to
contain the full old value blob (all keys/values).

Setting preimage=full ensures we get even those properties
that have separate columns if they are not part of an actual
modification.
2020-08-12 16:05:00 +00:00
Calle Wilund
a6ad70d3da cdc:stream_id: Encode format version + vnode grouping/index in id
Fixes #6948

Changes the stream_id format from
 <token:64>:<rand:64>
to
 <token:64>:<rand:38><index:22><version:4>

The code will attempt to assert version match when
presented with a stored id (i.e. construct from bytes).
This means that ID:s created by previous (experimental)
versions will break.

Moves the ID encoding fully into the ID class, and makes
the code path private for the topology generation code
path.

Removes some superflous accessors but adds accessors for
token, version and index. (For alternator etc).
2020-08-11 12:48:04 +03:00
Calle Wilund
bf63b8f9f4 alternator::streams: Don't include empty new/old image
Fixes #6933

If old (or new) image for a change set is empty, dynamo will not
include this key at all. Alternator did return an empty object.
This changes it to be excluded on empty.
2020-08-04 07:39:09 +00:00