This patch implements the previously-unimplemented Select option of the
Query and Scan operators.
The most interesting use case of this option is Select=COUNT which means
we should only count the items, without returning their actual content.
But there are actually four different Select settings: COUNT,
ALL_ATTRIBUTES, SPECIFIC_ATTRIBUTES, and ALL_PROJECTED_ATTRIBUTES.
Five previously-failing tests now pass, and their xfail mark is removed:
* test_query.py::test_query_select
* test_scan.py::test_scan_select
* test_query_filter.py::test_query_filter_and_select_count
* test_filter_expression.py::test_filter_expression_and_select_count
* test_gsi.py::test_gsi_query_select_1
These tests cover many different cases of successes and errors, including
combination of Select and other options. E.g., combining Select=COUNT
with filtering requires us to get the parts of the items needed for the
filtering function - even if we don't need to return them to the user
at the end.
Because we do not yet support GSI/LSI projection (issue #5036), the
support for ALL_PROJECTED_ATTRIBUTES is a bit simpler than it will need
to be in the future, but we can only finish that after #5036 is done.
Fixes#5058.
The most intrusive part of this patch is a change from attrs_to_get -
a map of top-level attributes that a read needs to fetch - to an
optional<attrs_to_get>. This change is needed because we also need
to support the case that we want to read no attributes (Select=COUNT),
and attrs_to_get.empty() used to mean that we want to read *all*
attributes, not no attributes. After this patch, an unset
optional<attrs_to_get> means read *all* attributes, a set but empty
attrs_to_get means read *no* attributes, and a set and non-empty
attrs_to_get means read those specific attributes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-2-nyh@scylladb.com>
In 4aa9e86924 ("Merge 'alternator: move uses of replica module to
data_dictionary' from Avi Kivity"), we changed alternator to use
data_dictionary instead of replica::database. However,
data_dictionary::database objects are different from replica::database
objects in that they don't have a stable address and need to be
captured by value (they are pointer-like). One capture in
describe_stream() was capturing a data_dictionary::database
by reference and so caused a use-after-free when the previous
continuation was deallocated.
Fix by capturing by value.
Fixes#9952.
Closes#9954
Alternator is a coordinator-side service and so should not access
the replica module. In this series all but one of uses of the replica
module are replaced with data_dictionary.
One case remains - accessing the replication map which is not
available (and should not be available) via the data dictionary.
The data_dictionary module is expanded with missing accessors.
Closes#9945
* github.com:scylladb/scylla:
alternator: switch to data_dictionary for table listing purposes
data_dictionary: add get_tables()
data_dictionary: introduce keyspace::is_internal()
As a coordinator-side service, alternator shouldn't touch the
replica module, so it is migrated here to data_dictionary.
One use case still remains that uses replica::keyspace - accessing
the replication map. This really isn't a replica-side thing, but it's
also not logically part of the data dictionary, so it's left using
replica::keyspace (using the data_dictionary::database::real_database()
escape hatch). Figuring out how to expose the replication map to
coordinator-side services is left for later.
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
The patch series moves the rest of internal ddl users to do schema
change over raft (if enabled). After that series only tests are left
using old API.
* 'gleb/raft-schema-rest-v6' of github.com:scylladb/scylla-dev: (33 commits)
migration_manager: drop no longer used functions
system_distributed_keyspace: move schema creation code to use raft
auth: move table creation code to use raft
auth: move keyspace creation code to use raft
table_helper: move schema creation code to use raft
cql3: make query_processor inherit from peering_sharded_service
table_helper: make setup_table() static
table_helper: co-routinize setup_keyspace()
redis: move schema creation code to go through raft
thrift: move system_update_column_family() to raft
thrift: authenticate a statement before verifying in system_update_column_family()
thrift: co-routinize system_update_column_family()
thrift: move system_update_keyspace() to raft
thrift: authenticate a statement before verifying in system_update_keyspace()
thrift: co-routinize system_update_keyspace()
thrift: move system_drop_keyspace() to raft
thrift: authenticate a statement before verifying in system_drop_keyspace()
thrift: co-routinize system_drop_keyspace()
thrift: move system_add_keyspace() to raft
thrift: co-routinize system_add_keyspace()
...
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
The rjson::set() *sounds* like it can set any member of a JSON object
(i.e., map), but that's not true :-( It calls the RapidJson function
AddMember() so it can only add a member to an object which doesn't have
a member with the same name (i.e., key). If it is called with a key
that already has a value, the result may have two values for the same
key, which is ill-formed and can cause bugs like issue #9542.
So in this patch we begin by renaming rjson::set() and its variant to
rjson::add() - to suggest to its user that this function only adds
members, without checking if they already exist.
After this rename, I was left with dozens of calls to the set() functions
that need to changed to either add() - if we're sure that the object
cannot already have a member with the same name - or to replace() if
it might.
The vast majority of the set() calls were starting with an empty item
and adding members with fixed (string constant) names, so these can
be trivially changed to add().
It turns out that *all* other set() calls - except the one fixed in
issue #9542 - can also use add() because there are various "excuses"
why we know the member names will be unique. A typical example is
a map with column-name keys, where we know that the column names
are unique. I added comments in front of such non-obvious uses of
add() which are safe.
Almost all uses of rjson except a handful are in Alternator, so I
verified that all Alternator test cases continue to pass after this
patch.
Fixes#9583
Refs #9542
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104152540.48900-1-nyh@scylladb.com>
This allows us to forward-declare raw_selector, which in turn reduces
indirect inclusions of expression.hh from 147 to 58, reducing rebuilds
when anything in that area changes.
Includes that were lost due to the change are restored in individual
translation units.
Closes#9434
All callers has been patched already. This argument can now
be used to replace get_local_storage_proxy().get_db().local()
call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The base64 encoding/decoding functions will be used for serialization of
hint sync point descriptions. Base64 format is not specific to
Alternator, so it can be moved to utils.
Some .cc files over the code include the storage service
for no real need. Drop the header and include (in some)
what's really needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A follow up for the patch for #7611. This change was requested
during review and moved out of #7611 to reduce its scope.
The patch switches UUID_gen API from using plain integers to
hold time units to units from std::chrono.
For one, we plan to switch the entire code base to std::chrono units,
to ensure type safety. Secondly, using std::chrono units allows to
increase code reuse with template metaprogramming and remove a few
of UUID_gen functions that beceme redundant as a result.
* switch get_time_UUID(), unix_timestamp(), get_time_UUID_raw(), switch
min_time_UUID(), max_time_UUID(), create_time_safe() to
std::chrono
* remove unused variant of from_unix_timestamp()
* remove unused get_time_UUID_bytes(), create_time_unsafe(),
redundant get_adjusted_timestamp()
* inline get_raw_UUID_bytes()
* collapse to similar implementations of get_time_UUID()
* switch internal constants to std::chrono
* remove unnecessary unique_ptr from UUID_gen::_instance
Message-Id: <20210406130152.3237914-2-kostja@scylladb.com>
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.
Previous commits have introduced a new service for managing CDC
generations. This code moves most of the relevant code to this new
service.
However, some part still remains in storage_service: the bootstrap
procedure, which happens inside storage_service, must also do some
initialization regarding CDC generations, for example: on restart it
must retrieve the latest known generation timestamp from disk; on
bootstrap it must create a new generation and announce it to other
nodes. The order of these operations w.r.t the rest of the startup
procedure is important, hence the startup procedure is the only right
place for them.
Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments
This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.
The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.
Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
For ProjectionExpression we implemented a hierarchical filter object which
can be used to hold a tree of attribute paths groups by a the top-level
attributes, and also detect overlapping and conflicting entries.
For UpdateExpression, we need almost exactly the same object: We need to
group update actions (e.g., SET a.b=3) by the top-level attribute, and
also detect and fail overlapping or conflicting paths.
So in this patch we rewrite the data structure we had for ProjectionExpression
in a more genric manner, using the template attribute_path_map<T> - which
holds data of type T for each attribute path. We also implement a template
function attribute_path_map_add() to add a path/value pair to this map,
and includes all the overlap and conflict detecting logic.
There shouldn't be functional changes in this patch. The ProjectionExpression
code uses the new generic code instead of the specific code, but should work
the same. In the next patch we can use the new generic code to implement
UpdateExpression as well.
The only somewhat functional change is better error messages for
conflicting or overlapping paths - which now include one of the
conflicting paths.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the existing code, the variable "attrs_to_get" is a list of top-level
attributes to fetch for an item. It is used to implement features like
ProjectionExpression or AttributesToGet in GetItem and other places.
However, to support attribute paths (e.g., a.b.c[2]) in ProjectionExpression,
i.e., issue #5024, we need more than that. We still need to know the top-
level attribute "a", because this is the granularity we have in the Scylla
table (all the content inside "a" is serialized as a single JSON); But we
also need to remember exactly which parts *inside* "a" we will need to
extract and return.
So in this patch we add a new type, "attrs_to_get", which is more than
just a list of top-level attributes. Instead, it is a *map*, whose keys
are the top-level attributes, and the value for each of them is a
"hierarchy_filter", an object which describes which part of the attribute
is needed.
This patch includes the code which converts the AttributesToGet and
ProjectionExpression into the new attrs_to_get structure. During this
conversion, we recognize two kinds of errors which DynamoDB complains
about: We recognize "overlapping" attributes (e.g., requesting both
a.b and a.b.c) and "conflicting" attributes (e.g, requesting both
a.b and a[1]). After this, two xfailing tests we had for detecting
these overlap and conflicts finally pass and their "xfail" label is
removed.
After this patch, we have the attrs_to_get object which can allow us
to filter only the requested pieces of the top-level attributes, but
we don't use it yet - so this patch is not enough for complete support
of attribute paths in ProjectionExpression. We will complete this
support in the next patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add new alternator-streams experimental flag for
alternator streams control.
CDC becomes GA and won't be guarded by an experimental flag any more.
Alternator Streams stay experimental so now they need to be controlled
by their own experimental flag.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
get() the latest token_metadata_ptr from the
shared_token_metadata before each use.
expose get_token_metadata_ptr() rather than get_token_metadata()
so that caller can keep it across continuations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes#7325
When building with clang on fedora32, calling the string_view constructor
of bignum generates broken ID:s (i.e. parsing borks). Creating a temp
std::string fixes it.
Closes#7542
Fixes#7496
Since cdc log now has an end-of-batch/record marker that tells
us explicitly that we've read the last row of a change, we
can use this instead of timestamp checks + limit extra to
ensure we have complete records.
Note that this does not try to fulfill user query limit
exact. To do this we would need to add a loop and potentially
re-query if quried rows are not enough. But that is a
separate exercise, and superbly suited for coroutines!
Allows QA to bypass the normal hardcoded 24h ttl of data and still
get "proper" behaviour w.r.t. available stream set/generations.
I.e. can manually change cdc ttl option for alternator table after
streams enabled. Should not be exposed, but perhaps useful for
testing.
Closes#7483
Fixes#7424
AWS sdk (kinesis) assumes SequenceNumbers are monotonically
growing bigints. Since we sort on and use timeuuids are these
a "raw" bit representation of this will _not_ fulfill the
requirement. However, we can "unwrap" the timestamp of uuid
msb and give the value as timestamp<<64|lsb, which will
ensure sort order == bigint order.
Fixes#7409
AWS kinesis Java sdk requires/expects shards to be reported in
lexical order, and even worse, ignores lastevalshard. Thus not
upholding said order will break their stream intropection badly.
Added asserts to unit tests.
v2:
* Added more comments
* use unsigned_cmp
* unconditional check in streams_test
Fixes#7345Fixes#7346
Do a more efficient collection skip when doing paging, instead of
iterating the full sets.
Ensure some semblance of sanity in the parent-child relationship
between shards by ensuring token order sorting and finding the
apparent previous ID coverting the approximate range of new gen.
Fix endsequencenumber generation by looking at whether we are
last gen or not, instead of the (not filled in) 'expired' column.
Fixes#7344
It is not data really needed, as shard_id:s are not required
to be unique across streams, and also because the length limit
on shard_id text representation.
As a side effect, shard iter instead carries the stream arn.
Fixes#7347
If cdc stream id:s are older than either table creation or now - 24h
we can skip them in describe_stream, to minimize the amount of
shards being returned.
As the test test_streams_closed_read confirmed, when a stream shard is
closed, GetRecords should not return a NextShardIterator at all.
Before this patch we wrongly returned an empty string for it.
Before this patch, several Alternator Stream tests (in test_streams.py)
failed when running against a multi-node Scylla cluster. The reason is as
follows: As a multi-node cluster boots and more and more nodes enter the
cluster, the cluster changes its mind about the token ownership, and
therefore the list of stream shards changes. By the time we have the full
cluster, a bunch of shards were created and closed without any data yet.
All the tests will see these closed shards, and need to understand them.
The fetch_more() utility function correctly assumed that a closed shard
does not return a NextShardIterator, and got confused by the empty string
we used to return.
Now that closed shards can return responses without NextShardIterator,
we also needed to fix in this patch a couple of tests which wrongly assumed
this can't happen. These tests did not fail on DynamoDB because unlike in
Scylla, DynamoDB does not have any closed shards in normal tests which
do not specifically cause them (only test_streams_closed_read).
We also need to fix test_streams_closed_read to get rid of an unnecessary
assumption: It currently assumes that when we read the very last item in
a closed shard is read, the end-of-shard is immediately signaled (i.e.,
NextShardIterator is not returned). Although DynamoDB does in fact do this,
it is also perfectly legal for Alternator's implementation to return the
last item with a new NextShardIterator - and only when the client reads
from that iterator, we finally return the signal the end of the shard.
Fixes#7237.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200922082529.511199-1-nyh@scylladb.com>
We call ostringstream::view(), but that member doesn't exist. It
works because it is guarded by an #ifdef and the guard isn't satisified,
but if it is (as with clang) it doesn't compile. Remove it.
Fixes#7190
Since we don't use any delta value when translating cdc -> streams
it is wasteful to write these to the log table, esp. since we already
write big fat pre- and post images.
Fixes#7163
If enabled, the resulting table description should include a
StreamDescription object with the appropriate members describing
current stream settings.
Fixes#7158
A streams shard descriptions has a sequence range describing start/end
(if available) of the shard. This is specified as being "numeric only".
Alternator incorrectly used UUID here, which breaks kinesis.
v2:
* Fix uint128_t parsing from string. bmp::number constructor accepted
sstring, but did not interpret it as std::string/chars. Weird results.
Fixes#7157
When creating/altering/describing a table, if streams are enabled, the
"latest active" stream arn should be included as LatestStreamArn.
Not doing so breaks java kinesis.
This patch fixes a bug which caused sporadic failures of the Alternator
test - test_streams.py::test_streams_last_result.
The GetRecords operation reads from an Alternator Streams shard and then
returns an "iterator" from where to continue reading next time. Because
we obviously don't want to read the same change again, we "incremented"
the current position, to start at the incremented position on the next read.
Unfortunately, the implementation of the increment() function wasn't quite
right. The position in the CDC log is a timeuuid, which has a really bizarre
comparison function (see compare_visitor in types.cc). In particular the
least-sigificant bytes of the UUID are compared as *signed* bytes. This
means that if the last byte of the UUID was 127, and increment() increased
it to 128, and this was wrong because the comparison function later deemed
that as a signed byte, where 128 is lower than 127, not higher! The result
was that with 1/256 probability (whenever the last byte of the position was
127) we would return an item twice. This was reproduced (with 1/256
probability) by the test test_streams_last_result, as reported in issue #7004.
The fix in this patch is to drop the increment() and replace it by a flag
whether an iterator is inclusive of the threshold (>=) or exclusive (>).
The internal representation of the iterator has a boolean flag "inclusive",
and the string representation uses the prefixes "I" or "i" to indicate an
inclusive or exclusive range, respectively - whereas before this patch we
always used the prefix "I".
Although increment() could have been fixed to work correctly, the result would
have been ugly because of the weirdness of the timeuuid comparison function.
increment() would also require extensive new unit-tests: we were lucky that
the high-level functional tests caught a 1 in 256 error, but they would not
have caught rarer errors (e.g., 1 in 2^32). Furthermore, I am looking at
Alternator as the first "user" of CDC, and seeing how complicated and
error-prone increment() is, we should not recommend to users to use this
technique - they should use exclusive (>) range queries instead.
Fixes#7004.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200901102718.435227-1-nyh@scylladb.com>
Fixes#6935Fixes#7107
DynamoDB streams for some reason duplicate the record keys
into both the "Keys" and "OldImage"/"NewImage" sub-objects
when doing GetRecords.
This patch appends the pk/ck parts into old/new image, and
also removes the previous restrictions on image generation
since cdc now generates more consistent pre/post image
data.
We'd like to strictly control who can modify token metadata
and nobody currently needs a mutable reference to storage_proxy::_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before this patch, modifying cdc/cdc_options.hh required recompiling 264
source files. This is because this header file was included by a couple
other header files - most notably schema.hh, where a forward declaration
would have been enough. Only the handful of source files which really
need to access the CDC options should include "cdc/cdc_options.hh" directly.
After this patch, modifying cdc/cdc_options.hh requires only 6 source files
to be recompiled.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200813070631.180192-1-nyh@scylladb.com>
Fixes#7030
Dynamo/alternator streams old image data is supposed to
contain the full old value blob (all keys/values).
Setting preimage=full ensures we get even those properties
that have separate columns if they are not part of an actual
modification.
Fixes#6948
Changes the stream_id format from
<token:64>:<rand:64>
to
<token:64>:<rand:38><index:22><version:4>
The code will attempt to assert version match when
presented with a stored id (i.e. construct from bytes).
This means that ID:s created by previous (experimental)
versions will break.
Moves the ID encoding fully into the ID class, and makes
the code path private for the topology generation code
path.
Removes some superflous accessors but adds accessors for
token, version and index. (For alternator etc).
Fixes#6933
If old (or new) image for a change set is empty, dynamo will not
include this key at all. Alternator did return an empty object.
This changes it to be excluded on empty.