Commit Graph

587 Commits

Author SHA1 Message Date
Pavel Emelyanov
7a5c2cdbe6 alternator: Don't use global gossiper
There's proxy at hand which can provide local gossiper reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:33:07 +03:00
Avi Kivity
582802825a treewide: use system-#include (angle brackets) for seastar
Seastar is an external library from Scylla's point of view so
we should use the angle bracket #include style. Most of the source
follows this, this patch fixes a few stragglers.

Also fix cases of #include which reached out to seastar's directory
tree directly, via #include "seastar/include/sesatar/..." to
just refer to <seastar/...>.

Closes #10433
2022-04-26 14:46:42 +03:00
Avi Kivity
40beb48176 alternator: ttl: avoid specializing class templates in non-namespace scope
The C++ standard disallows class template specialization in non-namespace
scopes. Clang apparently allows it as an extension.

Fix by not using a template - there are just two specializations and
no generic implementation. Use regular classes and std::conditional_t
to choose between the two.
2022-04-18 12:27:18 +03:00
Avi Kivity
b5e8e32c01 alternator: executor: fix signed/unsigned comparison in is_big()
Signed/unsigned comparisons are subject to C promotion rules. In is_big()
in this case the comparison is safe, but gcc warns. Use a cast to silence
the warning.

The sign/unsigned mix and int/size_t size differences still look bad, it
would be good to revisit this code, but that is left for another patch.
2022-04-18 12:23:18 +03:00
Nadav Har'El
84143c2ee5 alternator: implement Select option of Query and Scan
This patch implements the previously-unimplemented Select option of the
Query and Scan operators.

The most interesting use case of this option is Select=COUNT which means
we should only count the items, without returning their actual content.
But there are actually four different Select settings: COUNT,
ALL_ATTRIBUTES, SPECIFIC_ATTRIBUTES, and ALL_PROJECTED_ATTRIBUTES.

Five previously-failing tests now pass, and their xfail mark is removed:

 *  test_query.py::test_query_select
 *  test_scan.py::test_scan_select
 *  test_query_filter.py::test_query_filter_and_select_count
 *  test_filter_expression.py::test_filter_expression_and_select_count
 *  test_gsi.py::test_gsi_query_select_1

These tests cover many different cases of successes and errors, including
combination of Select and other options. E.g., combining Select=COUNT
with filtering requires us to get the parts of the items needed for the
filtering function - even if we don't need to return them to the user
at the end.

Because we do not yet support GSI/LSI projection (issue #5036), the
support for ALL_PROJECTED_ATTRIBUTES is a bit simpler than it will need
to be in the future, but we can only finish that after #5036 is done.

Fixes #5058.

The most intrusive part of this patch is a change from attrs_to_get -
a map of top-level attributes that a read needs to fetch - to an
optional<attrs_to_get>. This change is needed because we also need
to support the case that we want to read no attributes (Select=COUNT),
and attrs_to_get.empty() used to mean that we want to read *all*
attributes, not no attributes. After this patch, an unset
optional<attrs_to_get> means read *all* attributes, a set but empty
attrs_to_get means read *no* attributes, and a set and non-empty
attrs_to_get means read those specific attributes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-2-nyh@scylladb.com>
2022-04-11 10:04:32 +02:00
Nadav Har'El
9c1ebdceea alternator: forbid empty AttributesToGet
In DynamoDB one can retrieve only a subset of the attributes using the
AttributesToGet or ProjectionExpression paramters to read requests.
Neither allows an empty list of attributes - if you don't want any
attributes, you should use Select=COUNT instead.

Currently we correctly refuse an empty ProjectionExpression - and have
a test for it:
test_projection_expression.py::test_projection_expression_toplevel_syntax

However, Alternator is missing the same empty-forbidding logic for
AttributesToGet. An empty AttributesToGet is currently allowed, and
basically says "retrieve everything", which is sort of unexpected.

So this patch adds the missing logic, and the missing test (actually
two tests for the same thing - one using GetItem and the other Query).

Fixes #10332

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-1-nyh@scylladb.com>
2022-04-11 10:21:02 +03:00
Nadav Har'El
67e0590bbc alternator: remove old TODO (with test verifying it)
We had an old TODO in the Alternator "Scan" operation code which
suggested that we may need to do something to limit the size of pages
when a row limit ("Limit") isn't given.

But we do already have a built-in limit on page sizes (1 MB),
so this TODO isn't needed and can be removed.

But I also wanted to make sure we have a test that this limit works:

We already had a test that this 1 MB limit works for a single-partition
Query (test_query.py::test_query_reverse_longish - tested both forward
and reversed queries). In this patch I add a similar test for a whole-
table Scan. It turns out that although page size is limited in this case
as well, it's not exactly 1 MB... For small tables can even reach 3 MB.
I consider this "good enough" and that we can drop the TODO, but also
opened issue #10327 to document this surprising (for me) finding.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220404145240.354198-1-nyh@scylladb.com>
2022-04-05 09:23:23 +03:00
Nadav Har'El
7f89c8b3e3 alternator: clean error shutdown in case of TLS misconfigration
The way our boot-time service "controllers" are written, if a
controller's start_server() finds an error and throws, it cannot
the caller (main.cc) to call stop_server(), and must clean up
resources already created (e.g., sharded services) before returning
or risk crashes on assertion failures.

This patch fixes such a mistake in Alternator's initialization.
As noted in issue #10025, if the Alternator TLS configuration is
broken - especially the certificate or key files are missing -
Scylla would crash on an assertion failure, instead of reporting
the error as expected. Before this patch such a misconfiguration
will result in the unintelligible:

<alternator::server>::~sharded() [Service = alternator::server]:
Assertion `_instances.empty()' failed. Aborting on shard 0.

After this patch we get the right error message:

ERROR 2022-03-21 15:25:07,553 [shard 0] init - Startup failed:
std::_Nested_exception<std::runtime_error> (Failed to set up Alternator
TLS credentials): std::_Nested_exception<std::runtime_error> (Could not
read certificate file conf/scylla.crt): std::filesystem::__cxx11::
filesystem_error (error system:2, filesystem error: open failed:
No such file or directory [conf/scylla.crt])

Arguably this error message is a bit ugly, so I opened
https://github.com/scylladb/seastar/issues/1029, but at least it says
exactly what the error is.

Fixes #10025

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220321133323.3150939-1-nyh@scylladb.com>
2022-03-28 15:26:42 +03:00
Nadav Har'El
653f2df28f alternator: fix JSON escaping of error responses
In the DynamoDB API, error responses are in JSON format with specific
fields ("__type" and "message" in the x-amz-json-1.0 format currently
used). Alternator tried to be clever and build the string representation
of this JSON itself, instead of using RapidJSON. But this optimization
was a mistake - if the error message contains characters that need
escaping (such as double quotes and newlines), they weren't escaped,
and the resulting JSON was malformed. When the client library boto3
read this malformed JSON it got confused, cosidered the entire error
response to be a string, which resulted in an ugly error message.

The fix is easy - just build the JSON output as usual with RapidJSON
instead of trying to optimize using string operation.

The patch also includes two tests reproducing this bug and checking its
fix. The first test uses boto3 and shows it got confused on the type
of error (not understanding that it is a ValidationException). The
second test bypasses boto3 and shows exactly where the bug happens -
the response is an unparsable JSON.

Fixes #10278

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220327132705.3707979-1-nyh@scylladb.com>
2022-03-27 16:32:36 +03:00
Mikołaj Sielużycki
6f1b6da68a compile: Fix headers so that *-headers targets compile cleanly.
Closes #10273
2022-03-25 16:19:26 +02:00
Pavel Solodovnikov
95c8d65949 treewide: fix compilation issues with fmtlib 8.1.0+
Due to fd62fba985
scoped enums are not automatically converted to integers anymore,
this is the intended behavior, according to the fmtlib devs.

A bit nicer solution would be to use `std::to_underlying`
instead of a direct `static_cast`, but it's not available until
C++23 and some compilers are still missing the support for it.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-16 12:31:50 +03:00
Nadav Har'El
c26230943b alternator ttl: add metrics
This patch adds metrics to the Alternator TTL feature (aka the "expiration
service").

I put these metrics deliberately in their own object in ttl.{hh,cc}, and
also with their own prefix ("expiration_*") - and *not* together with the
rest of the Alternator metrics (alternator/stats.{hh,cc}). This is
because later we may want to use the expiration service not only in
Alternator but also in CQL - to support per-item expiration with CDC
events also in CQL. So the implementation of this feature should not be
too tangled with that of Alternator.

The patch currently adds four metrics, and opens the path to easily add
more in the future. The metrics added now are:

1. scylla_expiration_scan_passes:  The number of scan passes over the
       entire table. We expect this to grow by 1 every
       alternator_ttl_period_in_seconds seconds.

2. scylla_expiration_scan_table: The number of table scans. In each scan
       pass, we scan all the tables that have the Alternator TTL feature
       enabled. Each scan of each table is counted by this counter.

3. scylla_expiration_items_deleted: Counts the number of items that
       the expiration service expired (deleted). Please remember that
       each item is considered for expiration - and then expired - on
       only one node, so each expired item is counted only once - not
       RF times.

4. scylla_expiration_secondary_ranges_scanned: If this counter is
       incremented, it means this node took over some other node's
       expiration scanning duties while the other node was down.

This patch also includes a couple of unrelated comment fixes.

I tested the new metrics manually - they aren't yet tested by the
Alternator test suite because I couldn't make up my mind if such
tests would belong in test_ttl.py or test_metrics.py :-)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220224092419.1132655-1-nyh@scylladb.com>
2022-02-25 07:26:11 +02:00
Nadav Har'El
db7b11cfc4 alternator: make TTL expiration scanner bypass cache
The background scan for expired Alternator items (the TTL feature)
should bypass the cache to avoid poluting it with the entire content
of the table being scanned.

I tested that the flag added in this patch really works by adding a printout
to the code in table.cc which creates the reader. Although we do have a
metric for uses of BYPASS CACHE, unfortunately this metric counts usage
of "BYPASS CACHE" in CQL statements - and not does not account the low-
level calls that we use in the ttl scanner.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-02-25 07:26:11 +02:00
Nadav Har'El
49a8164fb7 alternator: add configurable scan period to TTL expiration
Before this patch, the experimental TTL (expiration time) feature in
Alternator scans tables for expiration in a tight loop - starting the
next scan one second after the previous one completed.

In this patch we introduce a new configuration option,
alternator_ttl_period_in_seconds, which determines how frequently
to start the scan. The default is 24 hours - meaning that the next
scan is started 24 hours after the previous one started.

The tests (test/alternator/run) change this configuration back to one
second, so that expiration tests finish as quickly as possible.

Please note that the scan is *not* slowed down to fill this 24 hours -
if it finishes in one hour, it will then sleep for 23 hours. Additional
work would be needed to slow down the scan to not finish too quickly.
One idea not yet implemented is to move the expiration service from
the "maintenance" scheduling group which it uses today to a new
scheduling group, and modifying the number of shares that this group
gets.

Another thing worth noting about the configurable period (which defaults
to 24 hours) is that when TTL is enabled on an Alternator table, it can
take that amount of time until its scan starts and items start expiring
from it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-02-25 07:26:11 +02:00
Nadav Har'El
f292d3d679 alternator: make schema modifications in CreateTable atomic
The Alternator CreateTable operation currently performs several schema-
changing operations separately - one by one: It creates a keyspace,
a table in that keyspace and possibly also multiple views, and it sets
tags on the table. A consequence of this is that concurrent CreateTable
and DeleteTable operations (for example) can result in unexpected errors
or inconsistent states - for example CreateTable wants to create the
table in the keyspace it just created, but a concurrent DeleteTable
deleted it. We have two issues about this problem (#6391 and #9868)
and three tests (test_table.py::test_concurrent_create_and_delete_table)
reproducing it.

In this patch we fix these problems by switching to the modern Scylla
schema-changing API: Instead of doing several schema-changing
operations one by one, we create a vector of schema mutation performing
all these operations - and then perform all these mutations together.

When the experimental Raft-based schema modifications is enabled, this
completely solves the races, and the tests begin to pass. However, if
the experimental Raft mode is not enabled, these tests continue to fail
because there is still no locking while applying the different schema
mutations (not even on a single node). So I put a special fixture
"fails_without_raft" on these tests - which means that the tests
xfail if run without raft, and expected to pass when run on Raft.

Indeed, after this patch
test/alternator/run --raft test_table.py::test_concurrent_create_and_delete_table

shows three passing tests (they also pass if we drastically improve the
number of iterations), while
test/alternator/run test_table.py::test_concurrent_create_and_delete_table

shows three xfailing tests.

All other Alternator tests pass as before with this patch, verifying
that the handling of new tables, new views, tags, and CDC log tables,
all happen correctly even after this patch.

A note about the implementation: Before this patch, the CreateTable code
used high-level functions like prepare_new_column_family_announcement().
These high-level functions become unusable if we write multiple schema
operations to one list of mutations, because for example this function
validates that the keyspace had already been created - when it hasn't
and that's the whole point. So instead we had to use lower-level
function like add_table_or_view_to_schema_mutation() and
before_create_column_family(). However, despite being lower level,
these functions were public so I think it's reasonable to use them,
and we probably have no other alternative.

Fixes #6391
Fixes #9868

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-02-18 09:03:52 +02:00
Nadav Har'El
4e3038b57f alternator: add FIXME for schema changes requiring a loop
In commit a664ac7ba5, the Alternator
schema-modifying code (e.g., delete_table()) was reorganized to support
the new Raft-based schema modifications. Schema modifications now work
with an "optimistic locking" approach: We retrieve the current schema
version id ("group0_guard"), reads the current schema and verifies it
can do the changes it wants to do, and then does them with
mm.announce(group0_guard) - which will fail if the schema version is not
current because some other concurrent modification beat us in the race.

This means that we need to do this whole read-modify-write (group0_guard,
checking the schema, creating mutations, calling mm.announce()) in a
*retry loop*. We have such a loop in the CQL code but it's missing in the
Alternator code. In this patch we don't add the loop yet, but add FIXMEs
to remind us where it's missing.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220214154435.544125-1-nyh@scylladb.com>
2022-02-14 18:24:16 +02:00
Nadav Har'El
9982a28007 alternator: allow REMOVE of non-existent nested attribute
DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x
exists in the item, but x.y doesn't - the removal silently does
nothing. Alternator incorrectly generated an error in this case,
and unfortunately we didn't have a test for this case.

So in this patch we add the missing test (which fails on Alternator
before this patch - and passes on DynamoDB) and then fix the behavior.
After this patch, "REMOVE x.y" will remain an error if "x" doesn't
exist (saying "document paths not valid for this item"), but if "x"
exists and is a map, but "x.y" doesn't, the removal will silently
do nothing and will not be an error.

Fixes #10043.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207133652.181994-1-nyh@scylladb.com>
2022-02-07 18:40:48 +02:00
Piotr Sarna
c613d1ce87 alternator: migrate expression parsers to string_view
Following the advice in the FIXME note, helper functions for parsing
expressions are now based on string views to avoid a few unnecessary
conversions to std::string.

Tests: unit(dev)

Closes #10013
2022-02-04 12:34:19 +02:00
Nadav Har'El
79776ff2ff alternator: fix error handling during Alternator startup
A recent restructuring of the startup of Alternator (and also other
protocol servers) led to incorrect error-handling behavior during
startup: If an error was detected on one of the shards of the sharded
service (in alternator/server.cc), the sharded service itself was never
stopped (in alternator/controller.cc), leading to an assertion failure
instead of the desired error message.

A common example of this problem is when the requested port for the
server was already taken (this was issue #9914).

So in this patch, exception handling is removed from server.cc - the
exception will propegate to the code in controller.cc, which will
properly stop the server (including the sharded services) before
returning.

Fixes #9914.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220130131709.1166716-1-nyh@scylladb.com>
2022-02-02 10:35:57 +01:00
Piotr Sarna
d50ed944f2 alternator: add error injection to BatchGetItem
When error injection is enabled at compile time, it's now possible
to inject an error into BatchGetItem in order to produce a partial
read, i.e. when only part of the items were retrieved successfully.
2022-01-31 12:56:00 +01:00
Piotr Sarna
31f4f062a2 alternator: fill UnprocessedKeys for failed batch reads
DynamoDB protocol specifies that when getting items in a batch
failed only partially, unprocessed keys can be returned so that
the user can perform a retry.
Alternator used to fail the whole request if any of the reads failed,
but right now it instead produces the list of unprocessed keys
and returns them to the user, as long as at least 1 read was
successful.

NOTE: tested manually by compiling Scylla with error injection,
which fails every nth request. It's rather hard to figure out
an automatic test case for this scenario.

Fixes #9984
2022-01-31 12:56:00 +01:00
Tomasz Grabiec
c89b1953f8 Merge "Enforce linearizability of group 0 operations using state IDs" from Kamil
We introduce a new table, `system.group0_history`.

This table will contain a history of all group 0 changes applied through
Raft. With each change is an associated unique ID, which also identifies
the state of all group 0 tables (including schema tables) after this
change is applied, assuming that all such changes are serialized through
Raft (they will be eventually).

Group 0 commands, additionally to mutations which modify group 0 tables,
contain a "previous state ID" and a "new state ID".

The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.

To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.

The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group 0
state, and finally call `announce`.

The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.

We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.

The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.

* kbr/schema-state-ids-v4:
  service: migration_manager: `announce`: take a description parameter
  service: raft: check and update state IDs during group 0 operations
  service: raft: group0_state_machine: introduce `group0_command`
  service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table
  service: migration_manager: convert migration request handler to coroutine
  db: system_keyspace: introduce `system.group0_history` table
  treewide: require `group0_guard` when performing schema changes
  service: migration_manager: introduce `group0_guard`
  service: raft: pass `storage_proxy&` to `group0_state_machine`
  service: raft: raft_state_machine: pass `snapshot_descriptor` to `transfer_snapshot`
  service: raft: rename `schema_raft_state_machine` to `group0_state_machine`
  service: migration_manager: rename `schema_read_barrier` to `start_group0_operation`
  service: migration_manager: `announce`: split raft and non-raft paths to separate functions
  treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions
  service: migration_manager: put notifier call inside `async`
  service: migration_manager: remove some unused and disabled code
  db: system_distributed_keyspace: use current time when creating mutations in `start()`
  redis: keyspace_utils: `create_keyspace_if_not_exists_impl`: call `announce` twice only
2022-01-25 09:52:30 +02:00
Avi Kivity
e74f570eda alternator: streams: fix use-after-free of data_dictionary in describe_stream()
In 4aa9e86924 ("Merge 'alternator: move uses of replica module to
data_dictionary' from Avi Kivity"), we changed alternator to use
data_dictionary instead of replica::database. However,
data_dictionary::database objects are different from replica::database
objects in that they don't have a stable address and need to be
captured by value (they are pointer-like). One capture in
describe_stream() was capturing a data_dictionary::database
by reference and so caused a use-after-free when the previous
continuation was deallocated.

Fix by capturing by value.

Fixes #9952.

Closes #9954
2022-01-25 09:52:30 +02:00
Kamil Braun
a664ac7ba5 treewide: require group0_guard when performing schema changes
`announce` now takes a `group0_guard` by value. `group0_guard` can only
be obtained through `migration_manager::start_group0_operation` and
moved, it cannot be constructed outside `migration_manager`.

The guard will be a method of ensuring linearizability for group 0
operations.
2022-01-24 15:20:35 +01:00
Kamil Braun
86762a1dd9 service: migration_manager: rename schema_read_barrier to start_group0_operation
1. Generalize the name so it mentions group 0, which schema will be a
   strict subset of.
2. Remove the fact that it performs a "read barrier" from the name. The
   function will be used in general to ensure linearizability of group0
   operations - both reads and writes. "Read barrier" is Raft-specific
   terminology, so it can be thought of as an implementation detail.
2022-01-24 15:12:50 +01:00
Kamil Braun
283ac7fefe treewide: pass mutation timestamp from call sites into migration_manager::prepare_* functions
The functions which prepare schema change mutations (such as
`prepare_new_column_family_announcement`) would use internally
generated timestamps for these mutations. When schema changes are
managed by group 0 we want to ensure that timestamps of mutations
applied through Raft are monotonic. We will generate these timestamps at
call sites and pass them into the `prepare_` functions. This commit
prepares the APIs.
2022-01-24 15:12:50 +01:00
Nadav Har'El
350c3d0f6a alternator: update comment about default timeout
The comment explaining where the default Alternator timeout is set
became out-of-date. So fix it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220120092631.401563-1-nyh@scylladb.com>
2022-01-20 14:05:58 +02:00
Nadav Har'El
4aa9e86924 Merge 'alternator: move uses of replica module to data_dictionary' from Avi Kivity
Alternator is a coordinator-side service and so should not access
the replica module. In this series all but one of uses of the replica
module are replaced with data_dictionary.

One case remains - accessing the replication map which is not
available (and should not be available) via the data dictionary.

The data_dictionary module is expanded with missing accessors.

Closes #9945

* github.com:scylladb/scylla:
  alternator: switch to data_dictionary for table listing purposes
  data_dictionary: add get_tables()
  data_dictionary: introduce keyspace::is_internal()
2022-01-19 11:34:25 +02:00
Avi Kivity
7399f3fae7 alternator: switch to data_dictionary for table listing purposes
As a coordinator-side service, alternator shouldn't touch the
replica module, so it is migrated here to data_dictionary.

One use case still remains that uses replica::keyspace - accessing
the replication map. This really isn't a replica-side thing, but it's
also not logically part of the data dictionary, so it's left using
replica::keyspace (using the data_dictionary::database::real_database()
escape hatch). Figuring out how to expose the replication map to
coordinator-side services is left for later.
2022-01-19 11:03:36 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Nadav Har'El
343c521e28 alternator: avoid large contigous allocation in BatchGetItem
The BatchGetItem request can return a very large response - according to
DynamoDB documentation up to 16 MB, but presently in Alternator, we allow
even more (see #5944).

The problem is that the existing code prepares the entire response as
a large contiguous string, resulting in oversized allocation warnings -
and potentially allocation failures. So in this patch we estimate the size
of the BatchGetItem response, and if it is "big enough" (currently over
100 KB), we return it with the recently added streaming output support.

This streaming output doesn't avoid the extra memory copies unfortunately,
but it does avoid a *contiguous* allocation which is the goal of this
patch.

After this patch, one oversized allocation warning is gone from the test:

    test/alternator/run test_batch.py::test_batch_get_item_large

(a second oversized allocation is still present, but comes from the
unrelated BatchWriteItem issue #8183).

Fixes #8522

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220111170541.637176-1-nyh@scylladb.com>
2022-01-13 09:46:08 +01:00
Nadav Har'El
8bcd23fa02 Merge: move rest of internal ddl users to use raft from Gleb
The patch series moves the rest of internal ddl users to do schema
change over raft (if enabled). After that series only tests are left
using old API.

* 'gleb/raft-schema-rest-v6' of github.com:scylladb/scylla-dev: (33 commits)
  migration_manager: drop no longer used functions
  system_distributed_keyspace: move schema creation code to use raft
  auth: move table creation code to use raft
  auth: move keyspace creation code to use raft
  table_helper: move schema creation code to use raft
  cql3: make query_processor inherit from peering_sharded_service
  table_helper: make setup_table() static
  table_helper: co-routinize setup_keyspace()
  redis: move schema creation code to go through raft
  thrift: move system_update_column_family() to raft
  thrift: authenticate a statement before verifying in system_update_column_family()
  thrift: co-routinize system_update_column_family()
  thrift: move system_update_keyspace() to raft
  thrift: authenticate a statement before verifying in system_update_keyspace()
  thrift: co-routinize system_update_keyspace()
  thrift: move system_drop_keyspace() to raft
  thrift: authenticate a statement before verifying in system_drop_keyspace()
  thrift: co-routinize system_drop_keyspace()
  thrift: move system_add_keyspace() to raft
  thrift: co-routinize system_add_keyspace()
  ...
2022-01-12 18:09:08 +02:00
Gleb Natapov
1491cc2906 alternator: move create_table() to raft 2022-01-12 16:33:16 +02:00
Gleb Natapov
0cd6d283ad alternator: move update_table() to raft 2022-01-12 16:33:15 +02:00
Gleb Natapov
7ee39ff94b alternator: move validation in update_table() to the begining 2022-01-12 16:33:15 +02:00
Gleb Natapov
740b2181e1 alternator: move update_tags() to raft 2022-01-12 16:33:15 +02:00
Gleb Natapov
57be1b773e alternator: move delete_table() to raft 2022-01-12 16:33:15 +02:00
Gleb Natapov
0ac20b5494 alternator: make some functions static
Make add_stream_options, supplement_table_info, supplement_table_stream_info static. They only need a pointer
to storage_proxy, so pass it directly.
2022-01-12 16:33:15 +02:00
Gleb Natapov
2e4a8bdfaa alternator: co-routinize delete_table() 2022-01-12 16:33:15 +02:00
Gleb Natapov
459539e812 migration_manager: do not allow creating keyspace with arbitrary timestamp
This was needed to fix issue #2129 which was only manifest itself with
auto_bootstrap set to false. The option is ignored now and we always
wait for schema to synch during boot.
2022-01-12 16:33:15 +02:00
Nadav Har'El
23e93a26b3 Merge 'Alternator: stream results + chunk results to remove large allocations' from Calle Wilund
Refs: #9555

When running the "Kraken" dynamodb streams test to provoke the issued observed by QA, I noticed on my setup mainly two things: Large allocation stalls (+ warnings) and timeouts on read semaphores in DB.

This tries to address the first issue, partly by making query_result_view serialization using chunked vector instead of linear one, and by introducing a streaming option for json return objects, avoiding linearizing to string before wire.
Note that the latter has some overhead issues of its own, mainly data copying, since we essentially will be triple buffering (local, wrapped http stream, and final output stream). Still, normal string output will typically do a lot of realloc which is potential extra copies as well, so...

This is not really performance tested, but with these tweaks I no longer get large alloc stalls at least, so that is a plus. :-)

Closes #9713

* github.com:scylladb/scylla:
  alternator::executor: Use streamed result for scan etc if large result
  alternator::streams: Use streamed result in get_records if large result
  executor/server: Add routine to make stream object return
  rjson: Add print to stream of rjson::value
  query_idl: Make qr_partition::rows/query_result::partitions chunked
2022-01-12 15:53:31 +02:00
Calle Wilund
f73ca9659b alternator::executor: Use streamed result for scan etc if large result
Avoids large allocations for larger scans.
Todo: determine threshold
2022-01-12 13:34:49 +00:00
Calle Wilund
0c1ff5c2f5 alternator::streams: Use streamed result in get_records if large result
If we have a resonable result set to send back to client, use direct
streaming of the object.

Todo: determine threshold.
2022-01-12 13:34:49 +00:00
Calle Wilund
4a8a7ef8b4 executor/server: Add routine to make stream object return
Simply retains result object and sets json::json_return_type to
streaming callback.
2022-01-12 13:34:49 +00:00
Pavel Emelyanov
095d93eaf8 pager: Keep shared pointer to proxy onboard
Pagers are created by alternator and select statement, both
have the proxy reference at hands. Next, the pager's unique_ptr
is put on the lambda of its fetch_page() continuation and thus
it survives the fetch_page execution and then gets destroyed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-10 07:58:57 +03:00
Avi Kivity
bbad8f4677 replica: move ::database, ::keyspace, and ::table to replica namespace
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.

References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.

scylla-gdb.py is adjusted to look for both the new and old names.
2022-01-07 12:04:38 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Nadav Har'El
e4b2dfb54d alternator ttl: when node is down, secondary node continues to expire
The current implementation of the Alternator expiration (TTL) feature
has each node scan for expired partitions in its own primary ranges.
This means that while a node is down, items in its primary ranges will
not get expired.

But we note that doesn't have to be this way: If only a single node is
down, and RF=3, the items that node owns are still readable with QUORUM -
so these items can still be safely read and checked for expiration - and
also deleted.

This patch implements a fairly simple solution: When a node completes
scanning its own primary ranges, also checks whether any of its *secondary*
ranges (ranges where it is the *second* replica) has its primary owner
down. For such ranges, this node will scan them as well. This secondary
scan stops if the remote node comes back up, but in that case it may
happen that both nodes will work on the same range at the same time.
The risks in that are minimal, though, and amount to wasted work and
duplicate deletion records in CDC. In the future we could avoid this by
using LWT to claim ownership on a range being scanned.

We have a new dtest (see a separate patch), alternator_ttl_tests.py::
TestAlternatorTTL::test_expiration_with_down_node, which reproduces this
and verifies this fix. The test starts a 5-node cluster, with 1000 items
with random tokens which are due to be expired immediately. The test
expects to see all items expiring ASAP, but when one of the five nodes
is brought down, this doesn't happen: Some of the items are not expired,
until this patch is used.

Fixes #9787

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211222131933.406148-1-nyh@scylladb.com>
2021-12-26 14:10:52 +02:00
Nadav Har'El
31eeb44d28 alternator: fix error on UpdateTable for non-existent table
When the UpdateTable operation is called for a non-existent table, the
appropriate error is ResourceNotFoundException, but before this patch
we ran into an exception, which resulted in an ugly "internal server
error".

In this patch we use the existing get_table() function which most other
operations use, and which does all the appropriate verifications and
generates the appropriate Alternator api_error instead of letting
internal Scylla exceptions escape to the user.

This patch also includes a test for UpdateTable on a non-existent table,
which used to fail before this patch and pass afterwards. We also add a
test for DeleteTable in the same scenario, and see it didn't have this
bug. As usual, both tests pass on DynamoDB, which confirms we generate
the right error codes.

Fixes #9747.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211206181605.1182431-1-nyh@scylladb.com>
2021-12-14 13:09:27 +01:00
Gleb Natapov
38e1f85959 migration_manager: drop view_ptr array from announce_column_family_update()
No users pass it any longer.
2021-12-11 12:31:07 +02:00