scylladb

Author	SHA1	Message	Date
Pavel Emelyanov	7a5c2cdbe6	alternator: Don't use global gossiper There's proxy at hand which can provide local gossiper reference Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:33:07 +03:00
Avi Kivity	582802825a	treewide: use system-#include (angle brackets) for seastar Seastar is an external library from Scylla's point of view so we should use the angle bracket #include style. Most of the source follows this, this patch fixes a few stragglers. Also fix cases of #include which reached out to seastar's directory tree directly, via #include "seastar/include/sesatar/..." to just refer to <seastar/...>. Closes #10433	2022-04-26 14:46:42 +03:00
Avi Kivity	40beb48176	alternator: ttl: avoid specializing class templates in non-namespace scope The C++ standard disallows class template specialization in non-namespace scopes. Clang apparently allows it as an extension. Fix by not using a template - there are just two specializations and no generic implementation. Use regular classes and std::conditional_t to choose between the two.	2022-04-18 12:27:18 +03:00
Avi Kivity	b5e8e32c01	alternator: executor: fix signed/unsigned comparison in is_big() Signed/unsigned comparisons are subject to C promotion rules. In is_big() in this case the comparison is safe, but gcc warns. Use a cast to silence the warning. The sign/unsigned mix and int/size_t size differences still look bad, it would be good to revisit this code, but that is left for another patch.	2022-04-18 12:23:18 +03:00
Nadav Har'El	84143c2ee5	alternator: implement Select option of Query and Scan This patch implements the previously-unimplemented Select option of the Query and Scan operators. The most interesting use case of this option is Select=COUNT which means we should only count the items, without returning their actual content. But there are actually four different Select settings: COUNT, ALL_ATTRIBUTES, SPECIFIC_ATTRIBUTES, and ALL_PROJECTED_ATTRIBUTES. Five previously-failing tests now pass, and their xfail mark is removed: * test_query.py::test_query_select * test_scan.py::test_scan_select * test_query_filter.py::test_query_filter_and_select_count * test_filter_expression.py::test_filter_expression_and_select_count * test_gsi.py::test_gsi_query_select_1 These tests cover many different cases of successes and errors, including combination of Select and other options. E.g., combining Select=COUNT with filtering requires us to get the parts of the items needed for the filtering function - even if we don't need to return them to the user at the end. Because we do not yet support GSI/LSI projection (issue #5036), the support for ALL_PROJECTED_ATTRIBUTES is a bit simpler than it will need to be in the future, but we can only finish that after #5036 is done. Fixes #5058. The most intrusive part of this patch is a change from attrs_to_get - a map of top-level attributes that a read needs to fetch - to an optional<attrs_to_get>. This change is needed because we also need to support the case that we want to read no attributes (Select=COUNT), and attrs_to_get.empty() used to mean that we want to read all attributes, not no attributes. After this patch, an unset optional<attrs_to_get> means read all attributes, a set but empty attrs_to_get means read no attributes, and a set and non-empty attrs_to_get means read those specific attributes. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220405113700.9768-2-nyh@scylladb.com>	2022-04-11 10:04:32 +02:00
Nadav Har'El	9c1ebdceea	alternator: forbid empty AttributesToGet In DynamoDB one can retrieve only a subset of the attributes using the AttributesToGet or ProjectionExpression paramters to read requests. Neither allows an empty list of attributes - if you don't want any attributes, you should use Select=COUNT instead. Currently we correctly refuse an empty ProjectionExpression - and have a test for it: test_projection_expression.py::test_projection_expression_toplevel_syntax However, Alternator is missing the same empty-forbidding logic for AttributesToGet. An empty AttributesToGet is currently allowed, and basically says "retrieve everything", which is sort of unexpected. So this patch adds the missing logic, and the missing test (actually two tests for the same thing - one using GetItem and the other Query). Fixes #10332 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220405113700.9768-1-nyh@scylladb.com>	2022-04-11 10:21:02 +03:00
Nadav Har'El	67e0590bbc	alternator: remove old TODO (with test verifying it) We had an old TODO in the Alternator "Scan" operation code which suggested that we may need to do something to limit the size of pages when a row limit ("Limit") isn't given. But we do already have a built-in limit on page sizes (1 MB), so this TODO isn't needed and can be removed. But I also wanted to make sure we have a test that this limit works: We already had a test that this 1 MB limit works for a single-partition Query (test_query.py::test_query_reverse_longish - tested both forward and reversed queries). In this patch I add a similar test for a whole- table Scan. It turns out that although page size is limited in this case as well, it's not exactly 1 MB... For small tables can even reach 3 MB. I consider this "good enough" and that we can drop the TODO, but also opened issue #10327 to document this surprising (for me) finding. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220404145240.354198-1-nyh@scylladb.com>	2022-04-05 09:23:23 +03:00
Nadav Har'El	7f89c8b3e3	alternator: clean error shutdown in case of TLS misconfigration The way our boot-time service "controllers" are written, if a controller's start_server() finds an error and throws, it cannot the caller (main.cc) to call stop_server(), and must clean up resources already created (e.g., sharded services) before returning or risk crashes on assertion failures. This patch fixes such a mistake in Alternator's initialization. As noted in issue #10025, if the Alternator TLS configuration is broken - especially the certificate or key files are missing - Scylla would crash on an assertion failure, instead of reporting the error as expected. Before this patch such a misconfiguration will result in the unintelligible: <alternator::server>::~sharded() [Service = alternator::server]: Assertion `_instances.empty()' failed. Aborting on shard 0. After this patch we get the right error message: ERROR 2022-03-21 15:25:07,553 [shard 0] init - Startup failed: std::_Nested_exception<std::runtime_error> (Failed to set up Alternator TLS credentials): std::_Nested_exception<std::runtime_error> (Could not read certificate file conf/scylla.crt): std::filesystem::__cxx11:: filesystem_error (error system:2, filesystem error: open failed: No such file or directory [conf/scylla.crt]) Arguably this error message is a bit ugly, so I opened https://github.com/scylladb/seastar/issues/1029, but at least it says exactly what the error is. Fixes #10025 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220321133323.3150939-1-nyh@scylladb.com>	2022-03-28 15:26:42 +03:00
Nadav Har'El	653f2df28f	alternator: fix JSON escaping of error responses In the DynamoDB API, error responses are in JSON format with specific fields ("__type" and "message" in the x-amz-json-1.0 format currently used). Alternator tried to be clever and build the string representation of this JSON itself, instead of using RapidJSON. But this optimization was a mistake - if the error message contains characters that need escaping (such as double quotes and newlines), they weren't escaped, and the resulting JSON was malformed. When the client library boto3 read this malformed JSON it got confused, cosidered the entire error response to be a string, which resulted in an ugly error message. The fix is easy - just build the JSON output as usual with RapidJSON instead of trying to optimize using string operation. The patch also includes two tests reproducing this bug and checking its fix. The first test uses boto3 and shows it got confused on the type of error (not understanding that it is a ValidationException). The second test bypasses boto3 and shows exactly where the bug happens - the response is an unparsable JSON. Fixes #10278 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220327132705.3707979-1-nyh@scylladb.com>	2022-03-27 16:32:36 +03:00
Mikołaj Sielużycki	6f1b6da68a	compile: Fix headers so that *-headers targets compile cleanly. Closes #10273	2022-03-25 16:19:26 +02:00
Pavel Solodovnikov	95c8d65949	treewide: fix compilation issues with fmtlib 8.1.0+ Due to `fd62fba985` scoped enums are not automatically converted to integers anymore, this is the intended behavior, according to the fmtlib devs. A bit nicer solution would be to use `std::to_underlying` instead of a direct `static_cast`, but it's not available until C++23 and some compilers are still missing the support for it. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-03-16 12:31:50 +03:00
Nadav Har'El	c26230943b	alternator ttl: add metrics This patch adds metrics to the Alternator TTL feature (aka the "expiration service"). I put these metrics deliberately in their own object in ttl.{hh,cc}, and also with their own prefix ("expiration_") - and not* together with the rest of the Alternator metrics (alternator/stats.{hh,cc}). This is because later we may want to use the expiration service not only in Alternator but also in CQL - to support per-item expiration with CDC events also in CQL. So the implementation of this feature should not be too tangled with that of Alternator. The patch currently adds four metrics, and opens the path to easily add more in the future. The metrics added now are: 1. scylla_expiration_scan_passes: The number of scan passes over the entire table. We expect this to grow by 1 every alternator_ttl_period_in_seconds seconds. 2. scylla_expiration_scan_table: The number of table scans. In each scan pass, we scan all the tables that have the Alternator TTL feature enabled. Each scan of each table is counted by this counter. 3. scylla_expiration_items_deleted: Counts the number of items that the expiration service expired (deleted). Please remember that each item is considered for expiration - and then expired - on only one node, so each expired item is counted only once - not RF times. 4. scylla_expiration_secondary_ranges_scanned: If this counter is incremented, it means this node took over some other node's expiration scanning duties while the other node was down. This patch also includes a couple of unrelated comment fixes. I tested the new metrics manually - they aren't yet tested by the Alternator test suite because I couldn't make up my mind if such tests would belong in test_ttl.py or test_metrics.py :-) Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220224092419.1132655-1-nyh@scylladb.com>	2022-02-25 07:26:11 +02:00
Nadav Har'El	db7b11cfc4	alternator: make TTL expiration scanner bypass cache The background scan for expired Alternator items (the TTL feature) should bypass the cache to avoid poluting it with the entire content of the table being scanned. I tested that the flag added in this patch really works by adding a printout to the code in table.cc which creates the reader. Although we do have a metric for uses of BYPASS CACHE, unfortunately this metric counts usage of "BYPASS CACHE" in CQL statements - and not does not account the low- level calls that we use in the ttl scanner. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-02-25 07:26:11 +02:00
Nadav Har'El	49a8164fb7	alternator: add configurable scan period to TTL expiration Before this patch, the experimental TTL (expiration time) feature in Alternator scans tables for expiration in a tight loop - starting the next scan one second after the previous one completed. In this patch we introduce a new configuration option, alternator_ttl_period_in_seconds, which determines how frequently to start the scan. The default is 24 hours - meaning that the next scan is started 24 hours after the previous one started. The tests (test/alternator/run) change this configuration back to one second, so that expiration tests finish as quickly as possible. Please note that the scan is not slowed down to fill this 24 hours - if it finishes in one hour, it will then sleep for 23 hours. Additional work would be needed to slow down the scan to not finish too quickly. One idea not yet implemented is to move the expiration service from the "maintenance" scheduling group which it uses today to a new scheduling group, and modifying the number of shares that this group gets. Another thing worth noting about the configurable period (which defaults to 24 hours) is that when TTL is enabled on an Alternator table, it can take that amount of time until its scan starts and items start expiring from it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-02-25 07:26:11 +02:00
Nadav Har'El	f292d3d679	alternator: make schema modifications in CreateTable atomic The Alternator CreateTable operation currently performs several schema- changing operations separately - one by one: It creates a keyspace, a table in that keyspace and possibly also multiple views, and it sets tags on the table. A consequence of this is that concurrent CreateTable and DeleteTable operations (for example) can result in unexpected errors or inconsistent states - for example CreateTable wants to create the table in the keyspace it just created, but a concurrent DeleteTable deleted it. We have two issues about this problem (#6391 and #9868) and three tests (test_table.py::test_concurrent_create_and_delete_table) reproducing it. In this patch we fix these problems by switching to the modern Scylla schema-changing API: Instead of doing several schema-changing operations one by one, we create a vector of schema mutation performing all these operations - and then perform all these mutations together. When the experimental Raft-based schema modifications is enabled, this completely solves the races, and the tests begin to pass. However, if the experimental Raft mode is not enabled, these tests continue to fail because there is still no locking while applying the different schema mutations (not even on a single node). So I put a special fixture "fails_without_raft" on these tests - which means that the tests xfail if run without raft, and expected to pass when run on Raft. Indeed, after this patch test/alternator/run --raft test_table.py::test_concurrent_create_and_delete_table shows three passing tests (they also pass if we drastically improve the number of iterations), while test/alternator/run test_table.py::test_concurrent_create_and_delete_table shows three xfailing tests. All other Alternator tests pass as before with this patch, verifying that the handling of new tables, new views, tags, and CDC log tables, all happen correctly even after this patch. A note about the implementation: Before this patch, the CreateTable code used high-level functions like prepare_new_column_family_announcement(). These high-level functions become unusable if we write multiple schema operations to one list of mutations, because for example this function validates that the keyspace had already been created - when it hasn't and that's the whole point. So instead we had to use lower-level function like add_table_or_view_to_schema_mutation() and before_create_column_family(). However, despite being lower level, these functions were public so I think it's reasonable to use them, and we probably have no other alternative. Fixes #6391 Fixes #9868 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-02-18 09:03:52 +02:00
Nadav Har'El	4e3038b57f	alternator: add FIXME for schema changes requiring a loop In commit `a664ac7ba5`, the Alternator schema-modifying code (e.g., delete_table()) was reorganized to support the new Raft-based schema modifications. Schema modifications now work with an "optimistic locking" approach: We retrieve the current schema version id ("group0_guard"), reads the current schema and verifies it can do the changes it wants to do, and then does them with mm.announce(group0_guard) - which will fail if the schema version is not current because some other concurrent modification beat us in the race. This means that we need to do this whole read-modify-write (group0_guard, checking the schema, creating mutations, calling mm.announce()) in a retry loop. We have such a loop in the CQL code but it's missing in the Alternator code. In this patch we don't add the loop yet, but add FIXMEs to remind us where it's missing. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220214154435.544125-1-nyh@scylladb.com>	2022-02-14 18:24:16 +02:00
Nadav Har'El	9982a28007	alternator: allow REMOVE of non-existent nested attribute DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x exists in the item, but x.y doesn't - the removal silently does nothing. Alternator incorrectly generated an error in this case, and unfortunately we didn't have a test for this case. So in this patch we add the missing test (which fails on Alternator before this patch - and passes on DynamoDB) and then fix the behavior. After this patch, "REMOVE x.y" will remain an error if "x" doesn't exist (saying "document paths not valid for this item"), but if "x" exists and is a map, but "x.y" doesn't, the removal will silently do nothing and will not be an error. Fixes #10043. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220207133652.181994-1-nyh@scylladb.com>	2022-02-07 18:40:48 +02:00
Piotr Sarna	c613d1ce87	alternator: migrate expression parsers to string_view Following the advice in the FIXME note, helper functions for parsing expressions are now based on string views to avoid a few unnecessary conversions to std::string. Tests: unit(dev) Closes #10013	2022-02-04 12:34:19 +02:00
Nadav Har'El	79776ff2ff	alternator: fix error handling during Alternator startup A recent restructuring of the startup of Alternator (and also other protocol servers) led to incorrect error-handling behavior during startup: If an error was detected on one of the shards of the sharded service (in alternator/server.cc), the sharded service itself was never stopped (in alternator/controller.cc), leading to an assertion failure instead of the desired error message. A common example of this problem is when the requested port for the server was already taken (this was issue #9914). So in this patch, exception handling is removed from server.cc - the exception will propegate to the code in controller.cc, which will properly stop the server (including the sharded services) before returning. Fixes #9914. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220130131709.1166716-1-nyh@scylladb.com>	2022-02-02 10:35:57 +01:00
Piotr Sarna	d50ed944f2	alternator: add error injection to BatchGetItem When error injection is enabled at compile time, it's now possible to inject an error into BatchGetItem in order to produce a partial read, i.e. when only part of the items were retrieved successfully.	2022-01-31 12:56:00 +01:00
Piotr Sarna	31f4f062a2	alternator: fill UnprocessedKeys for failed batch reads DynamoDB protocol specifies that when getting items in a batch failed only partially, unprocessed keys can be returned so that the user can perform a retry. Alternator used to fail the whole request if any of the reads failed, but right now it instead produces the list of unprocessed keys and returns them to the user, as long as at least 1 read was successful. NOTE: tested manually by compiling Scylla with error injection, which fails every nth request. It's rather hard to figure out an automatic test case for this scenario. Fixes #9984	2022-01-31 12:56:00 +01:00
Tomasz Grabiec	c89b1953f8	Merge "Enforce linearizability of group 0 operations using state IDs" from Kamil We introduce a new table, `system.group0_history`. This table will contain a history of all group 0 changes applied through Raft. With each change is an associated unique ID, which also identifies the state of all group 0 tables (including schema tables) after this change is applied, assuming that all such changes are serialized through Raft (they will be eventually). Group 0 commands, additionally to mutations which modify group 0 tables, contain a "previous state ID" and a "new state ID". The group 0 state machine will only modify state during command application if the provided "previous state ID" is equal to the last state ID present in the history table. Otherwise, the command will be a no-op. To ensure linearizability of group 0 changes, the performer of the change must first read the last state ID, only then read the state and send a command for the state machine. If a concurrent change races with this command and manages to modify the state, we will detect that the last state ID does not match during `apply`; all calls to `apply` are serialized, and `apply` adds the new entry to the history table at the end, after modifying the group 0 state. The details of this mechanism are abstracted away with `group0_guard`. To perform a group 0 change, one needs to call `announce`, which requires a `group0_guard` to be passed in. The only way to obtain a `group0_guard` is by calling `start_group0_operation`, which underneath performs a read barrier on group 0, obtains the last state ID from the history table, and constructs a new state ID that the change will append to the history table. The read barrier ensures that all previously completed changes are visible to this operation. The caller can then perform any necessary validation, construct mutations which modify group 0 state, and finally call `announce`. The guard also provides a timestamp which is used by the caller to construct the mutations. The timestamp is obtained from the new state ID. We ensure that it is greater than the timestamp of the last state ID. Thus, if the change is successful, the applied mutations will have greater timestamps than the previously applied mutations. We also add two locks. The more important one, used to ensure correctness, is `read_apply_mutex`. It is held when modifying group 0 state (in `apply` and `transfer_snapshot`) and when reading it (it's taken when obtaining a `group0_guard` and released before a command is sent in `announce`). Its goal is to ensure that we don't read partial state, which could happen without it because group 0 state consist of many parts and `apply` (or `transfer_snapshot`) potentially modifies all of them. Note: this doesn't give us 100% protection; if we crash in the middle of `apply` (or `transfer_snapshot`), then after restart we may read partial state. To remove this possibility we need to ensure that commands which were being applied before restart but not finished are re-applied after restart, before anyone can read the state. I left a TODO in `apply`. The second lock, `operation_mutex`, is used to improve liveness. It is taken when obtaining a `group0_guard` and released after a command is applied (compare to `read_apply_mutex` which is released before a command is sent). It is not taken inside `apply` or `transfer_snapshot`. This lock ensures that multiple fibers running on the same node do not attempt to modify group0 concurrently - this would cause some of them to fail (due to the concurrent modification protection described above). This is mostly important during first boot of the first node, when services start for the first time and try to create their internal tables. This lock serializes these attempts, ensuring that all of them succeed. * kbr/schema-state-ids-v4: service: migration_manager: `announce`: take a description parameter service: raft: check and update state IDs during group 0 operations service: raft: group0_state_machine: introduce `group0_command` service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table service: migration_manager: convert migration request handler to coroutine db: system_keyspace: introduce `system.group0_history` table treewide: require `group0_guard` when performing schema changes service: migration_manager: introduce `group0_guard` service: raft: pass `storage_proxy&` to `group0_state_machine` service: raft: raft_state_machine: pass `snapshot_descriptor` to `transfer_snapshot` service: raft: rename `schema_raft_state_machine` to `group0_state_machine` service: migration_manager: rename `schema_read_barrier` to `start_group0_operation` service: migration_manager: `announce`: split raft and non-raft paths to separate functions treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions service: migration_manager: put notifier call inside `async` service: migration_manager: remove some unused and disabled code db: system_distributed_keyspace: use current time when creating mutations in `start()` redis: keyspace_utils: `create_keyspace_if_not_exists_impl`: call `announce` twice only	2022-01-25 09:52:30 +02:00
Avi Kivity	e74f570eda	alternator: streams: fix use-after-free of data_dictionary in describe_stream() In `4aa9e86924` ("Merge 'alternator: move uses of replica module to data_dictionary' from Avi Kivity"), we changed alternator to use data_dictionary instead of replica::database. However, data_dictionary::database objects are different from replica::database objects in that they don't have a stable address and need to be captured by value (they are pointer-like). One capture in describe_stream() was capturing a data_dictionary::database by reference and so caused a use-after-free when the previous continuation was deallocated. Fix by capturing by value. Fixes #9952. Closes #9954	2022-01-25 09:52:30 +02:00
Kamil Braun	a664ac7ba5	treewide: require `group0_guard` when performing schema changes `announce` now takes a `group0_guard` by value. `group0_guard` can only be obtained through `migration_manager::start_group0_operation` and moved, it cannot be constructed outside `migration_manager`. The guard will be a method of ensuring linearizability for group 0 operations.	2022-01-24 15:20:35 +01:00
Kamil Braun	86762a1dd9	service: migration_manager: rename `schema_read_barrier` to `start_group0_operation` 1. Generalize the name so it mentions group 0, which schema will be a strict subset of. 2. Remove the fact that it performs a "read barrier" from the name. The function will be used in general to ensure linearizability of group0 operations - both reads and writes. "Read barrier" is Raft-specific terminology, so it can be thought of as an implementation detail.	2022-01-24 15:12:50 +01:00
Kamil Braun	283ac7fefe	treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions The functions which prepare schema change mutations (such as `prepare_new_column_family_announcement`) would use internally generated timestamps for these mutations. When schema changes are managed by group 0 we want to ensure that timestamps of mutations applied through Raft are monotonic. We will generate these timestamps at call sites and pass them into the `prepare_` functions. This commit prepares the APIs.	2022-01-24 15:12:50 +01:00
Nadav Har'El	350c3d0f6a	alternator: update comment about default timeout The comment explaining where the default Alternator timeout is set became out-of-date. So fix it. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220120092631.401563-1-nyh@scylladb.com>	2022-01-20 14:05:58 +02:00
Nadav Har'El	4aa9e86924	Merge 'alternator: move uses of replica module to data_dictionary' from Avi Kivity Alternator is a coordinator-side service and so should not access the replica module. In this series all but one of uses of the replica module are replaced with data_dictionary. One case remains - accessing the replication map which is not available (and should not be available) via the data dictionary. The data_dictionary module is expanded with missing accessors. Closes #9945 * github.com:scylladb/scylla: alternator: switch to data_dictionary for table listing purposes data_dictionary: add get_tables() data_dictionary: introduce keyspace::is_internal()	2022-01-19 11:34:25 +02:00
Avi Kivity	7399f3fae7	alternator: switch to data_dictionary for table listing purposes As a coordinator-side service, alternator shouldn't touch the replica module, so it is migrated here to data_dictionary. One use case still remains that uses replica::keyspace - accessing the replication map. This really isn't a replica-side thing, but it's also not logically part of the data dictionary, so it's left using replica::keyspace (using the data_dictionary::database::real_database() escape hatch). Figuring out how to expose the replication map to coordinator-side services is left for later.	2022-01-19 11:03:36 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Nadav Har'El	343c521e28	alternator: avoid large contigous allocation in BatchGetItem The BatchGetItem request can return a very large response - according to DynamoDB documentation up to 16 MB, but presently in Alternator, we allow even more (see #5944). The problem is that the existing code prepares the entire response as a large contiguous string, resulting in oversized allocation warnings - and potentially allocation failures. So in this patch we estimate the size of the BatchGetItem response, and if it is "big enough" (currently over 100 KB), we return it with the recently added streaming output support. This streaming output doesn't avoid the extra memory copies unfortunately, but it does avoid a contiguous allocation which is the goal of this patch. After this patch, one oversized allocation warning is gone from the test: test/alternator/run test_batch.py::test_batch_get_item_large (a second oversized allocation is still present, but comes from the unrelated BatchWriteItem issue #8183). Fixes #8522 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220111170541.637176-1-nyh@scylladb.com>	2022-01-13 09:46:08 +01:00
Nadav Har'El	8bcd23fa02	Merge: move rest of internal ddl users to use raft from Gleb The patch series moves the rest of internal ddl users to do schema change over raft (if enabled). After that series only tests are left using old API. * 'gleb/raft-schema-rest-v6' of github.com:scylladb/scylla-dev: (33 commits) migration_manager: drop no longer used functions system_distributed_keyspace: move schema creation code to use raft auth: move table creation code to use raft auth: move keyspace creation code to use raft table_helper: move schema creation code to use raft cql3: make query_processor inherit from peering_sharded_service table_helper: make setup_table() static table_helper: co-routinize setup_keyspace() redis: move schema creation code to go through raft thrift: move system_update_column_family() to raft thrift: authenticate a statement before verifying in system_update_column_family() thrift: co-routinize system_update_column_family() thrift: move system_update_keyspace() to raft thrift: authenticate a statement before verifying in system_update_keyspace() thrift: co-routinize system_update_keyspace() thrift: move system_drop_keyspace() to raft thrift: authenticate a statement before verifying in system_drop_keyspace() thrift: co-routinize system_drop_keyspace() thrift: move system_add_keyspace() to raft thrift: co-routinize system_add_keyspace() ...	2022-01-12 18:09:08 +02:00
Gleb Natapov	1491cc2906	alternator: move create_table() to raft	2022-01-12 16:33:16 +02:00
Gleb Natapov	0cd6d283ad	alternator: move update_table() to raft	2022-01-12 16:33:15 +02:00
Gleb Natapov	7ee39ff94b	alternator: move validation in update_table() to the begining	2022-01-12 16:33:15 +02:00
Gleb Natapov	740b2181e1	alternator: move update_tags() to raft	2022-01-12 16:33:15 +02:00
Gleb Natapov	57be1b773e	alternator: move delete_table() to raft	2022-01-12 16:33:15 +02:00
Gleb Natapov	0ac20b5494	alternator: make some functions static Make add_stream_options, supplement_table_info, supplement_table_stream_info static. They only need a pointer to storage_proxy, so pass it directly.	2022-01-12 16:33:15 +02:00
Gleb Natapov	2e4a8bdfaa	alternator: co-routinize delete_table()	2022-01-12 16:33:15 +02:00
Gleb Natapov	459539e812	migration_manager: do not allow creating keyspace with arbitrary timestamp This was needed to fix issue #2129 which was only manifest itself with auto_bootstrap set to false. The option is ignored now and we always wait for schema to synch during boot.	2022-01-12 16:33:15 +02:00
Nadav Har'El	23e93a26b3	Merge 'Alternator: stream results + chunk results to remove large allocations' from Calle Wilund Refs: #9555 When running the "Kraken" dynamodb streams test to provoke the issued observed by QA, I noticed on my setup mainly two things: Large allocation stalls (+ warnings) and timeouts on read semaphores in DB. This tries to address the first issue, partly by making query_result_view serialization using chunked vector instead of linear one, and by introducing a streaming option for json return objects, avoiding linearizing to string before wire. Note that the latter has some overhead issues of its own, mainly data copying, since we essentially will be triple buffering (local, wrapped http stream, and final output stream). Still, normal string output will typically do a lot of realloc which is potential extra copies as well, so... This is not really performance tested, but with these tweaks I no longer get large alloc stalls at least, so that is a plus. :-) Closes #9713 * github.com:scylladb/scylla: alternator::executor: Use streamed result for scan etc if large result alternator::streams: Use streamed result in get_records if large result executor/server: Add routine to make stream object return rjson: Add print to stream of rjson::value query_idl: Make qr_partition::rows/query_result::partitions chunked	2022-01-12 15:53:31 +02:00
Calle Wilund	f73ca9659b	alternator::executor: Use streamed result for scan etc if large result Avoids large allocations for larger scans. Todo: determine threshold	2022-01-12 13:34:49 +00:00
Calle Wilund	0c1ff5c2f5	alternator::streams: Use streamed result in get_records if large result If we have a resonable result set to send back to client, use direct streaming of the object. Todo: determine threshold.	2022-01-12 13:34:49 +00:00
Calle Wilund	4a8a7ef8b4	executor/server: Add routine to make stream object return Simply retains result object and sets json::json_return_type to streaming callback.	2022-01-12 13:34:49 +00:00
Pavel Emelyanov	095d93eaf8	pager: Keep shared pointer to proxy onboard Pagers are created by alternator and select statement, both have the proxy reference at hands. Next, the pager's unique_ptr is put on the lambda of its fetch_page() continuation and thus it survives the fetch_page execution and then gets destroyed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-10 07:58:57 +03:00
Avi Kivity	bbad8f4677	replica: move ::database, ::keyspace, and ::table to replica namespace Move replica-oriented classes to the replica namespace. The main classes moved are ::database, ::keyspace, and ::table, but a few ancillary classes are also moved. There are certainly classes that should be moved but aren't (like distributed_loader) but we have to start somewhere. References are adjusted treewide. In many cases, it is obvious that a call site should not access the replica (but the data_dictionary instead), but that is left for separate work. scylla-gdb.py is adjusted to look for both the new and old names.	2022-01-07 12:04:38 +02:00
Avi Kivity	ae3a360725	database: Move database, keyspace, table classes to replica/ directory The database, keyspace, and table classes represent the replica-only part of the objects after which they are named. Reading from a table doesn't give you the full data, just the replica's view, and it is not consistent since reconciliation is applied on the coordinator. As a first step in acknowledging this, move the related files to a replica/ subdirectory.	2022-01-06 17:07:30 +02:00
Nadav Har'El	e4b2dfb54d	alternator ttl: when node is down, secondary node continues to expire The current implementation of the Alternator expiration (TTL) feature has each node scan for expired partitions in its own primary ranges. This means that while a node is down, items in its primary ranges will not get expired. But we note that doesn't have to be this way: If only a single node is down, and RF=3, the items that node owns are still readable with QUORUM - so these items can still be safely read and checked for expiration - and also deleted. This patch implements a fairly simple solution: When a node completes scanning its own primary ranges, also checks whether any of its secondary ranges (ranges where it is the second replica) has its primary owner down. For such ranges, this node will scan them as well. This secondary scan stops if the remote node comes back up, but in that case it may happen that both nodes will work on the same range at the same time. The risks in that are minimal, though, and amount to wasted work and duplicate deletion records in CDC. In the future we could avoid this by using LWT to claim ownership on a range being scanned. We have a new dtest (see a separate patch), alternator_ttl_tests.py:: TestAlternatorTTL::test_expiration_with_down_node, which reproduces this and verifies this fix. The test starts a 5-node cluster, with 1000 items with random tokens which are due to be expired immediately. The test expects to see all items expiring ASAP, but when one of the five nodes is brought down, this doesn't happen: Some of the items are not expired, until this patch is used. Fixes #9787 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211222131933.406148-1-nyh@scylladb.com>	2021-12-26 14:10:52 +02:00
Nadav Har'El	31eeb44d28	alternator: fix error on UpdateTable for non-existent table When the UpdateTable operation is called for a non-existent table, the appropriate error is ResourceNotFoundException, but before this patch we ran into an exception, which resulted in an ugly "internal server error". In this patch we use the existing get_table() function which most other operations use, and which does all the appropriate verifications and generates the appropriate Alternator api_error instead of letting internal Scylla exceptions escape to the user. This patch also includes a test for UpdateTable on a non-existent table, which used to fail before this patch and pass afterwards. We also add a test for DeleteTable in the same scenario, and see it didn't have this bug. As usual, both tests pass on DynamoDB, which confirms we generate the right error codes. Fixes #9747. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211206181605.1182431-1-nyh@scylladb.com>	2021-12-14 13:09:27 +01:00
Gleb Natapov	38e1f85959	migration_manager: drop view_ptr array from announce_column_family_update() No users pass it any longer.	2021-12-11 12:31:07 +02:00

1 2 3 4 5 ...

587 Commits