scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 03:45:11 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	bc50163016	tests: add per_partition_rate_limit_test Adds the per_partition_rate_limit_test.cc file. Currently, it only contains a test which verifies that the feature correctly switches off rate limiting for internal queries (!allow_limit \|\| internal sg).	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	761a037afb	config: add add_per_partition_rate_limit_extension function for testing ...and use it in cql_test_env to enable the per_partition_rate_limit extension for all tests that use it.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	1a36029ab5	cf_prop_defs: guard per-partition rate limit with a feature The per-partition rate limit feature requires all nodes in the cluster to support it in order to work well. This commit adds a check which disallows creating/altering tables with per-partition rate limit until the node is sure that all nodes in the cluster support it.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	a7ad70600d	query-request: add allow_limit flag Adds allow_limit flag to the read_command. The flag decides whether rate limiting of this operation is allowed.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	c691e94190	storage_proxy: add allow rate limit flag to get_read_executor Adds a flag to get_read_executor which decides whether the read should be rate limited or not. The read executors were modified to choose the appropriate per partition rate limit info parameter and send it to the replicas.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	3357066387	storage_proxy: resultize return type of get_read_executor Now, get_read_executor is able to return coordinator exceptions without throwing them. In an upcoming commit, it will start returning rate limit exception in some cases and it is preferable to return them without throwing.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	d3d9add219	storage_proxy: add per partition rate limit info to read RPC Now, the read RPC accept the per partition rate limit info parameter. It is passed on to query_result_local(_digest) methods.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	e8e8ada4b4	storage_proxy: add per partition rate limit info to query_result_local(_digest) The query_result_local and query_result_local_digest methods were updated to accept db::per_partition_rate_limit::info structure and pass it on to database::accept.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	e6beab3106	storage_proxy: add allow rate limit flag to mutate/mutate_result Now, mutate/mutate_result accept a flag which decides whether the write should be rate limited or not. The new parameter is mandatory and all call sites were updated.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	1f65c4e001	storage_proxy: add allow rate limit flag to mutate_internal Now, mutate_internal accepts a flag which decides whether the write should be rate limited or not.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	1e4e92ed8b	storage_proxy: add allow rate limit flag to mutate_begin Now, mutate_begin accepts a flag which decides whether given write should be rate limited or not.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	76e95e7ae8	storage_proxy: choose the right per partition rate limit info in write handler Now, write response handler calculates the appropriate rate limit info parameter and passes it to the mutation holder.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	2a7ba76c3e	storage_proxy: resultize return types of write handler creation path The mutate_prepare and create_write_response_handler(_helper) functions are modified to be able to return exceptions without throwing them. In an upcoming commit, create_write_response_handler will sometimes return rate limit exception, and it is preferable to return them without throwing.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	3f88ecdea6	storage_proxy: add per partition rate limit to mutation_holders Now, `apply_locally` and `apply_remotely` accept the per partition rate limit info parameter.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	02469e0b15	storage_proxy: add per partition rate limit info to write RPC Adds db::per_partition_rate_limit::info parameter to the write RPC. The rate limit info controls the behavior of the rate limiter on the replica.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	c06376b383	storage_proxy: add per partition rate limit info to mutate_locally Now, mutate_locally accepts a parameter that controls the rate limiter behavior on the replica.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	cc9a2ad41f	database: apply per-partition rate limiting for reads/writes Adds the `db::rate_limiter` to the `database` class and modifies the `query` and `apply` methods so that they account the read/write operations in the rate limiter and optionally reject them.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	ec635ba170	database: move and rename: classify_query -> classify_request Moves the classify_query higher and renames it to classify_request. The function will be reused in further commits to protect non-user queries from accidentally being rate limited.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	dccb8a5729	schema: add per_partition_rate_limit schema extension Adds the new `per_partition_rate_limit` schema extension. It has two parameters: `max_writes_per_second` and `max_reads_per_second`. In the future commits they will control how many operations of given type are allowed for each partition in the given table.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	0fe8b55427	db: add rate_limiter Introduces the rate_limiter, a replica-side data structure meant for tracking the frequence with which each partition is being accessed (separately for reads and writes) and deciding whether the request should be accepted and processed further or rejected. The limiter is implemented as a statically allocated hashmap which keeps track of the frequency with which partitions are accessed. Its entries are incremented when an operation is admitted and are decayed exponentially over time. If a partition is detected to be accessed more than its limit allows, requests are rejected with a probability calculated in such a way that, on average, the number of accepted requests is kept at the limit. The structure currently weights a bit above 1MB and each shard is meant to keep a separate instance. All operations are O(1), including the periodic timer.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	2162bb9f3b	storage_proxy: propagate rate_limit_exception through read RPC This commit modifies the read RPC and the storage_proxy logic so that the coordinator knows whether a read operation failed due to rate limit being exceeded, and returns `exceptions::rate_limit_exception` if that happens.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	000f417d23	gms: add TYPED_ERRORS_IN_READ_RPC cluster feature We would like to extend the read RPC to return an optional, second value which indicates an exception - seastar type-erases exception on the RPC handler boundary and we need to differentiate rate_limit_exception from others. However, it may happen that a replica with an up-to-date version of Scylla tries to return an exception in this way to a coordinator with an old version and the coordinator will drop the error, thinking that the request succeeded. In order to protect from that, we introduce the `TYPED_ERROR_IN_READ_RPC` feature. Only after it is enabled replicas will start returning exceptions in the new way, and until then all exceptions will be reported using seastar's type-erasure mechanism.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	51546b0609	storage_proxy: pass rate_limit_exception through write RPC This commit modifies the storage_proxy logic so that the coordinator knows whether a write operation failed due to rate limit being exceeded, and returns `exceptions::rate_limit_exception` when that happens.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	621b7f35e2	replica: add rate_limit_exception and a simple serialization framework Introduces `replica::rate_limit_exception` - an exceptions that is supposed to be thrown/returned on the replica side when the request is rejected due to the exceeding the per-partition rate limit. Additionally, introduces the `exception_variant` type which allows to transport the new exception over RPC while preserving the type information. This will be useful in later commits, as the coordinator will have to know whether a replica has failed due to rate limit being exceeded or another kind of error. The `exception_variant` currently can only either hold "other exception" (std::monostate) or the aforementioned `rate_limit_exception`, but can be extended in a backwards-compatible way in the future to be able to hold more exceptions that need to be handled in a different way.	2022-06-22 20:07:58 +02:00
Piotr Dulikowski	a55d7ad46d	docs: design doc for per-partition rate limiting	2022-06-22 20:07:58 +02:00
Piotr Dulikowski	efc3953c0a	transport: add rate_limit_error Adds a CQL protocol extension which introduces the rate_limit_error. The new error code will be used to indicate that the operation failed due to it exceeding the allowed per-partition rate limit. The error code is supposed to be returned only if the corresponding CQL extension is enabled by the client - if it's not enabled, then Config_error will be returned in its stead.	2022-06-22 20:07:58 +02:00
Konstantin Osipov	c59a730c1e	test: re-enable but mark as flaky cdc_with_lwt_test Running flaky tests prevents regressions from sneaking which is possible if the test is disabled. Closes #10832	2022-06-21 16:36:49 +03:00
Avi Kivity	88f75f91ae	cql3: grammar: move semantic check for incompatible UPDATEs to prepare() code The grammar now checks that UPDATEs don't clash (for example, updates to the same column). The checks are good, but the grammar isn't the right place for them - better to concentrate all the checks in the prepare() code so it's easy to see all the checks. Move the checks to raw::update_statement::prepare_internal(). This exposes that the checks are quadratic, so add a comment. It could be fixed with a stable_sort() first, but that is left to later. Closes #10820	2022-06-21 11:42:11 +02:00
Botond Dénes	2b62f67593	tools/scylla-types: escape {} chars in description So fmt::format() doesn't interpret them as substitutions and doesn't error-out because there is no argument for them. Closes #10803	2022-06-21 11:58:13 +03:00
Michael Livshin	1e7360ef6d	checksum_utils_test: supply valid input to crc32_combine() If the len2 argument to crc32_combine() is zero, then the crc2 argument must also be zero. fast_crc32_combine() explicitly checks for len2==0, in which case it ignores crc2 (which is the same as if it were zero). zlib's crc32_combine() used to have that check prior to version 1.2.12, but then lost it, making its necessary for callers to be more careful. Also add the len2==0 check to the dummy fast_crc32_combine() implementation, because it delegates to zlib's. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #10731	2022-06-21 11:58:13 +03:00
Botond Dénes	5b50725c45	Merge 'partition_snapshot_row_cursor: avoid unnecessary row cloning in row()' from Michał Chojnowski Due to implementation details, all `deletable_row`s used in `row()` are copied twice, even though the only need to be copied/applied once. This is unnecessary work. `perf_simple_query_g --enable-cache=1 --flush --smp 1 --duration 30` Before: median 158516.17 tps ( 64.1 allocs/op, 12.1 tasks/op, 45010 insns/op) After: median 164307.76 tps ( 62.1 allocs/op, 12.1 tasks/op, 43220 insns/op) Closes #10509 * github.com:scylladb/scylla: partition_snapshot_row_cursor: construct the clustering_row directly in row() mutation_fragment: add a "from deletable_row" constructor to clustering_row mutation_fragment: pass the applied row by reference in clustering_row::apply()	2022-06-21 11:58:13 +03:00
Botond Dénes	121900e377	Merge "Sanitize compaction manager construction and stopping" from Pavel Emelyanov " In order to wire-in the compaction_throughput_mb_per_sec the compaction creation and stopping will need to be patched. Right now both places are quite hairy, this set coroutinizes stop() for simpler adding of stopping bits, unifies all the compaction manager constructors and adds the compaction_manager::config for simpler future extending. As a side effect the backlog_controller class gets an "abstract" sched group it controlls which in turn will facilitate seastar sched groups unification some day. " * 'br-compaction-manager-start-stop-cleanup' of https://github.com/xemul/scylla: compaction_manager: Introduce compaction_manager::config backlog_controller: Generalize scheduling groups database: Keep compound flushing sched group compaction_manager: Swap groups and controller compaction_manager: Keep compaction_sg on board compaction_manager: Unify scheduling_group structures compaction_manager: Merge static/dynamic constructors compaction_manager: Coroutinuze really_do_stop() compaction_manager: Shuffle really_do_stop() compaction_manager: Remove try-catch around logger	2022-06-21 11:58:13 +03:00
Raphael S. Carvalho	aa667e590e	sstable_set: Fix partitioned_sstable_set constructor The sstable set param isn't being used anywhere, and it's also buggy as sstable run list isn't being updated accordingly. so it could happen that set contains sstables but run list is empty, introducing inconsistency. we're fortunate that the bug wasn't activated as it would've been a hard one to catch. found this while auditting the code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220617203438.74336-1-raphaelsc@scylladb.com>	2022-06-21 11:58:13 +03:00
Benny Halevy	59acc58920	test: error_injection: test_inject_noop: do no rely on wall clock timing This unit test may fail in debug mode since .then may yield, even if inject returns a ready future, so the wall clock timing has no basis. As seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/932/testReport/junit/boost.error_injection_test.debug/test_boost_error_injection_test/test_inject_noop/ ``` [Exception] - critical check wait_time.count() < sleep_msec.count() has failed [47 >= 10] == [File] - test/boost/error_injection_test.cc == [Line] -45 ``` Instead, just verify that inject returns a successful, ready future, when using a non-existing errr injection name. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10842	2022-06-21 11:58:13 +03:00
Takuya ASADA	13caac7ae6	install.sh: install files with correct permission in strict umask setting To avoid failing to run scripts in non-root user, we need to set permission explicitly on executables. Fixes #10752 Closes #10840	2022-06-20 17:52:03 +03:00
Avi Kivity	a8507a6d28	Merge 'docs/contribute/maintainer.md: expand with merging guidelines' from Botond Dénes The current maintainer.md lacks any guidelines on what patches to accept/reject. Instead maintainers are expected to observe the unwritten rules as exercised by more senior maintainers, as well as use their own judgement or ask when in doubt. This has worked well as maintainers are all people who either worked at the company for a long time and hence had time to observe how things work, and/or have previous experience maintaining open-source projects. Nevertheless, many times I have wished we had a guideline I could glance at to make sure I considered all the angles and to make sure I did not forget some important unwritten rule. This series attempts to concisely summarize these unwritten rules in the form of a checklist, without attempting to cover all exceptions and corner-cases. This should already be enough for a maintainer-in-doubt to be able to quickly go over the checklist and see if they forgot to check anything (especially when evaluating backports). /cc @scylladb/scylla-maint Closes #10806 * github.com:scylladb/scylla: docs/contribute/maintainer.md: add merging and backporting guidelines docs/contribute/CONTRIBUTING.md: add reference to review checklist: docs/contribute/review-checklist.md: add section about patch organization docs/contribute/maintainer.md: expand section on git submodule sync	2022-06-20 17:20:52 +03:00
Botond Dénes	c3e7c1cf59	tools/schema_loader: load_schemas(): add note about CDC table names Explaining how the code determines what tables are CDC tables when parsing schema statements. Closes #10788	2022-06-20 17:16:33 +03:00
Michał Chojnowski	5570354f44	partition_snapshot_row_cursor: construct the clustering_row directly in row() Currently row() creates an empty clustering_row, then applies deletable_rows from the cursor to the empty clustering_row. But the apply logic is unnecessary for the first apply(), and it's cheaper to simply copy the row.	2022-06-20 15:45:19 +02:00
Michał Chojnowski	52c963b331	mutation_fragment: add a "from deletable_row" constructor to clustering_row Currently, construction of clustering_row from deletable_row is done by applying the deletable_row to an empty clustering_row. Direct construction is a slightly cheaper alternative.	2022-06-20 15:45:19 +02:00
Michał Chojnowski	a061eb9e76	mutation_fragment: pass the applied row by reference in clustering_row::apply() Currently, clustering_row::apply() takes deletable_row by reference, but copies it before passing it to deletable_row::apply(). This is more expensive than passing the reference down (by about 1800 instructions for perf_simple_query rows).	2022-06-20 15:22:17 +02:00
Avi Kivity	f8d84e3aaf	Update tools/java submodule (sync to Cassandra 3.11.3, deps update) * tools/java d4133b54c9...de8289690e (1): > Merge 'Sync with Cassandra 3.11.13 and update a few dependencies (v2)' from Piotr Grabowski	2022-06-20 13:25:17 +03:00
Asias He	72797bf516	token_metadata: Shortcut zero leaving nodes case in calculate_pending_ranges_for_leaving If there are zero leaving nodes, no need to calculate anything. This saves time for calculating pending ranges in large clusters significantly to avoid unnecessary calculation. Refs #10337 Closes #10822	2022-06-20 13:19:58 +03:00
Piotr Sarna	bbbd1f4edd	Merge 'alternator: make BatchGetItem group reads by partition' from Nadav Har'El This small series improves Alternator's BatchGetItem performance by grouping requests to the same partition together (Fixes #10753) and also improves error checking when the same item is requested more than once (Fixes #10757). Closes #10834 * github.com:scylladb/scylla: alternator: make BatchGetItem group reads by partition test/alternator: additional test for BatchGetItem	2022-06-20 10:07:19 +02:00
Raphael S. Carvalho	f15a6ce41a	tests: Introduce optional RNG seed for boost suite Today, if you want to reproduce a rare condition using the same RNG seed reported, you cannot use test.py which provides useful infrastructure and will have to run the tests manually instead. So let's extend test.py to allow optional forwarding of RNG seed to boost tests only, as other suites don't support the seed option. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220615223657.142110-1-raphaelsc@scylladb.com>	2022-06-20 07:19:08 +03:00
Nadav Har'El	3aca1ca572	alternator: make BatchGetItem group reads by partition DynamoDB API's BatchGetItem invokes a number (up to 25) of read requests in parallel, returning when all results are available. Alternator naively implemented this by sending all read requests in parallel, no matter which requests these were. That implementation was inefficient when all the requests are to different items (clustering rows) of the same partition. In a multi-node setup this will end up sending 25 separate requests to the same remote node(s). Even on a single-node setup, this may result in reading from disk more than once, and even if the partition is cached - doing an O(logN) search in each multiple times. What we do in this patch, instead, is to group all the BatchGetItem requests that aimed at the same partition into a single read request asking for a (sorted) list of clustering keys. This is similar to an "IN" request in CQL. As an example of the performance benefit of this patch, I tried a BatchGetItem request asking for 20 random items from a 10-million item partition. I measured the latency of this request on a single-node Scylla. Before this patch, I saw a latency of 17-21 ms (the lower number is when the request is retried and the requested items are already in the cache). After this patch, the latency is 10-14 ms. The performance improvement on multi-node clusters are expected to be even higher. Unfortunately the patch is less trivial than I hoped it would be, because some of the old code was organized under the assumption that each read request only returned one item (and if it failed, it means only one item failed), so this part of the code had to be reorganized (and, for making the code more readable, coroutinized). An unintended benefit of the code reorganization is that it also gave me an opportunity to fail an attempt to ask BatchGetItem the same item more than once (issue #10757). The patch also adds a few more corner cases in the tests, to be even more sure that the code reorganization doesn't introduce a regression in BatchGetItem. Fixes #10753 Fixes #10757 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-19 14:47:57 +03:00
Pavel Emelyanov	85263b2d02	trace-state: Remove unused fields ... and one friendship declaration tests: compilation Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220616094224.30676-1-xemul@scylladb.com>	2022-06-17 15:02:51 +03:00
Geoffrey Beausire	ee9841b138	Ensure gossip is enabled on all shards before starting the failure_detector_loop Before it was possible for a race condition to happen where the failure_detector_loop is started before the gossiper._enabled is set to true on every shard. This change ensure that _enabled is set to true before moving forward Closes #10548	2022-06-17 14:10:45 +03:00
Avi Kivity	4d587e0c3d	cql3: raw_value: deduplicate view() and to_view() Commit `e739f2b779` ("cql3: expr: make evaluate() return a cql3::raw_value rather than an expr::constant") introduced raw_value::view() as a synonym to raw_value::to_view() to reduce churn. To fix this duplication, we now remove raw_value::to_view(). raw_value::to_view() was picked for removal because is has fewer call sites, reducing churn again. Closes #10819	2022-06-17 09:32:58 +02:00
Avi Kivity	19a6e69001	cql3: accept and type-check reused named bind variables A named bind-variable can be reused: SELECT * FROM tab WHERE a = :var AND b = :var Currently, the grammar just ignores the possibility and creates a new variable with the same name. The new variable cannot be referenced by name since the first one shadows it. Catch variable reuse by maintaining a map from bind variable names to indexed, and check that when reusing a bind variable the types match. A unit test is added. Fixes #10810 Closes #10813	2022-06-17 09:09:49 +02:00
Konstantin Osipov	670b2562a1	lwt: Cassandrda compatibility when incarnating a row for UPDATE When evaluating an LWT condition involving both static and non-static cells, and matching no regular row, the static row must be used UNLESS the IF condition is IF EXISTS/IF NOT EXISTS, in which case special rules apply. Before this fix, Scylla used to assume a row doesn't exist if there is no matching primary key. In Cassandra, if there is a non-empty static row in the partition, a regular row based on the static row' cell values is created in this case, and then this row is used to evaluate the condition. This problem was reported as gh-10081. The reason for Scylla behaviour before the patch was that when implementing LWT I tried to converge Cassandra data model (or lack of thereof) with a relational data model, and assumed a static row is a "shared" portion of a regular row, i.e. a storage level concept intended to save space, and doesn't have independent existence. This was an oversimplification. This patch fixes gh-10081, making Scylla semantics match the one of Cassandra. I will now list other known examples when a static row has an own independent existence as part of a table, for cataloguing purposes. SELECT * from a partition which has a partition key and a static cell set returns 1 row. If later a regular row is added to the partition, the SELECT would still return 1 row, i.e. the static row will disappear, and a regular row will appear instead. Another example showing a static row has an independent existence below: CREATE TABLE t (p int, c int, s int static, PRIMARY KEY(p, c)); INSERT INTO t (p, c) VALUES(1, 1); INSERT INTO t (p, s) VALUES(1, 1) IF NOT EXISTS; In Cassandra (and Scylla), IF NOT EXISTS evaluates to TRUE, even though both the regular row and the partition exist. But the static cells are not set, and the insert only provides a partition key, so the database assumes the insert is operating against a static row. It would be wrong to assume that a static row exists when the partition key exists: INSERT INTO t (p, c, s) VALUES(1, 1, 1) IF NOT EXISTS; [applied] \| p \| c \| s -----------+---+---+------ False \| 1 \| 1 \| null evaluates to False, i.e. the regular row does exist when p and c exist. Issue CREATE TABLE t (p INT, c INT, r INT, s INT static, PRIMARY KEY(p, c)) INSERT INTO t (p, s) VALUES (1, 1); UPDATE t SET s=2, r=1 WHERE p=1 AND c=1 IF s=1 and r=null; - in this case, even though the regular row doesn't exist, the static row does, and should be used for condition evaluation. In other words, IF EXISTS/IF NOT EXISTS have contextual semantics. They apply to the regular row if clustering key is used in the WHERE clause, otherwise they apply to static row. One analogy for static rows is that it is like a static member of C++ or Java class. It's an attribute of the class (assuming class = partition), which is accessible through every object of the class (object = regular row). It is also present if there are no objects of the class, but the class itself exists: i.e. a partition could have no regular rows, but some static cells set, in this case it has a static row. Unlike C++/Java static class members a static row is an optional attribute of the partition. A partition may exist, but the static row may be absent (e.g. no static cell is set). If the static row does exist, all regular rows share its contents, even if they do not exist. A regular row exists when its clustering key is present in the table. A static row exists when at least one static cell is set. Tests are updated because now when no matching row is found for the update we show the value of the static row as the previous value, instead of a non-matching clustering row. Changes in v2: - reworded the commit message - added select tests Closes #10711	2022-06-16 19:23:46 +03:00

1 2 3 4 5 ...

31615 Commits