scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 11:30:36 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	e5b2218ad4	hinted handoff: use bool instead of send_state_set After restart_segment was removed from send_state enum, send_state_set now has only one possible element: segment_replay_failed. This patch removes send_state_set and uses bool in its place instead.	2020-06-12 16:10:20 +02:00
Piotr Dulikowski	6b34bb1a43	hinted handoff: update replay position on commitlog failure Hints manager uses commitlog framework to store and replay hints. The commitlog::read_log_file function is used for replaying hints. It reads commitlog entries and passes them to a callback. In case of hints manager, the callback calls manager::send_one_hint function. In case something goes wrong during this process, sending of that file is attempted again later. If the error was caused by hints that failed to be sent (e.g. due to network error), then we also advance _last_not_complete_rp field to the position of the first hint that failed. In the next retry, we will start reading from the commitlog from that position. However, current logic does not account for the case when an error occurs in the commitlog::read_log_file function itself. If, coincidentally, all hints sent by send_one_hint succeed, then we won't advance the _last_not_complete_rp field and we may unnecessarily repeat sending some of the hints that succeeded. This patch adds the send_one_file_ctx::last_sent_rp field, which keeps track of the last commitlog position for which a hint was attempted to be sent. In case read_log_file throws an error but all send_one_hint calls succeed, then it will be used to update _last_not_complete_rp. This will reduce the amount of hints that are resent in this case to only one. Tests: - unit(dev) - dtest(hintedhandoff_additional_test, dev)	2020-06-12 16:10:20 +02:00
Piotr Dulikowski	d369b538f0	hinted handoff: remove rps_set, use first_failed_rp instead When sending hints from one file, rps_set is used to keep track of positions of hints that are currently sent. If sending of a hint fails, its position is not removed from rps_set. If some hints fail to be sent while handling a hints file, the lowest position from rps_set is used to calculate the position from where to start when sending of the file is retried. Keeping track of commitlog positions this way isn't necessary to calculate this position. This patch removes rps_set and replaces it with first_failed_rp - which is just a single std::optional<db::replay_position>. This value is updated when a hint send failure is detected. This simplifies calculation of starting position for the next retry, and allowed to remove some error handling logic related to an edge case when inserting to rps_set fails. - unit(dev) - dtest(hintedhandoff_additional_test, dev)	2020-06-12 16:10:19 +02:00
Rafael Ávila de Espíndola	555d8fe520	build: Be consistent about system versus regular headers We were not consistent about using '#include "foo.hh"' instead of '#include <foo.hh>' for scylla's own headers. This patch fixes that inconsistency and, to enforce it, changes the build to use -iquote instead of -I to find those headers. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200608214208.110216-1-espindola@scylladb.com>	2020-06-10 15:49:51 +03:00
Piotr Sarna	3458bd2e32	db,view: fix outdated comments Some comments still referred to variable names which are no longer up-to-date. Follow-up for #6560. Message-Id: <2b857ccc900dd64f0d9379f5d6c87fd3aaa5d902.1591594042.git.sarna@scylladb.com>	2020-06-08 09:02:10 +03:00
Nadav Har'El	d6626c217a	merge: add error injection to mv Merged pull request https://github.com/scylladb/scylla/pull/6516 from Piotr Sarna: This series adds error injection points to materialized view paths: view update generation from staging sstables; view building; generating view updates from user writes. This series comes with a corresponding dtest pull request which adds some test cases based on error injection. Fixes #6488	2020-06-07 19:23:23 +03:00
Piotr Sarna	b3a6a33487	db,view: ensure that local updates are applied locally In current mutate_MV() code it's possible for a local endpoint to become a target for a network operation. That's the source of occasional `broken promise` benign error messages appearing, since the mutation is actually applied locally, so there's no point in creating a write response handler - the node will not send a response to itself via network. While at it, the code is deduplicated a little bit - with the paths simplified, it's easier to ensure that a local endpoint is never listed as a target for remote network operations. Fixes #5459 Tests: unit(dev), dtest(materialized_views_test.TestMaterializedViews.add_dc_during_mv_insert_test)	2020-06-07 19:10:03 +03:00
Kamil Braun	d89b7a0548	cdc: rename CDC description tables Commit `968177da04` has changed the schema of cdc_topology_description and cdc_description tables in the system_distributed keyspace. Unfortunately this was a backwards-incompatible change: these tables would always be created, irrespective of whether or not "experimental" was enabled. They just wouldn't be populated with experimental=off. If the user now tries to upgrade Scylla from a version before this change to a version after this change, it will work as long as CDC is protected b the experimental flag and the flag is off. However, if we drop the flag, or if the user turns experimental on, weird things will happen, such as nodes refusing to start because they try to populate cdc_topology_description while assuming a different schema for this table. The simplest fix for this problem is to rename the tables. This fix must get merged in before CDC goes out of experimental. If the user upgrades his cluster from a pre-rename version, he will simply have two garbage tables that he is free to delete after upgrading. sstables and digests need to be regenerated for schema_digest_test since this commit effectively adds new tables to the system_distributed keyspace. This doesn't result in schema disagreement because the table is announced to all nodes through the migration manager.	2020-06-05 09:59:16 +02:00
Piotr Sarna	76e89efc1a	db,view: add error injection points to view building ... in order to be able to test scenarios with failures.	2020-06-05 09:39:58 +02:00
Piotr Sarna	9d524a7a7e	db,view: add error injection points to view update generator ... in order to be able to test scenarios with failures.	2020-06-05 09:39:58 +02:00
Avi Kivity	a4c44cab88	treewide: update concepts language from the Concepts TS to C++20 Seastar recently lost support for the experimental Concepts Technical Specification (TS) and gained support for C++20 concepts. Re-enable concepts in Scylla by updating our use of concepts to the C++20 standard. This change: - peels off uses of the GCC6_CONCEPT macro - removes inclusions of <seastar/gcc6-concepts.hh> - replaces function-style concepts (no longer supported) with equation-style concepts - semicolons added and removed as needed - deprecated std::is_pod replaced by recommended replacement - updates return type constraints to use concepts instead of type names (either std::same_as or std::convertible_to, with std::same_as chosen when possible) No attempt is made to improve the concepts; this is a specification update only. Message-Id: <20200531110254.2555854-1-avi@scylladb.com>	2020-06-02 09:12:21 +03:00
Pavel Emelyanov	67d5fad65f	storage_service: Remove some inclusions of its header GC pass over .cc files. Some really do not need it, some need for features/gossiper Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-06-01 09:08:40 +03:00
Pavel Emelyanov	ee31191e21	storage_service: Move get_generation_number to util/ This is purely utility helper routine. As a nice side effect the inclusion of storage_service.hh is removed from several unrelated places. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-06-01 09:08:40 +03:00
Avi Kivity	0c6bbc84cd	Merge "Classify queries based on their initiator, rather than their target" from Botond " Currently we classify queries as "system" or "user" based on the table they target. The class of a query determines how the query is treated, currently: timeout, limits for reverse queries and the concurrency semaphore. The catch is that users are also allowed to query system tables and when doing so they will bypass the limits intended for user queries. This has caused performance problems in the past, yet the reason we decided to finally address this is that we want to introduce a memory limit for unpaged queries. Internal (system) queries are all unpaged and we don't want to impose the same limit on them. This series uses scheduling groups to distinguish user and system workloads, based on the assumption that user workloads will run in the statement scheduling group, while system workloads will run in the main (or default) scheduling group, or perhaps something else, but in any case not in the statement one. Currently the scheduling group of reads and writes is lost when going through the messaging service, so to be able to use scheduling groups to distinguish user and system reads this series refactors the messaging service to retain this distinction across verb calls. Furthermore, we execute some system reads/writes as part of user reads/writes, such as auth and schema sync. These processes are tagged to run in the main group. This series also centralises query classification on the replica and moves it to a higher level. More specifically, queries are now classified -- the scheduling group they run in is translated to the appropriate query class specific configuration -- on the database level and the configuration is propagated down to the lower layers. Currently this query class specific configuration consists of the reader concurrency semaphore and the max memory limit for otherwise unlimited queries. A corollary of the semaphore begin selected on the database level is that the read permit is now created before the read starts. A valid permit is now available during all stages of the read, enabling tracking the memory consumption of e.g. the memtable and cache readers. This change aligns nicely with the needs of more accurate reader memory tracking, which also wants a valid permit that is available in every layer. The series can be divided roughly into the following distinct patch groups: * 01-02: Give system read concurrency a boost during startup. * 03-06: Introduce user/system statement isolation to messaging service. * 07-13: Various infrastructure changes to prepare for using read permits in all stages of reads. * 14-19: Propagate the semaphore and the permit from database to the various table methods that currently create the permit. * 20-23: Migrate away from using the reader concurrency semaphore for waiting for admission, use the permit instead. * 24: Introduce `database::make_query_config()` and switch the database methods needing such a config to use it. * 25-31: Get rid of all uses of `no_reader_permit()`. * 32-33: Ban empty permits for good. * 34: querier_cache: use the queriers' permits to obtain the semaphore. Fixes: #5919 Tests: unit(dev, release, debug), dtest(bootstrap_test.py:TestBootstrap.start_stop_test_node), manual testing with a 2 node mixed cluster with extra logging. " * 'query-class/v6' of https://github.com/denesb/scylla: (34 commits) querier_cache: get semaphore from querier reader_permit: forbid empty permits reader_permit: fix reader_resources::operator bool treewide: remove all uses of no_reader_permit() database: make_multishard_streaming_reader: pass valid permit to multi range reader sstables: pass valid permits to all internal reads compaction: pass a valid permit to sstable reads database: add compaction read concurrency semaphore view: use valid permits for reads from the base table database: use valid permit for counter read-before-write database: introduce make_query_class_config() reader_concurrency_semaphore: remove wait_admission and consume_resources() test: move away from reader_concurrency_semaphore::wait_admission() reader_permit: resource_units: introduce add() mutation_reader: restricted_reader: work in terms of reader_permit row_cache: pass a valid permit to underlying read memtable: pass a valid permit to the delegate reader table: require a valid permit to be passed to most read methods multishard_mutation_query: pass a valid permit to shard mutation sources querier: add reader_permit parameter and forward it to the mutation_source ...	2020-05-29 10:11:44 +03:00
Piotr Sarna	77e943e9a3	db,views: unify time points used for update generation Until now, view updates were generated with a bunch of random time points, because the interface was not adjusted for passing a single time point. The time points were used to determine whether cells were alive (e.g. because of TTL), so it's better to unify the process: 1. when generating view updates from user writes, a single time point is used for the whole operation 2. when generating view updates via the view building process, a single time point is used for each build step NOTE: I don't see any reliable and deterministic way of writing test scenarios which trigger problems with the old code. After #6488 is resolved and error injection is integrated into view.cc, tests can be added. Fixes #6429 Tests: unit(dev) Message-Id: <f864e965eb2e27ffc13d50359ad1e228894f7121.1590070130.git.sarna@scylladb.com>	2020-05-28 12:56:09 +03:00
Botond Dénes	734e995639	database: add compaction read concurrency semaphore All reads will soon require a valid permit, including those done during compaction. To allow creating valid permits for these reads create a compaction specific semaphore. This semaphore is unlimited as compaction concurrency is managed by higher level layer, we use just for resource usage accounting.	2020-05-28 11:34:35 +03:00
Botond Dénes	992e697dd5	view: use valid permits for reads from the base table View update generation involves reading existing values from the base table, which will soon require a valid permit to be passed to it, so make sure we create and pass a valid permit to these reads. We use `database::make_query_class_config()` to obtain the semaphore for the read which selects the appropriate user/system semaphore based on the scheduling group the base table write is running in.	2020-05-28 11:34:35 +03:00
Botond Dénes	e4c591aa67	database: introduce make_query_class_config() And use it to obtain any query-class specific configuration that was obtained from `table::config` before, such as the read concurrency semaphore and the max memory limit for unlimited queries. As all users of these items get these from the query class config now, we can remove them from `table::config`.	2020-05-28 11:34:35 +03:00
Botond Dénes	cc5137ffe3	table: require a valid permit to be passed to most read methods Now that the most prevalent users (range scan and single partition reads) all pass valid permits we require all users to do so and propagate the permit down towards `make_sstable_reader()`. The plan is to use this permit for restricting the sstable readers, instead of the semaphore the table is configured with. The various `make_streaming_*reader()` overloads keep using the internal semaphores as but they also create the permit before the read starts and pass it to `make_sstable_reader()`.	2020-05-28 11:34:35 +03:00
Nadav Har'El	c3da9f2bd4	alternator: add mandatory configurable write isolation mode Alternator supports four ways in which write operations can use quorum writes or LWT or both, which we called "write isolation policies". Until this patch, Alternator defaulted to the most generally safe policy, "always_use_lwt". This default could have been overriden for each table separately, but there was no way to change this default for all tables. This patch adds a "--alternator-write-isolation" configuration option which allows changing the default. Moreover, @dorlaor asked that users must explicitly choose this default mode, and not get "always_use_lwt" without noticing. The previous default, "always_use_lwt" supports any workload correctly but because it uses LWT for all writes it may be disappointingly slow for users who run write-only workloads (including most benchmarks) - such users might find the slow writes so disappointing that they will drop Scylla. Conversely, a default of "forbid_rmw" will be faster and still correct, but will fail on workloads which need read-modify-write operations - and suprise users that need these operations. So Dor asked that that none of the write modes be made the default, and users must make an informed choice between the different write modes, rather than being disappointed by a default choice they weren't aware of. So after this patch, Scylla refuses to boot if Alternator is enabled but a "--alternator-write-isolation" option is missing. The patch also modifies the relevant documentation, adds the same option to our docker image, and the modifies the test-running script test/alternator/run to run Scylla with the old default mode (always_use_lwt), which we need because we want to test RMW operations as well. Fixes #6452 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200524160338.108417-1-nyh@scylladb.com>	2020-05-27 08:40:05 +03:00
Tomasz Grabiec	1424543e11	Merge "Move sstables_format on sstable_manager" from Pavel Emelyanov The format is currently sitting in storage_service, but the previous set patched all the users not to call it, instead they use sstables_manager to get the highest supported format. So this set finalizes this effort and places the format on sstables_manager(s). The set introduces the db::sstables_format_selector, that - starts with the lowest format (ka) - reads one on start from system tables - subscribes on sstables-related features and bumps up the selection if the respective feature is enabled During its lifetime the selector holds a reference to the sharded<database> and updates the format on it, the database, in turn, propagates it further to sstables_managers. The managers start with the highest known format (mc) which is done for tests. * https://github.com/xemul/scylla br-move-sstables-format-4: storage_service: Get rid of one-line helpers system_keyspace: Cleanup setup() from storage_service format_selector: Log which format is being selected sstables_manager: Keep format on format_selector: Make it standalone format_selector: Move the code into db/ format_selector: Select format locally storage_service: Introduce format_selector storage_service: Split feature_enabled_listener::on_enabled storage_service: Tossing bits around features: Introduce and use masked features features: Get rid of per-features booleans	2020-05-27 08:40:05 +03:00
Avi Kivity	8d27e1b4a9	Merge 'Propagate tracing to materialized view update path' from Piotr S In order to improve materialized views' debuggability, tracing points are added to view update generation path. Example trace: ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+----------- Execute CQL3 query \| 2020-04-27 13:13:46.834000 \| 127.0.0.1 \| 0 \| 127.0.0.1 Parsing a statement [shard 0] \| 2020-04-27 13:13:46.834346 \| 127.0.0.1 \| 1 \| 127.0.0.1 Processing a statement [shard 0] \| 2020-04-27 13:13:46.834426 \| 127.0.0.1 \| 80 \| 127.0.0.1 Creating write handler for token: -3248873570005575792 natural: {127.0.0.1, 127.0.0.3} pending: {} [shard 0] \| 2020-04-27 13:13:46.834494 \| 127.0.0.1 \| 148 \| 127.0.0.1 Creating write handler with live: {127.0.0.3, 127.0.0.1} dead: {} [shard 0] \| 2020-04-27 13:13:46.834507 \| 127.0.0.1 \| 161 \| 127.0.0.1 Sending a mutation to /127.0.0.3 [shard 0] \| 2020-04-27 13:13:46.834519 \| 127.0.0.1 \| 173 \| 127.0.0.1 Executing a mutation locally [shard 0] \| 2020-04-27 13:13:46.834532 \| 127.0.0.1 \| 186 \| 127.0.0.1 View updates for ks.t require read-before-write - base table reader is created [shard 0] \| 2020-04-27 13:13:46.834570 \| 127.0.0.1 \| 224 \| 127.0.0.1 Reading key {{-3248873570005575792, pk{000400000002}}} from sstable /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Data.db [shard 0] \| 2020-04-27 13:13:46.834608 \| 127.0.0.1 \| 262 \| 127.0.0.1 /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Index.db: scheduling bulk DMA read of size 8 at offset 0 [shard 0] \| 2020-04-27 13:13:46.834635 \| 127.0.0.1 \| 289 \| 127.0.0.1 /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Index.db: finished bulk DMA read of size 8 at offset 0, successfully read 8 bytes [shard 0] \| 2020-04-27 13:13:46.834975 \| 127.0.0.1 \| 629 \| 127.0.0.1 Message received from /127.0.0.1 [shard 0] \| 2020-04-27 13:13:46.834988 \| 127.0.0.3 \| 11 \| 127.0.0.1 /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Data.db: scheduling bulk DMA read of size 41 at offset 0 [shard 0] \| 2020-04-27 13:13:46.835015 \| 127.0.0.1 \| 669 \| 127.0.0.1 View updates for ks.t require read-before-write - base table reader is created [shard 0] \| 2020-04-27 13:13:46.835020 \| 127.0.0.3 \| 44 \| 127.0.0.1 Generated 1 view update mutations [shard 0] \| 2020-04-27 13:13:46.835080 \| 127.0.0.3 \| 104 \| 127.0.0.1 Sending view update for ks.t_v2_idx_index to 127.0.0.2, with pending endpoints = {}; base token = -3248873570005575792; view token = 3728482343045213994 [shard 0] \| 2020-04-27 13:13:46.835095 \| 127.0.0.3 \| 119 \| 127.0.0.1 Sending a mutation to /127.0.0.2 [shard 0] \| 2020-04-27 13:13:46.835105 \| 127.0.0.3 \| 129 \| 127.0.0.1 View updates for ks.t were generated and propagated [shard 0] \| 2020-04-27 13:13:46.835117 \| 127.0.0.3 \| 141 \| 127.0.0.1 /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Data.db: finished bulk DMA read of size 41 at offset 0, successfully read 41 bytes [shard 0] \| 2020-04-27 13:13:46.835160 \| 127.0.0.1 \| 813 \| 127.0.0.1 Sending mutation_done to /127.0.0.1 [shard 0] \| 2020-04-27 13:13:46.835164 \| 127.0.0.3 \| 188 \| 127.0.0.1 Mutation handling is done [shard 0] \| 2020-04-27 13:13:46.835177 \| 127.0.0.3 \| 201 \| 127.0.0.1 Generated 1 view update mutations [shard 0] \| 2020-04-27 13:13:46.835215 \| 127.0.0.1 \| 869 \| 127.0.0.1 Locally applying view update for ks.t_v2_idx_index; base token = -3248873570005575792; view token = 3728482343045213994 [shard 0] \| 2020-04-27 13:13:46.835226 \| 127.0.0.1 \| 880 \| 127.0.0.1 Successfully applied local view update for 127.0.0.1 and 0 remote endpoints [shard 0] \| 2020-04-27 13:13:46.835253 \| 127.0.0.1 \| 907 \| 127.0.0.1 View updates for ks.t were generated and propagated [shard 0] \| 2020-04-27 13:13:46.835256 \| 127.0.0.1 \| 910 \| 127.0.0.1 Got a response from /127.0.0.1 [shard 0] \| 2020-04-27 13:13:46.835274 \| 127.0.0.1 \| 928 \| 127.0.0.1 Delay decision due to throttling: do not delay, resuming now [shard 0] \| 2020-04-27 13:13:46.835276 \| 127.0.0.1 \| 930 \| 127.0.0.1 Mutation successfully completed [shard 0] \| 2020-04-27 13:13:46.835279 \| 127.0.0.1 \| 933 \| 127.0.0.1 Done processing - preparing a result [shard 0] \| 2020-04-27 13:13:46.835286 \| 127.0.0.1 \| 941 \| 127.0.0.1 Message received from /127.0.0.3 [shard 0] \| 2020-04-27 13:13:46.835331 \| 127.0.0.2 \| 14 \| 127.0.0.1 Sending mutation_done to /127.0.0.3 [shard 0] \| 2020-04-27 13:13:46.835399 \| 127.0.0.2 \| 82 \| 127.0.0.1 Mutation handling is done [shard 0] \| 2020-04-27 13:13:46.835413 \| 127.0.0.2 \| 96 \| 127.0.0.1 Got a response from /127.0.0.2 [shard 0] \| 2020-04-27 13:13:46.835639 \| 127.0.0.3 \| 662 \| 127.0.0.1 Delay decision due to throttling: do not delay, resuming now [shard 0] \| 2020-04-27 13:13:46.835640 \| 127.0.0.3 \| 664 \| 127.0.0.1 Successfully applied view update for 127.0.0.2 and 1 remote endpoints [shard 0] \| 2020-04-27 13:13:46.835649 \| 127.0.0.3 \| 673 \| 127.0.0.1 Got a response from /127.0.0.3 [shard 0] \| 2020-04-27 13:13:46.835841 \| 127.0.0.1 \| 1495 \| 127.0.0.1 Request complete \| 2020-04-27 13:13:46.834944 \| 127.0.0.1 \| 944 \| 127.0.0.1 ``` Fixes #6175 Tests: unit(dev), manual * psarna-propagate_tracing_to_more_write_paths: db,view: add tracing to view update generation path treewide: propagate trace state to write path	2020-05-27 08:40:05 +03:00
Pavel Emelyanov	ccdee822e1	storage_service: Get rid of one-line helpers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-05-25 14:17:31 +03:00
Pavel Emelyanov	3c2066bd78	system_keyspace: Cleanup setup() from storage_service Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-05-25 14:17:31 +03:00
Pavel Emelyanov	0598b3a858	format_selector: Log which format is being selected Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-05-25 14:17:31 +03:00
Pavel Emelyanov	89a1b09214	sstables_manager: Keep format on Make the database be the format_selector target, so when the format is selected its set on database which in turn just forwards the selection into sstables managers. All users of the format are already patched to read it from those managers. The initial value for the format is the highest, which is needed by tests. When scylla starts the format is updated by format_selector, first after reading from system tables, then by selectiing it from features. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-05-25 14:17:28 +03:00
Pavel Emelyanov	a61f18ed64	format_selector: Make it standalone Remove the selector from storage_service and introduce an instance in main.cc that starts soon after the gossiper and feature_service, starts listening for features and sets the selected format on storage_service. This change includes - Removal of for_testing bit from format_selector constructor, now tests just do not use it - Adding a gate to selection routine to make sure on exit all the selection stuff is done. Although before the cluster join the selector waits for the feature listeners to finish (the .sync() method) this gate is still required to handle aborted start cases and wait for gossiper announcement from selector to complete. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-05-25 14:15:04 +03:00
Pavel Emelyanov	1692d94c9a	format_selector: Move the code into db/ This is just move, no changes in code logic. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-05-25 14:09:24 +03:00
Avi Kivity	4d15aba7c0	commitlog: capture "this" explicitly in lambda C++20 deprecates capturing this in default-copy lambdas ([=]), with good reason. Move to explicit captures to avoid any ambiguity and reduce warning spew. Message-Id: <20200517150834.753463-1-avi@scylladb.com>	2020-05-19 08:14:32 +03:00
Piotr Sarna	18a37d0cb1	db,view: add tracing to view update generation path In order to improve materialized views' debuggability, tracing points are added to view update generation path. Sample info of an insert statement which resulted in producing local view updates which require read-before-write: activity \| timestamp \| source \| source_elapsed \| client ------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+----------- Execute CQL3 query \| 2020-04-19 12:02:48.420000 \| 127.0.0.1 \| 0 \| 127.0.0.1 Parsing a statement [shard 0] \| 2020-04-19 12:02:48.420674 \| 127.0.0.1 \| -- \| 127.0.0.1 Processing a statement [shard 0] \| 2020-04-19 12:02:48.420753 \| 127.0.0.1 \| 79 \| 127.0.0.1 Creating write handler for token: -6715243485458697746 natural: {127.0.0.1} pending: {} [shard 0] \| 2020-04-19 12:02:48.420815 \| 127.0.0.1 \| 141 \| 127.0.0.1 Creating write handler with live: {127.0.0.1} dead: {} [shard 0] \| 2020-04-19 12:02:48.420824 \| 127.0.0.1 \| 149 \| 127.0.0.1 Executing a mutation locally [shard 0] \| 2020-04-19 12:02:48.420830 \| 127.0.0.1 \| 155 \| 127.0.0.1 View updates for ks.t1 require read-before-write - base table reader is created [shard 0] \| 2020-04-19 12:02:48.420862 \| 127.0.0.1 \| 188 \| 127.0.0.1 Generated 2 view update mutations [shard 0] \| 2020-04-19 12:02:48.420910 \| 127.0.0.1 \| 235 \| 127.0.0.1 Locally applying view update for ks.t1_v_idx_index; base token = -6715243485458697746; view token = -4156302194539278891 [shard 0] \| 2020-04-19 12:02:48.420918 \| 127.0.0.1 \| 243 \| 127.0.0.1 Successfully applied local view update for 127.0.0.1 and 0 remote endpoints [shard 0] \| 2020-04-19 12:02:48.420971 \| 127.0.0.1 \| 297 \| 127.0.0.1 View updates for ks.t1 were generated and propagated [shard 0] \| 2020-04-19 12:02:48.420973 \| 127.0.0.1 \| 299 \| 127.0.0.1 Got a response from /127.0.0.1 [shard 0] \| 2020-04-19 12:02:48.420988 \| 127.0.0.1 \| 314 \| 127.0.0.1 Delay decision due to throttling: do not delay, resuming now [shard 0] \| 2020-04-19 12:02:48.420990 \| 127.0.0.1 \| 315 \| 127.0.0.1 Mutation successfully completed [shard 0] \| 2020-04-19 12:02:48.420994 \| 127.0.0.1 \| 320 \| 127.0.0.1 Done processing - preparing a result [shard 0] \| 2020-04-19 12:02:48.421000 \| 127.0.0.1 \| 326 \| 127.0.0.1 Request complete \| 2020-04-19 12:02:48.420330 \| 127.0.0.1 \| 330 \| 127.0.0.1 Sample info for remote updates: activity \| timestamp \| source \| source_elapsed \| client --------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+----------- Execute CQL3 query \| 2020-04-26 16:19:47.691000 \| 127.0.0.1 \| 0 \| 127.0.0.1 Parsing a statement [shard 1] \| 2020-04-26 16:19:47.691590 \| 127.0.0.1 \| 6 \| 127.0.0.1 Processing a statement [shard 1] \| 2020-04-26 16:19:47.692368 \| 127.0.0.1 \| 783 \| 127.0.0.1 Creating write handler for token: -3248873570005575792 natural: {127.0.0.3, 127.0.0.2} pending: {} [shard 1] \| 2020-04-26 16:19:47.694186 \| 127.0.0.1 \| 2598 \| 127.0.0.1 Creating write handler with live: {127.0.0.2, 127.0.0.3} dead: {} [shard 1] \| 2020-04-26 16:19:47.694283 \| 127.0.0.1 \| 2699 \| 127.0.0.1 Sending a mutation to /127.0.0.2 [shard 1] \| 2020-04-26 16:19:47.694591 \| 127.0.0.1 \| 3006 \| 127.0.0.1 Sending a mutation to /127.0.0.3 [shard 1] \| 2020-04-26 16:19:47.694862 \| 127.0.0.1 \| 3277 \| 127.0.0.1 Message received from /127.0.0.1 [shard 1] \| 2020-04-26 16:19:47.696358 \| 127.0.0.3 \| 40 \| 127.0.0.1 Message received from /127.0.0.1 [shard 1] \| 2020-04-26 16:19:47.696442 \| 127.0.0.2 \| 32 \| 127.0.0.1 View updates for ks.t require read-before-write - base table reader is created [shard 1] \| 2020-04-26 16:19:47.697762 \| 127.0.0.3 \| 1444 \| 127.0.0.1 View updates for ks.t require read-before-write - base table reader is created [shard 1] \| 2020-04-26 16:19:47.698120 \| 127.0.0.2 \| 1710 \| 127.0.0.1 Generated 1 view update mutations [shard 1] \| 2020-04-26 16:19:47.699107 \| 127.0.0.3 \| 2789 \| 127.0.0.1 Sending view update for ks.t_v2_idx_index to 127.0.0.4, with pending endpoints = {}; base token = -3248873570005575792; view token = 1634052884888577606 [shard 1] \| 2020-04-26 16:19:47.699345 \| 127.0.0.3 \| 3027 \| 127.0.0.1 Sending a mutation to /127.0.0.4 [shard 1] \| 2020-04-26 16:19:47.699614 \| 127.0.0.3 \| 3296 \| 127.0.0.1 Generated 1 view update mutations [shard 1] \| 2020-04-26 16:19:47.699824 \| 127.0.0.2 \| 3414 \| 127.0.0.1 Locally applying view update for ks.t_v2_idx_index; base token = -3248873570005575792; view token = 1634052884888577606 [shard 1] \| 2020-04-26 16:19:47.700012 \| 127.0.0.2 \| 3603 \| 127.0.0.1 View updates for ks.t were generated and propagated [shard 1] \| 2020-04-26 16:19:47.700059 \| 127.0.0.3 \| 3741 \| 127.0.0.1 Message received from /127.0.0.3 [shard 1] \| 2020-04-26 16:19:47.700958 \| 127.0.0.4 \| 37 \| 127.0.0.1 Successfully applied local view update for 127.0.0.2 and 0 remote endpoints [shard 1] \| 2020-04-26 16:19:47.701522 \| 127.0.0.2 \| 5112 \| 127.0.0.1 View updates for ks.t were generated and propagated [shard 1] \| 2020-04-26 16:19:47.701615 \| 127.0.0.2 \| 5206 \| 127.0.0.1 Sending mutation_done to /127.0.0.1 [shard 1] \| 2020-04-26 16:19:47.701913 \| 127.0.0.3 \| 5595 \| 127.0.0.1 Mutation handling is done [shard 1] \| 2020-04-26 16:19:47.702489 \| 127.0.0.3 \| 6171 \| 127.0.0.1 Got a response from /127.0.0.3 [shard 1] \| 2020-04-26 16:19:47.702667 \| 127.0.0.1 \| 11082 \| 127.0.0.1 Delay decision due to throttling: do not delay, resuming now [shard 1] \| 2020-04-26 16:19:47.702689 \| 127.0.0.1 \| 11105 \| 127.0.0.1 Mutation successfully completed [shard 1] \| 2020-04-26 16:19:47.702784 \| 127.0.0.1 \| 11200 \| 127.0.0.1 Sending mutation_done to /127.0.0.1 [shard 1] \| 2020-04-26 16:19:47.703016 \| 127.0.0.2 \| 6606 \| 127.0.0.1 Done processing - preparing a result [shard 1] \| 2020-04-26 16:19:47.703054 \| 127.0.0.1 \| 11470 \| 127.0.0.1 Sending mutation_done to /127.0.0.3 [shard 1] \| 2020-04-26 16:19:47.703720 \| 127.0.0.4 \| 2800 \| 127.0.0.1 Mutation handling is done [shard 1] \| 2020-04-26 16:19:47.704527 \| 127.0.0.4 \| 3607 \| 127.0.0.1 Got a response from /127.0.0.4 [shard 1] \| 2020-04-26 16:19:47.704580 \| 127.0.0.3 \| 8262 \| 127.0.0.1 Delay decision due to throttling: do not delay, resuming now [shard 1] \| 2020-04-26 16:19:47.704606 \| 127.0.0.3 \| 8288 \| 127.0.0.1 Successfully applied view update for 127.0.0.4 and 1 remote endpoints [shard 1] \| 2020-04-26 16:19:47.704853 \| 127.0.0.3 \| 8535 \| 127.0.0.1 Mutation handling is done [shard 1] \| 2020-04-26 16:19:47.706092 \| 127.0.0.2 \| 9682 \| 127.0.0.1 Got a response from /127.0.0.2 [shard 1] \| 2020-04-26 16:19:47.709933 \| 127.0.0.1 \| 18348 \| 127.0.0.1 Request complete \| 2020-04-26 16:19:47.702582 \| 127.0.0.1 \| 11582 \| 127.0.0.1 Tests: unit(dev, debug)	2020-05-18 16:05:23 +02:00
Piotr Sarna	92aadb94e5	treewide: propagate trace state to write path In order to add tracing to places where it can be useful, e.g. materialized view updates and hinted handoff, tracing state is propagated to all applicable call sites.	2020-05-18 16:05:23 +02:00
Benny Halevy	a96087165a	hints: get_device_id: use seastar file_stat This avoids potential use-after-move, since undefined c++ sequencing order may std::move(f) in the lambda capture before evaluating f.stat(). Also, this makes use of a more generic library function that doesn't require to open and hold on to the file in the application. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200514152054.162168-1-bhalevy@scylladb.com>	2020-05-15 10:11:45 +02:00
Tomasz Grabiec	df4b698309	Merge "Add more defenses against empty keys" from Botond In theory we shouldn't have empty keys in the database, as we validate all keys that enter the database via CQL with `validation::validate_cql_keys()`, which will reject empty keys. In this context, empty means a single-component key, with its only component being empty. Yet recently we've seen empty keys appear in a cluster and wreak havoc on it, as they will cause the memtable flush to fail due to the sstable summary rejecting the empty key. This will cause an infinite loop, where Scylla keeps retrying to flush the memtable and failing. The intermediate consequence of this is that the node cannot be shut down gracefully. The indirect consequence is possible data loss, as commitlog files cannot be replayed as they just re-insert the empty key into the memtable and the infinite flush retry circle starts all over again. A workaround is to move problematic commitlog files away, allowing the node to start up. This can however lead to data loss, if multiple replicas had to move away commitlogs that contain the same data. To prevent the node getting into an unusable state and subsequent data loss, extend the existing defenses against invalid (empty) keys to the commitlog replay, which will now ignore them during replay. Fixes: #6106 * denesb/empty-keys/v5: commitlog_replayer: ignore entries with invalid keys test: lib/sstable_utils: add make_keys_for_shard validation: add is_cql_key_invalid() validation: validate_cql_key(): make key parameter a `partition_key_view` partition_key_view: add validate method	2020-05-12 20:36:40 +02:00
Piotr Dulikowski	0c5ac0da98	hinted handoff: remove discarded hint positions from rps_set Related commit: `85d5c3d` When attempting to send a hint, an exception might occur that results in that hint being discarded (e.g. keyspace or table of the hint was removed). When such an exception is thrown, position of the hint will already be stored in rps_set. We are only allowed to retain positions of hints that failed to be sent and needed to be retried later. Dropping a hint is not an error, therefore its position should be removed from rps_set - but current logic does not do that. Because of that bug, hint files with many discardable hints might cause rps_set to grow large when the file is replayed. Furthermore, leaving positions of such hints in rps_set might cause more hints than necessary to be re-sent if some non-discarded hints fail to be sent. This commit fixes the problem by removing positions of discarded hints from rps_set. Fixes #6433	2020-05-12 15:13:59 +02:00
Botond Dénes	6083ed668b	commitlog_replayer: ignore entries with invalid keys When replaying the commitlog, pass keys to `validation::validate_cql_key()`. Discard entries which fail validation and warn about it in the logs. This prevents invalid keys from getting into the system, possibly failing the commitlog replay and the successful boot of the node, preventing the node from recovering data.	2020-05-12 12:07:21 +03:00
Piotr Dulikowski	85d5c3d5ee	hinted handoff: don't keep positions of old hints in rps_set When sending hints from one file, rps_set field in send_one_file_ctx keeps track of commitlog positions of hints that are being currently sent, or have failed to be sent. At the end of the operation, if sending of some hints failed, we will choose position of the earliest hint that failed to be sent, and will retry sending that file later, starting from that position. This position is stored in _last_not_complete_rp. Usually, this set has a bounded size, because we impose a limit of at most 128 hints being sent concurrently. Because we do not attempt to send any more hints after a failure is detected, rps_set should not have more than 128 elements at a time. Due to a bug, commitlog positions of old hints (older than gc_grace_seconds of the destination table) were inserted into rps_set but not removed after checking their age. This could cause rps_set to grow very large when replaying a file with old hints. Moreover, if the file mixed expired and non-expired hints (which could happen if it had hints to two tables with different gc_grace_seconds), and sending of some non-expired hints failed, then positions of expired hints could influence calculation _last_not_complete_rp, and more hints than necessary would be resent on the next retry. This simple patch removes commitlog position of a hint from rps_set when it is detected to be too old. Fixes #6422	2020-05-11 11:33:31 +02:00
Asias He	71d0d58f8c	Revert "config: Do not enable repair based node operations by default" This reverts commit `b8ac10c451`. The repair based node operations will be enabled by default in 4.1. Revert the patch which disables it by default.	2020-05-07 13:17:35 +03:00
Nadav Har'El	0214f0ad60	main: really enable the "--start-native-transport" option In commit `da3bf20e71` we supposedly enabled support for Cassandra's "start_native_transport" option which can be set to 0 to run Scylla without listening on the CQL port. This can be useful, for example, if a user only want the DynamoDB or Redis APIs but not CQL. Unfortunately, the option was still marked "Unused", so it wasn't really enabled as a valid command line option. This patch fixes that, and documents the start_native_transport option in docs/protocols.md, where we document the different protocols, ports, and options to configure them. Fixes #6387. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200506174850.13616-1-nyh@scylladb.com>	2020-05-07 11:09:18 +03:00
Piotr Sarna	f48e414eab	db, view: remove duplicate entries from pending endpoints When generating view updates, an endpoint can appear both as a primary paired endpoint for the view update, and as a pending endpoint (due to range movements). In order not to generate the same update twice for the same endpoint, the paired endpoint is removed from the list of pending endpoints if present. Fixes #5459 Tests: unit(dev), dtest(TestMaterializedViews.add_dc_during_mv_insert_test)	2020-05-06 16:42:56 +03:00
Piotr Sarna	bf5f247bc5	db: set gc grace period to 0 for local system tables Local system tables from `system` namespace use LocalStrategy replication, so they do not need to be concerned about gc grace period. Some system tables already set gc grace period to 0, but other ones, including system.large_partitions, did not. That may result in millions of tombstones being needlessly kept for these tables, which can cause read timeouts. Fixes #6325 Tests: unit(dev), local(running cqlsh and playing with system tables)	2020-05-03 17:41:50 +03:00
Asias He	b8ac10c451	config: Do not enable repair based node operations by default Give it some more time to mature. Use the old stream plan based node operations by default. Fixes: #6305 Backports: 4.0	2020-04-30 12:37:24 +03:00
Tomasz Grabiec	c59ec8d97f	Merge "Avoid some memory copies in lwt" from Gleb * seastar-dev.git gleb/lwt-shared-proposal: lwt: pass paxos::proposal as a shared pointer everywhere lwt: do not copy proposal in paxos_state::accept lwt: make load_paxos_state to take partition_key_view instead of a deference	2020-04-22 13:43:03 +02:00
Gleb Natapov	97af6bb0bd	lwt: make load_paxos_state to take partition_key_view instead of a deference Some caller have partition_key_view, but not partition_key, so thy need to create a temporary and copy just to pass a reference. Change it by accepting a view.	2020-04-22 13:51:43 +03:00
Calle Wilund	525b283326	commitlog::read_log_file: Preserve subscription across reading Fixes #6265 Return type for read_log_file was previously changed from subscription to future<>, returning the previously returned subscriptions result of done(). But it did not preserve the subscription itself, which in turn will cause us to (in work::stream), call back into a deleted object. Message-Id: <20200422090856.5218-1-calle@scylladb.com>	2020-04-22 12:12:11 +03:00
Glauber Costa	1f9c37fb5e	view_updating_consumer: move reference to a pointer It is currently not possible to wrap the view_updating_consumer in an std::optional. I intend to do it to allow for compactions to optionally generate view updates. The reason for that is that view_updating_consumer has a reference as a member, which makes the move assignment constructor not be implicitly generated. This patch fixes it by keeping a pointer instead of a reference. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20200421123648.8328-1-glauber@scylladb.com>	2020-04-22 10:05:35 +03:00
Piotr Sarna	03f41b9d96	db: remove trailing whitespace Found when backporting a patch to 3.3. Message-Id: <fa406597deaacff56dbba99fa167715b041bbb52.1587375123.git.sarna@scylladb.com>	2020-04-20 12:58:55 +02:00
Tomasz Grabiec	e648e314e5	Merge "Drop only learnt value on PRUNE" from Gleb It is unsafe to remove entire row, so only drop learn value from system.paxos table. Fixes: #6154	2020-04-20 12:06:04 +02:00
Gleb Natapov	73391420fb	lwt: drop only most recently learnt value during prune. It turned out we cannot drop the information about most recent commit entirely since it is used to cut off already outdate accepted values. Otherwise the following scenario can happen: 1. cas1 prepares on A, B, C, gets one accept from A 2. cas2 prepares on B, C, gets 2 accepts on B and C, learns on B, C 3. cas3 initiates a prepare on A, learns about cas1's accept, 4. cas2 learns on A, prunes on A, B, C Now cas3 will reply cas1's value because it does not know that it is less than already committed on (removed during step 4). The patch drops only committed value and keep the information about latest committed ballot. Fixed #6154	2020-04-19 17:12:15 +03:00
Gleb Natapov	d3d31d66d4	lwt: treated accepted ballot as a promised PAXOS node is allowed to accept a proposal without promising it first as long as its ballot is greater than already promised one. Treat such accepted ballot as promised since 'learn' stage removes accepted ballot, but we still want to remember it as the latest promised one. The goal is to be closer to formal PAXOS specification.	2020-04-19 17:12:03 +03:00
Piotr Sarna	9c15604659	treewide: deprecate passing explicit order in schema building In order to avoid confusion with regard to whose responsibility it is to sort the key columns (see #5856), the interface which allows adding columns to the builder with explicit column id is moved to a private function. An internal with_column_ordered() overload is maintained to be used for internal operations, but it's encouraged to use simpler with_column() in new code. Fixes #6235 Tests: unit(dev)	2020-04-19 16:19:17 +03:00

1 2 3 4 5 ...

1712 Commits