scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 13:37:04 +00:00

Author	SHA1	Message	Date
Calle Wilund	bf0a91b566	commitlog: Flush all segments if we only have one. Handle test cases with borked config so we don't deadlock in cases where we only have one segment in a commitlog	2021-05-25 12:43:12 +00:00
Calle Wilund	8ce836209b	commitlog: Always force flush if segment allocation is waiting Refs #8270 If segement allocation is blocked, we should bypass all thresholds and issue a flush of as much as possible.	2021-05-25 12:43:12 +00:00
Calle Wilund	e34ed30178	commitlog: Include segment wasted (slack) size in footprint check Refs #8270 Since segment allocation looks at actual disk footprint, not active, the threshold check in timer task should include slack space so we don't mistake sparse usage for space left.	2021-05-25 12:43:12 +00:00
Calle Wilund	ec40207e7f	commitlog: Adjust (lower) usage threshold Refs #8270 Try to ensure we issue a flush as soon as we are allocating in the last allowable segment, instead of "half through". This will make flushing a little more eager, but should reduce latencies created by waiting for segment delete/recycle on heavy usage.	2021-05-25 12:43:12 +00:00
Piotr Sarna	c8653d1321	cql3: enhance the fix for index paging type check The original fix stripped the reversed type only from the base table column, but it's better to be safe than sorry, so the reverse is also stripped from the view column. Refs #8667 Message-Id: <cb5dedb0b8b6b5eea3a69863ae50a0e906482665.1621330463.git.sarna@scylladb.com>	2021-05-18 12:47:35 +03:00
Takuya ASADA	60c0b37a4c	install.sh: apply correct file security context when copying files Currently, unified installer does not apply correct file security context while copying files, it causes permission error on scylla-server.service. We should apply default file security context while copying files, using '-Z' option on /usr/bin/install. Also, because install -Z requires normalized path to apply correct security context, use 'realpath -m <PATH>' on path variables on the script. Fixes #8589 Closes #8602	2021-05-18 12:09:51 +03:00
Takuya ASADA	6faa8b97ec	install.sh: fix not such file or directory on nonroot Since we have added scylla-node-exporter, we needed to do 'install -d' for systemd directory and sysconfig directory before copying files. Fixes #8663 Closes #8664	2021-05-18 12:03:45 +03:00
Avi Kivity	593ad4de1e	Merge 'Fix type checking in index paging' from Piotr Sarna When recreating the paging state from an indexed query, a bunch of panic checks were introduced to make sure that the code is correct. However, one of the checks is too eager - namely, it throws an error if the base column type is not equal to the view column type. It usually works correctly, unless the base column type is a clustering key with DESC clustering order, in which case the type is actually "reversed". From the point of view of the paging state generation it's not important, because both types deserialize in the same way, so the check should be less strict and allow the base type to be reversed. Tests: unit(release), along with the additional test case introduced in this series; the test also passes on Cassandra Fixes #8666 Closes #8667 * github.com:scylladb/scylla: test: add a test case for paging with desc clustering order cql3: relax a type check for index paging	2021-05-18 11:34:59 +03:00
Kamil Braun	03ad111beb	tree-wide: comments on deprecated functions to access global variables Closes #8665	2021-05-18 11:31:10 +03:00
Botond Dénes	ae366868fb	multishard_mutation_query: save_reader(): avoid round-trip for destroying rparts Force its destruction when saving the reader. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210514140844.119362-1-bdenes@scylladb.com>	2021-05-18 10:07:13 +03:00
Botond Dénes	c98b0d0de8	test: cql_test_env: add trace logs to execute_cql() In tests executing tons of these, it is useful to be able to enable a trace logging of each one, to see which is the last successful one. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210514140531.118390-1-bdenes@scylladb.com>	2021-05-18 10:06:22 +03:00
Piotr Sarna	c36f432423	test: add a test case for paging with desc clustering order Issue #8666 revealed an issue with validating types for paged indexed queries - namely, the type checking mechanism is too strict in comparing types and fails on mismatched clustering order - e.g. an `int` column type is different from `int` with DESC clustering order. As a result, users see a very confusing message (because reversed types are printed as their underlying type): > Mismatched types for base and view columns c: int and int This test case fails before the fix for #8666 and thus acts as a regression test.	2021-05-17 17:06:50 +02:00
Piotr Sarna	544ef2caf3	cql3: relax a type check for index paging When recreating the paging state from an indexed query, a bunch of panic checks were introduced to make sure that the code is correct. However, one of the checks is too eager - namely, it throws an error if the base column type is not equal to the view column type. It usually works correctly, unless the base column type is a clustering key with DESC clustering order, in which case the type is actually "reversed". From the point of view of the paging state generation it's not important, because both types deserialize in the same way, so the check should be less strict and allow the base type to be reversed. Tests: unit(release), along with the additional test case introduced in this series; the test also passes on Cassandra Fixes #8666	2021-05-17 17:06:50 +02:00
Botond Dénes	dca808dd51	perf/perf_simple_query: add --enable-cache option Allowing for testing performance with/out cache. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210517045402.16153-1-bdenes@scylladb.com>	2021-05-17 14:06:18 +02:00
Raphael S. Carvalho	10ae77966c	compaction_manager: Don't swallow exception in procedure used by reshape and resharding run_custom_job() was swallowing all exceptions, which is definitely wrong because failure in a resharding or reshape would be incorrectly interpreted as success, which means upper layer will continue as if everything is ok. For example, ignoring a failure in resharding could result in a shared sstable being left unresharded, so when that sstable reaches a table, scylla would abort as shared ssts are no longer accepted in the main sstable set. Let's allow the exception to be propagated, so failure will be communicated, and resharding and reshape will be all or nothing, as originally intended. Fixes #8657. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>	2021-05-17 13:57:05 +02:00
Avi Kivity	8d6e575f59	perf_fast_forward: report instructions per fragment Use a hardware counter to report instructions per fragment. Results vary from ~4k insns/f when reading sequentially to more than 1M insns/f. Instructions per fragment can be a more stable metric than frags/sec. It would probably be even more stable with a fake file implementation that works in-memory to eliminate seastar polling instruction variation. Closes #8660	2021-05-17 11:33:24 +02:00
Tomasz Grabiec	8dddfab5db	Merge 'db/virtual tables: Add infrastructure + system.status example table' from Piotr Wojtczak This is the 1st PR in series with the goal to finish the hackathon project authored by @tgrabiec, @kostja, @amnonh and @mmatczuk (improved virtual tables + function call syntax in CQL). Virtual tables created within this framework are "materialized" in memtables, so current solution is for small tables only. As an example system.status was added. It was checked that DISTINCT and reverse ORDER BY do work. This PR was created by @jul-stas and @StarostaGit Fixes #8343 This is the same as #8364, but with a compilation fix (newly added `close()` method was not implemented by the reader) Closes #8634 * github.com:scylladb/scylla: boost/tests: Add virtual_table_test for basic infrastructure boost/tests: Test memtable_filling_virtual_table as mutation_source db/system_keyspace: Add system.status virtual table db/virtual_table: Add a way to specify a range of partitions for virtual table queries. db/virtual_table: Introduce memtable_filling_virtual_table db: Add virtual tables interface db: Introduce chained_delegating_reader	2021-05-17 11:29:37 +02:00
Botond Dénes	5e39cedbe3	evictable_reader: remove _reader_created flag This flag is not really needed, because we can just attempt a resume on first use which will fail with the default constructed inactive read handle and the reader will be created via the recreate-after-evicted path. This allows the same path to be used for all reader creation cases, simplifying the logic and more importantly making further patching easier without the special case. To make the recreate path (almost) as cheap for the first reader creation as it was with the special path, `_trim_range_tombstones` and `_validate_partition_key` is only set when really needed. Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210514141511.127735-1-bdenes@scylladb.com>	2021-05-16 14:45:46 +03:00
Botond Dénes	3b57106627	evictable_reader: remove destructor We now have close() which is expected to clean up, no need for cleanup in the destructor and consequently a destructor at all. Message-Id: <20210514112349.75867-1-bdenes@scylladb.com>	2021-05-16 12:19:41 +03:00
Benny Halevy	f4cfa530cc	perf: enable instructions_retired_counter only once per executor::run Enabling it for each run_worker call will invoke ioctl PERF_EVENT_IOC_ENABLE in parallel to other workers running and this may skew the results. Test: perf_simple_query Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210514130542.301168-1-bhalevy@scylladb.com>	2021-05-16 12:13:27 +03:00
Tomasz Grabiec	28ac8d0f2b	Merge "raft: randomized_nemesis_test framework" from Kamil We introduce `PureStateMachine`, which is the most direct translation of the mathematical definition of a state machine to C++ that I could come up with. Represented by a C++ concept, it consists of: a set of inputs (represented by the `input_t` type), outputs (`output_t` type), states (`state_t`), an initial state (`init`) and a transition function (`delta`) which given a state and an input returns a new state and an output. The rest of the testing infrastructure is going to be generic w.r.t. `PureStateMachine`. This will allow easily implementing tests using both simple and complex state machines by substituting the proper definition for this concept. Next comes `logical_timer`: it is a wrapper around `raft::logical_clock` that allows scheduling events to happen after a certain number of logical clock ticks. For example, `logical_timer::sleep(20_t)` returns a future that resolves after 20 calls to `logical_timer::tick()`. It will be used to introduce timeouts in the tests, among other things. To replicate a state machine, our Raft implementation requires it to be represented with the `raft::state_machine` interface. `impure_state_machine` is an implementation of `raft::state_machine` that wraps a `PureStateMachine`. It keeps a variable of type `state_t` representing the current state. In `apply` it deserializes the given command into `input_t`, uses the transition (`delta`) function to produce the next state and output, replaces its current state with the obtained state and returns the output (more on that below); it does so sequentially for every given command. We can think of `PureStateMachine` as the actual state machine - the business logic, and `impure_state_machine` as the ``boilerplate'' that allows the pure machine to be replicated by Raft and communicate with the external world. The interface also requires maintainance of snapshots. We introduce the `snapshots_t` type representing a set of snapshots known by a state machine. `impure_state_machine` keeps a reference to `snapshots_t` because it will share it with an implementation of `persistence`. Returning outputs is a bit tricky because apply is ``write-only'' - it returns `future<>`. We use the following technique: 1. Before sending a command to a Raft leader through `server::add_entry`, one must first directly contact the instance of `impure_state_machine` replicated by the leader, asking it to allocate an ``output channel''. 2. On such a request, `impure_state_machine` creates a channel (represented by a promise-future pair) and a unique ID; it stores the input side of the channel (the promise) with this ID internally and returns the ID and the output side of the channel (the future) to the requester. 3. After obtaining the ID, one serializes the ID together with the input and sends it as a command to Raft. Thus commands are (ID, machine input) pairs. 4. When `impure_state_machine` applies a command, it looks for a promise with the given ID. If it finds one, it sends the output through this channel. 5. The command sender waits for the output on the obtained future. The allocation and deallocation of channels is done using the `impure_state_machine::with_output_channel` function. The `call` function is an implementation of the above technique. Note that only the leader will attempt to send the output - other replicas won't find the ID in their internal data structure. The set of IDs and channels is not a part of the replicated state. A failure may cause the output to never arrive (or even the command to never be applied) so `call` waits for a limited time. It may also mistakenly `call` a server which is not currently the leader, but it is prepared to handle this error. We implement the `raft::rpc` interface, allowing Raft servers to communicate with other Raft servers. The implementation is mostly boilerplate. It assumes that there exists a method of message passing, given by a `send_message_t` function passed in the constructor. It also handles the receival of messages in the `receive` function. It defines the message type (`message_t`) that will be used by the message-passing method. The actual message passing is implemented with `network` and `delivery_queue`. The only slightly complex thing in `rpc` is the implementation of `send_snapshot` which is the only function in the `raft::rpc` interface that actually expects a response. To implement this, before sending the snapshot message we allocate a promise-future pair and assign to it a unique ID; we store the promise and the ID in a data structure. We then send the snapshot together with the ID and wait on the future. The message receival function on the other side, when it receives the snapshot message, applies the snapshot and sends back a snapshot reply message that contains the same ID. When we receive a snapshot reply message we look up the ID in the data structure and if we find a promise, we push the reply through that promise. `rpc` also keeps a reference to `snapshots_t` - it will refer to the same set of snapshots as the `impure_state_machine` on the same server. It accesses the set when it receives or sends a snapshot message. `persistence` represents the data that does not get lost between server crashes and restarts. We store a log of commands in `_stored_entries`. It is invariably ``contiguous'', meaning that the index of each entry except the first is equal to the index of the previous entry plus one at all times (i.e. after each yield). We assume that the caller provides log entries in strictly increasing index order and without gaps. Additionally to storing log entries, `persistence` can be asked to store or load a snapshot. To implement this it takes a reference to a set of snapshots (`snapshots_t&`) which it will share with `impure_state_machine` and an implementation of `rpc`. We ensure that the stored log either ``touches'' the stored snapshot on the right side or intersects it. In order to simulate a production environment as closely as possible, we implement a failure detector which uses heartbeats for deciding whether to convict a server as failed. We convict a server if we don't receive a heartbeat for a long enough time. Similarly to `rpc`, `failure_detector` assumes a message passing method given by a `send_heartbeat_t` function through the constructor. `failure_detector` uses the knowledge about existing servers to decide who to send heartbeats to. Updating this knowledge happens through `add_server` and `remove_server` functions. `network` is a simple priority queue of "events", where an event is a message associated with delivery time. Each message contains a source, a destination, and payload. The queue uses a logical clock to decide when to deliver messages; it delivers are messages whose associated times are smaller than the current time. The exact delivery method is unknown to `network` but passed as a `deliver_t` function in the constructor. The type of payload is generic. The fact that `network` has delivered a message does not mean the message was processed by the receiver. In fact, `network` assumes that delivery is instantaneous, while processing a message may be a long, complex computation, or even require IO. Thus, after a message is delivered, something else must ensure that it is processed by the destination server. That something in our framework is `delivery_queue`. It will be the bridge between `network` and `rpc`. While `network` is shared by all servers - it represents the ``environment'' in which the servers live - each server has its own private `delivery_queue`. When `network` delivers an RPC message it will end up inside `delivery_queue`. A separate fiber, `delivery_queue::receive_fiber()`, will process those messages by calling `rpc::receive` (which is a potentially long operation, thus returns a `future<>`) on the `rpc` of the destination server. `raft_server` is a package that contains `raft::server` and other facilities needed for the server to communicate with its environment: the delivery queue, the set of snapshots (shared by `impure_state_machine`, `rpc` and `persistence`) and references to the `impure_state_machine` and `rpc` instances of this server. `environment` represents a set of `raft_server`s connected by a `network`. The `network` inside is initialized with a message delivery function which notifies the destination server's failure detector on each message and if the message contains an RPC payload, pushes it into the destination's `delivery_queue`. Needs to be periodically `tick()`ed which ticks the network and underlying servers. `ticker` calls the given function as fast as the Seastar reactor allows and yields between each call. It may be provided a limit for the number of calls; it crashes the test if the limit is reached before the ticker is `abort()`ed. Finally, we add a simple test that serves as an example of using the implemented framework. We introduce `ExRegister`, an implementation of `PureStateMachine` that stores an `int32_t` and handles ``exchange'' and ``read'' inputs; an exchange replaces the state with the given value and returns the previous state, a read does not modify the state and returns the current state. In order to pass the inputs to Raft we must serialize them into commands so we implement instances of `ser::serializer` for `ExReg`'s input types. * kbr/randomized-nemesis-test-v5: raft: randomized_nemesis_test: basic test raft: randomized_nemesis_test: ticker raft: randomized_nemesis_test: environment raft: randomized_nemesis_test: server raft: randomized_nemesis_test: delivery queue raft: randomized_nemesis_test: network raft: randomized_nemesis_test: heartbeat-based failure detector raft: randomized_nemesis_test: memory backed persistence raft: randomized_nemesis_test: rpc raft: randomized_nemesis_test: impure_state_machine raft: randomized_nemesis_test: introduce logical_timer raft: randomized_nemesis_test: `PureStateMachine` concept	2021-05-14 17:33:40 +02:00
Tomasz Grabiec	0fdd2f8217	Merge "raft: fsm cleanups" from Gleb * scylla-dev/raft-cleanup-v1: raft: drop _leader_progress tracking from the tracker raft: move current_leader into the follower state raft: add some precondition checks	2021-05-14 17:24:59 +02:00
Asias He	e4872a78b5	storage_service: Delay update pending ranges for replacing node In commit `c82250e0cf` (gossip: Allow deferring advertise of local node to be up), the replacing node is changed to postpone the responding of gossip echo message to avoid other nodes sending read requests to the replacing node. It works as following: 1) replacing node does not respond echo message to avoid other nodes to mark replacing node as alive 2) replacing node advertises hibernate state so other nodes knows replacing node is replacing 3) replacing node responds echo message so other nodes can mark replacing node as alive This is problematic because after step 2, the existing nodes in the cluster will start to send writes to the replacing node, but at this time it is possible that existing nodes haven't marked the replacing node as alive, thus failing the write request unnecessarily. For instance, we saw the following errors in issue #8013 (Cassandra stress fails to achieve consistency when only one of the nodes is down) ``` scylla: [shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2 required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1}, pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond gossip echo message) c-s: java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive ``` To solve this problem for older releases without the patch "repair: Switch to use NODE_OPS_CMD for replace operation", a minimum fix is implemented in this patch. Once existing nodes learn the replacing node is in HIBERNATE state, they add the replacing as replacing, but only add the replacing to the pending list only after the replacing node is marked as alive. With this patch, when the existing nodes start to write to the replacing node, the replacing node is already alive. Tests: replace_address_test.py:TestReplaceAddress.replace_node_same_ip_test + manual test Fixes: #8013 Closes #8614	2021-05-14 17:24:28 +02:00
Tomasz Grabiec	102dcfc1fd	Merge "scylla-gdb.py: introduce scylla read-stats" from Botond Too many or too resource-hungry reads often lie at the heart of issues that require an investigation with gdb. Therefore it is very useful to have a way to summarize all reads found on a shard with their states and resource consumptions. This is exactly what this new command does. For this it uses the reader concurrency semaphores and their permits respectively, which are now arranged in an intrusive list and therefore are enumerable. Example output: (gdb) scylla read-stats Semaphore _read_concurrency_sem with: 1/100 count and 14334414/14302576 memory resources, queued: 0, inactive=1 permits count memory table/description/state 1 1 14279738 multishard_mutation_query_test.fuzzy_test/fuzzy-test/active 16 0 53532 multishard_mutation_query_test.fuzzy_test/shard-reader/active 1 0 1144 multishard_mutation_query_test.fuzzy_test/shard-reader/inactive 1 0 0 ./view_builder/active 1 0 0 multishard_mutation_query_test.fuzzy_test/multishard-mutation-query/active 20 1 14334414 Total * botond/scylla-gdb.py-scylla-reads/v5: scylla-gdb.py: introduce scylla read-stats scylla-gdb.py: add pretty printer for std::string_view scylla-gdb.py: std_map() add __len__() scylla-gdb.py: prevent infinite recursion in intrusive_list.__len__()	2021-05-14 16:07:14 +02:00
Takuya ASADA	838acb44d0	scylla-fstrim.timer: fix wrong description from 'daily' to 'weekly' It scheduled weekly, not daily. Fixes #8633 Closes #8644	2021-05-14 16:02:12 +02:00
Asias He	b8749f51cb	repair: Consider memory bloat when calculate repair parallelism The repair parallelism is calculated by the number of memory allocated to repair and memory usage per repair instance. Currently, it does not consider memory bloat issues (e.g., issue #8640) which cause repair to use more memory and cause std::bad_alloc. Be more conservative when calculating the parallelism to avoid repair using too much memory. Fixes #8641 Closes #8652	2021-05-14 16:02:08 +02:00
Piotr Sarna	c1cb7d87e1	auth: remove the fixed 15s delay during auth setup The auth intialization path contains a fixed 15s delay, which used to work around a couple of issues (#3320, #3850), but is right now quite useless, because a retry mechanism is already in place anyway. This patch speeds up the boot process if authentication is enabled. In particular, for a single-node clusters, common for test setups, auth initialization now takes a couple of milliseconds instead of the whole 15 seconds. Fixes #8648 Closes #8649	2021-05-14 16:01:59 +02:00
Kamil Braun	c21311ecca	raft: randomized_nemesis_test: basic test This is a simple test that serves as an example of using the framework implemented in the previous commits. We introduce `ExRegister`, an implementation of `PureStateMachine` that stores an `int32_t` and handles ``exchange'' and ``read'' inputs; an exchange replaces the state with the given value and returns the previous state, a read does not modify the state and returns the current state. In order to pass the inputs to Raft we must serialize them into commands so we implement instances of `ser::serializer` for `ExReg`'s input types.	2021-05-14 15:11:01 +02:00
Kamil Braun	66b9bc6fe1	raft: randomized_nemesis_test: ticker `ticker` calls the given function as fast as the Seastar reactor allows and yields between each call. It may be provided a limit for the number of calls; it crashes the test if the limit is reached before the ticker is `abort()`ed. The commit also introduces a `with_env_and_ticker` helper function which creates an `environment`, a `ticker`, and passes references to them to the given function. It destroys them after the function finishes by calling `abort()`.	2021-05-14 15:11:01 +02:00
Kamil Braun	c7cef58797	raft: randomized_nemesis_test: environment `environment` represents a set of `raft_server`s connected by a `network`. The `network` inside is initialized with a message delivery function which notifies the destination server's failure detector on each message and if the message contains an RPC payload, pushes it into the destination's `delivery_queue`. Needs to be periodically `tick()`ed which ticks the network and underlying servers. New servers can be created in the environment by calling `new_server`.	2021-05-14 15:11:01 +02:00
Kamil Braun	5095a4158e	raft: randomized_nemesis_test: server `raft_server` is a package that contains `raft::server` and other facilities needed for the server to communicate with its environment: the delivery queue, the set of snapshots (shared by `impure_state_machine`, `rpc` and `persistence`) and references to the `impure_state_machine` and `rpc` instances of this server.	2021-05-14 15:11:01 +02:00
Kamil Braun	f139fd4c28	raft: randomized_nemesis_test: delivery queue The fact that `network` has delivered a message does not mean the message was processed by the receiver. In fact, `network` assumes that delivery is instantaneous, while processing a message may be a long, complex computation, or even require IO. Thus, after a message is delivered, something else must ensure that it is processed by the destination server. That something in our framework is `delivery_queue`. It will be the bridge between `network` and `rpc`. While `network` is shared by all servers - it represents the ``environment'' in which the servers live - each server has its own private `delivery_queue`. When `network` delivers an RPC message it will end up inside `delivery_queue`. A separate fiber, `delivery_queue::receive_fiber()`, will process those messages by calling `rpc::receive` (which is a potentially long operation, thus returns a `future<>`) on the `rpc` of the destination server.	2021-05-14 15:11:01 +02:00
Kamil Braun	2956f5f76c	raft: randomized_nemesis_test: network `network` is a simple priority queue of "events", where an event is a message associated with delivery time. Each message contains a source, a destination, and payload. The queue uses a logical clock to decide when to deliver messages; it delivers are messages whose associated times are smaller than the current time. The exact delivery method is unknown to `network` but passed as a `deliver_t` function in the constructor. The type of payload is generic.	2021-05-14 15:11:01 +02:00
Kamil Braun	3068a0aa70	raft: randomized_nemesis_test: heartbeat-based failure detector In order to simulate a production environment as closely as possible, we implement a failure detector which uses heartbeats for deciding whether to convict a server as failed. We convict a server if we don't receive a heartbeat for a long enough time. Similarly to `rpc`, `failure_detector` assumes a message passing method given by a `send_heartbeat_t` function through the constructor. `failure_detector` uses the knowledge about existing servers to decide who to send heartbeats to. Updating this knowledge happens through `add_server` and `remove_server` functions.	2021-05-14 15:11:01 +02:00
Kamil Braun	51df600478	raft: randomized_nemesis_test: memory backed persistence `persistence` represents the data that does not get lost between server crashes and restarts. We store a log of commands in `_stored_entries`. It is invariably ``contiguous'', meaning that the index of each entry except the first is equal to the index of the previous entry plus one at all times (i.e. after each yield). We assume that the caller provides log entries in strictly increasing index order and without gaps. Additionally to storing log entries, `persistence` can be asked to store or load a snapshot. To implement this it takes a reference to a set of snapshots (`snapshots_t&`) which it will share with `impure_state_machine` and an implementation of `rpc` coming in a later commit. We ensure that the stored log either ``touches'' the stored snapshot on the right side or intersects it.	2021-05-14 15:11:01 +02:00
Kamil Braun	7a1f6e6d7b	raft: randomized_nemesis_test: rpc We implement the `raft::rpc` interface, allowing Raft servers to communicate with other Raft servers. The implementation is mostly boilerplate. It assumes that there exists a method of message passing, given by a `send_message_t` function passed in the constructor. It also handles the receival of messages in the `receive` function. It defines the message type (`message_t`) that will be used by the message-passing method. The actual message passing is implemented with `network` and `delivery_queue` which are introduced in later commits. The only slightly complex thing in `rpc` is the implementation of `send_snapshot` which is the only function in the `raft::rpc` interface that actually expects a response. To implement this, before sending the snapshot message we allocate a promise-future pair and assign to it a unique ID; we store the promise and the ID in a data structure. We then send the snapshot together with the ID and wait on the future. The message receival function on the other side, when it receives the snapshot message, applies the snapshot and sends back a snapshot reply message that contains the same ID. When we receive a snapshot reply message we look up the ID in the data structure and if we find a promise, we push the reply through that promise. `rpc` also keeps a reference to `snapshots_t` - it will refer to the same set of snapshots as the `impure_state_machine` on the same server. It accesses the set when it receives or sends a snapshot message.	2021-05-14 15:11:01 +02:00
Kamil Braun	905126acc3	raft: randomized_nemesis_test: impure_state_machine To replicate a state machine, our Raft implementation requires it to be represented with the `raft::state_machine` interface. `impure_state_machine` is an implementation of `raft::state_machine` that wraps a `PureStateMachine`. It keeps a variable of type `state_t` representing the current state. In `apply` it deserializes the given command into `input_t`, uses the transition (`delta`) function to produce the next state and output, replaces its current state with the obtained state and returns the output (more on that below); it does so sequentially for every given command. We can think of `PureStateMachine` as the actual state machine - the business logic, and `impure_state_machine` as the ``boilerplate'' that allows the pure machine to be replicated by Raft and communicate with the external world. The interface also requires maintainance of snapshots. We introduce the `snapshots_t` type representing a set of snapshots known by a state machine. `impure_state_machine` keeps a reference to `snapshots_t` because it will share it with an implementation of `raft::persistence` coming with a later commit. Returning outputs is a bit tricky because apply is ``write-only'' - it returns `future<>`. We use the following technique: 1. Before sending a command to a Raft leader through `server::add_entry`, one must first directly contact the instance of `impure_state_machine` replicated by the leader, asking it to allocate an ``output channel''. 2. On such a request, `impure_state_machine` creates a channel (represented by a promise-future pair) and a unique ID; it stores the input side of the channel (the promise) with this ID internally and returns the ID and the output side of the channel (the future) to the requester. 3. After obtaining the ID, one serializes the ID together with the input and sends it as a command to Raft. Thus commands are (ID, machine input) pairs. 4. When `impure_state_machine` applies a command, it looks for a promise with the given ID. If it finds one, it sends the output through this channel. 5. The command sender waits for the output on the obtained future. The allocation and deallocation of channels is done using the `impure_state_machine::with_output_channel` function. The `call` function is an implementation of the above technique. Note that only the leader will attempt to send the output - other replicas won't find the ID in their internal data structure. The set of IDs and channels is not a part of the replicated state. A failure may cause the output to never arrive (or even the command to never be applied) so `call` waits for a limited time. It may also mistakenly `call` a server which is not currently the leader, but it is prepared to handle this error.	2021-05-14 15:11:01 +02:00
Kamil Braun	3e02befccd	raft: randomized_nemesis_test: introduce logical_timer This is a wrapper around `raft::logical_clock` that allows scheduling events to happen after a certain number of logical clock ticks. For example, `logical_timer::sleep(20_t)` returns a future that resolves after 20 calls to `logical_timer::tick()`.	2021-05-13 11:34:00 +02:00
Kamil Braun	15e3bd2620	raft: randomized_nemesis_test: `PureStateMachine` concept The commit introduces `PureStateMachine`, which is the most direct translation of the mathematical definition of a state machine to C++ that I could come up with. Represented by a C++ concept, it consists of: a set of inputs (represented by the `input_t` type), outputs (`output_t` type), states (`state_t`), an initial state (`init`) and a transition function (`delta`) which given a state and an input returns a new state and an output. The rest of the testing infrastructure is going to be generic w.r.t. `PureStateMachine`. This will allow easily implementing tests using both simple and complex state machines by substituting the proper definition for this concept. One possibility of modifying this definition would be to have `delta` return `future<pair<state_t, output_t>>` instead of `pair<state_t, output_t>`. This would lose some ``purity'' but allow long computations without reactor stalls in the tests. Such modification, if we decide to do it, is trivial.	2021-05-13 11:34:00 +02:00
Alejo Sanchez	68f69671b5	raft: style: test optionals directly Avoid using has_value() and test optional directly Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20210512142018.297203-2-alejo.sanchez@scylladb.com>	2021-05-12 20:39:52 +02:00
Piotr Wojtczak	e6254acfd3	boost/tests: Add virtual_table_test for basic infrastructure	2021-05-12 17:05:35 +02:00
Piotr Wojtczak	8825ae128d	boost/tests: Test memtable_filling_virtual_table as mutation_source Uses the infrastructure for testing mutation_sources, but only a subset of it which does not do fast forwarding (since virtual_table does not support it).	2021-05-12 17:05:35 +02:00
Juliusz Stasiewicz	874f4de60c	db/system_keyspace: Add system.status virtual table This change uses the previously introduced memtable_filling_virtual_table to expose nodetool status as a virtual table.	2021-05-12 17:05:35 +02:00
Tomasz Grabiec	57ed93bf44	db/virtual_table: Add a way to specify a range of partitions for virtual table queries. This change introduces a query_restrictions object into the virtual table infrastructure, for now only holding a restriction on partition ranges. That partition range is then implemented into memtable_filling_virtual_table.	2021-05-12 17:05:35 +02:00
Piotr Wojtczak	38720847f2	db/virtual_table: Introduce memtable_filling_virtual_table This change adds a more specific implementation of the virtual table called memtable_filling_virtual_table. It produces results by filling a memtable on each read.	2021-05-12 17:05:34 +02:00
Juliusz Stasiewicz	61a0314952	db: Add virtual tables interface This change introduces the basic interface we expect each virtual table to implement. More specific implementations will then expand upon it if needed.	2021-05-12 17:05:34 +02:00
Juliusz Stasiewicz	8333d66d4e	db: Introduce chained_delegating_reader This change adds a new type of mutation reader which purpose is to allow inserting operations before an invocation of the proper reader. It takes a future to wait on and only after it resolves will it forward the execution to the underlying flat_mutation_reader implementation.	2021-05-12 17:05:34 +02:00
Eliran Sinvani	5eb84f110e	gossiper: remove excess error logging from gossiper We remove a log of severity error that is later thrown as an exception, being catched few lines below and then printed out as a warning. Fixes #8616 Closes #8617	2021-05-12 15:02:35 +02:00
Tomasz Grabiec	f8d7374400	Merge 'Add additional sstable stats' from Michael Livshin Refs #251. Closes #8630 * github.com:scylladb/scylla: statistics: add global bloom filter memory gauge statistics: add some sstable management metrics sstables: make the `_open` field more useful sstables: stats: noexcept all accessors	2021-05-12 14:35:13 +02:00
Avi Kivity	c3f17ea0a3	Merge "Fix query performance for range tombstone covering many rows" from Tomasz " Row cache reader can produce overlapping range tombstones in the mutation fragment stream even if there is only a single range tombstone in sstables, due to #2581. For every range between two rows, the row cache reader queries for tombstones relevant for that range. The result of the query is trimmed to the current position of the reader (=position of the previous row) to satisfy key monotonicity. The end position of range tombstones is left unchanged. So cache reader will split a single range tombstone around rows. Those range tombstones are transient, they will be only materialized in the reader's stream, they are not persisted anywhere. That is not a problem in itself, but it interacts badly with mutation compactor due to #8625. The range_tombstone_accumulator which is used to compact the mutation fragment stream needs to accumulate all tombstones which are relevant for the current clustering position in the stream. Adding a new range tombstone is O(N) in the number of currently active tombstones. This means that producing N rows will be O(N^2). In a unit test introduced in this series, I saw reading 137'248 rows which overlap with a range tombstone take 245 seconds. Almost all of CPU time is in drop_unneeded_tombstones(). The solution is to make the cache reader trim range tombstone end to the currently emited sub-range, so that it emits non-overlapping range tombstones. Fixes #8626. Tests: - row_cache_test (release) - perf_row_cache_reads (release) " * tag 'fix-perf-many-rows-covered-by-range-tombstone-v2' of github.com:tgrabiec/scylla: tests: perf_row_cache_reads: Add scenario for lots of rows covered by a range tombstone row_cache: Avoid generating overlapping range tombstones range_tombstone_accumulator: Avoid update_current_tombstone() when nothing changed	2021-05-12 14:07:48 +03:00

1 2 3 4 5 ...

26498 Commits