scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 17:10:35 +00:00

Author	SHA1	Message	Date
Botond Dénes	dbb6851d4d	test/manual/sstable_scan_footprint: don't double close the semaphore The semaphore `stats_collector` references is the one obtained from the database object, which is already stopped by `database::stop()`, making the stop in `~stats_collector()` redundant, and even worse, as it triggers an assert failure. Remove it. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210518140913.276368-1-bdenes@scylladb.com>	2021-05-18 17:55:52 +03:00
Avi Kivity	16ff92745f	Merge 'perf: add alternator frontend to perf_simple_query' from Piotr Sarna The perf_simple_query tool is extended with another protocol aside from CQL - alternator. The alternative (pun intended) benchmark can be executed by using the `--alternator X` parameter, where X specifies one of the alternator's mandatory write isolation options: - "forbid_rmw" - forbids RMW (read-modify-write) requests - "unsafe" - never uses LWT (lightweight transactions), even for RMW - "always_use_lwt" - uses LWT even for non-RMW requests - "only_rmw_uses_lwt" - that one's rather self-explanatory Alternator cooperates with existing `--write` and `--delete` parameters. Aside from being able to check for improvements/regressions in the alternator module, it's also possible to check how different isolation levels influence the number of allocations and overall performance, or to compare alternator against CQL. Example output showing the difference in isolation levels: ```bash $ ./build/release/test/perf/perf_simple_query_g --smp 1 \ --write --alternator only_rmw_uses_lwt --default-log-level error random-seed=1235000092 Started alternator executor 10873.76 tps (202.9 allocs/op, 12.4 tasks/op, 369921 insns/op) 11096.09 tps (202.7 allocs/op, 12.1 tasks/op, 374792 insns/op) 11100.09 tps (203.0 allocs/op, 12.1 tasks/op, 376469 insns/op) 11068.98 tps (203.1 allocs/op, 12.1 tasks/op, 377132 insns/op) 11081.24 tps (203.2 allocs/op, 12.1 tasks/op, 377290 insns/op) median 11081.24 tps (203.2 allocs/op, 12.1 tasks/op, 377290 insns/op) median absolute deviation: 14.85 maximum: 11100.09 minimum: 10873.76 $ ./build/release/test/perf/perf_simple_query_g --smp 1 \ --random-seed 1235000092 --write --alternator always_use_lwt \ --default-log-level error random-seed=1235000092 Started alternator executor 3605.35 tps (877.4 allocs/op, 174.6 tasks/op, 986666 insns/op) 3555.71 tps (890.0 allocs/op, 174.4 tasks/op, 1006945 insns/op) 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op) 3437.65 tps (908.2 allocs/op, 174.6 tasks/op, 1033992 insns/op) 3409.88 tps (913.2 allocs/op, 174.4 tasks/op, 1041240 insns/op) median 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op) median absolute deviation: 75.15 maximum: 3605.35 minimum: 3409.88 ``` Closes #8656 * github.com:scylladb/scylla: perf: add alternator frontend to perf_simple_query cdc: make metadata.hh self-sufficient test: add minimal alternator_test_env	2021-05-18 16:17:54 +03:00
Piotr Sarna	6c6ccda8a0	perf: add alternator frontend to perf_simple_query The perf_simple_query tool is extended with another protocol aside from CQL - alternator. The alternative (pun intended) benchmark can be executed by using the `--alternator X` parameter, where X specifies one of the alternator's mandatory write isolation options: - "forbid_rmw" - forbids RMW (read-modify-write) requests - "unsafe" - never uses LWT (lightweight transactions), even for RMW - "always_use_lwt" - uses LWT even for non-RMW requests - "only_rmw_uses_lwt" - that one's rather self-explanatory Alternator cooperates with existing --write and --delete parameters. Aside from being able to check for improvements/regressions in the alternator module, it's also possible to check how different isolation levels influence the number of allocations and overall performance, or to compare alternator against CQL. $ ./build/release/test/perf/perf_simple_query_g --smp 1 \ --write --alternator only_rmw_uses_lwt --default-log-level error random-seed=1235000092 Started alternator executor 10873.76 tps (202.9 allocs/op, 12.4 tasks/op, 369921 insns/op) 11096.09 tps (202.7 allocs/op, 12.1 tasks/op, 374792 insns/op) 11100.09 tps (203.0 allocs/op, 12.1 tasks/op, 376469 insns/op) 11068.98 tps (203.1 allocs/op, 12.1 tasks/op, 377132 insns/op) 11081.24 tps (203.2 allocs/op, 12.1 tasks/op, 377290 insns/op) median 11081.24 tps (203.2 allocs/op, 12.1 tasks/op, 377290 insns/op) median absolute deviation: 14.85 maximum: 11100.09 minimum: 10873.76 $ ./build/release/test/perf/perf_simple_query_g --smp 1 \ --random-seed 1235000092 --write --alternator always_use_lwt \ --default-log-level error random-seed=1235000092 Started alternator executor 3605.35 tps (877.4 allocs/op, 174.6 tasks/op, 986666 insns/op) 3555.71 tps (890.0 allocs/op, 174.4 tasks/op, 1006945 insns/op) 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op) 3437.65 tps (908.2 allocs/op, 174.6 tasks/op, 1033992 insns/op) 3409.88 tps (913.2 allocs/op, 174.4 tasks/op, 1041240 insns/op) median 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op) median absolute deviation: 75.15 maximum: 3605.35 minimum: 3409.88	2021-05-18 15:10:31 +02:00
Piotr Sarna	b6d6247a74	test: add minimal alternator_test_env A minimal implementation of alternator test env, a younger cousin of cql_test_env, is implemented. Note that using this environment for unit tests is strongly discouraged in favor of the official test/alternator pytest suite. Still, alternator_test_env has its uses for microbenchmarks.	2021-05-18 15:10:31 +02:00
Botond Dénes	82bff1bcc6	test: cql_test_env: use proper scheduling groups Currently `cql_test_env` runs its `func` in the default (main) group and also leaves all scheduling groups in `dbcfg` default initialized to the same scheduling group. This results in every part of the system, normally isolated from each other, running in the same (default) scheduling group. Not a big problem on its own, as we are talking about tests, but this creates an artificial difference between the test and the real environment, which is ever more pronounced since certain query parameters are selected based on the current scheduling group. To bring cql test env just that little bit closer to the real thing, this patch creates all the scheduling groups main does (well almost) and configures `dbcfg` with them. Creating and destroying the scheduling group on each setup-teardown of cql test env breaks some internal seastar components which don't like seeing the same scheduling group with the same name but different id. So create the scheduling groups once on first access and keep them around until the test executable is running. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210514141614.128213-2-bdenes@scylladb.com>	2021-05-18 13:44:54 +03:00
Botond Dénes	300ee974f7	test: use with_cql_test_env_thread where needed Currently `with_cql_test_env()` is equivalent to `with_cql_test_env_thread()`, which resulted in many tests using the former while really needing the latter and getting away with it. This equivalence is incidental and will go away soon, so make sure all cql test env using tests that expect to be run in a thread use the appropriate variant. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210514141614.128213-1-bdenes@scylladb.com>	2021-05-18 13:44:52 +03:00
Avi Kivity	6db826475d	Merge "Introduce segregate scrub mode" from Botond " The current scrub compaction has a serious drawback, while it is very effective at removing any corruptions it recognizes, it is very heavy-handed in its way of repairing such corruptions: it simply drops all data that is suspected to be corrupt. While this is the safest way to cleanse data, it might not be the best way from the point of view of a user who doesn't want to loose data, even at the risk of retaining some business-logic level corruption. Mind you, no database-level scrub can ever fully repair data from the business-logic point of view, they can only do so on the database-level. So in certain cases it might be desirable to have a less heavy-handed approach of cleansing the data, that tries as hard as it can to not loose any data. This series introduces a new scrub mode, with the goal of addressing this use-case: when the user doesn't want to loose any data. The new mode is called "segregate" and it works by segregating its input into multiple outputs such that each output contains a valid stream. This approach can fix any out-of-order data, be that on the partition or fragment level. Out-of-order partitions are simply written into a separate output. Out of order fragments are handled by injecting a partition-end/partition-start pair right before them, so that they are now in a separate (duplicate) partition, that will just be written into a separate output, just like a regular out-of-order partition. The reason this series is posted as an RFC is that although I consider the code stable and tested, there are some questions related to the UX. * First and foremost every scrub that does more than just discard data that is suspected to be corrupt (but even these a certain degree) have to consider the possibility that they are rehabilitating corruptions, leaving them in the system without a warning, in the sense that the user won't see any more problems due to low-level corruptions and hence might think everything is alright, while data is still corrupt from the business logic point of view. It is very hard to draw a line between what should and shouldn't scrub do, yet there is a demand from users for scrub that can restore data without loosing any of it. Note that anybody executing such a scrub is already in a bad shape, even if they can read their data (they often can't) it is already corrupt, scrub is not making anything worse here. * This series converts the previous `skip_corrupted` boolean into an enum, which now selects the scrub mode. This means that `skip_corrupted` cannot be combined with segregate to throw out what the former can't fix. This was chosen for simplicity, a bunch of flags, all interacting with each other is very hard to see through in my opinion, a linear mode selector is much more so. * The new segregate mode goes all-in, by trying to fix even fragment-level disorder. Maybe it should only do it on the partition level, or maybe this should be made configurable, allowing the user to select what to happen with those data that cannot be fixed. Tests: unit(dev), unit(sstable_datafile_test:debug) " * 'sstable-scrub-segregate-by-partition/v1' of https://github.com/denesb/scylla: test: boost/sstable_datafile_test: add tests for segregate mode scrub api: storage_service/keyspace_scrub: expose new segregate mode sstables: compaction/scrub: add segregate mode mutation_fragment_stream_validator: add reset methods mutation_writer: add segregate_by_partition api: /storage_service/keyspace_scrub: add scrub mode param sstables: compaction/scrub: replace skip_corrupted with mode enum sstables: compaction/scrub: prevent infinite loop when last partition end is missing tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests	2021-05-18 13:43:01 +03:00
Avi Kivity	593ad4de1e	Merge 'Fix type checking in index paging' from Piotr Sarna When recreating the paging state from an indexed query, a bunch of panic checks were introduced to make sure that the code is correct. However, one of the checks is too eager - namely, it throws an error if the base column type is not equal to the view column type. It usually works correctly, unless the base column type is a clustering key with DESC clustering order, in which case the type is actually "reversed". From the point of view of the paging state generation it's not important, because both types deserialize in the same way, so the check should be less strict and allow the base type to be reversed. Tests: unit(release), along with the additional test case introduced in this series; the test also passes on Cassandra Fixes #8666 Closes #8667 * github.com:scylladb/scylla: test: add a test case for paging with desc clustering order cql3: relax a type check for index paging	2021-05-18 11:34:59 +03:00
Botond Dénes	c98b0d0de8	test: cql_test_env: add trace logs to execute_cql() In tests executing tons of these, it is useful to be able to enable a trace logging of each one, to see which is the last successful one. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210514140531.118390-1-bdenes@scylladb.com>	2021-05-18 10:06:22 +03:00
Piotr Sarna	c36f432423	test: add a test case for paging with desc clustering order Issue #8666 revealed an issue with validating types for paged indexed queries - namely, the type checking mechanism is too strict in comparing types and fails on mismatched clustering order - e.g. an `int` column type is different from `int` with DESC clustering order. As a result, users see a very confusing message (because reversed types are printed as their underlying type): > Mismatched types for base and view columns c: int and int This test case fails before the fix for #8666 and thus acts as a regression test.	2021-05-17 17:06:50 +02:00
Botond Dénes	dca808dd51	perf/perf_simple_query: add --enable-cache option Allowing for testing performance with/out cache. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210517045402.16153-1-bdenes@scylladb.com>	2021-05-17 14:06:18 +02:00
Avi Kivity	8d6e575f59	perf_fast_forward: report instructions per fragment Use a hardware counter to report instructions per fragment. Results vary from ~4k insns/f when reading sequentially to more than 1M insns/f. Instructions per fragment can be a more stable metric than frags/sec. It would probably be even more stable with a fake file implementation that works in-memory to eliminate seastar polling instruction variation. Closes #8660	2021-05-17 11:33:24 +02:00
Tomasz Grabiec	8dddfab5db	Merge 'db/virtual tables: Add infrastructure + system.status example table' from Piotr Wojtczak This is the 1st PR in series with the goal to finish the hackathon project authored by @tgrabiec, @kostja, @amnonh and @mmatczuk (improved virtual tables + function call syntax in CQL). Virtual tables created within this framework are "materialized" in memtables, so current solution is for small tables only. As an example system.status was added. It was checked that DISTINCT and reverse ORDER BY do work. This PR was created by @jul-stas and @StarostaGit Fixes #8343 This is the same as #8364, but with a compilation fix (newly added `close()` method was not implemented by the reader) Closes #8634 * github.com:scylladb/scylla: boost/tests: Add virtual_table_test for basic infrastructure boost/tests: Test memtable_filling_virtual_table as mutation_source db/system_keyspace: Add system.status virtual table db/virtual_table: Add a way to specify a range of partitions for virtual table queries. db/virtual_table: Introduce memtable_filling_virtual_table db: Add virtual tables interface db: Introduce chained_delegating_reader	2021-05-17 11:29:37 +02:00
Benny Halevy	f4cfa530cc	perf: enable instructions_retired_counter only once per executor::run Enabling it for each run_worker call will invoke ioctl PERF_EVENT_IOC_ENABLE in parallel to other workers running and this may skew the results. Test: perf_simple_query Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210514130542.301168-1-bhalevy@scylladb.com>	2021-05-16 12:13:27 +03:00
Tomasz Grabiec	28ac8d0f2b	Merge "raft: randomized_nemesis_test framework" from Kamil We introduce `PureStateMachine`, which is the most direct translation of the mathematical definition of a state machine to C++ that I could come up with. Represented by a C++ concept, it consists of: a set of inputs (represented by the `input_t` type), outputs (`output_t` type), states (`state_t`), an initial state (`init`) and a transition function (`delta`) which given a state and an input returns a new state and an output. The rest of the testing infrastructure is going to be generic w.r.t. `PureStateMachine`. This will allow easily implementing tests using both simple and complex state machines by substituting the proper definition for this concept. Next comes `logical_timer`: it is a wrapper around `raft::logical_clock` that allows scheduling events to happen after a certain number of logical clock ticks. For example, `logical_timer::sleep(20_t)` returns a future that resolves after 20 calls to `logical_timer::tick()`. It will be used to introduce timeouts in the tests, among other things. To replicate a state machine, our Raft implementation requires it to be represented with the `raft::state_machine` interface. `impure_state_machine` is an implementation of `raft::state_machine` that wraps a `PureStateMachine`. It keeps a variable of type `state_t` representing the current state. In `apply` it deserializes the given command into `input_t`, uses the transition (`delta`) function to produce the next state and output, replaces its current state with the obtained state and returns the output (more on that below); it does so sequentially for every given command. We can think of `PureStateMachine` as the actual state machine - the business logic, and `impure_state_machine` as the ``boilerplate'' that allows the pure machine to be replicated by Raft and communicate with the external world. The interface also requires maintainance of snapshots. We introduce the `snapshots_t` type representing a set of snapshots known by a state machine. `impure_state_machine` keeps a reference to `snapshots_t` because it will share it with an implementation of `persistence`. Returning outputs is a bit tricky because apply is ``write-only'' - it returns `future<>`. We use the following technique: 1. Before sending a command to a Raft leader through `server::add_entry`, one must first directly contact the instance of `impure_state_machine` replicated by the leader, asking it to allocate an ``output channel''. 2. On such a request, `impure_state_machine` creates a channel (represented by a promise-future pair) and a unique ID; it stores the input side of the channel (the promise) with this ID internally and returns the ID and the output side of the channel (the future) to the requester. 3. After obtaining the ID, one serializes the ID together with the input and sends it as a command to Raft. Thus commands are (ID, machine input) pairs. 4. When `impure_state_machine` applies a command, it looks for a promise with the given ID. If it finds one, it sends the output through this channel. 5. The command sender waits for the output on the obtained future. The allocation and deallocation of channels is done using the `impure_state_machine::with_output_channel` function. The `call` function is an implementation of the above technique. Note that only the leader will attempt to send the output - other replicas won't find the ID in their internal data structure. The set of IDs and channels is not a part of the replicated state. A failure may cause the output to never arrive (or even the command to never be applied) so `call` waits for a limited time. It may also mistakenly `call` a server which is not currently the leader, but it is prepared to handle this error. We implement the `raft::rpc` interface, allowing Raft servers to communicate with other Raft servers. The implementation is mostly boilerplate. It assumes that there exists a method of message passing, given by a `send_message_t` function passed in the constructor. It also handles the receival of messages in the `receive` function. It defines the message type (`message_t`) that will be used by the message-passing method. The actual message passing is implemented with `network` and `delivery_queue`. The only slightly complex thing in `rpc` is the implementation of `send_snapshot` which is the only function in the `raft::rpc` interface that actually expects a response. To implement this, before sending the snapshot message we allocate a promise-future pair and assign to it a unique ID; we store the promise and the ID in a data structure. We then send the snapshot together with the ID and wait on the future. The message receival function on the other side, when it receives the snapshot message, applies the snapshot and sends back a snapshot reply message that contains the same ID. When we receive a snapshot reply message we look up the ID in the data structure and if we find a promise, we push the reply through that promise. `rpc` also keeps a reference to `snapshots_t` - it will refer to the same set of snapshots as the `impure_state_machine` on the same server. It accesses the set when it receives or sends a snapshot message. `persistence` represents the data that does not get lost between server crashes and restarts. We store a log of commands in `_stored_entries`. It is invariably ``contiguous'', meaning that the index of each entry except the first is equal to the index of the previous entry plus one at all times (i.e. after each yield). We assume that the caller provides log entries in strictly increasing index order and without gaps. Additionally to storing log entries, `persistence` can be asked to store or load a snapshot. To implement this it takes a reference to a set of snapshots (`snapshots_t&`) which it will share with `impure_state_machine` and an implementation of `rpc`. We ensure that the stored log either ``touches'' the stored snapshot on the right side or intersects it. In order to simulate a production environment as closely as possible, we implement a failure detector which uses heartbeats for deciding whether to convict a server as failed. We convict a server if we don't receive a heartbeat for a long enough time. Similarly to `rpc`, `failure_detector` assumes a message passing method given by a `send_heartbeat_t` function through the constructor. `failure_detector` uses the knowledge about existing servers to decide who to send heartbeats to. Updating this knowledge happens through `add_server` and `remove_server` functions. `network` is a simple priority queue of "events", where an event is a message associated with delivery time. Each message contains a source, a destination, and payload. The queue uses a logical clock to decide when to deliver messages; it delivers are messages whose associated times are smaller than the current time. The exact delivery method is unknown to `network` but passed as a `deliver_t` function in the constructor. The type of payload is generic. The fact that `network` has delivered a message does not mean the message was processed by the receiver. In fact, `network` assumes that delivery is instantaneous, while processing a message may be a long, complex computation, or even require IO. Thus, after a message is delivered, something else must ensure that it is processed by the destination server. That something in our framework is `delivery_queue`. It will be the bridge between `network` and `rpc`. While `network` is shared by all servers - it represents the ``environment'' in which the servers live - each server has its own private `delivery_queue`. When `network` delivers an RPC message it will end up inside `delivery_queue`. A separate fiber, `delivery_queue::receive_fiber()`, will process those messages by calling `rpc::receive` (which is a potentially long operation, thus returns a `future<>`) on the `rpc` of the destination server. `raft_server` is a package that contains `raft::server` and other facilities needed for the server to communicate with its environment: the delivery queue, the set of snapshots (shared by `impure_state_machine`, `rpc` and `persistence`) and references to the `impure_state_machine` and `rpc` instances of this server. `environment` represents a set of `raft_server`s connected by a `network`. The `network` inside is initialized with a message delivery function which notifies the destination server's failure detector on each message and if the message contains an RPC payload, pushes it into the destination's `delivery_queue`. Needs to be periodically `tick()`ed which ticks the network and underlying servers. `ticker` calls the given function as fast as the Seastar reactor allows and yields between each call. It may be provided a limit for the number of calls; it crashes the test if the limit is reached before the ticker is `abort()`ed. Finally, we add a simple test that serves as an example of using the implemented framework. We introduce `ExRegister`, an implementation of `PureStateMachine` that stores an `int32_t` and handles ``exchange'' and ``read'' inputs; an exchange replaces the state with the given value and returns the previous state, a read does not modify the state and returns the current state. In order to pass the inputs to Raft we must serialize them into commands so we implement instances of `ser::serializer` for `ExReg`'s input types. * kbr/randomized-nemesis-test-v5: raft: randomized_nemesis_test: basic test raft: randomized_nemesis_test: ticker raft: randomized_nemesis_test: environment raft: randomized_nemesis_test: server raft: randomized_nemesis_test: delivery queue raft: randomized_nemesis_test: network raft: randomized_nemesis_test: heartbeat-based failure detector raft: randomized_nemesis_test: memory backed persistence raft: randomized_nemesis_test: rpc raft: randomized_nemesis_test: impure_state_machine raft: randomized_nemesis_test: introduce logical_timer raft: randomized_nemesis_test: `PureStateMachine` concept	2021-05-14 17:33:40 +02:00
Tomasz Grabiec	0fdd2f8217	Merge "raft: fsm cleanups" from Gleb * scylla-dev/raft-cleanup-v1: raft: drop _leader_progress tracking from the tracker raft: move current_leader into the follower state raft: add some precondition checks	2021-05-14 17:24:59 +02:00
Kamil Braun	c21311ecca	raft: randomized_nemesis_test: basic test This is a simple test that serves as an example of using the framework implemented in the previous commits. We introduce `ExRegister`, an implementation of `PureStateMachine` that stores an `int32_t` and handles ``exchange'' and ``read'' inputs; an exchange replaces the state with the given value and returns the previous state, a read does not modify the state and returns the current state. In order to pass the inputs to Raft we must serialize them into commands so we implement instances of `ser::serializer` for `ExReg`'s input types.	2021-05-14 15:11:01 +02:00
Kamil Braun	66b9bc6fe1	raft: randomized_nemesis_test: ticker `ticker` calls the given function as fast as the Seastar reactor allows and yields between each call. It may be provided a limit for the number of calls; it crashes the test if the limit is reached before the ticker is `abort()`ed. The commit also introduces a `with_env_and_ticker` helper function which creates an `environment`, a `ticker`, and passes references to them to the given function. It destroys them after the function finishes by calling `abort()`.	2021-05-14 15:11:01 +02:00
Kamil Braun	c7cef58797	raft: randomized_nemesis_test: environment `environment` represents a set of `raft_server`s connected by a `network`. The `network` inside is initialized with a message delivery function which notifies the destination server's failure detector on each message and if the message contains an RPC payload, pushes it into the destination's `delivery_queue`. Needs to be periodically `tick()`ed which ticks the network and underlying servers. New servers can be created in the environment by calling `new_server`.	2021-05-14 15:11:01 +02:00
Kamil Braun	5095a4158e	raft: randomized_nemesis_test: server `raft_server` is a package that contains `raft::server` and other facilities needed for the server to communicate with its environment: the delivery queue, the set of snapshots (shared by `impure_state_machine`, `rpc` and `persistence`) and references to the `impure_state_machine` and `rpc` instances of this server.	2021-05-14 15:11:01 +02:00
Kamil Braun	f139fd4c28	raft: randomized_nemesis_test: delivery queue The fact that `network` has delivered a message does not mean the message was processed by the receiver. In fact, `network` assumes that delivery is instantaneous, while processing a message may be a long, complex computation, or even require IO. Thus, after a message is delivered, something else must ensure that it is processed by the destination server. That something in our framework is `delivery_queue`. It will be the bridge between `network` and `rpc`. While `network` is shared by all servers - it represents the ``environment'' in which the servers live - each server has its own private `delivery_queue`. When `network` delivers an RPC message it will end up inside `delivery_queue`. A separate fiber, `delivery_queue::receive_fiber()`, will process those messages by calling `rpc::receive` (which is a potentially long operation, thus returns a `future<>`) on the `rpc` of the destination server.	2021-05-14 15:11:01 +02:00
Kamil Braun	2956f5f76c	raft: randomized_nemesis_test: network `network` is a simple priority queue of "events", where an event is a message associated with delivery time. Each message contains a source, a destination, and payload. The queue uses a logical clock to decide when to deliver messages; it delivers are messages whose associated times are smaller than the current time. The exact delivery method is unknown to `network` but passed as a `deliver_t` function in the constructor. The type of payload is generic.	2021-05-14 15:11:01 +02:00
Kamil Braun	3068a0aa70	raft: randomized_nemesis_test: heartbeat-based failure detector In order to simulate a production environment as closely as possible, we implement a failure detector which uses heartbeats for deciding whether to convict a server as failed. We convict a server if we don't receive a heartbeat for a long enough time. Similarly to `rpc`, `failure_detector` assumes a message passing method given by a `send_heartbeat_t` function through the constructor. `failure_detector` uses the knowledge about existing servers to decide who to send heartbeats to. Updating this knowledge happens through `add_server` and `remove_server` functions.	2021-05-14 15:11:01 +02:00
Kamil Braun	51df600478	raft: randomized_nemesis_test: memory backed persistence `persistence` represents the data that does not get lost between server crashes and restarts. We store a log of commands in `_stored_entries`. It is invariably ``contiguous'', meaning that the index of each entry except the first is equal to the index of the previous entry plus one at all times (i.e. after each yield). We assume that the caller provides log entries in strictly increasing index order and without gaps. Additionally to storing log entries, `persistence` can be asked to store or load a snapshot. To implement this it takes a reference to a set of snapshots (`snapshots_t&`) which it will share with `impure_state_machine` and an implementation of `rpc` coming in a later commit. We ensure that the stored log either ``touches'' the stored snapshot on the right side or intersects it.	2021-05-14 15:11:01 +02:00
Kamil Braun	7a1f6e6d7b	raft: randomized_nemesis_test: rpc We implement the `raft::rpc` interface, allowing Raft servers to communicate with other Raft servers. The implementation is mostly boilerplate. It assumes that there exists a method of message passing, given by a `send_message_t` function passed in the constructor. It also handles the receival of messages in the `receive` function. It defines the message type (`message_t`) that will be used by the message-passing method. The actual message passing is implemented with `network` and `delivery_queue` which are introduced in later commits. The only slightly complex thing in `rpc` is the implementation of `send_snapshot` which is the only function in the `raft::rpc` interface that actually expects a response. To implement this, before sending the snapshot message we allocate a promise-future pair and assign to it a unique ID; we store the promise and the ID in a data structure. We then send the snapshot together with the ID and wait on the future. The message receival function on the other side, when it receives the snapshot message, applies the snapshot and sends back a snapshot reply message that contains the same ID. When we receive a snapshot reply message we look up the ID in the data structure and if we find a promise, we push the reply through that promise. `rpc` also keeps a reference to `snapshots_t` - it will refer to the same set of snapshots as the `impure_state_machine` on the same server. It accesses the set when it receives or sends a snapshot message.	2021-05-14 15:11:01 +02:00
Kamil Braun	905126acc3	raft: randomized_nemesis_test: impure_state_machine To replicate a state machine, our Raft implementation requires it to be represented with the `raft::state_machine` interface. `impure_state_machine` is an implementation of `raft::state_machine` that wraps a `PureStateMachine`. It keeps a variable of type `state_t` representing the current state. In `apply` it deserializes the given command into `input_t`, uses the transition (`delta`) function to produce the next state and output, replaces its current state with the obtained state and returns the output (more on that below); it does so sequentially for every given command. We can think of `PureStateMachine` as the actual state machine - the business logic, and `impure_state_machine` as the ``boilerplate'' that allows the pure machine to be replicated by Raft and communicate with the external world. The interface also requires maintainance of snapshots. We introduce the `snapshots_t` type representing a set of snapshots known by a state machine. `impure_state_machine` keeps a reference to `snapshots_t` because it will share it with an implementation of `raft::persistence` coming with a later commit. Returning outputs is a bit tricky because apply is ``write-only'' - it returns `future<>`. We use the following technique: 1. Before sending a command to a Raft leader through `server::add_entry`, one must first directly contact the instance of `impure_state_machine` replicated by the leader, asking it to allocate an ``output channel''. 2. On such a request, `impure_state_machine` creates a channel (represented by a promise-future pair) and a unique ID; it stores the input side of the channel (the promise) with this ID internally and returns the ID and the output side of the channel (the future) to the requester. 3. After obtaining the ID, one serializes the ID together with the input and sends it as a command to Raft. Thus commands are (ID, machine input) pairs. 4. When `impure_state_machine` applies a command, it looks for a promise with the given ID. If it finds one, it sends the output through this channel. 5. The command sender waits for the output on the obtained future. The allocation and deallocation of channels is done using the `impure_state_machine::with_output_channel` function. The `call` function is an implementation of the above technique. Note that only the leader will attempt to send the output - other replicas won't find the ID in their internal data structure. The set of IDs and channels is not a part of the replicated state. A failure may cause the output to never arrive (or even the command to never be applied) so `call` waits for a limited time. It may also mistakenly `call` a server which is not currently the leader, but it is prepared to handle this error.	2021-05-14 15:11:01 +02:00
Kamil Braun	3e02befccd	raft: randomized_nemesis_test: introduce logical_timer This is a wrapper around `raft::logical_clock` that allows scheduling events to happen after a certain number of logical clock ticks. For example, `logical_timer::sleep(20_t)` returns a future that resolves after 20 calls to `logical_timer::tick()`.	2021-05-13 11:34:00 +02:00
Kamil Braun	15e3bd2620	raft: randomized_nemesis_test: `PureStateMachine` concept The commit introduces `PureStateMachine`, which is the most direct translation of the mathematical definition of a state machine to C++ that I could come up with. Represented by a C++ concept, it consists of: a set of inputs (represented by the `input_t` type), outputs (`output_t` type), states (`state_t`), an initial state (`init`) and a transition function (`delta`) which given a state and an input returns a new state and an output. The rest of the testing infrastructure is going to be generic w.r.t. `PureStateMachine`. This will allow easily implementing tests using both simple and complex state machines by substituting the proper definition for this concept. One possibility of modifying this definition would be to have `delta` return `future<pair<state_t, output_t>>` instead of `pair<state_t, output_t>`. This would lose some ``purity'' but allow long computations without reactor stalls in the tests. Such modification, if we decide to do it, is trivial.	2021-05-13 11:34:00 +02:00
Alejo Sanchez	68f69671b5	raft: style: test optionals directly Avoid using has_value() and test optional directly Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20210512142018.297203-2-alejo.sanchez@scylladb.com>	2021-05-12 20:39:52 +02:00
Piotr Wojtczak	e6254acfd3	boost/tests: Add virtual_table_test for basic infrastructure	2021-05-12 17:05:35 +02:00
Piotr Wojtczak	8825ae128d	boost/tests: Test memtable_filling_virtual_table as mutation_source Uses the infrastructure for testing mutation_sources, but only a subset of it which does not do fast forwarding (since virtual_table does not support it).	2021-05-12 17:05:35 +02:00
Tomasz Grabiec	a9dd7a295d	tests: perf_row_cache_reads: Add scenario for lots of rows covered by a range tombstone Reproduces #8626. Output: test_scan_with_range_delete_over_rows Populating with rows Rows: 702710 Scanning... read: 540.007324 [ms], preemption: {count: 2356, 99%: 1.131752 [ms], max: 1.148589 [ms]}, cache: 251/252 [MB] read: 651.942688 [ms], preemption: {count: 1176, 99%: 1.131752 [ms], max: 1.009652 [ms]}, cache: 251/252 [MB]	2021-05-12 11:58:36 +02:00
Nadav Har'El	cee4c075d2	Merge 'Fix index name conflicts with regular tables' from Piotr Sarna When an index is created without an explicit name, a default name is chosen. However, there was no check if a table with conflicting name already exists. The check is now in place and if any conflicts are found, a new index name is chosen instead. When an index is created with an explicit name and a conflicting regular table is found, index creation should simply fail. This series comes with a test. Fixes #8620 Tests: unit(release) Closes #8632 * github.com:scylladb/scylla: cql-pytest: add regression tests for index creation cql3: fail to create an index if there is a name conflict database: check for conflicting table names for indexes	2021-05-11 18:40:15 +03:00
Benny Halevy	9ba960a388	utils: phased_barrier::operation do not leak gate entry when reassigned utils::phased_barrier holds a `lw_shared_ptr<gate>` that is typically `enter()`ed in `phased_barrier::start()`, and left when the operation is destroyed in `~operation`. Currently, the operation move-assign implementation is the default one that just moves the lw_shared gate ptr from the other operation into this one, without calling `_gate->leave()` first. This change first destroys *this when move-assigned (if not self) to call _gate->leave() if engaged, before reassigning the gate with the other operation::_gate. A unit test that reproduces the issue before this change and passes with the fix was added to serialized_action_test. Fixes #8613 Test: unit(dev), serialized_action_test(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210510120703.1520328-1-bhalevy@scylladb.com>	2021-05-11 18:39:10 +03:00
Avi Kivity	1d8234f52d	Merge "reader_concurrency_semaphore: improve diagnostics printout" from Botond " The current printout is has multiple problems: * It is segregated by state, each having its own sorting criteria; * Number of permits and count resources is collapsed in to a single column, not clear which is the one printed. * Number of available/initial units of the semaphore are not printed; This series solves all this problems: * It merges all states into a single table, sorted by memory consumption, in descending order. * It separates number of permits and count resources into separate columns. * Prints a summary of the semaphore units. * Provides a cap on the maximum amount of printable lines, to not blow up the logs. The goal of all this is to make it easy to find the culprit a semaphore problem: easily spot the big memory consumers, then unpack the name column to determine which table and code path is responsible. This brings the printout close to the recently `scylla reads` scylla-gdb.py command, providing a uniform report format across the two tools. Example report: INFO 2021-05-07 09:52:16,806 [shard 0] testlog - With max-lines=4: Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/2147483647 count and 263599186/9223372036854775807 memory resources: user request, dumping permit diagnostics: permits count memory table/description/state 7 2 77M ks.tbl1/op1/active 6 3 59M ks.tbl1/op0/active 4 0 36M ks.tbl1/op2/active 3 1 36M ks.tbl0/op2/active 11 2 43M permits omitted for brevity 31 8 251M total " * 'reader-concurrency-semaphore-dump-improvement/v1' of https://github.com/denesb/scylla: test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics reader_concurrency_semaphore: dump_reader_diagnostics(): print more information in the header reader_concurrency_semaphore: dump_reader_diagnostics(): cap number of printed lines reader_concurrency_semaphore: dump_reader_diagnostics(): sort lines in descending order reader_concurrency_semaphore: dump_reader_diagnostics(): merge all states into a single table reader_concurrency_semaphore: dump_reader_diagnostics(): separate number of permits and count resources	2021-05-11 18:39:10 +03:00
Nadav Har'El	af485f5226	secondary index: fix index name in IndexInfo system table In commit `3e39985c7a` we added the Cassandra-compatible system table system."IndexInfo" (note the capitalized table name) which lists built indexes. Because we already had a table of built materialized views, and indexes are implemented as materialized views, the index list was implemented as a virtual table based on the view list. However, the name of each materialized view listed in the list of views looks like something_index, with the suffix "_index", while the name of the table we need to print is "something". We forgot to do this transformation in the virtual table - and this is what this patch does. This bug can confuse applications which use this system table to wait for an index to be built. Several tests translated from Cassandra's unit tests, in cassandra_tests/validation/entities/secondary_index_test.py fail in wait_for_index() because of this incompatibility, and pass after this patch. This patch also changes the unit test that enshrined the previous, wrong, behavior, to test for the correct behavior. This problem is typical of C++ unit tests which cannot be run against Cassandra. Fixes #8600 Unfortunately, although this patch fixes "typical" applications (including all tests which I tried) - applications which read from IndexInfo in a "typical" method to look for a specific index being ready, the implementation is technically NOT correct: The problem is that index names are not sorted in the right order, because they are sorted with the "_index" prefix. To give an example, the index names "a" should be listed before "a1", but the view names "a1_index" comes before "a_index" (because in ASCII, 1 comes before underscore). I can't think of any way to fix this bug without completely reimplementing IndexInfo in a different way - probably based on a temporary memtable (which is fine as this is not a performance-critical operation). We'll need to do this rewrite eventually, and I'll open a new issue. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210509140113.1084497-1-nyh@scylladb.com>	2021-05-11 18:39:10 +03:00
Avi Kivity	61c7f874cc	Merge 'Add per-service-level timeouts' from Piotr Sarna Ref: #7617 This series adds timeout parameters to service levels. Per-service-level timeouts can be set up in the form of service level parameters, which can in turn be attached to roles. Setting up and modifying role-specific timeouts can be achieved like this: ```cql CREATE SERVICE LEVEL sl2 WITH read_timeout = 500ms AND write_timeout = 200ms AND cas_timeout = 2s; ATTACH SERVICE LEVEL sl2 TO cassandra; ALTER SERVICE LEVEL sl2 WITH write_timeout = null; ``` Per-service-level timeouts take precedence over default timeout values from scylla.yaml, but can still be overridden for a specific query by per-query timeouts (e.g. `SELECT * from t USING TIMEOUT 50ms`). Closes #7913 * github.com:scylladb/scylla: docs: add a paragraph describing service level timeouts test: add per-service-level timeout tests test: add refreshing client state transport: add updating per-service-level params client_state: allow updating per service level params qos: allow returning combined service level options qos: add a way of merging service level options cql3: add preserving default values for per-sl timeouts qos: make getting service level public qos: make finding service level public treewide: remove service level controller from query state treewide: propagate service level to client state sstables: disambiguate boost::find cql3: add a timeout column to LIST SERVICE LEVEL statement db: add extracting service level info via CQL types: add a missing translation for cql_duration cql3: allow unsetting service level timeouts cql3: add validating service level timeout values db: add setting service level params via system_distributed cql3: add fetching service level attrs in ALTER and CREATE cql3: add timeout to service level params qos: add timeout to service level info db,sys_dist_ks: add timeout to the service level table migration_manager: allow table updates with timestamp cql3: allow a null keyword for CQL properties	2021-05-11 18:39:10 +03:00
Michael Livshin	ff7d781988	test: enable scylla-gdb/run It should pass now. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Michael Livshin	73f9f08df6	test: add a basic test for scylla-gdb.py (And disable it initially, because it won't pass without subsequent commits) Runs only in release mode, to keep things more realistic. Doesn't exercise Scylla much at present -- just stops it after several compactions and tries (almost) all "scylla *" commands in order. Refs #6952. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Michael Livshin	3bff94cd29	test.py: refine test mode control * Add ability to skip tests in individual modes using "skip_in_<mode>". * Add ability to allow tests in specific modes using "run_in_<mode>". * Rename "skip_in_debug_mode" to "skip_in_debug_modes", because there is an actual mode named "debug" and this is confusing. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Avi Kivity	b1f9df279a	Merge "Untie cdc, storage service and migration notifier knot" from Pavel E " Storage service needs migration notifier reference to pass it to cdc service via get_local_storage_service(). This set removes - get_local_storage_service from cdc - migration notifier from storage service - db_context::builder from cdc (released nuclear binding energy) tests: unit(dev) " * 'br-cdc-no-storage-service' of https://github.com/xemul/scylla: storage_service: Remove migration notifier dependency cdc: Remove db_context::builder cdc: Provide migration notifier right at once cdc: Remove db_context::builder::with_migration_notifier	2021-05-11 18:39:10 +03:00
Piotr Sarna	1cb804f024	cql-pytest: add regression tests for index creation This commit adds unit tests for an issue with index creation after a table with malicious name is previously created as well. The cases cover both indexes with a default name and the ones with explicit name set.	2021-05-11 17:34:37 +02:00
Botond Dénes	69d04d161e	test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics Not really testing anything, at least not automatically. It just provides coverage for the diagnostics dump code, as well as allows for developers to inspect the printout visually when making changes.	2021-05-10 18:06:30 +03:00
Piotr Sarna	570c63d39b	test: add per-service-level timeout tests The test suite checks if per-service-level timeouts work and validate their input.	2021-05-10 12:39:41 +02:00
Piotr Sarna	43f1f9e445	test: add refreshing client state With a helper client state refresher, some attributes which are usually only refreshed after a client disconnects and then reconnects, can be verified in the test suite.	2021-05-10 12:39:41 +02:00
Piotr Sarna	e257ec11c0	treewide: remove service level controller from query state ... since it's accessible through its member, client state.	2021-05-10 11:48:14 +02:00
Piotr Sarna	d1f2e8b469	treewide: propagate service level to client state ... since it's going to be used to set up per-service-level timeouts.	2021-05-10 11:48:14 +02:00
Piotr Sarna	e8d271fea9	db: add extracting service level info via CQL	2021-05-10 11:45:09 +02:00
Piotr Sarna	7e6beabf27	migration_manager: allow table updates with timestamp In order to avoid needless schema disagreements, a way of announcing a schema change with fixed timestamp is added. That way, when nodes update schemas of their internal tables (e.g. during updates), it's possible for all nodes to use an identical timestamp for this operation, which in turn makes their digests identical.	2021-05-10 10:10:38 +02:00
Gleb Natapov	78c5a72b32	raft: drop _leader_progress tracking from the tracker The tracker maintains a separate pointer to current leader progress, but all this complexity is not needed because the tracker already have find() function that can either find a leader's progress by id or return null. Removing the tracking simplifies code and make going out of sync (which is always a possibility if a state is maintained in two different places) impossible.	2021-05-09 13:55:55 +03:00

1 2 3 4 5 ...

1644 Commits