Commit Graph

1697 Commits

Author SHA1 Message Date
Avi Kivity
d3e5b37059 Revert "Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund"
This reverts commit e9c940dbbc, reversing
changes made to 6144656b25. Since it was
merged commitlog_test consistently times out in debug mode.
2021-05-27 21:16:26 +03:00
Wojciech Mitros
725c6aac81 test/perf: close test_env to pass an assert in sstables_manager destructor
When destroying an perf_sstable_test_env, an assert in sstables_manager
destructor fails, because it hasn't been closed.
Fix by removing all references to sstables from perf_sstable_test_env,
and then closing the test_env(as well as the sstables_manager)

Fixes #8736

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #8737
2021-05-27 17:41:17 +03:00
Michał Chojnowski
5e9f741bb4 repair: remove range_split.hh
Dead code since 80ebedd242.

Closes #8698
2021-05-27 17:21:37 +03:00
Avi Kivity
5f8484897b Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun
Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged.

---
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
   irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
   (according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
   system_distributed.cdc_generation_descriptions, using the
   generation's timestamp as the partition key (we call this table
   the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
   application state.

The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.

Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.

The commit introduces a new table for broadcasting generation data with
the following properties:
-  it uses a better schema that stores the data in multiple rows, each
   of manageable size
-  it resides in a new keyspace that uses EverywhereStrategy so the
   data will be written to every node in the cluster that has a token in
   the token ring
-  the data will be written using CL=ALL and read using CL=ONE; thanks
   to this, restarting node won't have to communicate with other nodes
   to retrieve the data of the last known generation. Note that writing
   with CL=ALL does not reduce availability: creating a new generation
   *requires* all nodes to be available anyway, because they must learn
   about the generation before their clocks go past the generation's
   timestamp; if they don't, partitions won't be mapped to stream IDs
   consistently across the cluster
-  the partition key is no longer the generation's timestamp. Because it
   was that way in the old internal table, it forced the algorithm to
   choose the timestamp *before* the generation data was inserted into
   the table. What if the inserting took a long time? It increased the
   chance that nodes would learn about the generation too late (after
   their clocks moved past its timestamp). With the new schema we will
   first insert the generation data using a randomly generated UUID as
   the partition key, *then* choose the timestamp, then gossip both the
   timestamp and the UUID.
   Observe that after a node learns about a generation broadcasted using
   this new method through gossip it will retrieve its data very quickly
   since it's one of the replicas and it can use CL=ONE as it was
   written using CL=ALL.

The generation's timestamp and the UUID mentioned in the last point form
a "generation identifier" for this new generation. For passing these new
identifiers around, we introduce the cdc::generation_id_v2 type.

Fixes #7961.

---

For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order.

dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/
unit tests (dev) passed locally

Closes #8643

* github.com:scylladb/scylla:
  docs: update cdc.md with info about the new internal table
  sys_dist_ks: don't create old CDC generations table on service initialization
  sys_dist_ks: rename all_tables() to ensured_tables()
  cdc: when creating new generations, use format v2 if possible
  main: pass feature_service to cdc::generation_service
  gms: introduce CDC_GENERATIONS_V2 feature
  cdc: introduce retrieve_generation_data
  test: cdc: include new generations table in permissions test
  sys_dist_ks: increase timeout for create_cdc_desc
  sys_dist_ks: new table for exchanging CDC generations
  tree-wide: introduce cdc::generation_id_v2
2021-05-27 17:13:44 +03:00
Avi Kivity
e8e4456ec7 Merge 'Introduce per-service-level workload types and their first use-case - shedding in interactive workloads' from Piotr Sarna
This draft extends and obsoletes #8123 by introducing a way of determining the workload type from service level parameters, and then using this context to qualify requests for shedding.

The rough idea is that when the admission queue in the CQL server is hit, it might make more sense to start shedding surplus requests instead of accumulating them on the semaphore. The assumption that interactive workloads are more interested in the success rate of as many requests as possible, and hanging on a semaphore reduces the chances for a request to succeed. Thus, it may make sense to shed some requests to reduce the load on this coordinator and let the existing requests to finish.

It's a draft, because I only performed local guided tests. #8123 was followed by some experiments on a multinode cluster which I want to rerun first.

Closes #8680

* github.com:scylladb/scylla:
  test: add a case for conflicting workload types
  cql-pytest: add basic tests for service level workload types
  docs: describe workload types for service levels
  sys_dist_ks: fix redundant parsing in get_service_level
  sys_dist_ks: make get_service_level exception-safe
  transport: start shedding requests during potential overload
  client_state: hook workload type from service levels
  cql3: add listing service level workload type
  cql3: add persisting service level workload type
  qos: add workload_type service level parameter
2021-05-27 17:01:56 +03:00
Konstantin Osipov
52f7ff4ee4 raft: (testing) update copyright
An incorrect copyright information was copy-pasted
from another test file.

Message-Id: <20210525183919.1395607-1-kostja@scylladb.com>
2021-05-27 15:47:49 +03:00
Piotr Sarna
99f356d764 test: add a case for conflicting workload types
The test case verifies that if several workload types are effective
for a single role, the conflict resolution is well defined.
2021-05-27 14:31:36 +02:00
Piotr Sarna
01b7e445f9 cql-pytest: add basic tests for service level workload types
The test cases check whether it's possible to declare workload
type for a service level and if its input is validated.
2021-05-27 14:31:36 +02:00
Pavel Emelyanov
d2442a1bb3 tests: Ditch storage_service_for_tests
The purpose of the class in question is to start sharded storage
service to make its global instance alive. I don't know when exactly
it happened but no code that instantiates this wrapper really needs
the global storage service.

Ref: #2795
tests: unit(dev), perf_sstable(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210526170454.15795-1-xemul@scylladb.com>
2021-05-27 14:39:13 +03:00
Piotr Sarna
762e2f48f2 cql3: add listing service level workload type
The workload type information is now presented in the output
of LIST SERVICE LEVEL and LIST ALL SERVICE LEVELS statements.
2021-05-27 13:02:22 +02:00
Nadav Har'El
97e827e3e1 secondary index: fix regression in CREATE INDEX IF NOT EXISTS
The recent commit 0ef0a4c78d added helpful
error messages in case an index cannot be created because the intended
name of its materialized view is already taken - but accidentally broke
the "CREATE INDEX IF NOT EXISTS" feature.

The checking code was correct, but in the wrong place: we need to first
check maybe the index already exists and "IF NOT EXISTS" was chosen -
and only do this new error checking if this is not the case.

This patch also includes a cql-pytest test for reproducing this bug.
The bug is also reproduced by the translated Cassandra unit tests
    cassandra_tests/validation/entities/secondary_index_test.py::
    testCreateAndDropIndex
and this is how I found this bug. After these patch, all these tests
pass.

Fixes #8717.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526143635.624398-1-nyh@scylladb.com>
2021-05-27 09:10:41 +02:00
Avi Kivity
e2e723cc4c build: enable -Wrange-loop-construct warning
This warning triggers when a range for ("for (auto x : range)") causes
non-trivial copies, prompting the developer to replace with a capture
by reference. A few minor violations in the test suite are corrected.

Closes #8699
2021-05-26 10:32:56 +03:00
Avi Kivity
e9c940dbbc Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund
Fixes #8270

If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk _footprint_ that is over limit causing new segment allocation to stall.

We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.

Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).

Closes #8695

* github.com:scylladb/scylla:
  commitlog_test: Add test case for usage/disk size threshold mismatch
  commitlog: Flush all segments if we only have one.
  commitlog: Always force flush if segment allocation is waiting
  commitlog: Include segment wasted (slack) size in footprint check
  commitlog: Adjust (lower) usage threshold
2021-05-25 18:34:29 +03:00
Kamil Braun
c948573398 sys_dist_ks: don't create old CDC generations table on service initialization
The old table won't be created in clusters that are bootstrapped after
this commit. It will stay in clusters that were upgraded from a version
before this commit.

Note that a fully upgraded cluster doesn't automatically create a new
generation in the new format. Even if the last generation was created
before the upgrade, the cluster will keep using it.
A new generation will be created in the new format when either:
1. a new node bootstraps (in the new version),
2. or the user runs checkAndRepairCdcStreams, which has a new check: if
   the current generation uses the old format, the command will decide
   that repair is needed, even if the generation is completely fine
   otherwise (also in the new version).

During upgrade, while the CDC_GENERATIONS_V2 feature is still not
enabled, the user may still bootstrap a node in the old version of
Scylla or run checkAndRepairCdcStreams on a not-yet-upgraded node. In
that case a new generation will be created in the old format,
using the old table definitions.
2021-05-25 16:07:23 +02:00
Kamil Braun
4d3870b24b main: pass feature_service to cdc::generation_service 2021-05-25 16:07:23 +02:00
Kamil Braun
f25e77c202 test: cdc: include new generations table in permissions test 2021-05-25 16:07:23 +02:00
Calle Wilund
a96433c684 commitlog_test: Add test case for usage/disk size threshold mismatch
Refs #8270

Tries to simulate case where we mismatch segments usage with actual
disk footprint and fail to flush enough to allow segment recycling
2021-05-25 12:43:12 +00:00
Avi Kivity
e391e4a398 test: serialized_action_test: prevent false-positive timeout in test_phased_barrier_reassignment
test_phased_barrier_reassignment has a timeout to prevent the test from
hanging on failure, but it occastionally triggers in debug mode since
the timeout is quite low (1ms). Increase the timeout to prevent false
positives. Since the timeout only expires if the test fails, it will
have no impact on execution time.

Ref #8613

Closes #8692
2021-05-25 11:20:18 +02:00
Raphael S. Carvalho
ee39eb9042 sstables: Fix slow off-strategy compaction on STCS tables
Off-strategy compaction on a table using STCS is slow because of
the needless write amplification of 2. That's because STCS reshape
isn't taking advantage of the fact that sstables produced by
a repair-based operation are disjoint. So the ~256 input sstables
were compacted (in batches of 32) into larger sstables, which in
turn were compacted into even larger ones. That write amp is very
significant on large data sets, making the whole operation 2x
slower.

Fixes #8449.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210524213426.196407-1-raphaelsc@scylladb.com>
2021-05-25 11:24:42 +03:00
Piotr Sarna
95c6ec1528 Merge 'test/cql-pytest: clean up tests to run on Cassandra' from Nadav Har'El
To keep our cql-pytest tests "correct", we should strive for them to pass on
Cassandra - unless they are testing a Scylla-only feature or a deliberate
difference between Scylla and Cassandra - in which case they should be marked
"scylla-only" and cause such tests to be skipped when running on Cassandra.

The following few small patches fix a few cases where our tests we failing on
Cassandra. In one case this even found a bug in the test (a trivial Python
mistake, but still).

Closes #8694

* github.com:scylladb/scylla:
  test/cql-pytest: fix python mistake in an xfailing test
  test/cql-pytest: mark some tests with scylla-only
  test/cql-pytest: clean up test_create_large_static_cells_and_rows
2021-05-24 16:42:01 +02:00
Nadav Har'El
edc2c65552 Merge 'Fix service level negative timeouts' from Piotr Sarna
This series fixes a minor validation issue with service level timeouts - negative values were not checked. This bug is benign because negative timeouts act just like a 0s timeout, but the original series claimed to validate against negative values, so it's hereby fixed.
More importantly however, this series follows by enabling cql-pytest to run service level tests and provides a first batch of them, including a missing test case for negative timeouts.
The idea is similar to what we already have in alternator test suite - authentication is unconditionally enabled, which doesn't affect any existing tests, but at the same time allows writing test cases which rely on authentication - e.g. service levels.

Closes #8645

* github.com:scylladb/scylla:
  cql-pytest: introduce service level test suite
  cql-pytest: add enabling authentication by default
  qos: fix validating service level timeouts for negative values
2021-05-24 16:30:13 +03:00
Tomasz Grabiec
b1821c773f Merge "raft: basic RPC module testing" from Pavel Solodovnikov
Now RPC module has some basic testing coverage to
make sure RPC configuration is updated appropriately
on configuration changes (i.e. `add_server` and
`remove_server` are called when appropriate).

The test suite currenty consists of the following
test-cases:
 * Loading server instance with configuration from a snapshot.
 * Loading server instance with configuration from a log.
 * Configuration changes (remove + add node).
 * Leader elections don't lead to RPC configuration changes.
 * Voter <-> learner node transitions also don't change RPC
   configuration.
 * Reverting uncommitted configuration changes updates
   RPC configuration accordingly (two cases: revert to
   snapshot config or committed state from the log).

A few more refactorings are made along the way to be
able to reuse some existing functions from
`replication_test` in `rpc_test` implementation.

Please note, though, that there are still some functions
that are borrowed from `replication_test` but not yet
extracted to common helpers.

This is mostly because RPC tests doesn't need all
the complexity that `replication_test` has, thus,
some helpers are copied in a reduced form.

It would take some effort to refactor these bits to
fit both `replication_test` and `rpc_test` without
sacrificing convenience.
This will probably be addressed in another series later.

* manmanson/raft-rpc-tests-v9-alt3:
  raft: add tests for RPC module
  test: add CHECK_EVENTUALLY_EQUAL utility macro
  raft: replication_test: reset test rpc network between test runs
  raft: replication_test: extract tickers initialization into a separate func
  raft: replication_test: support passing custom `apply_fn` to `change_configuration()`
  raft: replication_test: introduce `test_server` aggregate struct
  raft: replication_test: support voter<->learner configuration changes
  raft: remove duplicate `create_command` function from `replication_test`
  raft: avoid 'using' statements in raft testing helpers header
2021-05-24 14:44:37 +02:00
Avi Kivity
50f3bbc359 Merge "treewide: various header cleanups" from Pavel S
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places

A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).

The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).

Before:

	Command being timed: "ninja dev-build"
	User time (seconds): 28262.47
	System time (seconds): 824.85
	Percent of CPU this job got: 3979%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2129888
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1402838
	Minor (reclaiming a frame) page faults: 124265412
	Voluntary context switches: 1879279
	Involuntary context switches: 1159999
	Swaps: 0
	File system inputs: 0
	File system outputs: 11806272
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

After:

	Command being timed: "ninja dev-build"
	User time (seconds): 26270.81
	System time (seconds): 767.01
	Percent of CPU this job got: 3905%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2117608
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1400189
	Minor (reclaiming a frame) page faults: 117570335
	Voluntary context switches: 1870631
	Involuntary context switches: 1154535
	Swaps: 0
	File system inputs: 0
	File system outputs: 11777280
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The observed improvement is about 5% of total wall clock time
for `dev-build` target.

Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"

* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
  transport: remove extraneous `qos/service_level_controller` includes from headers
  treewide: remove evidently unneded storage_proxy includes from some places
  service_level_controller: remove extraneous `service/storage_service.hh` include
  sstables/writer: remove extraneous `service/storage_service.hh` include
  treewide: remove extraneous database.hh includes from headers
  treewide: reduce boost headers usage in scylla header files
  cql3: remove extraneous includes from some headers
  cql3: various forward declaration cleanups
  utils: add missing <limits> header in `extremum_tracking.hh`
2021-05-24 14:24:20 +03:00
Nadav Har'El
5206665b15 test/cql-pytest: fix python mistake in an xfailing test
The xfailing test cassandra_tests/validation/entities/collections_test.py::
testSelectionOfEmptyCollections had a Python mistake (using {} instead
of set() for an empty set), which resulted in its failure when run
against Cassandra. After this patch it passes on Cassandra and fails on
Scylla - as expected (this is why it is marked xfail).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-24 13:14:54 +03:00
Nadav Har'El
f26b31e950 test/cql-pytest: mark some tests with scylla-only
Tests which are known to test a Scylla-only feature (such as CDC)
or to rely on a known and difference between Scylla and Cassandra
should be marked "scylla-only", so they are skipped when running
the tests against Cassandra (test/cql-pytest/run-cassandra) instead
of reporting errors.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-24 13:03:48 +03:00
Nadav Har'El
c8117584e3 test/cql-pytest: clean up test_create_large_static_cells_and_rows
The test test_create_large_static_cells_and_rows had its own
implementation of "nodetool flush" using Scylla's REST API.
Now that we have a nodetool.flush() function for general use in
cql-pytest, let's use it and save a bit of duplication.

Another benefit is that now this test can be run (and pass) against
Cassandra.

To allow this test to run on Cassandra, I had to remove a
"USING TIMEOUT" which wasn't necessary for this test, and is
not a feature supported by Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-24 12:31:51 +03:00
Asias He
425e3b1182 gossip: Introduce direct failure detector
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.

For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.

As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.

In this patch, a new failure detector is implemented.

- Direct detection

The existing failure detector can get gossip heartbeat updates
indirectly.  For example:

Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues

Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.

This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.

This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.

Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.

- Parallel detection

The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.

A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.

- Deterministic detection

Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.

- Compatible

This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.

Fixes #8488

Closes #8036
2021-05-24 10:47:06 +03:00
Piotr Sarna
890ed201fd Merge 'Enable -Wunused-private-field warning' from Avi Kivity
The -Wunused-private-field was squelched when we switched to
clang to make the change easier. But it is a useful warning, so
re-enable it.

It found a serious bug (#8682) and a few minor instances of waste.

Closes #8683

* github.com:scylladb/scylla:
  build: enable -Wunused-private-field warning
  test: drop unused fields
  table: drop unused field database_sstable_write_monitor::_compaction_manager
  streaming: drop unused fields
  sstables: mx reader: drop unused _column_value_length field
  sstables: index_consumer: drop unused max_quantity field
  compaction: resharding_compaction: drop unused _shard field
  compaction: compaction_read_monitor: drop unused _compaction_manager field
  raft: raft_services: drop unused _gossiper field
  repair: drop unused _nr_peer_nodes field
  redis: drop unused fields _storage_proxy and _requests_blocked_memory
  mutation_rebuilder: drop unused field _remaining_limit
  db: data_listeners: remove unused field _db
  cql3: insert_json_statement: note bug with unused _if_not_exists
  cql3: authorized_prepared_statement_cache: drop unused field _logger
  auth: service_level_resource_view: drop unused field _resource
2021-05-24 09:21:10 +02:00
Gleb Natapov
b4d6bdb16e raft: test: check that a leader does not send probes to a follower in the snapshot mode
Message-Id: <YKTNN7vNGkQwTDX7@scylladb.com>
2021-05-23 01:06:12 +02:00
Avi Kivity
7e5a0b6fd0 test: drop unused fields
Drop unused fields in various tests and test libraries.
2021-05-21 21:04:49 +03:00
Nadav Har'El
a2379b96b1 alternator test: test for large BatchGetItem
This patch adds an Alternator test, test_batch_get_item_large,
which checks a BatchGetItem with a moderately large (1.5 MB) response.
The test passes - we do not have a bug in BatchGetItem - but it
does reproduce issue #8522 - the long response is stored in memory as
one long contiguous string and causes a warning about an over-sized
allocation:

  WARN ... seastar_memory - oversized allocation: 2281472 bytes.

Incidentally, this test also reproduces a second contiguous
allocation problem - issue #8183 (in BatchWriteItem which we use
in this test to set up the item to read).

Refs #8522
Refs #8183

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210520161619.110941-1-nyh@scylladb.com>
2021-05-21 08:38:53 +02:00
Avi Kivity
eac6fb8d79 gdb: bypass unit test on non-x86
The gdb self-tests fail on aarch64 due to a failure to use thread-local
variables. I filed [1] so it can get fixed.

Meanwhile, disable the test so the build passes. It is sad, but the aarch64
build is not impacted by these failures.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=27886

Closes #8672
2021-05-20 20:14:15 +03:00
Avi Kivity
30034371e7 Merge "Remove most of global pointers from repair" from Pavel
"
There are many global stuff in repair -- a bunch of pointers to
sharded services, tracker, map of metas (maybe more). This set
removes the first group, all those services had become main-local
recently. Along the way a call to global storage proxy is dropped.

To get there the repair_service is turned into a "classical"
sharded<> service, gets all the needed dependencies by references
from main and spreads them internally where needed. Tracker and other
stuff is left global, but tracker is now the candidate for merging
with the now sharded repair_service, since it emulates the sharded
concept internally.

Overall the change is

- make repair_service sharded and put all dependencies on it at start
- have sharded<repair_service> in API and storage service
- carry the service reference down to repair_info and repair_meta
  constructions to give them the depedencies
- use needed services in _info and _meta methods

tests: unit(dev), dtest.repair(dev)
"

* 'br-repair-service' of https://github.com/xemul/scylla: (29 commits)
  repair: Drop most of globals from repair
  repair: Use local references in messaging handler checks
  repair: Use local references in create_writer()
  repair: Construct repair_meta with local references
  repair: Keep more stuff on repair_info
  repair: Kill bunch of global usages from insert_repair_meta
  repair: Pass repair service down to meta insertion
  repair: Keep local migration manager on repair_info
  repair: Move unused db captures
  repair: Remove unused ms captures
  repair: Construct repair_info with service
  repair: Loop over repair sharded container
  repair: Make sync_data_using_repair a method
  repair: Use repair from storage service
  repair: Keep repair on storage service
  repair: Make do_repair_start a method
  repair: Pass repair_service through the API until do_repair_start
  repair: Fix indentation after previous patch
  repair: Split sync_data_using_repair
  repair: Turn repair_range a repair_info method
  ...
2021-05-20 10:57:48 +03:00
Pavel Solodovnikov
238273d237 treewide: remove evidently unneded storage_proxy includes from some places
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 02:19:32 +03:00
Pavel Solodovnikov
0663aa6ca1 service_level_controller: remove extraneous service/storage_service.hh include
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 02:18:41 +03:00
Pavel Solodovnikov
fff7ef1fc2 treewide: reduce boost headers usage in scylla header files
`dev-headers` target is also ensured to build successfully.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:33:18 +03:00
Piotr Sarna
223a59c09c test: make rjson allocator test working in sanitize mode
Following Nadav's advice, instead of ignoring the test
in sanitize/debug modes, the allocator simply has a special path
of failing sufficiently large allocation requests.
With that, a problem with the address sanitizer is bypassed
and other debug mode sanitizers can inspect and check
if there are no more problems related to wrapping the original
rapidjson allocator.

Closes #8539
2021-05-20 00:42:47 +03:00
Pavel Solodovnikov
a66de8658b raft: add tests for RPC module
Now RPC module has some basic testing coverage to
make sure RPC configuration is updated appropriately
on configuration changes (i.e. `add_server` and
`remove_server` are called when appropriate).

The test suite currenty consists of the following
test-cases:
 * Loading server instance with configuration from a snapshot.
 * Loading server instance with configuration from a log.
 * Configuration changes (remove + add node).
 * Leader elections don't lead to RPC configuration changes.
 * Voter <-> learner node transitions also don't change RPC
   configuration.
 * Reverting uncommitted configuration changes updates
   RPC configuration accordingly (two cases: revert to
   snapshot config or committed state from the log).

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-19 23:14:04 +03:00
Pavel Solodovnikov
e030e291a8 test: add CHECK_EVENTUALLY_EQUAL utility macro
It would be good to have a `CHECK` variant in addition
to an existing `REQUIRE_EVENTUALLY_EQUAL` macro. Will be used
in raft RPC tests.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-19 23:12:55 +03:00
Pavel Solodovnikov
2067cc75c6 raft: replication_test: reset test rpc network between test runs
Currently, emulated rpc network is shared between all test cases
in `replication_test.cc` (see static `rpc::net` map).
Though, its value is not reset when executing a subsequent test
case, which opens a possibility for heap-use-after-free bugs.

Also, make all `send_*` functions in test rpc class to throw an
error if a node being contacted is not in the network instead of
past-the-end access. This allows to safely contact a non-existent
node, which will be used in RPC tests later.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-19 23:06:29 +03:00
Avi Kivity
d8121961fa Merge 'cql-pytest: add nodetool flush feature and use it in a test' from Nadav Har'El
The first patch adds a nodetool-like capability to the cql-pytest framework.
It is *not* meant to be used to test nodetool itself, but rather to give CQL
tests the ability to use nodetool operations - currently only one operation -
"nodetool flush".

We try to use Scylla's REST API, if possible, and only fall back to using an
external "nodetool" command when the REST API is not available - i.e., when
testing Cassandra. The benefit of using the REST API is that we don't need
to run the jmx server to test Scylla.

The second patch is an example of using the new nodetool flush feature
in a test that needs to flush data to reproduce a bug (which has already
been fixed).

Closes #8622

* github.com:scylladb/scylla:
  cql-pytest: reproducer for issue #8138
  cql-pytest: add nodetool flush feature
2021-05-19 14:40:18 +03:00
Nadav Har'El
fd8d15a1a6 cql-pytest: reproducer for issue #8138
We add a reproducing test for issue #8138, were if we write to an
TWCS table, scanning it would yield no rows - and worse - crash the
debug build.

This test requires "nodetool flush" to force the read to happen from
sstables, hence the nodetool feature was implemented in the previous
patch (on Scylla, it uses the REST API - not actually running nodetool
or requiring JMX).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-19 13:58:14 +03:00
Nadav Har'El
49580a4701 cql-pytest: add nodetool flush feature
This patch adds a nodetool-compatible capability to the cql-pytest
framework. It is *not* meant to be used to test nodetool itself, but
rather to give CQL tests the ability to use nodetool operations -
currently one operation - "nodetool flush".

Use it in a test as:

     import nodetool
     nodetool.flush(cql, table)

I chose a functional API with parameters ("cql") instead of a fixture
with an implied connection so that in the future we may allow multiple
multiple nodes and this API will allow sending nodetool requests to
different nodes. However, multi-node support is not implemented yet,
nor used in any of the existing tests.

The implementation uses Scylla's REST API if available, or if not, falls
back to using an external "nodetool" command (which can be overridden
using the NODETOOL environment variable). This way, both cql-pytest/run
(Scylla) and cql-pytest/run-cassandra (Cassandra) now correctly support
these nodetool operations, and we still don't need to run JMX to test
Scylla.

The reason We want to support nodetool.flush() is to reproduce bugs that
depend on data reaching disk. We already had such a reproducer in
test_large_cells_rows.py - it too did something similar - but it was
Scylla-only (using only the REST API). Instead of copying such code to
multiple places, we better have a common nodetool.flush() function, as
done in this patch. The test in test_large_cells_rows.py can later be
changed to use the new function.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-19 13:55:25 +03:00
Pavel Emelyanov
28f01aadc9 allocation_strategy, code: Simplify alloc()
Todays alloc() accepts migrate-fn, size and alignment. All the callers
don't really need to provide anything special for the migrate-fn and
are just happy with default alignof() for alignment. The simplification
is in providing alloc() that only accepts size arg and does the rest
itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-19 09:23:49 +03:00
Botond Dénes
dbb6851d4d test/manual/sstable_scan_footprint: don't double close the semaphore
The semaphore `stats_collector` references is the one obtained from the
database object, which is already stopped by `database::stop()`, making
the stop in `~stats_collector()` redundant, and even worse, as it
triggers an assert failure. Remove it.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210518140913.276368-1-bdenes@scylladb.com>
2021-05-18 17:55:52 +03:00
Avi Kivity
16ff92745f Merge 'perf: add alternator frontend to perf_simple_query' from Piotr Sarna
The perf_simple_query tool is extended with another protocol
aside from CQL - alternator. The alternative (pun intended) benchmark
can be executed by using the `--alternator X` parameter, where X
specifies one of the alternator's mandatory write isolation options:
 - "forbid_rmw" - forbids RMW (read-modify-write) requests
 - "unsafe" - never uses LWT (lightweight transactions), even for RMW
 - "always_use_lwt" - uses LWT even for non-RMW requests
 - "only_rmw_uses_lwt" - that one's rather self-explanatory

Alternator cooperates with existing `--write` and `--delete` parameters.

Aside from being able to check for improvements/regressions
in the alternator module, it's also possible to check how different
isolation levels influence the number of allocations and overall
performance, or to compare alternator against CQL.

Example output showing the difference in isolation levels:

```bash
$ ./build/release/test/perf/perf_simple_query_g --smp 1 \
    --write --alternator only_rmw_uses_lwt --default-log-level error
random-seed=1235000092
Started alternator executor
10873.76 tps (202.9 allocs/op,  12.4 tasks/op,  369921 insns/op)
11096.09 tps (202.7 allocs/op,  12.1 tasks/op,  374792 insns/op)
11100.09 tps (203.0 allocs/op,  12.1 tasks/op,  376469 insns/op)
11068.98 tps (203.1 allocs/op,  12.1 tasks/op,  377132 insns/op)
11081.24 tps (203.2 allocs/op,  12.1 tasks/op,  377290 insns/op)

median 11081.24 tps (203.2 allocs/op,  12.1 tasks/op,  377290 insns/op)
median absolute deviation: 14.85
maximum: 11100.09
minimum: 10873.76

$ ./build/release/test/perf/perf_simple_query_g --smp 1 \
    --random-seed 1235000092 --write --alternator always_use_lwt \
    --default-log-level error
random-seed=1235000092
Started alternator executor
3605.35 tps (877.4 allocs/op, 174.6 tasks/op,  986666 insns/op)
3555.71 tps (890.0 allocs/op, 174.4 tasks/op, 1006945 insns/op)
3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op)
3437.65 tps (908.2 allocs/op, 174.6 tasks/op, 1033992 insns/op)
3409.88 tps (913.2 allocs/op, 174.4 tasks/op, 1041240 insns/op)

median 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op)
median absolute deviation: 75.15
maximum: 3605.35
minimum: 3409.88
```

Closes #8656

* github.com:scylladb/scylla:
  perf: add alternator frontend to perf_simple_query
  cdc: make metadata.hh self-sufficient
  test: add minimal alternator_test_env
2021-05-18 16:17:54 +03:00
Piotr Sarna
6c6ccda8a0 perf: add alternator frontend to perf_simple_query
The perf_simple_query tool is extended with another protocol
aside from CQL - alternator. The alternative (pun intended) benchmark
can be executed by using the `--alternator X` parameter, where X
specifies one of the alternator's mandatory write isolation options:
 - "forbid_rmw" - forbids RMW (read-modify-write) requests
 - "unsafe" - never uses LWT (lightweight transactions), even for RMW
 - "always_use_lwt" - uses LWT even for non-RMW requests
 - "only_rmw_uses_lwt" - that one's rather self-explanatory

Alternator cooperates with existing --write and --delete parameters.

Aside from being able to check for improvements/regressions
in the alternator module, it's also possible to check how different
isolation levels influence the number of allocations and overall
performance, or to compare alternator against CQL.

$ ./build/release/test/perf/perf_simple_query_g --smp 1 \
    --write --alternator only_rmw_uses_lwt --default-log-level error
random-seed=1235000092
Started alternator executor
10873.76 tps (202.9 allocs/op,  12.4 tasks/op,  369921 insns/op)
11096.09 tps (202.7 allocs/op,  12.1 tasks/op,  374792 insns/op)
11100.09 tps (203.0 allocs/op,  12.1 tasks/op,  376469 insns/op)
11068.98 tps (203.1 allocs/op,  12.1 tasks/op,  377132 insns/op)
11081.24 tps (203.2 allocs/op,  12.1 tasks/op,  377290 insns/op)

median 11081.24 tps (203.2 allocs/op,  12.1 tasks/op,  377290 insns/op)
median absolute deviation: 14.85
maximum: 11100.09
minimum: 10873.76

$ ./build/release/test/perf/perf_simple_query_g --smp 1 \
    --random-seed 1235000092 --write --alternator always_use_lwt \
    --default-log-level error
random-seed=1235000092
Started alternator executor
3605.35 tps (877.4 allocs/op, 174.6 tasks/op,  986666 insns/op)
3555.71 tps (890.0 allocs/op, 174.4 tasks/op, 1006945 insns/op)
3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op)
3437.65 tps (908.2 allocs/op, 174.6 tasks/op, 1033992 insns/op)
3409.88 tps (913.2 allocs/op, 174.4 tasks/op, 1041240 insns/op)

median 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op)
median absolute deviation: 75.15
maximum: 3605.35
minimum: 3409.88
2021-05-18 15:10:31 +02:00
Piotr Sarna
b6d6247a74 test: add minimal alternator_test_env
A minimal implementation of alternator test env, a younger cousin
of cql_test_env, is implemented. Note that using this environment
for unit tests is strongly discouraged in favor of the official
test/alternator pytest suite. Still, alternator_test_env has its uses
for microbenchmarks.
2021-05-18 15:10:31 +02:00
Botond Dénes
82bff1bcc6 test: cql_test_env: use proper scheduling groups
Currently `cql_test_env` runs its `func` in the default (main) group and
also leaves all scheduling groups in `dbcfg` default initialized to the
same scheduling group. This results in every part of the system,
normally isolated from each other, running in the same (default)
scheduling group. Not a big problem on its own, as we are talking about
tests, but this creates an artificial difference between the test and
the real environment, which is ever more pronounced since certain query
parameters are selected based on the current scheduling group.
To bring cql test env just that little bit closer to the real thing,
this patch creates all the scheduling groups main does (well almost) and
configures `dbcfg` with them.
Creating and destroying the scheduling group on each setup-teardown of
cql test env breaks some internal seastar components which don't like
seeing the same scheduling group with the same name but different id. So
create the scheduling groups once on first access and keep them around
until the test executable is running.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-2-bdenes@scylladb.com>
2021-05-18 13:44:54 +03:00
Botond Dénes
300ee974f7 test: use with_cql_test_env_thread where needed
Currently `with_cql_test_env()` is equivalent to
`with_cql_test_env_thread()`, which resulted in many tests using the
former while really needing the latter and getting away with it. This
equivalence is incidental and will go away soon, so make sure all cql
test env using tests that expect to be run in a thread use the
appropriate variant.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-1-bdenes@scylladb.com>
2021-05-18 13:44:52 +03:00