Commit Graph

1703 Commits

Author SHA1 Message Date
Nadav Har'El
ff81072f64 cql-pytest: port Cassandra's unit test validation/entities/secondary_index_test
In this patch, we port validation/entities/secondary_index_test.java,
resulting in 41 tests for various aspects of secondary indexes.
Some of the original Java tests required direct access to the Cassandra
internals not available through CQL, so those tests were omitted.

In porting these tests, I uncovered 9 previously-unknown bugs in Scylla:

Refs #8600: IndexInfo system table lists MV name instead of index name
Refs #8627: Cleanly reject updates with indexed values where value > 64k
Refs #8708: Secondary index is missing partitions with only a static row
Refs #8711: Finding or filtering with an empty string with a secondary
            index seems to be broken
Refs #8714: Improve error message on unsupported restriction on partition
            key
Refs #8717: Recent fix accidentally broke CREATE INDEX IF NOT EXISTS
Refs #8724: Wrong error message when attempting index of UDT column with
            a duration
Refs #8744: Index-creation error message wrongly refers to "map" - it can
            be any collection
Refs #8745: Secondary index CREATE INDEX syntax is missing the "values"
            option

These tests also provide additional reproducers for already known issues:

Refs #2203: Add support for SASI
Refs #2962: Collection column indexing
Refs #2963: Static column indexing
Refs #4244: Add support for mixing token, multi- and single-column
            restrictions

Due to these bugs, 15 out of the 41 tests here currently xfail. We actually
had more failing tests, but we fixed a few of the above issues before this
patch went in, so their tests are passing at the time of this submission.

All 41 tests pass when running against Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210531112354.970028-1-nyh@scylladb.com>
2021-05-31 18:31:13 +03:00
Piotr Sarna
389a0a52c9 treewide: revamp workload type for service levels
This patch is not backward compatible with its original,
but it's considered fine, since the original workload types were not
yet part of any release.
The changes include:
 - instead of using 'unspecified' for declaring that there's no workload
   type for a particular service level, NULL is used for that purpose;
   NULL is the standard way of representing lack of data
 - introducing a delete marker, which accompanies NULL and makes it
   possible to distinguish between wanting to forcibly reset a workload
   type to unspecified and not wanting to change the previous value
 - updating the tests accordingly

These changes come in as a single patch, because they're intertwined
with each other and the tests for workload types are already in place;
an attempt to split them proved to be more complicated than it's worth.

Tests: unit(release)

Closes #8763
2021-05-31 18:18:33 +03:00
Raphael S. Carvalho
a7cdd846da compaction: Prevent tons of compaction of fully expired sstable from happening in parallel
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.

This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.

With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.

Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.

Fixes #8710.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>

[avi: drop now unneeded storage_service_for_tests]
2021-05-30 23:22:51 +03:00
Avi Kivity
791412b046 test: user_defined_function_test: raise Lua timeout
user_defined_function_test fails sporadically in debug mode
due to lua timeout. Raise the timeout to avoid the failure, but
not so much that the test that expects timout becomes too slow.

Fixes #8746.

Closes #8747
2021-05-30 13:10:57 +03:00
Piotr Jastrzebski
76d7c761d1 schema: Stop using deprecated constructor
This is another boring patch.

One of schema constructors has been deprecated for many years now but
was used in several places anyway. Usage of this constructor could
lead to data corruption when using MX sstables because this constructor
does not set schema version. MX reading/writing code depends on schema
version.

This patch replaces all the places the deprecated constructor is used
with schema_builder equivalent. The schema_builder sets the schema
version correctly.

Fixes #8507

Test: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4beabc8c942ebf2c1f9b09cfab7668777ce5b384.1622357125.git.piotr@scylladb.com>
2021-05-30 11:58:27 +03:00
Nadav Har'El
1507bbb35a cql-pytest: increase default server-side timeouts
Sometimes the cql-pytest tests run extremely slowly. This can be
a combination of running the debug build (which is naturally slow)
and a test machine which is overcommitted, or experiencing some
transient swap storm or some similar event. We don't want tests, which
we run on a 100% reliable setups, to fail just because they run into
timeouts in Scylla when they run very slowly.

We already noticed this problem in the past, and increased the CQL client
timeout in conftest.py from the default of 10 seconds to 120 seconds -
the old default of 10 seconds was not enough for some long operations
(such as creating a table with multiple views) when the test ran very
slowly.

However, this only fixed the client-side timeout. We also have a bunch
of server-side timeouts, configured to all sorts of arbitrary (and
fairly small) numbers. For example, the server has a "write request
timeout" option, which defaults to just 2 seconds. We recently saw
this timeout exceeded in a slow run which tried to do a very large
write.

So this patch configures all the configurable server-side timeouts we
have to default to 300 seconds. This should be more than enough for even
the slowest runs (famous last words...). This default is not a good idea
on real multi-node clusters which are expected to deal with node loss,
but this is not the case in cql-pytest.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210529213648.856503-1-nyh@scylladb.com>
2021-05-30 01:20:14 +03:00
Avi Kivity
d3e5b37059 Revert "Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund"
This reverts commit e9c940dbbc, reversing
changes made to 6144656b25. Since it was
merged commitlog_test consistently times out in debug mode.
2021-05-27 21:16:26 +03:00
Wojciech Mitros
725c6aac81 test/perf: close test_env to pass an assert in sstables_manager destructor
When destroying an perf_sstable_test_env, an assert in sstables_manager
destructor fails, because it hasn't been closed.
Fix by removing all references to sstables from perf_sstable_test_env,
and then closing the test_env(as well as the sstables_manager)

Fixes #8736

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #8737
2021-05-27 17:41:17 +03:00
Michał Chojnowski
5e9f741bb4 repair: remove range_split.hh
Dead code since 80ebedd242.

Closes #8698
2021-05-27 17:21:37 +03:00
Avi Kivity
5f8484897b Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun
Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged.

---
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
   irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
   (according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
   system_distributed.cdc_generation_descriptions, using the
   generation's timestamp as the partition key (we call this table
   the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
   application state.

The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.

Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.

The commit introduces a new table for broadcasting generation data with
the following properties:
-  it uses a better schema that stores the data in multiple rows, each
   of manageable size
-  it resides in a new keyspace that uses EverywhereStrategy so the
   data will be written to every node in the cluster that has a token in
   the token ring
-  the data will be written using CL=ALL and read using CL=ONE; thanks
   to this, restarting node won't have to communicate with other nodes
   to retrieve the data of the last known generation. Note that writing
   with CL=ALL does not reduce availability: creating a new generation
   *requires* all nodes to be available anyway, because they must learn
   about the generation before their clocks go past the generation's
   timestamp; if they don't, partitions won't be mapped to stream IDs
   consistently across the cluster
-  the partition key is no longer the generation's timestamp. Because it
   was that way in the old internal table, it forced the algorithm to
   choose the timestamp *before* the generation data was inserted into
   the table. What if the inserting took a long time? It increased the
   chance that nodes would learn about the generation too late (after
   their clocks moved past its timestamp). With the new schema we will
   first insert the generation data using a randomly generated UUID as
   the partition key, *then* choose the timestamp, then gossip both the
   timestamp and the UUID.
   Observe that after a node learns about a generation broadcasted using
   this new method through gossip it will retrieve its data very quickly
   since it's one of the replicas and it can use CL=ONE as it was
   written using CL=ALL.

The generation's timestamp and the UUID mentioned in the last point form
a "generation identifier" for this new generation. For passing these new
identifiers around, we introduce the cdc::generation_id_v2 type.

Fixes #7961.

---

For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order.

dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/
unit tests (dev) passed locally

Closes #8643

* github.com:scylladb/scylla:
  docs: update cdc.md with info about the new internal table
  sys_dist_ks: don't create old CDC generations table on service initialization
  sys_dist_ks: rename all_tables() to ensured_tables()
  cdc: when creating new generations, use format v2 if possible
  main: pass feature_service to cdc::generation_service
  gms: introduce CDC_GENERATIONS_V2 feature
  cdc: introduce retrieve_generation_data
  test: cdc: include new generations table in permissions test
  sys_dist_ks: increase timeout for create_cdc_desc
  sys_dist_ks: new table for exchanging CDC generations
  tree-wide: introduce cdc::generation_id_v2
2021-05-27 17:13:44 +03:00
Avi Kivity
e8e4456ec7 Merge 'Introduce per-service-level workload types and their first use-case - shedding in interactive workloads' from Piotr Sarna
This draft extends and obsoletes #8123 by introducing a way of determining the workload type from service level parameters, and then using this context to qualify requests for shedding.

The rough idea is that when the admission queue in the CQL server is hit, it might make more sense to start shedding surplus requests instead of accumulating them on the semaphore. The assumption that interactive workloads are more interested in the success rate of as many requests as possible, and hanging on a semaphore reduces the chances for a request to succeed. Thus, it may make sense to shed some requests to reduce the load on this coordinator and let the existing requests to finish.

It's a draft, because I only performed local guided tests. #8123 was followed by some experiments on a multinode cluster which I want to rerun first.

Closes #8680

* github.com:scylladb/scylla:
  test: add a case for conflicting workload types
  cql-pytest: add basic tests for service level workload types
  docs: describe workload types for service levels
  sys_dist_ks: fix redundant parsing in get_service_level
  sys_dist_ks: make get_service_level exception-safe
  transport: start shedding requests during potential overload
  client_state: hook workload type from service levels
  cql3: add listing service level workload type
  cql3: add persisting service level workload type
  qos: add workload_type service level parameter
2021-05-27 17:01:56 +03:00
Konstantin Osipov
52f7ff4ee4 raft: (testing) update copyright
An incorrect copyright information was copy-pasted
from another test file.

Message-Id: <20210525183919.1395607-1-kostja@scylladb.com>
2021-05-27 15:47:49 +03:00
Piotr Sarna
99f356d764 test: add a case for conflicting workload types
The test case verifies that if several workload types are effective
for a single role, the conflict resolution is well defined.
2021-05-27 14:31:36 +02:00
Piotr Sarna
01b7e445f9 cql-pytest: add basic tests for service level workload types
The test cases check whether it's possible to declare workload
type for a service level and if its input is validated.
2021-05-27 14:31:36 +02:00
Pavel Emelyanov
d2442a1bb3 tests: Ditch storage_service_for_tests
The purpose of the class in question is to start sharded storage
service to make its global instance alive. I don't know when exactly
it happened but no code that instantiates this wrapper really needs
the global storage service.

Ref: #2795
tests: unit(dev), perf_sstable(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210526170454.15795-1-xemul@scylladb.com>
2021-05-27 14:39:13 +03:00
Piotr Sarna
762e2f48f2 cql3: add listing service level workload type
The workload type information is now presented in the output
of LIST SERVICE LEVEL and LIST ALL SERVICE LEVELS statements.
2021-05-27 13:02:22 +02:00
Nadav Har'El
97e827e3e1 secondary index: fix regression in CREATE INDEX IF NOT EXISTS
The recent commit 0ef0a4c78d added helpful
error messages in case an index cannot be created because the intended
name of its materialized view is already taken - but accidentally broke
the "CREATE INDEX IF NOT EXISTS" feature.

The checking code was correct, but in the wrong place: we need to first
check maybe the index already exists and "IF NOT EXISTS" was chosen -
and only do this new error checking if this is not the case.

This patch also includes a cql-pytest test for reproducing this bug.
The bug is also reproduced by the translated Cassandra unit tests
    cassandra_tests/validation/entities/secondary_index_test.py::
    testCreateAndDropIndex
and this is how I found this bug. After these patch, all these tests
pass.

Fixes #8717.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526143635.624398-1-nyh@scylladb.com>
2021-05-27 09:10:41 +02:00
Avi Kivity
e2e723cc4c build: enable -Wrange-loop-construct warning
This warning triggers when a range for ("for (auto x : range)") causes
non-trivial copies, prompting the developer to replace with a capture
by reference. A few minor violations in the test suite are corrected.

Closes #8699
2021-05-26 10:32:56 +03:00
Avi Kivity
e9c940dbbc Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund
Fixes #8270

If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk _footprint_ that is over limit causing new segment allocation to stall.

We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.

Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).

Closes #8695

* github.com:scylladb/scylla:
  commitlog_test: Add test case for usage/disk size threshold mismatch
  commitlog: Flush all segments if we only have one.
  commitlog: Always force flush if segment allocation is waiting
  commitlog: Include segment wasted (slack) size in footprint check
  commitlog: Adjust (lower) usage threshold
2021-05-25 18:34:29 +03:00
Kamil Braun
c948573398 sys_dist_ks: don't create old CDC generations table on service initialization
The old table won't be created in clusters that are bootstrapped after
this commit. It will stay in clusters that were upgraded from a version
before this commit.

Note that a fully upgraded cluster doesn't automatically create a new
generation in the new format. Even if the last generation was created
before the upgrade, the cluster will keep using it.
A new generation will be created in the new format when either:
1. a new node bootstraps (in the new version),
2. or the user runs checkAndRepairCdcStreams, which has a new check: if
   the current generation uses the old format, the command will decide
   that repair is needed, even if the generation is completely fine
   otherwise (also in the new version).

During upgrade, while the CDC_GENERATIONS_V2 feature is still not
enabled, the user may still bootstrap a node in the old version of
Scylla or run checkAndRepairCdcStreams on a not-yet-upgraded node. In
that case a new generation will be created in the old format,
using the old table definitions.
2021-05-25 16:07:23 +02:00
Kamil Braun
4d3870b24b main: pass feature_service to cdc::generation_service 2021-05-25 16:07:23 +02:00
Kamil Braun
f25e77c202 test: cdc: include new generations table in permissions test 2021-05-25 16:07:23 +02:00
Calle Wilund
a96433c684 commitlog_test: Add test case for usage/disk size threshold mismatch
Refs #8270

Tries to simulate case where we mismatch segments usage with actual
disk footprint and fail to flush enough to allow segment recycling
2021-05-25 12:43:12 +00:00
Avi Kivity
e391e4a398 test: serialized_action_test: prevent false-positive timeout in test_phased_barrier_reassignment
test_phased_barrier_reassignment has a timeout to prevent the test from
hanging on failure, but it occastionally triggers in debug mode since
the timeout is quite low (1ms). Increase the timeout to prevent false
positives. Since the timeout only expires if the test fails, it will
have no impact on execution time.

Ref #8613

Closes #8692
2021-05-25 11:20:18 +02:00
Raphael S. Carvalho
ee39eb9042 sstables: Fix slow off-strategy compaction on STCS tables
Off-strategy compaction on a table using STCS is slow because of
the needless write amplification of 2. That's because STCS reshape
isn't taking advantage of the fact that sstables produced by
a repair-based operation are disjoint. So the ~256 input sstables
were compacted (in batches of 32) into larger sstables, which in
turn were compacted into even larger ones. That write amp is very
significant on large data sets, making the whole operation 2x
slower.

Fixes #8449.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210524213426.196407-1-raphaelsc@scylladb.com>
2021-05-25 11:24:42 +03:00
Piotr Sarna
95c6ec1528 Merge 'test/cql-pytest: clean up tests to run on Cassandra' from Nadav Har'El
To keep our cql-pytest tests "correct", we should strive for them to pass on
Cassandra - unless they are testing a Scylla-only feature or a deliberate
difference between Scylla and Cassandra - in which case they should be marked
"scylla-only" and cause such tests to be skipped when running on Cassandra.

The following few small patches fix a few cases where our tests we failing on
Cassandra. In one case this even found a bug in the test (a trivial Python
mistake, but still).

Closes #8694

* github.com:scylladb/scylla:
  test/cql-pytest: fix python mistake in an xfailing test
  test/cql-pytest: mark some tests with scylla-only
  test/cql-pytest: clean up test_create_large_static_cells_and_rows
2021-05-24 16:42:01 +02:00
Nadav Har'El
edc2c65552 Merge 'Fix service level negative timeouts' from Piotr Sarna
This series fixes a minor validation issue with service level timeouts - negative values were not checked. This bug is benign because negative timeouts act just like a 0s timeout, but the original series claimed to validate against negative values, so it's hereby fixed.
More importantly however, this series follows by enabling cql-pytest to run service level tests and provides a first batch of them, including a missing test case for negative timeouts.
The idea is similar to what we already have in alternator test suite - authentication is unconditionally enabled, which doesn't affect any existing tests, but at the same time allows writing test cases which rely on authentication - e.g. service levels.

Closes #8645

* github.com:scylladb/scylla:
  cql-pytest: introduce service level test suite
  cql-pytest: add enabling authentication by default
  qos: fix validating service level timeouts for negative values
2021-05-24 16:30:13 +03:00
Tomasz Grabiec
b1821c773f Merge "raft: basic RPC module testing" from Pavel Solodovnikov
Now RPC module has some basic testing coverage to
make sure RPC configuration is updated appropriately
on configuration changes (i.e. `add_server` and
`remove_server` are called when appropriate).

The test suite currenty consists of the following
test-cases:
 * Loading server instance with configuration from a snapshot.
 * Loading server instance with configuration from a log.
 * Configuration changes (remove + add node).
 * Leader elections don't lead to RPC configuration changes.
 * Voter <-> learner node transitions also don't change RPC
   configuration.
 * Reverting uncommitted configuration changes updates
   RPC configuration accordingly (two cases: revert to
   snapshot config or committed state from the log).

A few more refactorings are made along the way to be
able to reuse some existing functions from
`replication_test` in `rpc_test` implementation.

Please note, though, that there are still some functions
that are borrowed from `replication_test` but not yet
extracted to common helpers.

This is mostly because RPC tests doesn't need all
the complexity that `replication_test` has, thus,
some helpers are copied in a reduced form.

It would take some effort to refactor these bits to
fit both `replication_test` and `rpc_test` without
sacrificing convenience.
This will probably be addressed in another series later.

* manmanson/raft-rpc-tests-v9-alt3:
  raft: add tests for RPC module
  test: add CHECK_EVENTUALLY_EQUAL utility macro
  raft: replication_test: reset test rpc network between test runs
  raft: replication_test: extract tickers initialization into a separate func
  raft: replication_test: support passing custom `apply_fn` to `change_configuration()`
  raft: replication_test: introduce `test_server` aggregate struct
  raft: replication_test: support voter<->learner configuration changes
  raft: remove duplicate `create_command` function from `replication_test`
  raft: avoid 'using' statements in raft testing helpers header
2021-05-24 14:44:37 +02:00
Avi Kivity
50f3bbc359 Merge "treewide: various header cleanups" from Pavel S
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places

A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).

The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).

Before:

	Command being timed: "ninja dev-build"
	User time (seconds): 28262.47
	System time (seconds): 824.85
	Percent of CPU this job got: 3979%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2129888
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1402838
	Minor (reclaiming a frame) page faults: 124265412
	Voluntary context switches: 1879279
	Involuntary context switches: 1159999
	Swaps: 0
	File system inputs: 0
	File system outputs: 11806272
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

After:

	Command being timed: "ninja dev-build"
	User time (seconds): 26270.81
	System time (seconds): 767.01
	Percent of CPU this job got: 3905%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2117608
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1400189
	Minor (reclaiming a frame) page faults: 117570335
	Voluntary context switches: 1870631
	Involuntary context switches: 1154535
	Swaps: 0
	File system inputs: 0
	File system outputs: 11777280
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The observed improvement is about 5% of total wall clock time
for `dev-build` target.

Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"

* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
  transport: remove extraneous `qos/service_level_controller` includes from headers
  treewide: remove evidently unneded storage_proxy includes from some places
  service_level_controller: remove extraneous `service/storage_service.hh` include
  sstables/writer: remove extraneous `service/storage_service.hh` include
  treewide: remove extraneous database.hh includes from headers
  treewide: reduce boost headers usage in scylla header files
  cql3: remove extraneous includes from some headers
  cql3: various forward declaration cleanups
  utils: add missing <limits> header in `extremum_tracking.hh`
2021-05-24 14:24:20 +03:00
Nadav Har'El
5206665b15 test/cql-pytest: fix python mistake in an xfailing test
The xfailing test cassandra_tests/validation/entities/collections_test.py::
testSelectionOfEmptyCollections had a Python mistake (using {} instead
of set() for an empty set), which resulted in its failure when run
against Cassandra. After this patch it passes on Cassandra and fails on
Scylla - as expected (this is why it is marked xfail).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-24 13:14:54 +03:00
Nadav Har'El
f26b31e950 test/cql-pytest: mark some tests with scylla-only
Tests which are known to test a Scylla-only feature (such as CDC)
or to rely on a known and difference between Scylla and Cassandra
should be marked "scylla-only", so they are skipped when running
the tests against Cassandra (test/cql-pytest/run-cassandra) instead
of reporting errors.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-24 13:03:48 +03:00
Nadav Har'El
c8117584e3 test/cql-pytest: clean up test_create_large_static_cells_and_rows
The test test_create_large_static_cells_and_rows had its own
implementation of "nodetool flush" using Scylla's REST API.
Now that we have a nodetool.flush() function for general use in
cql-pytest, let's use it and save a bit of duplication.

Another benefit is that now this test can be run (and pass) against
Cassandra.

To allow this test to run on Cassandra, I had to remove a
"USING TIMEOUT" which wasn't necessary for this test, and is
not a feature supported by Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-24 12:31:51 +03:00
Asias He
425e3b1182 gossip: Introduce direct failure detector
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.

For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.

As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.

In this patch, a new failure detector is implemented.

- Direct detection

The existing failure detector can get gossip heartbeat updates
indirectly.  For example:

Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues

Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.

This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.

This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.

Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.

- Parallel detection

The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.

A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.

- Deterministic detection

Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.

- Compatible

This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.

Fixes #8488

Closes #8036
2021-05-24 10:47:06 +03:00
Piotr Sarna
890ed201fd Merge 'Enable -Wunused-private-field warning' from Avi Kivity
The -Wunused-private-field was squelched when we switched to
clang to make the change easier. But it is a useful warning, so
re-enable it.

It found a serious bug (#8682) and a few minor instances of waste.

Closes #8683

* github.com:scylladb/scylla:
  build: enable -Wunused-private-field warning
  test: drop unused fields
  table: drop unused field database_sstable_write_monitor::_compaction_manager
  streaming: drop unused fields
  sstables: mx reader: drop unused _column_value_length field
  sstables: index_consumer: drop unused max_quantity field
  compaction: resharding_compaction: drop unused _shard field
  compaction: compaction_read_monitor: drop unused _compaction_manager field
  raft: raft_services: drop unused _gossiper field
  repair: drop unused _nr_peer_nodes field
  redis: drop unused fields _storage_proxy and _requests_blocked_memory
  mutation_rebuilder: drop unused field _remaining_limit
  db: data_listeners: remove unused field _db
  cql3: insert_json_statement: note bug with unused _if_not_exists
  cql3: authorized_prepared_statement_cache: drop unused field _logger
  auth: service_level_resource_view: drop unused field _resource
2021-05-24 09:21:10 +02:00
Gleb Natapov
b4d6bdb16e raft: test: check that a leader does not send probes to a follower in the snapshot mode
Message-Id: <YKTNN7vNGkQwTDX7@scylladb.com>
2021-05-23 01:06:12 +02:00
Avi Kivity
7e5a0b6fd0 test: drop unused fields
Drop unused fields in various tests and test libraries.
2021-05-21 21:04:49 +03:00
Nadav Har'El
a2379b96b1 alternator test: test for large BatchGetItem
This patch adds an Alternator test, test_batch_get_item_large,
which checks a BatchGetItem with a moderately large (1.5 MB) response.
The test passes - we do not have a bug in BatchGetItem - but it
does reproduce issue #8522 - the long response is stored in memory as
one long contiguous string and causes a warning about an over-sized
allocation:

  WARN ... seastar_memory - oversized allocation: 2281472 bytes.

Incidentally, this test also reproduces a second contiguous
allocation problem - issue #8183 (in BatchWriteItem which we use
in this test to set up the item to read).

Refs #8522
Refs #8183

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210520161619.110941-1-nyh@scylladb.com>
2021-05-21 08:38:53 +02:00
Avi Kivity
eac6fb8d79 gdb: bypass unit test on non-x86
The gdb self-tests fail on aarch64 due to a failure to use thread-local
variables. I filed [1] so it can get fixed.

Meanwhile, disable the test so the build passes. It is sad, but the aarch64
build is not impacted by these failures.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=27886

Closes #8672
2021-05-20 20:14:15 +03:00
Avi Kivity
30034371e7 Merge "Remove most of global pointers from repair" from Pavel
"
There are many global stuff in repair -- a bunch of pointers to
sharded services, tracker, map of metas (maybe more). This set
removes the first group, all those services had become main-local
recently. Along the way a call to global storage proxy is dropped.

To get there the repair_service is turned into a "classical"
sharded<> service, gets all the needed dependencies by references
from main and spreads them internally where needed. Tracker and other
stuff is left global, but tracker is now the candidate for merging
with the now sharded repair_service, since it emulates the sharded
concept internally.

Overall the change is

- make repair_service sharded and put all dependencies on it at start
- have sharded<repair_service> in API and storage service
- carry the service reference down to repair_info and repair_meta
  constructions to give them the depedencies
- use needed services in _info and _meta methods

tests: unit(dev), dtest.repair(dev)
"

* 'br-repair-service' of https://github.com/xemul/scylla: (29 commits)
  repair: Drop most of globals from repair
  repair: Use local references in messaging handler checks
  repair: Use local references in create_writer()
  repair: Construct repair_meta with local references
  repair: Keep more stuff on repair_info
  repair: Kill bunch of global usages from insert_repair_meta
  repair: Pass repair service down to meta insertion
  repair: Keep local migration manager on repair_info
  repair: Move unused db captures
  repair: Remove unused ms captures
  repair: Construct repair_info with service
  repair: Loop over repair sharded container
  repair: Make sync_data_using_repair a method
  repair: Use repair from storage service
  repair: Keep repair on storage service
  repair: Make do_repair_start a method
  repair: Pass repair_service through the API until do_repair_start
  repair: Fix indentation after previous patch
  repair: Split sync_data_using_repair
  repair: Turn repair_range a repair_info method
  ...
2021-05-20 10:57:48 +03:00
Pavel Solodovnikov
238273d237 treewide: remove evidently unneded storage_proxy includes from some places
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 02:19:32 +03:00
Pavel Solodovnikov
0663aa6ca1 service_level_controller: remove extraneous service/storage_service.hh include
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 02:18:41 +03:00
Pavel Solodovnikov
fff7ef1fc2 treewide: reduce boost headers usage in scylla header files
`dev-headers` target is also ensured to build successfully.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:33:18 +03:00
Piotr Sarna
223a59c09c test: make rjson allocator test working in sanitize mode
Following Nadav's advice, instead of ignoring the test
in sanitize/debug modes, the allocator simply has a special path
of failing sufficiently large allocation requests.
With that, a problem with the address sanitizer is bypassed
and other debug mode sanitizers can inspect and check
if there are no more problems related to wrapping the original
rapidjson allocator.

Closes #8539
2021-05-20 00:42:47 +03:00
Pavel Solodovnikov
a66de8658b raft: add tests for RPC module
Now RPC module has some basic testing coverage to
make sure RPC configuration is updated appropriately
on configuration changes (i.e. `add_server` and
`remove_server` are called when appropriate).

The test suite currenty consists of the following
test-cases:
 * Loading server instance with configuration from a snapshot.
 * Loading server instance with configuration from a log.
 * Configuration changes (remove + add node).
 * Leader elections don't lead to RPC configuration changes.
 * Voter <-> learner node transitions also don't change RPC
   configuration.
 * Reverting uncommitted configuration changes updates
   RPC configuration accordingly (two cases: revert to
   snapshot config or committed state from the log).

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-19 23:14:04 +03:00
Pavel Solodovnikov
e030e291a8 test: add CHECK_EVENTUALLY_EQUAL utility macro
It would be good to have a `CHECK` variant in addition
to an existing `REQUIRE_EVENTUALLY_EQUAL` macro. Will be used
in raft RPC tests.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-19 23:12:55 +03:00
Pavel Solodovnikov
2067cc75c6 raft: replication_test: reset test rpc network between test runs
Currently, emulated rpc network is shared between all test cases
in `replication_test.cc` (see static `rpc::net` map).
Though, its value is not reset when executing a subsequent test
case, which opens a possibility for heap-use-after-free bugs.

Also, make all `send_*` functions in test rpc class to throw an
error if a node being contacted is not in the network instead of
past-the-end access. This allows to safely contact a non-existent
node, which will be used in RPC tests later.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-19 23:06:29 +03:00
Avi Kivity
d8121961fa Merge 'cql-pytest: add nodetool flush feature and use it in a test' from Nadav Har'El
The first patch adds a nodetool-like capability to the cql-pytest framework.
It is *not* meant to be used to test nodetool itself, but rather to give CQL
tests the ability to use nodetool operations - currently only one operation -
"nodetool flush".

We try to use Scylla's REST API, if possible, and only fall back to using an
external "nodetool" command when the REST API is not available - i.e., when
testing Cassandra. The benefit of using the REST API is that we don't need
to run the jmx server to test Scylla.

The second patch is an example of using the new nodetool flush feature
in a test that needs to flush data to reproduce a bug (which has already
been fixed).

Closes #8622

* github.com:scylladb/scylla:
  cql-pytest: reproducer for issue #8138
  cql-pytest: add nodetool flush feature
2021-05-19 14:40:18 +03:00
Nadav Har'El
fd8d15a1a6 cql-pytest: reproducer for issue #8138
We add a reproducing test for issue #8138, were if we write to an
TWCS table, scanning it would yield no rows - and worse - crash the
debug build.

This test requires "nodetool flush" to force the read to happen from
sstables, hence the nodetool feature was implemented in the previous
patch (on Scylla, it uses the REST API - not actually running nodetool
or requiring JMX).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-19 13:58:14 +03:00
Nadav Har'El
49580a4701 cql-pytest: add nodetool flush feature
This patch adds a nodetool-compatible capability to the cql-pytest
framework. It is *not* meant to be used to test nodetool itself, but
rather to give CQL tests the ability to use nodetool operations -
currently one operation - "nodetool flush".

Use it in a test as:

     import nodetool
     nodetool.flush(cql, table)

I chose a functional API with parameters ("cql") instead of a fixture
with an implied connection so that in the future we may allow multiple
multiple nodes and this API will allow sending nodetool requests to
different nodes. However, multi-node support is not implemented yet,
nor used in any of the existing tests.

The implementation uses Scylla's REST API if available, or if not, falls
back to using an external "nodetool" command (which can be overridden
using the NODETOOL environment variable). This way, both cql-pytest/run
(Scylla) and cql-pytest/run-cassandra (Cassandra) now correctly support
these nodetool operations, and we still don't need to run JMX to test
Scylla.

The reason We want to support nodetool.flush() is to reproduce bugs that
depend on data reaching disk. We already had such a reproducer in
test_large_cells_rows.py - it too did something similar - but it was
Scylla-only (using only the REST API). Instead of copying such code to
multiple places, we better have a common nodetool.flush() function, as
done in this patch. The test in test_large_cells_rows.py can later be
changed to use the new function.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-19 13:55:25 +03:00
Pavel Emelyanov
28f01aadc9 allocation_strategy, code: Simplify alloc()
Todays alloc() accepts migrate-fn, size and alignment. All the callers
don't really need to provide anything special for the migrate-fn and
are just happy with default alignof() for alignment. The simplification
is in providing alloc() that only accepts size arg and does the rest
itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-19 09:23:49 +03:00