Commit Graph

1739 Commits

Author SHA1 Message Date
Alejo Sanchez
52188016af raft: replication test: create_server in raft_cluster
Remove the global create_raft_server() and replace with a
create_server() helper in replication_test().

This will allow not requiring the user of raft_cluster to create special
objects.

Note this does not move(apply) anymore as it's kept in raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:02 -04:00
Alejo Sanchez
1edcb6e647 raft: replication test: reset snapshots
When stopping a server also delete snapshots and persisted snapshots.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:46:11 -04:00
Alejo Sanchez
453f19cf0e raft: replication test: reset server helper
Add a helper to reset a server in raft_cluster.

Besides simplifying code and preventing errors, this will help move
create_raft_server logic to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
d3b7f21b88 raft: replication test: pause tickers before stopping
Pause tickers before stopping servers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
30c9daafd2 raft: replication test: tick helper
Move test tick handling to raft_cluster as helper method.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
2e61c507d2 raft: replication test: tickers on raft_cluster
Move tickers to raft_cluster helper class. Ticker initialization and
pause is done automatically at start_all() and stop_all().

Add temporary helpers to manage specific tickers. These might be removed
later once proper node abort and reset are implemented.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
aea77871c4 raft: replication test: cluster tracking leader
Track current leader inside helper class.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
ca8e55613e raft: replication test: elect first leader in raft_cluster
Run first leader election inside raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
322802308c raft: replication test: use id 0 for rpc tests
raft_cluster at the moment only allows sequential 0 based ids.

The code was generating ids over this and causing problems for code
changes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
c1a6e81002 raft: replication test: fix partition wait log
When partitioning, don't wait_log on servers outside configuration.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
6db730c500 raft: replication test: partition helper
Add a partition handling helper to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
848c244932 raft: replication test: track in_configuration in raft_cluster
Keep track of servers in configuration inside raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
16728b8966 raft: replication test: use cluster saved apply function
Use apply function saved in cluster at creation time.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
3daed889b8 raft: replication test: change_configuration in raft_cluster
Move change_configuration to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
102b8e71bb raft: replication test: free_election in raft_cluster
Move free_election to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
60d4d06861 raft: replication test: wait_log_all in raft_cluster
Move wait_log_all to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
d1ba0fe719 raft: replication test: wait_log in raft_cluster
Move wait_log to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
3e4871b884 raft: replication test: elect_new_leader in raft_cluster
Move elect_new_leader to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
59b9642be5 raft: replication test: elapse_election in raft_cluster
Move elapse_election to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
b3e2b54913 raft: replication test: move add_entry up
Style.

Move definition of add_entry and add_remaining_entries with the rest of
raft_cluster definitions.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
8cd2abe72b raft: replication test: remove spurious check
Going forward the leader is always in configuration and up to date.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
2d51d1bbc5 raft: replication test: raft_cluster add_entries
Move add_entries() to raft_cluster and provide a helper to add remaining
entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
2a1e7a15a6 raft: replication test: calculate first value helper
Helper to calculate what's the value number to be added after snapshot
and leader initial log.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
e2f425e210 raft: replication test: initial state helper
Move initial_state preparation to its own helper function.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
d2c0308a85 raft: replication test: move declarations up
Move declarations near the top of the file for following refactors to
raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
a3700a6d0a raft: replication test: move up set_config
Move set_config above raft_cluster for a subsequent commit.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
57da05c986 raft: replication test: use disconnect() helper
For rpc tests, use raft_cluster::disconnect() instead of the local
connected reference.

This removes connected object use outside raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
54c919b726 raft: replication test: add connectivity helpers
Add connectivity helpers disconnect(server, except) and connect_all() to
so users of raft_cluster don't need to keep the a connectivity object
pointer.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
5e324f3438 raft: replication test: rpc with raft_cluster
Use raft_cluster for rpc tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
752d53a909 raft: replication test: use parallel start/stop
Start and stop servers in parallel.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
bcf5181697 raft: replication test: cluster class
Use raft_cluster class to handle servers.

First part of this change.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
5fc0a1251d raft: replication test: helper uuid to local id
Add a helper to convert from UUID to size_t id.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
7e93501d4c raft: replication test: use optional
Instead of tracking with a boolean use an optional for partition leader.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
ccb85bce02 raft: replication test: wait log on next leader only
When there's a defined next leader, only wait for log propagation for
this follower.

Splits wait_log() to waiting for one follower with wait_log() and
waiting for all followers with wait_log().

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
2aa1646e35 raft: replication test: remove wait after adding entries
Remove log wait after adding entries. It was added to handle some debug
hangs but it is not good for testing.

There are already wait logs at proper code locations.
(e.g. elect_new_leader, partition)

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
0216d0a7b0 raft: replication test: remove unused param
elect_new_leader doesn't need to know configuration anymore.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
effcb7c5f6 raft: tests: move conversion helpers to header
Move replication test helpers to header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
7327cbd871 raft: replication test: use structs to avoid alias
Use structs for test commands.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Raphael S. Carvalho
a7cdd846da compaction: Prevent tons of compaction of fully expired sstable from happening in parallel
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.

This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.

With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.

Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.

Fixes #8710.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>

[avi: drop now unneeded storage_service_for_tests]
2021-05-30 23:22:51 +03:00
Avi Kivity
791412b046 test: user_defined_function_test: raise Lua timeout
user_defined_function_test fails sporadically in debug mode
due to lua timeout. Raise the timeout to avoid the failure, but
not so much that the test that expects timout becomes too slow.

Fixes #8746.

Closes #8747
2021-05-30 13:10:57 +03:00
Piotr Jastrzebski
76d7c761d1 schema: Stop using deprecated constructor
This is another boring patch.

One of schema constructors has been deprecated for many years now but
was used in several places anyway. Usage of this constructor could
lead to data corruption when using MX sstables because this constructor
does not set schema version. MX reading/writing code depends on schema
version.

This patch replaces all the places the deprecated constructor is used
with schema_builder equivalent. The schema_builder sets the schema
version correctly.

Fixes #8507

Test: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4beabc8c942ebf2c1f9b09cfab7668777ce5b384.1622357125.git.piotr@scylladb.com>
2021-05-30 11:58:27 +03:00
Nadav Har'El
1507bbb35a cql-pytest: increase default server-side timeouts
Sometimes the cql-pytest tests run extremely slowly. This can be
a combination of running the debug build (which is naturally slow)
and a test machine which is overcommitted, or experiencing some
transient swap storm or some similar event. We don't want tests, which
we run on a 100% reliable setups, to fail just because they run into
timeouts in Scylla when they run very slowly.

We already noticed this problem in the past, and increased the CQL client
timeout in conftest.py from the default of 10 seconds to 120 seconds -
the old default of 10 seconds was not enough for some long operations
(such as creating a table with multiple views) when the test ran very
slowly.

However, this only fixed the client-side timeout. We also have a bunch
of server-side timeouts, configured to all sorts of arbitrary (and
fairly small) numbers. For example, the server has a "write request
timeout" option, which defaults to just 2 seconds. We recently saw
this timeout exceeded in a slow run which tried to do a very large
write.

So this patch configures all the configurable server-side timeouts we
have to default to 300 seconds. This should be more than enough for even
the slowest runs (famous last words...). This default is not a good idea
on real multi-node clusters which are expected to deal with node loss,
but this is not the case in cql-pytest.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210529213648.856503-1-nyh@scylladb.com>
2021-05-30 01:20:14 +03:00
Avi Kivity
d3e5b37059 Revert "Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund"
This reverts commit e9c940dbbc, reversing
changes made to 6144656b25. Since it was
merged commitlog_test consistently times out in debug mode.
2021-05-27 21:16:26 +03:00
Wojciech Mitros
725c6aac81 test/perf: close test_env to pass an assert in sstables_manager destructor
When destroying an perf_sstable_test_env, an assert in sstables_manager
destructor fails, because it hasn't been closed.
Fix by removing all references to sstables from perf_sstable_test_env,
and then closing the test_env(as well as the sstables_manager)

Fixes #8736

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #8737
2021-05-27 17:41:17 +03:00
Michał Chojnowski
5e9f741bb4 repair: remove range_split.hh
Dead code since 80ebedd242.

Closes #8698
2021-05-27 17:21:37 +03:00
Avi Kivity
5f8484897b Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun
Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged.

---
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
   irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
   (according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
   system_distributed.cdc_generation_descriptions, using the
   generation's timestamp as the partition key (we call this table
   the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
   application state.

The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.

Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.

The commit introduces a new table for broadcasting generation data with
the following properties:
-  it uses a better schema that stores the data in multiple rows, each
   of manageable size
-  it resides in a new keyspace that uses EverywhereStrategy so the
   data will be written to every node in the cluster that has a token in
   the token ring
-  the data will be written using CL=ALL and read using CL=ONE; thanks
   to this, restarting node won't have to communicate with other nodes
   to retrieve the data of the last known generation. Note that writing
   with CL=ALL does not reduce availability: creating a new generation
   *requires* all nodes to be available anyway, because they must learn
   about the generation before their clocks go past the generation's
   timestamp; if they don't, partitions won't be mapped to stream IDs
   consistently across the cluster
-  the partition key is no longer the generation's timestamp. Because it
   was that way in the old internal table, it forced the algorithm to
   choose the timestamp *before* the generation data was inserted into
   the table. What if the inserting took a long time? It increased the
   chance that nodes would learn about the generation too late (after
   their clocks moved past its timestamp). With the new schema we will
   first insert the generation data using a randomly generated UUID as
   the partition key, *then* choose the timestamp, then gossip both the
   timestamp and the UUID.
   Observe that after a node learns about a generation broadcasted using
   this new method through gossip it will retrieve its data very quickly
   since it's one of the replicas and it can use CL=ONE as it was
   written using CL=ALL.

The generation's timestamp and the UUID mentioned in the last point form
a "generation identifier" for this new generation. For passing these new
identifiers around, we introduce the cdc::generation_id_v2 type.

Fixes #7961.

---

For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order.

dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/
unit tests (dev) passed locally

Closes #8643

* github.com:scylladb/scylla:
  docs: update cdc.md with info about the new internal table
  sys_dist_ks: don't create old CDC generations table on service initialization
  sys_dist_ks: rename all_tables() to ensured_tables()
  cdc: when creating new generations, use format v2 if possible
  main: pass feature_service to cdc::generation_service
  gms: introduce CDC_GENERATIONS_V2 feature
  cdc: introduce retrieve_generation_data
  test: cdc: include new generations table in permissions test
  sys_dist_ks: increase timeout for create_cdc_desc
  sys_dist_ks: new table for exchanging CDC generations
  tree-wide: introduce cdc::generation_id_v2
2021-05-27 17:13:44 +03:00
Avi Kivity
e8e4456ec7 Merge 'Introduce per-service-level workload types and their first use-case - shedding in interactive workloads' from Piotr Sarna
This draft extends and obsoletes #8123 by introducing a way of determining the workload type from service level parameters, and then using this context to qualify requests for shedding.

The rough idea is that when the admission queue in the CQL server is hit, it might make more sense to start shedding surplus requests instead of accumulating them on the semaphore. The assumption that interactive workloads are more interested in the success rate of as many requests as possible, and hanging on a semaphore reduces the chances for a request to succeed. Thus, it may make sense to shed some requests to reduce the load on this coordinator and let the existing requests to finish.

It's a draft, because I only performed local guided tests. #8123 was followed by some experiments on a multinode cluster which I want to rerun first.

Closes #8680

* github.com:scylladb/scylla:
  test: add a case for conflicting workload types
  cql-pytest: add basic tests for service level workload types
  docs: describe workload types for service levels
  sys_dist_ks: fix redundant parsing in get_service_level
  sys_dist_ks: make get_service_level exception-safe
  transport: start shedding requests during potential overload
  client_state: hook workload type from service levels
  cql3: add listing service level workload type
  cql3: add persisting service level workload type
  qos: add workload_type service level parameter
2021-05-27 17:01:56 +03:00
Konstantin Osipov
52f7ff4ee4 raft: (testing) update copyright
An incorrect copyright information was copy-pasted
from another test file.

Message-Id: <20210525183919.1395607-1-kostja@scylladb.com>
2021-05-27 15:47:49 +03:00
Piotr Sarna
99f356d764 test: add a case for conflicting workload types
The test case verifies that if several workload types are effective
for a single role, the conflict resolution is well defined.
2021-05-27 14:31:36 +02:00
Piotr Sarna
01b7e445f9 cql-pytest: add basic tests for service level workload types
The test cases check whether it's possible to declare workload
type for a service level and if its input is validated.
2021-05-27 14:31:36 +02:00