Commit Graph

620 Commits

Author SHA1 Message Date
Kefu Chai
a5e696fab8 storage_service, test: drop unused storage_service_config
this setting was removed back in
dcdd207349, so despite that we are still
passing `storage_service_config` to the ctor of `storage_service`,
`storage_service::storage_service()` just drops it on the floor.

in this change, `storage_service_config` class is removed, and all
places referencing it are updated accordingly.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>

Closes #11415
2022-08-31 19:49:13 +03:00
Tomasz Grabiec
1d0264e1a9 Merge 'Implement Raft upgrade procedure' from Kamil Braun
Start with a cluster with Raft disabled, end up with a cluster that performs
schema operations using group 0.

Design doc:
https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/
(TODO: replace this with .md file - we can do it as a follow-up)

The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from `system.peers`
  table)
- enter `synchronize` upgrade state, in which group 0 operations are disabled
- wait until all members of group 0 entered `synchronize` state or some member
  entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used for
  schema operations.

With the procedure comes a recovery mode in case the upgrade procedure gets
stuck (and it may if we lose a node during recovery - the procedure, to
correctly establish a single group 0 cluster, requires contacting every node).

This recovery mode can also be used to recover clusters with group 0 already
established if they permanently lose a majority of nodes - killing two birds with
one stone. Details in the last commit message.

Read the design doc, then read the commits in topological order
for best reviewing experience.

---

I did some manual tests: upgrading a cluster, using the cluster to add nodes,
remove nodes (both with `decommission` and `removenode`), replacing nodes.
Performing recovery.

As a follow-up, we'll need to implement tests using the new framework (after
it's ready). It will be easy to test upgrades and recovery even with a single
Scylla version - we start with a cluster with the RAFT flag disabled, then
rolling-restart while enabling the flag (and recovery is done through simple
CQL statements).

Closes #10835

* github.com:scylladb/scylladb:
  service/raft: raft_group0: implement upgrade procedure
  service/raft: raft_group0: extract `tracker` from `persistent_discovery::run`
  service/raft: raft_group0: introduce local loggers for group 0 and upgrade
  service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
  service/raft: raft_group0_client: prepare for upgrade procedure
  service/raft: introduce `group0_upgrade_state`
  db: system_keyspace: introduce `load_peers`
  idl-compiler: introduce cancellable verbs
  message: messaging_service: cancellable version of `send_schema_check`
2022-08-25 11:32:06 +03:00
Kamil Braun
e350e37605 service/raft: raft_group0: implement upgrade procedure
A listener is created inside `raft_group0` for acting when the
SUPPORTS_RAFT feature is enabled. The listener is established after the
node enters NORMAL status (in `raft_group0::finish_setup_after_join()`,
called at the end of `storage_service::join_cluster()`).

The listener starts the `upgrade_to_group0` procedure.

The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from
  `system.peers` table)
- enter `synchronize` upgrade state, in which group 0 operations are
  disabled (see earlier commit which implemented this logic)
- wait until all members of group 0 entered `synchronize` state or some
  member entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used
  for schema operations (only those for now).

The devil lies in the details, and the implementation is ugly compared
to this nice description; for example there are many retry loops for
handling intermittent network failures. Read the code.

`leave_group0` and `remove_group0` were adjusted to handle the upgrade
procedure being run correctly; if necessary, they will wait for the
procedure to finish.

If the upgrade procedure gets stuck (and it may, since it requires all
nodes to be available to contact them to correctly establish a single
group 0 raft cluster); or if a running cluster permanently loses a
majority of nodes, causing group 0 unavailability; the cluster admin
is not left without help.

We introduce a recovery mode, which allows the admin to
completely get rid of traces of existing group 0 and restart the
upgrade procedure - which will establish a new group 0. This works even
in clusters that never upgraded but were bootstrapped using group 0 from
scratch.

To do that, the admin does the following on every node:
- writes 'recovery' under 'group0_upgrade_state' key
  in `system.scylla_local` table,
- truncates the `system.discovery` table,
- truncates the `system.group0_history` table,
- deletes group 0 ID and group 0 server ID from `system.scylla_local`
  (the keys are `raft_group0_id` and `raft_server_id`
then the admin performs a rolling restart of their cluster. The nodes
restart in a "group 0 recovery mode", which simply means that the nodes
won't try to perform any group 0 operations. Then the admin calls
`removenode` to remove the nodes that are down. Finally, the admin
removes the `group0_upgrade_state` key from `system.scylla_local`,
rolling-restarts the cluster, and the cluster should establish group 0
anew.

Note that this recovery procedure will have to be extended when new
stuff is added to group 0 - like topology change state. Indeed, observe
that a minority of nodes aren't able to receive committed entries from a
leader, so they may end up in inconsistent group 0 states. It wouldn't
be safe to simply create group 0 on those nodes without first ensuring
that they have the same state from which group 0 will start.
Right now the state only consist of schema tables, and the upgrade
procedure ensures to synchronize them, so even if the nodes started in
inconsistent schema states, group 0 will correctly be established.
(TODO: create a tracking issue? something needs to remind us of this
 whenever we extend group 0 with new stuff...)
2022-08-23 13:51:01 +02:00
Botond Dénes
331033adae Merge 'Fix frozen mutation consume ordering' from Benny Halevy
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.

This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.

Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.

Add a unit test that verifies that.

Fixes #11198

Closes #11269

* github.com:scylladb/scylladb:
  mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable
  frozen_mutation: consume and consume_gently in-order
  frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc
  frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes
  frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally
  frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen
2022-08-23 06:37:04 +03:00
Mikołaj Sielużycki
b5380baf8a frozen_mutation: consume and consume_gently in-order
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.

This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.

Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.

Add a unit test that verifies that.

Fixes #11198

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 20:12:20 +03:00
Piotr Sarna
484004e766 Merge 'Fix mutation commutativity with shadowable tombstone'
from Tomasz Grabiec

This series fixes lack of mutation associativity which manifests as
sporadic failures in
row_cache_test.cc::test_concurrent_reads_and_eviction due to differences
in mutations applied and read.

No known production impact.

Refs https://github.com/scylladb/scylladb/issues/11307

Closes #11312

* github.com:scylladb/scylladb:
  test: mutation_test: Add explicit test for mutation commutativity
  test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
  db: mutation_partition: Drop unnecessary maybe_shadow()
  db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
  mutation_partition: row: make row marker shadowing symmetric
2022-08-20 16:46:32 +02:00
Kamil Braun
43687be1f1 service/raft: raft_group0_client: prepare for upgrade procedure
Now, whether an 'group 0 operation' (today it means schema change) is
performed using the old or new methods, doesn't depend on the local RAFT
fature being enabled, but on the state of the upgrade procedure.

In this commit the state of the upgrade is always
`use_pre_raft_procedures` because the upgrade procedure is not
implemented yet. But stay tuned.

The upgrade procedure will need certain guarantees: at some point it
switches from `use_pre_raft_procedures` to `synchronize` state. During
`synchronize` schema changes must be disabled, so the procedure can
ensure that schema is in sync across the entire cluster before
establishing group 0. Thus, when the switch happens, no schema change
can be in progress.

To handle all this weirdness we introduce `_upgrade_lock` and
`get_group0_upgrade_state` which takes this lock whenever it returns
`use_pre_raft_procedures`. Creating a `group0_guard` - which happens at
the start of every group 0 operation - will take this lock, and the lock
holder shall be stored inside the guard (note: the holder only holds the
lock if `use_pre_raft_procedures` was returned, no need to hold it for
other cases). Because `group0_guard` is held for the entire duration of
a group 0 operation, and because the upgrade procedure will also have to
take this lock whenever it wants to change the upgrade state (it's an
rwlock), this ensures that no group 0 operation that uses the old ways
is happening when we change the state.

We also implement `wait_until_group0_upgraded` using a condition
variable. It will be used by certain methods during upgrade (later
commits; stay tuned).

Some additional comments were written.
2022-08-19 19:15:19 +02:00
Benny Halevy
7747b8fa33 sstables: define run_identifier as a strong tagged_uuid type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11321
2022-08-18 19:03:10 +03:00
Tomasz Grabiec
3d9efee3bf test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
Given 3 row mutations:

m1 = {
      marker: {row_marker: dead timestamp=-9223372036854775803},
      tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}}
}

m2 = {
      marker: {row_marker: timestamp=-9223372036854775805}
}

m3 = {
      tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}}
}

We get different shadowable tombstones depending on the order of merging:

(m1 + m2) + m3 = {
       marker: {row_marker: dead timestamp=-9223372036854775803},
       tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}}

m1 + (m2 + m3) = {
       marker: {row_marker: dead timestamp=-9223372036854775803},
       tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}}
}

The reason is that in the second case the shadowable tombstone in m3
is shadwed by the row marker in m2. In the first case, the marker in
m2 is cancelled by the dead marker in m1, so shadowable tombstone in
m3 is not cancelled (the marker in m1 does not cancel because it's
dead).

This wouldn't happen if the dead marker in m1 was accompanied by a
hard tombstone of the same timestamp, which would effectively make the
difference in shadowable tombstones irrelevant.

Found by row_cache_test.cc::test_concurrent_reads_and_eviction.

I'm not sure if this situation can be reached in practice (dead marker
in mv table but no row tombstone).

Work it around for tests by producing a row tombstone if there is a
dead marker.

Refs #11307
2022-08-17 17:39:54 +02:00
Botond Dénes
c8ef356859 test/lib: move convenience table config factory to sstable_test_env
All users of `column_family_test_config()`, get the semaphore parameter
for it from `sstable_test_env`. It is clear that the latter serves as
the storage space for stable objects required by the table config. This
patch just enshrines this fact by moving the config factory method to
`sstable_test_env`, so it can just get what it needs from members.
2022-08-15 11:23:59 +03:00
Botond Dénes
c0e017e0f7 test/lib/sstable_test_env: move members to impl struct
All present members of sstable_test_env are std::unique_ptr<>:s because
they require stable addresses. This makes their handling somewhat
awkward. Move all of them into an internal `struct impl` and make that
member a unique ptr.
2022-08-15 11:20:09 +03:00
Botond Dénes
a9f296ed47 test/lib/sstable_utils: use test_env::do_with_async()
Instead of manually instantiating test_env.
2022-08-15 11:19:27 +03:00
Benny Halevy
d295d8e280 everywhere: define locator::host_id as a strong tagged_uuid type
So it can be distinguished from other uuid-based
identifiers in the system.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11276
2022-08-12 06:01:44 +03:00
Tomasz Grabiec
8ee5b69f80 test: row_cache: Use more narrow key range to stress overlapping reads more
This makes catching issues related to concurrent access of same or
adjacent entries more likely. For example, catches #11239.

Closes #11260
2022-08-10 06:53:54 +03:00
Benny Halevy
257d74bb34 schema, everywhere: define and use table_id as a strong type
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.

Fixes #11207

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:41 +03:00
Benny Halevy
813cffc2b5 counters: counter_id: use base class create_random_id
Rather than defining generate_random,
and use respectively in unit tests.
(It was inherited from raft::internal::tagged_id.)

This allows us to shorten counter_id's definition
to just using utils::tagged_uuid<struct counter_id_tag>.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
e4e92d44ae main: start compaction_manager as a sharded service
And pass a reference to it to the database rather
than having the database construct its own compaction_manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 07:50:15 +03:00
Aleksandra Martyniuk
7d457cffb8 scrub compaction: count validation errors for specific scrub task
The number of validation errors per given compaction scrub on given
shard is passed up to perform_task() function.
2022-07-29 09:35:20 +02:00
Avi Kivity
e66809d051 Merge 'Memtable flush: wait for sstable count reduction if needed' from Benny Halevy
Called from try_flush_memtable_to_sstable,
maybe_wait_for_sstable_count_reduction will wait for
compaction to catch up with memtable flush if there
the bucket to compact is inflated, having too many
sstables.  In that case we don't want to add fuel
to the fire by creating yet another sstable.

Fixes #4116

Closes #10954

* github.com:scylladb/scylla:
  table: Add test where compaction doesn't keep up with flush rate.
  compaction_manager: add maybe_wait_for_sstable_count_reduction
  time_window_compaction_strategy: get_sstables_for_compaction: clean up code
  time_window_compaction_strategy: make get_sstables_for_compaction idempotent
  time_window_compaction_strategy: get_sstables_for_compaction: improve debug messages
  leveled_manifest: pass compaction_counter as const&
2022-07-28 19:11:04 +03:00
Mikołaj Sielużycki
e0c6e1ef3c table: Add test where compaction doesn't keep up with flush rate.
The test simulates a situation where 2 threads issue flushes to 2
tables. Both issue small flushes, but one has injected reactor stalls.
This can lead to a situation where lots of small sstables accumulate on
disk, and, if compaction never has a chance to keep up, resources can be
exhausted.

(cherry picked from commit b5684aa96d)
(cherry picked from commit 25407a7e41)
2022-07-28 14:43:33 +03:00
Mikołaj Sielużycki
9c43f1266a test: Move validating_consumer to test/lib/mutation_assertions.hh 2022-07-27 11:19:50 +02:00
Pavel Emelyanov
a246b6d3eb streaming: Pass db::config& to manager constructor
The stream_manager will bookkeep the streaming bandwidth option, to
subscribe on its changes it needs the config reference. It would be
better if it was stream_manager::config, but currently subscription on
db::config::<stuff> updates is not very shard-friendly, so we need to
carry the config reference itself around.

Similar trouble is there for compaction_manager. The option is passed
through its own config, but the config is created on each shard by
database code. Stream manager config would be created once by main code
on shard 0.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-19 12:18:08 +03:00
Raphael S. Carvalho
a176022272 compaction_manager: task: switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
b5417096e2 compaction_manager: make propagate_replacement() switch to table_state
propagate_replacement is used by incremental compaction to notify
ongoing compaction about sstable list updates, such that the
ongoing job won't hold reference to exhausted sstables.

So it needs to switch to table_state, too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
f52ad722f3 compaction_manager: rename table_state's get_sstable_set to main_sstable_set
With compaction_manager switching to table_state, we'll need to
introduce a method in table_state to return maintenance set.
So better to have a descriptive name for main set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-13 11:12:33 -03:00
Nadav Har'El
cc69177dcc config: fix printing of experimental feature list
Recently we noticed a regression where with certain versions of the fmt
library,

   SELECT value FROM system.config WHERE name = 'experimental_features'

returns string numbers, like "5", instead of feature names like "raft".

It turns out that the fmt library keep changing their overload resolution
order when there are several ways to print something. For enum_option<T> we
happen to have to conflicting ways to print it:
  1. We have an explicit operator<<.
  2. We have an *implicit* convertor to the type held by T.

We were hoping that the operator<< always wins. But in fmt 8.1, there is
special logic that if the type is convertable to an int, this is used
before operator<<()! For experimental_features_t, the type held in it was
an old-style enum, so it is indeed convertible to int.

The solution I used in this patch is to replace the old-style enum
in experimental_features_t by the newer and more recommended "enum class",
which does not have an implicit conversion to int.

I could have fixed it in other ways, but it wouldn't have been much
prettier. For example, dropping the implicit convertor would require
us to change a bunch of switch() statements over enum_option (and
not just experimental_features_t, but other types of enum_option).

Going forward, all uses of enum_option should use "enum class", not
"enum". tri_mode_restriction_t was already using an enum class, and
now so does experimental_features_t. I changed the examples in the
comments to also use "enum class" instead of enum.

This patch also adds to the existing experimental_features test a
check that the feature names are words that are not numbers.

Fixes #11003.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11004
2022-07-11 09:17:30 +02:00
Tomasz Grabiec
c5ad05c819 db: Allow splitting initiatlization of system tables
We will need some system tables to be initialized earlier in the boot
so that system.scylla_local can be read before schema tables are
initialized.
2022-07-06 22:08:56 +02:00
Avi Kivity
419fe65259 Revert "Merge 'Block flush until compaction finishes if sstables accumulate' from Mikołaj Sielużycki"
This reverts commit aa8f135f64, reversing
changes made to 9a88bc260c. The patch
causes hangs during flush.

Also reverts parts of 411231da75 that impacted the unit test.

Fixes #10897.
2022-07-06 12:19:02 +03:00
Tomasz Grabiec
8f3349b407 test: lib: flat_mutation_reader_assertion: Add trace-level logging of read fragments
Message-Id: <20220629153926.137824-1-tgrabiec@scylladb.com>
2022-06-30 08:43:30 +03:00
Pavel Emelyanov
85033ea6ae Merge 'A bunch of refactors related to Raft group 0' from Kamil Braun
The commits here were extracted from PR https://github.com/scylladb/scylla/pull/10835 which implements upgrade procedure for Raft group 0.

They are mostly refactors which don't affect the behavior of the system, except one: the commit 4d439a16b3 causes all schema changes to be bounced to shard 0. Previously, they would only be bounced when the local Raft feature was enabled. I do that because:
1. eventually, we want this to be the default behavior
2. in the upgrade PR I remove the `is_raft_enabled()` function - the function was basically created with the mindset "Raft is either enabled or not" - which was right when we didn't support upgrade, but will be incorrect when we introduce intermediate states (when we upgrade from non-raft-based to raft-based operations); the upgrade PR introduces another mechanism to dispatch based on the upgrade state, but for the case of bouncing to shard 0, dispatching is simply not necessary.

Closes #10864

* github.com:scylladb/scylla:
  service/raft: raft_group_registry: add assertions when fetching servers for groups
  service/raft: raft_group_registry: remove `_raft_support_listener`
  service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map
  service/raft: raft_group0: move group 0 RPC handlers from `storage_service`
  service/raft: messaging: extract raft_addr/inet_addr conversion functions
  service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster`
  treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls
  test/boost: memtable_test: perform schema operations on shard 0
  test/boost: cdc_test: remove test_cdc_across_shards
  message: rename `send_message_abortable` to `send_message_cancellable`
  message: change parameter order in `send_message_oneway_timeout`
2022-06-29 16:51:54 +03:00
Pavel Emelyanov
3a753068be Merge "Make permissions cache live updateable and add an API for resetting authorization cache" from Igor Ribeiro Barbosa Duarte
Currently, for users who have permissions_cache configs set to very high
values (and thus can't wait for the configured times to pass) having to restart
the service every time they make a change related to permissions or
prepared_statements cache (e.g. Adding a user and changing their permissions)
can become pretty annoying.
This patch series make permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries live updateable so that restarting the
service is not necessary anymore for these cases.
It also adds an API for flushing the cache to make it easier for users who
don't want to modify their permissions_cache config.

branch: https://github.com/igorribeiroduarte/scylla/tree/make_permissions_cache_live_updateable
CI: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1005/
dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_permissions_cache

* https://github.com/igorribeiroduarte/scylla/make_permissions_cache_live_updateable:
  loading_cache_test: Test loading_cache::reset and loading_cache::update_config
  api: Add API for resetting authorization cache
  authorization_cache: Make permissions cache and authorized prepared statements cache live updateable
  auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct
  utils/loading_cache.hh: Add update_config method
  utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh
  utils/loading_cache.hh: Add reset method
2022-06-29 11:14:13 +03:00
Igor Ribeiro Barbosa Duarte
c8c48a98fa auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct
This patch makes authorized_prepared_statements_cache acccept a config struct,
similarly to permissions_cache. This will make it easier to make this cache
live updateable on the next patch.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-06-28 19:57:52 -03:00
Igor Ribeiro Barbosa Duarte
667840a7eb utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh
This patch renames the permissions_cache_config struct to loading_cache_config
and moves it to utils/loading_cache.hh. This will make it easier to handle
config updates to the authorization caches on the next patches

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-06-28 19:46:22 -03:00
Botond Dénes
6c818f8625 Merge 'sstables: generation_type tidy-up' from Michael Livshin
- Use `sstables::generation_type` in more places
- Enforce conceptual separation of `sstables::generation_type` and `int64_t`
- Fix `extremum_tracker` so that `sstables::generation_type` can be non-default-constructible

Fixes #10796.

Closes #10844

* github.com:scylladb/scylla:
  sstables: make generation_type an actual separate type
  sstables: use generation_type more soundly
  extremum_tracker: do not require default-constructible value types
2022-06-28 08:50:12 +03:00
Botond Dénes
1f4f8ba773 Merge 'compaction_manager: track if off-startegy compaction was performed in run_offstrategy_compaction' from Benny Halevy
This series moves the logic to not perform off-strategy compaction if the maintenance set is empty from the table layer down to the compaction_manager layer since it is the one that needs to make the decision.

With that compaction_manager::perform_offstrategy will return a future<bool> which resolves to true
iff off-strategy compaction was required and performed.

The sstable_compaction_test was adjusted and a new compaction_manager_for_testing class was added
to make sure the compaction manager is enabled when constructed (it wasn't so test_offstrategy_sstable_compaction didn't perform any off-strategy compactions!) and stopped before destroyed.

Closes #10848

* github.com:scylladb/scylla:
  table: perform_offstrategy_compaction: move off-strategy logic to compaction_manager
  compaction_manager: offstrategy_compaction_task: refactor log printouts
  test: sstable_compaction: compaction_manager_for_testing
2022-06-24 08:04:02 +03:00
Kamil Braun
bb58ee0b2e service/raft: raft_group_registry: remove _raft_support_listener
It did nothing.
It will be readded in `raft_group0` and it will do something, stay
tuned.

With this we can remove the `feature_service` reference from
`raft_group_registry`.
2022-06-23 16:14:41 +02:00
Kamil Braun
5da163e0b8 service: storage_service: initialize raft_group0 in main and pass a reference to join_cluster
`raft_group0` was constructed at the beginning of `join_cluster`, which
required passing references to 3 additional services to `join_cluster`
used only for that purpose (group 0 client, raft group registry, and
query processor).

Now we initialize `raft_group0` in main - like all other services - and
pass a reference to `join_cluster` so `storage_service` can store a
pointer to group 0.

We initialize `raft_group0` before we start listening for RPCs in
`messaging_service`. In a later commit we'll move the initialization
of group 0 related verbs to the constructor of `raft_group0` from
`storage_service`, so they will be initialized before we start
listening for RPCs.
2022-06-23 16:14:41 +02:00
Benny Halevy
34e9391587 test: sstable_compaction: compaction_manager_for_testing
Make the compaction manager for testing using
this class.

Makes sure to enable the compaction manager
and to stop it before it's destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 08:02:44 +03:00
Piotr Dulikowski
761a037afb config: add add_per_partition_rate_limit_extension function for testing
...and use it in cql_test_env to enable the per_partition_rate_limit
extension for all tests that use it.
2022-06-22 20:16:49 +02:00
Michael Livshin
ab13127761 sstables: use generation_type more soundly
`generation_type` is (supposed to be) conceptually different from
`int64_t` (even if physically they are the same), but at present
Scylla code still largely treats them interchangeably.

In addition to using `generation_type` in more places, we
provide (no-op) `generation_value()` and `generation_from_value()`
operations to make the smoke-and-mirrors more believable.

The churn is considerable, but all mechanical.  To avoid even
more (way, way more) churn, unit test code is left untreated for
now, except where it uses the affected core APIs directly.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-06-20 19:37:31 +03:00
Botond Dénes
4bd4aa2e88 Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drop tombstoned data.

One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.

Fixes #652.

Closes #10807

* github.com:scylladb/scylla:
  memtable: Add counters for tombstone compaction
  memtable, cache: Eagerly compact data with tombstones
  memtable: Subtract from flushed memory when cleaning
  mvcc: Introduce apply_resume to hold state for partition version merging
  test: mutation: Compare against compacted mutations
  compacting_reader: Drop irrelevant tombstones
  mutation_partition: Extract deletable_row::compact_and_expire()
  mvcc: Apply mutations in memtable with preemption enabled
  test: memtable: Make failed_flush_prevents_writes() immune to background merging
2022-06-15 18:12:42 +03:00
Avi Kivity
aa8f135f64 Merge 'Block flush until compaction finishes if sstables accumulate' from Mikołaj Sielużycki
If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.

In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
  drops the keyspace and shuts down before compaction finishes. Dropping
  keyspace drops tables, and each dropped table is smp::count writes to
  system.local table with flush after write, which creates tens of
  thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
  during repair resulted in excessive flushing of small sstables which
  compaction couldn't keep up with.

In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).

Fixes https://github.com/scylladb/scylla/issues/4116

Changelog:
v5:
- added a nicer way of timing the stalls caused by waiting for flush
- added predicate on signal when waiting for reduction of the number of sstables to correctly handle spurious wake ups
- added comment why we trigger compaction before waiting for sstable count reduction
- removed unnecessary cv.signal from table::stop

v4:
- removed conversion of table::stop to coroutines. It's an orthogonal change and doesn't need to go into this patchset

v3:
- removed unnecessary change to scheduling groups from v2
- moved sstables_changed signalling to suggested place in table::stop
- added log how long the table flush was blocked for
- changed the threshold to max(schema()->max_compaction_threshold(), 32) and comparison to <=

v2:
- Reimplemented waiting algorithm based on reviewers' feedback. It's confined to the table class and it waits in a loop until the number of sstable runs goes below threshold. It uses condition variable which is signaled on sstable set refresh. It handles node shutdown as well.
- Converted table::stop to coroutines.
- Reordered commits so that test is committed after fix, so it doesn't trip up bisection.

Closes #10717

* github.com:scylladb/scylla:
  table: Add test where compaction doesn't keep up with flush rate.
  random_mutation_generator: Add option to specify ks_name and cf_name
  table: Prevent creating unbounded number of sstables
2022-06-15 14:51:08 +03:00
Tomasz Grabiec
02c92d5ea2 test: mutation: Compare against compacted mutations
Memtables and cache will compact eagerly, so tests should not expect
readers to produce exact mutations written, only those which are
equivalant after applying copmaction.
2022-06-15 11:30:01 +02:00
Tomasz Grabiec
570b76bc5b compacting_reader: Drop irrelevant tombstones
The compacting reader created using make_compacting_reader() was not
dropping range_tombstone_change fragments which were shadowed by the
partition tombstones. As a result the output fragment stream was not
minimal.

Lack of this change would cause problems in unit tests later in the
series after the change which makes memtables lazily compact partition
versions. In test_reverse_reader_reads_in_native_reverse_order we
compare output of two readers, and assume that compacted streams are
the same. If compacting reader doesn't produce minimal output, then
the streams could differ if one of them went through the compaction in
the memtable (which is minimal).
2022-06-15 11:30:01 +02:00
Mikołaj Sielużycki
b5684aa96d random_mutation_generator: Add option to specify ks_name and cf_name 2022-06-15 10:57:28 +02:00
Pavel Emelyanov
9a88bc260c Merge 'various group0 start/stop issues' from Gleb
The series fixes a couple of crashes that were found during starting and
stopping Scylla with raft while doing ddl operations. Most of them
related to shutdown order between different components.

Also in scylla-dev gleb/group0-fixes-v1

CI https://jenkins.scylladb.com/job/releng/job/Scylla-CI/749/

* origin-dev/gleb/group0-fixes-v1:
  migration manager: remove unused code
  db/system_distributed_keyspace: do not announce empty schema
  main: stop raft before the migration manager
  storage_service: do not pass the raft group manager to storage_service constructor
  main: destroy the group0_client after stopping the group0
2022-06-15 11:44:03 +03:00
Avi Kivity
5129280f45 Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec"
This reverts commit e0670f0bb5, reversing
changes made to 605ee74c39. It causes failures
in debug mode in
database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain,
though with low probability.

Fixes #10780
Reopens #652.
2022-06-14 18:06:22 +03:00
Gleb Natapov
70b7b2b4d6 storage_service: do not pass the raft group manager to storage_service constructor
Reduce the storage_service's dependency on the raft group manager. The
group manager is needed only during bootstrap and in an rpc handler, so
pass it to those functions directly.
2022-06-09 09:40:55 +03:00
Tomasz Grabiec
374234cf76 test: mutation: Compare against compacted mutations
Memtables and cache will compact eagerly, so tests should not expect
readers to produce exact mutations written, only those which are
equivalant after applying copmaction.
2022-06-06 19:25:40 +02:00
Tomasz Grabiec
604e720706 compacting_reader: Drop irrelevant tombstones
The compacting reader created using make_compacting_reader() was not
dropping range_tombstone_change fragments which were shadowed by the
partition tombstones. As a result the output fragment stream was not
minimal.

Lack of this change would cause problems in unit tests later in the
series after the change which makes memtables lazily compact partition
versions. In test_reverse_reader_reads_in_native_reverse_order we
compare output of two readers, and assume that compacted streams are
the same. If compacting reader doesn't produce minimal output, then
the streams could differ if one of them went through the compaction in
the memtable (which is minimal).
2022-06-06 19:23:37 +02:00