Commit Graph

29908 Commits

Author SHA1 Message Date
Kamil Braun
4a52b802ac test: unit test for group 0 concurrent change protection and CQL DDL retries
Check that group 0 history grows iff a schema change does not throw
`group0_concurrent_modification`. Check that the CQL DDL statement retry
mechanism works as expected.
2022-01-27 11:26:15 +01:00
Kamil Braun
edd8344706 cql3: statements: schema_altering_statement: automatically retry in presence of concurrent changes
Schema changes on top of Raft do not allow concurrent changes.
If two changes are attempted concurrently, one of them gets
`group0_concurrent_modification` exception.

Catch the exception in CQL DDL statement execution function and retry.

In addition, the description of CQL DDL statements in group 0 history
table was improved.
2022-01-27 11:26:14 +01:00
Tomasz Grabiec
ba6c02b38a Merge "Clear old entries from group 0 history when performing schema changes" from Kamil
When performing a change through group 0 (which right now means schema
changes), clear entries from group 0 history table which are older
than one week.

This is done by including an appropriate range tombstone in the group 0
history table mutation.

* kbr/g0-history-gc-v2:
  idl: group0_state_machine: fix license blurb
  test: unit test for clearing old entries in group0 history
  service: migration_manager: clear old entries from group 0 history when announcing
2022-01-26 16:12:40 +01:00
Avi Kivity
df22396a34 Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADA
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.

Fixes #9540

----

This reverts 0d8f932 and introduce correct fix.

Closes #9970

* github.com:scylladb/scylla:
  scylla_raid_setup: use mdmonitor only when RAID level > 0
  Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"
2022-01-26 15:34:47 +02:00
Takuya ASADA
32f2eb63ac scylla_raid_setup: use mdmonitor only when RAID level > 0
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.

Fixes #9540
2022-01-26 22:33:07 +09:00
Takuya ASADA
cd57815fff Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"
This reverts commit 0d8f932f0b,
because RHEL developer explains this is not a bug, it's expected behavior.
(mdadm --monitor does not start when RAID level is 0)
see: https://bugzilla.redhat.com/show_bug.cgi?id=2031936

So we should stop downgrade mdadm package and modify our script not to
enable mdmonitor.service on RAID0, use it only for RAID5.
2022-01-26 22:33:06 +09:00
Gleb Natapov
579dcf187a raft: allow an option to persist commit index
Raft does not need to persist the commit index since a restarted node will
either learn it from an append message from a leader or (if entire cluster
is restarted and hence there is no leader) new leader will figure it out
after contacting a quorum. But some users may want to be able to bring
their local state machine to a state as up-to-date as it was before restart
as soon as possible without any external communication.

For them this patch introduces new persistence API that allows saving
and restoring last seen committed index.

Message-Id: <YfFD53oS2j1My0p/@scylladb.com>
2022-01-26 14:06:39 +01:00
Calle Wilund
43f51e9639 commitlog: Ensure we don't run continuation (task switch) with queues modified
Fixes #9955

In #9348 we handled the problem of failing to delete segment files on disk, and
the need to recompute disk footprint to keep data flow consistent across intermittent
failures. However, because _reserve_segments and _recycled_segments are queues, we
have to empty them to inspect the contents. One would think it is ok for these
queues to be empty for a while, whilst we do some recaclulating, including
disk listing -> continuation switching. But then one (i.e. I) misses the fact
that these queues use the pop_eventually mechanism, which does _not_ handle
a scenario where we push something into an empty queue, thus triggering the
future that resumes a waiting task, but then pop the element immediately, before
the waiting task is run. In fact, _iff_ one does this, not only will things break,
they will in fact start creating undefined behaviour, because the underlying
std::queue<T, circular_buffer> will _not_ do any bounds checks on the pop/push
operations -> we will pop an empty queue, immediately making it non-empty, but
using undefined memory (with luck null/zeroes).

Strictly speakging, seastar::queue::pop_eventually should be fixed to handle
the scenario, but nontheless we can fix the usage here as well, by simply copy
objects and do the calculation "in background" while we potentially start
popping queue again.

Closes #9966
2022-01-26 13:51:01 +02:00
Avi Kivity
f5cd6ec419 Update tools/python3 submodule (relicensed to Apache License 2.0)
* tools/python3 8a77e76...f725ec7 (2):
  > Relicense to Apache 2.0
  > treewide: use Software Package Data Exchange (SPDX) license identifiers
2022-01-25 18:50:39 +02:00
Kamil Braun
f3c0c73d36 idl: group0_state_machine: fix license blurb 2022-01-25 17:48:46 +01:00
Kamil Braun
bf91dcd1e3 idl: group0_state_machine: fix license blurb 2022-01-25 13:14:47 +01:00
Kamil Braun
b863a63b08 test: unit test for clearing old entries in group0 history
We perform a bunch of schema changes with different values of
`migration_manager::_group0_history_gc_duration` and check if entries
are cleared according to this setting.
2022-01-25 13:13:35 +01:00
Kamil Braun
e9083433a8 service: migration_manager: clear old entries from group 0 history when announcing
When performing a change through group 0 (which right now only covers
schema changes), clear entries from group 0 history table which are older
than one week.

This is done by including an appropriate range tombstone in the group 0
history table mutation.
2022-01-25 13:11:14 +01:00
Botond Dénes
eb42213db4 compact_mutation: close active range tombstone on page end
The compactor recently acquired the ability to consume a v2 stream. The
v2 spec requires that all streams end with a null tombstone.
`range_tombstone_assembler`, the component the compactor uses for
converting the v2 input into its v1 output enforces this with a check on
`consume_end_of_partition()`. Normally the producer of the stream the
compactor is consuming takes care of closing the active tombstone before
the stream ends. The compactor however (or its consumer) can decide to
end the consume early, e.g. to cut the current page. When this happens
the compactor must take care of closing the tombstone itself.
Furthermore it has to keep this tombstone around to re-open it on the
next page.
This patch implements this mechanism which was left out of 134601a15e.
It also adds a unit test which reproduces the problems caused by the
missing mechanism.
The compactor now tracks the last clustering position emitted. When the
page ends, this position will be used as the position of the closing
range tombstone change. This ensures the range tombstone only covers the
actually emitted range.

Fixes: #9907

Tests: unit(dev), dtest(paging_test.py, paging_additional_test.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220114053215.481860-1-bdenes@scylladb.com>
2022-01-25 09:52:30 +02:00
Gleb Natapov
e56e96ac5a raft: do not add new wait entries after abort
Abort signals stopped_error on all awaited entries, but if an entry is
added after this it will be destroyed without signaling and will cause
a waiter to get broken_promise.

Fixes #9688

Message-Id: <Ye6xJjTDooKSuZ87@scylladb.com>
2022-01-25 09:52:30 +02:00
Tomasz Grabiec
c89b1953f8 Merge "Enforce linearizability of group 0 operations using state IDs" from Kamil
We introduce a new table, `system.group0_history`.

This table will contain a history of all group 0 changes applied through
Raft. With each change is an associated unique ID, which also identifies
the state of all group 0 tables (including schema tables) after this
change is applied, assuming that all such changes are serialized through
Raft (they will be eventually).

Group 0 commands, additionally to mutations which modify group 0 tables,
contain a "previous state ID" and a "new state ID".

The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.

To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.

The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group 0
state, and finally call `announce`.

The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.

We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.

The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.

* kbr/schema-state-ids-v4:
  service: migration_manager: `announce`: take a description parameter
  service: raft: check and update state IDs during group 0 operations
  service: raft: group0_state_machine: introduce `group0_command`
  service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table
  service: migration_manager: convert migration request handler to coroutine
  db: system_keyspace: introduce `system.group0_history` table
  treewide: require `group0_guard` when performing schema changes
  service: migration_manager: introduce `group0_guard`
  service: raft: pass `storage_proxy&` to `group0_state_machine`
  service: raft: raft_state_machine: pass `snapshot_descriptor` to `transfer_snapshot`
  service: raft: rename `schema_raft_state_machine` to `group0_state_machine`
  service: migration_manager: rename `schema_read_barrier` to `start_group0_operation`
  service: migration_manager: `announce`: split raft and non-raft paths to separate functions
  treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions
  service: migration_manager: put notifier call inside `async`
  service: migration_manager: remove some unused and disabled code
  db: system_distributed_keyspace: use current time when creating mutations in `start()`
  redis: keyspace_utils: `create_keyspace_if_not_exists_impl`: call `announce` twice only
2022-01-25 09:52:30 +02:00
Avi Kivity
a105b09475 build: prepare for Scylla 5.0
We decided to name the next version Scylla 5.0, in honor of
Raft based schema management.
2022-01-25 09:52:30 +02:00
Avi Kivity
277303a722 build_indexes_virtual_reader: convert to flat_mutation_reader_v2
Since it doesn't handle range tombstones in any way, the conversion
consists of just using the new type names.

Closes #9948
2022-01-25 09:52:30 +02:00
Avi Kivity
007145e033 validation: complete transition to data_dictionary module
The API was converted in 00de5f4876,
but some #includes remain. Remove them.

Closes #9947
2022-01-25 09:52:30 +02:00
Avi Kivity
e74f570eda alternator: streams: fix use-after-free of data_dictionary in describe_stream()
In 4aa9e86924 ("Merge 'alternator: move uses of replica module to
data_dictionary' from Avi Kivity"), we changed alternator to use
data_dictionary instead of replica::database. However,
data_dictionary::database objects are different from replica::database
objects in that they don't have a stable address and need to be
captured by value (they are pointer-like). One capture in
describe_stream() was capturing a data_dictionary::database
by reference and so caused a use-after-free when the previous
continuation was deallocated.

Fix by capturing by value.

Fixes #9952.

Closes #9954
2022-01-25 09:52:30 +02:00
Kamil Braun
044e05b0d9 service: migration_manager: announce: take a description parameter
The description parameter is used for the group 0 history mutation.
The default is empty, in which case the mutation will leave
the description column as `null`.
I filled the parameter in some easy places as an example and left the
rest for a follow-up.

This is how it looks now in a fresh cluster with a single statement
performed by the user:

cqlsh> select * from system.group0_history ;

 key     | state_id                             | description
---------+--------------------------------------+------------------------------------------------------
 history | 9ec29cac-7547-11ec-cfd6-77bb9e31c952 |                                    CQL DDL statement
 history | 9beb2526-7547-11ec-7b3e-3b198c757ef2 |                                                 null
 history | 9be937b6-7547-11ec-3b19-97e88bd1ca6f |                                                 null
 history | 9be784ca-7547-11ec-f297-f40f0073038e |                                                 null
 history | 9be52e14-7547-11ec-f7c5-af15a1a2de8c |                                                 null
 history | 9be335dc-7547-11ec-0b6d-f9798d005fb0 |                                                 null
 history | 9be160c2-7547-11ec-e0ea-29f4272345de |                                                 null
 history | 9bdf300e-7547-11ec-3d3f-e577a2e31ffd |                                                 null
 history | 9bdd2ea8-7547-11ec-c25d-8e297b77380e |                                                 null
 history | 9bdb925a-7547-11ec-d754-aa2cc394a22c |                                                 null
 history | 9bd8d830-7547-11ec-1550-5fd155e6cd86 |                                                 null
 history | 9bd36666-7547-11ec-230c-8702bc785cb9 | Add new columns to system_distributed.service_levels
 history | 9bd0a156-7547-11ec-a834-85eac94fd3b8 |        Create system_distributed(_everywhere) tables
 history | 9bcfef18-7547-11ec-76d9-c23dfa1b3e6a |        Create system_distributed_everywhere keyspace
 history | 9bcec89a-7547-11ec-e1b4-34e0010b4183 |                   Create system_distributed keyspace
2022-01-24 15:20:37 +01:00
Kamil Braun
6a00e790c7 service: raft: check and update state IDs during group 0 operations
The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.

To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.

The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group
0 state, and finally call `announce`.

The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.

We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.

The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.
2022-01-24 15:20:37 +01:00
Kamil Braun
509ac2130f service: raft: group0_state_machine: introduce group0_command
Objects of this type will be serialized and sent as commands to the
group 0 state machine. They contain a set of mutations which modify
group 0 tables (at this point: schema tables and group 0 history table),
the 'previous state ID' which is the last state ID present in the
history table when the operation described by this command has started,
and the 'new state ID' which will be appended to the history table if
this change is successful (successful = the previous state ID is still
equal to the last state ID in the history table at the moment of
application). It also contains the address of the node which constructed
this command.

The state ID mechanism will be described in more detail in a later
commit.
2022-01-24 15:20:37 +01:00
Kamil Braun
cc0c54ea15 service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table
The MIGRATION_REQUEST verb is currently used to pull the contents of
schema tables (in the form of mutations) when nodes synchronize schemas.
We will (ab)use the verb to fetch additional data, such as the contents
of the group 0 history table, for purposes of group 0 snapshot transfer.

We extend `schema_pull_options` with a flag specifying that the puller
requests the additional data associated with group 0 snapshots. This
flag is `false` by default, so existing schema pulls will do what they
did before. If the flag is `true`, the migration request handler will
include the contents of group 0 history table.

Note that if a request is set with the flag set to `true`, that means
the entire cluster must have enabled the Raft feature, which also means
that the handler knows of the flag.
2022-01-24 15:20:37 +01:00
Kamil Braun
a944dd44ee service: migration_manager: convert migration request handler to coroutine 2022-01-24 15:20:37 +01:00
Kamil Braun
fad72daeb4 db: system_keyspace: introduce system.group0_history table
This table will contain a history of all group 0 changes applied through
Raft. With each change is an associated unique ID, which also identifies
the state of all group 0 tables (including schema tables) after this
change is applied, assuming that all such changes are serialized through
Raft (they will be eventually).

We will use these state IDs to check if a given change is still
valid at the moment it is applied (in `group0_state_machine::apply`),
i.e. that there wasn't a concurrent change that happened between
creating this change and applying it (which may invalidate it).
2022-01-24 15:20:37 +01:00
Kamil Braun
a664ac7ba5 treewide: require group0_guard when performing schema changes
`announce` now takes a `group0_guard` by value. `group0_guard` can only
be obtained through `migration_manager::start_group0_operation` and
moved, it cannot be constructed outside `migration_manager`.

The guard will be a method of ensuring linearizability for group 0
operations.
2022-01-24 15:20:35 +01:00
Kamil Braun
742f036261 service: migration_manager: introduce group0_guard
This object will be used to "guard" group 0 operations. Obtaining it
will be necessary to perform a group 0 change (such as modifying the
schema), which will be enforced by the type system.

The initial implementation is a stub and only provides a timestamp which
will be used by callers to create mutations for group 0 changes. The
next commit will change all call sites to use the guard as intended.

The final implementation, coming later, will ensure linearizability of
group 0 operations.
2022-01-24 15:12:50 +01:00
Kamil Braun
f908da919c service: raft: pass storage_proxy& to group0_state_machine
We'll use it to update the group 0 history table.
2022-01-24 15:12:50 +01:00
Kamil Braun
dce8ece4b6 service: raft: raft_state_machine: pass snapshot_descriptor to transfer_snapshot
Currently it takes just the snapshot ID. Extend it by taking the whole
snapshot descriptor.

In following commits I use this to perform additional logging.
2022-01-24 15:12:50 +01:00
Kamil Braun
538cc6ecb9 service: raft: rename schema_raft_state_machine to group0_state_machine
Generalize the name so it doesn't suggest that group 0 contains only
schema state.
2022-01-24 15:12:50 +01:00
Kamil Braun
86762a1dd9 service: migration_manager: rename schema_read_barrier to start_group0_operation
1. Generalize the name so it mentions group 0, which schema will be a
   strict subset of.
2. Remove the fact that it performs a "read barrier" from the name. The
   function will be used in general to ensure linearizability of group0
   operations - both reads and writes. "Read barrier" is Raft-specific
   terminology, so it can be thought of as an implementation detail.
2022-01-24 15:12:50 +01:00
Kamil Braun
0f24b907b7 service: migration_manager: announce: split raft and non-raft paths to separate functions 2022-01-24 15:12:50 +01:00
Kamil Braun
283ac7fefe treewide: pass mutation timestamp from call sites into migration_manager::prepare_* functions
The functions which prepare schema change mutations (such as
`prepare_new_column_family_announcement`) would use internally
generated timestamps for these mutations. When schema changes are
managed by group 0 we want to ensure that timestamps of mutations
applied through Raft are monotonic. We will generate these timestamps at
call sites and pass them into the `prepare_` functions. This commit
prepares the APIs.
2022-01-24 15:12:50 +01:00
Kamil Braun
f97edb1dbd service: migration_manager: put notifier call inside async
`get_notifier().before_update_column_family(...)` requires being inside
`async`. Fix this.
2022-01-24 15:12:50 +01:00
Kamil Braun
3bab5c564a service: migration_manager: remove some unused and disabled code
`include_keyspace_and_announce` was no longer used.
`do_announce_new_type` only had a declaration, it was not used and there
was no definition.
2022-01-24 15:12:49 +01:00
Kamil Braun
0af5f74871 db: system_distributed_keyspace: use current time when creating mutations in start()
When creating or updating internal distributed tables in
`system_distributed_keyspace::start()`, hardcoded timestamps were used.

There two reasons for this:
- to protect against issue #2129, where nodes would start without
  synchronizing schema with the existing cluster, creating the tables
  again, which would override any manual user changes to these tables.
  The solution was to use small timestamps (like api::min_timestamp) - the
  user-created schema mutations would always 'win' (because when they were
  created, they used current time).
- to eliminate unnecessary schema sync. If two nodes created these
  tables concurrently with different timestamps, the schemas would
  formally be different and would need to merge. This could happen
  during upgrades when we upgraded from a version which doesn't have
  these tables or doesn't have some columns.

The #2129 workaround is no longer necessary: when nodes start they always
have to sync schema with existing nodes; we also don't allow
bootstrapping nodes in parallel.

The second problem would happen during parallel bootstrap, which we
don't allow, or during parallel upgrade. The procedure we recommend is
rolling upgrade - where nodes are upgraded one by one. In this case only
one node is going to create/update the tables; following upgraded nodes
will sync schema first and notice they don't need to do anything. So if
procedures are followed correctly, the workaround is not needed. If
someone doesn't follow the procedures and upgrades nodes in parallel,
these additional schema synchronizations are not a big cost, so the
workaround doesn't give us much in this case as well.

When schema changes are performed by Raft group 0, certain constraints
are placed on the timestamps used for mutations. For this we'll need to
be able to use timestamps which are generated based on current time.
2022-01-24 15:12:49 +01:00
Kamil Braun
63d3449bc3 redis: keyspace_utils: create_keyspace_if_not_exists_impl: call announce twice only
The code would previously `announce` schema mutations once per each keyspace and
once per each table. This can be reduced to two calls of `announce`:
once to create all keyspaces, and once to create all tables.

This should be further reduced to a single `announce` in the future.
Left a FIXME.

Motivation: after migrating to Raft, each `announce` will require a
`read_barrier` to achieve linearizability of schema operations. This
introduces latency, as it requires contacting a leader which then must
contact a quorum. The fewer announce calls, the better. Also, if all
sub-operations are reduced to a single `announce`, we get atomicity -
either all of these sub-operations succeed or none do.
2022-01-24 15:12:46 +01:00
Benny Halevy
188cedd533 test: lister_test: test_lister_abort: generate at least one entry
Without this fix, generate_random_content could generate 0 entries
and the expected exception would never be injected.

With it, we generate at least 1 entry and the test passes
with the offending random-seed:

```
random-seed=1898914316
Generated 1 dir entries
Aborting lister after 1 dir entries
test/boost/lister_test.cc(96): info: check 'exception "expected_exception" raised as expected' has passed
```

Fixes #9953

Test: lister_test.test_lister_abort --random-seed=1898914316(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220123122921.14017-1-bhalevy@scylladb.com>
2022-01-23 17:52:44 +02:00
Gleb Natapov
d09864d61f redis: check for tables existence before creating
Do not create redis tables unconditionally on boot since this requires
issue raft barrier and cannot be done without a quorum.

Message-Id: <YefV0CqEueRL7G00@scylladb.com>
2022-01-23 17:52:44 +02:00
Benny Halevy
f439edca35 test: sstable_compaction_test: twcs_reshape_with_disjoint_set_test: take min_threshold into consideration
Take into account that get_reshaping_job selects only
buckets that have more than min_threashold sstables in them.

Therefore, with 256 disjoint sstables in different windows,
allow first or last windows to not be selected by get_reshaping_job
that will return at least disjoint_sstable_count - min_threshold + 1
sstables, and not more than disjoint_sstable_count.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220123090044.38449-2-bhalevy@scylladb.com>
2022-01-23 17:52:44 +02:00
Avi Kivity
ae6fdf1599 Update seastar submodule
* seastar 5025cd44ea...5524f229bb (3):
  > Merge "Simplify io-queue configuration" from Pavel E
  > fix sstring.find(): make find("") compatible with std::string
  > test: file_utils: test_non_existing_TMPDIR: no need to setenv

Contains patch from Pavel Emelyanov <xemul@scylladb.com>:

scylla-gdb: Remove _shares_capacity from fair-group debug

This field is about to be removed in newer seastar, so it
shouldn't be checked in scylla-gdb

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220121115643.6966-1-xemul@scylladb.com>
2022-01-21 17:38:05 +02:00
Piotr Jastrzebski
09d4438a0d cdc: Handle compact storage correctly in preimage
Base tables that use compact storage may have a special artificial
column that has an empty type.

c010cefc4d fixed the main CDC path to
handle such columns correctly and to not include them in the CDC Log
schema.

This patch makes sure that generation of preimage ignores such empty
column as well.

Fixes #9876
Closes #9910

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2022-01-20 13:23:38 +01:00
Nadav Har'El
350c3d0f6a alternator: update comment about default timeout
The comment explaining where the default Alternator timeout is set
became out-of-date. So fix it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220120092631.401563-1-nyh@scylladb.com>
2022-01-20 14:05:58 +02:00
Raphael S. Carvalho
5d654a6b9a compaction: don't copy owned ranges in cleanup ctor
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220119142322.39791-1-raphaelsc@scylladb.com>
2022-01-20 14:05:58 +02:00
Botond Dénes
a65b38a9f7 reader_permit: release_base_resources(): also update _resources
If the permit was admitted, _base_resources was already accounted in
_resource and therefore has to be deducted from it, otherwise the permit
will think it leaked some resources on destruction.

Test:
dtest(repair_additional_test.py.test_repair_one_missing_row_diff_shard_count)

Refs: #9751
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220119132550.532073-1-bdenes@scylladb.com>
2022-01-20 14:05:58 +02:00
Nadav Har'El
7cb6250c40 Merge 'snapshot_ctl: true_snapshots_size: fix space accounting' from Benny Halevy
This pull request fixes two preexisting issues related to snapshot_ctl::true_snapshots_size

https://github.com/scylladb/scylla/issues/9897
https://github.com/scylladb/scylla/issues/9898

And adds a couple unit tests to tests the snapshot_ctl functionality.

Test: unit(dev), database_test.{test_snapshot_ctl_details,test_snapshot_ctl_true_snapshots_size}(debug)

Closes #9899

* github.com:scylladb/scylla:
  table: get_snapshot_details: count allocated_size
  snapshot_ctl: cleanup true_snapshots_size
  snpashot_ctl: true_snapshots_size: do not map_reduce across all shards
2022-01-19 11:57:15 +02:00
Nadav Har'El
4aa9e86924 Merge 'alternator: move uses of replica module to data_dictionary' from Avi Kivity
Alternator is a coordinator-side service and so should not access
the replica module. In this series all but one of uses of the replica
module are replaced with data_dictionary.

One case remains - accessing the replication map which is not
available (and should not be available) via the data dictionary.

The data_dictionary module is expanded with missing accessors.

Closes #9945

* github.com:scylladb/scylla:
  alternator: switch to data_dictionary for table listing purposes
  data_dictionary: add get_tables()
  data_dictionary: introduce keyspace::is_internal()
2022-01-19 11:34:25 +02:00
Avi Kivity
7399f3fae7 alternator: switch to data_dictionary for table listing purposes
As a coordinator-side service, alternator shouldn't touch the
replica module, so it is migrated here to data_dictionary.

One use case still remains that uses replica::keyspace - accessing
the replication map. This really isn't a replica-side thing, but it's
also not logically part of the data dictionary, so it's left using
replica::keyspace (using the data_dictionary::database::real_database()
escape hatch). Figuring out how to expose the replication map to
coordinator-side services is left for later.
2022-01-19 11:03:36 +02:00
Avi Kivity
f80d13c95c data_dictionary: add get_tables()
Unlike replica::database::get_column_families() which is replaces,
it returns a vector of tables rather than a map. Map-like access
is provided by get_table(), so it's redundant to build a new
map container to expose the same functionality.
2022-01-19 09:36:22 +02:00