With the new context manager it's now easier to request an error
to be injected via REST API. Note that error injection is only
enabled in certain build modes (dev, debug, sanitize)
and the test case will be skipped if it's not possible to use
this mechanism.
When error injection is enabled at compile time, it's now possible
to inject an error into BatchGetItem in order to produce a partial
read, i.e. when only part of the items were retrieved successfully.
DynamoDB protocol specifies that when getting items in a batch
failed only partially, unprocessed keys can be returned so that
the user can perform a retry.
Alternator used to fail the whole request if any of the reads failed,
but right now it instead produces the list of unprocessed keys
and returns them to the user, as long as at least 1 read was
successful.
NOTE: tested manually by compiling Scylla with error injection,
which fails every nth request. It's rather hard to figure out
an automatic test case for this scenario.
Fixes#9984
Some of the tests in test/cql-pytest share the same table but use
different keys to ensure they don't collide. Before this patch we used a
random key, which was usually fine, but we recently noticed that the
pytest-randomly plugin may cause different tests to run through the *same*
sequence of random numbers and ruin our intent that different tests use
different keys.
So instead of using a *random* key, let's use a *unique* key. We can
achieve this uniqueness trivially - using a counter variable - because
anyway the uniqueness is only needed inside a single temporary table -
which is different in every run.
Another benefit is that it will now be clearer that the tests are
deterministic and not random - the intent of a random_string() key
was never to randomly walk the entire key space (random_string()
anyway had a pretty narrow idea of what a random string looks like) -
it was just to get a unique key.
Refs #9988 (fixes it for cql-pytest, but not for test/alternator)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If memory reclamation is triggered inside _cache.emplace(), the _cache
btree can get corrupted. Reclaimers erase from it, and emplace()
assumes that the tree is not modified during its execution. It first
locates the target node and then does memory allocation.
Fix by running emplace() under allocating section, which disables
memory reclamation.
The bug manifests with assert failures, e.g:
./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed.
Fixes#9915
Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com>
Currently this is done only in
storage_service::get_mutable_token_metadata_ptr
but it needs to be done here as well for code paths
calling mutate_token_metadata directly.
Currently, this it is only called from network_topology_strategy_test.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220130152157.2596086-1-bhalevy@scylladb.com>
It was observed (perhaps it depends on the Python implementation)
that an identical seed was used for multiple test cases,
which violated the assumption that generated values are in fact
unique. Using a global generator instead makes sure that it was
only seeded once.
Tests: unit(dev) # alternator tests used to fail for me locally
before this patch was applied
Message-Id: <315d372b4363f449d04b57f7a7d701dcb9a6160a.1643365856.git.sarna@scylladb.com>
On start the storage_service sets up initial tokens. Some dangling
variables, checks and code duplication had accumulated over time.
* xemul/br-storage-service-bootstrap-leftovers:
dht: Use db::config to generate initial tookens
database, dht: Move get_initial_tokens()
storage_service: Factor out random/config tokens generation
storage_service: No extra get_replace_address checks
storage_service: Remove write-only local variable
This series moves the static thread_local repair_meta_map instances
into the repair_service shards.
Refs #9809
Test: unit(release) (including scylla-gdb)
Dtest: repair_additional_test.py::TestRepairAdditional::{test_repair_disjoint_row_2nodes,test_repair_joint_row_3nodes_2_diff_shard_count} replace_address_test.py::TestReplaceAddress::test_serve_writes_during_bootstrap[rbo_enabled](release)
* git@github.com:bhalevy/scylla.git deglobalize-repair_meta_map-v1
repair_service: deglobalize get_next_repair_meta_id
repair_service: deglobalize repair_meta_map
repair_service: pass reference to service to row_level_repair_gossip_helper
repair_meta: define repair_meta_ptr
repair_meta: move static repair_meta map functions out of line
repair_meta: make get_set_diff a free function
repair: repair_meta: no need to keep sharded<netw::messaging_service>
repair: repair_meta: derive subordinate services from repair_service
repair: pass repair_service to repair_meta
* seastar 5524f229b...0d250d15a (6):
> core: memory: Avoid current_backtrace() on alloc failure when logging suppressed
Fixes#9982
> Merge "Enhance io-tester and its rate-limited job" from Pavel E
> queue: pop: assert that the queue is not empty
> io_queue: properly declare io_queue_for_tests
> reactor: Fix off-by-end-of-line misprint in legacy configuration
> fair_queue: Fix move constructor
The replica::database is passed into the helper just to get the
config from. Better to use config directly without messing with
the database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper in question has nothing to do with replica/database and
is only used by dht to convert config option to a set of tokens.
It sounds like the helper deserves living where it's needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a place in normal node start that parses the initial_token
option or generates num_tokens random tokens. This code is used almost
unchanged since being ported from its java version. Later there appeared
the dht::get_bootstrap_token() with the same internal logic.
This patch generalizes these two places. Logging messages are unified
too (dtest seem not to check those).
The change improves a corner case. The normal node startup code doesn't
check if the initial_token is empty and num_tokens is 0 generating empty
bootstrap_tokens set. It fails later with an obscure 'remove_endpoint
should be used instead' message.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_replace_address() returns optional<inet_address>, but
in many cases it's used under if (is_replacing()) branch which,
in turn, returns bool(get_replace_address()) and this is only
executed if the returned optional is engaged.
Extra checks can be removed making the code tiny bit shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rather than using a static unit32_t next_id,
move the next_id variable into repair_service shard 0
and manage it there.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Note that we can't pass the repair_service container()
from its ctor since it's not populated until all shards start.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Keep repair_meta in repair_meta_map as shared_ptr<repair_meta>
rather than lw_shared_ptr<repair_meta> so it can be defined
in the header file and use only forward-declared
class repair_meta.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define the static {get,insert,remove}_repair_meta functions out
of the repair_meta class definition, on the way of moving them,
along with the repair_meta_map itself, to repair_service.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All repair_meta needs is the local instance.
Need be, it's a peering service so the container()
can be used if needed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use repair_service as the authoritative source for
the database, messaging_service, system_distributed_keyspace,
and view_update_generator, similar to repair_info.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When performing a change through group 0 (which right now means schema
changes), clear entries from group 0 history table which are older
than one week.
This is done by including an appropriate range tombstone in the group 0
history table mutation.
* kbr/g0-history-gc-v2:
idl: group0_state_machine: fix license blurb
test: unit test for clearing old entries in group0 history
service: migration_manager: clear old entries from group 0 history when announcing
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.
Fixes#9540
----
This reverts 0d8f932 and introduce correct fix.
Closes#9970
* github.com:scylladb/scylla:
scylla_raid_setup: use mdmonitor only when RAID level > 0
Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.
Fixes#9540
This reverts commit 0d8f932f0b,
because RHEL developer explains this is not a bug, it's expected behavior.
(mdadm --monitor does not start when RAID level is 0)
see: https://bugzilla.redhat.com/show_bug.cgi?id=2031936
So we should stop downgrade mdadm package and modify our script not to
enable mdmonitor.service on RAID0, use it only for RAID5.
Raft does not need to persist the commit index since a restarted node will
either learn it from an append message from a leader or (if entire cluster
is restarted and hence there is no leader) new leader will figure it out
after contacting a quorum. But some users may want to be able to bring
their local state machine to a state as up-to-date as it was before restart
as soon as possible without any external communication.
For them this patch introduces new persistence API that allows saving
and restoring last seen committed index.
Message-Id: <YfFD53oS2j1My0p/@scylladb.com>
Fixes#9955
In #9348 we handled the problem of failing to delete segment files on disk, and
the need to recompute disk footprint to keep data flow consistent across intermittent
failures. However, because _reserve_segments and _recycled_segments are queues, we
have to empty them to inspect the contents. One would think it is ok for these
queues to be empty for a while, whilst we do some recaclulating, including
disk listing -> continuation switching. But then one (i.e. I) misses the fact
that these queues use the pop_eventually mechanism, which does _not_ handle
a scenario where we push something into an empty queue, thus triggering the
future that resumes a waiting task, but then pop the element immediately, before
the waiting task is run. In fact, _iff_ one does this, not only will things break,
they will in fact start creating undefined behaviour, because the underlying
std::queue<T, circular_buffer> will _not_ do any bounds checks on the pop/push
operations -> we will pop an empty queue, immediately making it non-empty, but
using undefined memory (with luck null/zeroes).
Strictly speakging, seastar::queue::pop_eventually should be fixed to handle
the scenario, but nontheless we can fix the usage here as well, by simply copy
objects and do the calculation "in background" while we potentially start
popping queue again.
Closes#9966
We perform a bunch of schema changes with different values of
`migration_manager::_group0_history_gc_duration` and check if entries
are cleared according to this setting.
When performing a change through group 0 (which right now only covers
schema changes), clear entries from group 0 history table which are older
than one week.
This is done by including an appropriate range tombstone in the group 0
history table mutation.
The compactor recently acquired the ability to consume a v2 stream. The
v2 spec requires that all streams end with a null tombstone.
`range_tombstone_assembler`, the component the compactor uses for
converting the v2 input into its v1 output enforces this with a check on
`consume_end_of_partition()`. Normally the producer of the stream the
compactor is consuming takes care of closing the active tombstone before
the stream ends. The compactor however (or its consumer) can decide to
end the consume early, e.g. to cut the current page. When this happens
the compactor must take care of closing the tombstone itself.
Furthermore it has to keep this tombstone around to re-open it on the
next page.
This patch implements this mechanism which was left out of 134601a15e.
It also adds a unit test which reproduces the problems caused by the
missing mechanism.
The compactor now tracks the last clustering position emitted. When the
page ends, this position will be used as the position of the closing
range tombstone change. This ensures the range tombstone only covers the
actually emitted range.
Fixes: #9907
Tests: unit(dev), dtest(paging_test.py, paging_additional_test.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220114053215.481860-1-bdenes@scylladb.com>
Abort signals stopped_error on all awaited entries, but if an entry is
added after this it will be destroyed without signaling and will cause
a waiter to get broken_promise.
Fixes#9688
Message-Id: <Ye6xJjTDooKSuZ87@scylladb.com>
We introduce a new table, `system.group0_history`.
This table will contain a history of all group 0 changes applied through
Raft. With each change is an associated unique ID, which also identifies
the state of all group 0 tables (including schema tables) after this
change is applied, assuming that all such changes are serialized through
Raft (they will be eventually).
Group 0 commands, additionally to mutations which modify group 0 tables,
contain a "previous state ID" and a "new state ID".
The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.
To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.
The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group 0
state, and finally call `announce`.
The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.
We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.
The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.
* kbr/schema-state-ids-v4:
service: migration_manager: `announce`: take a description parameter
service: raft: check and update state IDs during group 0 operations
service: raft: group0_state_machine: introduce `group0_command`
service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table
service: migration_manager: convert migration request handler to coroutine
db: system_keyspace: introduce `system.group0_history` table
treewide: require `group0_guard` when performing schema changes
service: migration_manager: introduce `group0_guard`
service: raft: pass `storage_proxy&` to `group0_state_machine`
service: raft: raft_state_machine: pass `snapshot_descriptor` to `transfer_snapshot`
service: raft: rename `schema_raft_state_machine` to `group0_state_machine`
service: migration_manager: rename `schema_read_barrier` to `start_group0_operation`
service: migration_manager: `announce`: split raft and non-raft paths to separate functions
treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions
service: migration_manager: put notifier call inside `async`
service: migration_manager: remove some unused and disabled code
db: system_distributed_keyspace: use current time when creating mutations in `start()`
redis: keyspace_utils: `create_keyspace_if_not_exists_impl`: call `announce` twice only
In 4aa9e86924 ("Merge 'alternator: move uses of replica module to
data_dictionary' from Avi Kivity"), we changed alternator to use
data_dictionary instead of replica::database. However,
data_dictionary::database objects are different from replica::database
objects in that they don't have a stable address and need to be
captured by value (they are pointer-like). One capture in
describe_stream() was capturing a data_dictionary::database
by reference and so caused a use-after-free when the previous
continuation was deallocated.
Fix by capturing by value.
Fixes#9952.
Closes#9954
The description parameter is used for the group 0 history mutation.
The default is empty, in which case the mutation will leave
the description column as `null`.
I filled the parameter in some easy places as an example and left the
rest for a follow-up.
This is how it looks now in a fresh cluster with a single statement
performed by the user:
cqlsh> select * from system.group0_history ;
key | state_id | description
---------+--------------------------------------+------------------------------------------------------
history | 9ec29cac-7547-11ec-cfd6-77bb9e31c952 | CQL DDL statement
history | 9beb2526-7547-11ec-7b3e-3b198c757ef2 | null
history | 9be937b6-7547-11ec-3b19-97e88bd1ca6f | null
history | 9be784ca-7547-11ec-f297-f40f0073038e | null
history | 9be52e14-7547-11ec-f7c5-af15a1a2de8c | null
history | 9be335dc-7547-11ec-0b6d-f9798d005fb0 | null
history | 9be160c2-7547-11ec-e0ea-29f4272345de | null
history | 9bdf300e-7547-11ec-3d3f-e577a2e31ffd | null
history | 9bdd2ea8-7547-11ec-c25d-8e297b77380e | null
history | 9bdb925a-7547-11ec-d754-aa2cc394a22c | null
history | 9bd8d830-7547-11ec-1550-5fd155e6cd86 | null
history | 9bd36666-7547-11ec-230c-8702bc785cb9 | Add new columns to system_distributed.service_levels
history | 9bd0a156-7547-11ec-a834-85eac94fd3b8 | Create system_distributed(_everywhere) tables
history | 9bcfef18-7547-11ec-76d9-c23dfa1b3e6a | Create system_distributed_everywhere keyspace
history | 9bcec89a-7547-11ec-e1b4-34e0010b4183 | Create system_distributed keyspace
The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.
To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.
The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group
0 state, and finally call `announce`.
The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.
We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.
The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.
Objects of this type will be serialized and sent as commands to the
group 0 state machine. They contain a set of mutations which modify
group 0 tables (at this point: schema tables and group 0 history table),
the 'previous state ID' which is the last state ID present in the
history table when the operation described by this command has started,
and the 'new state ID' which will be appended to the history table if
this change is successful (successful = the previous state ID is still
equal to the last state ID in the history table at the moment of
application). It also contains the address of the node which constructed
this command.
The state ID mechanism will be described in more detail in a later
commit.
The MIGRATION_REQUEST verb is currently used to pull the contents of
schema tables (in the form of mutations) when nodes synchronize schemas.
We will (ab)use the verb to fetch additional data, such as the contents
of the group 0 history table, for purposes of group 0 snapshot transfer.
We extend `schema_pull_options` with a flag specifying that the puller
requests the additional data associated with group 0 snapshots. This
flag is `false` by default, so existing schema pulls will do what they
did before. If the flag is `true`, the migration request handler will
include the contents of group 0 history table.
Note that if a request is set with the flag set to `true`, that means
the entire cluster must have enabled the Raft feature, which also means
that the handler knows of the flag.
This table will contain a history of all group 0 changes applied through
Raft. With each change is an associated unique ID, which also identifies
the state of all group 0 tables (including schema tables) after this
change is applied, assuming that all such changes are serialized through
Raft (they will be eventually).
We will use these state IDs to check if a given change is still
valid at the moment it is applied (in `group0_state_machine::apply`),
i.e. that there wasn't a concurrent change that happened between
creating this change and applying it (which may invalidate it).
`announce` now takes a `group0_guard` by value. `group0_guard` can only
be obtained through `migration_manager::start_group0_operation` and
moved, it cannot be constructed outside `migration_manager`.
The guard will be a method of ensuring linearizability for group 0
operations.