When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drop tombstoned data.
One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.
Fixes#652.
Closes#10807
* github.com:scylladb/scylla:
memtable: Add counters for tombstone compaction
memtable, cache: Eagerly compact data with tombstones
memtable: Subtract from flushed memory when cleaning
mvcc: Introduce apply_resume to hold state for partition version merging
test: mutation: Compare against compacted mutations
compacting_reader: Drop irrelevant tombstones
mutation_partition: Extract deletable_row::compact_and_expire()
mvcc: Apply mutations in memtable with preemption enabled
test: memtable: Make failed_flush_prevents_writes() immune to background merging
If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.
In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
drops the keyspace and shuts down before compaction finishes. Dropping
keyspace drops tables, and each dropped table is smp::count writes to
system.local table with flush after write, which creates tens of
thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
during repair resulted in excessive flushing of small sstables which
compaction couldn't keep up with.
In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).
Fixes https://github.com/scylladb/scylla/issues/4116
Changelog:
v5:
- added a nicer way of timing the stalls caused by waiting for flush
- added predicate on signal when waiting for reduction of the number of sstables to correctly handle spurious wake ups
- added comment why we trigger compaction before waiting for sstable count reduction
- removed unnecessary cv.signal from table::stop
v4:
- removed conversion of table::stop to coroutines. It's an orthogonal change and doesn't need to go into this patchset
v3:
- removed unnecessary change to scheduling groups from v2
- moved sstables_changed signalling to suggested place in table::stop
- added log how long the table flush was blocked for
- changed the threshold to max(schema()->max_compaction_threshold(), 32) and comparison to <=
v2:
- Reimplemented waiting algorithm based on reviewers' feedback. It's confined to the table class and it waits in a loop until the number of sstable runs goes below threshold. It uses condition variable which is signaled on sstable set refresh. It handles node shutdown as well.
- Converted table::stop to coroutines.
- Reordered commits so that test is committed after fix, so it doesn't trip up bisection.
Closes#10717
* github.com:scylladb/scylla:
table: Add test where compaction doesn't keep up with flush rate.
random_mutation_generator: Add option to specify ks_name and cf_name
table: Prevent creating unbounded number of sstables
Memtables and cache will compact eagerly, so tests should not expect
readers to produce exact mutations written, only those which are
equivalant after applying copmaction.
The compacting reader created using make_compacting_reader() was not
dropping range_tombstone_change fragments which were shadowed by the
partition tombstones. As a result the output fragment stream was not
minimal.
Lack of this change would cause problems in unit tests later in the
series after the change which makes memtables lazily compact partition
versions. In test_reverse_reader_reads_in_native_reverse_order we
compare output of two readers, and assume that compacted streams are
the same. If compacting reader doesn't produce minimal output, then
the streams could differ if one of them went through the compaction in
the memtable (which is minimal).
The series fixes a couple of crashes that were found during starting and
stopping Scylla with raft while doing ddl operations. Most of them
related to shutdown order between different components.
Also in scylla-dev gleb/group0-fixes-v1
CI https://jenkins.scylladb.com/job/releng/job/Scylla-CI/749/
* origin-dev/gleb/group0-fixes-v1:
migration manager: remove unused code
db/system_distributed_keyspace: do not announce empty schema
main: stop raft before the migration manager
storage_service: do not pass the raft group manager to storage_service constructor
main: destroy the group0_client after stopping the group0
This reverts commit e0670f0bb5, reversing
changes made to 605ee74c39. It causes failures
in debug mode in
database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain,
though with low probability.
Fixes#10780Reopens#652.
Reduce the storage_service's dependency on the raft group manager. The
group manager is needed only during bootstrap and in an rpc handler, so
pass it to those functions directly.
Memtables and cache will compact eagerly, so tests should not expect
readers to produce exact mutations written, only those which are
equivalant after applying copmaction.
The compacting reader created using make_compacting_reader() was not
dropping range_tombstone_change fragments which were shadowed by the
partition tombstones. As a result the output fragment stream was not
minimal.
Lack of this change would cause problems in unit tests later in the
series after the change which makes memtables lazily compact partition
versions. In test_reverse_reader_reads_in_native_reverse_order we
compare output of two readers, and assume that compacted streams are
the same. If compacting reader doesn't produce minimal output, then
the streams could differ if one of them went through the compaction in
the memtable (which is minimal).
The projected limited replacement of downgraded v1 mutation reader
will not do its own buffering, so this test will be pointless.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
mutation_source are going to be created only from v2 readers and the
::make_reader() method family is scheduled for removal.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
The manager reference is already available in constructor and thus
can be copied to on-table member.
The code that chooses the manager (user/system one) should be moved
from make_column_family_config() into add_column_family() method.
Once this happens, the get_sstables_manager() should be fixed to
return the reference from its new location. While at it -- mark the
method in question noexcept and add it's mutable overload.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In core code there's only one place that constructs table -- in
database.cc -- and this place currently has the sstables_manager pointer
sitting on table config (despite it's a pointer, it's always non-null).
All the tests always use the manager from one of _env's out there.
For now the new contructor arg is unused.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Storage service needs it to calculate schema version on join. The proxy
at this point can be passed as an argument to the joining helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The service in question is only needed join_cluster-time, no need to
keep it in the dependencies list. This also solves the dependency
trouble -- the distributed keyspace is sharded::start-ed after it's
passed to storage_service initialization.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This service is only needed join-time, it's better to pass it as
argument to join_cluster(). This solves current reversed dependency
issuse -- the cdc_gen_svc is now started after it's passed to storage
service initialization.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now they always follow one another both in main and cql-test-env.
Also, despite the name, init-server does joins the cluster when it's
just a normal node restarting, so join-cluster is called when the
cluster is already joind. This merge make the function be named as
what it really does.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And make cql-test-env configure to skip it not to slow down tests in
vain. Another side effect is that cql-test-env would trigger features
enabling at this point, but that's OK, they are enabled anyway.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Writing into the group0 raft group on a client side involves locking
the state machine, choosing a state id and checking for its presence
after operation completes. The code that does it resides now in the
migration manager since the currently it is the only user of group0. In
the near future we will have more client for group0 and they all will
have to have the same logic, so the patch moves it to a separate class
raft_group0_client that any future user of group0 can use to write
into it.
Message-Id: <YoYAJwdTdbX+iCUn@scylladb.com>
The compaction manager's empty constructor is supposed to be invoked
only in testing environment, however, it is easy to invoke it by mistake
from production code.
Here we add a more verbose constructor and making the default compaction
private, the verbose compiler need to be invoked with a tag
for_testing_tag, this will ensure that this constructor will be invoked
only when intended.
The unit tests were changed according to this new paradigm.
Tests: unit (dev)
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
"Almost" because 2 uses of the v1 asserter remain (as they are deliberate).
Closes#10518
* github.com:scylladb/scylla:
tests: remove obsolete utility functions
tests: less trivial flat_reader_assertions{,_v2} conversions
tests: trivial flat_reader_assertions{,_v2} conversions
flat_mutation_reader_assertions_v2: improve range tombstone support
We introduce a new service that performs failure detection by periodically pinging
endpoints. The set of pinged endpoints can be dynamically extended and
shrinked. To learn about liveness of endpoints, user of the service
registers a listener and chooses a threshold - a duration of time which
has to pass since the last successful ping in order to mark an endpoint
as dead. When an endpoint responds it's immediately marked as alive.
Endpoints are identified using abstract integer identifiers.
The method of performing a ping is a dependency of the service provided
by the user through the `pinger` interface. The implementation of `pinger` is
responsible for translating the abstract endpoint IDs to 'real'
addresses. For example, production implementation may map endpoint IDs
to IP addresses and use TCP/IP to perform the ping, while a test/simulation
implementation may use a simulated network that also operates on
abstract identifiers.
Similarly, the method of measuring time is a dependency provided by the
user using the `clock` interface. The service operates on abstract time
intervals and timepoints. So, for example, in a production
implementation time can be measured using a stopwatch, while in
test/simulation we can use a logical clock.
The service distributes work across different shards. When an endpoint
is added to the set of detected endpoints, the service will choose a
shard with the smallest amount of workers and create a worker that is
responsible for periodically pinging this endpoint on that shard and
sending notifications to listeners.
We modify the randomized nemesis test to use the new service.
The service is sharded, but for simplicity of implementation in the test
we implement rpcs and sleeps by routing the requests to shard 0, where
logical timers and network live. rpcs are using the existing simulated
network and clock using the existing logical timers.
We also integrate the service with production code. There,
`pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo
message requires the node's gossip generation number. We handle this by
embedding the pinger implementation inside `gossiper`, and making
`gossiper` update the generation number (cached inside the pinger class)
periodically.
Production `clock` is a simple implementation which uses
`std::chrono::steady_clock` and `seastar::sleep_until` underneath.
Translating `steady_clock` durations to `direct_fd::clock` durations happens
by taking the number of ticks.
We connect the group 0 raft server rpc implementation to the new service,
so that when servers are added or removed from the the group 0 configuration,
corresponding endpoints are added to the direct failure detector service.
Thus the set of detected endpoints will be equal to the group 0 configuration.
On each shard, we register a listener for the service.
The listener maintains a set of live addresses; on mark_alive it adds a
server to the set and on mark_dead it removes it. This set is then used
to implement the `raft::failure_detector` interface, consisting of
`is_alive()` function, which simply checks set membership.
---
v6:
- remove `_alive_start_index`. Instead, keep a map of `bool`s to track liveness of each endpoint. See the code for details (`listeners_liveness` struct and its usage in `ping_fiber()`, `notify_fiber()`, `add/remove_worker`, `add/remove_listener`). The diff is easy to read: f617aeca62..d4b225437c
v5:
- renamed `rpc` to `pinger`
- replaced `bool` with `enum class endpoint_update` (with values `added` and `removed`) in `_endpoint_updates`
- replaced `unsigned` with `shard_id`
- fixed definition of `threshold(size_t n)` (it didn't use `n`, but `_alive_start`; fortunately all uses passed `_alive_start` as `n` so the bug wouldn't affect the behavior)
- improve `_num_workers` assertions
- signal `_alive_start_changed` only when `_alive_start` indeed changed
- renamed `{_marked}_alive_start` to `{_marked}_alive_start_index`
v4:
- rearrange ping_fiber(). Remove the loop at the end of the big `while`
which was timing out listeners (after the sleep). Instead:
- rely on the loop before the sleep for timing out listeners
- before calling ping(), check if there is a timed out listener,
if so abandon the ping, immediately proceed to the timing-out-listeners
loop, and then immediately proceed to the next iteration of the big `while`
(without sleeping)
- inline send_mark_dead() and send_mark_alive(); each was used in
exactly one place after the rearrangement
- when marking alive, instead of repeatedly doing `--_alive_start` and
signalling the condition variable, just do `_alive_start = 0` and signal
the condition variable once
- fix the condition for stopping `endpoint_worker::notify_fiber()`: before, it was
`_as.abort_requested()`, now it is `_as.abort_requested() && _alive_start == _fd._listeners.size()`.
Indeed, we want to wait for the stopping code (`destroy_worker()`)
to set `_alive_start = _fd._listeners.size()` before `notify_fiber()`
finishes so `notify_fiber()` can send the final `mark_dead`
notifications for this endpoint. There was a race before where
`notify_fiber()` could finish before it sent those notifications
(because it finished as soon as it noticed `_as.abort_requested()`)
- fix some waits in the unit test; they depended on particular ordering
of tasks by the Scylla reactor, the test could sometimes hang in debug
mode which randomizes task order
- fix `rpc::ping()` in randomized_nemesis_test so it doesn't give an
exceptional discarded future in some cases
v3:
- fix a race in failure_detector::stop(): we must first wait for _destroy_subscriptions fiber to finish on all shards, only then we can set _impl to nullptr on any shard
- invoke_abortable_on was moved from randomized_nemesis_test to raft/helpers
- add a unit test (second patch)
v2:
- rename `direct_fd` namespace to `direct_failure_detector`
- move gms/direct_failure_detector.{cc,hh} to direct_failure_detector/failure_detector.{cc,hh}
- cleaned license comments
- removed _mark_queue for sending notifications from ping_fiber() to notify_fiber(). Instead:
- _listeners is now a boost::container::flat_multimap (previously it was std::multimap)
- _alive_start is no longer an iterator to _listeners, but an index (size_t)
- _mark_queue was replaced with a second index to _listeners, _marked_alive_start, together with a condition variable, _alive_start_changed
- ping_fiber() signals _alive_start_changed when it changes _alive_start
- notify_fiber() waits on _alive_start_changed. When it wakes up, it compares _marked_alive_start to _alive_start, sends notifications to listeners appropriately, and updates _marked_alive_start
- replacing _mark_queue with index + condition variable allowed some better exception specifications: send_mark_alive and send_mark_dead are now noexcept, ping_fiber() is specified to not return exceptional futures other than sleep_aborted which can only happen when we destroy the worker (previously, ping_fiber() could silently stop due to exception happening when we insert to _mark_queue - it could probably only be bad_alloc, but still)
- _shard_workers is now unordered_map<endpoint_id, endpoint_worker> instead of unordered_map<endpoint_id, unique_ptr<endpoint_worker>> (after learning how to construct map values in place - using either `emplace`+`forward_as_tuple` or `try_emplace`)
- `failure_detector::impl::add_endpoint` now gives strong exception guarantee: if an exception is thrown, no state changes
- same for `failure_detector::impl::remove_endpoint`
- `failure_detector::impl::create_worker` now uses `on_internal_error` when it detects that there is a worker for this endpoint already - thanks to the strong exception guarantees of `add_endpoint` and `remove_endpoint` this should never happen
- comment at _num_workers definition why we maintain this statistic (to pick a shard with smallest number of workers)
- remove unnecessary `if (_as.abort_requested())` in `ping_fiber()`
- in ping_fiber(), after a ping, we send notifications to listeners which we know will time-out before the next ping starts. Before, we would sleep until the threshold is actually passed by the clock. Now we send it immediately - we know ahead of time that the listener will time-out and we can notify it immediately.
- due to above, comment at `register_listener` was adjusted, with the following note added: "Note: the `mark_dead` notification may be sent earlier if we know ahead of time that `threshold` will be crossed before the next `ping()` can start."
- `register_listener` now takes a `listener&`, not `listener*`
- at `register_listener` comment why we allow different thresholds (second to last paragraph)
- at `register_listener` mention that listeners can be registered on any shard (last paragraph)
- add protected destructors to rpc, clock, listener, and mention that these objects are not owned/destroyed by `failure_detector`.
- replaced _endpoint_queue (seastar::queue<pair<endpoint_id, bool>>) with unordered_map<endpoint_id, bool> + condition variable. When user calls add/remove_endpoint, an entry is inserted to this map, or existing entry is updated, and the condition variable is signaled. update_endpoint_fiber() waits on the condition variable, performs the add/remove operation, and removes entries from this map. Compared to the previous solution:
- the new solution has at most one entry for a given endpoint, so the number of entries is bounded by the number of different endpoints (so in the main Scylla use case, by the number of different nodes that ever exist); the previous solution could in theory have a backlog of unprocessed events, with updates for a given endpoint appearing multiple times in the queue at once
- when the add/remove operation fails in update_endpoint_fiber(), we don't remove the entry from the map so the operation can be retried later. Previously we would always remove the entry from the queue so it doesn't grow too big in presence of failures.
- when the add/remove operation fails in update_endpoint_fiber(), we sleep for 10*ping_period before retrying. Note that this codepath should not be reached in practice, it can basically only happen on bad_alloc
- commented that `clock::sleep_until` should signalize aborts using `sleep_aborted`
- `clock::now()` is `noexcept`
- `add/remove_endpoint` can be called after `stop()`, they just won't do anything in that case. Reason: next item
- in randomized_nemesis_test, stop failure detector before raft server (it was the other way before), so it stops using server's RPC before server is aborted. Before, the log was spammed with errors from failure detector because failure detector was getting gate_closed_exceptions from the RPC when the server was stopped. A side effect is that the raft server may continue adding/removing endpoints when the failure detector is stopped, which is fine due to above item
- randomized_nemesis_test: direct_fd_clock::sleep_until translates abort_requested_exception to sleep_aborted (so sleep_until satisfies the interface specification)
- message/rpc_protocol_impl: send_message_abortable: if abort_source::subscribe returns null, immediately throw abort_requested_exception (before we would send the message out and not react to an abort if it happened before we were called)
- rebase
Closes#10437
* github.com:scylladb/scylla:
service: raft: remove `raft_gossip_failure_detector`
service: raft: raft_group_registry: use direct failure detector notifications for raft server liveness
service: raft: add/remove direct failure detector endpoints on group 0 configuration changes
main: start direct failure detector service
messaging_service: abortable version of `send_gossip_echo`
message: abortable version of `send_message`
test: raft: randomized_nemesis_test: remove old failure_detector
test: raft: randomized_nemesis_test: use `direct_failure_detector::failure_detector`
test: raft: randomized_nemesis_test: ping all shards on each tick
test: unit test for new failure detector service
direct_failure_detector: introduce new failure detector service
Dealing with the handful of tests that check range tombstones in
interesting ways and need more than search-and-replace.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
* Track the active range tombstone.
* Add `may_produce_tombstones()`.
* Flesh out `produces_row_with_key()`.
* Add more trace logs.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
propagate_replacement() is an internal function that shouldn't be in
the public interface. No one besides an unit test for incremental
compaction needs it. In the future, I want to revisit incremental
compaction unit test to stop using it and only rely on public
interfaces
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220506171647.81063-1-raphaelsc@scylladb.com>
We connect the group 0 raft server rpc implementation to the new direct
failure detector service, so that when servers are added or removed from
the the group 0 configuration, corresponding endpoints are added to the
direct failure detector service. Thus the set of detected endpoints will
be equal to the group 0 configuration.
This causes the failure detector service to start pinging endpoints,
but no listeners are registered yet. The following commit changes that.
We add the new direct failure detector to the list of services started
in the Scylla process.
To start the service, we need an implementation of `pinger` and `clock`.
`pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo
message requires the node's gossip generation number. We handle this by
embedding the pinger implementation inside `gossiper`, and making
`gossiper` update the generation number (cached inside the pinger class)
periodically.
`clock` is a simple implementation which uses `std::chrono::steady_clock`
and `seastar::sleep_until` underneath. Translating `steady_clock`
durations to `direct_failure_detector::clock` durations happens by taking
the number of ticks.
The service is currently not used, just initialized; no endpoints are
added and no listeners are registered yet, but the following commits
change that.
No code uses global gossiper instance, it can be removed. The main and
cql-test-env code now have their own real local instances.
This change also requires adding the debug:: pointer and fixing the
scylle-gdb.py to find the correct global location.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reference is already available at the env initialization, but it's
not kept on the env instance itself. Will be used by the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some places in the code has function-local gossiper reference but
continue to use global instance. Re-use the local reference (it's going
to become sharded<> instance soon).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reference is put on the snitch_ptr because this is the sharded<>
thing and because gossiper reference is the same for different snitch
drivers. Also, getting gossiper from snitch_ptr by driver will look
simpler than getting it from any base class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Snitch depends on gossiper and system keyspace, so it needs to be
started after those two do.
fixes#10402
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series enforces a minimum size of the unprivileged section when
performing `shrink()` operation.
When the cache is shrunk, we still drop entries first from unprivileged
section (as before this commit), however, if this section is already small
(smaller than `max_size / 2`), we will drop entries from the privileged
section.
This is necessary, as before this change the unprivileged section could
be starved. For example if the cache could store at most 50 entries and
there are 49 entries in privileged section, after adding 5 entries (that would
go to unprivileged section) 4 of them would get evicted and only the 5th one
would stay. This caused problems with BATCH statements where all
prepared statements in the batch have to stay in cache at the same time
for the batch to correctly execute.
To correctly check if the unprivileged section might get too small after
dropping an entry, `_current_size` variable, which tracked the overall size
of cache, is changed to two variables: `_unprivileged_section_size` and
`_privileged_section_size`, tracking section sizes separately.
New tests are added to check this new behavior and bookkeeping of the section
sizes. A test is added, that sets up a CQL environment with a very small
prepared statement cache, reproduces issue in #10440 and stresses the cache.
Fixes#10440.
Closes#10456
* github.com:scylladb/scylla:
loading_cache_test: test prepared stmts cache
loading_cache: force minimum size of unprivileged
loading_cache: extract dropping entries to lambdas
loading_cache: separately track size of sections
loading_cache: fix typo in 'privileged'
Add a new test that sets up a CQL environment with a very small prepared
statements cache. The test reproduces a scenario described in #10440,
where a privileged section of prepared statement cache gets large
and that could possibly starve the unprivileged section, making it
impossible to execute BATCH statements. Additionally, at the end of the
test, prepared statements/"simulated batches" with prepared statements
are executed a random number of times, stressing the cache.
To create a CQL environment with small prepared cache, cql_test_config
is extended to allow setting custom memory_config value.
We don't have any upgrade_to_v2() left in production code, so no need to
keep testing it. Removing it from this test paves the way for removing
it for good (not in this series).
After previous patches both, create_snitch() and stop_snitch() no look
like the classica sharded service start/stop sequence. Finally both
helpers can be removed and the rest of the user can just call start/stop
on locally obtained sharded references.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently snitch drivers register themselves in class-registry with all
sorts of construction options possible. All those different constuctors
are in fact "config options".
When later snitch will declare its dependencies (gossiper and system
keyspace), it will require patching all this registrations, which's very
inconvenient.
This patch introduces the snitch_config struct and replaces all the
snitch constructors with the snitch_driver(snitch_config cfg) one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>