If set, any remaining segment that has data older than this threshold will request flushing, regardless of data pressure. I.e. even a system where nothing happends will after X seconds flush data to free up the commit log.
Related to #15820
The functionality here is to prevent pathological/test cases where a silent system cannot fully process stuff like compaction, GC etc due to things like CL forcing smaller GC windows etc.
Closesscylladb/scylladb#15971
* github.com:scylladb/scylladb:
commitlog: Make max data lifetime runtime-configurable
db::config: Expose commitlog_max_data_lifetime_in_s parameter
commitlog: Add optional max lifetime parameter to cl instance
Add a new member that stores the abort_source. This can later be used by
the sstables to check if an abort has been requested. Also implement
sstables_manager::get_abort_source() that returns a const reference to
the abort source.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
This is in preparation for the following patch that adds abort_source
variable to the sstables_manager.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
rwlock was added to protect iterations against concurrent updates to the map.
the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup).
the rwlock is very problematic because it can result in topology changes blocked, as updating
token metadata takes the exclusive lock, which is serialized with table wide ops like
split / major / explicit flush (and those can take a long time).
to get rid of the lock, we can copy the storage group map and guard individual groups with a gate
(not a problem since map is expected to have a maximum of ~100 elements).
so cleanup can close that gate (carefully closed after stopping individual groups such that
migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered
by nodetool flush) can skip a group that was closed, as such a group is being migrated out.
Check documentation added to compaction_group.hh to understand how
concurrent iterations and updates to the map work without the rwlock.
Yielding variants that iterate over groups are no longer returning group
id since id stability can no longer be guaranteed without serializing split
finalization and iteration.
Fixes#18821.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It was added to make integration of storage groups easier, but it's
complicated since it's another source of truth and we could have
problems if it becomes inconsistent with the group map.
Fixes#18506.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We currently disable tombstone GC for compaction done on the read path of streaming and repair, because those expired tombstones can still prevent data resurrection. With time-based tombstone GC, missing a repair for long enough can cause data resurrection because a tombstone is potentially GC'd before it could be spread to every node by repair. So repair disseminating these expired tombstones helps clusters which missed repair for long enough. It is not a guarantee because compaction could have done the GC itself, but it is better than nothing.
This last resort is getting less important with repair-based tombstone GC. Furthermore, we have seen this cause huge repair amplification in a cluster, where expired tombstones triggered repair replicating otherwise identical rows.
This series makes tombstone GC on the streaming/repair compaction path configurable with a config item. This new config item defaults to `false` (current behaviour), setting it to `true`, will enable tombstone GC.
Fixes: https://github.com/scylladb/scylladb/issues/19015
Not a regression, no backport needed
Closesscylladb/scylladb#19016
* github.com:scylladb/scylladb:
test/topology_custom/test_repair: add test for enable_tombstone_gc_for_streaming_and_repair
replica/table: maybe_compact_for_streaming(): toggle tombstone GC based on the control flag
replica: propagate enable_tombstone_gc_for_streaming_and_repair to maybe_compact_for_streaming()
db/config: introduce enable_tombstone_gc_for_streaming_and_repair
The following command had been executed to get the
list of headers that did not contain '#pragma once':
'grep -rnw . -e "#pragma once" --include *.hh -L'
This change adds missing include guard to headers
that did not contain any guard.
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#19626
Before each function change was immediately visible as
during event notification logic yielded.
Now we first gather the modifications and then commit them.
Further work will broaden the scope of atomicity to the whole
schema and even across other subsystems.
This is done to ease code reuse in the following commit.
It'd also help should we ever want properly mount functions
class to schema object instead of static storage.
Currently, a pending replica that applies a write on a table that has
materialized views, will build all the view updates as a normal replica,
only to realize at a late point, in db::view::get_view_natural_endpoint(),
that it doesn't have a paired view replica to send the updates to. It will
then either drop the view updates, or send them to a pending view
replica, if such exists.
This work is unnecessary since it may be dropped, and even if there is a
pending view replica to send the updates to, the updates that are built
by the pending replica may be wrong since it may have incomplete
information.
This commit fixes the inefficiency by skipping the view update building
step when applying an update on a pending replica.
The metric total_view_updates_on_wrong_node is added to count the cases
that a view update is determined to be unnecessary.
The test reproduces the scenario of writing to a table and applying
the update on a pending replica, and verifies that the pending replica
doesn't try to build view updates.
Fixesscylladb/scylladb#19152Closesscylladb/scylladb#19488
The reader concurrency semaphore restricts the concurrency of reads that require CPU (intention: they read from the cache) to 1, meaning that if there is even a single active read which declares that it needs just CPU to proceed, no new read is admitted. This is meant to keep the concurrency of reads in the cache at 1. The idea is that concurrency in the cache is not useful: it just leads to the reactor rotating between these reads, all of the finishing later then they could if they were the only active read in the cache.
This was observed to backfire in the case where there reads from a single table are mostly very fast, but on some keys are very slow (hint: collection full of tombstones). In this case the slow read keeps up the fast reads in the queue, increasing the 99th percentile latencies significantly.
This series proposes to fix this, by making the CPU concurrency configurable. We don't like tunables like this and this is not a proper fix, but a workaround. The proper fix would be to allow to cut any page early, but we cannot cut a page in the middle of a row. We could maybe have a way of detecting slow reads and excluding them from the CPU concurrency. This would be a heuristic and it would be hard to get right. So in this series a robust and simple configurable is offered, which can be used on those few clusters which do suffer from the too strict concurrency limit. We have seen it in very few cases so far, so this doesn't seem to be wide-spread.
Fixes: https://github.com/scylladb/scylladb/issues/19017
This fixes a regression introduced in 5.0, so we have to backport to all currently supported releases
Closesscylladb/scylladb#19018
* github.com:scylladb/scylladb:
test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrenc Please enter the commit message for your changes. Lines starting
test/boost/reader_concurrency_semaphore_test: hoist require_can_admit
reader_concurrency_semaphore: wire in the configurable cpu concurrency
reader_concurrency_semaphore: add cpu_concurrency constructor parameter
db/config: introduce reader_concurrency_semahore_cpu_concurrency
C++23 made std::unique_ptr constexpr. A side effect of this (presumably)
is that the compiler compiles it more eagerly, requiring the full definition
of the class in std::make_unique, while it previously was content with
finding the definition later.
One victim of this change is compaction_group::table_state; define
it earlier to build with C++23.
Define compaction_group::~compaction_group() out-of-line to prevent
problems instantiating compaction_group::_table_state, which is an
std::unique_ptr. In C++23, std::unique_ptr is constexpr, which means
its methods (in this case the destructor) require seeing the definition
of the class at the point of instantiation.
Now enable_tombstone_gc_for_streaming_and_repair is wired in all the way
to maybe_compact_for_streaming(), so we can implement the toggling of
tombstone GC based on it.
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:
e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"
as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.
The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit
026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"
In turn, flat_mutation_reader was introduced in 2017 in commit
748205ca75 "Introduce flat_mutation_reader"
To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.
Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.
Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.
Some notes about the transition:
- files were also renamed. In one case (flat_mutation_reader_test.cc), the
rename target already existed, so we rename to
mutation_reader_another_test.cc.
- a namespace 'mutation_reader' with two definitions existed (in
mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
class. As a result, a few #includes had to be adjusted.
Closesscylladb/scylladb#19356
Normally, the space overhead for TWCS is 1/N, where is number of windows. But during off-strategy, the overhead is 100% because input sstables cannot be released earlier.
Reshaping a TWCS table that takes ~50% of available space can result in system running out of space.
That's fixed by restricting every TWCS off-strategy job to 10% of free space in disk. Tables that aren't big will not be penalized with increased write amplification, as all input (disjoint) sstables can still be compacted in a single round.
Fixes#16514.
Closesscylladb/scylladb#18137
* github.com:scylladb/scylladb:
compaction: Reduce twcs off-strategy space overhead to 10% of free space
compaction: wire storage free space into reshape procedure
sstables: Allow to get free space from underlying storage
replica: don't expose compaction_group to reshape task
Currently, when calculating the view update backlog for gossip,
we start with `db::view::update_backlog()` and compare it to backlogs
from all shards. However, this backlog can't be compared to other
backlogs - it has size 0 and we compare the fraction current/size
when comparing backlogs, causing us to compare with `NaN`.
This patch fixes it by starting the comparisons with an empty backlog.
The patch introducing this issue (f70f774e40) wasn't backported, so this one doesn't need to be either
Closesscylladb/scylladb#19247
* github.com:scylladb/scylladb:
mv: make the view update backlog unmofidiable
mv: fix value of the gossiped view update backlog
Currently, a view update backlog may reach an invalid state, when
its max is 0 and its relative_size() is NaN as a result. This can
be achieved either by constructing the backlog with a 0 max or by
modifying the max of an existing backlog. In particular, this
happens when creating the backlog using the default constructor.
In this patch the the default constructor is deleted and a check
is added to make sure that the max is different than 0 is added
to its constructor - if the check fails, we construct an empty
backlog instead, to handle the possibility of getting an invalid
backlog sent from a node with a version that's missing this check.
Additionally, we make the backlogs members private, exposing them
only through const getters.
Making the count resources on the maintenance (streaming) semaphore live update via config. This will allow us to improve repair speed on mixed-shard clusters, where we suspect that reader trashing -- due to the combination of high number of readers on each shard and very conservative reader count limit (10) -- is the main cause of the slowness.
Making this count limit confgurable allows us to start experimenting with this fix, without committing to a count limit increase (or removal), addressing the pain in the field.
Refs: #18269
No OSS backport needed.
Closesscylladb/scylladb#19248
* github.com:scylladb/scylladb:
replica/database: wire in maintenance_reader_concurrency_semaphore_count_limit
db/config: introduce maintenance_reader_concurrency_semaphore_count_limit
reader_concurrency_semaphore: make count parameter live-update
backport not needed, these are just cleanups.
Closesscylladb/scylladb#19260
* github.com:scylladb/scylladb:
replica: simplify perform_cleanup_compaction()
replica: return storage_group by reference on storage_group_for*()
replica: devirtualize storage_group_of()
with tablets, we're expected to have a worst of ~100 tablets in a given
table and shard, so let's avoid linear search when looking for the
memtable_list in a range scan. we're bounded by ~100 elements, so
shouldn't be a big problem, but it's an inefficiency we can easily
get rid of.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#19286
compaction_group sits in replica layer and compaction layer is
supposed to talk to it through compaction::table_state only.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Making the count resources on the maintenance (streaming) semaphore live
update via config. This will allow us to improve repair speed on
mixed-shard clusters, where we suspect that reader trashing -- due to
the combination of high number of readers on each shard and very
conservative reader count limit (10) -- is the main cause of the
slowness.
Making this count limit confgurable allows us to start experimenting
with this fix, without committing to a count limit increase (or
removal), addressing the pain in the field.
those functions cannot return nullptr, will throw when group is not
found, so better return ref instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This config item is propagated to the table object via table::config. Although the field in `table::config`, used to propagate the value, was `utils::updateable_value<T>`, it was assigned a constant and so the live-update chain was broken.
This series fixes this and adds a test which fails before the patch and passes after. The test needed new test infrastructure, around the failure injection api, namely the ability to exfiltrate the value of internal variable. This infrastructure is also added in this series.
Fixes: https://github.com/scylladb/scylladb/issues/18674
- [x] This patch has to be backported because it fixes broken functionality
Closesscylladb/scylladb#18705
* github.com:scylladb/scylladb:
test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update
test/pylib: rest_client: add get_injection()
api/error_injection: add getter for error_injection
utils/error_injection: add set_parameter()
replica/database: fix live-update enable_compacting_data_for_streaming_and_repair
This config item is propagated to the table object via table::config.
Although the field in table::config, used to propagate the value, was
utils::updateable_value<T>, it was assigned a constant and so the
live-update chain was broken.
This patch fixes this.
Consider the following:
1) table A has N tablets and views
2) migration starts for a tablet of A from node 1 to 2.
3) migration is at write_both_read_old stage
4) coordinator will push writes to both nodes (pending and leaving)
5) A has view, so writes to it will also result in reads (table::push_view_replica_updates())
6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in)
7) so read on step 5 is not being able to find sstable set for tablet migrating in
Causes the following error:
"tablets - SSTable set wasn't found for tablet 21 of table mview.users"
which means loss of write on pending replica.
The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot.
It's not a problem to refresh the cache snapshot as long as the logical
state of the data hasn't changed, which is true when allocating new
tablet replicas. That's also done in the context of compactions for example.
Fixes#19052.
Fixes#19033.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#19099
When a function is created with the CREATE FUNCTION statement, the
statement handler does all the necessary preparations on its own. The
very same code exists in schema_tables, when the function is loaded on
boot. This patch generalizes both and keeps function language-specific
context creation inside lang/ code.
The creation function returns context via argument reference. It would
have been nicer if it was returned via future<>, but it's not suitable
for future<T> type :(
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And, while at it, rename local variable to refer to it to as "manager"
not "wasm". Query processor and database also have getters named
"wasm()", these are not renamed yet to keep patch smaller (and those
getters are going to be reworked further anyway).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's unused, populator uses it to print debugging messages, but it can
as well use table->dir() for it, just as sstable_directory does. One
message looks useless and is removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19113
This vector of paths is only used to generate the same vector of paths
for table config, but the latter already has all the needed info.
It's the part of the plan to stop using paths/directories in keyspaces
and tables, because with storage-options tables no longer keep their
data in "files on disk", so this information goes to sstables storage
manager (refs #12707)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19119
tablet snapshot, used by migration, can race with compaction and
can find files deleted. That won't cause data loss because the
error is propagated back into the coordinator that decides to
retry streaming stage. So the consequence is delayed migration,
which might in turn reduce node operation throughput (e.g.
when decommissioning a node). It should be rare though, so
shouldn't have drastic consequences.
Fixes#18977.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18979
Currently the loader is called via API, which inherits the maintenance scheduling group from API http server. The loader then can either do load_and_stream() or call (legacy) distributed_loader::upload_new_sstables(). The latter first switches into streaming scheduling group, but the former doesn't and continues running in the maintenance one.
All this is not really a problem, because streaming sched group and maintenance sched group is one group under two different variable names. However, it's messy and worth delegating the sched group switch (even if it's a no-op) to the sstables-loader. As a nice side effect, this patch removes one place that uses database as proxy object to get configuration parameters.
Closesscylladb/scylladb#18928
* github.com:scylladb/scylladb:
sstables-loader: Run loading in its scheduling group
sstables-loader: Add scheduling group to constructor
storage_proxy is responsible for intersecting the range of the read
with tablets, and calling replica with a single tablet range, therefore
it makes sense to avoid touching memtables of tablets that don't
intersect with a particular range.
Note this is a performance issue, not correctness one, as memtable
readers that don't intersect with current range won't produce any
data, but cpu is wasted until that's realized (they're added to list
of readers in mutation_reader_merger, more allocations, more data
sources to peek into, etc).
That's also important for streaming e.g. after decommission, that
will consume one tablet at a time through a reader, so we don't want
memtables of streamed tablets (that weren't cleaned up yet) to
be consumed.
Refs #18904.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18907
Now the loading code has two different paths, and only one of them
switches sched group. It's cleaner and more natural to switch the sched
group in the loader itself, so that all code paths run in it and don't
care switching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability.
If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count.
Fixes#18085.
Closesscylladb/scylladb#18287
* github.com:scylladb/scylladb:
test: Fix flakiness in topology_experimental_raft/test_tablets
service: Use tablet read selector to determine which replica to account table stats
storage_service: Fix race between tablet split and stats retrieval
The system-distributed-keyspace and view-update-generator often go in pair, because streaming, repair and sstables-loader (via distributed-loader) need them booth to check if sstable is staging and register it if it's such. The check is performed by messing directly with system_distributed.view_build_status table, and the registration happens via view-update-generator.
That's not nice, other services shouldn't know that view status is kept in system table. Also view-update-generator is a service to generae and push view updates, the fact that it keeps staging sstables list is the implementation detail.
This PR replaces dependencies on the mentioned pair of services with the single dependency on view-builder (repair, sstables-loader and stream-manager are enlightened) and hides the view building-vs-staging details inside the view_builder.
Along the way, some simplification of repair_writer_impl class is done.
Closesscylladb/scylladb#18706
* github.com:scylladb/scylladb:
stream_manager: Remove system_distributed_keyspace and view_update_generator
repair: Remove system_distributed_keyspace and view_update_generator
streaming: Remove system_distributed_keyspace and view_update_generator
sstables_loader: Remove system_distributed_keyspace and view_update_generator
distributed_loader: Remove system_distributed_keyspace and view_update_generator
view: Make register_staging_sstable() a method of view_builder
view: Make check_view_build_ongoing() helper a method of view_builder
streaming: Proparage view_builder& down to make_streaming_consumer()
repair: Keep view_builder& on repair_writer_impl
distributed_loader: Propagate view_builder& via process_upload_dir()
stream_manager: Add view builder dependency
repair_service: Add view builder dependency
sstables_loader: Add view_bulder dependency
main: Start sstables loader later
repair: Remove unwanted local references from repair_meta
Database used to be (and still is in many ways) an object used to get configuration from. Part of the configuration is the set of pre-configured scheduling groups. That's not nice, services should use each other for some real need, not as proxies to configuration. This patch patches the places that explicitly switch to statement group _not_ to use database to get the group itself.
fixes: #17643Closesscylladb/scylladb#18799
* github.com:scylladb/scylladb:
database: Don't export statement scheduling group
test: Use async attrs and cql-test-env scheduling groups
test: Use get_scheduling_groups() to get scheduling groups
api: Don't switch sched group to start/stop protocol servers
main: Don't switch sched group to start protocol servers
code: Switch to sched group in request_stop_server()
code: Switch to server sched group in start()
protocol_server: Keep scheduling group on board
code: Add scheduling group to controllers
redis: Coroutinize start() method
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.
Fixes https://github.com/scylladb/scylladb/issues/18098Closesscylladb/scylladb#18769