All compaction task executors, except for regular compaction one,
become task manager compaction tasks.
Creating and starting of major_compaction_task_executor is modified
to be consistent with other compaction task executors.
Closes#14505
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py to cover compaction group tasks
compaction: turn custom_task_executor into compaction_task_impl
compaction: turn sstables_task_executor into sstables_compaction_task_impl
compaction: change sstables compaction tasks type
compaction: move table_upgrade_sstables_compaction_task_impl
compaction: pass task_info through sstables compaction
compaction: turn offstrategy_compaction_task_executor into offstrategy_compaction_task_impl
compaction: turn cleanup_compaction_task_executor into cleanup_compaction_task_impl
comapction: use optional task info in major compaction
compaction: use perform_compaction in compaction_manager::perform_major_compaction
Per-table metrics are very valuable for the users, it does come with a
high load on both the reporting and the collecting metrics systems.
This patch adds a small subset of per-metrics table that will be
reported on the node level.
The list of metrics is:
system_column_family_memtable_switch - Number of times flush has
resulted in the memtable being switched out
system_column_family_memtable_partition_writes - Number of write
operations performed on partitions in memtables
system_column_family_memtable_partition_hits - Number of times a write
operation was issued on an existing partition in memtables
system_column_family_memtable_row_writes - Number of row writes
performed in memtables
system_column_family_memtable_row_hits - Number of rows overwritten by
write operations in memtables
system_column_family_total_disk_space - Total disk space used
system_column_family_live_sstable - Live sstable count
system_column_family_read_latency_count - Number of reads
system_column_family_write_latency_count - Number of writes
The names of the read/write metrics is based on the histogram convention,
so when latencies histograms will be added, the names will not change.
The metrics are label with a specific label __per_table="node" so it
will be possible to easily manipulate it.
The metrics will be available when enable_metrics_reporting (the
per-table full metrics flag) is off and enable_node_table_metrics is
true.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
cleanup_compaction_task_executor inherits both from compaction_task_executor
and cleanup_compaction_task_impl.
Add a new version of compaction_manager::perform_task_on_all_files
which accepts only the tasks that are derived from compaction_task_impl.
After all task executors' conversions are done, the new version replaces
the original one.
To make it consistent with the upcoming methods, methods triggering
major compaction get std::optional<tasks::task_info> as an argument.
Thanks to that we can distinguish between a task that has no parent
and the task which won't be registered in task manager.
Now that all users have opted in unconditionally, there is no point in
keeping this optional. Make it mandatory to make sure there are no
opt-out by mistake.
The global override via enable_compacting_data_for_streaming_and_repair
config item still remains, allowing compaction to be force turned-off.
Opt-in is possible by passing an engaged `compaction_time`
(gc_clock::time_point) to the method. When this new parameter is
disengaged, no compaction happens.
Note that there is a global override, via the
enable_compacting_data_for_streaming_and_repair config item, which can
force-disable this compaction.
Compaction done on the output of the streaming reader does *not*
garbage-collect tombstones!
All call-sites are adjusted (the new parameter is not defaulted), but
none opt in yet. This will be done in separate commit per user.
The new test detected a stack-use-after-return when using table's
as_mutation_source_excluding_staging() for range reads.
This doesn't really affect view updates that generate single
key reads only. So the problem was only stressed in the recently
added test. Otherwise, we'd have seen it when running dtests
(in debug mode) that stress the view update path from staging.
The problem happens because the closure was feeded into
a noncopyable_function that was taken by reference. For range
reads, we defer before subsequent usage of the predicate.
For single key reads, we only defer after finished using
the predicate.
Fix is about using sstable_predicate type, so there won't
be a need to construct a temporary object on stack.
Fixes#14812.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14813
Allowing reading from each individual memtable which contains the given
token, without exposing the memtables themselves to the caller. Exposing
the memtables directly to any code outside of table is undesired because
they are mutable objects.
`db/query_context.hh` contains the declaration of class
`db::query_context`. but `replica/table.cc` does not use or need
`db::query_context`.
so, in this change, the `#include` is removed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14717
instead using the operator=(T&&) to assign an instance of `T` to a
shared_ptr, assign a new instance of shared_ptr to it.
unlike std::shared_ptr, seastar::shared_ptr allows us to move a value
into the existing value pointed by shared_ptr with operator=(). the
corresponding change in seastar is
319ae0b530.
but this is a little bit confusing, as the behavior of a shared_ptr
should look like a pointer instead the value pointed by it. and this
could be error-prune, because user could use something like
```c++
p = std::string();
```
by accident, and expect that the value pointed by `p` is cleared.
and all copies of this shared_ptr are updated accordingly. what
he/she really wants is:
```c++
*p = std::string();
```
and the code compiles, while the outcome of the statement is that
the pointee of `p` is destructed, and `p` now points to a new
instance of string with a new address. the copies of this
instance of shared_ptr still hold the old value.
this behavior is not expected. so before deprecating and removing
this operator. let's stop using it.
in this change, we update two caller sites of the
`lw_shared_ptr::operator=(T&&)`. instead of creating a new instance
pointee of the pointer in-place, a new instance of lw_shared_ptr is
created, and is assigned to the existing shared_ptr.
Closes#14470
* github.com:scylladb/scylladb:
sstables: use try_emplace() when appropriate
replica,sstable: do not assign a value to a shared_ptr
instead using the operator=(T&&) to assign an instance of `T` to a
shared_ptr, assign a new instance of shared_ptr to it.
unlike std::shared_ptr, seastar::shared_ptr allows us to move a value
into the existing value pointed by shared_ptr with operator=(). the
corresponding change in seastar is
319ae0b530.
but this is a little bit confusing, as the behavior of a shared_ptr
should look like a pointer instead the value pointed by it. and this
could be error-prune, because user could use something like
```c++
p = std::string();
```
by accident, and expect that the value pointed by `p` is cleared.
and all copies of this shared_ptr are updated accordingly. what
he/she really wants is:
```c++
*p = std::string();
```
and the code compiles, while the outcome of the statement is that
the pointee of `p` is destructed, and `p` now points to a new
instance of string with a new address. the copies of this
instance of shared_ptr still hold the old value.
this behavior is not expected. so before deprecating and removing
this operator. let's stop using it.
in this change, we update two caller sites of the
`lw_shared_ptr::operator=(T&&)`. instead of creating a new instance
pointee of the pointer in-place, a new instance of lw_shared_ptr is
created, and is assigned to the existing shared_ptr.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
View building from staging creates a reader from scratch (memtable
\+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.
perf shows that the reader creation is very expensive:
```
+ 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+ 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+ 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare
+ 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare
+ 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+ 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping
+ 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment
+ 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free
+ 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()
+ 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64
+ 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe
+ 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+ 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small
+ 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects
+ 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run
+ 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate
+ 1.19% 0.05% reactor-3 libc.so.6 [.] syscall
+ 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+ 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert
```
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).
The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.
This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.
This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.
With this improvement, view building was measured to be 3x faster.
from
`INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s`
to
`INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s`
Refs https://github.com/scylladb/scylladb/issues/14089.
Fixes scylladb/scylladb#14244.
Closes#14364
* github.com:scylladb/scylladb:
table: Optimize creation of reader excluding staging for view building
view_update_generator: Dump throughput and duration for view update from staging
utils: Extract pretty printers into a header
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.
perf shows that the reader creation is very expensive:
+ 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+ 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+ 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare
+ 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare
+ 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+ 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping
+ 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment
+ 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free
+ 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+ 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64
+ 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe
+ 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+ 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small
+ 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects
+ 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run
+ 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate
+ 1.19% 0.05% reactor-3 libc.so.6 [.] syscall
+ 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+ 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).
The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.
This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.
This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.
With this improvement, view building was measured to be 3x faster.
from
INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s
to
INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s
Refs #14089.
Fixes#14244.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch
Refs #2252, #6033Closes#12917
Task manager task covering compaction group major
compaction.
Uses multiple inheritance on already existing
major_compaction_task_executor to keep track of
the operation with task manager.
Closes#14271
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py
test: use named variable for task tree depth
compaction: turn major_compaction_task_executor into major_compaction_task_impl
compaction: take gate holder out of task executor
compaction: extend signature of some methods
tasks: keep shared_ptr to impl in task
compaction: rename compaction_task_executor methods
This reverts commit 562087beff.
The regressions introduced by the reverted change have been fixed.
So let's revert this revert to resurrect the
uuid_sstable_identifier_enabled support.
Fixes#10459
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
We need to keep sharding metadata consistent with tablet mapping to
shards in order for node restart to detect that those sstables belong
to a single shard and that resharding is not necessary. Resharding of
sstables based on tablet metadata is not implemented yet and will
abort after this series.
Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
Extend a signature of table::compact_all_sstables and
compaction_manager::perform_major_compaction so that they get
the info of a covering task.
This allows to easily create child tasks that cover compaction group
compaction.
This reverts commit d1dc579062, reversing
changes made to 3a73048bc9.
Said commit caused regressions in dtests. We need to investigate and fix
those, but in the meanwhile let's revert this to reduce the disruption
to our workflows.
Refs: #14283
the invalid sstable id is the NULL of a sstable identifier. with
this concept, it would be a lot simpler to find/track the greatest
generation. the complexity is hidden in the generation_type, which
compares the a) integer-based identifiers b) uuid-based identifiers
c) invalid identitifer in different ways.
so, in this change
* the default constructor generation_type is
now public.
* we don't check for empty generation anymore when loading
SSTables or enumerating them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we assume that generation is always integer based.
in order to enable the UUID-based generation identifier if the related
option is set, we should populate this option down to generation generator.
because we don't have access to the cluster features in some places where
a new generation is created, a new accessor exposing feature_service from
sstable manager is added.
Fixes#10459
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
sel is a local variable, and it is not shared with anybody else.
so make it a unique_ptr<> for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14189
The call is generic enough not to drop the sstable itself on return so
that callers can do whatever they need with it. The only today's caller
is API which will convert sstables to filenames on its own
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
Problem can be reproduced easily:
1) wrote some sstables with smp 1
2) shut down scylla
3) moved sstables to upload
4) restarted scylla with smp 2
5) ran refresh (resharding happens, adds sstable to cleanup
set and never removes it)
6) cleanup (tries to cleanup resharded sstables which were
leaked in the cleanup set)
Bumps into assert "Assertion `!sst->is_shared()' failed", as
cleanup picks a shared sstable that was leaked and already
processed by resharding.
Fix is about not inserting shared sstables into cleanup set,
as shared sstables are restricted to resharding and cannot
be processed later by cleanup (nor it should because
resharding itself cleaned up its input files).
Dtest: https://github.com/scylladb/scylla-dtest/pull/3206Fixes#14001.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14147
The metrics that are being deregistered (in this PR) cause Scylla to crash when a
table is dropped, but the corresponding table object in memory is not
yet deallocated, and a new table with the same name is created. This
caused a double-metrics-registration exception to be thrown. In order to
avoid it, we are deregistering table's metrics as soon as the table is
marked to be disposed from the database. Table's representation in memory can
still live, but shouldn't forbid other table with the same name to be
created.
Fixes#13548Closes#13971
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.
Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.
Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.
Fixes#9559Closes#13812
* github.com:scylladb/scylladb:
table: signal compaction_manager when staging sstables become eligible for cleanup
compaction_manager: perform_cleanup: wait until all candidates are cleaned up
compaction_manager: perform_cleanup: perform_offstrategy if needed
compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance
sstable_set: add for_each_sstable_gently* helpers
Right now all users of global_table know it's a vector and reference its
elements with this_shard_id() index. Making the global_table_ptr a class
makes it possible to stop using operator[] and "index" this_shard_id()
in its -> and * operators.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
perform_cleanup may be waiting for those sstables
to become eligible for cleanup so signal it
when table::move_sstables_from_staging detects an
sstable that requires cleanup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There are two layers of stables deletion -- delete-atomically and wipe. The former is in fact the "API" method, it's called by table code when the specific sstable(s) are no longer needed. It's called "atomically" because it's expected to fail in the middle in a safe manner so that subsequent boot would pick the dangling parts and proceed. The latter is a low-level removal function that can fail in the middle, but it's not of _its_ care.
Currently the atomic deletion is implemented with the help of sstable_directory::delete_atomically() method that commits sstables files names into deletion log, then calls wipe (indirectly), then drops the deletion log. On boot all found deletion logs are replayed. The described functionality is used regardless of the sstable storage type, even for S3, though deletion log is an overkill for S3, it's better be implemented with the help of ownership table. In fact, S3 storage already implements atomic deletion in its wipe method thus being overly careful.
So this PR
- makes atomic deletion be storage-specific
- makes S3 wipe non-atomic
fixes: #13016
note: Replaying sstables deletion from ownership table on boot is not here, see #13024Closes#13562
* github.com:scylladb/scylladb:
sstables: Implement atomic deleter for s3 storage
sstables: Get atomic deleter from underlying storage
sstables: Move delete_atomically to manager and rename
This is to let manager decide which storage driver to call for atomic
sstables deletion in the next patch. While at it -- rename the
sstable_directory's method into something more descriptive (to make
compiler catch all callers of it).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If tombstone GC was disabled, compaction will ensure that fully expired
sstables won't be bypassed and that no expired tombstones will be
purged. Changing the value takes immediate effect even on ongoing
compactions.
Not wired into an API yet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
this change extracts the storage class and its derived classes
out into their own source files. for couple reasons:
- for better readability. the sstables.hh is over 1005 lines.
and sstables.cc 3602 lines. it's a little bit difficult to figure
out how the different parts in these sources interact with each
other. for instance, with this change, it's clear some of helper
functions are only used by file_system_storage.
- probably less inter-source dependency. by extracting the sources
files out, they can be compiled individually, so changing one .cc
file does not impact others. this could speed up the compilation
time.
Closes#13785
* github.com:scylladb/scylladb:
sstables: storage: coroutinize idempotent_link_file()
sstables: extract storage out
this change extracts the storage class and its derived classes
out into storage.cc and storage.hh. for couple reasons:
- for better readability. the sstables.hh is over 1005 lines.
and sstables.cc 3602 lines. it's a little bit difficult to figure
out how the different parts in these sources interact with each
other. for instance, with this change, it's clear some of helper
functions are only used by file_system_storage.
- probably less inter-source dependency. by extracting the sources
files out, they can be compiled individually, so changing one .cc
file does not impact others. this could speed up the compilation
time.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* replace generation_type::value() with generation_type::as_int()
* drop generation_value()
because we will switch over to UUID based generation identifier, the member
function or the free function generation_value() cannot fulfill the needs
anymore. so, in this change, they are consolidated and are replaced by
"as_int()", whose name is more specific, and will also work and won't be
misleading even after switching to UUID based generation identifier. as
`value()` would be confusing by then: it could be an integer or a UUID.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
In many cases we trigger offstrategy compaction opportunistically
also when there's nothing to do. In this case we still print
to the log lots of info-level message and call
`run_offstrategy_compaction` that wastes more cpu cycles
on learning that it has nothing to do.
This change bails out early if the maintenance set is empty
and prints a "Skipping off-strategy compaction" message in debug
level instead.
Fixes#13466
Also, add an group_id class and return it from compaction_group and table_state.
Use that to identify the compaction_group / table_state by "ks_name.cf_name compaction_group=idx/total" in log messages.
Fixes#13467Closes#13520
* github.com:scylladb/scylladb:
compaction_manager: print compaction_group id
compaction_group, table_state: add group_id member
compaction_manager: offstrategy compaction: skip compaction if no candidates are found
This series fixes a few issues caused by f1bbf705f9
(f1bbf705f9):
- table, compaction_manager: prevent cross shard access to owned_ranges_ptr
- Fixes#13631
- distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
- compaction: make_partition_filter: do not assert shard ownership
- allow the filtering reader now used during resharding to process tokens owned by other shards
Closes#13635
* github.com:scylladb/scylladb:
compaction: make_partition_filter: do not assert shard ownership
distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
table, compaction_manager: prevent cross shard access to owned_ranges_ptr
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
Will be used by tablet-based replication strategies, for which
effective replication map is different per table.
Also, this patch adapts existing users of effective replication map to
use the per-table effective replication map.
For simplicity, every table has an effective replication map, even if
the erm is per keyspace. This way the client code can be uniform and
doesn't have to check whether replication strategy is per table.
Not all users of per-keyspace get_effective_replication_map() are
adapted yet to work per-table. Those algorithms will throw an
exception when invoked on a keyspace which uses per-table replication
strategy.