The CDC generation data can be large and not fit in a single command.
This pr splits it into multiple mutations by smartly picking a
`mutation_size_threshold` and sending each mutation as a separate group
0 command.
Commands are sent sequentially to avoid concurrency problems.
Topology snapshots contain only mutation of current CDC generation data
but don't contain any previous or future generations. If a new
generation of data is being broadcasted but hasn't been entirely applied
yet, the applied part won't be sent in a snapshot. New or delayed nodes
can never get the applied part in this scenario.
Send the entire cdc_generations_v3 table in the snapshot to resolve this
problem.
A mechanism to remove old CDC generations will be introduced as a
follow-up.
Closes#13962
* github.com:scylladb/scylladb:
test: raft topology: test `prepare_and_broadcast_cdc_generation_data`
service: raft topology: print warning in case of `raft::commit_status_unknown` exception in topology coordinator loop
raft topology: introduce `prepare_and_broadcast_cdc_generation_data`
raft: add release_guard
raft: group0_state_machine::merger take state_id as the maximal value from all merged commands
raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot
raft topology: make `mutation_size_threshold` depends on `max_command_size`
raft: reduce max batch size of raft commands and raft entries
raft: add description argument to add_entry_unguarded
raft: introduce `write_mutations` command
raft: refactor `topology_change` applying
For now, `raft_sys_table_storage::_max_mutation_size` equals `max_mutation_size`
(half of the commitlog segment size), so with some additional information, it
can exceed this threshold resulting in throwing an exception when writing
mutation to the commitlog.
A batch of raft commands has the size at most `group0_state_machine::merger::max_command_size`
(half of the commitlog segment size). It doesn't have additional metadata, but
it may have a size of exactly `max_mutation_size`. It shouldn't make any trouble,
but it is prefered to be careful.
Make `raft_sys_table_storage::_max_mutation_size` and
`group0_state_machine::merger::max_command_size` more strict to leave space
for metadata.
Fixed typo "1204" => "1024".
Currently, it is hard for injected code to wait for some events, for example, requests on some REST endpoint.
This PR adds the `inject_with_handler` method that executes injected function and passes `injection_handler` as its argument.
The `injection_handler` class is used to wait for events inside the injected code.
The `error_injection` class can notify the injection's handler or handlers associated with the injection on all shards about the received message.
Closes#14357.
Closes#14460
* github.com:scylladb/scylladb:
tests: introduce InjectionHandler class for communicating with injected code
api/error_injection: add message_injection endpoint
tests: utils: error injections: add test for inject_with_handler
utils: error injection: add inject_with_handler for interactions with injected code
utils: error injection: create structure for error injections data
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.
This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.
Fixes https://github.com/scylladb/scylladb/issues/14503Closes#14502
* github.com:scylladb/scylladb:
test: view_build_test: add range tombstones to test_view_update_generator_buffering
test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
view_updating_consumer: make buffer limit a variable
view: fix range tombstone handling on flushes in view_updating_consumer
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.
This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.
Fixes#14503
SELECT clause components (selectors) are currently evaluated during query execution
using a stateful class hierarchy. This state is needed to hold intermediate state while
aggregating over multiple rows. Because the selectors are stateful, we must re-create
them each query using a selector_factory hierarchy.
We'd like to convert all of this to the unified expression evaluation machinery, so we can
have just one grammar for expressions, and just one way to evaluate expressions, but
the statefulness makes this complex.
In commit 59ab9aac44 "(Merge 'functions: reframe aggregate functions in terms
of scalar functions' from Avi Kivity)", we made aggregate functions stateless, moving
their state to aggregate_function_selector::_accumulator, and therefore into the
class hierarchy we're addressing now. Another reason for keeping state is that selectors
that aren't aggregated capture the first value they see in a GROUP BY group.
Since expressions can't contain state directly, we break apart expressions that contain
aggregate functions into two: an inner expression that processes incoming rows within
a group, and an outer expression that generates the group's output. The two expressions
communicate via a newly introduced expression element: a temporary.
The problem of non-aggregated columns requiring state is solved by encapsulating
those columns in an internal aggregate function, called the "first" function.
In terms of performance, this series has little effect, since the common case of selectors
that only contain direct column references without transformations is evaluated via a fast
path (`simple_selection`). This fast-path is preserved with almost no changes.
While the series makes it possible to start to extend the grammar and unify expression
syntaxes, it does not do so. The grammar is unchanged. There is just one breaking change:
the `SELECT JSON` statement generates json object field names based on the input selectors.
In one case the name of the field has changed, but it is an esoteric case (where a function call
is selected as part of `SELECT JSON`), and the new behavior is compatible with Cassandra.
Closes#14467
* github.com:scylladb/scylladb:
cql3: selection: drop selector_factories, selectables, and selectors
cql3: select_statement: stop using selector_factories in SELECT JSON
cql3: selection: don't create selector_factories any more
cql3: selection: collect column_definitions using expressions
cql3: selection: reimplement selection::is_aggregate()
cql3: selection: evaluate aggregation queries via expr::evaluate()
cql3: selection, select_statement: fine tune add_column_for_post_processing() usage
cql3: selection: evaluate non-aggregating complex selections using expr::evaluate()
cql3: selection: store primary key in result_set_builder
cql3: expression: fix field_selection::type interpretation by evaluate()
cql3: selection: make result_set_builder::current non-optional<>
cql3: selection: simplify row/group processing
cql3: selection: convert requires_thread to expressions
cql: selection: convert used_functions() to expressions
cql3: selection: convert is_reducible/get_reductions to expressions
cql3: selection: convert is_count() to expressions
cql3: selection convert contains_ttl/contains_writetime to work on expressions
cql3: selection: make simple_selectors stateless
cql3: expression: add helper to split expressions with aggregate functions
cql3: selection: short-circuit non-aggregations
cql3: selection: drop validate_selectors
cql3: select_statement: force aggregation if GROUP BY is used
cql3: select_statement: levellize aggregation depth
cql3: selection: skip first_function when collecting metadata
cql3: select_statement: explicitly disable automatic parallelization with no aggregates
cql3: expression: introduce temporaries
cql3: select_statement: use prepared selectors
cql3: selection: avoid selector_factories in collect_metadata()
cql3: expressions: add "metadata mode" formatter for expressions
cql3: selection: convert collect_metadata() to the prepared expression domain
cql3: selection: convert processes_selection to work on prepared expressions
cql3: selection: prepare selectors earlier
cql3: raw_selector: deinline
cql3: expression: reimplement verify_no_aggregate_functions()
cql3: expression: add helpers to manage an expression's aggregation depth
cql3: expression: improve printing of prepared function calls
cql3: functions: add "first" aggregate function
SELECT JSON uses selector_factories to obtain the names of the
fields to insert into the json object, and we want to drop
selector_factories entirely. Switch instead to the ":metadata" mode
of printing expressions, which does what we want.
Unfortunately, the switch changes how system functions are converted
into field names. A function such as unixtimestampof() is now rendered
as "system.unixtimestampof()"; before it did not have the keyspace
prefix.
This is a compatiblity problem, albeit an obscure one. Since the new
behavior matches Cassandra, and the odds of hitting this are very low,
I think we can allow the change.
field_selection::type refers to the type of the selection operation,
not the type of the structure being selected. This is what
prepare_expression() generates and how all other expression elements
work, but evaluate() for field_selection thinks it's the type
of the structure, and so fails when it gets an expression
from prepare_expression().
Fix that, and adjust the tests.
We define the "aggregation depth" of an expression by how many
nested aggregation functions are applied. In CQL/SQL, legal
values are 0 and 1, but for generality we deal with any aggregation depth.
The first helper measures the maximum aggregation depth along any path
in the expression graph. If it's 2 or greater, we have something like
max(max(x)) and we should reject it (though these helpers don't). If
we get 1 it's a simple aggregation. If it's zero then we're not aggregating
(though CQL may decide to aggregate anyway if GROUP BY is used).
The second helper edits an expression to make sure the aggregation depth
along any path that reaches a column is the same. Logically,
`SELECT x, max(y)` does not make sense, as one is a vector of values
and the other is a scalar. CQL resolves the problem by defining x as
"the first value seen". We apply this resolution by converting the
query to `SELECT first(x), max(y)` (where `first()` is an internal
aggregate function), so both selectors refer to scalars that consume
vectors.
When a scalar is consumed by an aggregate function (for example,
`SELECT max(x), min(17)` we don't have to bother, since a scalar
is implicity promoted to a vector by evaluating it every row. There
is some ambiguity if the scalar is a non-pure function (e.g.
`SELECT max(x), min(random())`, but it's not worth following.
A small unit test is added.
Split long running test
test_memtable_with_many_versions_conforms_to_mutation_source to 2 tests
for _plain and _reverse.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14447
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.
In some unlucky circumstances, a stack overflow can occur:
- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
combined reader (~500),
- All of the above needs to happen within a single task quota.
In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.
A regression test is added.
Fixes: scylladb/scylladb#14415
Closes#14452
Reduce test string value size, parallelize inserts, and use a prepared statement,
The debug running time for this tests is reduced from 13:18 to 7:52.
Refs #13905Closes#14380
* github.com:scylladb/scylladb:
test/boost/index_with_paging_test: parallel insert
test/boost/index_with_paging_test: prepared statement
test/boost/index_with_paging_test: reduce running time
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.
The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.
Broadcast table commands are not mutations, so they are never combined.
* 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla:
test: add test for group0 raft command merging
service: raft: respect max mutation size limit when persisting raft entries
group0_state_machine: merge commands before applying them whenever possible
Parallelize inserts for long-running test_index_with_paging.
Run time in debug mode reduced by 1 minute 48 seconds.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Reduce test string value size for test_index_with_paging from 4096 to
100. With 100 bytes it should make the base row significantly larger
than the key so the test will exercise both types of paging in the
scanning code.
The debug running time for this tests is reduced from 9 minutes to 6
minutes.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the previous buffer-fill. Otherwise, the
reader could get stuck in an infinite loop between buffer fills, if the
reader is evicted in-between.
The code guranteeing this forward progress had a bug: the comparison
between the position after the last buffer-fill and the current
last fragment position was done in the wrong direction.
So if the condition that we wanted to achieve was already true, we would
continue filling the buffer until partition end which may lead to OOMs
such as in #13491.
There was already a fix in this area to handle `partition_start`
fragments correctly - #13563 - but it missed that the position
comparison was done in the wrong order.
Fix the comparison and adjust one of the tests (added in #13563) to
detect this case.
Fixes#13491
test_range_tombstones_v2 is too strict for this reader -- it expects a
particular sequence of `range_tombstone_change`s, but
multishard_combining_reader, when tested with a small buffer, may
generate -- as expected -- additional (redundant) range tombstone change
pairs (end+start).
Currently we don't observe these redundant fragments due to a bug in
`evictable_reader_v2` but they start appearing once we fix the bug and
the test must be prepared first.
To prepare the test, modify `flat_reader_assertions_v2` so it squashes
redundant range tombstone change pairs. This happens only in non-exact
mode.
Enable exact mode in `test_sstable_reversing_reader_random_schema` for
comparing two readers -- the squashing of `r_t_c`s may introduce an
artificial difference.
Add a test that submits 3 large commands each one a little bit larger
than 1/3 of maximum mutation size. Check that in the end 2 command were
executed (first 2 were merged and third was executed separately).
View building from staging creates a reader from scratch (memtable
\+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.
perf shows that the reader creation is very expensive:
```
+ 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+ 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+ 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare
+ 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare
+ 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+ 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping
+ 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment
+ 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free
+ 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()
+ 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64
+ 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe
+ 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+ 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small
+ 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects
+ 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run
+ 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate
+ 1.19% 0.05% reactor-3 libc.so.6 [.] syscall
+ 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+ 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert
```
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).
The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.
This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.
This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.
With this improvement, view building was measured to be 3x faster.
from
`INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s`
to
`INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s`
Refs https://github.com/scylladb/scylladb/issues/14089.
Fixes scylladb/scylladb#14244.
Closes#14364
* github.com:scylladb/scylladb:
table: Optimize creation of reader excluding staging for view building
view_update_generator: Dump throughput and duration for view update from staging
utils: Extract pretty printers into a header
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.
perf shows that the reader creation is very expensive:
+ 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+ 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+ 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare
+ 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare
+ 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+ 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping
+ 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment
+ 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free
+ 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+ 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64
+ 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe
+ 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+ 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small
+ 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects
+ 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run
+ 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate
+ 1.19% 0.05% reactor-3 libc.so.6 [.] syscall
+ 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+ 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).
The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.
This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.
This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.
With this improvement, view building was measured to be 3x faster.
from
INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s
to
INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s
Refs #14089.
Fixes#14244.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch
Refs #2252, #6033Closes#12917
Split long running test_aggregate_functions to one case per type.
This allows test.py to run them in parallel.
Before this it would take 18 minutes to run in debug mode. Afterwards
each case takes 30-45 seconds.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14368
Split long running test test_schema_changes in 3 parts, one for each
writable_sstable_versions so it can be run in parallel by test.py.
Add static checks to alert if the array of types changed.
Original test takes around 24 minutes in debug mode, and each new split
test takes around 8 minutes.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14367
Remove previous configuration blocking parallel run.
Test cases run fine in local debug.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14369
Split long running tests
test_database_with_data_in_sstables_is_a_mutation_source_plain and
test_database_with_data_in_sstables_is_a_mutation_source_reverse.
They run with x_log2_compaction_groups of 0 and 1, each one taking from
10 to 15 minutes each in debug mode, for a total of 28 and 22 minutes.
Split the test cases to run with 0 and 1, so test.py can run them in
parallel.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14356
Adding a function declaration to expression.hh causes many
recompilations. Reduce that by:
- moving some restrictions-related definitions to
the existing expr/restrictions.hh
- moving evaluation related names to a new header
expr/evaluate.hh
- move utilities to a new header
expr/expr-utilities.hh
expression.hh contains only expression definitions and the most
basic and common helpers, like printing.
This reverts commit 562087beff.
The regressions introduced by the reverted change have been fixed.
So let's revert this revert to resurrect the
uuid_sstable_identifier_enabled support.
Fixes#10459
This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets):
1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored.
2. CDC subsystem was not adjusted (not supported yet)
3. sstable sharding metadata reflects tablet boundaries
5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables)
6. The system is NOT prepared to handle tablet migration / topology changes in a safe way.
7. Sstable cleanup is not wired properly yet
After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead.
To make the life easier, support was added to obtain table pointer from the schema pointer:
```
schema_ptr s;
s->table().shard_of(...)
```
Closes#13939
* github.com:scylladb/scylladb:
locator: network_topology_startegy: Allocate shards to tablets
locator: Store node shard count in topology
service: topology: Extract topology updating to a lambda
test: Move test_tablets under topology_experimental
sstables: Add trace-level logging related to shard calculation
schema: Catch incorrect uses of schema::get_sharder()
dht: Rename dht::shard_of() to dht::static_shard_of()
treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
storage_proxy: Avoid multishard reader for tablets
storage_proxy: Obtain shard from erm in the read path
db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
forward_service: Use table sharder
alternator: Use table sharder
db: multishard: Obtain sharder from erm
sstable_directory: Improve trace-level logging
db: table: Introduce shard_of() helper
db: Use table sharder in compaction
sstables: Compute sstable shards using sharder from erm when loading
sstables: Generate sharding metadata using sharder from erm when writing
test: partitioner: Test split_range_to_single_shard() on tablet-like sharder
dht: Make split_range_to_single_shard() prepared for tablet sharder
sstables: Move compute_shards_for_this_sstable() to load()
dht: Take sharder externally in splitting functions
locator: Make sharder accessible through effective_replication_map
dht: sharder: Document guarantees about mapping stability
tablets: Implement tablet sharder
tablets: Include pending replica in get_shard()
dht: sharder: Introduce next_shard()
db: token_ring_table: Filter out tablet-based keyspaces
db: schema: Attach table pointer to schema
schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load()
schema_registry: Make learn(schema_ptr) attach entry to the target schema
test: lib: cql_test_env: Expose feature_service
test: Extract throttle object to separate header
Fixes#11017
When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.
Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
Closes#14294
Uses a simple algorihtm for allocating shards which chooses
least-loaded shard on a given node, encapsulated in load_sketch.
Takes load due to current tablet allocation into account.
Each tablet, new or allocated for other tables, is assumed to have an
equal load weight.
This is in order to prevent new incorrect uses of dht::shard_of() to
be accidentally added. Also, makes sure that all current uses are
caught by the compiler and require an explicit rename.
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
This is not strictly necessary, as the multishard reader will be later
avoided altogether for tablet-based tables, but it is a step towards
converting all code to use the erm->get_sharder() instead of
schema::get_sharder().
schema::get_sharder() does not use the correct sharder for
tablet-based tables. Code which is supposed to work with all kinds of
tables should obtain the sharder from erm::get_sharder().
The function currently assumes that shard assignment for subsequent
tokens is round robin, which will not be the case for tablets. This
can lead to incorrect split calculation or infinite loop.
Another assumption was that subsequent splits returned by the sharder
have distinct shards. This also doesn't hold for tablets, which may
return the same shard for subsequent tokens. This assumption was
embedded in the following line:
start_token = sharder.token_for_next_shard(end_token, shard);
If the range which starts with end_token is also owned by "shard",
token_for_next_shard() would skip over it.
We need those functions to work with tablet sharder, which is not
accessible through schema::get_sharder(). In order to propagate the
right sharder, those functions need to take it externally rather from
the schema object. The sharder will come from the
effective_replication_map attached to the table object.
Those splitting functions are used when generating sharding metadata
of an sstable. We need to keep this sharding metadata consistent with
tablet mapping to shards in order for node restart to detect that
those sstables belong to a single shard and that resharding is not
necessary. Resharding of sstables based on tablet metadata is not
implemented yet and will abort after this series.
Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
This will make it easier to access table proprties in places which
only have schema_ptr. This is in particular useful when replacing
dht::shard_of() uses with s->table().shard_of(), now that sharding is
no longer static, but table-specific.
Also, it allows us to install a guard which catches invalid uses of
schema::get_sharder() on tablet-based tables.
It will be helpful for other uses as well. For example, we can now get
rid of the static_props hack.