Commit Graph

5175 Commits

Author SHA1 Message Date
Petr Gusev
b69bc97673 repair_test: add test_reader_with_different_strategies 2023-07-05 13:02:17 +04:00
Nadav Har'El
ec77172b4b Merge 'cql3: convert the SELECT clause evaluation phase to expressions' from Avi Kivity
SELECT clause components (selectors) are currently evaluated during query execution
using a stateful class hierarchy. This state is needed to hold intermediate state while
aggregating over multiple rows. Because the selectors are stateful, we must re-create
them each query using a selector_factory hierarchy.

We'd like to convert all of this to the unified expression evaluation machinery, so we can
have just one grammar for expressions, and just one way to evaluate expressions, but
the statefulness makes this complex.

In commit 59ab9aac44 "(Merge 'functions: reframe aggregate functions in terms
of scalar functions' from Avi Kivity)", we made aggregate functions stateless, moving
their state to aggregate_function_selector::_accumulator, and therefore into the
class hierarchy we're addressing now. Another reason for keeping state is that selectors
that aren't aggregated capture the first value they see in a GROUP BY group.

Since expressions can't contain state directly, we break apart expressions that contain
aggregate functions into two: an inner expression that processes incoming rows within
a group, and an outer expression that generates the group's output. The two expressions
communicate via a newly introduced expression element: a temporary.

The problem of non-aggregated columns requiring state is solved by encapsulating
those columns in an internal aggregate function, called the "first" function.

In terms of performance, this series has little effect, since the common case of selectors
that only contain direct column references without transformations is evaluated via a fast
path (`simple_selection`). This fast-path is preserved with almost no changes.

While the series makes it possible to start to extend the grammar and unify expression
syntaxes, it does not do so. The grammar is unchanged. There is just one breaking change:
the `SELECT JSON` statement generates json object field names based on the input selectors.
In one case the name of the field has changed, but it is an esoteric case (where a function call
is selected as part of `SELECT JSON`), and the new behavior is compatible with Cassandra.

Closes #14467

* github.com:scylladb/scylladb:
  cql3: selection: drop selector_factories, selectables, and selectors
  cql3: select_statement: stop using selector_factories in SELECT JSON
  cql3: selection: don't create selector_factories any more
  cql3: selection: collect column_definitions using expressions
  cql3: selection: reimplement selection::is_aggregate()
  cql3: selection: evaluate aggregation queries via expr::evaluate()
  cql3: selection, select_statement: fine tune add_column_for_post_processing() usage
  cql3: selection: evaluate non-aggregating complex selections using expr::evaluate()
  cql3: selection: store primary key in result_set_builder
  cql3: expression: fix field_selection::type interpretation by evaluate()
  cql3: selection: make result_set_builder::current non-optional<>
  cql3: selection: simplify row/group processing
  cql3: selection: convert requires_thread to expressions
  cql: selection: convert used_functions() to expressions
  cql3: selection: convert is_reducible/get_reductions to expressions
  cql3: selection: convert is_count() to expressions
  cql3: selection convert contains_ttl/contains_writetime to work on expressions
  cql3: selection: make simple_selectors stateless
  cql3: expression: add helper to split expressions with aggregate functions
  cql3: selection: short-circuit non-aggregations
  cql3: selection: drop validate_selectors
  cql3: select_statement: force aggregation if GROUP BY is used
  cql3: select_statement: levellize aggregation depth
  cql3: selection: skip first_function when collecting metadata
  cql3: select_statement: explicitly disable automatic parallelization with no aggregates
  cql3: expression: introduce temporaries
  cql3: select_statement: use prepared selectors
  cql3: selection: avoid selector_factories in collect_metadata()
  cql3: expressions: add "metadata mode" formatter for expressions
  cql3: selection: convert collect_metadata() to the prepared expression domain
  cql3: selection: convert processes_selection to work on prepared expressions
  cql3: selection: prepare selectors earlier
  cql3: raw_selector: deinline
  cql3: expression: reimplement verify_no_aggregate_functions()
  cql3: expression: add helpers to manage an expression's aggregation depth
  cql3: expression: improve printing of prepared function calls
  cql3: functions: add "first" aggregate function
2023-07-03 23:21:33 +03:00
Avi Kivity
d9cf81f1a6 cql3: select_statement: stop using selector_factories in SELECT JSON
SELECT JSON uses selector_factories to obtain the names of the
fields to insert into the json object, and we want to drop
selector_factories entirely. Switch instead to the ":metadata" mode
of printing expressions, which does what we want.

Unfortunately, the switch changes how system functions are converted
into field names. A function such as unixtimestampof() is now rendered
as "system.unixtimestampof()"; before it did not have the keyspace
prefix.

This is a compatiblity problem, albeit an obscure one. Since the new
behavior matches Cassandra, and the odds of hitting this are very low,
I think we can allow the change.
2023-07-03 19:45:17 +03:00
Avi Kivity
0021f77e30 cql3: expression: fix field_selection::type interpretation by evaluate()
field_selection::type refers to the type of the selection operation,
not the type of the structure being selected. This is what
prepare_expression() generates and how all other expression elements
work, but evaluate() for field_selection thinks it's the type
of the structure, and so fails when it gets an expression
from prepare_expression().

Fix that, and adjust the tests.
2023-07-03 19:45:17 +03:00
Avi Kivity
b1b4a18ad8 cql3: expression: add helpers to manage an expression's aggregation depth
We define the "aggregation depth" of an expression by how many
nested aggregation functions are applied. In CQL/SQL, legal
values are 0 and 1, but for generality we deal with any aggregation depth.

The first helper measures the maximum aggregation depth along any path
in the expression graph. If it's 2 or greater, we have something like
max(max(x)) and we should reject it (though these helpers don't). If
we get 1 it's a simple aggregation. If it's zero then we're not aggregating
(though CQL may decide to aggregate anyway if GROUP BY is used).

The second helper edits an expression to make sure the aggregation depth
along any path that reaches a column is the same. Logically,
`SELECT x, max(y)` does not make sense, as one is a vector of values
and the other is a scalar. CQL resolves the problem by defining x as
"the first value seen". We apply this resolution by converting the
query to `SELECT first(x), max(y)` (where `first()` is an internal
aggregate function), so both selectors refer to scalars that consume
vectors.

When a scalar is consumed by an aggregate function (for example,
`SELECT max(x), min(17)` we don't have to bother, since a scalar
is implicity promoted to a vector by evaluating it every row. There
is some ambiguity if the scalar is a non-pure function (e.g.
`SELECT max(x), min(random())`, but it's not worth following.

A small unit test is added.
2023-07-03 19:45:16 +03:00
Alejo Sanchez
520bd90008 test/boost/memtable_test: split test plain/reverse
Split long running test
test_memtable_with_many_versions_conforms_to_mutation_source to 2 tests
for _plain and _reverse.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14447
2023-07-03 15:20:12 +03:00
Piotr Dulikowski
ee9bfb583c combined: mergers: remove recursion in operator()()
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.

In some unlucky circumstances, a stack overflow can occur:

- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
  end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
  combined reader (~500),
- All of the above needs to happen within a single task quota.

In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.

A regression test is added.

Fixes: scylladb/scylladb#14415

Closes #14452
2023-06-30 12:07:13 +03:00
Kamil Braun
ff386e7a44 service: raft: force initial snapshot transfer in new cluster
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d065 we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: #14066

Closes #14336
2023-06-29 22:46:42 +02:00
Konstantin Osipov
3d81408a58 test.py: make experimental: raft the default for all tests
Make sure all tests use the new centralized topology
coordinator. This is a step forward towards maturing the
coordinator implementation.

Closes #14039
2023-06-29 14:44:00 +02:00
Botond Dénes
2a58b4a39a Merge 'Compaction resharding tasks' from Aleksandra Martyniuk
Task manager's tasks covering resharding compaction
on table and shard level.

Closes #14044

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py to test resharding compaction
  compaction: add shard_reshard_sstables_compaction_task_impl
  compaction: invoke resharding on sharded database
  compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run()
  replica: delete unused functions and struct
  compaction: add reshard_sstables_compaction_task_impl
  compaction: replica: copy struct and functions from distributed_loader.cc
  compaction: create resharding_compaction_task_impl
2023-06-29 12:10:54 +03:00
Nadav Har'El
dd63169077 Merge 'test/boost/index_with_paging_test: reduce running time' from Alecco
Reduce test string value size, parallelize inserts, and use a prepared statement,

The debug running time for this tests is reduced from 13:18 to 7:52.

Refs #13905

Closes #14380

* github.com:scylladb/scylladb:
  test/boost/index_with_paging_test: parallel insert
  test/boost/index_with_paging_test: prepared statement
  test/boost/index_with_paging_test: reduce running time
2023-06-29 10:45:01 +03:00
Avi Kivity
f6f974cdeb cql3: selection: fix GROUP BY, empty groups, and aggregations
A GROUP BY combined with aggregation should produce a single
row per group, except for empty groups. This is in contrast
to an aggregation without GROUP BY, which produces a single
row no matter what.

The existing code only considered the case of no grouping
and forced a row into the result, but this caused an unwanted
row if grouping was used.

Fix by refining the check to also consider GROUP BY.

XFAIL tests are relaxed.

Fixes #12477.

Note, forward_service requires that aggregation produce
exactly one row, but since it can't work with grouping,
it isn't affected.

Closes #14399
2023-06-28 18:56:22 +03:00
Kamil Braun
b912eeade5 Merge 'merge raft commands to group0 before applying them whenever possible' from Gleb
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.

The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.

Broadcast table commands are not mutations, so they are never combined.

* 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla:
  test: add test for group0 raft command merging
  service: raft: respect max mutation size limit when persisting raft entries
  group0_state_machine: merge commands before applying them whenever possible
2023-06-28 17:21:07 +02:00
Alejo Sanchez
d4697ed21e test/boost/index_with_paging_test: parallel insert
Parallelize inserts for long-running test_index_with_paging.

Run time in debug mode reduced by 1 minute 48 seconds.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-06-28 16:11:58 +02:00
Alejo Sanchez
70a3179888 test/boost/index_with_paging_test: prepared statement
Prepare statement for insert.

Run time in debug mode reduced by 9 seconds.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-06-28 14:49:21 +02:00
Michał Jadwiszczak
0a8fcead08 cql3: Specify arguments types in UDA creation errors
Display not only function name but also expected arguments
if `state_function` or `final_function` was not found.

Fixes: #12088

Closes #14278
2023-06-28 15:27:49 +03:00
Alejo Sanchez
48d24269f1 test/boost/index_with_paging_test: reduce running time
Reduce test string value size for test_index_with_paging from 4096 to
100. With 100 bytes it should make the base row significantly larger
than the key so the test will exercise both types of paging in the
scanning code.

The debug running time for this tests is reduced from 9 minutes to 6
minutes.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-06-28 13:55:52 +02:00
Nadav Har'El
49c8c06b1b Merge 'cql: fix crash on empty clustering range in LWT' from Jan Ciołek
LWT queries with empty clustering range used to cause a crash.
For example in:
```cql
UPDATE tab SET r = 9000 WHERE p = 1  AND c = 2 AND c = 2000 IF r = 3
```
The range of `c` is empty - there are no valid values.

This caused a segfault when accessing the `first` range:
```c++
op.ranges.front()
```

Cassandra rejects such queries at the preparation stage. It doesn't allow two `EQ` restriction on the same clustering column when an IF is involved.
We reject them during runtime, which is a worse solution. The user can prepare a query with `c = ? AND c = ?`, and then run it, but unexpectedly it will throw an `invalid_request_exception` when the two bound variables are different.

We could ban such queries as well, we already ban the usage of `IN` in conditional statements. The problem is that this would be a breaking change.

A better solution would be to allow empty ranges in `LWT` statements. When an empty range is detected we just wouldn't apply the change. This would be a larger change, for now let's just fix the crash.

Fixes: https://github.com/scylladb/scylladb/issues/13129

Closes #14429

* github.com:scylladb/scylladb:
  modification_statement: reject conditional statements with empty clustering key
  statements/cas_request: fix crash on empty clustering range in LWT
2023-06-28 14:43:54 +03:00
Aleksandra Martyniuk
bf3e0744c1 test: extend test_compaction_task.py to test resharding compaction 2023-06-28 11:43:12 +02:00
Jan Ciolek
ccdb26bf9e statements/cas_request: fix crash on empty clustering range in LWT
LWT queries with empty clustering range used to cause a crash.
For example in:
```cql
UPDATE tab SET r = 9000 WHERE p = 1  AND c = 2 AND c = 2000 IF r = 3
```
The range of `c` is empty - there are no valid values.

This caused a segfault when accessing the `first` range:
```c++
op.ranges.front()
```

To fix it let's throw en exception when the clustering range
is empty. Cassandra also rejects queries with `c = 1 AND c = 2`.

There's also a check for empty partition range, as it used
to crash in the past, can't really hurt to add it.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-28 10:18:06 +02:00
Kamil Braun
96bc78905d readers: evictable_reader: don't accidentally consume the entire partition
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the previous buffer-fill. Otherwise, the
reader could get stuck in an infinite loop between buffer fills, if the
reader is evicted in-between.

The code guranteeing this forward progress had a bug: the comparison
between the position after the last buffer-fill and the current
last fragment position was done in the wrong direction.

So if the condition that we wanted to achieve was already true, we would
continue filling the buffer until partition end which may lead to OOMs
such as in #13491.

There was already a fix in this area to handle `partition_start`
fragments correctly - #13563 - but it missed that the position
comparison was done in the wrong order.

Fix the comparison and adjust one of the tests (added in #13563) to
detect this case.

Fixes #13491
2023-06-27 14:37:29 +02:00
Kamil Braun
5800ce8ddd test: flat_mutation_reader_assertions: squash r_t_cs with the same position
test_range_tombstones_v2 is too strict for this reader -- it expects a
particular sequence of `range_tombstone_change`s, but
multishard_combining_reader, when tested with a small buffer, may
generate -- as expected -- additional (redundant) range tombstone change
pairs (end+start).

Currently we don't observe these redundant fragments due to a bug in
`evictable_reader_v2` but they start appearing once we fix the bug and
the test must be prepared first.

To prepare the test, modify `flat_reader_assertions_v2` so it squashes
redundant range tombstone change pairs. This happens only in non-exact
mode.

Enable exact mode in `test_sstable_reversing_reader_random_schema` for
comparing two readers -- the squashing of `r_t_c`s may introduce an
artificial difference.
2023-06-27 14:37:25 +02:00
Gleb Natapov
945f476363 test: add test for group0 raft command merging
Add a test that submits 3 large commands each one a little bit larger
than 1/3 of maximum mutation size. Check that in the end 2 command were
executed (first 2 were merged and third was executed separately).
2023-06-27 14:59:55 +03:00
Botond Dénes
f5e3b8df6d Merge 'Optimize creation of reader excluding staging for view building' from Raphael "Raph" Carvalho
View building from staging creates a reader from scratch (memtable
\+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
```
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

```
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
`INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s`

to
`INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s`

Refs https://github.com/scylladb/scylladb/issues/14089.
Fixes scylladb/scylladb#14244.

Closes #14364

* github.com:scylladb/scylladb:
  table: Optimize creation of reader excluding staging for view building
  view_update_generator: Dump throughput and duration for view update from staging
  utils: Extract pretty printers into a header
2023-06-27 07:25:30 +03:00
Raphael S. Carvalho
1d8cb32a5d table: Optimize creation of reader excluding staging for view building
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s

to
INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s

Refs #14089.
Fixes #14244.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-06-26 22:30:39 -03:00
Raphael S. Carvalho
83c70ac04f utils: Extract pretty printers into a header
Can be easily reused elsewhere.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-06-26 21:58:20 -03:00
Benny Halevy
9231a6c480 cql-pytest: test_using_timestamp: increase ttl
It seems like the current 1-second TTL is too
small for debug build on aarch64 as seen in
https://jenkins.scylladb.com/job/scylla-master/job/build/1513/artifact/testlog/aarch64/debug/cql-pytest.test_using_timestamp.1.log
```
            k = unique_key_int()
            cql.execute(f"INSERT INTO {table} (k, v) VALUES ({k}, {v1}) USING TIMESTAMP {ts} and TTL 1")
            cql.execute(f"INSERT INTO {table} (k, v) VALUES ({k}, {v2}) USING TIMESTAMP {ts}")
>           assert_value(k, v1)

test/cql-pytest/test_using_timestamp.py:140:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

k = 10, expected = 2

    def assert_value(k, expected):
        select = f"SELECT k, v FROM {table} WHERE k = {k}"
        res = list(cql.execute(select))
>       assert len(res) == 1
E       assert 0 == 1
E        +  where 0 = len([])
```

Increase the TTL used to write data to de-flake the test
on slow machines running debug build.

Ref #14182

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14396
2023-06-26 21:35:31 +03:00
Alexey Novikov
ca4e7f91c6 compact and remove expired rows from cache on read
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch

Refs #2252, #6033

Closes #12917
2023-06-26 15:29:01 +02:00
Botond Dénes
b23361977b Merge 'Compaction reshape tasks' from Aleksandra Martyniuk
Task manager's tasks covering resharding compaction
on top and shard level.

Closes #14112

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py to test reshaping compaction
  compaction: move reshape function to shard_reshaping_table_compaction_task_impl::run()
  compaction: add shard_reshaping_compaction_task_impl
  replica: delete unused function
  compaction: add table_reshaping_compaction_task_impl
  compaction: copy reshape to task_manager_module.cc
  compaction: add reshaping_compaction_task_impl
2023-06-26 11:56:07 +03:00
Alejo Sanchez
4999cbc1cf test/boost/cql_functions_test: split long running tests
Split long running test_aggregate_functions to one case per type.

This allows test.py to run them in parallel.

Before this it would take 18 minutes to run in debug mode. Afterwards
each case takes 30-45 seconds.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14368
2023-06-26 11:29:36 +03:00
Alejo Sanchez
8b1968cfbb test/boost/schema_changes_test: split long-running test
Split long running test test_schema_changes in 3 parts, one for each
writable_sstable_versions so it can be run in parallel by test.py.

Add static checks to alert if the array of types changed.

Original test takes around 24 minutes in debug mode, and each new split
test takes around 8 minutes.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14367
2023-06-26 11:24:07 +03:00
Alejo Sanchez
633f026d63 test/boost/memtable_test: allow parallel run
Remove previous configuration blocking parallel run.

Test cases run fine in local debug.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14369
2023-06-26 11:23:43 +03:00
Alejo Sanchez
3cbfd863eb test/boost/database_test: split long running tests
Split long running tests
test_database_with_data_in_sstables_is_a_mutation_source_plain and
test_database_with_data_in_sstables_is_a_mutation_source_reverse.

They run with x_log2_compaction_groups of 0 and 1, each one taking from
10 to 15 minutes each in debug mode, for a total of 28 and 22 minutes.

Split the test cases to run with 0 and 1, so test.py can run them in
parallel.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14356
2023-06-26 11:20:27 +03:00
Aleksandra Martyniuk
b02a5fd184 test: extend test_compaction_task.py to test reshaping compaction 2023-06-23 16:22:53 +02:00
Kamil Braun
be5b61b870 Merge 'cql3: expr: break up expression.hh header' from Avi Kivity
It's very annoying to add a declaration to expression.hh and watch
the whole world get recompiled. Improve that by moving less-common
functions to a new header expr-utils.hh. Move the evaluation machinery
to a new header evaluate.hh. The remaining definitions in expression.hh
should not change as often, and thus cause less frequent recompiles.

Closes #14346

* github.com:scylladb/scylladb:
  cql3: expr: break up expression.hh header
  cql3: expr: restrictions.hh: protect against double inclusions
  cql3: constants: deinline
  cql3: statement_restrictions: deinline
  cql3: deinline operation::fill_prepare_context()
2023-06-23 10:19:28 +02:00
Nadav Har'El
0a1283c813 Merge 'cql3:statements:describe_statement: check pointer after casting to UDF/UDA' from Michał Jadwiszczak
There was a bug in describe_statement. If executing `DESC FUNCTION  <uda name>` or ` DESC AGGREGATE <udf name>`, Scylla was crashing because the function was found (`functions::find()` searches both UDFs and UDAs) but the function was bad and the pointer wasn't checked after cast.

Added a test for this.

Fixes: #14360

Closes #14332

* github.com:scylladb/scylladb:
  cql-pytest:test_describe: add test for filtering UDF and UDA
  cql3:statements:describe_statement: check pointer to UDF/UDA
2023-06-22 20:54:25 +03:00
Michał Jadwiszczak
d3d9a15505 cql-pytest:test_describe: add test for filtering UDF and UDA 2023-06-22 18:08:45 +02:00
Avi Kivity
b858a4669d cql3: expr: break up expression.hh header
Adding a function declaration to expression.hh causes many
recompilations. Reduce that by:

 - moving some restrictions-related definitions to
   the existing expr/restrictions.hh
 - moving evaluation related names to a new header
   expr/evaluate.hh
 - move utilities to a new header
   expr/expr-utilities.hh

expression.hh contains only expression definitions and the most
basic and common helpers, like printing.
2023-06-22 14:21:03 +03:00
Avi Kivity
32b27d6a08 cql3: expr: change evaluation_input vector components to take spans
Spans are slightly cleaner, slightly faster (as they avoid an indirection),
and allow for replacing some of the arguments with small_vector:s.

Closes #14313
2023-06-22 11:28:01 +02:00
Botond Dénes
e1c2de4fb8 Merge 'forward_service: fix forgetting case-sensitivity in aggregates ' from Jan Ciołek
There was a bug that caused aggregates to fail when used on column-sensitive columns.

For example:
```cql
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there is no column "somecolumn".

This is because the case-sensitivity got lost on the way.

For non case-sensitive column names we convert them to lowercase, but for case sensitive names we have to preserve the name as originally written.

The problem was in `forward_service` - we took a column name and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column couldn't be found.

To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.

Fixes: https://github.com/scylladb/scylladb/issues/14307

Closes #14340

* github.com:scylladb/scylladb:
  service/forward_service.cc: make case-sensitivity explicit
  cql-pytest/test_aggregate: test case-sensitive column name in aggregate
  forward_service: fix forgetting case-sensitivity in aggregates
2023-06-22 08:25:33 +03:00
Botond Dénes
320159c409 Merge 'Compaction group major compaction task' from Aleksandra Martyniuk
Task manager task covering compaction group major
compaction.

Uses multiple inheritance on already existing
major_compaction_task_executor to keep track of
the operation with task manager.

Closes #14271

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py
  test: use named variable for task tree depth
  compaction: turn major_compaction_task_executor into major_compaction_task_impl
  compaction: take gate holder out of task executor
  compaction: extend signature of some methods
  tasks: keep shared_ptr to impl in task
  compaction: rename compaction_task_executor methods
2023-06-22 08:15:17 +03:00
Avi Kivity
8576502c48 Merge 'raft topology: ban left nodes from the cluster' from Kamil Braun
Use the new Seastar functionality for storing references to connections to implement banning hosts that have left the cluster (either decommissioned or using removenode) in raft-topology mode. Any attempts at communication from those nodes will be rejected.

This works not only for nodes that restart, but also for nodes that were running behind a network partition and we removed them. Even when the partition resolves, the existing nodes will effectively put a firewall from that node.

Some changes to the decommission algorithm had to be introduced for it to work with node banning. As a side effect a pre-existing problem with decommission was fixed. Read the "introduce `left_token_ring` state" and "prepare decommission path for node banning" commits for details.

Closes #13850

* github.com:scylladb/scylladb:
  test: pylib: increase checking period for `get_alive_endpoints`
  test: add node banning test
  test: pylib: manager_client: `get_cql()` helper
  test: pylib: ScyllaCluster: server pause/unpause API
  raft topology: ban left nodes
  raft topology: skip `left_token_ring` state during `removenode`
  raft topology: prepare decommission path for node banning
  raft topology: introduce `left_token_ring` state
  raft topology: `raft_topology_cmd` implicit constructor
  messaging_service: implement host banning
  messaging_service: exchange host IDs and map them to connections
  messaging_service: store the node's host ID
  messaging_service: don't use parameter defaults in constructor
  main: move messaging_service init after system_keyspace init
2023-06-21 20:16:45 +03:00
Jan Ciolek
854b0301be cql-pytest/test_aggregate: test case-sensitive column name in aggregate
There was a bug which made aggregates fail when used with case-sensitive
column names.
Add a test to make sure that this doesn't happen in the future.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-21 14:49:24 +02:00
Nadav Har'El
8a9de08510 sstable: limit compression chunk size to 128 KB
The chunk size used in sstable compression can be set when creating a
table, using the "chunk_length_in_kb" parameter. It can be any power-of-two
multiple of 1KB. Very large compression chunks are not useful - they
offer diminishing returns on compression ratio, and require very large
memory buffers and reading a very large amount of disk data just to
read a small row. In fact, small chunks are recommended - Scylla
defaults to 4 KB chunks, and Cassandra lowered their default from 64 KB
(in Cassandra 3) to 16 KB (in Cassandra 4).

Therefore, allowing arbitrarily large chunk sizes is just asking for
trouble. Today, a user can ask for a 1 GB chunk size, and crash or hang
Scylla when it runs out of memory. So in this patch we add a hard limit
of 128 KB for the chunk size - anything larger is refused.

Fixes #9933

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14267
2023-06-21 14:26:02 +03:00
Kefu Chai
f014ccf369 Revert "Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai""
This reverts commit 562087beff.

The regressions introduced by the reverted change have been fixed.
So let's revert this revert to resurrect the
uuid_sstable_identifier_enabled support.

Fixes #10459
2023-06-21 13:02:40 +03:00
Avi Kivity
e233f471b8 Merge 'Respect tablet shard assignment' from Tomasz Grabiec
This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets):
1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored.
2. CDC subsystem was not adjusted (not supported yet)
3. sstable sharding metadata reflects tablet boundaries
5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables)
6. The system is NOT prepared to handle tablet migration / topology changes in a safe way.
7. Sstable cleanup is not wired properly yet

After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead.

To make the life easier, support was added to obtain table pointer from the schema pointer:

```
schema_ptr s;
s->table().shard_of(...)
```

Closes #13939

* github.com:scylladb/scylladb:
  locator: network_topology_startegy: Allocate shards to tablets
  locator: Store node shard count in topology
  service: topology: Extract topology updating to a lambda
  test: Move test_tablets under topology_experimental
  sstables: Add trace-level logging related to shard calculation
  schema: Catch incorrect uses of schema::get_sharder()
  dht: Rename dht::shard_of() to dht::static_shard_of()
  treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
  storage_proxy: Avoid multishard reader for tablets
  storage_proxy: Obtain shard from erm in the read path
  db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
  forward_service: Use table sharder
  alternator: Use table sharder
  db: multishard: Obtain sharder from erm
  sstable_directory: Improve trace-level logging
  db: table: Introduce shard_of() helper
  db: Use table sharder in compaction
  sstables: Compute sstable shards using sharder from erm when loading
  sstables: Generate sharding metadata using sharder from erm when writing
  test: partitioner: Test split_range_to_single_shard() on tablet-like sharder
  dht: Make split_range_to_single_shard() prepared for tablet sharder
  sstables: Move compute_shards_for_this_sstable() to load()
  dht: Take sharder externally in splitting functions
  locator: Make sharder accessible through effective_replication_map
  dht: sharder: Document guarantees about mapping stability
  tablets: Implement tablet sharder
  tablets: Include pending replica in get_shard()
  dht: sharder: Introduce next_shard()
  db: token_ring_table: Filter out tablet-based keyspaces
  db: schema: Attach table pointer to schema
  schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load()
  schema_registry: Make learn(schema_ptr) attach entry to the target schema
  test: lib: cql_test_env: Expose feature_service
  test: Extract throttle object to separate header
2023-06-21 10:20:41 +03:00
Calle Wilund
f18e967939 storage_proxy: Make split_stats resilient to being called from different scheduling group
Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14294
2023-06-21 10:08:27 +03:00
Tomasz Grabiec
ebdebb982b locator: network_topology_startegy: Allocate shards to tablets
Uses a simple algorihtm for allocating shards which chooses
least-loaded shard on a given node, encapsulated in load_sketch.

Takes load due to current tablet allocation into account.

Each tablet, new or allocated for other tables, is assumed to have an
equal load weight.
2023-06-21 00:58:25 +02:00
Tomasz Grabiec
6defcb7bd5 test: Move test_tablets under topology_experimental
Tablets will rely on shard_count information in topology, which is set
only when using eperimental raft-based topology.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
29cbdb812b dht: Rename dht::shard_of() to dht::static_shard_of()
This is in order to prevent new incorrect uses of dht::shard_of() to
be accidentally added. Also, makes sure that all current uses are
caught by the compiler and require an explicit rename.
2023-06-21 00:58:24 +02:00