Commit Graph

11801 Commits

Author SHA1 Message Date
Kamil Braun
2032d7dbe4 test: scylla_cluster: return the new IP from change_ip API
Also simplify the API by getting rid of `ActionReturn` and returning
errors through exceptions (which are correctly forwarded to the client
for some time already).
2023-07-06 10:24:46 +02:00
Kamil Braun
00f51ea753 test: node replace with ignore_dead_nodes test
Regression test for #14487 on steroids. It performs 3 consecutive node
replace operations, starting with 3 dead nodes.

In order to have a Raft majority, we have to boot a 7-node cluster, so
we enable this test only in one mode; the choice was between `dev` and
`release`, I picked `dev` because it compiles faster and I develop on
it.
2023-07-06 10:24:46 +02:00
Kamil Braun
9b136ee574 test: scylla_cluster: accept ignore_dead_nodes in ReplaceConfig 2023-07-06 10:24:46 +02:00
Tomasz Grabiec
c25201c1a3 Merge 'view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.

This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object).

The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.

Fixes https://github.com/scylladb/scylladb/issues/14503

Closes #14502

* github.com:scylladb/scylladb:
  test: view_build_test: add range tombstones to test_view_update_generator_buffering
  test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
  view_updating_consumer: make buffer limit a variable
  view: fix range tombstone handling on flushes in view_updating_consumer
2023-07-05 21:21:43 +02:00
Michał Chojnowski
f6203f2bd4 test: view_build_test: add range tombstones to test_view_update_generator_buffering
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
2023-07-05 17:33:49 +02:00
Michał Chojnowski
aab10402ce test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
A random mutation test for view_updating_consumer's buffering logic.
Reproduces #14503.
2023-07-05 17:33:49 +02:00
Michał Chojnowski
ac29b6f198 view_updating_consumer: make buffer limit a variable
The limit doesn't change at runtime, but we this patch makes it variable for
unit testing purposes.
2023-07-05 17:33:47 +02:00
Raphael S. Carvalho
5d34db2532 test: Extend sstable partition skipping test to cover fast forward using token
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-07-05 11:38:58 -03:00
Pavel Emelyanov
e91f95a629 Merge 's3/test: restructure object_store test into a pytest based test suite' from Kefu Chai
in this series, test/object_storage is restructured into a pytest based test. this paves the road to a test suites covers more use cases. so we can some more lower-level tests for tiered/caching-store.

Closes #14165

* github.com:scylladb/scylladb:
  s3/test: do not return ip in managed_cluster()
  s3/test: verify the behavior with asserts
  s3/test: restructure object_store/run into a pytest
  s3/test: extract get_scylla_with_s3_cmd() out
  s3/test: s/restart_with_dir/kill_with_dir/
  s3/test: vendor run_with_dir() and friends
  s3/test: remove get_tempdir()
  s3/test: extract managed_cluster() out
2023-07-05 15:40:43 +03:00
Kefu Chai
9080f8842b s3/test: do not return ip in managed_cluster()
let's just use cluster.contact_points for retrieving the IP address
of the scylla node in this single-node cluster. so the name of
managed_cluster() is less weird.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 17:07:39 +08:00
Kefu Chai
ec6410653f s3/test: verify the behavior with asserts
instead of assigning to "success", let's use assert for this purpose.
simpler this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 17:07:21 +08:00
Kefu Chai
471d75c6c6 s3/test: restructure object_store/run into a pytest
instead of using a single run to perform the test, restructure
it into a pytest based test suite with a single test case.
this should allow us to add more tests exercising the object-storage
and cached/tierd storage in future.

* add fixtures so they can be reused by tests
* use tmpdir fixture for managing the tmpdir, see
  https://docs.pytest.org/en/6.2.x/tmpdir.html#the-tmpdir-fixture
* perform part of the teardown in the "test_tempdir()" fixture
* change the type of test from "Run" to "Python"
* rename "run" to "test_basic.py"
* optionally start the minio server if the settings are not
  found in command line or env variables, so that the tests are
  self-contained without the fixture setup by test.py.
* instead of sys.exit(), use assert statement, as this is
  what pytest uses.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 17:05:13 +08:00
Petr Gusev
b69bc97673 repair_test: add test_reader_with_different_strategies 2023-07-05 13:02:17 +04:00
Kefu Chai
bffaf84395 s3/test: extract get_scylla_with_s3_cmd() out
* define a dedicated S3_server class which duck types MinioServer.
  it will be used to represent S3 server in place of MinioServer if
  S3 is used for testing
* prepare object_storage.yaml in get_scylla_with_s3(), so it is more
  clear that we are using the same set of settings for launching
  scylla

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 16:49:04 +08:00
Kefu Chai
f74218f434 s3/test: s/restart_with_dir/kill_with_dir/
replace the restart_with_dir() with kill_with_dir(), so
that we can simplify the usage of managed_cluster() by enabling it
to start and stop the single-node cluster. with this change, the caller
does not need to run the scylla and pass its pid to this function
any more.

since the restart_with_dir() call is superseded by managed_cluster(),
which tears down the cluster, teardown() is now only responsible to
print out the log file.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 16:48:25 +08:00
Kefu Chai
a6bb5864ff s3/test: vendor run_with_dir() and friends
so we don't need to mess up with cql-pytest/run.py, which is
use by cql-pytest.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 16:48:04 +08:00
Kefu Chai
b45049c968 s3/test: remove get_tempdir()
to match with another call of managed_cluster(), so it's clear that
we are just reusing test_tempdir.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 16:45:14 +08:00
Kefu Chai
a5a87d81c6 s3/test: extract managed_cluster() out
for setting up the cluster and tearing down it.
this helps to indent the code so that it is visually explicit
the lifecycle of the cluster.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 16:45:14 +08:00
Kefu Chai
1faf50fc05 test/pylib: do not hardwire alias to "local"
define a variable for it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 15:58:41 +08:00
Kefu Chai
d55cfdc152 test/pylib: retry if minio_server is not ready
there is chance that minio_server is not ready to serve after
launching the server executable process. so we need to retry until
the first "mc" command is able to talk to it.

in this change, add method `mc()` is added to run minio client,
so we can retry the command before it timeouts. and it allows us to
ignore the failure or specify the timeout. this should ready the
minio server before tests start to connect to it.

Fixes #1719
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 15:57:59 +08:00
Michał Chojnowski
5ad0846bff view: fix range tombstone handling on flushes in view_updating_consumer
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.

This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).

The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.

Fixes #14503
2023-07-04 20:33:21 +02:00
Marcin Maliszkiewicz
6424dd5ec4 alternator: close output_stream when exception is thrown during response streaming
When exception occurs and we omit closing output_stream then the whole process is brought down
by an assertion in ~output_stream.

Fixes https://github.com/scylladb/scylladb/issues/14453
Relates https://github.com/scylladb/scylladb/issues/14403

Closes #14454
2023-07-04 16:15:08 +03:00
Kefu Chai
c005b6dce0 test/pylib: chmod +x minio_server.py
add a shebang line. so we can just launch
a minio_server using

```console
test/pylib/minio_server.py --host 127.0.0.1
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-04 13:19:34 +08:00
Kefu Chai
2bae0b9aa8 test/pylib: allow run minio_server.py as a stand-alone tool
this would allow developer to run a minio server for testing, for
instance, s3_test, using something like:

```console
$ python3 test/pylib/minio_server.py --host 127.0.0.1
tempdir='/tmp/tmpfoobar-minio'
export S3_SERVER_ADDRESS_FOR_TEST=127.0.0.1
export S3_SERVER_PORT_FOR_TEST=900
export S3_PUBLIC_BUCKET_FOR_TEST=testbucket
```

and developer is supposed to copy-and-paste the `export` commands
to prepare the environmental variables for the test using the
minio server. the tempdir is used for the rundir of minio, and it
is also used for holding the log file of this tool. one might want
to check it when necessary.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-04 13:14:42 +08:00
Tomasz Grabiec
f2ed9fcd7e schema_mutations, migration_manager: Ignore empty partitions in per-table digest
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.

After ae8d2a550d, it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).

This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.

A similar problem was fixed for per-node digest calculation in
18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.

Fixes #4485.
2023-07-03 23:06:55 +02:00
Nadav Har'El
ec77172b4b Merge 'cql3: convert the SELECT clause evaluation phase to expressions' from Avi Kivity
SELECT clause components (selectors) are currently evaluated during query execution
using a stateful class hierarchy. This state is needed to hold intermediate state while
aggregating over multiple rows. Because the selectors are stateful, we must re-create
them each query using a selector_factory hierarchy.

We'd like to convert all of this to the unified expression evaluation machinery, so we can
have just one grammar for expressions, and just one way to evaluate expressions, but
the statefulness makes this complex.

In commit 59ab9aac44 "(Merge 'functions: reframe aggregate functions in terms
of scalar functions' from Avi Kivity)", we made aggregate functions stateless, moving
their state to aggregate_function_selector::_accumulator, and therefore into the
class hierarchy we're addressing now. Another reason for keeping state is that selectors
that aren't aggregated capture the first value they see in a GROUP BY group.

Since expressions can't contain state directly, we break apart expressions that contain
aggregate functions into two: an inner expression that processes incoming rows within
a group, and an outer expression that generates the group's output. The two expressions
communicate via a newly introduced expression element: a temporary.

The problem of non-aggregated columns requiring state is solved by encapsulating
those columns in an internal aggregate function, called the "first" function.

In terms of performance, this series has little effect, since the common case of selectors
that only contain direct column references without transformations is evaluated via a fast
path (`simple_selection`). This fast-path is preserved with almost no changes.

While the series makes it possible to start to extend the grammar and unify expression
syntaxes, it does not do so. The grammar is unchanged. There is just one breaking change:
the `SELECT JSON` statement generates json object field names based on the input selectors.
In one case the name of the field has changed, but it is an esoteric case (where a function call
is selected as part of `SELECT JSON`), and the new behavior is compatible with Cassandra.

Closes #14467

* github.com:scylladb/scylladb:
  cql3: selection: drop selector_factories, selectables, and selectors
  cql3: select_statement: stop using selector_factories in SELECT JSON
  cql3: selection: don't create selector_factories any more
  cql3: selection: collect column_definitions using expressions
  cql3: selection: reimplement selection::is_aggregate()
  cql3: selection: evaluate aggregation queries via expr::evaluate()
  cql3: selection, select_statement: fine tune add_column_for_post_processing() usage
  cql3: selection: evaluate non-aggregating complex selections using expr::evaluate()
  cql3: selection: store primary key in result_set_builder
  cql3: expression: fix field_selection::type interpretation by evaluate()
  cql3: selection: make result_set_builder::current non-optional<>
  cql3: selection: simplify row/group processing
  cql3: selection: convert requires_thread to expressions
  cql: selection: convert used_functions() to expressions
  cql3: selection: convert is_reducible/get_reductions to expressions
  cql3: selection: convert is_count() to expressions
  cql3: selection convert contains_ttl/contains_writetime to work on expressions
  cql3: selection: make simple_selectors stateless
  cql3: expression: add helper to split expressions with aggregate functions
  cql3: selection: short-circuit non-aggregations
  cql3: selection: drop validate_selectors
  cql3: select_statement: force aggregation if GROUP BY is used
  cql3: select_statement: levellize aggregation depth
  cql3: selection: skip first_function when collecting metadata
  cql3: select_statement: explicitly disable automatic parallelization with no aggregates
  cql3: expression: introduce temporaries
  cql3: select_statement: use prepared selectors
  cql3: selection: avoid selector_factories in collect_metadata()
  cql3: expressions: add "metadata mode" formatter for expressions
  cql3: selection: convert collect_metadata() to the prepared expression domain
  cql3: selection: convert processes_selection to work on prepared expressions
  cql3: selection: prepare selectors earlier
  cql3: raw_selector: deinline
  cql3: expression: reimplement verify_no_aggregate_functions()
  cql3: expression: add helpers to manage an expression's aggregation depth
  cql3: expression: improve printing of prepared function calls
  cql3: functions: add "first" aggregate function
2023-07-03 23:21:33 +03:00
Avi Kivity
d9cf81f1a6 cql3: select_statement: stop using selector_factories in SELECT JSON
SELECT JSON uses selector_factories to obtain the names of the
fields to insert into the json object, and we want to drop
selector_factories entirely. Switch instead to the ":metadata" mode
of printing expressions, which does what we want.

Unfortunately, the switch changes how system functions are converted
into field names. A function such as unixtimestampof() is now rendered
as "system.unixtimestampof()"; before it did not have the keyspace
prefix.

This is a compatiblity problem, albeit an obscure one. Since the new
behavior matches Cassandra, and the odds of hitting this are very low,
I think we can allow the change.
2023-07-03 19:45:17 +03:00
Avi Kivity
0021f77e30 cql3: expression: fix field_selection::type interpretation by evaluate()
field_selection::type refers to the type of the selection operation,
not the type of the structure being selected. This is what
prepare_expression() generates and how all other expression elements
work, but evaluate() for field_selection thinks it's the type
of the structure, and so fails when it gets an expression
from prepare_expression().

Fix that, and adjust the tests.
2023-07-03 19:45:17 +03:00
Avi Kivity
b1b4a18ad8 cql3: expression: add helpers to manage an expression's aggregation depth
We define the "aggregation depth" of an expression by how many
nested aggregation functions are applied. In CQL/SQL, legal
values are 0 and 1, but for generality we deal with any aggregation depth.

The first helper measures the maximum aggregation depth along any path
in the expression graph. If it's 2 or greater, we have something like
max(max(x)) and we should reject it (though these helpers don't). If
we get 1 it's a simple aggregation. If it's zero then we're not aggregating
(though CQL may decide to aggregate anyway if GROUP BY is used).

The second helper edits an expression to make sure the aggregation depth
along any path that reaches a column is the same. Logically,
`SELECT x, max(y)` does not make sense, as one is a vector of values
and the other is a scalar. CQL resolves the problem by defining x as
"the first value seen". We apply this resolution by converting the
query to `SELECT first(x), max(y)` (where `first()` is an internal
aggregate function), so both selectors refer to scalars that consume
vectors.

When a scalar is consumed by an aggregate function (for example,
`SELECT max(x), min(17)` we don't have to bother, since a scalar
is implicity promoted to a vector by evaluating it every row. There
is some ambiguity if the scalar is a non-pure function (e.g.
`SELECT max(x), min(random())`, but it's not worth following.

A small unit test is added.
2023-07-03 19:45:16 +03:00
Alejo Sanchez
520bd90008 test/boost/memtable_test: split test plain/reverse
Split long running test
test_memtable_with_many_versions_conforms_to_mutation_source to 2 tests
for _plain and _reverse.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14447
2023-07-03 15:20:12 +03:00
Michał Jadwiszczak
58eb7a45b7 cql-pytest:test_describe: add test for generic UDT/UDF/UDA desc 2023-06-30 14:50:08 +02:00
Piotr Dulikowski
ee9bfb583c combined: mergers: remove recursion in operator()()
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.

In some unlucky circumstances, a stack overflow can occur:

- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
  end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
  combined reader (~500),
- All of the above needs to happen within a single task quota.

In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.

A regression test is added.

Fixes: scylladb/scylladb#14415

Closes #14452
2023-06-30 12:07:13 +03:00
Kamil Braun
ff386e7a44 service: raft: force initial snapshot transfer in new cluster
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d065 we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: #14066

Closes #14336
2023-06-29 22:46:42 +02:00
Konstantin Osipov
3d81408a58 test.py: make experimental: raft the default for all tests
Make sure all tests use the new centralized topology
coordinator. This is a step forward towards maturing the
coordinator implementation.

Closes #14039
2023-06-29 14:44:00 +02:00
Botond Dénes
2a58b4a39a Merge 'Compaction resharding tasks' from Aleksandra Martyniuk
Task manager's tasks covering resharding compaction
on table and shard level.

Closes #14044

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py to test resharding compaction
  compaction: add shard_reshard_sstables_compaction_task_impl
  compaction: invoke resharding on sharded database
  compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run()
  replica: delete unused functions and struct
  compaction: add reshard_sstables_compaction_task_impl
  compaction: replica: copy struct and functions from distributed_loader.cc
  compaction: create resharding_compaction_task_impl
2023-06-29 12:10:54 +03:00
Nadav Har'El
dd63169077 Merge 'test/boost/index_with_paging_test: reduce running time' from Alecco
Reduce test string value size, parallelize inserts, and use a prepared statement,

The debug running time for this tests is reduced from 13:18 to 7:52.

Refs #13905

Closes #14380

* github.com:scylladb/scylladb:
  test/boost/index_with_paging_test: parallel insert
  test/boost/index_with_paging_test: prepared statement
  test/boost/index_with_paging_test: reduce running time
2023-06-29 10:45:01 +03:00
Avi Kivity
f6f974cdeb cql3: selection: fix GROUP BY, empty groups, and aggregations
A GROUP BY combined with aggregation should produce a single
row per group, except for empty groups. This is in contrast
to an aggregation without GROUP BY, which produces a single
row no matter what.

The existing code only considered the case of no grouping
and forced a row into the result, but this caused an unwanted
row if grouping was used.

Fix by refining the check to also consider GROUP BY.

XFAIL tests are relaxed.

Fixes #12477.

Note, forward_service requires that aggregation produce
exactly one row, but since it can't work with grouping,
it isn't affected.

Closes #14399
2023-06-28 18:56:22 +03:00
Kamil Braun
b912eeade5 Merge 'merge raft commands to group0 before applying them whenever possible' from Gleb
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.

The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.

Broadcast table commands are not mutations, so they are never combined.

* 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla:
  test: add test for group0 raft command merging
  service: raft: respect max mutation size limit when persisting raft entries
  group0_state_machine: merge commands before applying them whenever possible
2023-06-28 17:21:07 +02:00
Alejo Sanchez
d4697ed21e test/boost/index_with_paging_test: parallel insert
Parallelize inserts for long-running test_index_with_paging.

Run time in debug mode reduced by 1 minute 48 seconds.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-06-28 16:11:58 +02:00
Alejo Sanchez
70a3179888 test/boost/index_with_paging_test: prepared statement
Prepare statement for insert.

Run time in debug mode reduced by 9 seconds.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-06-28 14:49:21 +02:00
Michał Jadwiszczak
0a8fcead08 cql3: Specify arguments types in UDA creation errors
Display not only function name but also expected arguments
if `state_function` or `final_function` was not found.

Fixes: #12088

Closes #14278
2023-06-28 15:27:49 +03:00
Alejo Sanchez
48d24269f1 test/boost/index_with_paging_test: reduce running time
Reduce test string value size for test_index_with_paging from 4096 to
100. With 100 bytes it should make the base row significantly larger
than the key so the test will exercise both types of paging in the
scanning code.

The debug running time for this tests is reduced from 9 minutes to 6
minutes.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-06-28 13:55:52 +02:00
Nadav Har'El
49c8c06b1b Merge 'cql: fix crash on empty clustering range in LWT' from Jan Ciołek
LWT queries with empty clustering range used to cause a crash.
For example in:
```cql
UPDATE tab SET r = 9000 WHERE p = 1  AND c = 2 AND c = 2000 IF r = 3
```
The range of `c` is empty - there are no valid values.

This caused a segfault when accessing the `first` range:
```c++
op.ranges.front()
```

Cassandra rejects such queries at the preparation stage. It doesn't allow two `EQ` restriction on the same clustering column when an IF is involved.
We reject them during runtime, which is a worse solution. The user can prepare a query with `c = ? AND c = ?`, and then run it, but unexpectedly it will throw an `invalid_request_exception` when the two bound variables are different.

We could ban such queries as well, we already ban the usage of `IN` in conditional statements. The problem is that this would be a breaking change.

A better solution would be to allow empty ranges in `LWT` statements. When an empty range is detected we just wouldn't apply the change. This would be a larger change, for now let's just fix the crash.

Fixes: https://github.com/scylladb/scylladb/issues/13129

Closes #14429

* github.com:scylladb/scylladb:
  modification_statement: reject conditional statements with empty clustering key
  statements/cas_request: fix crash on empty clustering range in LWT
2023-06-28 14:43:54 +03:00
Aleksandra Martyniuk
bf3e0744c1 test: extend test_compaction_task.py to test resharding compaction 2023-06-28 11:43:12 +02:00
Jan Ciolek
ccdb26bf9e statements/cas_request: fix crash on empty clustering range in LWT
LWT queries with empty clustering range used to cause a crash.
For example in:
```cql
UPDATE tab SET r = 9000 WHERE p = 1  AND c = 2 AND c = 2000 IF r = 3
```
The range of `c` is empty - there are no valid values.

This caused a segfault when accessing the `first` range:
```c++
op.ranges.front()
```

To fix it let's throw en exception when the clustering range
is empty. Cassandra also rejects queries with `c = 1 AND c = 2`.

There's also a check for empty partition range, as it used
to crash in the past, can't really hurt to add it.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-28 10:18:06 +02:00
Kamil Braun
96bc78905d readers: evictable_reader: don't accidentally consume the entire partition
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the previous buffer-fill. Otherwise, the
reader could get stuck in an infinite loop between buffer fills, if the
reader is evicted in-between.

The code guranteeing this forward progress had a bug: the comparison
between the position after the last buffer-fill and the current
last fragment position was done in the wrong direction.

So if the condition that we wanted to achieve was already true, we would
continue filling the buffer until partition end which may lead to OOMs
such as in #13491.

There was already a fix in this area to handle `partition_start`
fragments correctly - #13563 - but it missed that the position
comparison was done in the wrong order.

Fix the comparison and adjust one of the tests (added in #13563) to
detect this case.

Fixes #13491
2023-06-27 14:37:29 +02:00
Kamil Braun
5800ce8ddd test: flat_mutation_reader_assertions: squash r_t_cs with the same position
test_range_tombstones_v2 is too strict for this reader -- it expects a
particular sequence of `range_tombstone_change`s, but
multishard_combining_reader, when tested with a small buffer, may
generate -- as expected -- additional (redundant) range tombstone change
pairs (end+start).

Currently we don't observe these redundant fragments due to a bug in
`evictable_reader_v2` but they start appearing once we fix the bug and
the test must be prepared first.

To prepare the test, modify `flat_reader_assertions_v2` so it squashes
redundant range tombstone change pairs. This happens only in non-exact
mode.

Enable exact mode in `test_sstable_reversing_reader_random_schema` for
comparing two readers -- the squashing of `r_t_c`s may introduce an
artificial difference.
2023-06-27 14:37:25 +02:00
Gleb Natapov
945f476363 test: add test for group0 raft command merging
Add a test that submits 3 large commands each one a little bit larger
than 1/3 of maximum mutation size. Check that in the end 2 command were
executed (first 2 were merged and third was executed separately).
2023-06-27 14:59:55 +03:00
Botond Dénes
f5e3b8df6d Merge 'Optimize creation of reader excluding staging for view building' from Raphael "Raph" Carvalho
View building from staging creates a reader from scratch (memtable
\+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
```
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

```
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
`INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s`

to
`INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s`

Refs https://github.com/scylladb/scylladb/issues/14089.
Fixes scylladb/scylladb#14244.

Closes #14364

* github.com:scylladb/scylladb:
  table: Optimize creation of reader excluding staging for view building
  view_update_generator: Dump throughput and duration for view update from staging
  utils: Extract pretty printers into a header
2023-06-27 07:25:30 +03:00
Raphael S. Carvalho
1d8cb32a5d table: Optimize creation of reader excluding staging for view building
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s

to
INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s

Refs #14089.
Fixes #14244.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-06-26 22:30:39 -03:00