Commit Graph

2731 Commits

Author SHA1 Message Date
Tomasz Grabiec
4e9d95d78c Merge 'Compact data before streaming' from Botond Dénes
Currently, streaming and repair processes and sends data as-is. This is wasteful: streaming might be sending data which is expired or covered by tombstones, taking up valuable bandwidth and processing time. Repair additionally could be exposed to artificial differences, due to different nodes being in different states of compactness.
This PR adds opt-in compaction to `make_streaming_reader()`, then opts in all users. The main difference being in how these choose the current compaction time to use:
* Load'n'stream and streaming uses the current time on the local node.
* Repair uses a centrally chosen compaction time, generated on the repair master and propagated to al repair followers. This is to ensure all repair participants work with the exact state of compactness.

 Importantly, this compaction does *not* purge tombstones (tombstone GC is disabled completely).

Fixes: https://github.com/scylladb/scylladb/issues/3561

Closes #14756

* github.com:scylladb/scylladb:
  replica: make_[multishard_]streaming_reader(): make compaction_time mandatory
  repair/row_level: opt in to compacting the stream
  streaming: opt-in to compacting the stream
  sstables_loader: opt-in for compacting the stream
  replica/table: add optional compacting to make_multishard_streaming_reader()
  replica/table: add optional compacting to make_streaming_reader()
  db/config: add config item for enabling compaction for streaming and repair
  repair: log the error which caused the repair to fail
  readers: compacting_reader: use compact_mutation_state::abandon_current_partition()
  mutation/mutation_compactor: allow user to abandon current partition
2023-07-28 16:42:13 +02:00
Kefu Chai
cc2bbde8f1 test: use BOOST_CHECK_EQUAL when appropriate in compaction_manager_basic_test
compaction_manager_basic_test checks the stats of compaction_manager to
verify that there are no ongoing or pending compactions after the triggering
the compaction and waiting for its completion. but in #14865, there are
still active compaction(s) after the compaction_manager's stats shows there
is at least one task completed.

to understand this issue better, let's use `BOOST_CHECK_EQUAL()` instead
of `BOOST_REQUIRE()`, so that the test does not error out when the check
fails, and we can have better understanding of the status when the test
fails.

Refs #14865
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14872
2023-07-28 15:45:07 +03:00
Avi Kivity
cf81eef370 Merge 'schema_mutations, migration_manager: Ignore empty partitions in per-table digest' from Tomasz Grabiec
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.

After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).

This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.

A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.

Fixes #4485.

Manually tested using ccm on cluster upgrade scenarios and node restarts.

Closes #14441

* github.com:scylladb/scylladb:
  test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
  schema_mutations, migration_manager: Ignore empty partitions in per-table digest
  migration_manager, schema_tables: Implement migration_manager::reload_schema()
  schema_tables: Avoid crashing when table selector has only one kind of tables
2023-07-28 00:01:33 +03:00
Alexey Novikov
ff721ec3e3 make timestamp string format cassandra compatible
when we convert timestamp into string it must look like: '2017-12-27T11:57:42.500Z'
it concerns any conversion except JSON timestamp format
JSON string has space as time separator and must look like: '2017-12-27 11:57:42.500Z'
both formats always contain milliseconds and timezone specification

Fixes #14518
Fixes #7997

Closes #14726
2023-07-27 12:01:09 +03:00
Botond Dénes
fdaf908967 repair/row_level: opt in to compacting the stream
Using a centrally generated compaction-time, generated on the repair
master and propagated to all repair followers. For repair it is
imperative that all participants use the exact same compaction time,
otherwise there can be artificial differences between participants,
generating unnecessary repair activity.
If a repair follower doesn't get a compaction-time from the repair
master, it uses a locally generated one. This is no worse than the
previous state of each node being on some undefined state of compaction.
2023-07-27 04:57:50 -04:00
Botond Dénes
2f8d77e97b replica/table: add optional compacting to make_multishard_streaming_reader()
Doing to make_multishard_streaming_reader() what the previous commit did
to make_streaming_reader(). In fact, the new compaction_time parameter
is simply forwarded to the make_streaming_reader() on the shard readers.

Call sites are updated, but none opt in just yet.
2023-07-27 03:22:11 -04:00
Raphael S. Carvalho
050ce9ef1d cached_file: Evict unused pages that aren't linked to LRU yet
It was found that cached_file dtor can hit the following assert
after OOM

cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.`

cached_file's dtor iterates through all entries and evict those
that are linked to LRU, under the assumption that all unused
entries were linked to LRU.

That's partially correct. get_page_ptr() may fetch more than 1
page due to read ahead, but it will only call cached_page::share()
on the first page, the one that will be consumed now.

share() is responsible for automatically placing the page into
LRU once refcount drops to zero.

If the read is aborted midway, before cached_file has a chance
to hit the 2nd page (read ahead) in cache, it will remain there
with refcount 0 and unlinked to LRU, in hope that a subsequent
read will bring it out of that state.

Our main user of cached_file is per-sstable index caching.
If the scenario above happens, and the sstable and its associated
cached_file is destroyed, before the 2nd page is hit, cached_file
will not be able to clear all the cache because some of the
pages are unused and not linked.

A page read ahead will be linked into LRU so it doesn't sit in
memory indefinitely. Also allowing for cached_file dtor to
clear all cache if some of those pages brought in advance
aren't fetched later.

A reproducer was added.

Fixes #14814.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14818
2023-07-27 00:01:46 +02:00
Nadav Har'El
056d04954c Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.

Fixes: #14819

Closes #14821

* github.com:scylladb/scylladb:
  test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
  db/view/view_updating_consumer: account for the size of mutations
  mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
  mutation/mutation: add memory_usage()
2023-07-26 20:04:28 +03:00
Avi Kivity
ff1f461a42 Merge 'Introduce tablet load balancer' from Tomasz Grabiec
After this series, tablet replication can handle the scenario of bootstrapping new nodes. The ownership is distributed indirectly by the means of a load-balancer which moves tablets around in the background. See docs/dev/topology-over-raft.md for details.

The implementation is by no means meant to be perfect, especially in terms of performance, and will be improved incrementally.

The load balancer will be also kicked by schema changes, so that allocation/deallocation done during table creation/drop will be rebalanced.

Tablet data is streamed using existing `range_streamer`, which is the infrastructure for "the old streaming". This will be later replaced by sstable transfer once integration of tablets with compaction groups is finished. Also, cleanup is not wired yet, also blocked by compaction group integration.

Closes #14601

* github.com:scylladb/scylladb:
  tests: test_tablets: Add test for bootstraping a node
  storage_service: topology_coordinator: Implement tablet migration state machine
  tablets: Introduce tablet_mutation_builder
  service: tablet_allocator: Introduce tablet load balancer
  tablets: Introduce tablet_map::for_each_tablet()
  topology: Introduce get_node()
  token_metadata: Add non-const getter of tablet_metadata
  storage_service: Notify topology state machine after applying schema change
  storage_service: Implement stream_tablet RPC
  tablets: Introduce global_tablet_id
  stream_transfer_task, multishard_writer: Work with table sharder
  tablets: Turn tablet_id into a struct
  db: Do not create per-keyspace erm for tablet-based tables
  tablets: effective_replication_map: Take transition stage into account when computing replicas
  tablets: Store "stage" in transition info
  doc: Document tablet migration state machine and load balancer
  locator: erm: Make get_endpoints_for_reading() always return read replicas
  storage_service: topology_coordinator: Sleep on failure between retries
  storage_service: topology_coordinator: Simplify coordinator loop
  main: Require experimental raft to enable tablets
2023-07-26 12:30:29 +03:00
Botond Dénes
d0f725c1b9 test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
A test reproducing #14819, that is, the view update builder not flushing
the buffer when only empty partitions are consumed (with only a
tombstone in them).
2023-07-26 03:09:53 -04:00
Botond Dénes
ad2ddffb22 Merge 'Remove qctx from system_keyspace::save_truncation_record()' from Pavel Emelyanov
The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from

- proxy::remote::handle_truncate()
- schema_tables::merge_schema()
- legacy_schema_migrator
- tests

All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx

Closes #14778

* github.com:scylladb/scylladb:
  system_keyspace: Make save_truncation_record() non-static
  code: Pass sharded<db::system_keyspace>& to database::truncate()
  db: Add sharded<system_keyspace>& to legacy_schema_migrator
2023-07-26 08:48:49 +03:00
Tomasz Grabiec
5c681a1d63 tablets: Introduce tablet_mutation_builder 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
6f4a35f9ae service: tablet_allocator: Introduce tablet load balancer
Will be invoked by the topology coordinator later to decide
which tablets to migrate.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
f88220aeee stream_transfer_task, multishard_writer: Work with table sharder
So that we can use it on tablet-based tables.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
8cf92d4c86 tablets: Turn tablet_id into a struct
The IDL compiler cannot deal with enum classes like this.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
dc2ec3f81c tablets: Store "stage" in transition info
It's needed to implement tablet migration. It stores the current step
of tablet migration state machine. The state machine will be advanced
by the topology change coordinator.

See the "Tablet migration" section of topology-over-raft.md
2023-07-25 21:08:02 +02:00
Tomasz Grabiec
7851694eaa locator: erm: Make get_endpoints_for_reading() always return read replicas
Just a simplification.

Drop the test case from token_metadata which creates pending endpoints
without normal tokens. It fails after this change with exception:
"sorted_tokens is empty in first_token_index!" thrown from
token_metadata::first_token_index(), which is used when calculating
normal endpoints. This test case is not valid, first node inserts
its tokens as normal without going through bootstrap procedure.
2023-07-25 21:08:01 +02:00
Botond Dénes
3eec990e4e Merge 'test: use different table names in simple_backlog_controller_test ' from Kefu Chai
in this series, we use different table names in simple_backlog_controller_test. this test is a test exercising sstables compaction strategies. and it creates and keeps multiple tables in a single test session. but we are going to add metrics on per-table basis, and will use the table's ks and cf as the counter's labels. as the metrics subsystem does not allow multiple counters to share the same label. the test will fail when the metrics are being added.

to address this problem, in this change

1. a new ctor is added for `simple_schema`, so we can create `simple_schema` with different names
2. use the new ctor in simple_backlog_controller_test

Fixes #14767

Closes #14783

* github.com:scylladb/scylladb:
  test: use different table names in simple_backlog_controller_test
  test/lib/simple_schema: add ctor for customizing ks.cf
  test/lib/simple_schema: do not hardwire ks.cf
2023-07-25 10:26:33 +03:00
Botond Dénes
a8feb7428d Merge 'semaphore mismatch: don't throw an error if both semaphores belong to user' from Michał Jadwiszczak
If semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error.

Until now, semaphore mismatch was only checked in multi-partition queries.  The PR pushes the check to `querier_cache` and perform it on all `lookup_*_querier` methods.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.

This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it.

Refers: https://github.com/scylladb/scylla-enterprise/issues/3182
Refers: https://github.com/scylladb/scylla-enterprise/issues/3050
Closes: #14770

Closes #14736

* github.com:scylladb/scylladb:
  querier_cache: add stats of scheduling group mismatches
  querier_cache: check semaphore mismatch during querier lookup
  querier_cache: add reference to `replica::database::is_user_semaphore()`
  replica:database: add method to determine if semaphore is user one
2023-07-24 14:13:09 +03:00
Michał Jadwiszczak
a5fc53aa11 querier_cache: check semaphore mismatch during querier lookup
Previously semaphore mismatch was checked only in multi-partition
queries and if happened, an internal error was thrown.

This commit pushed the check down to `querier_cache`, so each
`lookup_*_querier` method will check for the mismatch.

What's more, if semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning and drop cached reader instead of
throwing an error.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.
2023-07-21 19:05:50 +02:00
Michał Jadwiszczak
e5c965b280 querier_cache: add reference to replica::database::is_user_semaphore() 2023-07-21 18:58:57 +02:00
Kefu Chai
d78c6d5f50 test: use different table names in simple_backlog_controller_test
in `simple_backlog_controller_test`, we need to have multiple tables
at the same time. but the default constructor of `simple_schema` always
creates schema with the table name of "ks.cf". we are going to have
a per-table metrics. and the new metric group will use the table name
as its counter labels, so we need to either disable this per-table
metrics or use a different table name for each table.

as in real world, we don't have multiple tables at the same time. it
would be better to stop reusing the same table name in a single test
session. so, in this change, we use a random cf_name for each of
the created table.

Fixes #14767
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-21 19:08:29 +08:00
Pavel Emelyanov
eaeffcdb81 code: Pass sharded<db::system_keyspace>& to database::truncate()
The arguments goes via the db::(drop|truncate)_table_on_all_shards()
pair of calls that start from

- storage_proxy::remote: has its sys.ks reference already
- schema_tables::merge_schema: has sys.ks argument already
- legacy_schema_migrator: the reference was added by previous patch
- tests: run in cql_test_env with sys.ks on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-21 13:11:59 +03:00
Nadav Har'El
5860820934 Merge 'mutation/mutation_compactor: validate the input stream' from Botond Dénes
The mutation compactor has a validator which it uses to validate the stream of mutation fragments that passes through it. This validator is supposed to validate the stream as it enters the compactor, as opposed to its compacted form (output). This was true for most fragment kinds except range tombstones, as purged range tombstones were not visible to
the validator for the most part.

This mistake was introduced by https://github.com/scylladb/scylladb/commit e2c9cdb576, which itself was a flawed attempt at fixing an error seen because purged tombstones were not terminated by the compactor.

This patch corrects this mistake by fixing the above problem properly: on page-cut, if the validator has an active tombstone, a closing tombstone is generated for it, to avoid the false-positive error. With this, range tombstones can be validated again as they come in.

The existing unit test checking the validation in the compactor is greatly expanded to check all (I hope) different validation scenarios.

Closes #13817

* github.com:scylladb/scylladb:
  test/mutation_test: test_compactor_validator_sanity_test
  mutation/mutation_compactor: fix indentation
  mutation/mutation_compactor: validate the input stream
  mutation: mutation_fragment_stream_validating_filter: add accessor to underlying validator
  readers: reader-from-fragment: don't modify stream when created without range
2023-07-21 00:26:46 +03:00
Pavel Emelyanov
98609e2115 Merge 's3/test: close using deferred_close() or deferred()' from Kefu Chai
let's use RAII to tear down the client and the input file, so we can
always perform the cleanups even if the test throws.

Closes #14765

* github.com:scylladb/scylladb:
  s3/test: use seastar::deferred() to perform cleanup
  s3/test: close using deferred_close()
2023-07-20 20:05:34 +03:00
Botond Dénes
53da97416a Merge 'Remove qctx from system.paxos table access methods' from Pavel Emelyanov
The "fix" is straightforward -- callers of system_keyspace::*paxos* methods need to get system keyspace from somewhere. This time the only caller is storage_proxy::remote that can have system keyspace via direct dependency reference.

Closes #14758

* github.com:scylladb/scylladb:
  db/system_keyspace: Move and use qctx::execute_cql_with_timeout()
  db/system_keyspace: Make paxos methods non-static
  service/paxos: Add db::system_keyspace& argument to some methods
  test: Optionally initialize proxy remote for cql_test_env
  proxy/remote: Keep sharded<db::system_keyspace>& dependency
2023-07-20 16:53:25 +03:00
Botond Dénes
e62325babc Merge 'Compaction reshard task' from Aleksandra Martyniuk
Task manager tasks covering reshard compaction.

Reattempt on https://github.com/scylladb/scylladb/pull/14044. Bugfix for https://github.com/scylladb/scylladb/issues/14618 is squashed with 95191f4.
Regression test added.

Closes #14739

* github.com:scylladb/scylladb:
  test: add test for resharding with non-empty owned_ranges_ptr
  test: extend test_compaction_task.py to test resharding compaction
  compaction: add shard_reshard_sstables_compaction_task_impl
  compaction: invoke resharding on sharded database
  compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run()
  compaction: add reshard_sstables_compaction_task_impl
  compaction: create resharding_compaction_task_impl
2023-07-20 16:43:22 +03:00
Botond Dénes
a35f4f6985 test/mutation_test: test_compactor_validator_sanity_test
Greatly expand this test to check that the compactor validates the input
stream properly.
The test is renamed (the _sanity_test suffix is removed) to reflect the
expanded scope.
2023-07-20 08:48:50 -04:00
Raphael S. Carvalho
3117f2f066 tests: Add test for table's mutation source excluding staging
Commit f5e3b8df6d introduced an optimization for
as_mutation_source_excluding_staging() and added a test that
verifies correctness of single key and range reads based
on supplied predicates. This new test aims to improve the
coverage by testing directly both table::as_mutation_source()
and as_mutation_source_excluding_staging(), therefore
guaranteeing that both supply the correct predicate to
sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14763
2023-07-20 07:14:36 +03:00
Kefu Chai
77faec4f38 s3/test: use seastar::deferred() to perform cleanup
let's use RAII to remove the object use as a fixture, so we don't
leave some object in the bucket for testing. this might interfere
with other tests which share the same minio server with the test
which fails to do its clean up if an exception is thrown.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-20 10:04:54 +08:00
Kefu Chai
7a9c802fc3 s3/test: close using deferred_close()
let's use RAII to tear down the client and the input file, so we can
always perform the cleanups even if the test throws.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-20 10:04:54 +08:00
Pavel Emelyanov
ea9db1b35c Merge 'cql3: expr: remove the default constructor' from Avi Kivity
`expression`'s default constructor is dangerous as an it can leak
into computations and generate surprising results. Fix that by
removing the default constructor.

This is made somewhat difficult by the parser generator's reliance
on default construction, and we need to expand our workaround
(`uninitialized<>`) capabilities to do so.

We also remove some incidental uses of default-constructed expressions.

Closes #14706

* github.com:scylladb/scylladb:
  cql3: expr: make expression non-default-constructible
  cql3: grammar: don't default-construct expressions
  cql3: grammar: improve uninitialized<> flexibility
  cql3: grammar: adjust uninitialized<> wrapper
  test: expr_test: don't invoke expression's default constructor
  cql3: statement_restrictions: explicitly initialize expressions in index match code
  cql3: statement_restrictions: explicitly intitialize some expression fields
  cql3: statement_restrictions: avoid expression's default constructor when classifying restrictions
  cql3: expr: prepare_expression: avoid default-constructed expression
  cql3: broadcast_tables: prepare new_value without relying on expression default constructor
2023-07-19 21:46:03 +03:00
Pavel Emelyanov
b4fc1076e3 test: Optionally initialize proxy remote for cql_test_env
Some test cases that use cql_test_env involve paxos state updates. Since
this update is becoming via proxy->remote->system_keyspace those test
cases need cql_test_env to initialize the remote part of the proxy too

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 19:32:10 +03:00
Aleksandra Martyniuk
bfb81b8cdd test: add test for resharding with non-empty owned_ranges_ptr 2023-07-19 17:19:10 +02:00
Kefu Chai
665135553d build: cmake: remove nonexistent test
the test of "type_json_test" was added locally, and has not landed
on master. but it somehow was spilled into 87170bf07a by accident.

so, let's drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14749
2023-07-19 11:58:34 +03:00
Avi Kivity
460b28d067 Merge 'Introduce SELECT MUTATION FRAGMENTS statement' from Botond Dénes
SELECT MUTATION FRAGMENTS is a new select statement sub-type, which allows dumping the underling mutations making up the data of a given table. The output of this statement is mutation-fragments presented as CQL rows. Each row corresponds to a mutation-fragment. Subsequently, the output of this statement has a schema that is different than that of the underlying table.  The output schema is derived from the table's schema, as following:
* The table's partition key is copied over as-is
* The clustering key is formed from the following columns:
    - mutation_source (text): the kind of the mutation source, one of: memtable, row-cache or sstable; and the identifier of the individual mutation source.
    - partition_region (int): represents the enum with the same name.
    - the copy of the table's clustering columns
    - position_weight (int): -1, 0 or 1, has the same meaning as that in position_in_partition, used to disambiguate range tombstone changes with the same clustering key, from rows and from each other.
* The following regular columns:
    - metadata (text): the JSON representation of the mutation-fragment's metadata.
    - value (text): the JSON representation of the mutation-fragment's value.

Data is always read from the local replica, on which the query is executed. Migrating queries between coordinators is frobidden.

More details in the documentation commit (last commit).

Example:
```cql
cqlsh> CREATE TABLE ks.tbl (pk int, ck int, v int, PRIMARY KEY (pk, ck));

cqlsh> DELETE FROM ks.tbl WHERE pk = 0;
cqlsh> DELETE FROM ks.tbl WHERE pk = 0 AND ck > 0 AND ck < 2;
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 0, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 1, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 2, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (1, 0, 0);
cqlsh> SELECT * FROM ks.tbl;

 pk | ck | v
----+----+---
  1 |  0 | 0
  0 |  0 | 0
  0 |  1 | 0
  0 |  2 | 0

(4 rows)
cqlsh> SELECT * FROM MUTATION_FRAGMENTS(ks.tbl);

 pk | mutation_source | partition_region | ck | position_weight | metadata                                                                                                                 | mutation_fragment_kind | value
----+-----------------+------------------+----+-----------------+--------------------------------------------------------------------------------------------------------------------------+------------------------+-----------
  1 |      memtable:0 |                0 |    |                 |                                                                                                         {"tombstone":{}} |        partition start |      null
  1 |      memtable:0 |                2 |  0 |               0 | {"marker":{"timestamp":1688122873341627},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122873341627}}} |         clustering row | {"v":"0"}
  1 |      memtable:0 |                3 |    |                 |                                                                                                                     null |          partition end |      null
  0 |      memtable:0 |                0 |    |                 |                                      {"tombstone":{"timestamp":1688122848686316,"deletion_time":"2023-06-30 11:00:48z"}} |        partition start |      null
  0 |      memtable:0 |                2 |  0 |               0 | {"marker":{"timestamp":1688122860037077},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122860037077}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                2 |  0 |               1 |                                      {"tombstone":{"timestamp":1688122853571709,"deletion_time":"2023-06-30 11:00:53z"}} | range tombstone change |      null
  0 |      memtable:0 |                2 |  1 |               0 | {"marker":{"timestamp":1688122864641920},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122864641920}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                2 |  2 |              -1 |                                                                                                         {"tombstone":{}} | range tombstone change |      null
  0 |      memtable:0 |                2 |  2 |               0 | {"marker":{"timestamp":1688122868706989},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122868706989}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                3 |    |                 |                                                                                                                     null |          partition end |      null

(10 rows)
```

Perf simple query:
```
/build/release/scylla perf-simple-query -c1 -m2G --duration=60
```

Before:
```
median 141596.39 tps ( 62.1 allocs/op,  13.1 tasks/op,   43688 insns/op,        0 errors)
median absolute deviation: 137.15
maximum: 142173.32
minimum: 140492.37
```
After:
```
median 141889.95 tps ( 62.1 allocs/op,  13.1 tasks/op,   43692 insns/op,        0 errors)
median absolute deviation: 167.04
maximum: 142380.26
minimum: 141025.51
```

Fixes: https://github.com/scylladb/scylladb/issues/11130

Closes #14347

* github.com:scylladb/scylladb:
  docs/operating-scylla/admin-tools: add documentation for the SELECT * FROM MUTATION_FRAGMENTS() statement
  test/topology_custom: add test_select_from_mutation_fragments.py
  test/boost/database_test: add test for mutation_dump/generate_output_schema_from_underlying_schema
  test/cql-pytest: add test_select_mutation_fragments.py
  test/cql-pytest: move scylla_data_dir fixture to conftest.py
  cql3/statements: wire-in mutation_fragments_select_statement
  cql3/restrictions/statement_restrictions: fix indentation
  cql3/restrictions/statement_restrictions: add check_indexes flag
  cql3/statments/select_statement: add mutation_fragments_select_statement
  cql3: add SELECT MUTATION FRAGMENTS select statement sub-type
  service/pager: allow passing a query functor override
  service/storage_proxy: un-embed coordinator_query_options
  replica: add mutation_dump
  replica: extract query_state into own header
  replica/table: add make_nonpopulating_cache_reader()
  replica/table: add select_memtables_as_mutation_sources()
  tools,mutation: extract the low-level json utilities into mutation/json.hh
  tools/json_writer: fold SstableKey() overloads into callers
  tools/json_writer: allow writing metadata and value separately
  tools/json_writer: split mutation_fragment_json_writer in two classes
  tools/json_writer: allow passing custom std::ostream to json_writer
2023-07-19 11:54:11 +03:00
Asias He
c29e7e4644 Revert "Revert "view_update_generator: Increase the registration_queue_size""
This reverts commit 4cee8206f8.

The test is fixed.

Closes #14750
2023-07-19 11:46:28 +03:00
Avi Kivity
503d21b570 cql3: expr: avoid separating column_mutation_attribute from its column_value when levellizing aggregation depth
Since ec77172b4b (" Merge 'cql3: convert
the SELECT clause evaluation phase to expressions' from Avi Kivity"),
we rewrite non-aggregating selectors to include an aggregation, in order
to have the rest of the code either deal with no aggregation, or
all selectors aggregating, with nothing in between. This is done
by wrapping column selectors with "first" function calls: col ->
first(col).

This broke non-aggregating selectors that included the ttl() or
writetime() pseudo functions. This is because we rewrote them as
writetime(first(col)), and writetime() isn't a function that operates
on any values; it operates on mutations and so must have access to
a column, not an expression.

Fix by detecting this scenario and rewriting the expression as
first(writetime(col)).

Unit and integration tests are added.

Fixes #14715.

Closes #14716
2023-07-19 11:35:01 +03:00
Botond Dénes
7540e62522 test/boost/database_test: add test for mutation_dump/generate_output_schema_from_underlying_schema
Checking that the generated schema has deterministic id and version.
2023-07-19 01:28:28 -04:00
Kamil Braun
6f22ed9145 Merge 'raft: move group0_state_machine::merger to its own header and add unit test for it' from Mikołaj Grzebieluch
Move `merger` to its own header file. Leave the logic of applying
commands to `group0_state_machine`. Remove `group0_state_machine`
dependencies from `merger` to make it an independent module.

Add a test that checks if `group0_state_machine_merger` preserves
timeuuid monotonicity. `last_id()` should be equal to the largest
timeuuid, based on its timestamps.

This test combines two commands in the reverse order of their timeuuids.
The timeuuids yield different results when compared in both timeuuid
order and uuid order. Consequently, the resulting command should have a
more recent timeuuid.

Fixes #14568

Closes #14682

* github.com:scylladb/scylladb:
  raft: group0_state_machine_merger: add test for timeuuid ordering
  raft: group0_state_machine: extract merger to its own header
2023-07-18 17:43:50 +02:00
Raphael S. Carvalho
da18a9badf Fix test.py with compaction groups
test.py with --x-log2-compaction-groups option rotted a little bit.
Some boost tests added later didn't use the correct header which
parses the option or they didn't adjust suite.yaml.
Perhaps it's time to set up a weekly (or bi-weekly) job to verify
there are no regressions with it. It's important as it stresses
the data plane for tablets reusing the existing tests available.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14732
2023-07-18 16:57:11 +03:00
Botond Dénes
7d5cca1958 Merge 'Regular compaction task' from Aleksandra Martyniuk
Task manager's tasks covering regular compaction.

Uses multiple inheritance on already existing
regular_compaction_task_executor to keep track of
the operation with task manager.

Closes #14377

* github.com:scylladb/scylladb:
  test: add regular compaction task test
  compaction: turn regular_compaction_task_executor into regular_compaction_task_impl
  compaction: add compaction_manager::perform_compaction method
  test: modify sstable_compaction_test.cc
  compaction: add regular_compaction_task_impl
  compaction: switch state after compaction is done
2023-07-18 16:52:53 +03:00
Michał Jadwiszczak
62ced66702 schema: add scylla specific options to schema description
Add `paxos_grace_seconds`, `tombstone_gc`, `cdc` and `synchronous_updates`
options to schema description.

Fixes: #12389
Fixes: scylladb/scylla-enterprise#2979

Closes #14275
2023-07-18 11:16:19 +03:00
Botond Dénes
21ff6efd74 test/boost/view_build_test: improve test_view_update_generator_register_semaphore_unit_leak
By making it independent of the number of units the view update
generator's registration semaphore is created with. We want to increase
this number significantly and that would destabilize this test
significantly. To prevent this, detach the test from the number of units
completely, while stil preserving the original intent behind it, as best
as it could be determined.

Closes #14727
2023-07-18 09:18:28 +03:00
Botond Dénes
b3cb611be7 Merge 'treewide: enable -Wsign-compare and address the warnings from this option' from Kefu Chai
in order to identify the problems caused by integer type promotion when comparing unsigned and signed integers, in this series, we

- address the warnings raised by `-Wsign-compare` compiler option
- add `-Wsign-compare` compiler option to the building systems

Closes #14652

* github.com:scylladb/scylladb:
  treewide: use unsigned variable to compare with unsigned
  treewide: compare signed and unsigned using std::cmp_*()
2023-07-18 09:05:30 +03:00
Botond Dénes
f03efd7ea9 Merge 'build: cmake: fix the build of some tests' from Kefu Chai
this series addresses the FTBFS of tests with CMake, and also checks for the unknown parameters in `add_scylla_test()`

Closes #14650

* github.com:scylladb/scylladb:
  build: cmake: build SEASTAR tests as SEASTAR tests
  build: cmake: error out if found unknown keywords
  build: cmake: link tests against necessary libraries
2023-07-18 06:51:40 +03:00
Kefu Chai
fa3129fa29 treewide: use unsigned variable to compare with unsigned
some times we initialize a loop variable like

auto i = 0;

or

int i = 0;

but since the type of `0` is `int`, what we get is a variable of
`int` type, but later we compare it with an unsigned number, if we
compile the source code with `-Werror=sign-compare` option, the
compiler would warn at seeing this. in general, this is a false
alarm, as we are not likely to have a wrong comparison result
here. but in order to prevent issues due to the integer promotion
for comparison in other places. and to prepare for enabling
`-Werror=sign-compare`. let's use unsigned to silence this warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 10:27:18 +08:00
Kefu Chai
3129ae3c8c treewide: compare signed and unsigned using std::cmp_*()
when comparing signed and unsigned numbers, the compiler promotes
the signed number to coomon type -- in this case, the unsigned type,
so they can be compared. but sometimes, it matters. and after the
promotion, the comparison yields the wrong result. this can be
manifested using a short sample like:

```
int main(int argc, char **argv) {
    int x = -1;
    unsigned y = 2;
    fmt::print("{}\n", x < y);
    return 0;
}
```

this error can be identified by `-Werror=sign-compare`, but before
enabling this compiling option. let's use `std::cmp_*()` to compare
them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 10:27:18 +08:00
Aleksandra Martyniuk
ab4ae6b84a test: modify sstable_compaction_test.cc
Modify sstable_compaction_test.cc so that it does not depend on
how quick compaction manager stats are updated after compaction
is triggered.

It is required since in the following changes the context may
switch before the stats are updated.
2023-07-17 15:54:33 +02:00
Mikołaj Grzebieluch
bdf3959ae6 raft: group0_state_machine_merger: add test for timeuuid ordering
This test checks if `group0_state_machine_merger` preserves timeuuid monotonicity.
`last_id()` should be equal to the largest timeuuid, based on its timestamps.

This test combines two commands in the reverse order of their timeuuids.
The timeuuids yield different results when compared in both timeuuid order and
uuid order. Consequently, the resulting command should have a more recent timeuuid.

Closes #14568
2023-07-17 15:51:20 +02:00