Commit Graph

837 Commits

Author SHA1 Message Date
Pavel Emelyanov
fa93ac9bfd database: Add wasm::manager& dependency
The dependency is needed by db::schema_tables to get wasm manager for
its needs. This patch prepares the ground. Now the wasm::manager is
shared between replica::database and cql3::query_processor

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-04 19:47:50 +03:00
Botond Dénes
00a62866ac Merge 'Make database::add_column_family exception safe.' from Aleksandra Martyniuk
If some state update in database::add_column_family throws,
info about a column family would be inconsistent.

Undo already performed operations in database::add_column_family
when one throws.

Fixes: #14666.

Closes #14672

* github.com:scylladb/scylladb:
  replica: undo the changes if something fails
  replica: start table earlier in database::add_column_family
2023-08-04 10:58:17 +03:00
Botond Dénes
4d538e1363 Merge 'Task manager tasks covering compaction group compaction' from Aleksandra Martyniuk
All compaction task executors, except for regular compaction one,
become task manager compaction tasks.

Creating and starting of major_compaction_task_executor is modified
to be consistent with other compaction task executors.

Closes #14505

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py to cover compaction group tasks
  compaction: turn custom_task_executor into compaction_task_impl
  compaction: turn sstables_task_executor into sstables_compaction_task_impl
  compaction: change sstables compaction tasks type
  compaction: move table_upgrade_sstables_compaction_task_impl
  compaction: pass task_info through sstables compaction
  compaction: turn offstrategy_compaction_task_executor into offstrategy_compaction_task_impl
  compaction: turn cleanup_compaction_task_executor into cleanup_compaction_task_impl
  comapction: use optional task info in major compaction
  compaction: use perform_compaction in compaction_manager::perform_major_compaction
2023-08-04 10:11:00 +03:00
Aleksandra Martyniuk
1e9b2972ea replica: undo the changes if something fails
If a step of adding a table fails, previous steps are undone.
2023-08-03 17:37:31 +02:00
Avi Kivity
cb3b808e3f Merge 'replica/table.cc: Add per-node-per-table metrics' from Amnon Heiman
Per-table metrics are very valuable for the users, it does come with a high load on both the reporting and the collecting metrics systems.

This patch adds a small subset of per-metrics table that will be reported on the node level.

The list of metrics is:
system_column_family_memtable_switch - Number of times flush has
  resulted in the memtable being switched out
system_column_family_memtable_partition_writes - Number of write
  operations performed on partitions in memtables
system_column_family_memtable_partition_hits - Number of times a write
  operation was issued on an existing partition in memtables
system_column_family_memtable_row_writes - Number of row writes
  performed in memtables
system_column_family_memtable_row_hits - Number of rows overwritten by write operations in memtables
system_column_family_total_disk_space - Total disk space used system_column_family_live_sstable - Live sstable count system_column_family_read_latency_count - Number of reads system_column_family_write_latency_count - Number of writes

The names of the read/write metrics is based on the histogram convention, so when latencies histograms will be added, the names will not change.

The metrics are label with a specific label __per_table="node" so it will be possible to easily manipulate it.

The metrics will be available when enable_metrics_reporting (the per-table full metrics flag) is off

Fixes #2198

Closes #13293

* github.com:scylladb/scylladb:
  replica/table.cc: Add node-per-table metrics
  config: add enable_node_table_metrics flag
2023-08-02 22:17:47 +03:00
Aleksandra Martyniuk
9f68566038 replica: start table earlier in database::add_column_family
In database::add_column_family table::start() is called before
a table is registered in different structures.
2023-08-02 16:35:34 +02:00
Pavel Emelyanov
c3b23fc03d Merge 'Skip mode validation for snapshots' from Benny Halevy
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.

Fixes #12010

Closes #14892

* github.com:scylladb/scylladb:
  distributed_loader: process_sstable_dir: do not verify snapshots
  utils/directories: verify_owner_and_mode: add recursive flag
2023-08-02 13:05:47 +03:00
Amnon Heiman
c30d7ba5d7 replica/table.cc: Add node-per-table metrics
Per-table metrics are very valuable for the users, it does come with a
high load on both the reporting and the collecting metrics systems.

This patch adds a small subset of per-metrics table that will be
reported on the node level.

The list of metrics is:
system_column_family_memtable_switch - Number of times flush has
  resulted in the memtable being switched out
system_column_family_memtable_partition_writes - Number of write
  operations performed on partitions in memtables
system_column_family_memtable_partition_hits - Number of times a write
  operation was issued on an existing partition in memtables
system_column_family_memtable_row_writes - Number of row writes
  performed in memtables
system_column_family_memtable_row_hits - Number of rows overwritten by
write operations in memtables
system_column_family_total_disk_space - Total disk space used
system_column_family_live_sstable - Live sstable count
system_column_family_read_latency_count - Number of reads
system_column_family_write_latency_count - Number of writes

The names of the read/write metrics is based on the histogram convention,
so when latencies histograms will be added, the names will not change.

The metrics are label with a specific label __per_table="node" so it
will be possible to easily manipulate it.

The metrics will be available when enable_metrics_reporting (the
per-table full metrics flag) is off and enable_node_table_metrics is
true.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2023-08-02 10:20:18 +03:00
Amnon Heiman
d10a3dd19a config: add enable_node_table_metrics flag
By default, per-table-per-shard metrics reporting is turned off, and the
aggregated version of the metrics (per-table-per-node) will be turned
on.

There could be a situation where a user with an excessive number of
tables would suffer from performance issues, both from the network and
the metrics collection server.

This patch adds a config option, enable_node_table_metrics, which allows
users to turn off per-table metrics reporting altogether.

For example, when running Scylla with the command line argument
'--enable-node-aggregated-table_metrics 0' per-table metrics will not be reported.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2023-08-02 10:20:18 +03:00
Benny Halevy
845b6f901b distributed_loader: process_sstable_dir: do not verify snapshots
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.

Fixes #12010

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 16:01:46 +03:00
Botond Dénes
4a02865ea1 Merge 'Prevent invalidation of iterators over database::_column_families' from Aleksandra Martyniuk
Maps related to column families in database are extracted
to a column_families_data class. Access to them is possible only
through methods. All methods which may preempt hold rwlock
in relevant mode, so that the iterators can't become invalid.

Fixes: #13290

Closes #13349

* github.com:scylladb/scylladb:
  replica: make tables_metadata's attributes private
  replica: add methods to get a filtered copy of tables map
  replica: add methods to check if given table exists
  replica: add methods to get table or table id
  replica: api: return table_id instead of const table_id&
  replica: iterate safely over tables related maps
  replica: pass tables_metadata to phased_barrier_top_10_counts
  replica: add methods to safely add and remove table
  replica: wrap column families related maps into tables_metadata
  replica: futurize database::add_column_family and database::remove
2023-07-31 15:31:59 +03:00
Botond Dénes
72043a6335 Merge 'Avoid using qctx in schema_tables' column-mapping queries' from Pavel Emelyanov
There are three methods in system_keyspace namespace that run queries over `system.scylla_table_schema_history` table. For that they use qctx which's not nice.

Fortunately, all the callers already have the system_keyspace& local variable or argument they can pass to those methods. Since the accessed table belongs to system keyspace, the latter declares the querying methods as "friends" to let them get private `query_processor& _qp` member

Closes #14876

* github.com:scylladb/scylladb:
  schema_tables: Extract query_processor from system_keyspace for querying
  schema_tables: Add system_keyspace& argument to ..._column_mapping() calls
  migration_manager: Add system_keyspace argument to get_schema_mapping()
2023-07-31 15:00:59 +03:00
Pavel Emelyanov
cf4d4d7e9b schema_tables: Add system_keyspace& argument to ..._column_mapping() calls
The callers all have local sys_ks argument:

- merge_tables_and_views()
- service::get_column_mapping()
- database::parse_system_tables()

And a test that can get it from cql_test_env.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 15:55:13 +03:00
Aleksandra Martyniuk
4e439ac957 compaction: turn offstrategy_compaction_task_executor into offstrategy_compaction_task_impl
offstrategy_compaction_task_executor inherits both from compaction_task_executor
and offstrategy_compaction_task_impl.
2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk
92f2987217 compaction: turn cleanup_compaction_task_executor into cleanup_compaction_task_impl
cleanup_compaction_task_executor inherits both from compaction_task_executor
and cleanup_compaction_task_impl.

Add a new version of compaction_manager::perform_task_on_all_files
which accepts only the tasks that are derived from compaction_task_impl.
After all task executors' conversions are done, the new version replaces
the original one.
2023-07-28 10:48:58 +02:00
Aleksandra Martyniuk
8317e4dd7f comapction: use optional task info in major compaction
To make it consistent with the upcoming methods, methods triggering
major compaction get std::optional<tasks::task_info> as an argument.

Thanks to that we can distinguish between a task that has no parent
and the task which won't be registered in task manager.
2023-07-28 09:25:21 +02:00
Botond Dénes
b599f15b26 replica: make_[multishard_]streaming_reader(): make compaction_time mandatory
Now that all users have opted in unconditionally, there is no point in
keeping this optional. Make it mandatory to make sure there are no
opt-out by mistake.
The global override via enable_compacting_data_for_streaming_and_repair
config item still remains, allowing compaction to be force turned-off.
2023-07-27 04:57:52 -04:00
Botond Dénes
2f8d77e97b replica/table: add optional compacting to make_multishard_streaming_reader()
Doing to make_multishard_streaming_reader() what the previous commit did
to make_streaming_reader(). In fact, the new compaction_time parameter
is simply forwarded to the make_streaming_reader() on the shard readers.

Call sites are updated, but none opt in just yet.
2023-07-27 03:22:11 -04:00
Botond Dénes
42b0dd5558 replica/table: add optional compacting to make_streaming_reader()
Opt-in is possible by passing an engaged `compaction_time`
(gc_clock::time_point) to the method. When this new parameter is
disengaged, no compaction happens.
Note that there is a global override, via the
enable_compacting_data_for_streaming_and_repair config item, which can
force-disable this compaction.
Compaction done on the output of the streaming reader does *not*
garbage-collect tombstones!

All call-sites are adjusted (the new parameter is not defaulted), but
none opt in yet. This will be done in separate commit per user.
2023-07-27 03:22:11 -04:00
Avi Kivity
ff1f461a42 Merge 'Introduce tablet load balancer' from Tomasz Grabiec
After this series, tablet replication can handle the scenario of bootstrapping new nodes. The ownership is distributed indirectly by the means of a load-balancer which moves tablets around in the background. See docs/dev/topology-over-raft.md for details.

The implementation is by no means meant to be perfect, especially in terms of performance, and will be improved incrementally.

The load balancer will be also kicked by schema changes, so that allocation/deallocation done during table creation/drop will be rebalanced.

Tablet data is streamed using existing `range_streamer`, which is the infrastructure for "the old streaming". This will be later replaced by sstable transfer once integration of tablets with compaction groups is finished. Also, cleanup is not wired yet, also blocked by compaction group integration.

Closes #14601

* github.com:scylladb/scylladb:
  tests: test_tablets: Add test for bootstraping a node
  storage_service: topology_coordinator: Implement tablet migration state machine
  tablets: Introduce tablet_mutation_builder
  service: tablet_allocator: Introduce tablet load balancer
  tablets: Introduce tablet_map::for_each_tablet()
  topology: Introduce get_node()
  token_metadata: Add non-const getter of tablet_metadata
  storage_service: Notify topology state machine after applying schema change
  storage_service: Implement stream_tablet RPC
  tablets: Introduce global_tablet_id
  stream_transfer_task, multishard_writer: Work with table sharder
  tablets: Turn tablet_id into a struct
  db: Do not create per-keyspace erm for tablet-based tables
  tablets: effective_replication_map: Take transition stage into account when computing replicas
  tablets: Store "stage" in transition info
  doc: Document tablet migration state machine and load balancer
  locator: erm: Make get_endpoints_for_reading() always return read replicas
  storage_service: topology_coordinator: Sleep on failure between retries
  storage_service: topology_coordinator: Simplify coordinator loop
  main: Require experimental raft to enable tablets
2023-07-26 12:30:29 +03:00
Botond Dénes
ad2ddffb22 Merge 'Remove qctx from system_keyspace::save_truncation_record()' from Pavel Emelyanov
The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from

- proxy::remote::handle_truncate()
- schema_tables::merge_schema()
- legacy_schema_migrator
- tests

All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx

Closes #14778

* github.com:scylladb/scylladb:
  system_keyspace: Make save_truncation_record() non-static
  code: Pass sharded<db::system_keyspace>& to database::truncate()
  db: Add sharded<system_keyspace>& to legacy_schema_migrator
2023-07-26 08:48:49 +03:00
Tomasz Grabiec
5c681a1d63 tablets: Introduce tablet_mutation_builder 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
c2b18ae483 db: Do not create per-keyspace erm for tablet-based tables
This erm is not updated when replicating token metadata in
storage_service::replicate_to_all_cores() so will pin token metadata
version and prevent token metadata barrier from finishing.

It is not necessary to have per-keyspace erm for tablet-based tables,
so just don't create it.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
dc2ec3f81c tablets: Store "stage" in transition info
It's needed to implement tablet migration. It stores the current step
of tablet migration state machine. The state machine will be advanced
by the topology change coordinator.

See the "Tablet migration" section of topology-over-raft.md
2023-07-25 21:08:02 +02:00
Aleksandra Martyniuk
6e6ba7309e replica: make tables_metadata's attributes private
Make _column_families and _ks_cf_to_uuid private to prevent unsafe
access. The maps can be accessed only through method which use locks
if preemption is possible.
2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
c5cad803b3 replica: add methods to get a filtered copy of tables map 2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
ff26b2ba3f replica: add methods to check if given table exists 2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
6796721c3d replica: add methods to get table or table id 2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
e072a2341d replica: api: return table_id instead of const table_id&
Return table_id instead of const table_id& from database::find_uuid
as copying table_id does not cause much overhead and simplifies
methods signature.
2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
cdbfa0b2f5 replica: iterate safely over tables related maps
Loops over _column_families and _ks_cf_to_uuid which may preempt
are protected by reader mode of rwlock so that iterators won't
get invalid.
2023-07-25 17:13:04 +02:00
Aleksandra Martyniuk
a21d3357c3 replica: pass tables_metadata to phased_barrier_top_10_counts 2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk
8842bd87c3 replica: add methods to safely add and remove table 2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk
52afd9d42d replica: wrap column families related maps into tables_metadata
As a preparation for ensuring access safety for column families
related maps, add tables_metadata, access to members of which
would be protected by rwlock.
2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk
395ce87eff replica: futurize database::add_column_family and database::remove
As a preparation for further changes, database::add_column_family
and database::remove return future<>.
2023-07-25 16:13:00 +02:00
Raphael S. Carvalho
0ac43ea877 Fix stack-use-after-return in mutation source excluding staging
The new test detected a stack-use-after-return when using table's
as_mutation_source_excluding_staging() for range reads.

This doesn't really affect view updates that generate single
key reads only. So the problem was only stressed in the recently
added test. Otherwise, we'd have seen it when running dtests
(in debug mode) that stress the view update path from staging.

The problem happens because the closure was feeded into
a noncopyable_function that was taken by reference. For range
reads, we defer before subsequent usage of the predicate.
For single key reads, we only defer after finished using
the predicate.

Fix is about using sstable_predicate type, so there won't
be a need to construct a temporary object on stack.

Fixes #14812.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14813
2023-07-25 10:38:20 +03:00
Botond Dénes
a8feb7428d Merge 'semaphore mismatch: don't throw an error if both semaphores belong to user' from Michał Jadwiszczak
If semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error.

Until now, semaphore mismatch was only checked in multi-partition queries.  The PR pushes the check to `querier_cache` and perform it on all `lookup_*_querier` methods.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.

This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it.

Refers: https://github.com/scylladb/scylla-enterprise/issues/3182
Refers: https://github.com/scylladb/scylla-enterprise/issues/3050
Closes: #14770

Closes #14736

* github.com:scylladb/scylladb:
  querier_cache: add stats of scheduling group mismatches
  querier_cache: check semaphore mismatch during querier lookup
  querier_cache: add reference to `replica::database::is_user_semaphore()`
  replica:database: add method to determine if semaphore is user one
2023-07-24 14:13:09 +03:00
Kamil Braun
e6099c4685 Merge 'config: set schema_commitlog_segment_size_in_mb to 128 ' from Patryk Jędrzejczak
Fixes #14668

In #14668, we have decided to introduce a new `scylla.yaml` variable for the schema commitlog segment size and set it to 128MB. The reason is that segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. This `schema_commitlog_segment_size_in_mb variable` variable is now added to `scylla.yaml` and `db/config`.

Additionally,  we do not derive the commitlog sync period for schema commitlog anymore because schema commitlog runs in batch mode, so it doesn't need this parameter. It has also been discussed in #14668.

Closes #14704

* github.com:scylladb/scylladb:
  replica: do not derive the commitlog sync period for schema commitlog
  config: set schema_commitlog_segment_size_in_mb to 128
  config: add schema_commitlog_segment_size_in_mb variable
2023-07-24 10:23:34 +02:00
Michał Jadwiszczak
246728cbbb querier_cache: add stats of scheduling group mismatches
Add stats to count dropped queriers because of scheduling group
mismatch.
2023-07-21 19:05:55 +02:00
Michał Jadwiszczak
a5fc53aa11 querier_cache: check semaphore mismatch during querier lookup
Previously semaphore mismatch was checked only in multi-partition
queries and if happened, an internal error was thrown.

This commit pushed the check down to `querier_cache`, so each
`lookup_*_querier` method will check for the mismatch.

What's more, if semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning and drop cached reader instead of
throwing an error.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.
2023-07-21 19:05:50 +02:00
Michał Jadwiszczak
e5c965b280 querier_cache: add reference to replica::database::is_user_semaphore() 2023-07-21 18:58:57 +02:00
Pavel Emelyanov
db1c6e2255 system_keyspace: Make save_truncation_record() non-static
... and stop using qctx

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-21 13:12:50 +03:00
Pavel Emelyanov
eaeffcdb81 code: Pass sharded<db::system_keyspace>& to database::truncate()
The arguments goes via the db::(drop|truncate)_table_on_all_shards()
pair of calls that start from

- storage_proxy::remote: has its sys.ks reference already
- schema_tables::merge_schema: has sys.ks argument already
- legacy_schema_migrator: the reference was added by previous patch
- tests: run in cql_test_env with sys.ks on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-21 13:11:59 +03:00
Botond Dénes
e62325babc Merge 'Compaction reshard task' from Aleksandra Martyniuk
Task manager tasks covering reshard compaction.

Reattempt on https://github.com/scylladb/scylladb/pull/14044. Bugfix for https://github.com/scylladb/scylladb/issues/14618 is squashed with 95191f4.
Regression test added.

Closes #14739

* github.com:scylladb/scylladb:
  test: add test for resharding with non-empty owned_ranges_ptr
  test: extend test_compaction_task.py to test resharding compaction
  compaction: add shard_reshard_sstables_compaction_task_impl
  compaction: invoke resharding on sharded database
  compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run()
  compaction: add reshard_sstables_compaction_task_impl
  compaction: create resharding_compaction_task_impl
2023-07-20 16:43:22 +03:00
Michał Jadwiszczak
d7a3aa2698 replica:database: add method to determine if semaphore is user one
Add method to compare semaphore with system ones (streaming, compaction,
system read) to be able if the semaphore belongs to a user.
2023-07-20 10:24:21 +02:00
Aleksandra Martyniuk
7a7e287d8c compaction: add reshard_sstables_compaction_task_impl
Add task manager's task covering resharding compaction.

A struct and some functions are moved from replica/distributed_loader.cc
to compaction/task_manager_module.cc.
2023-07-19 17:15:40 +02:00
Patryk Jędrzejczak
ee1c240f2a replica: do not derive the commitlog sync period for schema commitlog
We don't want to apply the value of the commitlog_sync_period_in_ms
variable to schema commitlog. Schema commitlog runs in batch mode,
so it doesn't need this parameter.
2023-07-19 14:16:50 +02:00
Patryk Jędrzejczak
5b167a4ad7 config: add schema_commitlog_segment_size_in_mb variable
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.
2023-07-19 14:16:41 +02:00
Pavel Emelyanov
312184c0c7 keys: Move exploded_clustering_prefix's operator<< to keys.cc
Now it sits in replicate/database.cc, but the latter is overloaded with
code, worth keeping less, all the more so the ..._prefix itself lives in
the keys.hh header.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14748
2023-07-19 11:57:27 +03:00
Avi Kivity
460b28d067 Merge 'Introduce SELECT MUTATION FRAGMENTS statement' from Botond Dénes
SELECT MUTATION FRAGMENTS is a new select statement sub-type, which allows dumping the underling mutations making up the data of a given table. The output of this statement is mutation-fragments presented as CQL rows. Each row corresponds to a mutation-fragment. Subsequently, the output of this statement has a schema that is different than that of the underlying table.  The output schema is derived from the table's schema, as following:
* The table's partition key is copied over as-is
* The clustering key is formed from the following columns:
    - mutation_source (text): the kind of the mutation source, one of: memtable, row-cache or sstable; and the identifier of the individual mutation source.
    - partition_region (int): represents the enum with the same name.
    - the copy of the table's clustering columns
    - position_weight (int): -1, 0 or 1, has the same meaning as that in position_in_partition, used to disambiguate range tombstone changes with the same clustering key, from rows and from each other.
* The following regular columns:
    - metadata (text): the JSON representation of the mutation-fragment's metadata.
    - value (text): the JSON representation of the mutation-fragment's value.

Data is always read from the local replica, on which the query is executed. Migrating queries between coordinators is frobidden.

More details in the documentation commit (last commit).

Example:
```cql
cqlsh> CREATE TABLE ks.tbl (pk int, ck int, v int, PRIMARY KEY (pk, ck));

cqlsh> DELETE FROM ks.tbl WHERE pk = 0;
cqlsh> DELETE FROM ks.tbl WHERE pk = 0 AND ck > 0 AND ck < 2;
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 0, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 1, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 2, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (1, 0, 0);
cqlsh> SELECT * FROM ks.tbl;

 pk | ck | v
----+----+---
  1 |  0 | 0
  0 |  0 | 0
  0 |  1 | 0
  0 |  2 | 0

(4 rows)
cqlsh> SELECT * FROM MUTATION_FRAGMENTS(ks.tbl);

 pk | mutation_source | partition_region | ck | position_weight | metadata                                                                                                                 | mutation_fragment_kind | value
----+-----------------+------------------+----+-----------------+--------------------------------------------------------------------------------------------------------------------------+------------------------+-----------
  1 |      memtable:0 |                0 |    |                 |                                                                                                         {"tombstone":{}} |        partition start |      null
  1 |      memtable:0 |                2 |  0 |               0 | {"marker":{"timestamp":1688122873341627},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122873341627}}} |         clustering row | {"v":"0"}
  1 |      memtable:0 |                3 |    |                 |                                                                                                                     null |          partition end |      null
  0 |      memtable:0 |                0 |    |                 |                                      {"tombstone":{"timestamp":1688122848686316,"deletion_time":"2023-06-30 11:00:48z"}} |        partition start |      null
  0 |      memtable:0 |                2 |  0 |               0 | {"marker":{"timestamp":1688122860037077},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122860037077}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                2 |  0 |               1 |                                      {"tombstone":{"timestamp":1688122853571709,"deletion_time":"2023-06-30 11:00:53z"}} | range tombstone change |      null
  0 |      memtable:0 |                2 |  1 |               0 | {"marker":{"timestamp":1688122864641920},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122864641920}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                2 |  2 |              -1 |                                                                                                         {"tombstone":{}} | range tombstone change |      null
  0 |      memtable:0 |                2 |  2 |               0 | {"marker":{"timestamp":1688122868706989},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122868706989}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                3 |    |                 |                                                                                                                     null |          partition end |      null

(10 rows)
```

Perf simple query:
```
/build/release/scylla perf-simple-query -c1 -m2G --duration=60
```

Before:
```
median 141596.39 tps ( 62.1 allocs/op,  13.1 tasks/op,   43688 insns/op,        0 errors)
median absolute deviation: 137.15
maximum: 142173.32
minimum: 140492.37
```
After:
```
median 141889.95 tps ( 62.1 allocs/op,  13.1 tasks/op,   43692 insns/op,        0 errors)
median absolute deviation: 167.04
maximum: 142380.26
minimum: 141025.51
```

Fixes: https://github.com/scylladb/scylladb/issues/11130

Closes #14347

* github.com:scylladb/scylladb:
  docs/operating-scylla/admin-tools: add documentation for the SELECT * FROM MUTATION_FRAGMENTS() statement
  test/topology_custom: add test_select_from_mutation_fragments.py
  test/boost/database_test: add test for mutation_dump/generate_output_schema_from_underlying_schema
  test/cql-pytest: add test_select_mutation_fragments.py
  test/cql-pytest: move scylla_data_dir fixture to conftest.py
  cql3/statements: wire-in mutation_fragments_select_statement
  cql3/restrictions/statement_restrictions: fix indentation
  cql3/restrictions/statement_restrictions: add check_indexes flag
  cql3/statments/select_statement: add mutation_fragments_select_statement
  cql3: add SELECT MUTATION FRAGMENTS select statement sub-type
  service/pager: allow passing a query functor override
  service/storage_proxy: un-embed coordinator_query_options
  replica: add mutation_dump
  replica: extract query_state into own header
  replica/table: add make_nonpopulating_cache_reader()
  replica/table: add select_memtables_as_mutation_sources()
  tools,mutation: extract the low-level json utilities into mutation/json.hh
  tools/json_writer: fold SstableKey() overloads into callers
  tools/json_writer: allow writing metadata and value separately
  tools/json_writer: split mutation_fragment_json_writer in two classes
  tools/json_writer: allow passing custom std::ostream to json_writer
2023-07-19 11:54:11 +03:00
Botond Dénes
a507ff5d88 replica: add mutation_dump
This file contains facilities to dump the underlying mutations contained
in various mutation sources -- like memtable, cache and sstables -- and
return them as query results. This can be used with any table on the
system. The output represents the mutation fragments which make up said
mutations, and it will be generated according to a schema, which is a
transformation of the table's schema.
This file provides a method, which can be used to implement the backend
of a select-statement: it has a similar signature to regular query
methods.
2023-07-19 01:28:28 -04:00