Commit Graph

120 Commits

Author SHA1 Message Date
Kefu Chai
f5f5ff1d51 clustering_interval_set: add fmt::formatter for clustering_interval_set
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `clustering_interval_set`

their operator<<:s are dropped

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17593
2024-03-08 15:13:14 +02:00
Kefu Chai
0fd85a98a9 mutation: add fmt::formatter for position_range
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `position_range`, and the
helpers for printing related types are dropped.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-26 20:15:57 +08:00
Kefu Chai
2f532b9ebc mutation: add fmt::formatter for mutation_fragment and range_tombstone_stream
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for

* mutation_fragment
* range_tombstone_stream

their operator<<:s are dropped

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-26 20:15:57 +08:00
Kefu Chai
1fe7a467e7 mutation: add fmt::formatter for mutation_fragment_v2::printer
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated formatter.

in this change, we define formatters for mutation_fragment_v2::printer

Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-26 17:47:05 +08:00
Nadav Har'El
b4cef638ef Merge 'mutation: add fmt::formatter for mutation types' from Kefu Chai
before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter.

in this change, we define formatters for

* canonical_mutation
* atomic_cell_view
* atomic_cell
* atomic_cell_or_collection::printer

Refs #13245

Closes scylladb/scylladb#17506

* github.com:scylladb/scylladb:
  mutation: add fmt::formatter for canonical_mutation
  mutation: add fmt::formatter for atomic_cell_view and atomic_cell
  mutation: add fmt::formatter for atomic_cell_or_collection::printer
2024-02-25 09:48:56 +02:00
Kefu Chai
84ba624415 mutation: add fmt::formatter for canonical_mutation
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for canonical_mutation

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-25 12:48:13 +08:00
Kefu Chai
3625796222 mutation: add fmt::formatter for atomic_cell_view and atomic_cell
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for

* atomic_cell_view
* atomic_cell

and drop their operator<<:s.

Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-25 12:19:11 +08:00
Kefu Chai
b4fa32ec17 mutation: add fmt::formatter for atomic_cell_or_collection::printer
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for
`atomic_cell_or_collection::printer`, and drop its operator<<.

Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-25 12:18:41 +08:00
Nadav Har'El
0aaa6b1a08 fmt: add formatter for mutation_fragment_v2::kind
Unfortunately, fmt v10 dropped support for operator<< formatters,
forcing us to replace the huge number of operator<< implementations
in our code by uglier and templated fmt::formatter implementations
to get Scylla to compile on modern distros (such as Fedora 39) :-(

Kefu has already started doing this migration, here is my small
contribution - the formatter for mutation_fragment_v2::kind.
This patch is need to compile, for example,
build/dev/mutation/mutation_fragment_stream_validator.o.

I can't remove the old operator<< because it's still used by
the implementation of other operator<< functions. We can remove
all of them when we're done with this coversion. In the meantime,
I replaced the original implementation of operator<< by a trivial
implementation just passing the work to the new fmt::print support.

Refs #13245

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17432
2024-02-23 10:25:39 +02:00
Kefu Chai
37c6073fd5 mutation: add fmt::formatter for clustering_row and friends
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for

* clustering_row::printer
* static_row::printer
* partition_start
* partition_end
* mutation_fragment::printer

and drop their operator<<:s

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-22 17:53:34 +08:00
Kefu Chai
b61b5a8b5d mutation: add fmt::formatter for row_tombstone and friends
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for

* row_tombstone
* row_marker
* deletable_row::printer
* row::printer

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-22 12:44:33 +08:00
Kefu Chai
acefde0735 mutation: add fmt::formatter for mutation_partition::printer
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `mutation_partition::printer`,
and drop its operator<<.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17419
2024-02-20 09:01:22 +02:00
Kefu Chai
b309e42195 collection_mutation: add formatter for collection_mutation_view::printer
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for
`collection_mutation_view::printer`, and drop its
operator<<.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17300
2024-02-13 17:42:25 +02:00
Botond Dénes
120442231f Merge 'row_cache: test cache consistency during multi-partition cache updates' from Michał Chojnowski
Adds a test reproducing https://github.com/scylladb/scylladb/issues/16759, and the instrumentation needed for it.

Closes scylladb/scylladb#17208

* github.com:scylladb/scylladb:
  row_cache_test: test cache consistency during memtable-to-cache merge
  row_cache: use preemption_source in update()
  utils: preempt: add preemption_source
2024-02-13 17:37:06 +02:00
Kefu Chai
54ed65bb50 mutation: s/statics/static content/
codespell reports that "statics" could be the misspelling of
"statistics". but "static" here means the static column(s). so
replace "static" with more specific wording.

Refs #589
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17216
2024-02-13 17:33:21 +02:00
Michał Chojnowski
bed20a2e37 row_cache: use preemption_source in update()
To facilitate testing the state of cache after the update is preempted
at various points, pass a preemption_source& to update() instead of
calling the reactor directly.

In release builds, the calls to preemption_source methods should compile
to the same direct reactor calls as today. Only in dev mode they should
add an extra branch. (However, the `preemption_source&` argument has
to be shoveled in any mode).
2024-02-07 18:31:36 +01:00
Kefu Chai
b931d93668 treewide: fix misspellings in code comments
these misspellings are identified by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17004
2024-01-31 09:16:10 +02:00
Patryk Wrobel
ba488b10ec mutation/mutation.hh: remove anonymous namespace from header
Anonymous namespace implies internal linkage for its members.
When it is defined in a header, then each translation unit,
which includes such header defines its own unique instance
of members of the unnamed namespace that are ODR-used within
that translation unit.

This can lead to unexpected results including code bloat
or undefined behavior due to ODR violations.

This change aligns the code with the following guidelines:
 - CppCoreGuidelines: "SF.21: Don’t use an unnamed (anonymous)
                       namespace in a header"
 - SEI CERT C++: "DCL59-CPP. Do not define an unnamed namespace
                  in a header file"

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-01-26 08:38:39 +01:00
Kefu Chai
d1dd71fbd7 mutation: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16889
2024-01-21 16:58:26 +02:00
Kefu Chai
dd496afff3 mutation: add formatter for {atomic_cell_view,atomic_cell}::printer
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define a formatter for `atomic_cell_view::printer`
and `atomic_cell::printer` respectively, and remove their operator<<().

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16602
2024-01-02 16:14:42 +02:00
Yaniv Kaul
ae2ab6000a Typos: fix typos in code
Fixes some more typos as found by codespell run on the code.
In this commit, there are more user-visible errors.

Refs: https://github.com/scylladb/scylladb/issues/16255
2023-12-05 15:18:11 +02:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Botond Dénes
65e42e4166 Merge 'mutation_query: properly send range tombstones in reverse queries' from Michał Chojnowski
reconcilable_result_builder passes range tombstone changes to _rt_assembler
using table schema, not query schema.
This means that a tombstone with bounds (a; b), where a < b in query schema
but a > b in table schema, will not be emitted from mutation_query.

This is a very serious bug, because it means that such tombstones in reverse
queries are not reconciled with data from other replicas.
If *any* queried replica has a row, but not the range tombstone which deleted
the row, the reconciled result will contain the deleted row.

In particular, range deletes performed while a replica is down will not
later be visible to reverse queries which select this replica, regardless of the
consistency level.

As far as I can see, this doesn't result in any persistent data loss.
Only in that some data might appear resurrected to reverse queries,
until the relevant range tombstone is fully repaired.

This series fixes the bug and adds a minimal reproducer test.

Fixes #10598

Closes scylladb/scylladb#16003

* github.com:scylladb/scylladb:
  mutation_query_test: test that range tombstones are sent in reverse queries
  mutation_query: properly send range tombstones in reverse queries
2023-11-21 09:19:14 +02:00
Michał Chojnowski
9ccd4ea416 partition_version: fix violation of "older versions are evicted first" during schema upgrades
A schema upgrade appends a MVCC version B after an existing version A.

The last dummy in B is added to the front of LRU,
so it will be evicted after the entries in A.

This alone doesn't quite violate the "older versions are evicted first" rule,
because the new last dummy carries no information. But apply_monotonically
generally assumes that entries on the same position have the obvious
eviction order, even if they carry no information. Thus, after the merge,
the rule can become broken.

The proposed fix is as follows:

- In the case where A is merged into B, the merged last dummy
  inherits the link of A.
- The merging of B into anything is prevented until its merge with A is finished.

This is relatively hacky, because it still involves a state that
goes against some natural expectations granted by the "older versions..."
rule. A less hacky fix would be to ensure that the new dummy is inserted
into a proper place in the eviction order to begin with.

Or, better yet, we could eliminate the rule altogether.
Aside from being very hard to maintain, it also prevents the introduction
of any eviction algorithm other than LRU.
2023-11-16 19:01:18 +01:00
Kefu Chai
efd65aebb2 build: cmake: add check-header target
to have feature parity with `configure.py`. we won't need this
once we migrate to C++20 modules. but before that day comes, we
need to stick with C++ headers.

we generate a rule for each .hh files to create a corresponding
.cc and then compile it, in order to verify the self-containness of
that header. so the number of rule is quite large, to avoid the
unnecessary overhead. the check-header target is enabled only if
`Scylla_CHECK_HEADERS` option is enabled.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15913
2023-11-13 10:27:06 +02:00
Michał Chojnowski
002357e238 mutation_query: properly send range tombstones in reverse queries
reconcilable_result_builder passes range tombstone changes to _rt_assembler
using table schema, not query schema.
This means that a tombstone with bounds (a; b), where a < b in query schema
but a > b in table schema, are not be emitted from mutation_query.

This is a very serious bug, because it means that such tombstones in reverse
queries are not reconciled with data from other replicas.
If *any* queried replica has a row, but not the range tombstone which deleted
the row, the reconciled result will contain the deleted row.

In particular, range deletes performed while a replica is down, will not
later be visible to reverse queries which select this replica, regardless of the
consistency level.

As far as I can see, this doesn't result in any persistent data loss.
Only in that some data might appear resurrected to reverse queries,
until the relevant range tombstone is fully repaired.
2023-11-08 14:54:48 +01:00
Benny Halevy
a1acf6854b everywhere: reduce dependencies on i_partitioner.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-05 20:47:44 +02:00
Michał Chojnowski
93ea3d41d8 position_in_partition: make operator= exception-safe
The copy assignment operator of _ck can throw
after _type and _bound_weight have already been changed.
This leaves position_in_partition in an inconsistent state,
potentially leading to various weird symptoms.

The problem was witnessed by test_exception_safety_of_reads.
Specifically: in cache_flat_mutation_reader::add_to_buffer,
which requires the assignment to _lower_bound to be exception-safe.

The easy fix is to perform the only potentially-throwing step first.

Fixes #15822

Closes scylladb/scylladb#15864
2023-10-29 18:30:32 +02:00
Botond Dénes
9231454acd mutation/json: extract generic streaming writer into utils/rjson.hh
This writer is generally useful, not just for writing mutations as json.
Make it generally available as well.
2023-10-20 10:04:56 -04:00
Kefu Chai
9bc0a9f95e mutation: do not include unused header
the `utils::UUID` class is not used by the implementation of
`canonical_mutation`, so let's remove the include from this source file.

the `#include` was originally added in
5a353486c6, but that commit did
add any code using UUID to this file.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15731
2023-10-17 20:38:07 +03:00
Tomasz Grabiec
8b7623f49e database: Fix accounting of small partitions in mutation query
The partition key size was ignored by the accounter, as well as the
partition tombstone. As a result, a sequence of partitions with just
tombstones would be accounted as taking no memory and page size
limitter to not kick in.

Fix by accounting the real size of accumulated frozen_mutation.

Also, break pages across partitions even if there are no live rows.
The coordinator can handle it now.

Refs #7933
2023-09-22 02:53:14 -04:00
Tomasz Grabiec
17c1cad4b4 database, storage_proxy: Reconcile pages with no live rows incrementally
Currently, mutation query on replica side will not respond with a result
which doesn't have at least one live row. This causes problems if there
is a lot of dead rows or partitions before we reach a live row, which
stems from the fact that resulting reconcilable_result will be large:

* Large allocations. Serialization of reconcilable_result causes large
  allocations for storing result rows in std::deque
* Reactor stalls. Serialization of reconcilable_result on the replica
  side and on the coordinator side causes reactor stalls. This impacts
  not only the query at hand. For 1M dead rows, freezing takes 130ms,
  unfreezing takes 500ms. Coordinator does multiple freezes and
  unfreezes. The reactor stall on the coordinator side is >5s.
* Large repair mutations. If reconciliation works on large pages, repair
  may fail due to too large mutation size. 1M dead rows is already too
  much: Refs #9111.

This patch fixes all of the above by making mutation reads respect the
memory accounter's limit for the page size, even for dead rows.

This patch also addresses the problem of client-side timeouts during
paging. Reconciling queries processing long strings of tombstones will
now properly page tombstones,like regular queries do.

My testing shows that this solution even increases efficiency. I tested
with a cluster of 2 nodes, and a table of RF=2. The data layout was as
follows (1 partition):

    Node1: 1 live row, 1M dead rows
    Node2: 1M dead rows, 1 live row

This was designed to trigger reconciliation right from the very start of
the query.

Before:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

After:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

Non-reconciling queries have almost identical duration (1 few ms changes
can be observed between runs). Note how in the after case, the
reconciling read also produces 100 pages, vs. just 2 pages in the before
case, leading to a much lower duration (less than 1/4 of the before).

Refs #7929
Refs #3672
Refs #7933
Fixes #9111
2023-09-22 02:53:14 -04:00
Botond Dénes
7e7101c180 Revert "Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes"
This reverts commit 628e6ffd33, reversing
changes made to 45ec76cfbf.

The test included with this PR is flaky and often breaks CI.
Revert while a fix is found.

Fixes: #15371
2023-09-13 10:45:37 +03:00
Tomasz Grabiec
0d773c9f9f database: Fix accounting of small partitions in mutation query
The partition key size was ignored by the accounter, as well as the
partition tombstone. As a result, a sequence of partitions with just
tombstones would be accounted as taking no memory and page size
limitter to not kick in.

Fix by accounting the real size of accumulated frozen_mutation.

Also, break pages across partitions even if there are no live rows.
The coordinator can handle it now.

Refs #7933
2023-09-11 06:56:13 -04:00
Tomasz Grabiec
2c8a0e4175 database, storage_proxy: Reconcile pages with no live rows incrementally
Currently, mutation query on replica side will not respond with a result
which doesn't have at least one live row. This causes problems if there
is a lot of dead rows or partitions before we reach a live row, which
stems from the fact that resulting reconcilable_result will be large:

* Large allocations. Serialization of reconcilable_result causes large
  allocations for storing result rows in std::deque
* Reactor stalls. Serialization of reconcilable_result on the replica
  side and on the coordinator side causes reactor stalls. This impacts
  not only the query at hand. For 1M dead rows, freezing takes 130ms,
  unfreezing takes 500ms. Coordinator does multiple freezes and
  unfreezes. The reactor stall on the coordinator side is >5s.
* Large repair mutations. If reconciliation works on large pages, repair
  may fail due to too large mutation size. 1M dead rows is already too
  much: Refs #9111.

This patch fixes all of the above by making mutation reads respect the
memory accounter's limit for the page size, even for dead rows.

This patch also addresses the problem of client-side timeouts during
paging. Reconciling queries processing long strings of tombstones will
now properly page tombstones,like regular queries do.

My testing shows that this solution even increases efficiency. I tested
with a cluster of 2 nodes, and a table of RF=2. The data layout was as
follows (1 partition):

    Node1: 1 live row, 1M dead rows
    Node2: 1M dead rows, 1 live row

This was designed to trigger reconciliation right from the very start of
the query.

Before:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

After:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

Non-reconciling queries have almost identical duration (1 few ms changes
can be observed between runs). Note how in the after case, the
reconciling read also produces 100 pages, vs. just 2 pages in the before
case, leading to a much lower duration (less than 1/4 of the before).

Refs #7929
Refs #3672
Refs #7933
Fixes #9111
2023-09-11 06:56:13 -04:00
Tomasz Grabiec
4e9d95d78c Merge 'Compact data before streaming' from Botond Dénes
Currently, streaming and repair processes and sends data as-is. This is wasteful: streaming might be sending data which is expired or covered by tombstones, taking up valuable bandwidth and processing time. Repair additionally could be exposed to artificial differences, due to different nodes being in different states of compactness.
This PR adds opt-in compaction to `make_streaming_reader()`, then opts in all users. The main difference being in how these choose the current compaction time to use:
* Load'n'stream and streaming uses the current time on the local node.
* Repair uses a centrally chosen compaction time, generated on the repair master and propagated to al repair followers. This is to ensure all repair participants work with the exact state of compactness.

 Importantly, this compaction does *not* purge tombstones (tombstone GC is disabled completely).

Fixes: https://github.com/scylladb/scylladb/issues/3561

Closes #14756

* github.com:scylladb/scylladb:
  replica: make_[multishard_]streaming_reader(): make compaction_time mandatory
  repair/row_level: opt in to compacting the stream
  streaming: opt-in to compacting the stream
  sstables_loader: opt-in for compacting the stream
  replica/table: add optional compacting to make_multishard_streaming_reader()
  replica/table: add optional compacting to make_streaming_reader()
  db/config: add config item for enabling compaction for streaming and repair
  repair: log the error which caused the repair to fail
  readers: compacting_reader: use compact_mutation_state::abandon_current_partition()
  mutation/mutation_compactor: allow user to abandon current partition
2023-07-28 16:42:13 +02:00
Botond Dénes
7351c8424d mutation/mutation_rebuilder: add comment about validity of returned mutation reference
Closes #14853
2023-07-27 12:13:46 +03:00
Botond Dénes
326c3b92e5 mutation/mutation_compactor: allow user to abandon current partition
Currently, the compactor requires a valid stream and thus abandoning a
partition in the middle was not possible. This causes some complications
for the compacting reader, which implements methods such as
`next_partition()` which is possibly called in the middle of a
partition. In this case the compacting reader attempts to close the
partition properly by inserting a synthetic partition-end fragment into
the stream. This is not enough however as it doesn't close any range
tombstone changes that might be active. Instead of piling on more
complexity, add an API to the compactor which allows abandoning the
current partition.
2023-07-27 02:50:44 -04:00
Botond Dénes
fda4168300 mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
To allow const access to the mutation under construction, e.g. so the
user can query its size.
2023-07-25 10:34:31 -04:00
Botond Dénes
e6fa21d1b3 mutation/mutation: add memory_usage() 2023-07-25 10:34:30 -04:00
Nadav Har'El
5860820934 Merge 'mutation/mutation_compactor: validate the input stream' from Botond Dénes
The mutation compactor has a validator which it uses to validate the stream of mutation fragments that passes through it. This validator is supposed to validate the stream as it enters the compactor, as opposed to its compacted form (output). This was true for most fragment kinds except range tombstones, as purged range tombstones were not visible to
the validator for the most part.

This mistake was introduced by https://github.com/scylladb/scylladb/commit e2c9cdb576, which itself was a flawed attempt at fixing an error seen because purged tombstones were not terminated by the compactor.

This patch corrects this mistake by fixing the above problem properly: on page-cut, if the validator has an active tombstone, a closing tombstone is generated for it, to avoid the false-positive error. With this, range tombstones can be validated again as they come in.

The existing unit test checking the validation in the compactor is greatly expanded to check all (I hope) different validation scenarios.

Closes #13817

* github.com:scylladb/scylladb:
  test/mutation_test: test_compactor_validator_sanity_test
  mutation/mutation_compactor: fix indentation
  mutation/mutation_compactor: validate the input stream
  mutation: mutation_fragment_stream_validating_filter: add accessor to underlying validator
  readers: reader-from-fragment: don't modify stream when created without range
2023-07-21 00:26:46 +03:00
Botond Dénes
18ed94e60b mutation/mutation_compactor: fix indentation
Left broken by the previous patch.
2023-07-20 08:48:50 -04:00
Botond Dénes
3d5b70e0d7 mutation/mutation_compactor: validate the input stream
The mutation compactor has a validator which it uses to validate the
stream of mutation fragments that passes through it. This validator is
supposed to validate the stream as it enters the compactor, as opposed
to its compacted form (output). This was true for most fragment kinds
except range tombstones, as purged range tombstones were not visible to
the validator for the most part.
This mistake was introduced by e2c9cdb576, which itself was a flawed
attempt at fixing an error seen because purged tombstones were not
terminated by the compactor.
This patch corrects this mistake by fixing the above problem properly:
on page-cut, if the validator has an active tombstone, a closing
tombstone is generated for it, to avoid the false-positive error. With
this, range tombstones can be validated again as they come in.
2023-07-20 08:48:50 -04:00
Botond Dénes
dbb2a6f03a mutation: mutation_fragment_stream_validating_filter: add accessor to underlying validator 2023-07-20 08:48:50 -04:00
Avi Kivity
460b28d067 Merge 'Introduce SELECT MUTATION FRAGMENTS statement' from Botond Dénes
SELECT MUTATION FRAGMENTS is a new select statement sub-type, which allows dumping the underling mutations making up the data of a given table. The output of this statement is mutation-fragments presented as CQL rows. Each row corresponds to a mutation-fragment. Subsequently, the output of this statement has a schema that is different than that of the underlying table.  The output schema is derived from the table's schema, as following:
* The table's partition key is copied over as-is
* The clustering key is formed from the following columns:
    - mutation_source (text): the kind of the mutation source, one of: memtable, row-cache or sstable; and the identifier of the individual mutation source.
    - partition_region (int): represents the enum with the same name.
    - the copy of the table's clustering columns
    - position_weight (int): -1, 0 or 1, has the same meaning as that in position_in_partition, used to disambiguate range tombstone changes with the same clustering key, from rows and from each other.
* The following regular columns:
    - metadata (text): the JSON representation of the mutation-fragment's metadata.
    - value (text): the JSON representation of the mutation-fragment's value.

Data is always read from the local replica, on which the query is executed. Migrating queries between coordinators is frobidden.

More details in the documentation commit (last commit).

Example:
```cql
cqlsh> CREATE TABLE ks.tbl (pk int, ck int, v int, PRIMARY KEY (pk, ck));

cqlsh> DELETE FROM ks.tbl WHERE pk = 0;
cqlsh> DELETE FROM ks.tbl WHERE pk = 0 AND ck > 0 AND ck < 2;
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 0, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 1, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 2, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (1, 0, 0);
cqlsh> SELECT * FROM ks.tbl;

 pk | ck | v
----+----+---
  1 |  0 | 0
  0 |  0 | 0
  0 |  1 | 0
  0 |  2 | 0

(4 rows)
cqlsh> SELECT * FROM MUTATION_FRAGMENTS(ks.tbl);

 pk | mutation_source | partition_region | ck | position_weight | metadata                                                                                                                 | mutation_fragment_kind | value
----+-----------------+------------------+----+-----------------+--------------------------------------------------------------------------------------------------------------------------+------------------------+-----------
  1 |      memtable:0 |                0 |    |                 |                                                                                                         {"tombstone":{}} |        partition start |      null
  1 |      memtable:0 |                2 |  0 |               0 | {"marker":{"timestamp":1688122873341627},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122873341627}}} |         clustering row | {"v":"0"}
  1 |      memtable:0 |                3 |    |                 |                                                                                                                     null |          partition end |      null
  0 |      memtable:0 |                0 |    |                 |                                      {"tombstone":{"timestamp":1688122848686316,"deletion_time":"2023-06-30 11:00:48z"}} |        partition start |      null
  0 |      memtable:0 |                2 |  0 |               0 | {"marker":{"timestamp":1688122860037077},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122860037077}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                2 |  0 |               1 |                                      {"tombstone":{"timestamp":1688122853571709,"deletion_time":"2023-06-30 11:00:53z"}} | range tombstone change |      null
  0 |      memtable:0 |                2 |  1 |               0 | {"marker":{"timestamp":1688122864641920},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122864641920}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                2 |  2 |              -1 |                                                                                                         {"tombstone":{}} | range tombstone change |      null
  0 |      memtable:0 |                2 |  2 |               0 | {"marker":{"timestamp":1688122868706989},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122868706989}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                3 |    |                 |                                                                                                                     null |          partition end |      null

(10 rows)
```

Perf simple query:
```
/build/release/scylla perf-simple-query -c1 -m2G --duration=60
```

Before:
```
median 141596.39 tps ( 62.1 allocs/op,  13.1 tasks/op,   43688 insns/op,        0 errors)
median absolute deviation: 137.15
maximum: 142173.32
minimum: 140492.37
```
After:
```
median 141889.95 tps ( 62.1 allocs/op,  13.1 tasks/op,   43692 insns/op,        0 errors)
median absolute deviation: 167.04
maximum: 142380.26
minimum: 141025.51
```

Fixes: https://github.com/scylladb/scylladb/issues/11130

Closes #14347

* github.com:scylladb/scylladb:
  docs/operating-scylla/admin-tools: add documentation for the SELECT * FROM MUTATION_FRAGMENTS() statement
  test/topology_custom: add test_select_from_mutation_fragments.py
  test/boost/database_test: add test for mutation_dump/generate_output_schema_from_underlying_schema
  test/cql-pytest: add test_select_mutation_fragments.py
  test/cql-pytest: move scylla_data_dir fixture to conftest.py
  cql3/statements: wire-in mutation_fragments_select_statement
  cql3/restrictions/statement_restrictions: fix indentation
  cql3/restrictions/statement_restrictions: add check_indexes flag
  cql3/statments/select_statement: add mutation_fragments_select_statement
  cql3: add SELECT MUTATION FRAGMENTS select statement sub-type
  service/pager: allow passing a query functor override
  service/storage_proxy: un-embed coordinator_query_options
  replica: add mutation_dump
  replica: extract query_state into own header
  replica/table: add make_nonpopulating_cache_reader()
  replica/table: add select_memtables_as_mutation_sources()
  tools,mutation: extract the low-level json utilities into mutation/json.hh
  tools/json_writer: fold SstableKey() overloads into callers
  tools/json_writer: allow writing metadata and value separately
  tools/json_writer: split mutation_fragment_json_writer in two classes
  tools/json_writer: allow passing custom std::ostream to json_writer
2023-07-19 11:54:11 +03:00
Botond Dénes
665f69b80d tools,mutation: extract the low-level json utilities into mutation/json.hh
Soon, we will want to convert mutation fragments into json inside the
scylla codebase, not just in tools. To avoid scylla-core code having to
include tools/ (and link against it), move the low-level json utilities
into mutation/.
2023-07-19 01:28:28 -04:00
Kefu Chai
bab16eb30e treewide: remove #includes not use directly
for faster build times and clear inter-module dependencies, we
should not #includes headers not directly used. instead, we should
only #include the headers directly used by a certain compilation
unit.

in this change, the source files under "/compaction" directories
are checked using clangd, which identifies the cases where we have
an #include which is not directly used. all the #includes identified
by clangd are removed. because some source files rely on the incorrectly
included header file, those ones are updated to #include the header
file they directly use.

if a forward declaration suffice, the declaration is added instead.

see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 17:36:31 +08:00
Michał Chojnowski
5ad0846bff view: fix range tombstone handling on flushes in view_updating_consumer
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.

This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).

The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.

Fixes #14503
2023-07-04 20:33:21 +02:00
Tomasz Grabiec
e48ec6fed3 db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
87b4606cd6 Merge 'atomic_cell: compare value last' from Benny Halevy
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

Fixes #14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

Closes #14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability
2023-06-20 12:11:48 +02:00