Commit Graph

30 Commits

Author SHA1 Message Date
Avi Kivity
2fbd78c769 feature: grandfather DIGEST_FOR_NULL_VALUES
The DIGEST_FOR_NULL_VALUES feature was added in 21a77612b3 (2020; 4.4)
and can now be assumed to be always present. The hasher which it invoked
is removed.
2024-05-18 00:24:00 +03:00
Benny Halevy
a126160d7e frozen_mutation: move unfreeze_gently to async_utils
Unfreeze_gently doesn't have to be a method of
frozen_mutation.  It might as well be implemented as
a free function reading from a frozen_mutation
and preparing a mutation gently.

The logic will be used in a later patch
to make a canonical mutation directly from
a frozen_mutation instead of unfreezing it
and then converting it to a canonical_mutation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:27:56 +03:00
Benny Halevy
15e8ecb670 mutation_partition: apply_monotonically: do not support schema upgrade
Currently, if the input mutation_partition requires
schema upgrade, apply_monotonically always silently reverts to
being non-preemptible, even if the caller passed is_preemptible::yes.

To prevent that from happening, put the burden of upgrading
the mutation_partition schem on the caller, which is
today the apply() methods, which are synchronous anyhow.

With that, we reduce the proliferation of the
`apply_monotonically` overloads and keep only the
low level one (which could potentially be private as well,
as it's called only from within the mutation/ source files
and from tests)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 18:42:41 +03:00
Pavel Emelyanov
d90db016bf treewide: Use partition_slice::is_reversed()
Continuation of cc56a971e8, more noisy places detected

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#17763
2024-03-13 08:52:46 +02:00
Kefu Chai
b4fa32ec17 mutation: add fmt::formatter for atomic_cell_or_collection::printer
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for
`atomic_cell_or_collection::printer`, and drop its operator<<.

Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-25 12:18:41 +08:00
Kefu Chai
acefde0735 mutation: add fmt::formatter for mutation_partition::printer
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `mutation_partition::printer`,
and drop its operator<<.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17419
2024-02-20 09:01:22 +02:00
Kefu Chai
54ed65bb50 mutation: s/statics/static content/
codespell reports that "statics" could be the misspelling of
"statistics". but "static" here means the static column(s). so
replace "static" with more specific wording.

Refs #589
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17216
2024-02-13 17:33:21 +02:00
Kefu Chai
b931d93668 treewide: fix misspellings in code comments
these misspellings are identified by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17004
2024-01-31 09:16:10 +02:00
Kefu Chai
d1dd71fbd7 mutation: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16889
2024-01-21 16:58:26 +02:00
Michał Chojnowski
002357e238 mutation_query: properly send range tombstones in reverse queries
reconcilable_result_builder passes range tombstone changes to _rt_assembler
using table schema, not query schema.
This means that a tombstone with bounds (a; b), where a < b in query schema
but a > b in table schema, are not be emitted from mutation_query.

This is a very serious bug, because it means that such tombstones in reverse
queries are not reconciled with data from other replicas.
If *any* queried replica has a row, but not the range tombstone which deleted
the row, the reconciled result will contain the deleted row.

In particular, range deletes performed while a replica is down, will not
later be visible to reverse queries which select this replica, regardless of the
consistency level.

As far as I can see, this doesn't result in any persistent data loss.
Only in that some data might appear resurrected to reverse queries,
until the relevant range tombstone is fully repaired.
2023-11-08 14:54:48 +01:00
Tomasz Grabiec
8b7623f49e database: Fix accounting of small partitions in mutation query
The partition key size was ignored by the accounter, as well as the
partition tombstone. As a result, a sequence of partitions with just
tombstones would be accounted as taking no memory and page size
limitter to not kick in.

Fix by accounting the real size of accumulated frozen_mutation.

Also, break pages across partitions even if there are no live rows.
The coordinator can handle it now.

Refs #7933
2023-09-22 02:53:14 -04:00
Tomasz Grabiec
17c1cad4b4 database, storage_proxy: Reconcile pages with no live rows incrementally
Currently, mutation query on replica side will not respond with a result
which doesn't have at least one live row. This causes problems if there
is a lot of dead rows or partitions before we reach a live row, which
stems from the fact that resulting reconcilable_result will be large:

* Large allocations. Serialization of reconcilable_result causes large
  allocations for storing result rows in std::deque
* Reactor stalls. Serialization of reconcilable_result on the replica
  side and on the coordinator side causes reactor stalls. This impacts
  not only the query at hand. For 1M dead rows, freezing takes 130ms,
  unfreezing takes 500ms. Coordinator does multiple freezes and
  unfreezes. The reactor stall on the coordinator side is >5s.
* Large repair mutations. If reconciliation works on large pages, repair
  may fail due to too large mutation size. 1M dead rows is already too
  much: Refs #9111.

This patch fixes all of the above by making mutation reads respect the
memory accounter's limit for the page size, even for dead rows.

This patch also addresses the problem of client-side timeouts during
paging. Reconciling queries processing long strings of tombstones will
now properly page tombstones,like regular queries do.

My testing shows that this solution even increases efficiency. I tested
with a cluster of 2 nodes, and a table of RF=2. The data layout was as
follows (1 partition):

    Node1: 1 live row, 1M dead rows
    Node2: 1M dead rows, 1 live row

This was designed to trigger reconciliation right from the very start of
the query.

Before:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

After:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

Non-reconciling queries have almost identical duration (1 few ms changes
can be observed between runs). Note how in the after case, the
reconciling read also produces 100 pages, vs. just 2 pages in the before
case, leading to a much lower duration (less than 1/4 of the before).

Refs #7929
Refs #3672
Refs #7933
Fixes #9111
2023-09-22 02:53:14 -04:00
Botond Dénes
7e7101c180 Revert "Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes"
This reverts commit 628e6ffd33, reversing
changes made to 45ec76cfbf.

The test included with this PR is flaky and often breaks CI.
Revert while a fix is found.

Fixes: #15371
2023-09-13 10:45:37 +03:00
Tomasz Grabiec
0d773c9f9f database: Fix accounting of small partitions in mutation query
The partition key size was ignored by the accounter, as well as the
partition tombstone. As a result, a sequence of partitions with just
tombstones would be accounted as taking no memory and page size
limitter to not kick in.

Fix by accounting the real size of accumulated frozen_mutation.

Also, break pages across partitions even if there are no live rows.
The coordinator can handle it now.

Refs #7933
2023-09-11 06:56:13 -04:00
Tomasz Grabiec
2c8a0e4175 database, storage_proxy: Reconcile pages with no live rows incrementally
Currently, mutation query on replica side will not respond with a result
which doesn't have at least one live row. This causes problems if there
is a lot of dead rows or partitions before we reach a live row, which
stems from the fact that resulting reconcilable_result will be large:

* Large allocations. Serialization of reconcilable_result causes large
  allocations for storing result rows in std::deque
* Reactor stalls. Serialization of reconcilable_result on the replica
  side and on the coordinator side causes reactor stalls. This impacts
  not only the query at hand. For 1M dead rows, freezing takes 130ms,
  unfreezing takes 500ms. Coordinator does multiple freezes and
  unfreezes. The reactor stall on the coordinator side is >5s.
* Large repair mutations. If reconciliation works on large pages, repair
  may fail due to too large mutation size. 1M dead rows is already too
  much: Refs #9111.

This patch fixes all of the above by making mutation reads respect the
memory accounter's limit for the page size, even for dead rows.

This patch also addresses the problem of client-side timeouts during
paging. Reconciling queries processing long strings of tombstones will
now properly page tombstones,like regular queries do.

My testing shows that this solution even increases efficiency. I tested
with a cluster of 2 nodes, and a table of RF=2. The data layout was as
follows (1 partition):

    Node1: 1 live row, 1M dead rows
    Node2: 1M dead rows, 1 live row

This was designed to trigger reconciliation right from the very start of
the query.

Before:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

After:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

Non-reconciling queries have almost identical duration (1 few ms changes
can be observed between runs). Note how in the after case, the
reconciling read also produces 100 pages, vs. just 2 pages in the before
case, leading to a much lower duration (less than 1/4 of the before).

Refs #7929
Refs #3672
Refs #7933
Fixes #9111
2023-09-11 06:56:13 -04:00
Tomasz Grabiec
87b4606cd6 Merge 'atomic_cell: compare value last' from Benny Halevy
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

Fixes #14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

Closes #14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability
2023-06-20 12:11:48 +02:00
Benny Halevy
0aa13f70eb mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
As in compare_atomic_cell_for_merge, we want to consider
the row marker ttl for ordering, in case both are expiring
and have the same expiration time.

This was missed in a57c087c89
and a085ef74ff.

With that in mind, add documentation to compare_row_marker_for_merge
and a mutual note to both functions about their
equivalence.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-20 10:10:39 +03:00
Michał Chojnowski
4f73a28174 mutation: mutation_cleaner: add pause()
In unit tests, we would want to delay the merging of some MVCC
versions to test the transient scenarios with multiple versions present.

In many cases this can be done by holding snapshots to all versions.
But sometimes (i.e. during schema upgrades) versions are added and
scheduled for merge immediately, without a window for the test to
grab a snapshot to the new version.

This patch adds a pause() method to mutation_cleaner, which ensures
that no asynchronous/implicit MVCC version merges happen within
the scope of the call.

This functionality will be used by a test added in an upcoming patch.
2023-06-19 22:50:43 +02:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Tomasz Grabiec
51e3b9321b Merge ' mvcc: make schema upgrades gentle' from Michał Chojnowski
After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls.

This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery.
Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version.
After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and by the MVCC version merge ongoing in the background after the upgrade.

The series:
1. Does some code cleanup in the mutation_partition area.
2. Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry).
3. Adds upgrading variants of constructors and apply() for `row` and its wrappers.
4. Prepares partition_snapshot_row_cursor, mutation_partition_v2::apply_monotonically and partition_snapshot::merge_partition_versions for dealing with heterogeneous version chains.
5. Modifies partition_entry::upgrade to perform upgrades by extending the version chain with a new schema instead of squashing it to a single upgraded version.

Fixes #2577

Closes #13761

* github.com:scylladb/scylladb:
  test: mvcc_test: add a test for gentle schema upgrades
  partition_version: make partition_entry::upgrade() gentle
  partition_version: handle multi-schema snapshots in merge_partition_versions
  mutation_partition_v2: handle schema upgrades in apply_monotonically()
  partition_version: remove the unused "from" argument in partition_entry::upgrade()
  row_cache_test: prepare test_eviction_after_schema_change for gentle schema upgrades
  partition_version: handle multi-schema entries in partition_entry::squashed
  partition_snapshot_row_cursor: handle multi-schema snapshots
  partiton_version: prepare partition_snapshot::squashed() for multi-schema snapshots
  partition_version: prepare partition_snapshot::static_row() for multi-schema snapshots
  partition_version: add a logalloc::region argument to partition_entry::upgrade()
  memtable: propagate the region to memtable_entry::upgrade_schema()
  mutation_partition: add an upgrading variant of lazy_row::apply()
  mutation_partition: add an upgrading variant of rows_entry::rows_entry
  mutation_partition: switch an apply() call to apply_monotonically()
  mutation_partition: add an upgrading variant of rows_entry::apply_monotonically()
  mutation_fragment: add an upgrading variant of clustering_row::apply()
  mutation_partition: add an upgrading variant of row::row
  partition_version: remove _schema from partition_entry::operator<<
  partition_version: remove the schema argument from partition_entry::read()
  memtable: remove _schema from memtable_entry
  row_cache: remove _schema from cache_entry
  partition_version: remove the _schema field from partition_snapshot
  partition_version: add a _schema field to partition_version
  mutation_partition: change schema_ptr to schema& in mutation_partition::difference
  mutation_partition: change schema_ptr to schema& in mutation_partition constructor
  mutation_partition_v2: change schema_ptr to schema& in mutation_partition_v2 constructor
  mutation_partition: add upgrading variants of row::apply()
  partition_version: update the comment to apply_to_incomplete()
  mutation_partition_v2: clean up variants of apply()
  mutation_partition: remove apply_weak()
  mutation_partition_v2: remove a misleading comment in apply_monotonically()
  row_cache_test: add schema changes to test_concurrent_reads_and_eviction
  mutation_partition: fix mixed-schema apply()
2023-05-24 22:58:43 +02:00
Botond Dénes
646396a879 mutation/mutation_partition: append_clustered_row(): use on_internal_error()
Instead of simply throwing an exception. With just the exception, it is
impossible to find out what went wrong, as this API is very generic and
is used in a variety of places. The backtrace printed by
`on_internal_error()` will help zero in on the problem.

Fixes: #13876

Closes #13883
2023-05-15 20:31:44 +03:00
Michał Chojnowski
b488e4d541 mutation_partition: add an upgrading variant of row::row
It will be used in upcoming commits.

A factory function is used, rather than an actual constructor,
because we want to delegate the (easy) case of equal schemas
to the existing single-schema constructor.
And that's impossible (at least without invoking a copy/move
constructor) to do with only constructors.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
bc6a07a16a mutation_partition: change schema_ptr to schema& in mutation_partition::difference
Cosmetic change. See the preceding commit for details.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
a70c5704df mutation_partition: change schema_ptr to schema& in mutation_partition constructor
Cosmetic change. See the preceding commit for details.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
021b345832 mutation_partition: add upgrading variants of row::apply()
They will be used in upcoming patches which introduce incremental schema upgrades.

Currently, these variants always copy cells during upgrade.
This could be optimized in the future by adding a way to move them instead.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
88a0871729 mutation_partition: remove apply_weak()
apply_weak is just an alias for apply(), and most of its variants
are dead code. Get rid of it.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
fb8ae3cca4 mutation_partition: fix mixed-schema apply()
In some mixed-schema apply helpers for tests, the source mutation
is accidentally copied with the target schema. Fix that.

Nothing seems to be currently affected by this bug; I found it when it
was triggered by a new test I was adding.
2023-05-04 02:37:29 +02:00
Kefu Chai
1cb95b8cff mutation: specialize fmt::formatter<tombstone> and fmt::formatter<shadowable_tombstone>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `tombstone` and `shadowable_tombstone` without the
help of `operator<<`.

in this change, only `operator<<(ostream&, const shadowable_tombstone&)`
is dropped, and all its callers are now using fmtlib for formatting the
instances of `shadowable_tombstone` now.
`operator<<(ostream&, const tombstone&)` is preserved. as it is still
used by Boost::test for printing the operands in case the comparing tests
fail.

please note, before this change we were using a concrete string
for indent. after this change, some of the places are changed to
using fmtlib for indent.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-12 10:57:03 +08:00
Kefu Chai
c37f4e5252 treewide: use fmt::join() when appropriate
now that fmtlib provides fmt::join(). see
https://fmt.dev/latest/api.html#_CPPv4I0EN3fmt4joinE9join_viewIN6detail10iterator_tI5RangeEEN6detail10sentinel_tI5RangeEEERR5Range11string_view
there is not need to revent the wheel. so in this change, the homebrew
join() is replaced with fmt::join().

as fmt::join() returns an join_view(), this could improve the
performance under certain circumstances where the fully materialized
string is not needed.

please note, the goal of this change is to use fmt::join(), and this
change does not intend to improve the performance of existing
implementation based on "operator<<" unless the new implementation is
much more complicated. we will address the unnecessarily materialized
strings in a follow-up commit.

some noteworthy things related to this change:

* unlike the existing `join()`, `fmt::join()` returns a view. so we
  have to materialize the view if what we expect is a `sstring`
* `fmt::format()` does not accept a view, so we cannot pass the
  return value of `fmt::join()` to `fmt::format()`
* fmtlib does not format a typed pointer, i.e., it does not format,
  for instance, a `const std::string*`. but operator<<() always print
  a typed pointer. so if we want to format a typed pointer, we either
  need to cast the pointer to `void*` or use `fmt::ptr()`.
* fmtlib is not able to pick up the overload of
  `operator<<(std::ostream& os, const column_definition* cd)`, so we
  have to use a wrapper class of `maybe_column_definition` for printing
  a pointer to `column_definition`. since the overload is only used
  by the two overloads of
  `statement_restrictions::add_single_column_parition_key_restriction()`,
  the operator<< for `const column_definition*` is dropped.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-03-16 20:34:18 +08:00
Avi Kivity
c5e4bf51bd Introduce mutation/ module
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.

mutation_reader remains in the readers/ module.

mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.

This is a step forward towards librarization or modularization of the
source base.

Closes #12788
2023-02-14 11:19:03 +02:00