The CQL binary protocol version 3 was introduced in 2014. All Scylla
version support it, and Cassandra versions 2.1 and newer.
Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer
use 32-bit collection sizes.
Unfortunately, we implemented support for multiple serialization formats
very intrusively, by pushing the format everywhere. This avoids the need
to re-serialize (sometimes) but is quite obnoxious. It's also likely to be
broken, since it's almost untested and it's too easy to write
cql_serialization_format::internal() instead of propagating the client
specified value.
Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's
easy to verify that they are no longer in use on a running system by
examining the `system.clients` table before upgrade.
Fixes#10607Closes#12432
* github.com:scylladb/scylladb:
treewide: drop cql_serialization_format
cql: modification_statement: drop protocol check for LWT
transport: drop cql protocol versions 1 and 2
This commit removes consume_in_reverse::legacy_half_reverse, an option
once used to indicate that the given key ranges are sorted descending,
based on the clustering key of the start of the range, and that the
range tombstones inside partition would be sorted (descending, as all
the mutation fragments would) according to their end (but range
tombstone would still be stored according to their start bound).
As it turns out, mutation::consume, when called with legacy_half_reverse
option produces invalid fragment stream, one where all the row
tombstone changes come after all the clustering rows. This was not an
issue, since when constructing results from the query, Scylla would not
pass the tombstones to the client, but instead compact data beforehand.
In this commit, the consume_in_reverse::legacy_half_reverse is removed,
along with all the uses.
As for the swap out in mutation_partition.cc in query_mutation and
to_data_query_result:
The downstream was not prepared to deal with legacy_half_reverse.
mutation::consume contains
```
if (reverse == consume_in_reverse::yes) {
while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
co_await yield();
}
} else {
while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
co_await yield();
}
}
```
So why did it work at all? to_data_query_result deals with a single slice.
The used consumer (compact_for_query_v2) compacts-away the range tombstone
changes, and thus the only difference between the consume_in_reverse::no
and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys
and the second one was ordered decreasing. This property is maintained if
we swap out for the consume_in_reverse::yes format.
Now that we don't accept cql protocol version 1 or 2, we can
drop cql_serialization format everywhere, except when in the IDL
(since it's part of the inter-node protocol).
A few functions had duplicate versions, one with and one without
a cql_serialization_format parameter. They are deduplicated.
Care is taken that `partition_slice`, which communicates
the cql_serialization_format across nodes, still presents
a valid cql_serialization_format to other nodes when
transmitting itself and rejects protocol 1 and 2 serialization\
format when receiving. The IDL is unchanged.
One test checking the 16-bit serialization format is removed.
The consume_in_reverse::legacy_half_reverse format is soon to be phased
out. This commit starts treating frozen_mutations from replicas for
reversed queries so that they are consumed with consume_in_reverse::yes.
This PR fixes several bugs related to handling of non-full
clustering keys.
One is in trim_clustering_row_ranges_to(), which is broken for non-full keys in reverse
mode. It will trim the range to position_in_partition_view::after_key(full_key) instead of
position_in_partition_view::before_key(key), hence it will include the
key in the resulting range rather than exclude it.
Fixes#12180
after_key() was creating a position which is after all keys prefixed
by a non-full key, rather than a position which is right after that
key.
This will issue will be caught by cql_query_test::test_compact_storage
in debug mode when mutation_partition_v2 merging starts inserting
sentinels at position after_key() on preemption.
It probably already causes problems for such keys as after_key() is used
in various parts in the read path.
Refs #1446Closes#12234
* github.com:scylladb/scylladb:
position_in_partition: Make after_key() work with non-full keys
position_in_partition: Introduce before_key(position_in_partition_view)
db: Fix trim_clustering_row_ranges_to() for non-full keys and reverse order
types: Fix comparison of frozen sets with empty values
This fixes a long standing bug related to handling of non-full
clustering keys, issue #1446.
after_key() was creating a position which is after all keys prefixed
by a non-full key, rather than a position which is right after that
key.
This will issue will be caught by cql_query_test::test_compact_storage
in debug mode when mutation_partition_v2 merging starts inserting
sentinels at position after_key() on preemption.
It probably already causes problems for such keys.
While deletable_row is used to hold regular columns of a clustering row,
its name or implementation doesn't suggest that it is a requirement. In
fact, some of its methods already take a column_kind parameter which is
used to interpret the kind of columns held in the row.
This commit removes the assumption about the column kind from the
`deletable_row::is_live` method.
Our LSA cache is implemented as an auto_unlink Boost intrusive list, meaning
that elements of the list unlink themselves from the list automatically on
destruction. Some parts of the code rely on that, and don't unlink them
manually.
However, this precludes accurate bookkeeping about the cache. Elements only have
access to themselves and their neighbours, not to any bookkeeping context.
Therefore, a destructor cannot update the relevant metadata.
In this patch, we fix this by adding explicit unlink calls to places where it
would be done by a destructor. In a following patch, we will add an assert to
the destructor to check that every element is unlinked before destruction.
and use it to access the repair history maps.
At this introductory patch, we use default-constructed
tombstone_gc_state to access the thread-local maps
temporarily and those use sites will be replaced
in following patches that will gradually pass
the tombstone_gc_state down from the compaction_manager
to where it's used.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones.
The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by 3131cbea62, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones.
The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set.
Upgrade sanity test was conducted as following:
* Created cluster of 3 nodes with RF=3 with master version
* Wrote small dataset of 1000 rows.
* Deleted prefix of 980 rows.
* Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100`
* Also did some manual queries via `cqlsh` with smaller page size and tracing on.
* Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`.
* Confirmed there are no errors or read-repairs.
Perf regression test:
```
build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60
```
Before:
```
median 133665.96 tps ( 62.0 allocs/op, 12.0 tasks/op, 43007 insns/op, 0 errors)
median absolute deviation: 973.40
maximum: 135511.63
minimum: 104978.74
```
After:
```
median 129984.90 tps ( 62.0 allocs/op, 12.0 tasks/op, 43181 insns/op, 0 errors)
median absolute deviation: 2979.13
maximum: 134538.13
minimum: 114688.07
```
Diff: +~200 instruction/op.
Fixes: https://github.com/scylladb/scylla/issues/7689
Fixes: https://github.com/scylladb/scylla/issues/3914
Fixes: https://github.com/scylladb/scylla/issues/7933
Refs: https://github.com/scylladb/scylla/issues/3672Closes#11053
* github.com:scylladb/scylladb:
test/cql-pytest: add test for query tombstone page limit
query-result-writer: stop when tombstone-limit is reached
service/pager: prepare for empty pages
service/storage_proxy: set smallest continue pos as query's continue pos
service/storage_proxy: propagate last position on digest reads
query: result_merger::get() don't reset last-pos on short-reads and last pages
query: add tombstone-limit to read-command
service/storage_proxy: add get_tombstone_limit()
query: add tombstone_limit type
db/config: add config item for query tombstone limit
gms: add cluster feature for empty replica pages
tree: don't use query::read_command's IDL constructor
The query result writer now counts tombstones and cuts the page (marking
it as a short one) when the tombstone limit is reached. This is to avoid
timing out on large span of tombstones, especially prefixes.
In the case of unpaged queries, we fail the read instead, similarly to
how we do with max result size.
If the limit is 0, the previous behaviour is used: tombstones are not
taken into consideration at all.
This reverts commit bcadd8229b, reversing
changes made to cf528d7df9. Since
4bd4aa2e88 ("Merge 'memtable, cache: Eagerly
compact data with tombstones' from Tomasz Grabiec"), memtable is
self-compacting and the extra compaction step only reduces throughput.
The unit test in memtable_test.cc is not reverted as proof that the
revert does not cause a regression.
Closes#11243
This reverts commit c3bad157e5, reversing
changes made to e66809d051. The checks it
adds are triggered by some dtests. While it's possible the check is
triggered due to an existing problem, better to investigate it out-of-tree.
Fixes#11169.
If the reader feeding the result builder is missing a partition-end
between two partition, or at end-of-stream, the result builder will
write a corrupt partition-entry into the result, ending up in an
"IDL Frame truncated" error.
It is trivial to add a check for this and this will result in a much
more clear error message, then the mysterious frame truncated error
mentioned above.
Now that we use emit_only_live_rows::no everywhere we can remove this
template parameters. Only the template parameter is removed, the
internal logic around it is left in place (will be removed in a next
patch), by hard-wiring `only_live()`.
emit_only_live_rows is a convenience so downstream consumers of the
mutation compactors don't have to check the `bool is_live` already
passed to them. This convenience however causes a template parameter and
additional logic for the compactor. As the most prominent of these
consumers (the query result builder) will soon have to switch to
emit_only_live_rows::no for other reasons anyway (it will want to count
tombstones), we take the opportunity to switch everybody to ::no. This
can be done with very little additional complexity to these consumer --
basically an additional if or two.
This prepares the ground for removing this template parameter and the
associate logic from the compactor.
Use the recently introduced query-result facility to have the replica
set the position where the query should continue from. For now this is
the same as what the implicit position would have been previously (last
row in result), but it opens up the possibility to stop the query at a
dead row.
Currently, clustering_row::apply() takes deletable_row by reference, but
copies it before passing it to deletable_row::apply(). This is more expensive
than passing the reference down (by about 1800 instructions for
perf_simple_query rows).
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drpo tombstoned data.
One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.
Fixes#652.
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.
This will start happening after tombstones are compacted with rows on
partition version merging.
This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
Partition version merging is preemptable. It may stop in the middle
and be resumed later. Currently, all state is kept inside the versions
themselves, in the form of elements in the source version which are
yet to be moved. This will change once we add compaction (tombstones
with rows) into the merging algorithm. There, state cannot be encoded
purley within versions. Consider applying a partition tombstone over
large number of rows.
This patch introduces apply_rows object to hold the necessary state to
make sure forward progress in case of preemption.
No change in behavior yet.
This reverts commit e0670f0bb5, reversing
changes made to 605ee74c39. It causes failures
in debug mode in
database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain,
though with low probability.
Fixes#10780Reopens#652.
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drpo tombstoned data.
One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.
Fixes#652.
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.
This will start happening after tombstones are compacted with rows on
partition version merging.
This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
Partition version merging is preemptable. It may stop in the middle
and be resumed later. Currently, all state is kept inside the versions
themselves, in the form of elements in the source version which are
yet to be moved. This will change once we add compaction (tombstones
with rows) into the merging algorithm. There, state cannot be encoded
purley within versions. Consider applying a partition tombstone over
large number of rows.
This patch introduces apply_rows object to hold the necessary state to
make sure forward progress in case of preemption.
No change in behavior yet.
Reduce stalls by maybe yielding in-between partitions,
and by awaiting unfreeze_gently where possible.
Refs #10038
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allowing to consume the frozen_mutation directly
to a stream rather than unfreezing it first
and then consuming the unfrozen mutation.
Streaming directly from the frozen_mutation
saves both cpu and memory, and will make it
easier to be made async as a follow, to allow
yielding, e.g. between rows.
This is used today only in to_data_query_result
which is invoked on the read-repair path.
Refs #10038Fixes#10021
Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220405055807.1834494-1-bhalevy@scylladb.com>
Add a `consume()` overload for range tombstone changes and convert them
internally to range tombstones, as the underlying reconcilable result
is still v1.
Add a consume() overload which takes a range tombstone change and drops
it just like the existing range tombstone overload does: query results
don't care about range tombstones.
The downstream consumer (mutation_querier) already ignores range
tombstones, so no point forwarding them to it. This makes adding v2
support easier too as range tombstone changes can be similarly dropped.
The comment on the public methods calling said method promises to do so
but doesn't actually follows through. This patch fixes this for row
tombstones, to mirror the behaviour of the mutation compactor. This is
especially important for tests that compare mutations compacted with
different methods.
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Said wrapper was conceived to make unmovable `compact_mutation` because
readers wanted movable consumers. But `compact_mutation` is movable for
years now, as all its unmovable bits were moved into an
`lw_shared_ptr<>` member. So drop this unnecessary wrapper and its
unnecessary usages.
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.
To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.
In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:
1) GC a tombstone after gc_grace_seconds
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;
This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.
2) Never GC a tombstone
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};
3) GC a tombstone immediately
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};
4) GC a tombstone after repair
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.
A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.
Tests: compaction_test.py, ninja test
Fixes#3560
[avi: resolve conflicts vs data_dictionary]
This means when page_size is sent together with read_command it will be
used for paged queries instead of the hard_limit.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The B-tree insertion methods accept smart pointers and
automatically release the ownership after exception-risky
part is passed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add flags if memtable contains tombstones. They can be used as a
heuristic to determine if a memtable should be compacted on
flush. It's an intermediate step until we can compact during applying
mutations on a memtable.
We shouldn't be using Seastar as a text formatting library; that's
not its focus. Use fmt directly instead. fmt::print() doesn't return
the output stream which is a minor inconvenience, but that's life.
Closes#9556