"This patch series ensures we don't count dead partitions (i.e.,
partitions with no live rows) towards the partition_limit. We also
enforce the partition limit at the storage_proxy level, so that
limits with smp > 1 works correctly."
(cherry picked from commit 5f11a727c9)
This patch adds support to send a cell's ttl as part of a query's
result. This is needed for thrift support.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Originally, streamed_mutations guaranteed that emitted tombstones are
disjoint. In order to achieve that two separate objects were produced
for each range tombstone: range_tombstone_begin and range_tombstone_end.
Unfortunately, this forced sstable writer to accumulate all clustering
rows between range_tombstone_begin and range_tombstone_end.
However, since there is no need to write disjoint tombstones to sstables
(see #1153 "Write range tombstones to sstables like Cassandra does") it
is also not necessary for streamed_mutations to produce disjoint range
tombstones.
This patch changes that by making streamed_mutation produce
range_tombstone objects directly.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Checking bloom filters of sstables to compute max purgeable timestamp
for compaction is expensive in terms of CPU time. We can avoid
calculating it if we're not about to GC any tombstone.
This patch changes compacting functions to accept a function instead
of ready value for max_purgeable.
I verified that bloom filter operations no longer appear on flame
graphs during compaction-heavy workload (without tombstones).
Refs #1322.
compact_mutation code is going to be shared among queries and sstable
compaction. There are some differences though. Queries don't provide
_max_purgeable and sstable compaction don't need any limits.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Since decorated keys are already computed it is better to pass more
information than less. Consumers interested just in partition key can
just drop token and the ones requiring full decorated key don't need to
recompute it.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch renames compact_query::_partition_limit to
_current_partition_limit for clarity, as the next patch adds
a partition limit that limits the number of partitions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch renames compact_query::_limit to _row_limit for
clarity, as a subsequent patch introduces yet another limit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch as a per-partition row limit. It ensures both local
queries and the reconciliation logic abide by this limit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
compact_for_query is an intermediate stage used to compact data in a
flattened stream of mutations before they are consumed by query building
consumers.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch uses the composite_marker to add inclusiveness information
to the prefixes of a range tombstone.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Since Scylla now supports proper range tombstones, the code for
reading ranges from sstables and converting them to overlapping
tombstones is no longer necessary, and is, in fact, wasteful as
the internal representation converts overlapping tombstones back to
ranges.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch moves the difference between two mutation_partition's
row_tombstones inside the range_tombstone_list.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the type of the mutation partition's row_tombstones
to be a range_tombstone_list, so that they are now represented as a
set of disjoint ranges. All of its usages are updated accordingly.
Fixes#1155
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The patch calculates row count during result building and while merging.
If one of results that are being merged does not have row count the
merged result will not have one either.
Broken by f15c380a4f.
This resulted in empty collection being returned in the results
instead of no collection.
Fixes org.apache.cassandra.cql3.validation.entities.CollectionsTest
from cassandra-unit-tests.
Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test.
Broken by f15c380a4f, where the
calcualtion of has_ck_selector got broken, in such a way that present
clustering restrictions were treated as if not present, which resulted
in static row being returned when it shouldn't.
While at it, unify the check between query_compacted() and
do_compact() by extracting it to a function.
The first erase_and_dispose(), which removes rows between last
position and beginning of the next range, can invalidate end()
iterator of the range. Fix by looking up end after erasing.
mutation_partition::range() was split into lower_bound() and
upper_bound() to allow for that.
This affects for example queries with descending order where the
selected clustering range is empty and falls before all rows.
Exposed by f15c380a4f, which is now
calling do_compact() during query.
Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test
"Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.
The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.
perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.
Fixes #1165."
Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.
The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.
perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.
Fixes#1165.
We cannot leave partially applied mutation behind when the write
fails. It may fail if memory allocation fails in the middle of
apply(). This for example would violate write atomicity, readers
should either see the whole write or none at all.
This fix makes apply() revert partially applied data upon failure, by
the means of ReversiblyMergeable concept. In a nut shell the idea is
to store old state in the source mutation as we apply it and swap back
in case of exception. At cell level this swapping is inexpensive, just
rewiring pointers. For this to work, the source mutation needs to be
brought into mutable form, so frozen mutations need to be unfrozen. In
practice this doesn't increase amount of cell allocations in the
memtable apply path because incoming data will usually be newer and we
will have to copy it into LSA anyway. There are extra allocations
though for the data structures which holds cells.
I didn't see significant change in performance of:
build/release/tests/perf/perf_simple_query -c1 -m1G --write --duration 13
The score fluctuates around ~77k ops/s.
Fixes#283.
Currently only "set" storage could store empty cells, but not the
"vector" one because there empty cell has the meaning of being
missing. To implement rolback, we need to be able to distinguish empty
cells from missing ones. Solve by making vector storage use a bitmap
for presence checking instead of emptiness. This adds 4 bytes to
vector storage.
Query result digest is used to verify that all replicas have the same
data. Therefore, it needs to contain more information than the query
result itself in order to ensure proper detection of disagreements.
Generally, adding clustering keys to the digest regardless of whether
the client asked for them will guarantee correctness. However, adding
tombstones as well improves the chances of early detection of nodes
containing stale data.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch change the way optional vector are implemented.
Now a vector of optional would be handle like any other non primitive
types, with a single method add() that would return a writer to the
optional.
The writer to the optional would have a skip and write method like
simple optional field.
For basic types the write method would get the value as a parameter, for
composite type, it would return a writer to the type.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1456796143-3366-2-git-send-email-amnon@scylladb.com>
The query result footprint for cassandra-stress mutation as reported
by tests/memory-footprint increased by 18% from 285 B to 337 B.
perf_simple_query shows slight regression in throughput (-8%):
build/release/tests/perf/perf_simple_query -c4 -m1G --partitions 100000
Before: ~433k tps
After: ~400k tps
From Avi:
This patchset introduces a linearization context for managed_bytes objects.
Within this context, any scattered managed_bytes (found only in lsa regions,
so limited to memtable and cache) are auto-linearized for the lifetime of
the context. This ensures that key and value lookups can use fast
contiguous iterators instead of using slow discontiguous iterators (or
crashing, as is the case now).