Commit Graph

126 Commits

Author SHA1 Message Date
Paweł Dziepak
e95f4eaee4 Merge "partition_limit: Don't count dead partitions" from Duarte
"This patch series ensures we don't count dead partitions (i.e.,
partitions with no live rows) towards the partition_limit. We also
enforce the partition limit at the storage_proxy level, so that
limits with smp > 1 works correctly."

(cherry picked from commit 5f11a727c9)
2016-08-03 12:44:32 +03:00
Duarte Nunes
21d0a2c764 query: Optionally send cell ttl
This patch adds support to send a cell's ttl as part of a query's
result. This is needed for thrift support.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-14 15:36:23 +02:00
Paweł Dziepak
93cc4454a6 streamed_mutation: emit range_tombstones directly
Originally, streamed_mutations guaranteed that emitted tombstones are
disjoint. In order to achieve that two separate objects were produced
for each range tombstone: range_tombstone_begin and range_tombstone_end.

Unfortunately, this forced sstable writer to accumulate all clustering
rows between range_tombstone_begin and range_tombstone_end.

However, since there is no need to write disjoint tombstones to sstables
(see #1153 "Write range tombstones to sstables like Cassandra does") it
is also not necessary for streamed_mutations to produce disjoint range
tombstones.

This patch changes that by making streamed_mutation produce
range_tombstone objects directly.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-13 09:51:18 +01:00
Tomasz Grabiec
8c4b5e4283 db: Avoiding checking bloom filters during compaction
Checking bloom filters of sstables to compute max purgeable timestamp
for compaction is expensive in terms of CPU time. We can avoid
calculating it if we're not about to GC any tombstone.

This patch changes compacting functions to accept a function instead
of ready value for max_purgeable.

I verified that bloom filter operations no longer appear on flame
graphs during compaction-heavy workload (without tombstones).

Refs #1322.
2016-07-10 09:54:20 +02:00
Paweł Dziepak
23d0bfd065 mutation_partition: add row::memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-07 12:17:25 +01:00
Paweł Dziepak
7a95847014 mutation_compactor: prepare for sstable compaction
compact_mutation code is going to be shared among queries and sstable
compaction. There are some differences though. Queries don't provide
_max_purgeable and sstable compaction don't need any limits.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-30 11:39:01 +01:00
Paweł Dziepak
4133cc7a53 mutation_reader: make consume_flattened() produce decorated keys
Since decorated keys are already computed it is better to pass more
information than less. Consumers interested just in partition key can
just drop token and the ones requiring full decorated key don't need to
recompute it.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-30 11:39:00 +01:00
Paweł Dziepak
3e86f9ab73 mutation_partition: extract compact_for_query to a separate header
The compacting logic inside compact_for_query is going to be shared with
sstable compaction.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-30 11:37:54 +01:00
Paweł Dziepak
b70bf086b7 frozen_mutation: handle reversed streams properly
Freezing streamed_mutations assumed that mutation fragments are streamed
in the order they appear in the frozen mutation. That is not true for
reversed streams.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1467277069-18702-1-git-send-email-pdziepak@scylladb.com>
2016-06-30 11:26:45 +02:00
Duarte Nunes
69798df95e query: Limit number of partitions returned
This is required to implement a thrift verb.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-22 09:48:13 +02:00
Duarte Nunes
594e43a60a compact_query: Rename partition_limit
This patch renames compact_query::_partition_limit to
_current_partition_limit for clarity, as the next patch adds
a partition limit that limits the number of partitions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-22 09:47:29 +02:00
Duarte Nunes
e9ebd87991 compact_query: Rename limit to row_limit
This patch renames compact_query::_limit to _row_limit for
clarity, as a subsequent patch introduces yet another limit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-22 09:47:28 +02:00
Duarte Nunes
01b18063ea query: Add per-partition row limit
This patch as a per-partition row limit. It ensures both local
queries and the reconciliation logic abide by this limit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-22 09:46:51 +02:00
Paweł Dziepak
ed12c164f8 mutation_query: make mutation queries streaming-friendly
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:31:28 +01:00
Paweł Dziepak
0828c88b25 mutation_partition: implement streaming-friendly data_query()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:31:19 +01:00
Paweł Dziepak
67ae9457e3 mutation_partition: introduce mutation_querier
mutation_querier is a streamed_mutation consumer that adds the mutation
content to query::result.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:53 +01:00
Paweł Dziepak
f54e604a16 mutation_partition: introduce compact_for_query
compact_for_query is an intermediate stage used to compact data in a
flattened stream of mutations before they are consumed by query building
consumers.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:53 +01:00
Paweł Dziepak
f95c5542dc mutation_partition: allow slicing moved mutation_partition
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:51 +01:00
Paweł Dziepak
5a60f6d1ec range_tombstone: extract is_single_clustering_row_tombstone()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:50 +01:00
Paweł Dziepak
847bf878ec mutation_partition: add more row::apply() overloads
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:48 +01:00
Duarte Nunes
70083efee2 sstables: Read and write range tombstone bounds
This patch uses the composite_marker to add inclusiveness information
to the prefixes of a range tombstone.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-02 16:21:59 +02:00
Duarte Nunes
7628e403a3 sstables: Drop code for tombstone merging
Since Scylla now supports proper range tombstones, the code for
reading ranges from sstables and converting them to overlapping
tombstones is no longer necessary, and is, in fact, wasteful as
the internal representation converts overlapping tombstones back to
ranges.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-02 16:21:59 +02:00
Duarte Nunes
95594b8171 mutations: Encapsulate row tombstones difference
This patch moves the difference between two mutation_partition's
row_tombstones inside the range_tombstone_list.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-02 16:21:59 +02:00
Duarte Nunes
91aac30f12 mutations: Row tombstones are now a set of ranges
This patch changes the type of the mutation partition's row_tombstones
to be a range_tombstone_list, so that they are now represented as a
set of disjoint ranges. All of its usages are updated accordingly.

Fixes #1155

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-02 16:21:59 +02:00
Gleb Natapov
5fef0717cc query: find latest modification timestamp while calculating result digest 2016-05-24 13:27:34 +03:00
Piotr Jastrzebski
23c23abe53 Make memtable mutation_reader slice using clustering ranges.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-05-16 11:46:41 +02:00
Gleb Natapov
b75475de80 query: fix result row counting for results with multiple partitions
Message-Id: <1462377579-2419-1-git-send-email-gleb@scylladb.com>
2016-05-04 18:18:15 +02:00
Gleb Natapov
db322d8f74 query: put live row count into query::result
The patch calculates row count during result building and while merging.
If one of results that are being merged does not have row count the
merged result will not have one either.
2016-05-02 15:10:15 +03:00
Tomasz Grabiec
c69d0a8e87 mutation_partition: Fix collection emptiness check
Broken by f15c380a4f.

This resulted in empty collection being returned in the results
instead of no collection.

Fixes org.apache.cassandra.cql3.validation.entities.CollectionsTest
from cassandra-unit-tests.
2016-04-15 18:14:05 +02:00
Tomasz Grabiec
c2b955d40b mutation_partition: Fix static row being returned when paginating
Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test.

Broken by f15c380a4f, where the
calcualtion of has_ck_selector got broken, in such a way that present
clustering restrictions were treated as if not present, which resulted
in static row being returned when it shouldn't.

While at it, unify the check between query_compacted() and
do_compact() by extracting it to a function.
2016-04-08 20:53:33 +02:00
Tomasz Grabiec
a1539fed95 mutation_partition: Fix reversed trim_rows()
The first erase_and_dispose(), which removes rows between last
position and beginning of the next range, can invalidate end()
iterator of the range. Fix by looking up end after erasing.

mutation_partition::range() was split into lower_bound() and
upper_bound() to allow for that.

This affects for example queries with descending order where the
selected clustering range is empty and falls before all rows.

Exposed by f15c380a4f, which is now
calling do_compact() during query.

Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test
2016-04-08 20:53:33 +02:00
Avi Kivity
db03295c8a Merge "Fix query digest mismatch" from Tomasz
"Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.

The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.

perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.

Fixes #1165."
2016-04-08 12:13:29 +03:00
Pekka Enberg
38a54df863 Fix pre-ScyllaDB copyright statements
People keep tripping over the old copyrights and copy-pasting them to
new files. Search and replace "Cloudius Systems" with "ScyllaDB".

Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>
2016-04-08 08:12:47 +03:00
Tomasz Grabiec
f15c380a4f database: Compact mutations when executing data queries
Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.

The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.

perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.

Fixes #1165.
2016-04-07 19:56:58 +02:00
Tomasz Grabiec
dc290f0af7 mutation_partition: Make apply() atomic even in case of exception
We cannot leave partially applied mutation behind when the write
fails. It may fail if memory allocation fails in the middle of
apply(). This for example would violate write atomicity, readers
should either see the whole write or none at all.

This fix makes apply() revert partially applied data upon failure, by
the means of ReversiblyMergeable concept. In a nut shell the idea is
to store old state in the source mutation as we apply it and swap back
in case of exception. At cell level this swapping is inexpensive, just
rewiring pointers. For this to work, the source mutation needs to be
brought into mutable form, so frozen mutations need to be unfrozen. In
practice this doesn't increase amount of cell allocations in the
memtable apply path because incoming data will usually be newer and we
will have to copy it into LSA anyway. There are extra allocations
though for the data structures which holds cells.

I didn't see significant change in performance of:

  build/release/tests/perf/perf_simple_query -c1 -m1G --write --duration 13

The score fluctuates around ~77k ops/s.

Fixes #283.
2016-03-21 21:49:52 +01:00
Tomasz Grabiec
e09d186c7c mutation_partition: Make intrusive sets ReversiblyMergeable 2016-03-21 21:49:52 +01:00
Tomasz Grabiec
e4a576a90f mutation_partition: Make rows_entry ReversiblyMergeable 2016-03-21 19:26:24 +01:00
Tomasz Grabiec
aadcd75d89 mutation_partition: Make row_marker ReversiblyMergeable 2016-03-21 19:26:24 +01:00
Tomasz Grabiec
ea7c2dd085 mutation_partition: Make row ReversiblyMergeable 2016-03-21 19:26:24 +01:00
Tomasz Grabiec
d5e66a5b0d mutation_partition: row: Allow storing empty cells internally
Currently only "set" storage could store empty cells, but not the
"vector" one because there empty cell has the meaning of being
missing. To implement rolback, we need to be able to distinguish empty
cells from missing ones. Solve by making vector storage use a bitmap
for presence checking instead of emptiness. This adds 4 bytes to
vector storage.
2016-03-21 18:41:27 +01:00
Tomasz Grabiec
ed1e6515db mutation_partition: Make row::merge() tolerate empty row
The row may be empty and still have a set storage, in which case
rbegin() dereference is undefined behavior.
2016-03-21 18:41:27 +01:00
Tomasz Grabiec
518e956736 mutation_partition: Make row::vector_to_set() exception-safe
Currently allocation failure can leave the old row in a
half-moved-from state and leak cell_entry objects.
2016-03-18 22:30:04 +01:00
Tomasz Grabiec
c91eefa183 mutation_partition: Unmark cell_entry's copy constructor as noexcept
It was a mistake, it certainly may throw because it copies cells.
2016-03-18 22:30:04 +01:00
Paweł Dziepak
21e2ebcf8c query: build only result, only digest or both
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Paweł Dziepak
46079f763b query: add keys and tombstones to result digest
Query result digest is used to verify that all replicas have the same
data. Therefore, it needs to contain more information than the query
result itself in order to ensure proper detection of disagreements.

Generally, adding clustering keys to the digest regardless of whether
the client asked for them will guarantee correctness. However, adding
tombstones as well improves the chances of early detection of nodes
containing stale data.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Paweł Dziepak
c1f7f11d54 mutation_partition: do not add ck to result when not asked to
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Amnon Heiman
1c7bc28d35 idl-compiler: change optional vector implementation
This patch change the way optional vector are implemented.

Now a vector of optional would be handle like any other non primitive
types, with a single method add() that would return a writer to the
optional.

The writer to the optional would have a skip and write method like
simple optional field.

For basic types the write method would get the value as a parameter, for
composite type, it would return a writer to the type.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1456796143-3366-2-git-send-email-amnon@scylladb.com>
2016-03-01 09:41:30 +02:00
Tomasz Grabiec
6cec131432 query: Switch to IDL-generated views and writers
The query result footprint for cassandra-stress mutation as reported
by tests/memory-footprint increased by 18% from 285 B to 337 B.

perf_simple_query shows slight regression in throughput (-8%):

  build/release/tests/perf/perf_simple_query -c4 -m1G --partitions 100000

Before: ~433k tps
After:  ~400k tps
2016-02-26 12:26:13 +01:00
Tomasz Grabiec
4284715ddf Relax includes 2016-02-26 12:26:13 +01:00
Tomasz Grabiec
a921479e71 Merge tag '807-v3' from https://github.com/avikivity/scylla
From Avi:

This patchset introduces a linearization context for managed_bytes objects.

Within this context, any scattered managed_bytes (found only in lsa regions,
so limited to memtable and cache) are auto-linearized for the lifetime of
the context.   This ensures that key and value lookups can use fast
contiguous iterators instead of using slow discontiguous iterators (or
crashing, as is the case now).
2016-02-16 14:29:48 +01:00