This reverts commit aa392810ff, reversing
changes made to a24ff47c637e6a5fd158099b8a65f1191fc2d023; it uses
boost::intrusive::detail directly, which it must not, and doesn't compile on
all boost versions as a consequence.
Renaming the function to external_memory_usage() makes it clear that
sizeof(T) is not included, something that was a source of confusion in
the past.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If mutation is bigger than this limit
it won't be read and mutation_from_streamed_mutation
will return empty optional.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Originally, streamed_mutations guaranteed that emitted tombstones are
disjoint. In order to achieve that two separate objects were produced
for each range tombstone: range_tombstone_begin and range_tombstone_end.
Unfortunately, this forced sstable writer to accumulate all clustering
rows between range_tombstone_begin and range_tombstone_end.
However, since there is no need to write disjoint tombstones to sstables
(see #1153 "Write range tombstones to sstables like Cassandra does") it
is also not necessary for streamed_mutations to produce disjoint range
tombstones.
This patch changes that by making streamed_mutation produce
range_tombstone objects directly.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch as a per-partition row limit. It ensures both local
queries and the reconciliation logic abide by this limit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.
The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.
perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.
Fixes #1165."
Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.
The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.
perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.
Fixes#1165.
Schema is tracked in memtable and cache per-entry. Entries are
upgraded lazily on access. Incoming mutations are upgraded to table's
current schema on given shard.
Mutating nodes need to keep schema_ptr alive in case schema version is
requested by target node.
Allows for having more than one clustering row range set, depending on
PK queried (although right now limited to one - which happens to be exactly
the number of mutiplexing paging needs... What a coincidence...)
Encapsulates the row_ranges member in a query function, and if needed holds
ranges outside the default one in an extra object.
Query result::builder::add_partition now fetches the correct row range for
the partition, and this is the range used in subsequent iteration.
We use boost::any to convert to and from database values (stored in
serlialized form) and native C++ values. boost::any captures information
about the data type (how to copy/move/delete etc.) and stores it inside
the boost::any instance. We later retrieve the real value using
boost::any_cast.
However, data_value (which has a boost::any member) already has type
information as a data_type instance. By teaching data_type intances about
the corresponding native type, we can elimiante the use of boost::any.
While boost::any is evil and eliminating it improves efficiency somewhat,
the real goal is growing native type support in data_type. We will use that
later to store native types in the cache, enabling O(log n) access to
collections, O(1) access to tuples, and more efficient large blob support.
By passing mutation_partition oither by const ref or rref instead of
by value one move can be avoided if copying is necessary.
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
When partition has no live regular rows, but has some data live in the
static row, then it should appear in the results, even though we
didn't select any static column.
To reproduce:
create table cf (k blob, c blob, v blob, s1 blob static, primary key (k, c));
update cf set s1 = 0x01 where k = 0x01;
update cf set s1 = 0x02 where k = 0x02;
select k from cf;
The "select" statement should return 2 rows, but was returning 0.
The following query worked fine, because static columns were included:
select * from cf;
The data query should contain only live data, so we shouldn't write a
partition entry if it's supposed to be absent from the results. We
can'r tell that though until we've processed all the data. To solve
this problem, query result writer is using an optimistic approach,
where the partition header will be retracted from the buffer
(cheaply), if it turns out there's no live data in it.
Reduces coupling. User's should not rely on the fact that it's an
std::map<>. It also allows us to extend row's interface with
domain-specific methods, which are a lot easier to discover than free
functions.
Origin does that, so should we. Both ttl and expiry time are stored in
sstables. The value of ttl seems to be used to calculate the read
digest (expiry is not used for that).
The API for creating atomic_cells changed a bit.
To create a non-expiring cell:
atomic_cell::make_live(timestamp, value);
To create an expiring cell:
atomic_cell::make_live(timestamp, value, expiry, ttl);
or:
// Expiry is calculated based on current clock reading
atomic_cell::make_live(timestamp, value, ttl_optional);
Ensure that read-side accessors are const. This is important in preparation
for multiple memtables (and later, sstables) since a read-side
mutation_partition may be a temporary object coming from multiple memtables
(and sstables) while a write-side mutation_partition is guaranteed to belong
to a single memtable (and thus, not be temporary).
Since writers will want non-const mutation_partitions to write to, they won't
be able to use the read-side accessors by accident.
Partitions should be ordered using Origin's ordering, which is first
by token, then by Origin's representation of the key. That is the
natural ordering of decorated_key.
This also changes mutation class to hold decorated_key, to avoid
decoration overhead at different layers.