Commit Graph

104 Commits

Author SHA1 Message Date
Pavel Emelyanov
92f58f62f2 headers:: Remove flat_mutation_reader.hh from several other headers
All they can live with forward declaration of the f._m._r. plus a
seastar header in commitlog code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-17 17:54:47 +03:00
Pavel Emelyanov
4d2f5f93a4 memtable: Switch onto B+ rails
The change is the same as with row-cache -- use B+ with int64_t token
as key and array of memtable_entry-s inside it.

The changes are:

Similar to those for row_cache:

- compare() goes away, new collection uses ring_position_comparator

- insertion and removal happens with the help of double_decker, most
  of the places are about slightly changed semantics of it

- flags are added to memtable_entry, this makes its size larger than
  it could be, but still smaller than it was before

Memtable-specific:

- when the new entry is inserted into tree iterators _might_ get
  invalidated by double-decker inner array. This is easy to check
  when it happens, so the invalidation is avoided when possible

- the size_in_allocator_without_rows() is now not very precise. This
  is because after the patch memtable_entries are not allocated
  individually as they used to. They can be squashed together with
  those having token conflict and asking allocator for the occupied
  memory slot is not possible. As the closest (lower) estimate the
  size of enclosing B+ data node is used

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
dff5eb6f25 memtable: Count partitions separately
The B+ will not have constant-time .size() call, so do it by hands

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
7b2754cf5f row-cache: Use ring_position_comparator in some places
The row cache (and memtable) code uses own comparators built on top
of the ring_position_comparator for collections of partitions. These
collections will be switched from the key less-compare to the pair
of token less-compare + key tri-compare.

Prepare for the switch by generalizing the ring_partition_comparator
and by patching all the non-collections usage of less-compare to use
one.

The memtable code doesn't use it outside of collections, but patch it
anyway as a part of preparations.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Botond Dénes
9ede82ebf8 memtable: pass a valid permit to the delegate reader
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the delegate reader when
creating it. This means `memtable::make_flat_reader()` now also requires
a permit to be passed to it.
Internally the permit is stored in `scanning_reader`, which is used both
for flushes and normal reads. In the former case a permit is not
required.
2020-05-28 11:34:35 +03:00
Pavel Emelyanov
4fa12f2fb8 header: De-bloat schema.hh
The header sits in many other headers, but there's a handy
schema_fwd.hh that's tiny and contains needed declarations
for other headers. So replace shema.hh with schema_fwd.hh
in most of the headers (and remove completely from some).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303102050.18462-1-xemul@scylladb.com>
2020-03-03 11:34:00 +01:00
Piotr Dulikowski
59fbbb993f memtables: add partition/row hit/miss counters
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
    - memtable_partition_writes - number of write operations performed
          on partitions in memtables,
    - memtable_partition_hits - number of write operations performed
          on partitions that previously existed in a memtable,
    - memtable_row_writes - number of row write operations performed
          in memtables,
    - memtable_row_hits - number of row write operations that ovewrote
          rows previously present in a memtable.

Tests: unit(release)
2019-11-12 13:35:41 +01:00
Tomasz Grabiec
ea461a3884 memtable: Extract memtable_entry::upgrade_schema() 2019-10-03 22:03:29 +02:00
Paweł Dziepak
341f186933 memtable: move encoding_stats_collector implementation out of header 2019-02-07 10:16:50 +00:00
Benny Halevy
bd6861989d sstables: mc: use proper gc_clock types for local_deletion_time and ttl
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-01-22 15:34:32 +02:00
Benny Halevy
e2c4d2d60a memtable: extract encoding_stats_collector base class to encoding_stats header file
To be used also by compaction.

Refs #3971

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-01-16 17:59:58 +02:00
Paweł Dziepak
635873639b Merge "Encoding stats enhancements" from Benny
"
Cleanup various cases related to updating of metatdata stats and encoding stats
updating in preparation for 64-bit gc_clock (#3353).

Fixes #4026
Fixes #4033
Fixes #4035
Fixes #4041

Refs #3353
"

* 'projects/encoding-stats-fixes/v6' of https://github.com/bhalevy/scylla:
  sstables: remove duplicated code in data_consume_rows_context CELL_VALUE_BYTES
  sstables: mc: use api::timestamp_type in write_liveness_info
  sstables: mc: sstable_write encoding_stats are const
  mp_row_consumer_k_l::consume_deleted_cell rename ttl param to local_deletion_time
  memtable: don't use encoding_stats epochs as default
  memtable: mc: udpate min_ttl encoding stats for dead row marker
  memtable: mc: add comment regarding updating encoding stats of collection tombstones
  sstables: metadata_collector: add update tombstone stats
  sstables: assert that delete_time is not live when updating stats
  sstables: move update_deletion_time_stats to metadata collector
  sstables: metadata_collector: introduce update_local_deletion_time_and_tombstone_histogram
  sstables: mc: write_liveness_info and write_collection should update tombstone_histogram
  sstables: update_local_deletion_time for row marker deletion_time and expiration
2019-01-15 16:53:36 +02:00
Benny Halevy
238866228f memtable: rename get_stats to get_encoding_stats
For symmetry reasons to similar sstable and compaction methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190113105155.29118-2-bhalevy@scylladb.com>
2019-01-14 14:58:43 +02:00
Benny Halevy
2c99eb28d8 memtable: don't use encoding_stats epochs as default
Why default to an artificial minimum when you can do better
with zero effort? Track the actual minima in the memtable instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-01-13 14:17:45 +02:00
Benny Halevy
9b78911379 memtable: mc: udpate min_ttl encoding stats for dead row marker
Update min ttl with expired_liveness_ttl (although it's value of max int32
is not expected to affect the minimum).

Fixes #4041

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-01-13 14:17:45 +02:00
Benny Halevy
47964d9ddc memtable: mc: add comment regarding updating encoding stats of collection tombstones
When the row flag has_complex_deletion is set, some collection columns may have
deletion tombstones and some may not. we don't strictly need to update stats
will not affect the encoding_stats anyway.

Fixes #4035

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-01-13 14:17:45 +02:00
Benny Halevy
60323b79d1 sstables: mc: sign-extend delta local_deletion_time and delta ttl
Follow Cassandra's encoding so that values that are less than the
baseline encoding_stats will wrap-around in 64-bits rather tham 32.

Fixes #4074
Refs #3353

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190109192703.18371-1-bhalevy@scylladb.com>
2019-01-09 21:43:30 +02:00
Vladimir Krivopalov
a95ba2f38a memtable: Track regular and shadowable tombstones separately in encoding_stats_collector.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-23 16:30:42 -07:00
Tomasz Grabiec
074be4d4e8 memtable, cache: Run mutation_cleaner worker in its own scheduling group
The worker is responsible for merging MVCC snapshots, which is similar
to merging sstables, but in memory. The new scheduling group will be
therefore called "memory compaction".

We should run it in a separate scheduling group instead of
main/memtables, so that it doesn't disrupt writes and other system
activities. It's also nice for monitoring how much CPU time we spend
on this.
2018-06-27 21:51:04 +02:00
Tomasz Grabiec
6c6ffaee71 mutation_cleaner: Make merge() redirect old instance to the new one
If memtable snapshot goes away after memtable started merging to
cache, it would enqueue the snapshots for cleaning on the memtable's
cleaner, which will have to clean without deferrring when the memtable
is destroyed. That may stall the reactor. To avoid this, make merge()
cause the old instance of the cleaner to redirect to the new instance
(owned by cache), like we do for regions. This way the snapshots
mentioned earlier can be cleaned after memtable is destroyed,
gracefully.
2018-06-27 21:51:04 +02:00
Tomasz Grabiec
450985dfee mvcc: Use RAII to ensure that partition versions are merged
Before this patch, maybe_merge_versions() had to be manually called
before partition snapshot goes away. That is error prone and makes
client code more complicated. Delegate that task to a new
partition_snapshot_ptr object, through which all snapshots are
published now.
2018-06-27 21:51:04 +02:00
Paweł Dziepak
aa25f0844f atomic_cell: introduce fragmented buffer value interface
As a prepratation for the switch to the new cell representation this
patch changes the type returned by atomic_cell_view::value() to one that
requires explicit linearisation of the cell value. Even though the value
is still implicitly linearised (and only when managed by the LSA) the
new interface is the same as the target one so that no more changes to
its users will be needed.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
ec9d166a4f treewide: require type to compute cell memory usage 2018-05-31 15:51:11 +01:00
Paweł Dziepak
93130e80fb atomic_cell: require column_definition for creating atomic_cell views 2018-05-31 15:51:11 +01:00
Tomasz Grabiec
3f19f76c67 mvcc: Destroy memtable partition versions gently
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.

Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.

Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.

Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
40cc766cf2 database: Add API for incremental clearing of partition entries
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:

  stop_iteration clear_gently() noexcept;

It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:

  return repeat([this] { return clear_gently(); });

The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
2018-05-30 12:18:56 +02:00
Glauber Costa
68d1c64e7a memtable: also keep track of max timestamp
We are now keeping track of the minimum timestamp in a memtable. Also
keep track of the max timestamp so we can know what it is before we
finish flushing the entire memtable to an SSTable. Will be used by
partially written SSTables undergoing TWCS.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 12:55:58 -04:00
Paweł Dziepak
863a96db48 Merge "Fix partition tombstones for SSTables 3.x" from Vladimir
"Previously, partition tombstone was not written for partitions with no
rows causing corrupted data files.

This is now fixed and covered with tests.

In addition, we now track partition tombstones while collecting encoding
statistics."

* 'projects/sstables-30/fix-partition-tombstone/v3' of https://github.com/argenet/scylla:
  tests: Don't use deprecated schema constructor.
  tests: Add tests to cover partitions consisting only of partition keys.
  sstables: Make sure partition level tombstone is written for partitions with no rows.
  memtable: Collect statistics from partition-level tombstone.
2018-05-10 16:27:20 +01:00
Vladimir Krivopalov
ffc3a1ffeb memtable: Collect statistics from partition-level tombstone.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 07:28:50 -07:00
Paweł Dziepak
0b4c6b8938 types: make some collection_type_impl functions non-static
The switch to the new in-memory representation will require a larger
parts of the logic be aware of the type of the values they are dealing
with. In most cases it is not a significant burden for the users.
2018-05-09 16:52:26 +01:00
Vladimir Krivopalov
948c4d79d3 Collect encoding statistics for memtable updates.
We keep track of all updates and store the minimal values of timestamps,
TTLs and local deletion times across all the inserted data.
These values are written as a part of serialization_header for
Statistics.db and used for delta-encoding values when writing Data.db
file in SSTables 3.0 (mc) format.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-25 15:39:14 -07:00
Tomasz Grabiec
d85d651e0f memtable: Make printable
Useful when debugging test failures.
2018-02-06 14:24:19 +01:00
Paweł Dziepak
ea7248056f memtable: drop memtable_entry::read() 2018-01-30 18:33:26 +01:00
Paweł Dziepak
0420ca48a5 paratition_snapshot_reader: minimise amount of retryable code
Retryable code that has side effects is a recipe for bugs. This patch
reworkds the snapshot reader so that the amount of logic run with
reclamation disabled is minimal and has a very limited side effects.
2018-01-30 18:33:26 +01:00
Piotr Jastrzebski
17f2eb8ff7 Remove memtable::make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Paweł Dziepak
32eb6437fd memtable: make make_flush_reader() return flat_mutation_reader 2017-11-27 20:07:22 +01:00
Glauber Costa
c2f49da609 partition: add method to calculate memory size of a partition
Once that is added, also add a method to a memtable entry to calculate
the entire size of a memtable entry. Right now we only have one method
to calculate the size minus rows.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
b98a48657e memtable: add a method to export memtable's dirty memory manager
It will be used by the cache update process to gradually return real
dirty memory to the manager.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
ec36b9eddc memtable: factor out calculation of memtable_entry memory size
The total size is the sum of two components. Add a method that
does that sum so this code gets easier to reuse.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Piotr Jastrzebski
68505a5065 Change memtable_entry::read to return flat_mutation_reader
This is the first step to move scanning_reader to
be flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
0a9ab7ff80 memtable: Introduce make_flat_reader
This method creates a flat_mutation_reader
instead of mutation_reader. All users will be gradually
converted to the new interface. make_reader is implemented
using make_flat_reader and will be removed once all users
are migrated.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Duarte Nunes
baeec0935f Replace query::full_slice with schema::full_slice()
query::full_slice doesn't select any regular or static columns, which
is at odds with the expectations of its users. This patch replaces it
with the schema::full_slice() version.

Refs #2885

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>
2017-10-17 11:25:53 +02:00
Tomasz Grabiec
2df6f356b1 mvcc: Store LSA region reference in partition_snapshot
Will be useful for improving encapsulation.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
673a22f8e1 memtable: Mark mark_flushed() as noexcept
Callers rely on that.
2017-09-04 10:04:29 +02:00
Duarte Nunes
1e7f0eab82 memtable: Created readers should be fast forwardable by default
mutation_reader::forwarding defaults to yes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170816180304.2121-1-duarte@scylladb.com>
2017-08-17 10:21:01 +03:00
Nadav Har'El
3018df11b5 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
2017-06-19 18:31:32 +03:00
Avi Kivity
6e2c9ef9fb Revert "Allow reading exactly desired byte ranges and fast_forward_to"
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2).  It causes crashes
during range scans, reported by Gleb:

"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.

Backtrace:
    at /home/gleb/work/seastar/seastar/core/apply.hh:36
    rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
    range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
    at ./seastar/core/future.hh:890
    at /home/gleb/work/seastar/seastar/core/future-util.hh:119
    at /home/gleb/work/seastar/seastar/core/future-util.hh:142
2017-06-18 16:10:21 +03:00
Nadav Har'El
317d7fc253 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>
2017-06-15 13:22:46 +01:00
Calle Wilund
2913241df1 memtable/commitlog: Change bookkeep to track individul segments
Use per CF-id reference count instead, and use handles as result of 
add operations. These must either be explicitly released or stored
(rp_set), or they will release the corresponding replay_position
upon destruction. 

Note: this does _not_ remove the replay positioning ordering requirement
for mutations. It just removes it as a means to track segment liveness.
2017-06-07 12:07:01 +00:00
Tomasz Grabiec
de70d942a9 memtable: Decouple from sstable
We can make the dependency more abstract by using mutation_source
instead of an sstable.

Will be useful in some stress tests which want to avoid the disk, but
is also good for the sake of decoupling.
Message-Id: <1495729508-30081-2-git-send-email-tgrabiec@scylladb.com>
2017-05-25 19:30:21 +03:00