Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
Add flags if memtable contains tombstones. They can be used as a
heuristic to determine if a memtable should be compacted on
flush. It's an intermediate step until we can compact during applying
mutations on a memtable.
This commit consists of changes, which need to reside in a single
commit, so that the tests pass on each of the commits.
1. Remove do_make_flat_reader which disabled reverse reads by making the
slice a forward one. Remove call to get_ranges which would do
superfluous reversal of clustering ranges.
2. test: cql_query_test: remove expectation that the test_query_limit
fails for reversed queries, since reversed queries no longer require
linear memory wrt. the result size, when paginated.
Push down reversing to the mutation-sources proper, instead of doing it
on the querier level. This will allow us to test reverse reads on the
mutation source level.
The `max_size` parameter of `consume_page()` is now unused but is not
removed in this patch, it will be removed in a follow-up to reduce
churn.
Fixes#8733
If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.
v3:
* Fix table::clear to discard rp_sets before memtables
Closes#8894
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Fixes#8733
If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.
Closes#8766
Tracking both min and max timestamp will be required for memtable flush
to short-circuit interposer consumer if needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
All they can live with forward declaration of the f._m._r. plus a
seastar header in commitlog code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The change is the same as with row-cache -- use B+ with int64_t token
as key and array of memtable_entry-s inside it.
The changes are:
Similar to those for row_cache:
- compare() goes away, new collection uses ring_position_comparator
- insertion and removal happens with the help of double_decker, most
of the places are about slightly changed semantics of it
- flags are added to memtable_entry, this makes its size larger than
it could be, but still smaller than it was before
Memtable-specific:
- when the new entry is inserted into tree iterators _might_ get
invalidated by double-decker inner array. This is easy to check
when it happens, so the invalidation is avoided when possible
- the size_in_allocator_without_rows() is now not very precise. This
is because after the patch memtable_entries are not allocated
individually as they used to. They can be squashed together with
those having token conflict and asking allocator for the occupied
memory slot is not possible. As the closest (lower) estimate the
size of enclosing B+ data node is used
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row cache (and memtable) code uses own comparators built on top
of the ring_position_comparator for collections of partitions. These
collections will be switched from the key less-compare to the pair
of token less-compare + key tri-compare.
Prepare for the switch by generalizing the ring_partition_comparator
and by patching all the non-collections usage of less-compare to use
one.
The memtable code doesn't use it outside of collections, but patch it
anyway as a part of preparations.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the delegate reader when
creating it. This means `memtable::make_flat_reader()` now also requires
a permit to be passed to it.
Internally the permit is stored in `scanning_reader`, which is used both
for flushes and normal reads. In the former case a permit is not
required.
The header sits in many other headers, but there's a handy
schema_fwd.hh that's tiny and contains needed declarations
for other headers. So replace shema.hh with schema_fwd.hh
in most of the headers (and remove completely from some).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303102050.18462-1-xemul@scylladb.com>
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
- memtable_partition_writes - number of write operations performed
on partitions in memtables,
- memtable_partition_hits - number of write operations performed
on partitions that previously existed in a memtable,
- memtable_row_writes - number of row write operations performed
in memtables,
- memtable_row_hits - number of row write operations that ovewrote
rows previously present in a memtable.
Tests: unit(release)
"
Cleanup various cases related to updating of metatdata stats and encoding stats
updating in preparation for 64-bit gc_clock (#3353).
Fixes#4026Fixes#4033Fixes#4035Fixes#4041
Refs #3353
"
* 'projects/encoding-stats-fixes/v6' of https://github.com/bhalevy/scylla:
sstables: remove duplicated code in data_consume_rows_context CELL_VALUE_BYTES
sstables: mc: use api::timestamp_type in write_liveness_info
sstables: mc: sstable_write encoding_stats are const
mp_row_consumer_k_l::consume_deleted_cell rename ttl param to local_deletion_time
memtable: don't use encoding_stats epochs as default
memtable: mc: udpate min_ttl encoding stats for dead row marker
memtable: mc: add comment regarding updating encoding stats of collection tombstones
sstables: metadata_collector: add update tombstone stats
sstables: assert that delete_time is not live when updating stats
sstables: move update_deletion_time_stats to metadata collector
sstables: metadata_collector: introduce update_local_deletion_time_and_tombstone_histogram
sstables: mc: write_liveness_info and write_collection should update tombstone_histogram
sstables: update_local_deletion_time for row marker deletion_time and expiration
Why default to an artificial minimum when you can do better
with zero effort? Track the actual minima in the memtable instead.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Update min ttl with expired_liveness_ttl (although it's value of max int32
is not expected to affect the minimum).
Fixes#4041
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When the row flag has_complex_deletion is set, some collection columns may have
deletion tombstones and some may not. we don't strictly need to update stats
will not affect the encoding_stats anyway.
Fixes#4035
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The worker is responsible for merging MVCC snapshots, which is similar
to merging sstables, but in memory. The new scheduling group will be
therefore called "memory compaction".
We should run it in a separate scheduling group instead of
main/memtables, so that it doesn't disrupt writes and other system
activities. It's also nice for monitoring how much CPU time we spend
on this.
If memtable snapshot goes away after memtable started merging to
cache, it would enqueue the snapshots for cleaning on the memtable's
cleaner, which will have to clean without deferrring when the memtable
is destroyed. That may stall the reactor. To avoid this, make merge()
cause the old instance of the cleaner to redirect to the new instance
(owned by cache), like we do for regions. This way the snapshots
mentioned earlier can be cleaned after memtable is destroyed,
gracefully.
Before this patch, maybe_merge_versions() had to be manually called
before partition snapshot goes away. That is error prone and makes
client code more complicated. Delegate that task to a new
partition_snapshot_ptr object, through which all snapshots are
published now.
As a prepratation for the switch to the new cell representation this
patch changes the type returned by atomic_cell_view::value() to one that
requires explicit linearisation of the cell value. Even though the value
is still implicitly linearised (and only when managed by the LSA) the
new interface is the same as the target one so that no more changes to
its users will be needed.
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.
Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.
Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.
Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:
stop_iteration clear_gently() noexcept;
It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:
return repeat([this] { return clear_gently(); });
The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
We are now keeping track of the minimum timestamp in a memtable. Also
keep track of the max timestamp so we can know what it is before we
finish flushing the entire memtable to an SSTable. Will be used by
partially written SSTables undergoing TWCS.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"Previously, partition tombstone was not written for partitions with no
rows causing corrupted data files.
This is now fixed and covered with tests.
In addition, we now track partition tombstones while collecting encoding
statistics."
* 'projects/sstables-30/fix-partition-tombstone/v3' of https://github.com/argenet/scylla:
tests: Don't use deprecated schema constructor.
tests: Add tests to cover partitions consisting only of partition keys.
sstables: Make sure partition level tombstone is written for partitions with no rows.
memtable: Collect statistics from partition-level tombstone.
The switch to the new in-memory representation will require a larger
parts of the logic be aware of the type of the values they are dealing
with. In most cases it is not a significant burden for the users.
We keep track of all updates and store the minimal values of timestamps,
TTLs and local deletion times across all the inserted data.
These values are written as a part of serialization_header for
Statistics.db and used for delta-encoding values when writing Data.db
file in SSTables 3.0 (mc) format.
For #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Retryable code that has side effects is a recipe for bugs. This patch
reworkds the snapshot reader so that the amount of logic run with
reclamation disabled is minimal and has a very limited side effects.
Once that is added, also add a method to a memtable entry to calculate
the entire size of a memtable entry. Right now we only have one method
to calculate the size minus rows.
Signed-off-by: Glauber Costa <glauber@scylladb.com>