Commit Graph

524 Commits

Author SHA1 Message Date
Tomasz Grabiec
1828e28bbb database: Invalidate cache atomically with attaching streaming sstables
Not doing so may cause reads to see partial writes, if another
update+read happens in between.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
bbfa52822e row_cache: Switch readers to use per-entry snapshots
Currently readers are always using the latest snapshot. This is fine
for respecting write atomicity if partitions are fully continuous in
cache (now), but will break write atomicity once partial population is
allowed.

Consider the following case:

  flush write(ck=1), write(ck=2) -> snapshot_1
  cache reader 1 reads and inserts ck=1 @snapshot_1
  flush write(ck=1), write(ck=2) -> snapshot_2
  cache reader 2 reads and inserts ck=2 @snapshot_2

Because cache update is not atomic, it can happen that reader 2 will
complete while the partition hasn't been updated yet for snapshot_2.
In such case, after read 2 the partition would contain ck=1 from
snapshot_1 and ck=2 from snapshot_2. It will match neither of the
snapshots, and this could violate write atomicity.

To solve this problem we conceptually assign each partition key in the
ring to its current snapshot which it reflects. The update process
gradually converts entries in ring order to the new snapshot. Reads
will not be using the latest snapshot, but rather the current snapshot
for the position in the ring they are at.

There is a race between the update process and populating reads. Since
after the update all entries must reflect the new snapshot, reads
using the old snapshot cannot be allowed to insert data which can no
longer be reached by the update process. Before this patch this race
was prevented by the use of a phased_barrier, where readers would keep
phased_barrier::operation alive between starting a read of a partition
and inserting it into cache. Cache update was waiting for all prior
operations before starting the update. Any later read which was not
waited for would use the latest snapshot for reads, so the update
process didn't have to fix anything up for such reads.

After this change, later reads cannot always use the latest snapshot,
they have to use the snapshot corresponding to given entry. So it's
not enough for update() to wait for prior reads in order to prevent
stale populations. The (simple) solution implemented in this patch is
to detect the conflict and abandon population of given sub-range. In
general, reads are allowed to populate given range only if it belongs
to a single snapshot.

Note that the range here is not the whole query range. For population
of continuity, it is the range starting after the previous key and
ending after the key being inserted. When populating a partition
entry, the range is a singular range containing only the partition
key. Readers switch to new snapshots automatically as they move across
the ring. It's possible that the insertion of the partition doesn't
conflict, but continuity does. In such case the entry will be inserted
but continuity will not be set.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
c82c6ec6ed database: Allow obtaining snapshot_source for sstables 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
77f944880c cache: Remove support for wide partitions
This will be handled by row cache now.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Nadav Har'El
3018df11b5 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
2017-06-19 18:31:32 +03:00
Avi Kivity
6e2c9ef9fb Revert "Allow reading exactly desired byte ranges and fast_forward_to"
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2).  It causes crashes
during range scans, reported by Gleb:

"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.

Backtrace:
    at /home/gleb/work/seastar/seastar/core/apply.hh:36
    rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
    range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
    at ./seastar/core/future.hh:890
    at /home/gleb/work/seastar/seastar/core/future-util.hh:119
    at /home/gleb/work/seastar/seastar/core/future-util.hh:142
2017-06-18 16:10:21 +03:00
Avi Kivity
9cf6db3de5 Merge 2017-06-15 19:11:07 +03:00
Nadav Har'El
317d7fc253 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>
2017-06-15 13:22:46 +01:00
Gleb Natapov
ca812a8ea0 database: reset node's hit rate information on connection drop
Node may go down, so after it restarts cache hit rate info will be
incorrect and it can be overwhelmed with traffic until new and
up-to-date cache hit rate arrives. Solve this by dropping node's
information on connection reset, it is more accurate than relying on
gossip which may be slow and miss reboot of a node.
2017-06-13 09:57:14 +03:00
Gleb Natapov
0e4d5bc2f3 Store cluster wide cache hit statistics in CF 2017-06-13 09:57:14 +03:00
Gleb Natapov
69c5526301 messaging_service: return cache hit ratio as part of data read 2017-06-13 09:57:14 +03:00
Gleb Natapov
991ec4a16c periodically calculate avg cache hit rate between all shards
This patch adds new class cache_hitrate_calculator whose responsibility
is to periodically calculate average cache hit rates between all shards
for each CF.
2017-06-13 09:57:14 +03:00
Gleb Natapov
fab18c0c5a database: introduce cache_temperature class
The class will represent cache hit rate for a column family and is
serializable for use with RPC.
2017-06-13 09:57:14 +03:00
Calle Wilund
18806989b6 database: remove hard rp ordering requirement, set low rp mark on
truncate

With commitlog keeping use-count per CF id, we can ease the ordering
restriction on replay positiontion. Previously we required that all
added mutations have a position > previously flushed. However, if
we accept that replay must now be all data, by keeping track instead
per CF of highest RP ever entered, we can instead just set a
low mark on truncation, since this is the only remaining hard
RP divider.
2017-06-07 12:07:01 +00:00
Calle Wilund
2913241df1 memtable/commitlog: Change bookkeep to track individul segments
Use per CF-id reference count instead, and use handles as result of 
add operations. These must either be explicitly released or stored
(rp_set), or they will release the corresponding replay_position
upon destruction. 

Note: this does _not_ remove the replay positioning ordering requirement
for mutations. It just removes it as a means to track segment liveness.
2017-06-07 12:07:01 +00:00
Gleb Natapov
f5679e0416 database: remove remnants of no longer existing db::serializer.
Message-Id: <20170604100552.GD8248@scylladb.com>
2017-06-04 13:07:17 +03:00
Raphael S. Carvalho
4b4a1883aa refresh: do not use default priority for loading new sstables
Metadata is read using default priority class, which can significantly
slow down the process under high load. Compaction class can be used,
and if it turns out to be a problem, we can switch to a special class
for it.

Fixes #1859.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170517184546.17497-1-raphaelsc@scylladb.com>
2017-05-22 19:03:17 +03:00
Raphael S. Carvalho
a4e414cb3b database: reduce memory requirement to load sstables
SSTable load temporarily uses more space than needed to store metadata,
due to:
1) All components are read using read_simple() which uses 128k buffer.
file::dma_read_bulk() will allocate 128k, and may potentially allocate
another big buffer (128k - read) for file::read_maybe_eof().
2) read_filter() may use double the space it needs to.

Due to the fact that sstable loading parallelism is unlimited, Scylla
may require much more memory to load all sstables, and that may lead to
OOM. Higher the number of sstables higher the memory overhead.

To confirm this problem, I wrote a test[1] which loads 30k sstables in
parallel and reports the memory usage peak in the end.
When loading 30k sstables, each of which metadata is ~300kb, memory
usage peak was ~18G. When loading completed, only ~9GB were needed to
store all the metadata.
[1]: https://gist.github.com/raphaelsc/2db37b4fb34301833ab9eeed3b1a524d

To fix this problem, we need to set a limit on load parallelism (let's
start with a small number like 3 and adjust later if needed) and rely
on readahead so that the requirement drops considerably without
increasing boot time. Actually, boot time is improved by it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
2017-05-22 11:52:51 -03:00
Duarte Nunes
983af595e9 database: Read existing base mutations
When generating updates for a materialized view we need to read the
existing base row, to be able to determine the primary key of the view
row the new base update will supplant, in case the view includes a
base non-primary key column in its own primary key. That old view row
will be tombstoned or updated, if it exists, depending on the difference
between the new base row and the existing one, if any.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Calle Wilund
3514123677 database: Extract "remove" from "drop_columnfamily" 2017-05-10 16:44:48 +00:00
Calle Wilund
48ddcbb77b database/main: encapsulate system CF dir touching 2017-05-10 16:44:47 +00:00
Avi Kivity
8c5c5d3004 Merge "CQL front-end for secondary indices" from Pekka
"This patch series adds CQL front-end support for secondary indices. You
can now execute CREATE INDEX and DROP INDEX statements, which will
update the newly added "Indexes" system table. However, the indexes are
not actually backed up by anything nor are they available for CQL
queries. The feature is hidden behind a new cluster feature flag and
enabled only with the "--experimental" flag."

* 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits)
  schema: Kill index_type enum
  schema: Kill index_info class
  cql3/statements/create_index_statement: Use database::existing_index_names() in validation
  cql3/statements: Use secondary index manager in alter_table_statement class
  index: Add secondary_index_manager
  thrift/handler: Use index_metadata
  db/schema_tables: Index persistence
  schema: Add all_indices() to schema class
  schema: Remove add_default_index_names() from schema_builder class
  db/schema_tables: Add system table for indices
  cql3/Cgl.g: DROP INDEX
  cql3/statements: Add drop_index_statement class
  database: Add find_indexed_table() to database class
  cql3: Return change event from announce_migration()
  cql3/statements: Multiple index targets for CREATE INDEX
  cql3/statements: Use index_metadata in create_index_statement class
  cql3/statements: Use feature flag in create_index_statement class
  service/storage_service: Add feature flag for secondary indices
  database: Add get_available_index_name() to database class
  schema: Add get_default_index_name() to index_metadata class
  ...
2017-05-08 17:04:40 +03:00
Pekka Enberg
f26b8d7afb database: Add find_indexed_table() to database class 2017-05-04 14:59:12 +03:00
Pekka Enberg
930fa79aff database: Add get_available_index_name() to database class 2017-05-04 14:59:11 +03:00
Pekka Enberg
c6e7d4484a database: Make existing_index_names() per-keyspace operation 2017-05-04 14:59:11 +03:00
Paweł Dziepak
24f4dcf9e4 db: make virtual dirty soft limit configurable
Message-Id: <20170428150005.28454-1-pdziepak@scylladb.com>
2017-04-30 19:17:22 +03:00
Raphael S. Carvalho
662fe77c11 database: kill column_family::start_rewrite
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:33 -03:00
Raphael S. Carvalho
cf45333588 database: implement new sstable resharding algorithm
NOTE: it's not wired yet.

Currently, a shared sstable is rewritten at all shards it belongs
to and only after that, it's deleted. With this new algorithm, a
shared sstable will be read only once and N unshared sstables
will be created, each of them with 1/N of the data. After it's
done, each owner shard will receive its new unshared sstable
replacing its ancestors.

Another benefit is that we'll no longer have resharding resulting
in number of sstables growing considerably after resharding.
A full-sized leveled sstable is usually 160MB, so after resharding,
we could have N files of 160MB/N. Now, leveled strategy will help
resharding. N adjacent sstables of same level will be resharded
together, so we'll end up with N files of N*160MB/N.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:30 -03:00
Raphael S. Carvalho
6513252e91 database: introduce function to replace new sstables by their ancestors
When resharding, we're working with sstables from all shards. So let's say
we're done with resharding of sstable A that belongs to shard 0 and 1 and
sstable B that belongs to shard 1 and 2. SStables were generated for
shards 0, 1, and 2. So shards 0, 1, and 2 need to load the new sstables
and remove the ancestors. Shard 1 for example will remove sstables A and
B (ancestors) and add the new one. Then it comes this new function.
We'll forward new sstables to their target shards using foreign sstable
open info.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:27 -03:00
Raphael S. Carvalho
c44a2319e6 prevent regular compaction from choosing shared sstables
For new resharding, it's important to exclude resharding sstables
from the list of candidates for regular compaction. That's doesn't
affect current resharding because it marks the sstables as
compacting. That won't work with new resharding which will work
with sstables from multiple shards.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:26 -03:00
Raphael S. Carvalho
405e41e9a8 database: export column family dir
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:19 -03:00
Raphael S. Carvalho
2b774c5bc3 database: inform if column family has shared tables
That's gonna be useful to quickly determine if it's worth resharding
a column family.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:17 -03:00
Amnon Heiman
7572addfbf column_family: metrics should be register once
column_family constructor uses delegation, as such, only the actual
constructor implementation should contain a call to register the
metrics.
Current implementation ends up with re registration of the metrics.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170320140817.22214-1-amnon@scylladb.com>
2017-03-21 11:20:29 +02:00
Duarte Nunes
876a514743 database: Upgrade mutation to current schema to push view updates
This patch ensures we upgrade the mutation to the current schema when
generating and pushing view updates, so that the it matches the most
up to date views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 18:15:27 +01:00
Duarte Nunes
bfb8a3c172 materialized views: Replace db::view::view class
The write path uses a base schema at a particular version, and we
want it to use the materialized views at the corresponding version.

To achieve this, we need to map the state currently in db::view::view
to a particular schema version, which this patch does by introducing
the view_info class to hold the state previously in db::view::view,
and by having a view schema directly point to it.

The changes in the patch are thus:

1) Introduce view_info to hold the extra view state;
2) Point to the view_info from the schema;
3) Make the functions in the now stateless db::view::view non-member;
4) Remove the db::view::view class.

All changes are structural and don't affect current behavior.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 15:50:05 +01:00
Paweł Dziepak
38c1501f4d db: make apply an execution stage 2017-03-09 09:27:43 +00:00
Avi Kivity
439b38f5ab Merge "Improvements to counter implementation" from Paweł
"This series adds various optimisations to counter implementation
(nothing extreme, mostly just avoiding unnecessary operations) as well
as some missing features such as tracing and dropping timed out queries.

Performance was tested using:
perf-simple-query -c4 --counters --duration 60

The following results are medians.
          before       after      diff
write   18640.41    33156.81    +77.9%
read    58002.32    62733.93     +8.2%"

* tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits)
  cell_locker: add metrics for lock acquisition
  storage_proxy: count counter updates for which the node was a leader
  storage_proxy: use counter-specific timeout for writes
  storage_proxy: transform counter timeouts to mutation_write_timeout_exception
  db: avoid allocations in do_apply_counter_update()
  tests/counters: add test for apply reversability
  counters: attempt to apply in place
  atomic_cell: add COUNTER_IN_PLACE_REVERT flag
  counters: add equality operators
  counters: implement decrement operators for shard_iterator
  counters: allow using both views and mutable_views
  atomic_cell: introduce atomic_cell_mutable_view
  managed_bytes: add cast to mutable_view
  bytes: add bytes_mutable_view
  utils: introduce mutable_view
  db: add more tracing events for counter writes
  db: propagate tracing state for counter writes
  tests/cell_locker: add test for timing out lock acquisition
  counter_cell_locker: allow setting timeouts
  db: propagate timeout for counter writes
  ...
2017-03-07 11:48:13 +02:00
Avi Kivity
1af9e3a5cb Merge "database: fix the 'nodetool clearsnapshot'" from Vlad
"Work on this series started with fixing the  'nodetool clearsnapshot'.
The current master code  ignores the snapshots in deleted keyspaces (issue #2045).

I noticed that in many places our code has to build the path to some directory/file
it simply had the sstring(<path1>) + "/" + sstring(<path2>) constructs which may cause us issues
if somebody decides to complile/run scylla on not-Unix-based OS, like Microsoft Windows.

I understand that this is a long shot but if we can make it right now - why not to.
The answer is boost::filesystem::path class - its synchronous parts, of course.

I decided to take an initiative and fix the issues above and then use the fixed code for
fixing the issue #2045:
   - Fix some minor issues in the existing code.
   - Extend the lister class and move it into the separate files outside database.cc.

On the way I've found an issue in the existing code (issue #2071).
This series fixes this one too (PATCH2)."
2017-03-06 16:45:31 +02:00
Paweł Dziepak
04b80272f2 cell_locker: add metrics for lock acquisition 2017-03-02 09:05:12 +00:00
Paweł Dziepak
277501f42f db: propagate tracing state for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
25173f8095 db: propagate timeout for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
f25fa6566f db: avoid deserialization when applying counter mutation
In the later stages of counter write path a mutation is produced that
already has all cells transformed to counter shards and can be applied
to the memtable and written to the commitlog.
The current interface expectes a frozen mutation, which is suboptimal
for counters. The freeze itself is unaviodable -- it is required by
commitlog, but we can avoid later deserialization of frozen_mutation
when it is applied to the memtable if we pass the unfrozen mutation
along.
2017-03-01 16:33:37 +00:00
Paweł Dziepak
426345e1d4 storage_proxy: avoid excessive mutation freezes 2017-03-01 16:33:36 +00:00
Paweł Dziepak
0198d8e470 Merge "Introduce streamed_mutation::fast_forward_to()" from Tomasz
"This introduces an API which allows forward navigation in a stream of mutation
fragments. It allows one to consume only a subset of the stream by iteratively
specifying sub-ranges from which fragments should be returned.

API outline:

  When in forwarding mode, the stream does not return all fragments right away,
  but only those belonging to the current range. Initially current range only
  covers the static row. The stream can be forwarded, even before reaching end-
  of-stream for current range, to a later range with fast_forward_to().
  Forwarding doesn't change initial restrictions of the stream, it can only be
  used to skip over data.

  Monotonicity of positions is preserved by forwarding. That is fragments
  emitted after forwarding will have greater positions than any fragments
  emitted before forwarding.

  For any range, all range tombstones relevant for that range which are present
  in the original stream will be emitted. Range tombstones emitted before
  forwarding which overlap with the new range are not necessarily re-emitted.

  When not in forwarding mode, the stream acts as if the current range was equal
  to the full range. This implies that fast_forward_to() cannot be
  used.

  Whether stream is in forwarding mode or not is specified when the stream
  is created, typically via mutation_source interface.

What's left for later series:

  Optimization by providing specialized implementations. This series implements
  forwarding support in all mutation sources via generic wrapper which simply
  drops fragments."

* tag 'tgrabiec/clustering-fast-forward-to-v2' of github.com:scylladb/seastar-dev:
  tests: mutation_source_tests: Verify monotonicty of positions
  tests: random_mutation_generator: Spread the keys more
  tests: mutation_source_test: Make blobs more easily distinguishable
  tests: streamed_mutation: Test that merged stream passes mutation source tests
  tests: mutation_source_test: Add tests for forwarding of streamed_mutation
  tests: streamed_mutation_assertions: Add methods for navigating the stream
  tests: Add range generators to random_mutation_generator
  partition_slice_builder: Add with_ranges()
  query: Introduce full_clustering_range
  streamed_mutation: Add non-owning variant of mutation_from_streamed_mutation()
  db: Enable creating forwardable readers via mutation_source
  mutation_source: Document liveness requirements
  mutation_source: Cleanup
  db: Replace virtual_reader_type with mutation_source_opt
  partition_version: Refactor make_partition_snapshot_reader() overloads
  database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr
  memtable: Accept all mutation_source parameters
  streamed_mutation: Implement fast_forward_to() in stream merger
  streamed_mutation: Add generic implementation of forwardable streamed_mutation
  streamed_mutation: Add fast_forward_to() API
  position_in_partition: Introduce position_range
  position_in_partition: Introduce position constructor for right after the static row
  streamed_mutation: Make cast to view non-explicit
  streamed_mutation: Make schema() getter non-copying
2017-02-24 10:37:51 +00:00
Tomasz Grabiec
892d4a2165 db: Enable creating forwardable readers via mutation_source
Right now all mutation source implementations will use
make_forwardable() wrapper.
2017-02-23 18:50:44 +01:00
Tomasz Grabiec
586dbaa8d3 db: Replace virtual_reader_type with mutation_source_opt
Virtual reader is a mutation_source.
2017-02-23 18:23:52 +01:00
Tomasz Grabiec
f46ae8128d database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr
It was using the state passed via as_mutation_source() instead. Let's
respect mutation_source contract instead, and use the state passed via
mutation_source invocation.

Technically just a cleanup. Alse prerequisite for more cleanup.
2017-02-23 18:23:52 +01:00
Paweł Dziepak
359c617821 db: restore call to check_valid_rp()
5a0955e89d "db: add operations for
applying counter updates" merged two column_family::apply() overloads
into do_apply() in order to reduce code duplication. Unfortunately,
a call to check_valid_rp() didn't survive that change.
Message-Id: <20170221133800.30411-1-pdziepak@scylladb.com>
2017-02-21 15:26:04 +01:00
Vlad Zolotarov
978241d473 database: move lister class into separate files
Move lister class away from database.cc.
This is a preparation for moving it to the seastar library.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:50:40 -05:00
Vlad Zolotarov
34cafa71c3 database: make 'clearsnapshot' to delete the snapshots of deleted keyspaces if requested
The current implementation of 'nodetool clearsnapshot' command only deletes the snapshots
of the keyspaces that are alive at the time the command is issued (issue #2045).

This, besides not implementing the spec, prevents users from being able to clear
the disk space occupied by snapshots of deleted keyspaces that are no longer needed
(e.g. snapshots created when KS is deleted).

This patch fixes this issue by making the database::clear_snapshot() scan the data directories
looking for the snapshots to be deleted instead of relying on in-memory data structures.

This patch makes column_family::clear_snapshot() method not needed any more.

Fixes #2045

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:50:40 -05:00