Commit Graph

140 Commits

Author SHA1 Message Date
Glauber Costa
7e3093709a backlog: add level to write progress monitor
For SSTables being written, we don't know their level yet. Add that
information to the write monitor. New SSTables will always be at L0.
Compacted SSTables will have their level determined by the compaction
process.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-31 21:09:38 -04:00
Glauber Costa
b573a2ff61 backlog: keep track of maximum timestamp in write monitor
For sealed SSTables we can get the maximum timestamp from the statistics
component.  But for partially written SSTables, the metadata is not yet
available.

One way to solve this would be to make the SSTable statistics available
earlier. But we would end up with a maximum timestamp that potentially
changes all the time as we write more cells.

A better approach is to take note of what's the maximum timestamp in a
memtable before we start to flush, and when time comes for us to flush
we will use the progress manager to inform the consumers about the
maximum timestamp.

For SSTables being compacted, we can't know for sure what is the maximum
timestamp as some entries could be TTLd already. But the maximum of all
SSTables present in the compaction is a good enough estimation for this
purposes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 12:55:58 -04:00
Piotr Sarna
fe02c3d0e2 database, sstables, tests: add large_partition_handler
This commit makes database, sstables and tests aware
of which large_partition_handler they use.
Proper large_partition_handler is retrievable from config information
and is based on existing compaction_large_partition_warning_threshold_mb
entry. Right now CQL TABLE variant of large_partition_handler is used
in the database.

Tests use a NOP version of large_partition_handler, which does not
depend on CQL queries at all.
2018-05-04 14:38:13 +02:00
Vladimir Krivopalov
15ef4ca73c Support for writing SSTables 3.0 ('mc') Data.db and Index.db files - rows only.
This fix adds functionality for writing data in 'mc' format to Data.db
file according to the SSTables 3.0 data format as described at https://github.com/scylladb/scylla/wiki/SSTables-3.0-Data-File-Format
and Index.db file according to the specification at https://github.com/scylladb/scylla/wiki/SSTables-3.0-Index-File-Format

The following cases are not supported yet:
  - writing counter cells
  - range tombstones

In Index.db, end open markers are not written since range tombstones are not
supported for data files yet.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Raphael S. Carvalho
11940ca39e sstables: Fix bloom filter size after resharding by properly estimating partition count
We were feeding the total estimation partition count of an input shared
sstable to the output unshared ones.

So sstable writer thinks, *from estimation*, that each sstable created
by resharding will have the same data amount as the shared sstable they
are being created from. That's a problem because estimation is feeded to
bloom filter creation which directly influences its size.
So if we're resharding all sstables that belong to all shards, the
disk usage taken by filter components will be multiplied by the number
of shards. That becomes more of a problem with #3302.

Partition count estimation for a shard S will now be done as follow:
    //
    // TE, the total estimated partition count for a shard S, is defined as
    // TE = Sum(i = 0...N) { Ei / Si }.
    //
    // where i is an input sstable that belongs to shard S,
    //       Ei is the estimated partition count for sstable i,
    //       Si is the total number of shards that own sstable i.

Fixes #2672.
Refs #3302.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180423151001.9995-1-raphaelsc@scylladb.com>
2018-04-23 18:11:20 +03:00
Avi Kivity
03c22ad524 Merge "Support for Cassandra 2.2 (LA) SSTable formats" from Daniel
"
These patches add support for C* 2.2 file(name) format.

Namely:
  * It forces Scylla to write files in la format.
  * Adds storage-service feature for them.
  * cf and ks are determined from directory, not from file-name (for 2.2 format).
  * Adds some other fixes to make dtest happy.
  * Unit tests work with la format or with both formats.
"

* 'danfiala/filename-format-2.2-v4' of https://github.com/hagrid-the-developer/scylla:
  tests/sstables: Tests use la format or iterate over both formats.
  tests/sstables: Helper functions support 2.2 format directory structure.
  stables: Use 2.2 (la) format as a default format to store sstables if it is enabled by feature-bits.
  storage_service: Support la sstable storage format as a feature.
  sstables: make_descriptor accepts sstable-directory, because it is necessary to determine cf and ks in 2.2 format.
  sstables: Throw more detail exception for unknown item in reverse_map.
  sstables/compaction: Suppress NaN in a report of a throughput.
2018-03-19 17:49:44 +02:00
Daniel Fiala
c5eca593fc sstables/compaction: Suppress NaN in a report of a throughput.
* It causes failures in dtest.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-18 05:46:32 +01:00
Raphael S. Carvalho
aa75684ee7 sstables: Warn when an extra-large partition is written
Based on https://issues.apache.org/jira/browse/CASSANDRA-9643

For compaction_large_partition_warning_threshold_mb option set to 1,
follow an example output:

WARN  2018-02-22 19:52:11,029 [shard 0] sstable - Writing large
row system/local:{key: pk{00056c6f63616c}, token:-7564491331177403445}
(1276758 bytes)

Fixes #2209.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180306175912.19259-1-raphaelsc@scylladb.com>
2018-03-07 15:49:46 +00:00
Glauber Costa
956af9f099 database, main: set up scheduling_groups for our main tasks
Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.

The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.

Comments from Glauber:

This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:

1) A bug/regression is fixed with the boundary calculations for the
   memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
   go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
   themselves are run in the compaction scheduling group. Having that
   working is what changes this patch the most: we now store the
   scheduling group in the compaction manager and let the compaction
   manager itself enforce the scheduling group.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Avi Kivity
641aaba12c database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.

The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.

Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
2018-02-07 17:19:29 -05:00
Duarte Nunes
cbbdfde979 sstables/compaction_backlog_tracker: Constify backlog()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180111004914.25796-1-duarte@scylladb.com>
2018-01-11 13:20:57 +02:00
Glauber Costa
ca284174d0 infrastructure for backlog estimator for compaction work.
This patch adds infrastucture in various points in the system to allow
us to determine the amount of work present as backlog from compactions.

What needs to be done can be explained in three major pieces:

1) Add hooks in the points where sstables are added or inserted to a
   column family (or more precisely, to a compaction_strategy object).

2) Add hooks in reads and write monitors that allows a compaction
   backlog estimator (tracker) to become aware of bytes that are
   partially written and compacted away.

3) Add a per-column family class (compaction_backlog_tracker) that
   can be used to track work that is done and relevant to compactions
   (like the two above), and a compaction manager to provide a
   system-wide backlog based on the response of the individual trackers.

The definition of how much backlog one has is strategy-specific. The
Null strategy is easy, as it never really has any backlog, and so is the
major strategy - since what it really matters is the backlog of the
underlying compaction strategy.

Although backlogs are strategy-specific, they should be "compatible", in
the sense that if a particular strategy has more work to do, it should
yield a higher number than its counterparts.

All the others are presented in this patch as unimplemented: they will
always advertise a mild backlog that should yield a constant
CPU-utilization if used alone.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
d4109ebb80 compaction: control destruction of readers
Compactions run from a seastar::thread, in run(). They will either fail
or succeed, and from the point of view of ordering of destruction
between the compaction object and its readers:

- if compaction succeed, we have no control over who gets destructed
  first since both objects will be going out of scope.
- if they fail, we will forceably destruct the compaction object, at
  which point the readers are still alive

From the point of view of lifetime management, it would be nice to make
sure that the compaction object outlives whichever other objects it
needs during compaction.

This nice to have will become paramount when we start adding
read_monitors to the compaction object, that have to, themselves outlive
the readers.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:06 -05:00
Raphael S. Carvalho
eff62bc61e compaction_manager: serialize compaction of same size tier for different cfs
Currently, compaction manager will serialize compaction of same size tier
(or weight) if they belong to the same column family. However, it fails to
do so if the compaction jobs belong to different column families.
That can lead to an ungodly amount of running compaction which gets worse
the higher the number of shards and active column families. The problem
is that it may affect overall system performance due to excessive resource
usage. It's easy to trigger it during bootstraping after loading node with
new sstables or repairing, or if lots of cfs are being actively written.

That being said, compaction jobs of same size tier are now serialized
on a given shard, such that maximum number of compaction (system wise)
is now:
(SHARDS) * (SIZE TIERS)
instead of:
(SHARDS) * (COLUMN FAMILIES) * (SIZE TIERS)

We'll work hard to release a size tier (weight) for a column family
waiting on it as fast as possible, given that we wouldn't like to
underutilize resources available for compaction. We want one starting
after the other. Compaction for a column family that cannot run now
because the size tier is taken, will be postponed. There's a worker
that will be sleeping on a condition variable that will be signalled
whenever a compaction completes. FIFO ordering is used on postponed
list for fairness.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-17 17:42:48 -02:00
Raphael S. Carvalho
49f3cfe746 sstables: improve compact_sstables() interface
Motivation is that a new field in the descriptor will be forwarded
to compaction procedure without extending parameter list even more.
Also beautifies the interface, making it concise and easier to
play with.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-17 17:22:19 -02:00
Paweł Dziepak
73b3d02cc0 db: make make_range_sstable_reader() return flat reader 2017-12-13 12:01:03 +00:00
Paweł Dziepak
e12959616c db: make column_family::make_sstable_reader() return a flat reader 2017-12-13 12:01:03 +00:00
Avi Kivity
d934ca55a7 Merge "SSTable resharding fixes" from Raphael
"Didn't affect any release. Regression introduced in 301358e.

Fixes #3041"

* 'resharding_fix_v4' of github.com:raphaelsc/scylla:
  tests: add sstable resharding test to test.py
  tests: fix sstable resharding test
  sstables: Fix resharding by not filtering out mutation that belongs to other shard
  db: introduce make_range_sstable_reader
  rename make_range_sstable_reader to make_local_shard_sstable_reader
  db: extract sstable reader creation from incremental_reader_selector
  db: reuse make_range_sstable_reader in make_sstable_reader
2017-12-07 16:42:48 +02:00
Raphael S. Carvalho
bad21ba444 sstables: Fix resharding by not filtering out mutation that belongs to other shard
After 301358e, sstable resharding stopped work because shared sstables would
use a filtering reader, which excludes mutation that belong to other shards.
That completely breaks which relies on compaction of mutations that belong
to different shards. The fix is about using recently introduced non local
shard reader.

Fixes #3041.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:26 -02:00
Raphael S. Carvalho
d1b146baa6 rename make_range_sstable_reader to make_local_shard_sstable_reader
Tomek says:
"I think that the least surprising behavior for a function named like this
is to read the sstables unfiltered (it just reads them), and the filtering
should be indicated specially in the name or by accepting a parameter."

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:25 -02:00
Raphael S. Carvalho
809b30c4a2 sstables/compaction: do not actually compact fully expired sstables
There's no need to actually compact a sstable which is fully expired
and which deletion of all its data will not ressurect older data.
For that, a sstable will only be considered fully expired if it
doesn't contain data newer than its overlapping counterparts.
That way, there could be a false negative, but never a false positive.
Currently, a fully expired sstable would unnecessarily waste read
bandwidth of disk. This will help a lot time series workloads in
which data for a given time window is all deleted at once using TTL.

Fixes #2620.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
d2ab154f12 sstables: switch to const ref wherever possible
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
d916c8cdad sstables: use gc_clock::time_point for gc_before
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
45c11865fa sstables: change return value type of get_fully_expired_sstables
unordered_set will allow us to quickly extract fully expired tables
from a set of compacting sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:45:55 -02:00
Paweł Dziepak
b64dd21751 sstables: convert compaction to flat_mutation_reader 2017-11-23 18:14:31 +00:00
Duarte Nunes
baeec0935f Replace query::full_slice with schema::full_slice()
query::full_slice doesn't select any regular or static columns, which
is at odds with the expectations of its users. This patch replaces it
with the schema::full_slice() version.

Refs #2885

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>
2017-10-17 11:25:53 +02:00
Botond Dénes
47e07b787e restricted_mutation_reader: restrict based-on memory consumption
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
2017-10-03 12:44:12 +03:00
Avi Kivity
78eae8bf48 Revert "Merge "Make restricting_mutation_reader more accurate" from Botond"
This reverts commit c6e5dcc556, reversing
changes made to 19b21a0ab2. Failes to build,
plus author has more changes.
2017-10-03 11:58:59 +03:00
Avi Kivity
c6e5dcc556 Merge "Make restricting_mutation_reader more accurate" from Botond
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed."

Fixes #2692.

* 'bdenes/restricting_mutation_reader-v4' of https://github.com/denesb/scylla:
  Update reader restriction related metrics
  Add restricted_reader_test unit test
  restricted_mutation_reader: restrict based-on memory consumption
  mutation_reader.hh: Move restricted_reader related code
2017-10-03 11:15:34 +03:00
Raphael S. Carvalho
e34c1db642 db: update compaction history outside the sstable write lock
The reason to do that is because compaction can deadlock if refresh
disables write which waits for compaction, and compaction in turn
waits for dirty memory[1] that would be released by memtable write.

Dirty memory manager for non-system cfs was being used for system cfs,
which was useful for exposing this problem.

[1]: when updating compaction history.

Fixes #2769.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-2-raphaelsc@scylladb.com>
2017-09-26 19:51:12 +02:00
Botond Dénes
33e97e7457 restricted_mutation_reader: restrict based-on memory consumption
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
2017-09-20 11:14:35 +03:00
Avi Kivity
9b540eccb0 database: remove dependency on compaction.hh and compaction_manager.hh 2017-09-11 20:09:45 +03:00
Glauber Costa
db846326f8 compaction: remove dead code
This code has no more users. Bury it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170908005305.29925-1-glauber@scylladb.com>
2017-09-08 08:17:15 +02:00
Botond Dénes
611774b1d9 Use the incremental reader for compaction
As leveled compaction strategy stands to gain the most from
incrementally opening sstables.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <292648d3fa4ea97376c0b4360754a20132194f63.1502822066.git.bdenes@scylladb.com>
2017-08-15 21:38:04 +03:00
Duarte Nunes
7fb6a74302 combined_mutation_reader: Drop exhausted readers if not in FF mode
Exhausted readers can be fast forwarded, so we have to keep them
around. However, if the current reader is not fast forwardable, then
we can drop those readers and their buffers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 14:37:27 +02:00
Botond Dénes
94fc550e68 sstable_set::incremental_selector: select() now returns a selection
A seletion contains - in addition to the list of sstables - a next_token
which is a hint as to what is the next best token to call select() with.
This should be the smallest token such that at the next call to
select() the least number of new sstables will be returned, without
skipping any.
2017-08-09 16:27:33 +03:00
Glauber Costa
4f01ec0910 restrict background writers to 50 % of CPU.
In scylla, we have foreground processes, which are latency sensitive and
need to be responded to as fast as possible in order to maintain good
latency profiles, and background process, which are less so.

The most important background processes we have during normal write
workload operations are memtable writes and sstable compactions. Those
processes are quite CPU-intensive, and left unchecked will easily
dominate the CPU. Lower values of task-quota usually help, as it will
force those processes to preempt more, but aren't enough to guarantee
good isolation. We have seen boxes with good NVMe storage having their
throughput reduced to less than half of the original baseline in a short
dive down for the duration of a compaction.

In the long run, our goal is to leverage the CPU scheduler to make sure
that those processes are balanced with respect to all the others.
However, the current state of affairs is causing grievances as this very
moment. Thankfully, those processes live in a seastar::thread, that
ships with its own rudimentary bandwidth control mechanism: the
scheduling group.

The goal of this patch is to wrap background processes together in a
scheduling group, and assign to such group 50 % of our CPU power; the
remainder being left to foreground processes.

While we pride ourselves in dynamically adjusting things to the
workload, we won't be able to do this properly before the CPU scheduler
lands - and let's face it, leaving background processes run wild is not
adaptative either. Every workload would benefit most from a different
value for such shares, but 50 % is as fair as it gets if we really need
static partitining in the mean time.

As a defense against unforeseen consequences, we'll leave the actual
value as an option, but will do our best to hide it - as this is not a
tunable that we want to be part of a normal Scylla setup. The most
convenient place for this tunable is still db::config, so we can easily
pass it down to the database layer - but we will not document it in the
yaml, and will clearly note in the help string that it is not supposed
to be tuned.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-07-18 23:35:33 -04:00
Raphael S. Carvalho
4351e0a996 compaction: introduce new compaction type for reshard
so now user can look at nodetool compactionstats and determine
whether or not resharding is running, for example:
$ ./bin/nodetool compactionstats
pending tasks: 3
       id   compaction type   keyspace                table   completed   total   unit   progress
   <none>           RESHARD     system   compaction_history          11     256   keys      4.30%
   <none>           RESHARD     system   compaction_history           2     256   keys      0.78%
   <none>           RESHARD     system   compaction_history          10     256   keys      3.91%
   <none>           RESHARD     system   compaction_history           8     256   keys      3.12%
   <none>           RESHARD     system   compaction_history           7     256   keys      2.73%

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170620175733.25882-1-raphaelsc@scylladb.com>
2017-06-22 14:48:38 +03:00
Raphael S. Carvalho
41137c7fb6 compaction: use sstable::bytes_on_disk for calculating start and end size
Currently, start and end size of compaction are calculated using the
uncompressed size of data component. bytes_on_disk() returns size
used by all components.

NOTE: start and end size are written to compaction history, so users
who monitor it should be aware of this change.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525212129.6758-1-raphaelsc@scylladb.com>
2017-05-28 11:33:24 +03:00
Avi Kivity
ebaeefa02b Merge seatar upstream (seastar namespace)
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
 - 'net' namespace conflicts with seastar::net, renamed to 'netw'.
 - 'transport' namespace conflicts with seastar::transport, renamed to
   cql_transport.
 - "logger" global variables now conflict with logger global type, renamed
   to xlogger.
 - other minor changes
2017-05-21 12:26:15 +03:00
Raphael S. Carvalho
ddc1d80c28 compaction: remove dead function declaration
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170504013046.23522-2-raphaelsc@scylladb.com>
2017-05-04 11:48:51 +03:00
Raphael S. Carvalho
61229ab88c compaction: fix type for cleanup
After compaction revamp, compaction type set by cleanup at its ctor
is being overwritten at compaction::setup(). Consequently, cleanup
would not be stopped by 'nodetool stop cleanup' and cleanup would
be listed as regular compaction in 'nodetool compactionstats'.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170504013046.23522-1-raphaelsc@scylladb.com>
2017-05-04 11:48:50 +03:00
Raphael S. Carvalho
3071b9052a compaction: make cleanup_compaction inherit from regular_compaction
Some fields that belong to regular and cleanup aren't needed for
resharding_compaction, such as incremental selector (which is used
for determining max purgeable timestamp for a given decorated key)
Better move those fields to regular and make cleanup inherit from
regular compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170428195611.9196-1-raphaelsc@scylladb.com>
2017-04-30 19:37:09 +03:00
Raphael S. Carvalho
687a4bb0c2 dtcs: do not compact fully expired sstable which ancestor is not deleted yet
Currently, fully expired sstable[1] is unconditionally chosen for compaction
by DTCS, but that may lead to a compaction loop under certain conditions.

Let's consider that an almost expired sstable is compacted, and it's not
deleted yet, and that the new sstable becomes expired before its ancestor is
deleted.
Because this new sstable is expired, it will be chosen by DTCS, but it will
not be purged because 'compacted undeleted' sstables are taken into account
by calculation of max purgeable timestamp and prevents expired data from
being purged. The problem is that this sequence of events can keep happening
forever as reported by issue #2260.
NOTE: This problem was easier to reproduce before improvement on compaction
of expired cells, because fully expired sstable was being converted into a
sstable full of tombstones, which is also considered fully expired.

Fixes #2260.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170428233554.13744-1-raphaelsc@scylladb.com>
2017-04-30 19:35:46 +03:00
Duarte Nunes
4e693383f7 mutation_partion: Use row_tombstone
This patch replaces the current row tombstone representation by a
row_tombstone.

The intent of the patch is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be.

We need to distinguish shadowable from non-shadowable row tombstones
to support scenarios such as, when inserting to a table with a
materialzied view:

1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1
2. delete from base using timestamp 2 where p = 3
3. insert into base (p, v1) values (3, 1) using timestamp 3

These should yield a view row where v2 is definitely null, but with
the current implementation, v2 will pop back with its value v2=3@TS=1,
even though its dead in the base row. This is because the row
tombstone inserted at 2) is a shadowable one.

This patch only addresses the memory representation of such
row_tombstones.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:33 +02:00
Raphael S. Carvalho
0127309820 sstables: extend compaction for new resharding
Extends compaction for new resharding algorithm. Not wired yet.
New resharding will compact shared sstable(s) and create one
sstable for each owner. It's up to the caller to open these
new unshared sstables at their respective column families.

This new approach will save a lot of bandwidth because we'll
no longer read the entire shared sstable #smp::count times.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:08 -03:00
Raphael S. Carvalho
2a437ab427 compaction: rework compacting_sstable_writer to work with multiple writers
compacting_sstable_writer only allowed one writer so far, but we will need
multiple ones for resharding.
It's done by moving writer management to compaction.
finish_sstable_writer() is added for compaction impl to stop all writers,
whereas stop_sstable_writer() will only stop current writer (needed when
current sstable reaches max limit size for example).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:05 -03:00
Raphael S. Carvalho
a35a3a9647 compaction: prepare compacting_sstable_writer to work with writers
No need for compacting_sstable_writer to store items that are available
in compaction class. Also, that's a step towards supporting multiple
writers for compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:03 -03:00
Raphael S. Carvalho
38ed83e2f7 sstables: rework compaction to make it easy to extend
compact_sstables() supported both regular and cleanup compaction,
but with lots of conditions that made it ugly and hard to extend.

In the future, we want to introduce a new type of compaction for
resharding that will create one sstable for every shard owning
the sstable(s) given as input. That will be easier now.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:02 -03:00
Tomasz Grabiec
124dde30db sstables: Extract writer parameters into config objects
Also enables users to change the default promoted index block size.
2017-03-10 14:42:22 +01:00