Commit Graph

1253 Commits

Author SHA1 Message Date
Avi Kivity
d973445a94 Merge "sstable/schema extensions" from Calle
"
Adds extension points to schema/sstables to enable hooking in
stuff, like, say, something that modifies how sstable disk io
works. (Cough, cough, *encryption*)

Extensions are processed as property keywords in CQL. To add
an extension, a "module" must register it into the extensions
object on boot time. To avoid globals (and yet don't),
extensions are reachable from config (and thus from db).

Table/view tables already contain an extension element, so
we utilize this to persist config.

schema_tables tables/views from mutations now require a "context"
object (currently only extensions, but abstracted for easier
further changes.

Because of how schemas currently operate, there is a super
lame workaround to allow "schema_registry" access to config
and by extension extensions. DB, upon instansiation, calls
a thread local global "init" in schema_registry and registers
the config. It, in turn, can then call table_from_mutations
as required.

Includes the (modified) patch to encapsulate compression
into objects, mainly because it is nice to encapsulate, and
isolate a little.
"

* 'calle/extensions-v5' of github.com:scylladb/seastar-dev:
  extensions: Small unit test
  sstables: Process extensions on file open
  sstables::types: Add optional extensions attribute to scylla metadata
  sstables::disk_types: Add hash and comparator(sstring) to disk_string
  schema_tables: Load/save extensions table
  cql: Add schema extensions processing to properties
  schema_tables: Require context object in schema load path
  schema_tables: Add opaque context object
  config_file_impl: Remove ostream operators
  main/init: Formalize configurables + add extensions to init call
  db::config: Add extensions as a config sub-object
  db::extensions: Configuration object to store various extensions
  cql3::statements::property_definitions: Use std::variant instead of any
  sstables: Add extension type for wrapping file io
  schema: Add opaque type to represent extensions
  sstables::compress/compress: Make compression a virtual object
2018-02-26 17:15:29 +02:00
Duarte Nunes
e75f7c41d9 Merge 'Proper clean-up on closing index_reader' from Vladimir
With the changes introduced in #2981 and #3189, the lifetime management
of the objects used by index_reader became more complicated.
This patchset addresses the immediate problems caused by lack of proper
handling.

The more holistic approach to this will take more time and is to be
implemented under #3220. The current fix, however, should be good
enought as a stop-gap solution.

* 'issues/3213/v3' of https://github.com/argenet/scylla:
  Close promoted index streams when closing index_readers.
  Support proper closing of prepended_input_stream.
2018-02-21 01:02:16 +00:00
Vladimir Krivopalov
c996191411 Close promoted index streams when closing index_readers.
Promoted index input streams must be explicitly closed when closing the
index_reader in order to ensure all the pending read-aheads are
completed.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-02-20 16:04:15 -08:00
Vladimir Krivopalov
8d52d809f7 Support proper closing of prepended_input_stream.
When the stream is being closed, the call is forwarded to the stored
data_source.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-02-20 16:04:05 -08:00
Vladimir Krivopalov
721bd3eef6 Added missing 'override' to skip() in buffer_input_stream and prepended_input_stream.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <4e91bead8de7f6fa9b3bfdab8bda73efdb22749d.1519152303.git.vladimir@scylladb.com>
2018-02-20 19:49:11 +00:00
Avi Kivity
87f10bc853 sstables: continuous_data_consumer: make _remain an unsigned type
All of the adjustments to _remain already ensure it is greater than 0,
and indeed a negative _remain doesn't make sense.

Switching to an unsigne types allows us to re-enable -Wsign-compare.

Tests: unit (release)
Message-Id: <20180212121636.10463-1-avi@scylladb.com>
2018-02-12 12:25:21 +00:00
Avi Kivity
55168592ad compaction_manager: fix use-after-free of column_family
Commit cce1a2bce8 ("Use the CPU scheduler")
placed some compaction manager code in a scheduling_group. Unfortunately,
downstream code relied on the callers not deferring, so it can rely
on the column_family's existence. That doesn't happen if the column_family
is removed quickly, as with_scheduling_group() always defers.

Fix applying the scheduling group after we've taken the lock and guaranteed
the stability of the column_family object.

Fixes #3196.
Message-Id: <20180211165155.18179-1-avi@scylladb.com>
2018-02-11 17:53:35 +00:00
Vladimir Krivopalov
71495691aa Use separate shared_index_lists per sstable_mutation_reader instead of a single one per sstable.
With the changes introduced in #2981, it is no longer safe to share
index_entries among multiple sstable_mutation_readers.
The original intent behind sharing index_entries among index_readers was
to avoid re-reading same pages twice as we have two index readers -
lower and upper bound - for every sstable_mutation_reader. In fact, the
shared entries were held at the sstable object level so index_readers
from different sstable_mutation_readers could have accessed them.

Now, with calls to index_reader::advance_to(pos)/index_reader::advance_past(pos),
index_entry can be accessed in a way that modifies its state if we need
to read more promoted index blocks. It is safe to keep sharing them
between two index_readers within the same sstable_mutation_reader as the
invariant is maintained that readers can be only moved forward.
We cannot safely assume, however, that this invariant holds for multiple
sstable_mutation_readers as it may happen that one of them has read and
thrown away some promoted index blocks that another one needs. So we
restrict sharing to per-sstable_mutation_reader level.

Fixes #3189.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <83957d007621fe4c62af49aebf1838bb2f32ee55.1518226793.git.vladimir@scylladb.com>
2018-02-10 15:08:45 +02:00
Avi Kivity
432268f582 Merge "branch 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla" from Raphael
"The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
The manager was needed for orchestrating deletion of shared sstable
across shards. It brings extra complexity that's not longer needed,
and it was also overloading shard 0, but the latter could have
been fixed.

Tests:
- unit: release mode
- dtest: resharding_test.py"

* 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla:
  Remove SSTable's atomic deletion manager
  Stop using SSTable's atomic deletion manager
  database: split column_family::rebuild_sstable_list
2018-02-08 19:10:16 +02:00
Tomasz Grabiec
cce1a2bce8 Merge "Use the CPU scheduler" from Glauber & Avi
In this patchset I am resubmitting Avi's enablement of the CPU scheduler
in his behalf. I've done a ton of testing in the series and there are
some improvements / changes that I had previously sent as a separate series.

What you see here is the result of merging that work.

After this patchset is applied, workloads are smoother and we are able to
uphold the pre-defined shares among the various actors.

We also finally have everything we need to merge the CPU and I/O controllers.
After that is done the code is now much simpler. But also, as a bonus,
controllers that were previously available for I/O only (compactions) are
enabled for CPU as well.

* git@github.com:glommer/scylla.git cpusched-v7:

Avi Kivity (4):
  database, sstables, compaction: convert use of thread_scheduling_group
    to seastar cpu scheduler
  memtable, database: make memtable::clear_gently() inherit
    scheduling_group
  config: mark background_writer_scheduling_quota as Unused
  database: place data_query execution stage into scheduling_group

Glauber Costa (9):
  database, main: set up scheduling_groups for our main tasks
  row_cache: actually use the scheduling group for update_cache
  allow update_cache and clear_gently to use the entire task quota.
  database: remove cpu_flush_quota metric
  controllers: retire auto_adjust_flush_quota
  controllers: allow memtable I/O controller to have shares statically
    set
  controllers: update control points for memtable I/O controller
  controllers: allow a static priority to override the controller output
  controllers: unify the I/O and CPU controllers
2018-02-08 15:58:40 +01:00
Raphael S. Carvalho
312bd9ce25 Remove SSTable's atomic deletion manager
Not used anymore, can be deleted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:38:45 -02:00
Raphael S. Carvalho
1472cfcc19 Stop using SSTable's atomic deletion manager
The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:27:17 -02:00
Glauber Costa
956af9f099 database, main: set up scheduling_groups for our main tasks
Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.

The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.

Comments from Glauber:

This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:

1) A bug/regression is fixed with the boundary calculations for the
   memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
   go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
   themselves are run in the compaction scheduling group. Having that
   working is what changes this patch the most: we now store the
   scheduling group in the compaction manager and let the compaction
   manager itself enforce the scheduling group.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Avi Kivity
641aaba12c database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.

The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.

Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
2018-02-07 17:19:29 -05:00
Calle Wilund
264b9d2da0 sstables: Process extensions on file open
Allowing them to wrap/replace an opened file, and add to/read from
scylla metadata.
2018-02-07 10:11:46 +00:00
Calle Wilund
b0c0c3c0ad sstables::types: Add optional extensions attribute to scylla metadata
Allowing storing key:value pairs.
2018-02-07 10:11:46 +00:00
Calle Wilund
68fc076f80 sstables::disk_types: Add hash and comparator(sstring) to disk_string 2018-02-07 10:11:46 +00:00
Calle Wilund
0dcf287230 sstables: Add extension type for wrapping file io 2018-02-07 10:11:45 +00:00
Calle Wilund
74758c87cd sstables::compress/compress: Make compression a virtual object
Make a "compressor" an actual class, that can be implemented and
registered via class registry. 

For "common" compressors, the objects will be shared, but complex
implementors can be semi-stateful. 

sstable compression is split into two parts: The "static" config
which is shared across shards, and a "local" one, which holds 
a compressor pointer. The latter is encapsulated, along with 
actual compressed data writers, in sstables/compress.cc.

For compression (write), compression writer is instansiated 
with the settings active in table metadata. 

For decompression (read), compression reader is instansiated
with the settings stored in sstable metadata, which can 
differ from the currently active table metadata. 

v2:
* Structured patch sets differently (dependencies)
* Added more comments/api descs
* Added patch to move all sstable compression into compress.cc,
  effectively separating top-level virtual compressor object
  from sstable io knowledge
v3:
* Rebased
v4: 
* Moved all sstable compression logic/knowledge into  
  compress.cc (local compression). Merged the two patches 
  (separation just confuses reader).
2018-02-07 10:11:45 +00:00
Raphael S. Carvalho
09f4ee808f sstables/compress: Fix race condition in segmented offset reading of shared sstable
Race condition was introduced by commit 028c7a0888, which introduces chunk offset
compression, because a reading state is kept in the compress structure which is
supposed to be immutable and can be shared among shards owning the same sstable.

So it may happen that shard A updates state while shard B relies on information
previously set which leads to incorrect decompression, which in turn leads to
read misbehaving.

We could serialize access to at() which would only lead to contention issues for
shared sstables, but that can be avoided by moving state out of compress structure
which is expected to be immutable after sstable is loaded and feeded to shards that
own it. Sequential accessor (wraps state and reference to segmented_offset) is
added to prevent at() and push_back() interfaces from being polluted.

Tests: release mode.

Fixes #3148.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180205192432.23405-1-raphaelsc@scylladb.com>
2018-02-06 12:10:10 +02:00
Vladimir Krivopalov
b91c3fd47e Use advance_past for single partition upper bound.
Instead of advancing to the next partition, try first find the more
precise position using promoted index blocks.
advance_past() only seeks within currently available PI blocks (or reads
the first batch, if never read before) and uses the position if found,
otherwise resorts to advance_to_next_partition()

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:45 -08:00
Vladimir Krivopalov
6f8c6a0933 Remove obsolete types and methods.
These types and methods are no longer in use since the index_reader is
now consuming promoted index incrementally.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:35 -08:00
Vladimir Krivopalov
0a7a56edd5 Simplify continuous_data_consumer::consume_input() interface.
Remove redundant input parameter as continuous_data_consumer derivatives
would only use themselves as a context. So take it internally and make
the function regular (non-template) and having no parameters.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:26 -08:00
Vladimir Krivopalov
7e15e436de Parse promoted index entries lazily upon request rather than immediately.
Now promoted index is converted into an input_stream and skipped over
instead of being consumed immediately and stored as a single buffer.
The only part that is read right away is the deletion time as it is
likely to be there in the already read buffer and reading it should both
be cheap and prevent from reading the whole promoted index if only
deletion time mark is needed.

When accessed, promoted index is parsed in chunks, buffer by buffer, to
limit memory consumption.

Fixes #2981

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:15 -08:00
Vladimir Krivopalov
9fdf4b24b5 Add helper input streams: buffer_input_stream and prepended_input_stream.
buffer_input_stream is a simple input_stream wrapping a single
temporary_buffer.

prepended_input_stream suits for the case when some data has been read
into a buffer and the rest is still in a stream. It accepts a buffer and
a data_source and first reads from the buffer and then, when it ends,
proceeds reading from the data_source.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:04 -08:00
Vladimir Krivopalov
5dca3100ed Support skipping over bytes from input stream in parsers based on continuous_data_consumer
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:56:55 -08:00
Raphael S. Carvalho
2c181b69c9 sstables: fix wildly inaccurate sstable key estimation after dynamic index sampling
The reason sstable key estimation is inaccurate is that it doesn't account that
index sampling is now dynamic.

The estimation is done as follow:
    uint64_t get_estimated_key_count() const {
        return ((uint64_t)_components->summary.header.size_at_full_sampling + 1) *
                _components->summary.header.min_index_interval;
    }

The biggest problem is that _components->summary.header.min_index_interval isn't
actually the minimum interval, but instead the default interval value set in the
schema.
So the estimation gets worse the larger the average partition, because the larger
the average partition the lower the index sampling interval.
One of the problems is that estimation has a big influence on bloom filter size,
and so for large partitions we were generating bigger filters than we had to.

From now on, size at full sampling is calculated as if sampling were static
(which was the case until commit 8726ee937d which introduced size-based
sampling), using minimum index as a strict sampling interval.

Tests: units (release)

Fixes #3113.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180122233612.11147-1-raphaelsc@scylladb.com>
2018-01-23 10:42:24 +02:00
Glauber Costa
5140aaea00 add a timeout to fast forward to
In the last patch, we enabled per-request timeouts, we enable timeouts
in fill_buffer. There are many places, though, in which we
fast_forward_to before we fill_buffer, so in order to make that
effective we need to propagate the timeouts to fast_forward_to as well.

In the same way as fill_buffer, we make the argument optional wherever
possible in the high level callers, making them mandatory in the
implementations.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:19 -05:00
Glauber Costa
d965af42b0 add a timeout to fill_buffer
As part of the work to enable per-request timeouts, we enable timeouts
in fill_buffer.

The argument is made optional at the main classes, but mandatory in all
the ::impl versions. This way we'll make sure we didn't forget anything.

At this point we're still mostly passing that information around and
don't have any entity that will act on those timeouts. In the next patch
we will wire that up.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Duarte Nunes
cbbdfde979 sstables/compaction_backlog_tracker: Constify backlog()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180111004914.25796-1-duarte@scylladb.com>
2018-01-11 13:20:57 +02:00
Duarte Nunes
43ad5bd182 sstables/compaction_backlog_manager: Fix user-after-free
If the compaction_backlog_manager's lifetime ends before the linked
compaction_backlog_tracker's, the latter's _manager pointer not being
cleared, can lead to a use-after-free error when running
~compaction_backlog_tracker(), as evidenced by unit-tests failed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180111004914.25796-2-duarte@scylladb.com>
2018-01-11 13:20:55 +02:00
Raphael S. Carvalho
4610e994e1 sstables: cure our blindness on sstable read failure
After 611774b, we're blind again on which sstable caused a compaction
to fail, leaving us with cryptic message as follow:
compaction_manager - compaction failed: std::runtime_error (compressed
chunk failed checksum)

After this change, now both read failure in compaction or regular read
will report the guilty sstable, see:
compaction_manager - compaction failed: std::runtime_error (SSTable reader
found an exception when reading sstable ./data/.../keyspace1-standard1
ka-1-Data.db : std::runtime_error(compressed chunk failed checksum))

Fixes #3006.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180102230752.14701-1-raphaelsc@scylladb.com>
2018-01-08 13:43:13 +02:00
Avi Kivity
72c673fcc3 Merge "I/O Controller for memtables and compactions" from Glauber
"This patchset implements the compaction controller for I/O shares. The
goal is to automatic adjust compaction shares based on a
strategy-specific backlog. A higher backlog will translate into higher
shares.

As compaction progresses, that reduces the backlog. As new data is
flushed, that increases the backlog. The goal of the controler is to
keep the backlog constant at a certain rate, so that we don't go neither
too fast or too slow.

Tracking reads and writes:
==========================

Tracking of reads and writes happen through the read_monitor and the
write_monitor. The write monitor is an existing interface that has the
purpose of releasing the write permit at particular points of the write
process. We enhance it so to get a reference to an instance that tracks
the current offset inside the sstables::file_writer. This way the
backlog tracker can always know for sure what's the offset of the
current write.

A similar thing is done for reads. The data_consumer already tracks the
position of the current read, and we isolate that into a structure to
which we can get a reference. A read_monitor allows us to connect the
compaction to that reference.

Lifetime management:
====================

In general, tracking objects will be owned by their callers and passed
down as references. The compaction object will own the read monitors and
the compaction write monitors and the memtable flush write monitor will
be kept alive in a do_with block around the flush itself.

The backlog_{write,read}_progress_manager needs to be kept alive until
the SSTable is no longer in progress. For writes, that means until we
are able to add the SSTable charges in full, and for reads (compaction)
that means until we are able to remove the charges in full.

It is important to do that to avoid spikes in the graph. If we remove
the progress managers in a different operation than updating the SSTable
list we will be left in a temporary state where charges appear or
disappear abruptly, to be fixed when the final
add_sstable/remove_sstable happens. So we want those things to happen
together.

The compaction_backlog_tracker is kept alive until the strategy changes,
for example, through ALTER TABLE. Current charges are transferred to the
new strategy's compaction_backlog_tracker object when we do that. If the
type of strategy changes, the current read charges are forgotten. We can
do that because those running compaction will not really contribute to
decrease the backlog of the new compaction strategy.

Tranfer of Charges
==================

When ALTER TABLE happens, we need to transfer ongoing writes to the new
backlog manager. Ongoing reads will still be tracked by the
backlog_manager that originated them.

The rationale for that is that reads still belong to the current
compaction, with the strategy that generated them. But new Tables being
written will add to the backlog of the new strategy.

Note that ALTER TABLE operations not necessarily cause a change of
Strategy. We can be using the same strategy but just changing
properties. If that is the case, we expect no discontinuity in the
backlog graph (tested).

Resharding
==========

Resharding compactions are more complex than normal compactions because
the SSTables are created in one shard and later sent to another shard.
It is better, then, to track resharding compactions separately and let
them have their own backlog tracker, which will insert backlog in
proportion to the amount of data to be resharded.

Memtable Flush I/O Controller
=============================

With the current infrastructure it becomes trivial to add a new
controller, for either I/O or CPU. This patchset then adds an I/O
controller for memtable flushes, using the same backlog algorithm that
we already used for CPU."

* 'compaction-controller-io-v5' of github.com:glommer/scylla:
  database: add a controller for I/O on memtable flushes.
  document the compaction controller
  compaction: adjust shares for compactions
  backlog_controllers: implement generic I/O controller
  factor out some of the controller code
  io shares: multiply all shares by 10
  compaction_strategy: implement backlog manager for the SizeTiered strategy
  infrastructure for backlog estimator for compaction work.
  sstables: notify about end of data component write
  sstables: add read_monitor_generator
  sstables: add read_monitor
  sstables: enhance data consumer with a position tracker
  sstables: enhance the file_writer with an offset tracker
  sstables: pass references instead of pointers for write_monitor
  compaction: control destruction of readers
2018-01-07 15:00:10 +02:00
Avi Kivity
375ed938b4 Merge "Fix potential infinite recursion in leveled compaction" from Raphael
'"The issue is triggered by compaction of sstables of level higher than 0.

The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]

When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.

This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made."

Fixes #2908.'

* 'high_level_compaction_infinite_recursion_fix_v4' of github.com:raphaelsc/scylla:
  tests: test for infinite recursion bug when doing high-level compaction
  Fix potential infinite recursion when combining mutations for leveled compaction
  dht: make it easier to create ring_position_view from token
  dht: introduce is_min/max for ring_position
2018-01-07 13:22:17 +02:00
Raphael S. Carvalho
818830715f Fix potential infinite recursion when combining mutations for leveled compaction
The issue is triggered by compaction of sstables of level higher than 0.

The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]

When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.

This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made.

Fix is to use ring_position in reader_selector, such that
inclusiveness would be respected.
So reader_selector::has_new_readers() won't return false positive
under the conditions described above.

Fixes #2908.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-01-03 16:23:01 -02:00
Raphael S. Carvalho
e29b598c5f sstables: make compaction_descriptor's ctor explicit to avoid bad conversion
perf sstable used old sstables::compact_sstables() interface and still compiled
due to bad implicit conversion.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180103041900.21186-1-raphaelsc@scylladb.com>
2018-01-03 12:37:12 +02:00
Glauber Costa
074a13ecf1 compaction_strategy: implement backlog manager for the SizeTiered strategy
The SizeTiered backlog for a single SSTable is defined as:

   Bi = Ei * log4(T / Si)

Where:

  - Si is the size of this individual SSTable
  - T is the sum of sizes for all individual SSTables
  - Ei is the effective bytes in this SSTable.

The Effective size of an SSTable is:
 - The uncompacted size for an SSTable under compaction
 - The partially written size for an SSTable being written
 - The SSTable size for an SSTable that is not undergoing
   any of those processes.

The Aggregate Backlog for the entire Table is just the sum of
all individual SSTable backlogs, including the SSTables currently
being written.

Care is taken to avoid iterating over all SSTables, by separating
the aggregate backlog into a static component (sstables not changing) and
a component of SSTables that are undergoing change.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
ca284174d0 infrastructure for backlog estimator for compaction work.
This patch adds infrastucture in various points in the system to allow
us to determine the amount of work present as backlog from compactions.

What needs to be done can be explained in three major pieces:

1) Add hooks in the points where sstables are added or inserted to a
   column family (or more precisely, to a compaction_strategy object).

2) Add hooks in reads and write monitors that allows a compaction
   backlog estimator (tracker) to become aware of bytes that are
   partially written and compacted away.

3) Add a per-column family class (compaction_backlog_tracker) that
   can be used to track work that is done and relevant to compactions
   (like the two above), and a compaction manager to provide a
   system-wide backlog based on the response of the individual trackers.

The definition of how much backlog one has is strategy-specific. The
Null strategy is easy, as it never really has any backlog, and so is the
major strategy - since what it really matters is the backlog of the
underlying compaction strategy.

Although backlogs are strategy-specific, they should be "compatible", in
the sense that if a particular strategy has more work to do, it should
yield a higher number than its counterparts.

All the others are presented in this patch as unimplemented: they will
always advertise a mild backlog that should yield a constant
CPU-utilization if used alone.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
86d7c160fd sstables: notify about end of data component write
We need to notify the monitor that the offset tracker that we are using is
about to be destroyed and will no longer be valid.

While we could modify the file_writer interface so that we could capture
the offset_tracker and take ownership of it - guaranteeing it is alive
until we reach the existing on_write_completed(), this feels like a
layer violation.

It is also potentially useful in general to offer the monitor callers
with knowledge that writing the data portion is done.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
3bd6bceaf0 sstables: add read_monitor_generator
Passing the read monitor down to the sstable readers is tricky. The
point of interest - like compaction - are usually very far from the
interfaces that register the monitor, like read_rows. Between the two,
there is usually a mutation_reader, which is and ought to be totally
unaware of the read monitor: technically, a mutation_reader may not even
know it is backed by sstables.

The solution is to create a read_monitor_generator, that can be passed
from the upper layers, like compaction, to the layers that are actually
making the decision of which sstables to create readers for.

Note that we don't need an equivalent piece of infrastructure for
writes, because writes don't happen through hidden layers and have all
the information they need to initialize their monitors.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
9702a0935b sstables: add read_monitor
Similar to the write_monitor, it will track progress of an sstable
being read. In the current interface, we will notify interested users
about what is the current position in the data file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
f0391bf9a0 sstables: enhance data consumer with a position tracker
Callers, like compactions, will be able to know at any time the current
progress of a read.

As we do that, the currently unimplemented position() method of
data_consume_context becomes redundant and is removed.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
110b8531f4 sstables: enhance the file_writer with an offset tracker
Callers, like the memtable flusher or compactions will be able to find
out the current amount of bytes written at any time.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
00df0a5ad3 sstables: pass references instead of pointers for write_monitor
This came from Avi's review on the read_monitors. He suggests we
wouldn't keep shared pointers, and would instead have the caller
ensuring lifetime. That makes sense, but having the writer interface
using shared_ptr and the read interface using references would lead to
an inconsistent interface.

For the sake of consistency we will change the write monitor to take
references before we do that. From database.cc's perspective, we could
now keep the monitors in a do_with() block, but we will keep the
shared_ptrs to manage their lifetime in anticipation of upcoming patches
in this series, where we'll have to pass them somewhere else.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:06 -05:00
Glauber Costa
d4109ebb80 compaction: control destruction of readers
Compactions run from a seastar::thread, in run(). They will either fail
or succeed, and from the point of view of ordering of destruction
between the compaction object and its readers:

- if compaction succeed, we have no control over who gets destructed
  first since both objects will be going out of scope.
- if they fail, we will forceably destruct the compaction object, at
  which point the readers are still alive

From the point of view of lifetime management, it would be nice to make
sure that the compaction object outlives whichever other objects it
needs during compaction.

This nice to have will become paramount when we start adding
read_monitors to the compaction object, that have to, themselves outlive
the readers.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:06 -05:00
Avi Kivity
8795238869 Merge "Fix handling of range tombstones starting at same position" from Tomasz
"When we get two range tombstones with the same lower bound from
different data sources (e.g. two sstable), which need to be combined
into a single stream, they need to be de-overlapped, because each
mutation fragment in the stream must have a different position. If we
have range tombstones [1, 10) and [1, 20), the result of that
de-overlapping will be [1, 10) and [10, 20]. The problem is that if
the stream corresponds to a clustering slice with upper bound greater
than 1, but lower than 10, the second range tombstone would appear as
being out of the query range. This is currently violating assumptions
made by some consumers, like cache populator.

One effect of this may be that a reader will miss rows which are in
the range (1, 10) (after the start of the first range tombstone, and
before the start of the second range tombstone), if the second range
tombstone happens to be the last fragment which was read for a
discontinuous range in cache and we stopped reading at that point
because of a full buffer and cache was evicted before we resumed
reading, so we went to reading from the sstable reader again. There
could be more cases in which this violation may resurface.

There is also a related bug in mutation_fragment_merger. If the reader
is in forwarding mode, and the current range is [1, 5], the reader
would still emit range_tombstone([10, 20]). If that reader is later
fast forwarded to another range, say [6, 8], it may produce fragments
with smaller positions which were emitted before, violating
monotonicity of fragment positions in the stream.

A similar bug was also present in partition_snapshot_flat_reader.

Possible solutions:

 1) relax the assumption (in cache) that streams contain only relevant
 range tombstones, and only require that they contain at least all
 relevant tombstones

 2) allow subsequent range tombstones in a stream to share the same
 starting position (position is weakly monotonic), then we don't need
 to de-overlap the tombstones in readers.

 3) teach combining readers about query restrictions so that they can drop
fragments which fall outside the range

 4) force leaf readers to trim all range tombstones to query restrictions

This patch implements solution no 2. It simplifies combining readers,
which don't need to accumulate and trim range tombstones.

I don't like solution 3, because it makes combining readers more
complicated, slower, and harder to properly construct (currently
combining readers don't need to know restrictions of the leaf
streams).

Solution 4 is confined to implementations of leaf readers, but also
has disadvantage of making those more complicated and slower.

There is only one consumer which needs the tombstones with monotonic positions, and
that is the sstable writer.

Fixes #3093."

* tag 'tgrabiec/fix-out-of-range-tombstones-v1' of github.com:scylladb/seastar-dev:
  tests: row_cache: Introduce test for concurrent read, population and eviction
  tests: sstables: Add test for writing combined stream with range tombstones at same position
  tests: memtable: Test that combined mutation source is a mutation source
  tests: memtable: Test that memtable with many versions is a mutation source
  tests: mutation_source: Add test for stream invariants with overlapping tombstones
  tests: mutation_reader: Test fast forwarding of combined reader with overlapping range tombstones
  tests: mutation_reader: Test combined reader slicing on random mutations
  tests: mutation_source_test: Extract random_mutation_generator::make_partition_keys()
  mutation_fragment: Introduce range()
  clustering_interval_set: Introduce overlaps()
  clustering_interval_set: Extract private make_interval()
  mutation_reader: Allow range tombstones with same position in the fragment stream
  sstables: Handle consecutive range_tombstone fragments with same position
  tests: streamed_mutation_assertions: Merge range_tombstones with the same position in produces_range_tombstone()
  streamed_mutation: Introduce peek()
  mutation_fragment: Extract mergeable_with()
  mutation_reader: Move definition of combining mutation reader to source file
  mutation_reader: Use make_combined_reader() to create combined reader
2018-01-02 18:32:09 +02:00
Raphael S. Carvalho
3dcf00ec67 sstables: feed new sstable with its owner shard
Missed opportunity to feed shard id to sstable being written when
working on 67c5c8dc67, so when sstable is reopened after sealed,
its shard doesn't need to be recomputed by open procedure.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171231024529.13664-1-raphaelsc@scylladb.com>
2018-01-01 10:17:07 +02:00
Raphael S. Carvalho
c76356fb39 sstables: make shard computation resilient to empty sharding metadata
Scylla metadata could be empty due to bugs like the one introduced by
115ff10. Let's make shard computation resilient to empty sharding
metadata by falling back to the approach that uses first and last
keys to compute shards.

Refs #2932.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171223120140.3642-2-raphaelsc@scylladb.com>
2017-12-28 14:07:06 +02:00
Raphael S. Carvalho
fa5a26f12d sstables: fail sstable write if unable to generate sharding metadata
SSTable can generate an empty sharding metadata after a bug like
the one introduced here 115ff10, that results in tokens being
generated using base table for the view table. That leads to
sstable being deleted in subsequent boot because all shards will
agree on its deletion given that it will not belong to anybody,
and also compaction to crash because this relies on resulting
sstable belonging to one shard at least.

I wouldn't like to spend days debugging it again because sstable
write silently generated empty sharding metadata, so let's make
write fail when it happens (see issue #2932 for details).

Refs #2932.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171223120140.3642-1-raphaelsc@scylladb.com>
2017-12-28 14:07:05 +02:00
Duarte Nunes
2618209c2d Remove obsolete includes and fix build
move.hh was deleted, but files weren't updated to reflect that.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-28 12:03:44 +00:00