Commit Graph

1128 Commits

Author SHA1 Message Date
Duarte Nunes
2a371c2689 Merge 'Allow bypassing cache on a per-query basis' from Avi
"
Some queries are very unlikely to hit cache. Usually this includes
range queries on large tables, but other patterns are possible.

While the database should adapt to the query pattern, sometimes the
user has information the database does not have. By passing this
information along, the user helps the database manage its resources
more optimally.

To do this, this patch introduces a BYPASS CACHE clause to the
SELECT statement. A query thus marked will not attempt to read
from the cache, and instead will read from sstables and memtables
only. This reduces CPU time spent to query and populate the cache,
and will prevent the cache from being flooded with data that is
not likely to be read again soon. The existing cache disabled path
is engaged when the option is selected.

Tests: unit (release), manual metrics verification with ccm with and without the
    BYPASS CACHE clause.

Ref #3770.
"

* tag 'cache-bypass/v2' of https://github.com/avikivity/scylla:
  doc: document SELECT ... BYPASS CACHE
  tests: add test for SELECT ... BYPASS CACHE
  cql: add SELECT ... BYPASS CACHE clause
  db: add query option to bypass cache
2018-11-26 09:59:40 +00:00
Avi Kivity
b835b93ee6 db: add query option to bypass cache
With the option enabled, we bypass the cache unconditionally and only
read from memtables+sstables. This is useful for analytics queries.
2018-11-25 16:26:08 +02:00
Raphael S. Carvalho
2058001f94 sstables/compaction: propagate sstable replacement to all compaction of a CF
This is needed for parallel compaction to work with sstable run based approach.
That's because regular compaction clones a set containing all sstables of its
column family. So compaction A can potentially hold a reference to a compacting
sstable of compaction B, so preventing compacting B from releasing its exhausted
sstable.

So all replacements are propagated to all compactions of a given column family,
and compactions in turn, including the one which initiated the propagation,
will do the replacement.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:30 -02:00
Raphael S. Carvalho
953fdcc867 sstables: store cf pointer in compaction_info
motivation is that we need a more efficient way to find compactions
that belong to a given column family in compaction list.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:28 -02:00
Raphael S. Carvalho
fc92fb955d sstables/compaction_manager: release reference to exhausted sstable through callback
That's important for the reference to sstable to not be kept throughout
the compaction procedure, which would break the goal of releasing
space during compaction.

Manager passes a callback to compaction which calls it whenever
there's sstable replacement.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:16 -02:00
Raphael S. Carvalho
3433de3dc0 database: do not keep reference to sstable in selector when done selecting
When compacting, we'll create all readers at once and will not select
again from incremental selector, meaning the selector will keep all
respective sstables in current_sstables, preventing compaction from
releasing space as it goes on.

The change is about refreshing sstable set's selector such that it
will not hold a reference to an exhausted sstable whatsoever.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:12 -02:00
Raphael S. Carvalho
e5a0b05c15 sstables/compaction: release space earlier of exhausted input sstables
Currently, compaction only replace input sstables at end of compaction,
meaning compaction must be finished for all the space of those sstables
to be released.

What we can do instead is to delete earlier some input sstable under
some conditions:

1) SStable data should be committed to a new, sealed output sstable,
meaning it's exhausted.
2) Exhausted sstable mustn't overlap with a non-exhausted sstable
because a tombstone in the exhausted could have been purged and the
shadowed data in non-exhausted could be ressurected if system
crashes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:07 -02:00
Raphael S. Carvalho
8d11b0bbb4 database: do not store reference to sstable in incremental selector
Use sstable generation instead to keep track of read sstables.
The motivation is that we'll not keep reference to sstables, so allowing
their space on disk to be released as soon they get exhausted.
Generation is used because it guarantees uniqueness of the sstable.

Reviewed-by: Botond Dénes <bdenes@scylladb.com>

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:04 -02:00
Avi Kivity
775b7e41f4 Update seastar submodule
* seastar d59fcef...b924495 (2):
  > build: Fix protobuf generation rules
  > Merge "Restructure files" from Jesse

Includes fixup patch from Jesse:

"
Update Seastar `#include`s to reflect restructure

All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
2018-11-21 00:01:44 +02:00
Glauber Costa
9f403334c8 remove monitor if sstable write failed
In (almost) all SSTable write paths, we need to inform the monitor that
the write has failed as well. The monitor will remove the SSTable from
controller's tracking at that point.

Except there is one place where we are not doing that: streaming of big
mutations. Streaming of big mutations is an interesting use case, in
which it is done in 2 parts: if the writing of the SSTable fails right
away, then we do the correct thing.

But the SSTables are not commited at that point and the monitors are
still kept around with the SSTables until a later time, when they are
finally committed. Between those two points in time, it is possible that
the streaming code will detect a failure and manually call
fail_streaming_mutations(), which marks the SSTable for deletions. At
that point we should propagate that information to the monitor as well,
but we don't.

Fixes #3732 (hopefully)
Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181114213618.16789-1-glauber@scylladb.com>
2018-11-20 16:15:12 +00:00
Piotr Sarna
de43b4f41d database: add a check if loaded sstable is already staging
Staging sstables are loaded before regular ones. If the process
fails midway, an sstable can be linked both in the regular directory
and in staging directory. In such cases, the sstable remains
in staging and will be moved to the regular directory
by view update streamer service.
2018-11-13 15:04:43 +01:00
Piotr Sarna
c825a17b9d table: move push_view_replica_updates to table.cc 2018-11-13 14:52:22 +01:00
Piotr Sarna
a17fcb8d94 database: add populating tables with staging sstables
After populating tables with regular sstables, same procedure
is performed for staging sstables.
2018-11-13 14:52:22 +01:00
Piotr Sarna
19bf94fa8f database: add creating /staging directory for sstables
staging directory is now created on boot.
2018-11-13 14:52:22 +01:00
Piotr Sarna
e42d97060f database: provide nonfrozen version of push_view_replica_updates
Now it's also possible to pass a mutation to push to view replicas.
2018-11-13 11:45:30 +01:00
Piotr Sarna
642c3ae0e0 database: add subdir param to make_streaming_sstable_for_write
This function allows specifying a subfolder to put a newly created
sstable in - e.g. staging/ subfolder for streamed base table mutations.
2018-11-13 11:45:30 +01:00
Piotr Sarna
8e053f9efb database: add staging sstables to a map
SSTables that belong to staging/ directory are put in the
_sstables_staging map.
2018-11-13 11:45:30 +01:00
Piotr Sarna
3f34312aa6 database: skip staging sstables in compaction
Staging sstables are not part of the compaction process to ensure
than each sstable can be easily excluded from view generation process
that depends on the mentioned sstable.
2018-11-13 11:45:30 +01:00
Avi Kivity
a71ab365e3 toplevel: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Botond Dénes
23f3831aaf table::make_streaming_reader(): add forwarding parameter
The single-range overload, when used by
make_multishard_streaming_reader(), has to create a reader that is
forwardable. Otherwise the multishard streaming reader will not produce
any output as it cannot fast-forward its shard readers to the ranges
produced by the generator.

Also add a unit test, that is based on the real-life purpose the
multishard streaming reader was designed for - serving partition
from a shard, according to a sharding configuration that is different
than the local one. This is also the scenario that found the buf in the
first place.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <bf799961bfd535882ede6a54cd6c4b6f92e4e1c1.1539235034.git.bdenes@scylladb.com>
2018-10-11 10:59:18 +03:00
Botond Dénes
4bb0bbb9e2 database: add make_multishard_streaming_reader()
Creates a streaming reader that reads from all shards. Shard readers are
created with `table::make_streaming_reader()`.
This is needed for the new row-level repair.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4b74c710bed2ef98adf07555a4c841e5b690dd8c.1538470782.git.bdenes@scylladb.com>
2018-10-09 11:07:47 +03:00
Botond Dénes
3eeb6fbd23 table::make_streaming_reader(): add single-range overload
This will be used by the `make_multishard_streaming_reader()` in the
next patch. This method will create a multishard combining reader which
needs its shard readers to take a single range, not a vector of ranges
like the existing overload.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cc6f2c9a8cf2c42696ff756ed6cb7949b95fe986.1538470782.git.bdenes@scylladb.com>
2018-10-09 11:07:46 +03:00
Nadav Har'El
bebe5b5df2 materialized views: add view_updates_pending statistic
We are already maintaining a statistic of the number of pending view updates
sent but but not yet completed by view replicas, so let's expose it.
As all per-table statistics, also this one will only be exposed if the
"--enable-keyspace-column-family-metrics" option is on.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-10-02 20:44:58 +01:00
Glauber Costa
c3f27784de database: guarantee a minimum amount of shares when manual operations are requested.
We have found issues when a flush is requested outside the usual
memtable flush loop and because there is not a lot of data the
controller will not have a high amount of shares.

To prevent this, this patch guarantees some minimum amount of shares
when extraneous operations (nodetool flush, commitlog-driven flush, etc)
are requested.

Another option would be to add shares instead of guarantee a minimum.
But in my view the approach I am taking here has two main advantages:

1) It won't cause spikes when those operations are requested
2) It is cumbersome to add shares in the current infrastructure, as just
adding backlog can cause shares to spike. Consider this example:

  Backlog is within the first range of very low backlog (~0.2). Shares
  for this would be around ~20. If we want to add 200 shares, that is
  equivalent to a backlog of 0.8. Once we add those two backlogs
  together, we end up with 1 (max backlog).

Fixes #3761

Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180927131904.8826-1-glauber@scylladb.com>
2018-09-27 15:20:31 +02:00
Avi Kivity
337ee6153a Merge "Support SSTables 3.x in Scylla runtime" from Vladimir and Piotr
"
This patchset makes it possible to use SSTables 'mc' format, commonly
referred to as 'SSTables 3.x', when running Scylla instance.

Several bugs found on this way are fixed. Also, a configuration option
is introduced to allow running Scylla either with 'mc' or 'la' format
as default.

Tests: unit {release}

+ tested Scylla with both 'la' and 'mc' formats to work fine:

cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};                                                                  [3/1890]
cqlsh> USE test;
cqlsh:test> CREATE TABLE cfsst3 (pk int, ck int, rc int, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''};
cqlsh:test> INSERT INTO cfsst3 (pk, ck, rc) VALUES ( 4, 7, 8);
    <<flush>>
cqlsh:test> DELETE from cfsst3 WHERE pk = 4 and ck> 3 and ck < 8;
    <<flush>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 2, 3);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 4, 6);
cqlsh:test> SELECT * FROM cfsst3 ;

 pk | ck | rc
----+----+------
  2 |  3 | null
  4 |  6 | null

(2 rows)
    <<Scylla restart>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 5, 7);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 6, 8);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 7, 9);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 8, 10);
cqlsh:test> SELECT * from cfsst3 ;

 pk | ck | rc
----+----+------
  5 |  7 | null
  8 | 10 | null
  2 |  3 | null
  4 |  6 | null
  7 |  9 | null
  6 |  8 | null

(6 rows)
"

* 'projects/sstables-30/try-runtime/v8' of https://github.com/argenet/scylla:
  database: Honour enable_sstables_mc_format configuration option.
  sstables: Support SSTables 'mc' format as a feature.
  db: Add configuration option for enabling SSTables 'mc' format.
  tests: Add test for reading a complex column with zero subcolumns (SST3).
  sstables: Fix parsing of complex columns with zero subcolumns.
  sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding.
  sstables: Use parser_type instead of abstract_type::parse_type in column_translation.
  bytes: Add helper for turning bytes_view into sstring_view.
  sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists.
  sstables: Fix string formatting for exception messages in m_format_read_helpers.
  sstables: Don't validate timestamps against the max value on parsing.
  sstables: Always store only min bases in serialization_header.
  sstables: Support 'mc' version parsing from filename.
  SST3: Make sure we call consume_partition_end
2018-09-26 11:10:07 +01:00
Vladimir Krivopalov
cd80d6ff65 database: Honour enable_sstables_mc_format configuration option.
Only enable SSTables 'mc' format if the entire cluster supports it and
it is enabled in the configuration file.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Raphael S. Carvalho
745e35fa82 database: Fix sstable resharding for mc format
SStable format mc doesn't write ancestors to metadata, so resharding
will not work with this new format because it relies on ancestors to
replace new unshared sstables with old shared ones.
Fix is about not relying on ancestors metadata for this operation.

Fixes #3777.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180922211933.1987-1-raphaelsc@scylladb.com>
2018-09-25 18:37:48 +03:00
Botond Dénes
eb357a385d flat_mutation_reader: make timeout opt-out rather than opt-in
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.

To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.

There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.

Tests: unit(release, debug)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>

Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
2018-09-20 11:31:24 +02:00
Raphael S. Carvalho
5bc028f78b database: fix 2x increase in disk usage during cleanup compaction
Don't hold reference to sstables cleaned up, so that file descriptors
for their index and data files will be closed and consequently disk
space released.

Fixes #3735.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180914194047.26288-1-raphaelsc@scylladb.com>
2018-09-17 17:26:46 +03:00
Botond Dénes
253407bdc8 multishard_mutation_query: add badness counters
Add badness counters that allow tracking problems. The following
counters are added:
1) multishard_query_unpopped_fragments
2) multishard_query_unpopped_bytes
3) multishard_query_failed_reader_stops
4) multishard_query_failed_reader_saves

The first pair of counters observe the amount of work range scan queries
have to undo on each page. It is normal for these counters to be
non-zero, however sudden spikes in their values can indicate problems.
This undoing of work is needed for stateful range-scans to work.
When stateful queries are enabled the `multishard_combining_reader` is
dismantled and all unconsumed fragments in its and any of its
intermediate reader's buffers are pushed back into the originating shard
reader's buffer (via `unpop_mutation_fragment()`). This also includes
the `partition_start`, the `static_row` (if there is one) and all
extracted and active `range_tombstone` fragments. This together can
amount to a substantial amount of fragments.
(1) counts the amount of fragments moved back, while (2) counts the
number of bytes. Monitoring size and quantity separately allows for
detecting edge cases like moving many small fragments or just a few huge
ones. The counters count the fragments/bytes moved back to readers
located on the shard they belong to.

The second pair of counters are added to detect any problems around
saving readers. Since the failure to save a reader will not fail the
read itself, it is necessary to add visibility to these failures by
other means.
(3) counts the number of times stopping a shard reader (waiting
on pending read-aheads and next-partitions) failed while (4)
counts the number of times inserting the reader into the `querier_cache`
failed.
Contrary to the first two counters, which will almost certainly never be
zero, these latter two counters should always be zero. Any other value
indicates problems in the respective shards/nodes.
2018-09-03 10:31:44 +03:00
Botond Dénes
5f726e9a89 querier: move all to query namespace
To avoid name clashes.
2018-09-03 10:31:44 +03:00
Glauber Costa
8dea1b3c61 database: fix directory for information when loading new SSTables from upload dir
When we load new SSTables, we use the directory information from the
entry descriptor to build information about those SSTables. When the
descriptor is created by flush_upload_dir, the sstable directory used in
the descriptor contains the `upload` part. Therefore, we will try to
load SSTables that are in the upload directory when we already moved
them out and fail.

Since the generation also changes, we have been historically fixing the
generation manually, but not the SSTable directory. The reason for that
is that up until recently, the SSTable directory was passed statically
to open_sstables, ignoring whatever the entry descriptor said. Now that
the sstable directory is also derived from the entry descriptor, we
should fix that too.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180829165326.12183-1-glauber@scylladb.com>
2018-08-30 10:34:25 +03:00
Paweł Dziepak
6f1c3e6945 Merge "Convert more execution_stages to inherit scheduling_groups" from Avi
"
Previous work (71471bb322) converted the CQL layer to inheriting
execution stages, paving the way to multiple users sharing the front-end.

This patchset does the same thing to the back-end, converting more execution
stages to preserve the caller's scheduling_group. Since RPC now (8c993e0728)
assigns the correct scheduling group within the replica, we can extend that
work so a statement is executed with the same scheduling group all the way
to sstable parsing, even if we cross nodes in the process. This improves
performance isolation and paves the way to multi-user SLA guarantees.
"

* tag 'inherit-sched_group/v1' of https://github.com/avikivity/scylla:
  database: make database's mutation apply stage inherit its scheduling group from the caller
  database: make database::_mutation_query_stage inherit the scheduling group
  database: make database::_data_query_stage inheriting its caller's scheduling_group
  storage_proxy: make _mutate_stage inherit its caller's scheduling_group
2018-08-28 13:49:31 +01:00
Tomasz Grabiec
2afce13967 database: Avoid OOM when soft pressure but nothing to flush
There could be soft pressure, but soft-pressure flusher may not be
able to make progress (Refs #3716). It will keep trying to flush empty
memtables, which block on earlier flushes to complete, and thus
allocate continuations in memory. Those continuations accumulate in
memory and can cause OOM.

flush will take longer to complete. Due to scheduling group isolation,
the soft-pressure flusher will keep getting the CPU.

This causes bad_alloc and crashes of dtest:
limits_test.py:TestLimits.max_cells_test

Fixes #3717

Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:03:58 +03:00
Tomasz Grabiec
1e50f85288 database: Make soft-pressure memtable flusher not consider already flushed memtables
The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.

The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.

I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.

The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.

This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.

Fixes #3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:02:34 +03:00
Avi Kivity
37f9a3c566 database: make database's mutation apply stage inherit its scheduling group from the caller
Like the two preceeding patches, convert the mutation apply stage
to an inheriting_concrete_scheduling_group.  This change has two
added benefits: we get rid of a thread_local, and we drop a
with_scheduling_group() inside an execution stage which just creates a bunch
of continuations and somewhat undoes the benefit of the execution stage.
2018-08-24 19:04:49 +03:00
Avi Kivity
ebff1cfc37 database: make database::_mutation_query_stage inherit the scheduling group
Like the preceeding patch and for the same reasons, adjust
database::_mutation_query_stage to inherit the scheduling group from its
caller.
2018-08-24 19:04:49 +03:00
Avi Kivity
596fb6f2f7 database: make database::_data_query_stage inheriting its caller's scheduling_group
Now (8c993e0728) that replica-side operations run under the correct
scheduling group, we can inherit the scheduling_group for _data_query_stage
from the caller.  By itself this doesn't do much, but it will later allow us
to have multiple groups for statement executions.
2018-08-24 19:04:49 +03:00
Rafi Einstein
c7f41c988f Add a counter to count large partition warning in compaction
Fixes #3562

Tests: dtest(compaction_test.py)
Message-Id: <20180807190324.82014-1-rafie@scylladb.com>
2018-08-07 20:15:09 +01:00
Avi Kivity
2d311c26b3 database: tag dirty memory managers with scheduling groups
dirty memory managers run code on behalf of their callers
in a background fiber, so provide that background fiber with
the scheduling group appropriate to their caller.

 - system: main (we want to let system writes through quickly)
 - dirty: statement (normal user writes)
 - streaming: streaming (streaming writes)
2018-07-31 13:18:21 +03:00
Avi Kivity
ef9b36376c Merge "database: support multiple data directories" from Glauber
"
While Cassandra supports multiple data directories, we have been
historically supporting just one. The one-directory model suits us
better because of the I/O Scheduler and so far we have seen very few
requests -- if any, to support this.

Still, the infrastructure needed to support multiple directories can be
beneficial so I am trying to bring this in.

For simplicity, we will treat the first directory in the list as the
main directory. By being able to still associate one singular directory
with a table, most of the code doesn't have to change and we don't have
to worry about how to distribute data between the directories.

In this design:
- We scan all data directories for existing data.
- resharding only happens within a particular data directory.
- snapshot details are accumulated with data for all directories that
  host snapshots for the tables we are examining
- snapshots are created with files in its own directories, but the
  manifest file goes to the main directory. For this one, note that in
  Cassandra the same thing happens, except that there is no "main"
  directory. Still the manifest file is still just in one of them.
- SSTables are flushed into the main directory.
- Compactions write data into the main directory

Despite the restrictions, one example of usage of this is recovery.  If
we have network attached devices for instance, we can quickly attach a
network device to an existing node and make the data immediately
available as it is compacted back to main storage.

Tests: unit (release)
"

* 'multi-data-file-v2' of github.com:glommer/scylla:
  database: change ident
  database: support multiple data directories
  database: allow resharing to specify a directory
  database: support multiple directories in get_snapshot_details
  database: move get_snapshot_info into a seastar::thread
  snapshots: always create the snapshot directory
  sstables: pass sstable dir with entry descriptor
  database: make nodetool listsnapshots print correct information
  sstables: correctly create descriptors for snapshots
2018-07-15 13:31:04 +03:00
Asias He
6540051f77 database: Add add_sstable_and_update_cache
Since we can write mutations to sstable directly in streaming, we need
to add those sstables to the system so it can be seen by the query.
Also we need to update the cache so the query refects the latest data.
2018-07-13 08:36:45 +08:00
Asias He
dfc2739625 database: Add make_streaming_sstable_for_write
This will be used to create sstable for streaming receiver to write the
mutations received from network to sstable file instead of writing to
memtable.
2018-07-13 08:36:45 +08:00
Avi Kivity
2f8537b178 database: demote "Setting compaction strategy" log message to debug level
It's not very helpful in normal operation, and generates much noise,
especially when there are many tables.
Message-Id: <20180708070051.8508-1-avi@scylladb.com>
2018-07-08 10:27:03 +01:00
Glauber Costa
82f7f7b36d database: change ident
Previous patches have used reviewer-oriented identation. Re-ident.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 17:11:01 -04:00
Glauber Costa
99c8a1917f database: support multiple data directories
While Cassandra supports multiple data directories, we have been
historically supporting just one. The one-directory model suits us
better because of the I/O Scheduler and so far we have seen very few
requests -- if any, to support this.

Still, the infrastructure needed to support multiple directories can be
beneficial so I am trying to bring this in.

For simplicity, we will treat the first directory in the list as the
main directory. By being able to still associate one singular directory
with a table, most of the code doesn't have to change and we don't have
to worry about how to distribute data between the directories.

In this design:
 - We scan all data directories for existing data.
 - resharding only happens within a particular data directory.
 - snapshot details are accumulated with data for all directories that
   host snapshots for the tables we are examining
 - snapshots are created with files in its own directories, but the
   manifest file goes to the main directory. For this one, note that in
   Cassandra the same thing happens, except that there is no "main"
   directory. Still the manifest file is still just in one of them.
 - SSTables are flushed into the main directory.
 - Compactions write data into the main directory

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:39 -04:00
Glauber Costa
3b46984a1e database: allow resharing to specify a directory
resharding assumes that all SSTables will be in cf->dir(), but in
reality we will soon have tables in other places. We can specify a
directory in get_all_shared_sstables and specify that directory from the
resharding process.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:08 -04:00
Glauber Costa
c8b2d441a8 database: support multiple directories in get_snapshot_details
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:08 -04:00
Glauber Costa
a8ccf4d1e6 database: move get_snapshot_info into a seastar::thread
I am about to add another level of identation and this code already
shifts right too much. It is not performance critical, so let's use a
thread for that. seastar::threads did not exist when this was first
written.

Also remove one unused continuation from inside the inner scan,
simplifying its code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:08 -04:00
Glauber Costa
919c7d6bb9 snapshots: always create the snapshot directory
We currently don't always create the snapshot directory as an
optimization. We have a test at sync time handling this use case.

This works well when all SSTables are created in the same directory, but
if we have more than one data directory than it may not work if we don't
have SSTables in all data directories.

We can fix it by unconditionally creating the directory.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:08 -04:00