Commit Graph

688 Commits

Author SHA1 Message Date
Piotr Jastrzebski
7666e81b51 Decouple database.hh from types/user.hh
This commit declares shared_ptr<user_types_metadata> in
database.hh were user_types_metadata is an incomplete type so
it requires
"Allow to use shared_ptr with incomplete type other than sstable"
to compile correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-01-24 09:55:04 +01:00
Piotr Jastrzebski
e92b4c3dbc Move user_type_impl out of types.hh to types/user.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-01-24 09:04:04 +01:00
Rafael Ávila de Espíndola
f7d1dc16d4 database: Use nop_large_partition_handler to avoid self-reporting
Currently nop_large_partition_handler is only used in tests, but it
can also be used avoid self-reporting.

Tests: unit(Release)

I also tested starting scylla with
--compaction-large-partition-warning-threshold-mb=0.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190123205059.39573-1-espindola@scylladb.com>
2019-01-23 21:11:21 +00:00
Botond Dénes
4e89dea9ea database: don't allow access to global semaphores
Recently we had a bug (#4096) due to a component
(`multishard_mutation_query()`) assuming that all reads used the
semaphore obtainable via `database::user_read_concurrency_sem()`.
This problem revealed that it is plain wrong to allow access to the
shard-global semaphores residing in the database object. Instead all
code wishing to access the relevant semaphore for some read, should do
so via the relevant `table` object, thus guaranteeing that it will get
the correct semaphore, configured for that table.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4f3a6780eb3240822db34aba7c1ba0a675a96592.1547734212.git.bdenes@scylladb.com>
2019-01-21 16:29:02 +02:00
Avi Kivity
fae4c6c0b6 database: merge for_all_partitions and for_all_partitions_slow
for_all_partitions is only used in the implementation of for_all_partitions_slow,
so merge them and get rid of a template.
2019-01-20 15:55:20 +02:00
Avi Kivity
6e6372e8d2 Revert "Merge "Type-eaese gratuitous templates with functions" from Avi"
This reverts commit 31c6a794e9, reversing
changes made to 4537ec7426. It causes bad_function_calls
in some situations:

INFO  2019-01-20 01:41:12,164 [shard 0] database - Keyspace system: Reading CF sstable_activity id=5a1ff267-ace0-3f12-8563-cfae6103c65e version=d69820df-9d03-3cd0-91b0-c078c030b708
INFO  2019-01-20 01:41:13,952 [shard 0] legacy_schema_migrator - Moving 0 keyspaces from legacy schema tables to the new schema keyspace (system_schema)
INFO  2019-01-20 01:41:13,958 [shard 0] legacy_schema_migrator - Dropping legacy schema tables
INFO  2019-01-20 01:41:14,702 [shard 0] legacy_schema_migrator - Completed migration of legacy schema tables
ERROR 2019-01-20 01:41:14,999 [shard 0] seastar - Exiting on unhandled exception: std::bad_function_call (bad_function_call)
2019-01-20 11:32:14 +02:00
Tomasz Grabiec
31c6a794e9 Merge "Type-eaese gratuitous templates with functions" from Avi
Many area of the code are splattered with unneeded templates. This patchset replaces
some of them, where the template parameter is a function object, with an std::function
or noncopyable_function (with a preference towards the latter; but it is not always
possible). As the template is compiled for each instantiation (if the function
object is a lambda) while a function is compiled only once, there are significant
savings in compile time and bloat.

   text    data     bss     dec     hex filename
85160690          42120  284910 85487720        5187068 scylla.before
84824762          42120  284910 85151792        5135030 scylla.after

* https://github.com/avikivity/scylla detemplate/v1:
  api/commitlog: de-template acquire_cl_metric()
  database: de-template do_parse_schema_tables
  database: merge for_all_partitions and for_all_partitions_slow
  hints: de-template scan_for_hints_dirs()
  schema_tables: partially de-template make_map_mutation()
  distributed_loader: de-template
  tests: commitlog_test: de-template
  tests: cql_auth_query_test: de-template
  test: de-template eventually() and eventually_true()
  tests: flush_queue_test: de-template
  hint_test: de-template
  tests: mutation_fragment_test: de-template
  test: mutation_test: de-template
2019-01-18 11:42:01 +01:00
Avi Kivity
f61dbc9855 database: merge for_all_partitions and for_all_partitions_slow
for_all_partitions is only used in the implementation of for_all_partitions_slow,
so merge them and get rid of a template.
2019-01-17 18:50:36 +02:00
Botond Dénes
4537ec7426 mutlishard_mutation_query(): use correct reader concurrency semaphore
The multishard mutation query used the semaphore obtained from
`database::user_read_concurrency_sem()` to pause-resume shard readers.
This presented a problem when `multishard_mutation_query()` was reading
from system tables. In this case the readers themselves would obtain
their permits from the system read concurrency semaphore. Since the
pausing of shard readers used the user read semaphore, pausing failed to
fulfill its objective of alleviating pressure on the semaphore the reads
obtained their permits from. In some cases this lead to a deadlock
during system reads.
To ensure the correct semaphore is used for pausing-resuming readers,
obtain the semaphore from the `table` object. To avoid looking up the
table on every pause or resume call, cache the semaphores when readers
are created.

Fixes: #4096

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c784a3cd525ce29642d7216fbe92638fa7884e88.1547729119.git.bdenes@scylladb.com>
2019-01-17 15:19:59 +02:00
Piotr Sarna
8a5aaf2839 database: add pending streams phaser
This phaser will be used later to wait for all existing stream sessions
to finish before proceeding with view building.
2019-01-15 09:36:55 +01:00
Avi Kivity
391d1e0fe0 table: const correctness for table::get_sstables() and related
Do not allow write access to the sstable list via this accessor. Luckily
there are no violations, and now we enforce it.
Message-Id: <20190111151049.16953-1-avi@scylladb.com>
2019-01-11 17:39:17 +01:00
Raphael S. Carvalho
1b7cad3531 database: Fix race condition in sstable snapshot
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.

Fixes #4051.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>
2019-01-11 07:53:14 +02:00
Duarte Nunes
fa2b0384d2 Replace std::experimental types with C++17 std version.
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.

Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.

Scylla now requires GCC 8 to compile.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
2019-01-08 13:16:36 +02:00
Avi Kivity
c180a18dbb Distribute distributed_loader into its own header and source files
distributed_loader is a sizeable fraction of database.cc, so moving it
out reduces compile time and improves readability.
Message-Id: <20181230200926.15074-1-avi@scylladb.com>
2018-12-31 14:27:27 +02:00
Tomasz Grabiec
7747f2dde3 Merge "nodetool toppartitions" from Rafi & Avi
Implementation of nodetool toppartiotion query, which samples most frequest PKs in read/write
operation over a period of time.

Content:
- data_listener classes: mechanism that interfaces with mutation readers in database and table classes,
- toppartition_query and toppartition_data_listener classes to implement toppartition-specific query (this
  interfaces with data_listeners and the REST api),
- REST api for toppartitions query.

Uses Top-k structure for handling stream summary statistics (based on implementation in C*, see #2811).

What's still missing:
- JMX interface to nodetool (interface customization may be required),
- Querying #rows and #bytes (currently, only #partitions is supported).

Fixes #2811

* https://github.com/avikivity/scylla rafie_toppartitions_v7.1:
  top_k: whitespace and minor fixes
  top_k: map template arguments
  top_k: std::list -> chunked_vector
  top_k: support for appending top_k results
  nodetool toppartitions: refactor table::config constructor
  nodetool toppartitions: data listeners
  nodetool toppartitions: add data_listeners to database/table
  nodetool toppartitions: fully_qualified_cf_name
  nodetool toppartitions: Toppartitions query implementation
  nodetool toppartitions: Toppartitions query REST API
  nodetool toppartitions: nodetool-toppartitions script
2018-12-28 16:31:24 +01:00
Rafi Einstein
0bffe5f83e nodetool toppartitions: add data_listeners to database/table
Add data_listeners member to database.
Adds data_listeners* to table::config, to be used by table methods to invoke listeners.
Install on_read() listener in table::make_reader().
Install on_write() listener in database::apply_in_memory().

Tests: Unit (release)
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:45:57 +02:00
Rafi Einstein
038f8c7988 nodetool toppartitions: refactor table::config constructor
Eliminae extra parameters to ctor and deduce them instead from db param.

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:45:57 +02:00
Avi Kivity
74c1afad29 database: provide accessor to db::extensions
Rather than forcing callers to go through get_config(), provide a
direct accessor. This reduces dependencies on config.hh, and will
allow separation of extensions from configuration.
2018-12-21 20:15:43 +00:00
Duarte Nunes
2174eed640 database: Expose current memory view update backlog
Expose the base replica's current memory view update backlog, which is
defined in terms of units consumed from the semaphore.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-19 22:38:30 +00:00
Duarte Nunes
fc9176e784 database: Wait on view update semaphore for view building
View building sends view updates synchronously, which has natural
backpressure. However, they

1) Contribute to the load on the view replicas, and;
2) Add memory pressure to the base replica.

They should thus count towards the current view update backlog, and
consume units from the view update concurrency semaphore.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-19 22:38:30 +00:00
Duarte Nunes
86198060e5 database: generate_and_propagate_view_updates no longer needs a timeout
We no longer wait on the semaphore and instead over-subscribe it, so
there's not reason to pass a timeout.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-19 22:38:29 +00:00
Duarte Nunes
39eda68094 database: Don't generate view updates when node is overloaded
We arrive at an overloaded state when we fail to acquire semaphore
units in the base replica. This can mean clients are working in
interactive mode, we fail to throttle them and consequently should
start shedding load. We want to avoid impacting base table
availability by running out of memory, so we could offload the memory
queue to disk by writing the view updates as hints without attempting
to send them. However, the disk is also a limited resource and in
extreme cases we won’t be able to write hints. A tension exists
between forgetting the view updates, thereby opening up a window for
inconsistencies between base and view, or failing the base replica
write. The latter can fail the whole user write, or if the
coordinator was able to achieve CL, can instead cause inconsistencies
between base tables (we wouldn't want to store a hint, because if the
base replica is still overloaded, we would redo the whole dance).

Between the devil and the deep blue sea, we chose to forget view
updates. As a further simplification, we don't even write hints,
assuming that if clients can’t be throttled (as we'll attempt to do in
future patches), it will only be a matter of time before view updates
can’t be offloaded. We also start acquiring the semaphore units using
consume(), which is non-blocking, but allows for underflow of the
available semaphore units. This is okay, and we expect not to underflow
by much, as we stop generating new view updates.

Refs #2538

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-19 22:38:29 +00:00
Duarte Nunes
1f1fc36b72 database: Make view update concurrency semaphore memory-based
The semaphore currently limiting the amount of view updates a given
base replica emits aims to control the load that is imposed on the
cluster, to protect view replicas from being overloaded when there
are bursts of traffic (especially for degenerate cases like an index
with low selectivity).

100 is, however, an arbitrary number. It might allow too much load on
the view replicas, and it might also allow too much memory from the
base shard to be consumed. Conversely, it might allow for too few
updates to be queued in case of a burst, or to absorb updates while a
view replica becomes partitioned.

To deal with the load that is inflicted on the cluster, future patches
will ensure that the rate of base writes obeys the rate at which the
slowest view replica can consume the corresponding view updates.

To protect the current shard from using too much memory for this
queue, we will limit it to 10% of the shard's memory. The goal is to
both protect the shard from being overloaded, but also to allow it to
absorb bursts of writes resulting in large view mutations.

Refs #2538

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-19 22:38:29 +00:00
Botond Dénes
76fe4ebc18 Move cache_temperature into its own header
Some headers need to include database.hh just because of
cache_temperature. Move it into its own header so these includes can be
removed.
2018-12-12 16:03:45 +02:00
Avi Kivity
414b14a6bd Merge "Make inactive shard readers evictable" from Botond
"
This series attempts to solve the regressions recently discovered in
performance of multi-partition range-scans. Namely that they:
* Flood the reader concurrency semaphore's queues, trampling other
  reads.
* Behave very badly when too many of them is running concurrently
  (trashing).
* May deadlock if enough of them is running without a timeout.

The solution for these problems is to make inactive shard readers
evictable. This should address all three issues listed above, to varying
degrees:
* Shard readers will now not cling onto their permits for the entire
  duration of the scan, which might be a lot of time.
* Will be less affected by infinite concurrency (more than the node can
  handle) as each scan now can make progress by evicting inactive shard
  readers belonging to other scans.
* Will not deadlock at all.

In addition to the above fix, this series also bundles two further
improvements:
* Add a mechanism to `reader_concurrecy_semaphore` to be notified of
  newly inserted evictables.
* General cleanups and fixes for `multishard_combining_reader` and
  `foreign_reader`.

I can unbundle these mini series and send them separately, if the
maintainers so prefer, altough considering that this series will have to
be backported to 3.0, I think this present form is better.

Fixes: #3835
"

* 'evictable-inactive-shard-readers/v7' of https://github.com/denesb/scylla: (27 commits)
  tests/multishard_mutation_query_test: test stateless query too
  tests/querier_cache: fail resource-based eviction test gracefully
  tests/querier_cache: simplify resource-based eviction test
  tests/mutation_reader_test: add test_multishard_combining_reader_next_partition
  tests/mutation_reader_test: restore indentation
  tests/mutation_reader_test: enrich pause-related multishard reader test
  multishard_combining_reader: use pause-resume API
  query::partition_slice: add clear_ranges() method
  position_in_partition: add region() accessor
  foreign_reader: add pause-resume API
  tests/mutation_reader_test: implement the pause-resume API
  query_mutations_on_all_shards(): implement pause-resume API
  make_multishard_streaming_reader(): implement the pause-resume API
  database: add accessors for user and streaming concurrency semaphores
  reader_lifecycle_policy: extend with a pause-resume API
  query_mutations_on_all_shards(): restore indentation
  query_mutations_on_all_shards(): simplify the state-machine
  multishard_combining_reader: use the reader lifecycle policy
  multishard_combining_reader: add reader lifecycle policy
  multishard_combining_reader: drop unnecessary `reader_promise` member
  ...
2018-12-04 10:22:35 +02:00
Botond Dénes
bf0d1f4eea database: add accessors for user and streaming concurrency semaphores
These will soon be needed to register inactive user and streaming reads
with the respective semaphores.
2018-12-04 08:51:05 +02:00
Benny Halevy
857ff4f59a database: directly use std::experimental::filesystem::path for lister::path
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2018-12-02 22:02:10 +02:00
Piotr Sarna
1336b9ee31 database: add is_internal_keyspace
Similarly to is_system_keyspace, it will allow checking if a keyspace
is created for internal use.
2018-11-28 09:21:56 +01:00
Avi Kivity
775b7e41f4 Update seastar submodule
* seastar d59fcef...b924495 (2):
  > build: Fix protobuf generation rules
  > Merge "Restructure files" from Jesse

Includes fixup patch from Jesse:

"
Update Seastar `#include`s to reflect restructure

All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
2018-11-21 00:01:44 +02:00
Glauber Costa
9f403334c8 remove monitor if sstable write failed
In (almost) all SSTable write paths, we need to inform the monitor that
the write has failed as well. The monitor will remove the SSTable from
controller's tracking at that point.

Except there is one place where we are not doing that: streaming of big
mutations. Streaming of big mutations is an interesting use case, in
which it is done in 2 parts: if the writing of the SSTable fails right
away, then we do the correct thing.

But the SSTables are not commited at that point and the monitors are
still kept around with the SSTables until a later time, when they are
finally committed. Between those two points in time, it is possible that
the streaming code will detect a failure and manually call
fail_streaming_mutations(), which marks the SSTable for deletions. At
that point we should propagate that information to the monitor as well,
but we don't.

Fixes #3732 (hopefully)
Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181114213618.16789-1-glauber@scylladb.com>
2018-11-20 16:15:12 +00:00
Piotr Sarna
d7849e6ea4 database: add get_staging_sstable method
This method can be used to check if sstable is staging,
i.e. it shouldn't be compacted and it will not be used
for generating view updates from other staging tables,
and return proper shared_sstable pointer if it is.
2018-11-13 15:04:43 +01:00
Piotr Sarna
348fa3b092 table: add stream_view_replica_updates
Generating view replica updates during streaming ignores
the staging sstable that is used to generate them.
2018-11-13 14:52:22 +01:00
Piotr Sarna
fed9c59eb8 table: split push_view_replica_updates
push_view_replica_updates is split in order to allow different
mutation source to be provided.
2018-11-13 14:52:22 +01:00
Piotr Sarna
466d780445 table: add as_mutation_source_excluding
A variant of table::as_mutation_source that allows excluding
a single sstable is added.
2018-11-13 14:52:22 +01:00
Piotr Sarna
e88b85134c database: add sstable-excluding reader
When generating view updates from a staging sstable, this sstable
should not be used in the process. Hence, a reader that skips a single
sstable is added.
2018-11-13 14:52:22 +01:00
Piotr Sarna
160a6d58d2 table: add move_sstable_from_staging_in_thread function
After materialized view updates are generated, the sstable
should be moved from staging/ to a regular directory.
It's expected to be called from seastar::async thread context.
2018-11-13 11:45:30 +01:00
Piotr Sarna
e42d97060f database: provide nonfrozen version of push_view_replica_updates
Now it's also possible to pass a mutation to push to view replicas.
2018-11-13 11:45:30 +01:00
Piotr Sarna
642c3ae0e0 database: add subdir param to make_streaming_sstable_for_write
This function allows specifying a subfolder to put a newly created
sstable in - e.g. staging/ subfolder for streamed base table mutations.
2018-11-13 11:45:30 +01:00
Piotr Sarna
701d88e39f database: add staging sstables map
In order to keep track of staging sstables (used for mv updates),
a map of them is now kept in table class.
2018-11-13 11:45:30 +01:00
Botond Dénes
23f3831aaf table::make_streaming_reader(): add forwarding parameter
The single-range overload, when used by
make_multishard_streaming_reader(), has to create a reader that is
forwardable. Otherwise the multishard streaming reader will not produce
any output as it cannot fast-forward its shard readers to the ranges
produced by the generator.

Also add a unit test, that is based on the real-life purpose the
multishard streaming reader was designed for - serving partition
from a shard, according to a sharding configuration that is different
than the local one. This is also the scenario that found the buf in the
first place.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <bf799961bfd535882ede6a54cd6c4b6f92e4e1c1.1539235034.git.bdenes@scylladb.com>
2018-10-11 10:59:18 +03:00
Botond Dénes
4bb0bbb9e2 database: add make_multishard_streaming_reader()
Creates a streaming reader that reads from all shards. Shard readers are
created with `table::make_streaming_reader()`.
This is needed for the new row-level repair.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4b74c710bed2ef98adf07555a4c841e5b690dd8c.1538470782.git.bdenes@scylladb.com>
2018-10-09 11:07:47 +03:00
Botond Dénes
3eeb6fbd23 table::make_streaming_reader(): add single-range overload
This will be used by the `make_multishard_streaming_reader()` in the
next patch. This method will create a multishard combining reader which
needs its shard readers to take a single range, not a vector of ranges
like the existing overload.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cc6f2c9a8cf2c42696ff756ed6cb7949b95fe986.1538470782.git.bdenes@scylladb.com>
2018-10-09 11:07:46 +03:00
Raphael S. Carvalho
745e35fa82 database: Fix sstable resharding for mc format
SStable format mc doesn't write ancestors to metadata, so resharding
will not work with this new format because it relies on ancestors to
replace new unshared sstables with old shared ones.
Fix is about not relying on ancestors metadata for this operation.

Fixes #3777.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180922211933.1987-1-raphaelsc@scylladb.com>
2018-09-25 18:37:48 +03:00
Botond Dénes
253407bdc8 multishard_mutation_query: add badness counters
Add badness counters that allow tracking problems. The following
counters are added:
1) multishard_query_unpopped_fragments
2) multishard_query_unpopped_bytes
3) multishard_query_failed_reader_stops
4) multishard_query_failed_reader_saves

The first pair of counters observe the amount of work range scan queries
have to undo on each page. It is normal for these counters to be
non-zero, however sudden spikes in their values can indicate problems.
This undoing of work is needed for stateful range-scans to work.
When stateful queries are enabled the `multishard_combining_reader` is
dismantled and all unconsumed fragments in its and any of its
intermediate reader's buffers are pushed back into the originating shard
reader's buffer (via `unpop_mutation_fragment()`). This also includes
the `partition_start`, the `static_row` (if there is one) and all
extracted and active `range_tombstone` fragments. This together can
amount to a substantial amount of fragments.
(1) counts the amount of fragments moved back, while (2) counts the
number of bytes. Monitoring size and quantity separately allows for
detecting edge cases like moving many small fragments or just a few huge
ones. The counters count the fragments/bytes moved back to readers
located on the shard they belong to.

The second pair of counters are added to detect any problems around
saving readers. Since the failure to save a reader will not fail the
read itself, it is necessary to add visibility to these failures by
other means.
(3) counts the number of times stopping a shard reader (waiting
on pending read-aheads and next-partitions) failed while (4)
counts the number of times inserting the reader into the `querier_cache`
failed.
Contrary to the first two counters, which will almost certainly never be
zero, these latter two counters should always be zero. Any other value
indicates problems in the respective shards/nodes.
2018-09-03 10:31:44 +03:00
Botond Dénes
97364c7ad9 database: add query_mutations_on_all_shards()
This method allows for querying a range or ranges on all shards of the
node. Under the hood it uses the multishard_combining_reader for
executing the query.
It supports paging and stateful queries (saving and reusing the readers
between pages). All this is transparent to the client, who only needs to
supply the same query::read_command::query_uuid through the pages of the
query (and supply correct start positions on each page, that match the
stop position of the last page).
2018-09-03 10:31:44 +03:00
Botond Dénes
5f726e9a89 querier: move all to query namespace
To avoid name clashes.
2018-09-03 10:31:44 +03:00
Avi Kivity
37f9a3c566 database: make database's mutation apply stage inherit its scheduling group from the caller
Like the two preceeding patches, convert the mutation apply stage
to an inheriting_concrete_scheduling_group.  This change has two
added benefits: we get rid of a thread_local, and we drop a
with_scheduling_group() inside an execution stage which just creates a bunch
of continuations and somewhat undoes the benefit of the execution stage.
2018-08-24 19:04:49 +03:00
Avi Kivity
596fb6f2f7 database: make database::_data_query_stage inheriting its caller's scheduling_group
Now (8c993e0728) that replica-side operations run under the correct
scheduling group, we can inherit the scheduling_group for _data_query_stage
from the caller.  By itself this doesn't do much, but it will later allow us
to have multiple groups for statement executions.
2018-08-24 19:04:49 +03:00
Avi Kivity
ef9b36376c Merge "database: support multiple data directories" from Glauber
"
While Cassandra supports multiple data directories, we have been
historically supporting just one. The one-directory model suits us
better because of the I/O Scheduler and so far we have seen very few
requests -- if any, to support this.

Still, the infrastructure needed to support multiple directories can be
beneficial so I am trying to bring this in.

For simplicity, we will treat the first directory in the list as the
main directory. By being able to still associate one singular directory
with a table, most of the code doesn't have to change and we don't have
to worry about how to distribute data between the directories.

In this design:
- We scan all data directories for existing data.
- resharding only happens within a particular data directory.
- snapshot details are accumulated with data for all directories that
  host snapshots for the tables we are examining
- snapshots are created with files in its own directories, but the
  manifest file goes to the main directory. For this one, note that in
  Cassandra the same thing happens, except that there is no "main"
  directory. Still the manifest file is still just in one of them.
- SSTables are flushed into the main directory.
- Compactions write data into the main directory

Despite the restrictions, one example of usage of this is recovery.  If
we have network attached devices for instance, we can quickly attach a
network device to an existing node and make the data immediately
available as it is compacted back to main storage.

Tests: unit (release)
"

* 'multi-data-file-v2' of github.com:glommer/scylla:
  database: change ident
  database: support multiple data directories
  database: allow resharing to specify a directory
  database: support multiple directories in get_snapshot_details
  database: move get_snapshot_info into a seastar::thread
  snapshots: always create the snapshot directory
  sstables: pass sstable dir with entry descriptor
  database: make nodetool listsnapshots print correct information
  sstables: correctly create descriptors for snapshots
2018-07-15 13:31:04 +03:00
Asias He
6540051f77 database: Add add_sstable_and_update_cache
Since we can write mutations to sstable directly in streaming, we need
to add those sstables to the system so it can be seen by the query.
Also we need to update the cache so the query refects the latest data.
2018-07-13 08:36:45 +08:00