Commit Graph

296 Commits

Author SHA1 Message Date
Avi Kivity
c96fc1d585 Merge "Introduce row level repair" from Asias
"
=== How the the partition level repair works

- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.

=== Major problems with partition level repair

- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.

- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice

=== Row level repair

Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch

=== How the row level repair works

- To solve the problem of reading data twice

Read the data only once for both checksum and synchronization between nodes.

We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.

We need to find a sync boundary among the nodes which contains only N bytes of
rows.

- To solve the problem of sending unnecessary data.

We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.

For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = {      row2, row3}
Node3 has set3 = {row1, row2, row4}

To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.

=== How to implement repair with set reconciliation

- Step A: Negotiate sync boundary

class repair_sync_boundary {
    dht::decorated_key pk;
    position_in_partition position
}

Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.

- Step B: Get missing rows from peer nodes so that repair master contains all the rows

Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.

At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.

Now, local node contains all the rows.

- Step C: Send missing rows to the peer nodes

Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.

=== How the RPC API looks like

- repair_range_start()

Step A:
- request_sync_boundary()

Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()

Step C:
- send_row_diff()

- repair_range_stop()

=== Performance evaluation

We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.

1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.

Time to repair:
   old = 87 min
   new = 70 min (rebuild took 50 minutes)
   improvement = 19.54%

2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
   old = 43 min
   new = 24 min
   improvement = 44.18%

3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.

Time to repair:
   old: 211 min
   new: 44 min
   improvement: 79.15%

Bytes sent on wire for repair:
   old: tx= 162 GiB,  rx = 90 GiB
   new: tx= 1.15 GiB, tx = 0.57 GiB
   improvement: tx = 99.29%, rx = 99.36%

It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.

In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.

In the result, we saw the rows on wire were as expected.

tx_row_nr  = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr  =  500233 + 500235 +  499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000

Fixes: #3033

Tests: dtests/repair_additional_test.py
"

* 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits)
  repair: Enable row level repair
  repair: Add row_level_repair
  repair: Add docs for row level repair
  repair: Add repair_init_messaging_service_handler
  repair: Add repair_meta
  repair: Add repair_writer
  repair: Add repair_reader
  repair: Add repair_row
  repair: Add fragment_hasher
  repair: Add decorated_key_with_hash
  repair: Add get_random_seed
  repair: Add get_common_diff_detect_algorithm
  repair: Add shard_config
  repair: Add suportted_diff_detect_algorithms
  repair: Add repair_stats to repair_info
  repair: Introduce repair_stats
  flat_mutation_reader:  Add make_generating_reader
  storage_service: Introduce ROW_LEVEL_REPAIR feature
  messaging_service: Add RPC verbs for row level repair
  repair: Export the repair logger
  ...
2018-12-25 13:13:00 +02:00
Duarte Nunes
6df32bfb0c main: Start and stop the view_update_backlog_broker
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-19 22:38:30 +00:00
Duarte Nunes
776fdd4d1a service/storage_proxy: Expose local view update backlog
The local view update backlog is the max backlog out of the relative
memory backlog size and the relative hints backlog size.

We leverage the db::view::node_update_backlog class so we can send the
max backlog out of the node's shards.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-19 22:38:30 +00:00
Asias He
b9e0db801d repair: Enable row level repair
Finally, enable new row level repair if the cluster supports it. If not,
fallback to the old partition level repair.

Fixes #3033
2018-12-12 16:49:01 +08:00
Avi Kivity
89be47e291 batchlog_manager: remove dependency on db::config
Extract configuration into a new struct batchlog_manager_config and have the
callers populate it using db::config. This reduces dependencies on global objects.
2018-12-09 20:11:38 +02:00
Tomasz Grabiec
6012a63660 Merge "Fix window during init where waiting for a feature can be ignored" from Avi
storage_service keeps a bunch of "feature" variables, indicating cluster-wide
supported features, and has the ability to wait until the entire cluster supports
a given feature.

The propagation of features depends on gossip, but gossip is initialized after
storage_service, so the current code late-initializes the features. However, that
means that whoever waits on a feature between storage_service initialization and
gossip initialization loses their wait entry. In #3952, we have proof that this
in fact happens.

Fix this by removing the circular dependency. We now store features in a new
service, feature_service, that is started before both gossip and storage_service.
Gossip updates feature_service while storage_service reads for it.

Fixes #3953.

* https://github.com/avikivity/3953/v4.1:
  storage_service: deinline enable_all_features()
  gossiper: keep features registered
  tests/gossip: switch to seastar::thread
  storage_service: deinline init/deinit functions
  gossiper: split feature storage into a new feature_service
  gossiper: maybe enable features after start_gossiping()
  storage_service: fix gap when feature::when_enabled() doesn't work
2018-12-06 15:42:26 +01:00
Avi Kivity
4e553b692e gossiper: split feature storage into a new feature_service
Feature lifetime is tied to storage_service lifetime, but features are now managed
by gossip. To avoid circular dependency, add a new feature_service service to manage
feature lifetime.

To work around the problem, the current code re-initializes features after
gossip is initialized. This patch does not fix this problem; it only makes it
possible to solve it by untyping features from gossip.
2018-12-06 16:31:04 +02:00
Glauber Costa
fee4d2eb9b compaction_manager: delay initialization of the compaction manager.
If the compaction manager is started, compactions may start (this is
regardless of whether or not we trigger them). The problem with that is
that they start at a time in which we are flushing the commitlog and the
initialization procedure waits for the commitlog to be fully flushed and
the resulting memtables flushed before we move on.

Because there are no incoming writes, the amount of shares in memtable
flushes decrease as memory used decreases and that can cause the startup
procedure to take a long time.

We have recently started to bump the shares manually for manual flushes.
While that guarantees that we will not drive the shares to zero, I will
make the argument that we can do better by making sure that those things
are, at this point, running alone: user experience is affected by
startup times and the bump we give to user-triggered operations will
only do so much. Even if we increase the shares a lot flushes will still
be fighting for resources with compactions and startup will take longer
than it could.

By making sure that flushes are this point running alone we improve the
user experience by making sure the startup is as fast as it can be.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-12-04 13:48:42 -05:00
Piotr Sarna
6ab8235369 main: fix deinitialization order for view update generator
View update generator should be stopped only after
drain_on_shutdown() is performed on storage service.
Message-Id: <4d2bda4c73422a2ebf46d6dcd06c95d960839889.1543230849.git.sarna@scylladb.com>
2018-11-26 11:21:37 +00:00
Avi Kivity
775b7e41f4 Update seastar submodule
* seastar d59fcef...b924495 (2):
  > build: Fix protobuf generation rules
  > Merge "Restructure files" from Jesse

Includes fixup patch from Jesse:

"
Update Seastar `#include`s to reflect restructure

All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
2018-11-21 00:01:44 +02:00
Piotr Sarna
16c042039c main: add registering staging sstables read from disk
Staging sstables read from disk are registered to the view update
generator right after initializing non system keyspaces.

Fixes #3275
2018-11-13 15:04:43 +01:00
Piotr Sarna
dc74887ff3 streaming: add system distributed keyspace ref to streaming
Streaming code needs system distributed keyspace to check if streamed
sstables should be staging, so a proper reference is added.
2018-11-13 15:01:53 +01:00
Piotr Sarna
7ef5e1b685 streaming: add view update generator reference to streaming
Streaming code may need view update generator service to generate
and send view updates, so a proper reference is added.
2018-11-13 15:01:53 +01:00
Piotr Sarna
eb0c507a45 main: add generating missed mv updates from staging sstables
If any sstables are found in the staging directory, it means that
they missed generating view updates, so it's performed now.
2018-11-13 15:01:53 +01:00
Avi Kivity
a71ab365e3 toplevel: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Vlad Zolotarov
aca0882a3f hinted handoff: enable storing hints before starting messaging_service
When messaging_service is started we may immediately receive a mutation
from another node (e.g. in the MV update context). If hinted handoff is not
ready to store hints at that point we may fail some of MV updates.

We are going to resolve this by start()ing hints::managers before we
start messaging_service and blocking hints replaying until all relevant
objects are initialized.

Refs #3828

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-10-18 16:49:58 -04:00
Avi Kivity
e7ae4beef0 main: run prometheus and API servers under streaming group
Both the Prometheus and the API servers are used for maintenance
operations, similarly to streaming. Run them under the streaming
scheduling group to prevent them from impacting normal operations,
and rename the streaming scheduling group to reflect the more
generic role.

This helps to prevent spikes from Prometheus or API requests from
interfering with the normal workload. Using an existing group is
preferable to creating a new group because in the worst case, all
the non-main-workload groups compete with the main workload.
Consolidating them allows us to give them significant shares in
total without increasing competition in the worst case.

The group's label is unchanged to preserve compatibility with
dashboards.

A nice side effect is that repair, which is initiated by API calls,
gets placed into the maintenance group naturally. Compaction tasks
which are run by compaction manager are not changed.
Message-Id: <20180714160723.23655-1-avi@scylladb.com>
2018-07-30 15:07:33 +01:00
Avi Kivity
8c993e0728 messaging: tag RPC services with scheduling groups
Assign a scheduling_group for each RPC service. Assignement is
done by connection (get_rpc_client_idx()) - all verbs on the
same connection are assigned the same group. While this may seem
arbitrary, it avoids priority inversion; if two verbs on the same
connection have different scheduling groups, the verb with the low
shares may cause a backlog and stall the connection, including
following requests from verbs that ought to have higher shares.

The scheduling_group parameters are encapsulated in different
classes as they are passed around to avoid adding dependencies.
Message-Id: <20180708140433.6426-1-avi@scylladb.com>
2018-07-13 13:57:08 +02:00
Vlad Zolotarov
c65a110839 main: remove the "experimental" tag from the hinted handoff feature
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-06 19:19:40 -04:00
Vlad Zolotarov
83ba6d84a1 db::hints::manager: implement rebalance() method
Rebalance hints segments that need to be sent among all present shards.

Ensure that after rebalancing the difference between the number of segments
of any two shards is not greater than 1.

Try to minimize the amount of "file rename" operations in order to achieve the needed result.

Note: "Resharding" is a particular case of rebalancing.

Tests: dtest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-06 19:18:46 -04:00
Avi Kivity
f55a2fe3a7 main: improve reporting of dns resolution errors
A report that C-Ares returned some errors tells the user nothing.

Improve the error message by including the name of the configuration
variable and its value.
Message-Id: <20180705084959.10872-1-avi@scylladb.com>
2018-07-05 10:24:41 +01:00
Tomasz Grabiec
074be4d4e8 memtable, cache: Run mutation_cleaner worker in its own scheduling group
The worker is responsible for merging MVCC snapshots, which is similar
to merging sstables, but in memory. The new scheduling group will be
therefore called "memory compaction".

We should run it in a separate scheduling group instead of
main/memtables, so that it doesn't disrupt writes and other system
activities. It's also nice for monitoring how much CPU time we spend
on this.
2018-06-27 21:51:04 +02:00
Avi Kivity
ea39e3e9d4 main: start client protocol servers under the statement scheduling group
This will isolate client protocol and coordinator-side processing from
the rest of the system.
2018-06-18 18:30:21 +03:00
Gleb Natapov
da20d86423 Configure authorized_prepared_statment_cache memory limit during object creation 2018-06-11 15:34:14 +03:00
Gleb Natapov
b38ced0fcd Configure logalloc memory size during initialization 2018-06-11 15:34:14 +03:00
Gleb Natapov
646e400918 Provide available memory size to messaging_service object during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
ac88935baa Provide available memory size to storage_proxy object during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
f41575a156 Provide available memory size to database object during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
461f20e7b1 Configure prepared_statements_cache memory limit from outside
Pass desirable memory limit during construction instead of querying
memory size explicitly.
2018-06-11 15:34:13 +03:00
Piotr Sarna
f12fdcffdb storage_proxy: restore optional hinted handoff
Since hinted handoff for materialized views is now a separate entity,
regular hinted handoff can go back to being optional.
2018-06-04 09:46:06 +02:00
Piotr Sarna
a791dce0ae db, config: add view_pending_updates directory
Hints for materialized view updates need to be kept somewhere,
because their dedicated hints manager has to have a root directory.
view_pending_updates directory resides in /data and is used
for that purpose.
2018-06-04 09:46:06 +02:00
Piotr Sarna
b7ac2da238 main: initialize hints manager unconditionally
This commit makes sure that hints manager is always initialized,
including creating hints directories and starting it.
It needs to be fixed because hints manager is internally used
to store failed materialized view replicas.

Fixes #3451
Message-Id: <44532fd3704e20cabeb9c4985dace5650fd22d2c.1527018865.git.sarna@scylladb.com>
2018-05-22 22:21:50 +01:00
Glauber Costa
3d2c4c1cf8 main: change I/O scheduler verification code
Before we accept running while not in developer mode, we verify that
the I/O Scheduler is properly configured. Up until now, that meant
verifying that --max-io-requests is properly set and that the number
of I/O Queues is enough to leave at least 4 requests per I/O Queue.

Systems that move to newer versions of Scylla may continue doing that,
so we need to be backwards compatible and keep testing for that.
However, newer systems will not set that option, but pass a YAML
property file (or string) instead. So we need to make sure that
either one of those is set.

If the property file is set, I am deciding here not to test for
number of I/O queues. scylla_io_setup will usually configure that
anyway, plus we plan on soon moving to all-shards-dispatch making
that less important.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180509163737.5907-1-glauber@scylladb.com>
2018-05-13 19:22:54 +03:00
Vlad Zolotarov
48c96d09d6 db::hints::manager: drain hints when the node is decommissioned/removed
When node is decommissioned/removed it will drain all its hints and all
remote nodes that have hints to it will drain their hints to this node.

What "drain" means? - The node that "drains" hints to a specific
destination will ignore failures and will continue sending hints till the end
of the current segment, erase it and move to the next one till there are
no more segments left.

After all hints are drained the corresponding hints directory is removed.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Duarte Nunes
bf5045c7eb db/view: Require configuration option to enable view building
View building, enabled by default, can contain or expose issues that
prevent the node from starting. In those cases, it is necessary to
disable view building such that the node can be submitted to
maintenance operations.

Fixes #3329

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-03 13:16:28 +01:00
Glauber Costa
a9ef72537f parse and ignore background writer controller
Unused options are not exposed as command line options and will prevent
Scylla from booting when present, although they can still be pased over
YAML, for Cassandra compatibility.

That has never been a problem, but we have been adding options to i3
(and others) that are now deprecated, but were previously marked as
Used. Systems with those options may have issues upgrading.

While this problem is common to all Unused options, the likelihood for
any other unused option to appear in the command line is near zero,
except for those two - since we put them there ourselves.

There are two ways to handle this issue:

1) Mark them as Used, and just ignore them.
2) Add them explicitly to boost program options, and then ignore them.

The second option is preferred here, because we can add them as hidden
options in program_options, meaning they won't show up in the help. We
can then just print a discrete message saying that those options are,
for now on ignored.

v2: mark set as const (Botond)
v3: rebase on top of master, identation suggested by Duarte.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180329145517.8462-1-glauber@scylladb.com>
2018-03-29 17:57:30 +03:00
Duarte Nunes
a21efeffa0 db/view/view_builder: React to schema changes
The view_builder now uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.

We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the upcoming view building
code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
901faabaa2 db/view: Introduce view_builder
This patch introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.

This patch introduces only the bootstrap functionality, which is
responsible for loading the data stored in the system tables and
filling the in-memory data structures with the relevant information,
to be used in subsequent patches for the actual view building. The
interaction with the system tables is as follows.

Interaction with the tables in system_keyspace:
  - When we start building a view, we add an entry to the
    scylla_views_builds_in_progress system table. If the node restarts
    at this point, we'll consider these newly inserted views as having
    made no progress, and we'll treat them as new views;
  - When we finish a build step, we update the progress of the views
    that we built during this step by writing the next token to the
    scylla_views_builds_in_progress table. If the node restarts here,
    we'll start building the views at the token in the next_token
    column.
  - When we finish building a view, we mark it as completed in the
    built views system table, and remove it from the in-progress system
    table. Under failure, the following can happen:
        * When we fail to mark the view as built, we'll redo the last
          step upon node reboot;
        * When we fail to delete the in-progress record, upon reboot
          we'll remove this record.
    A view is marked as completed only when all shards have finished
    their share of the work, that is, if a view is not built, then all
    shards will still have an entry in the in-progress system table;
  - A view that a shard finished building, but not all other shards,
    remains in the in-progress system table, with first_token ==
    next_token.

Interaction with the distributed system table (view_build_status):
  - When we start building a view, we mark the view build as being
    in-progress;
  - When we finish building a view, we mark the view as being built.
    Upon failure, we ensure that if the view is in the in-progress
    system table, then it may not have been written to this table. We
    don't load the built views from this table when starting. When
    starting, the following happens:
         * If the view is in the system.built_views table and not the
           in-progress system table, then it will be in view_build_status;
         * If the view is in the system.built_views table and not in
           this one, it will still be in the in-progress system table -
           we detect this and mark it as built in this table too,
           keeping the invariant;
         * If the view is in this table but not in system.built_views,
           then it will also be in the in-progress system table - we
           don't detect this and will redo the missing step, for
           simplicity.

View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
ff15068a41 service/storage_service: Allow querying the view build status
This patch adds support for the nodetool viewbuildstatus command,
which shows the progress of a materialized view build across the
cluster.

A view can be absent from the result, successfully built, or
currently being built.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Avi Kivity
16a7650873 Merge "More extensions: commitlog + system tables" from Calle
"
Additional extension points.

* Allows wrapping commitlog file io (including hinted handoff).
* Allows system schema modification on boot, allowing extensions
  to inject extensions into hardcoded schemas.

Note: to make commitlog file extensions work, we need to both
enforce we can be notified on segment delete, and thus need to
fix the old issue of hard ::unlink call in segment destructor.
Segment delete is therefore moved to a batch routine, run at
intervals/flush. Replay segments and hints are also deleted via
the commitlog object, ensuring an extension is notified (metadata).

Configurable listeneres are now allowed to inject configuration
object into the main config. I.e. a local object can, either
by becoming a "configurable" or manually, add references to
self-describing values that will be parsed from the scylla.yaml
file, effectively extending it.

All these wonderful abstractions courtesy of encryption of course.
But super generalized!
"

* 'calle/commitlog_ext' of github.com:scylladb/seastar-dev:
  db::extensions: Allow extensions to modify (system) schemas
  db::commitlog: Add commitlog/hints file io extension
  db::commitlog: Do segment delete async + force replay delete go via CL
  main/init: Change configurable callbacks and calls to allow adding opts
  util::config_file: Add "add" config item overload
2018-03-26 16:18:22 +03:00
Calle Wilund
2bc98aebaf db::commitlog: Do segment delete async + force replay delete go via CL
Refs #2858

Push segement files to be deleted to a pending list, and process at
intervals or flush-requests (or shutdown). Note that we do _not_
indescrimenately do deletes in non-anchored tasks, because we need
to guarantee that finshed segments are fully deleted and gone on CL
shutdown, not to be mistaken for replayables.

Also make sure we delete segments replayed via commitlog call,
so IFF we add metadata processing for CL, we can clear it out.
2018-03-26 11:58:27 +00:00
Glauber Costa
9188059427 database: group statements in their own scheduling group
When we introduced the CPU scheduler, we have also introduced a group
for commitlog - but never used it. There is also doubtful value in
separating reads from writes, since they are often part of the same
workload.

To accomodate for that, let's rename the query group to "statement"
(query is not incorrect, just confusing), and move the write path,
currently ungrouped, inside it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-20 16:58:36 -04:00
Calle Wilund
eb10d32ff9 main/init: Change configurable callbacks and calls to allow adding opts
Refs #2526

Allows sub-configs to dynamically add yaml/command line options to
the main config object, i.e. extend the scylla.yaml
2018-03-19 12:24:04 +00:00
Benoît Canet
1d0cc7cf20 messaging_service: Start messaging service earlier
The messaging service was completely started
after a bootstraping node finished to join hence
leading to #2034.

Fixes #2034
Message-Id: <20180313084500.27265-1-amnon@scylladb.com>
2018-03-13 10:59:53 +02:00
Avi Kivity
d973445a94 Merge "sstable/schema extensions" from Calle
"
Adds extension points to schema/sstables to enable hooking in
stuff, like, say, something that modifies how sstable disk io
works. (Cough, cough, *encryption*)

Extensions are processed as property keywords in CQL. To add
an extension, a "module" must register it into the extensions
object on boot time. To avoid globals (and yet don't),
extensions are reachable from config (and thus from db).

Table/view tables already contain an extension element, so
we utilize this to persist config.

schema_tables tables/views from mutations now require a "context"
object (currently only extensions, but abstracted for easier
further changes.

Because of how schemas currently operate, there is a super
lame workaround to allow "schema_registry" access to config
and by extension extensions. DB, upon instansiation, calls
a thread local global "init" in schema_registry and registers
the config. It, in turn, can then call table_from_mutations
as required.

Includes the (modified) patch to encapsulate compression
into objects, mainly because it is nice to encapsulate, and
isolate a little.
"

* 'calle/extensions-v5' of github.com:scylladb/seastar-dev:
  extensions: Small unit test
  sstables: Process extensions on file open
  sstables::types: Add optional extensions attribute to scylla metadata
  sstables::disk_types: Add hash and comparator(sstring) to disk_string
  schema_tables: Load/save extensions table
  cql: Add schema extensions processing to properties
  schema_tables: Require context object in schema load path
  schema_tables: Add opaque context object
  config_file_impl: Remove ostream operators
  main/init: Formalize configurables + add extensions to init call
  db::config: Add extensions as a config sub-object
  db::extensions: Configuration object to store various extensions
  cql3::statements::property_definitions: Use std::variant instead of any
  sstables: Add extension type for wrapping file io
  schema: Add opaque type to represent extensions
  sstables::compress/compress: Make compression a virtual object
2018-02-26 17:15:29 +02:00
Avi Kivity
432268f582 Merge "branch 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla" from Raphael
"The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
The manager was needed for orchestrating deletion of shared sstable
across shards. It brings extra complexity that's not longer needed,
and it was also overloading shard 0, but the latter could have
been fixed.

Tests:
- unit: release mode
- dtest: resharding_test.py"

* 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla:
  Remove SSTable's atomic deletion manager
  Stop using SSTable's atomic deletion manager
  database: split column_family::rebuild_sstable_list
2018-02-08 19:10:16 +02:00
Tomasz Grabiec
cce1a2bce8 Merge "Use the CPU scheduler" from Glauber & Avi
In this patchset I am resubmitting Avi's enablement of the CPU scheduler
in his behalf. I've done a ton of testing in the series and there are
some improvements / changes that I had previously sent as a separate series.

What you see here is the result of merging that work.

After this patchset is applied, workloads are smoother and we are able to
uphold the pre-defined shares among the various actors.

We also finally have everything we need to merge the CPU and I/O controllers.
After that is done the code is now much simpler. But also, as a bonus,
controllers that were previously available for I/O only (compactions) are
enabled for CPU as well.

* git@github.com:glommer/scylla.git cpusched-v7:

Avi Kivity (4):
  database, sstables, compaction: convert use of thread_scheduling_group
    to seastar cpu scheduler
  memtable, database: make memtable::clear_gently() inherit
    scheduling_group
  config: mark background_writer_scheduling_quota as Unused
  database: place data_query execution stage into scheduling_group

Glauber Costa (9):
  database, main: set up scheduling_groups for our main tasks
  row_cache: actually use the scheduling group for update_cache
  allow update_cache and clear_gently to use the entire task quota.
  database: remove cpu_flush_quota metric
  controllers: retire auto_adjust_flush_quota
  controllers: allow memtable I/O controller to have shares statically
    set
  controllers: update control points for memtable I/O controller
  controllers: allow a static priority to override the controller output
  controllers: unify the I/O and CPU controllers
2018-02-08 15:58:40 +01:00
Raphael S. Carvalho
1472cfcc19 Stop using SSTable's atomic deletion manager
The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:27:17 -02:00
Glauber Costa
956af9f099 database, main: set up scheduling_groups for our main tasks
Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.

The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.

Comments from Glauber:

This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:

1) A bug/regression is fixed with the boundary calculations for the
   memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
   go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
   themselves are run in the compaction scheduling group. Having that
   working is what changes this patch the most: we now store the
   scheduling group in the compaction manager and let the compaction
   manager itself enforce the scheduling group.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Avi Kivity
641aaba12c database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.

The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.

Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
2018-02-07 17:19:29 -05:00