Commit Graph

1093 Commits

Author SHA1 Message Date
Piotr Sarna
8b43ac3a57 hints: reserve more space for dedicated storage
Reserving 10% of space for hints managers makes sense if the device
is shared with other components (like /data or /commitlog).
But, if hints directory is mounted on a dedicated storage, it makes
sense to reserve much more - 90% was chosen as a sane limit.
Whether storage is 'dedicated' or not is based on a simple check
if given hints directory is a mount point.

Fixes #3516

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:27:00 +02:00
Piotr Sarna
32f86ca61e hints: add is_mountpoint function
A helper function that checks whether a path is also a mount point
is added.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:52 +02:00
Piotr Sarna
b6c1b8c5ef hints: make space_watchdog device-aware
Instead of having one static space limit for all directories,
space_watchdog now keeps a per-device limit, shared among
hints managers residing on the same disks.

References #3516

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:45 +02:00
Piotr Sarna
d22668de04 hints: add device_id to manager
In order to make space_watchdog device-aware, device_id field
is added to hints manager. It's an equivalent of stat.st_dev
and it identifies the disk that contains manager's root directory.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:37 +02:00
Piotr Sarna
91b5e33c6a hints: add get_device_id function
In order to distinguish which directories reside on which devices,
get_device_id function is added to resource manager.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:25:47 +02:00
Glauber Costa
290d553c3a compaction_strategy: allow the user to tell us if min_threshold has to be strict
Now that we have the controller, we would like to take min_threshold as
a hint. If there is nothing to compact, we can ignore that and start
compacting less than min_threshold SSTables so that the backlog keeps
reducing.

But there are cases in which we don't want min_threshold to be a hint
and we want to enforce it strictly. For instance, if write amplification
is more of a concern than space amplification.

This patch adds a YAML option that allows the user to tell us that. We will
default to false, meaning min_threshold is not strictly enforced.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-15 13:42:43 -04:00
Piotr Sarna
6b3a97e34a hints: fix max_shard_disk_space_size initialization
Previously max_shard_disk_space_size was unconditionally initialized
with the capacity of hints_directory. But, it's likely that
hints_directory doesn't exist at all if hinted handoff is not enabled,
which results in Scylla failing to boot.
So, max_shard_disk_space_size is now initialized with the capacity
of hints_for_views directory, which is always present.
This commit also moves max_shard_disk_space_size to the .cc file
where it belongs - resource_manager.cc.

Tests: unit (release)

Message-Id: <9f7b86b6452af328c05c5c6c55bfad3382e12445.1528977363.git.sarna@scylladb.com>
2018-06-14 14:24:01 +01:00
Gleb Natapov
cdf1289b43 Provide available memory size to hinted handoff resource manager during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
cc47f6c69d Provide available memory size to commitlog during creation 2018-06-11 15:34:13 +03:00
Nadav Har'El
41472e2618 legacy_schema_migrator: add comment
When I came across db/legacy_schema_migrator.cc, I had no idea what it
does and though I had obvious guesses (it somehow migrates old schemas,
right?) I didn't know what it really does. So after I figured this out,
I wrote this comment so the next person doesn't need to guess.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180605120225.25173-1-nyh@scylladb.com>
2018-06-10 19:39:06 +03:00
Avi Kivity
6f23403137 Merge "Virtualize IndexInfo system table" from Duarte
"
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.

Fixes #3483
"

* 'built-indexes-virtual-reader/v2' of github.com:duarten/scylla:
  tests/virtual_reader_test: Add test for built indexes virtual reader
  db/system_keysace: Add virtual reader for IndexInfo table
  db/system_keyspace: Explain that table_name is the keyspace in IndexInfo
  index/secondary_index_manager: Expose index_table_name()
  db/legacy_schema_migrator: Don't migrate indexes
2018-06-06 17:35:51 +03:00
Duarte Nunes
833d34e88a Merge 'Make rows in a secondary index ordered by token' from Piotr
"
As in #3423, ensuring token order on secondary index queries can be done
by adding an additional column to views that back secondary indexes.
This column is a first clustering column and contains token value,
computed on updates.
This series also updates tests and comments refering to issue 3423.

Tests: unit (release, debug)
"

* 'order_by_token_in_si_5' of https://github.com/psarna/scylla:
  cql3: update token order comments
  index, tests: add token column to secondary index schema
  view: add handling of a token column for secondary indexes
  view: add is_index method
2018-06-06 10:07:43 +01:00
Piotr Sarna
d5e7b5507b view: add handling of a token column for secondary indexes
In order to ensure token order on secondary index queries,
first clustering column for each view that backs a secondary index
is going to store a token computed from base's partition keys.
After this commit, if there exists a column that is not present
in base schema, it will be filled with computed token.
2018-06-05 18:59:25 +02:00
Piotr Sarna
06eee0f525 view: add is_index method
is_index method returns true if view that owns it
is backing a secondary index.
2018-06-05 11:10:24 +02:00
Glauber Costa
bdce561ada system_keyspace: add sharding information to local table
We would like the clients to be able to route work directly to the right
shards. To do that, they need to know the sharding algorithm and its
parameters.

The algorithm can be copied into the client, but the parameters need to
be exported somewhere. Let's use the local table for that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
---
v2: force msb to zero on non-murmur
2018-06-04 11:25:58 -04:00
Duarte Nunes
3e39985c7a db/system_keysace: Add virtual reader for IndexInfo table
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.

Fixes #3483

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Duarte Nunes
65c4205334 db/system_keyspace: Explain that table_name is the keyspace in IndexInfo
This patch adds the same comment that exists in Apache Cassandra,
explaining that the table_name column in the IndexInfo system table
actually refers to the keyspace name. Don't be fooled.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Duarte Nunes
7187963bda db/legacy_schema_migrator: Don't migrate indexes
Previous versions contained no indexes, and Apache Cassandra indexes
cannot be migrated to Scylla.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Piotr Sarna
204bc17bd7 hints: decouple hints manager metrics from constructor
Now that more than one instance of hints manager can be present
at the same time, registering metrics is moved out of the constructor
to prevent 'registering metrics twice' errors.
2018-06-04 09:46:06 +02:00
Piotr Sarna
f345efc79a hints: move space_watchdog to resource manager
Space watchdog is decoupled from hints manager and moved to resource
manager, so it can be shared among different hints manager instances.
2018-06-04 09:46:01 +02:00
Piotr Sarna
ef40f7e628 hints: move send limiter to resource manager
Send limiting semaphore is moved from hints manager to resource manager.
In consequence, hints manager now keeps a reference to its resource
manager.
2018-06-04 09:35:58 +02:00
Piotr Sarna
2315937854 hints: move constants to resource_manager
Constants related to managing resources are moved to newly created
resource_manager class. Later, this class will be used to manage
(potentially shared) resources of hints managers.
2018-06-04 09:35:58 +02:00
Paweł Dziepak
0ea6d14cf5 atomic_cell: explicitly state when atomic_cell is a collection member
Collections are not going to be fully converted to the IMR just yet and
still use the old serialisation format. This means that they still don't
support fragmented values very well. This patch passes the information
when an atomic_cell is created as a member of a collection so that later
we can avoid fragmenting the value in such cases.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
aa25f0844f atomic_cell: introduce fragmented buffer value interface
As a prepratation for the switch to the new cell representation this
patch changes the type returned by atomic_cell_view::value() to one that
requires explicit linearisation of the cell value. Even though the value
is still implicitly linearised (and only when managed by the LSA) the
new interface is the same as the target one so that no more changes to
its users will be needed.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
27014a23d7 treewide: require type info for copying atomic_cell_or_collection 2018-05-31 15:51:11 +01:00
Paweł Dziepak
e9d6fc48ac treewide: require type for creating atomic_cell 2018-05-31 15:51:11 +01:00
Paweł Dziepak
93130e80fb atomic_cell: require column_definition for creating atomic_cell views 2018-05-31 15:51:11 +01:00
Duarte Nunes
99d678d079 db/view: Remove ifdef'd Java code
It provides no useful information, so just get rid of it.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-28 11:51:23 +01:00
Duarte Nunes
ad18d535e9 db/view: Ignore scenario where base replica hasn't joined the ring
Apache Cassandra handles a case where the node hasn't joined the ring
and may consequentially have an outdated view of it. Following the same
reasoning as with the previous patch, we ignore this scenario. It
happens when there are range movements, and this node is bootstrapping,
but there are already other mechanisms in the cluster, such as hinted
handoff and dual-writing to replicas during range movements, that
contribute to this update eventually making its way to the view.

This patch doesn't change any behavior, but it provides the reasoning
why we won't use the batchlog as Cassandra does, or the hinted handoff
log as we will, to later send the update when the node is joined (note
that Cassandra just sends the mutations "later", and doesn't check
again for any condition or change).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-28 11:51:23 +01:00
Duarte Nunes
be45e6a1b7 db/view: Handle case when base has no paired view replica
If no view replica is paired with the current base replica, it means
there's a range movement going on (decommission or move), such that
this base replica is gaining new token ranges. The current node is
thus a pending_endpoint from the POV of the coordinator that sent the
request.

Sending view updates to the view replica this base will eventually be
paired with only makes a difference when the base update didn't make
it to the node which is currently being decommissioned or moved-from.

The update will, however, make it to that node if HH is enabled at the
coordinator, before the range movement finishes, or later to this node
when it becomes a natural endpoint for the token.

We still ensure we send to any pending view endpoints though, at least
until we handle that case more optimally.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-28 11:51:18 +01:00
Duarte Nunes
4859b759b9 Merge 'Make all timeouts explicit' from Avi
"
This patchset makes all users of query_processor specify their timeouts
explicitly, in preparation for the removal of
cql_statement::execute_internal() (whose main function was to override
timeouts).
"

* tag 'cql-explicit-timeouts/v1' of https://github.com/avikivity/scylla:
  query_processor: require clients to specify timeout configuration
  query_processor: un-default consistency level in make_internal_options
2018-05-26 16:10:58 +02:00
Piotr Sarna
3792bed3ed view: adapt view_stats to act as write stats
This commit adapts view_stats structure so it can be passed
to storage_proxy as write stats. Thanks to that, mv replica updates
will not interfere with user write metrics. As a side effect it also
provides more stats to replica view updates.

Closes #3385
Closes #3416
2018-05-22 16:52:58 +02:00
Piotr Sarna
9246bb36bc db: add row locking metrics
This commit adds statistics to row_locker class. Metrics are
independendly counted for all lock types: row<->partition and
exclusive<->shared.

Metrics gathered:
 - total acquisitions
 - operations that wait on the lock
 - histogram of the time spent on waiting on this type of lock

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Piotr Sarna
49bebcfa25 view: add view metrics
This commit introduces view statistics:
 - updates pushed to local/remote replicas
 - updates failed to be pushed to local/remote replicas

Metrics are kept on per-table basis, i.e. updates_pushed_remote
shows the number of total updates (mutations) pushed to all paired
mv replicas that this particular table has.
Every single update is taken into consideration, so if view update
requires removing a row from one view and adding a row to another,
it will be counted as 2 updates.

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Calle Wilund
62c3b4c429 commitlog: Ensure file objects are closed before object free
Fixes #3446

Previously, only shutdown-synced objects where actually closed,
which is wrong.

This introduces yet another queue, processed together with the
deletion objects, which ensures we explicitly close all objects
that have been discarded.

Message-Id: <20180521140456.32100-1-calle@scylladb.com>
2018-05-22 14:52:06 +03:00
Glauber Costa
596a525950 commitlog: don't move pointer to segment
We are currently moving the pointer we acquired to the segment inside
the lambda in which we'll handle the cycle.

The problem is, we also use that same pointer inside the exception
handler. If an exception happens we'll access it and we'll crash.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180518125820.10726-1-glauber@scylladb.com>
2018-05-18 17:25:18 +02:00
Avi Kivity
a99e820bb9 query_processor: require clients to specify timeout configuration
Remove implicit timeouts and replace with caller-specified timeouts.
This allows removing the ambiguity about what timeout a statement is
executed with, and allows removing cql_statement::execute_internal(),
which mostly overrode timeouts and consistency levels.

Timeout selection is now as follows:

  query_processor::*_internal: infinite timeout, CL=ONE
  query_processor::process(), execute(): user-specified consisistency level and timeout

All callers were adjusted to specify an infinite timeout. This can be
further adjusted later to use the "other" timeout for DCL and the
read or write timeout (as needed) for authentication in the normal
query path.

Note that infinite timeouts don't mean that the query will hang; as
soon as the failure detector decides that the node is down, RPC
responses will termiante with a failure and the query will fail.
2018-05-14 09:41:06 +03:00
Duarte Nunes
a23bda3393 Merge 'Implement separate timeout for range queries' from Avi
"
This patchset implements separate timeouts for range queries, and lays
the foundations for separate timeouts for other query types.

While the feature in itself is worthy, the real motivation is to have
the timeouts decided by the caller, instead of storage_proxy. This in
turn is required to disentangle each layer behaving differently
depending on whether the query is internal or not; instead, the goal
is to have each caller declare its needs in terms of consistency level
and timeouts, and have the lower layers implement its requirements
instead of making their own decisions.

Fixes #3013.

Tests: unit (release)
"

* tag '3013/v1.1' of https://github.com/avikivity/scylla:
  storage_proxy: remove default_query_timeout()
  storage_proxy: don't use default timeouts
  query_options: augment with timeout_config
  thrift: configure thrift transport and handler with a timeout_config
  transport: configure native transport with a timeout_config
  cql3: define and populate timeout_config_selector
  timeout_config: introduce timeout configuration
2018-05-13 20:05:50 +02:00
Paweł Dziepak
75b8b521d9 db/view/build_progress: avoid copying mutation fragment 2018-05-09 16:52:26 +01:00
Paweł Dziepak
0b4c6b8938 types: make some collection_type_impl functions non-static
The switch to the new in-memory representation will require a larger
parts of the logic be aware of the type of the values they are dealing
with. In most cases it is not a significant burden for the users.
2018-05-09 16:52:26 +01:00
Vlad Zolotarov
48c96d09d6 db::hints::manager: drain hints when the node is decommissioned/removed
When node is decommissioned/removed it will drain all its hints and all
remote nodes that have hints to it will drain their hints to this node.

What "drain" means? - The node that "drains" hints to a specific
destination will ignore failures and will continue sending hints till the end
of the current segment, erase it and move to the next one till there are
no more segments left.

After all hints are drained the corresponding hints directory is removed.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
ec76f8a27d db::hints::manager: add a few more trace messages
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
6ede32156f db::hints::manager::end_point_hints_manager::sender: add set_stopping()/stopping() methods
It's nicer to have access methods instead of working directly with enum_set methods and values.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
94da744f37 db::hints::manager::end_point_hints_manager::stop(): log the last exception instead of forwarding it
Returning a future with an exception from end_point_manager::stop()
is practically useless because the best the caller can do is to log
it and continue as if it didn't happen because it has other things
to shut down.

Therefore in order to simplify the caller we will log the exception
if it happens and will always return a non-exceptional future.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
8aedbf9d18 db::hints: manager.hh: cleanup: fix the comments
Fix the comments that went out of sync with the current implementation.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
5463b58faa db::hints::manager: rework end_point_hints_manager::stop() to use seastar::async()
This simplifies the code reading and extending.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Duarte Nunes
c053275a48 db/view/row_locking: Add timeout when waiting for the lock
This ensures we respect the write timeout set by the client when
applying base writes, in case a writes takes too long to acquire the
row lock for the read-before-write phase of a materialized view
update.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180507132755.8751-1-duarte@scylladb.com>
2018-05-07 18:22:39 +01:00
Duarte Nunes
2be75bdfc9 db/timeout_clock: Properly scope type names
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-1-duarte@scylladb.com>
2018-05-07 11:24:41 +03:00
Piotr Sarna
fe02c3d0e2 database, sstables, tests: add large_partition_handler
This commit makes database, sstables and tests aware
of which large_partition_handler they use.
Proper large_partition_handler is retrievable from config information
and is based on existing compaction_large_partition_warning_threshold_mb
entry. Right now CQL TABLE variant of large_partition_handler is used
in the database.

Tests use a NOP version of large_partition_handler, which does not
depend on CQL queries at all.
2018-05-04 14:38:13 +02:00
Piotr Sarna
14b3c7e7e7 db: add large_partition_handler interface with implementations
This commit introduces large_partition_handler class, which can be used
to take additional action when large partitions are written.

It comes with two implementations:
 * NOP, used in tests, which does nothing on large partition
   update/delete
 * CQL TABLE, which inserts/deletes information on particular sstable
   to system.large_partitions table, in order to be retrievable from
   cqlsh later.

References #3292
2018-05-04 12:46:31 +02:00