Commit Graph

999 Commits

Author SHA1 Message Date
Calle Wilund
2bc98aebaf db::commitlog: Do segment delete async + force replay delete go via CL
Refs #2858

Push segement files to be deleted to a pending list, and process at
intervals or flush-requests (or shutdown). Note that we do _not_
indescrimenately do deletes in non-anchored tasks, because we need
to guarantee that finshed segments are fully deleted and gone on CL
shutdown, not to be mistaken for replayables.

Also make sure we delete segments replayed via commitlog call,
so IFF we add metadata processing for CL, we can clear it out.
2018-03-26 11:58:27 +00:00
Nadav Har'El
e9702aa126 Materialized Views: don't lose updates while cluster is changing
When the cluster is changed (nodes added or removed), ranges of tokens
are moved between nodes. Scylla initiates a streaming process between an
old and a new owner of the range, which can take a long time. During
that streaming time, the new owner of the range is known as a "pending node"
for this range, and all updates must go to both the old owner (in case the
movement fails!) and the pending node (in case the movement succeeds).

For materialized views, because they are ordinary tables, streaming moves
all the view's data that existed before the streaming started. But we did
not send updates done to the view *during* the streaming. A dtest
demonstrates that the new node will miss some of the view update, and will
require a repair of the view tables immediately after the cluster change
ends, which is not good. To fix that, we need to send every new update
that happens during the streaming also to the "pending node". We already
did this properly for base-table updates, but not to the view updates:
Each base table replica wrote to only one paired view table replica,
and nobody wrote to the new pending node (in case where there is one,
for the particular view token involved).

In this patch, we make sure that all view updates go also to the "pending
nodes" when there are any. We do the same thing that Cassandra does, which
is - *all* base replicas write the update to the pending node(s).
Arguably, it is inefficient that all replicas send the update to the same
node. In most cases it is enough to send it from just one base replica -
the one who is slated to be the new node's pair.  I opened
https://issues.apache.org/jira/browse/CASSANDRA-14262 about this idea.
But that is an optimization. The patch as-is already fixes the bug.

Fixes #3211

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180313171853.17283-1-nyh@scylladb.com>
2018-03-16 12:00:29 +00:00
Botond Dénes
f488ae3917 Add buffer_size() to flat_mutation_reader
buffer_size() exposes the collective size of the external memory
consumed by the mutattion-fragments in the flat reader's buffer. This
provides a basis to build basic memory accounting on. Altought this is
not the entire memory consumption of any given reader it is the most
volatile component and usually by far the largest one too.
2018-03-13 10:34:34 +02:00
Botond Dénes
aaf67bcbaa Consider preferred replicas when choosing endpoints for query_singular()
Propagate the preferred_replicas to db::filter_for_query() and consider
them when selecting the endpoints. The algoritm for selecting the
endpoints is as follows:
* Compute the intersection of the endpoint candidates and the
preferred endpoints.
* If this yields a set of endpoints that already satisfies the CL
requirements use this set.
* Otherwise select the remaining endpoints according to the
load-balancing strategy, just like before.
2018-03-13 10:34:34 +02:00
Botond Dénes
eac597d726 Add preferred and last replicas to the signature of query()
preferred_replicas are added to the parameters and last_replicas are
added to the return type. The preferred replicas will be used as a hint
for the selection of the replicas to send the read requests to. The last
replicas (returned) are the replicas actually selected for the read.
This will allow queries to consistently hit the same replicas for each
page thus reusing readers created on these replicas.
For convenience a query() overload is provided that doesn't take or
return the preferred and last replicas.

This patch only adds the parameters and propagates them down to
query_singular() and query_partition_key_range(). The code to actually
use these preferred-replicas will be added in later patches.
This reason for separating this is to reduce noise and improve
reviewability for those functional changes later.
2018-03-13 10:34:34 +02:00
Avi Kivity
4f6b892aa1 cql3: remove #include of system_keyspace.hh
We include system_keyspace for just the string "system" (and a related
is_system_keyspace() function). Replace with a forward-declared functions.
2018-03-11 18:02:23 +02:00
Botond Dénes
1259031af3 Use the reader_concurrency_semaphore to limit reader concurrency 2018-03-08 14:12:12 +02:00
Raphael S. Carvalho
aa75684ee7 sstables: Warn when an extra-large partition is written
Based on https://issues.apache.org/jira/browse/CASSANDRA-9643

For compaction_large_partition_warning_threshold_mb option set to 1,
follow an example output:

WARN  2018-02-22 19:52:11,029 [shard 0] sstable - Writing large
row system/local:{key: pk{00056c6f63616c}, token:-7564491331177403445}
(1276758 bytes)

Fixes #2209.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180306175912.19259-1-raphaelsc@scylladb.com>
2018-03-07 15:49:46 +00:00
Duarte Nunes
9254a9a6fe db/system_keyspace: Move dependency on db/schema_tables to source file
And add missing dependencies to header file.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180307111304.2914-1-duarte@scylladb.com>
2018-03-07 14:45:36 +02:00
Avi Kivity
d973445a94 Merge "sstable/schema extensions" from Calle
"
Adds extension points to schema/sstables to enable hooking in
stuff, like, say, something that modifies how sstable disk io
works. (Cough, cough, *encryption*)

Extensions are processed as property keywords in CQL. To add
an extension, a "module" must register it into the extensions
object on boot time. To avoid globals (and yet don't),
extensions are reachable from config (and thus from db).

Table/view tables already contain an extension element, so
we utilize this to persist config.

schema_tables tables/views from mutations now require a "context"
object (currently only extensions, but abstracted for easier
further changes.

Because of how schemas currently operate, there is a super
lame workaround to allow "schema_registry" access to config
and by extension extensions. DB, upon instansiation, calls
a thread local global "init" in schema_registry and registers
the config. It, in turn, can then call table_from_mutations
as required.

Includes the (modified) patch to encapsulate compression
into objects, mainly because it is nice to encapsulate, and
isolate a little.
"

* 'calle/extensions-v5' of github.com:scylladb/seastar-dev:
  extensions: Small unit test
  sstables: Process extensions on file open
  sstables::types: Add optional extensions attribute to scylla metadata
  sstables::disk_types: Add hash and comparator(sstring) to disk_string
  schema_tables: Load/save extensions table
  cql: Add schema extensions processing to properties
  schema_tables: Require context object in schema load path
  schema_tables: Add opaque context object
  config_file_impl: Remove ostream operators
  main/init: Formalize configurables + add extensions to init call
  db::config: Add extensions as a config sub-object
  db::extensions: Configuration object to store various extensions
  cql3::statements::property_definitions: Use std::variant instead of any
  sstables: Add extension type for wrapping file io
  schema: Add opaque type to represent extensions
  sstables::compress/compress: Make compression a virtual object
2018-02-26 17:15:29 +02:00
Pekka Enberg
f1f691b555 Merge "Add the GoogleCloudSnitch" from Vlad
"This series adds the GoogleCloudSnitch.

 Fixes #1619"

* 'google-cloud-snitch-v4' of https://github.com/vladzcloudius/scylla:
  config: uncomment/add the supported snitches description
  tests: added gce_snitch_test
  locator::gce_snitch: implementation of the GoogleCloudSnitch
  locator::snitch_base: properly log the failure during the snitch startup
2018-02-19 15:58:56 +02:00
Duarte Nunes
f665f1ab97 db/commitlog: Close the segment file
Operations on a segment's underlying append_challenged_posix_file_impl,
such as truncate(), schedule asynchronous operations when they are
executed, which capture the file object. To synchronize with them and
prevent use-after-free, we need to call close() and only delete the
segment and file when the returned future resolves.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216235754.24257-1-duarte@scylladb.com>
2018-02-19 13:09:41 +00:00
Duarte Nunes
7004f6c7ff db/commitlog: Actually prevent new requests during shutdown
When shutting down the commitlog we try to block all new requests by
acquiring all available resources. We were, however, letting go of the
semaphore permits too early, before closing the gate and shutting down
the active segments.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216234826.24111-1-duarte@scylladb.com>
2018-02-19 13:09:26 +00:00
Glauber Costa
7b6f188e27 controllers: allow a static priority to override the controller output
We have merged the I/O controller without this, but we want to integrate
the CPU and I/O controllers into one. Currently, the quota can be
statically set for the CPU controller. For now, until we gain more
experience with it we should allow a static value to override the
controller's output as well.

That is particularly important since we don't yet control some
strategies like LCS and the time-based ones. Users in the field may be
using one of those strategies with a static value for background quota.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
c099c98676 controllers: retire auto_adjust_flush_quota
It no longer makes sense now that we have the full scheduler +
controllers.  In its lieu, we will provide an option to statically set
the controller's shares as a safe guard against us getting this wrong.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Avi Kivity
2ee163d32b config: mark background_writer_scheduling_quota as Unused
Since the background writer flush quota config is no longer used, mark
it Unused.
2018-02-07 17:19:29 -05:00
Avi Kivity
641aaba12c database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.

The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.

Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
2018-02-07 17:19:29 -05:00
Calle Wilund
97f9f572f8 schema_tables: Load/save extensions table
Parses the extension map in tables/views using the registered extension.
If a schema row contains an unknown extension, we just preserve the data
in a placeholder.
2018-02-07 10:11:46 +00:00
Calle Wilund
2b56bbfa7d schema_tables: Require context object in schema load path
Requires "workaround" fix for schema_registry and frozen_mutation, since
the former is a free-float thread local, and the latter is a pure data
carrier. frozen_schema can take a parameter for unfreeze, but schema
registry requires being told which the system extensions are.
2018-02-07 10:11:46 +00:00
Calle Wilund
c2b49ec2e2 schema_tables: Add opaque context object
To allow carrying extensions and potentially more
2018-02-07 10:11:46 +00:00
Calle Wilund
c19d8dd602 db::config: Add extensions as a config sub-object
The idea being that we should have config be a global, immutable
singleton, set up by startup/test then owned/referenced by db etc. 

Extensions are read-only in this context, so init code should set it up
before handing to the config. Or keep a ref to the ext param.
2018-02-07 10:11:46 +00:00
Calle Wilund
78174c6c59 db::extensions: Configuration object to store various extensions
A singular, yet not static global, container for schema/sstable 
extensions.
2018-02-07 10:11:46 +00:00
Vlad Zolotarov
bc90aa79b3 config: uncomment/add the supported snitches description
Uncomment desscriptions of Ec2SnitchXXX which are supported for a long
time already.
Add the description of the new GoogleCloudSnitch.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-05 10:37:13 -05:00
Nadav Har'El
31d0a1dd0c Materialized views: implement row and partition locking mechanism
This patch adds a "row_locker" class providing locking (shard-locally) of
individual clustering rows or entire partitions, and both exclusive and
shared locks (a.k.a. reader/writer lock).

As we'll see in a following patch, we need this locking capability for
materialized views, to serialize the read-modify-update modifications
which involve the same rows or partitions.

The new row_locker is significantly different from the existing cell_locker.
The two main differences are that 1. row_locker also supports locking the
entire partition, not just individual rows (or cells in them), and that
2. row_locker supports also shared (reader) locks, not just exclusive locks.
For this reason we opted for a new implementation, instead of making large
modificiations to the existing cell_locker. And we put the source files
in the view/ directory, because row_locker's requirements are pretty
specific to the needs of materialized views.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-01-30 16:16:27 +02:00
Duarte Nunes
1e3fae5bef db/schema_tables: Only drop UDTs after merging tables
Dropping a user type requires that all tables using that type also be
dropped. However, a type may appear to be dropped at the same time as
a table, for instance due to the order in which a node receives schema
notifications, or when dropping a keyspace.

When dropping a table, if we build a schema in a shard through a
global_schema_pointer, then we'll check for the existence of any user
type the schema employs. We thus need to ensure types are only dropped
after tables, similarly to how it's done for keyspaces.

Fixes #3068

Tests: unit-tests (release)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180129114137.85149-1-duarte@scylladb.com>
2018-01-30 12:07:04 +01:00
Piotr Jastrzebski
96c97ad1db Rename streamed_mutation* files to mutation_fragment*
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
José Guilherme Vanz
380bc0aa0d Swap arguments order of mutation constructor
Swap arguments in the mutation constructor keeping the same standard
from the constructor variants. Refs #3084

Signed-off-by: José Guilherme Vanz <guilherme.sft@gmail.com>
Message-Id: <20180120000154.3823-1-guilherme.sft@gmail.com>
2018-01-21 12:58:42 +02:00
Piotr Jastrzebski
4c74b8c7e7 Migrate materalized views to flat_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-18 07:32:35 +01:00
Duarte Nunes
b607662d2e collection_type_impl: Make for_each_cell static
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180115013532.67200-1-duarte@scylladb.com>
2018-01-15 11:16:33 +02:00
Glauber Costa
08a0c3714c allow request-specific read timeouts in storage proxy reads
Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.

We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.

This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.

In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.

We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.

After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.

Fixes #2462

Implementation notes, and general comments about open discussion in 2462:

* Because we are not bypassing the timeout, just setting it high enough,
  I consider the concerns about the batchlog moot: if we fail for any
  other reason that will be propagated. Last case, because the timeout
  is per-CF, we could do what we do for the dirty memory manager and
  move the batchlog alone to use a different timeout setting.

* Storage proxy likes specifying its timeouts as a time_point, whereas
  when we get low enough as to deal with the read_concurrency_config,
  we are talking about deltas. So at some point we need to convert time_points
  to durations. We do that in the database query functions.

v2:
- use per-request instead of per-table timeouts.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:21 -05:00
Glauber Costa
5140aaea00 add a timeout to fast forward to
In the last patch, we enabled per-request timeouts, we enable timeouts
in fill_buffer. There are many places, though, in which we
fast_forward_to before we fill_buffer, so in order to make that
effective we need to propagate the timeouts to fast_forward_to as well.

In the same way as fill_buffer, we make the argument optional wherever
possible in the high level callers, making them mandatory in the
implementations.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:19 -05:00
Glauber Costa
d965af42b0 add a timeout to fill_buffer
As part of the work to enable per-request timeouts, we enable timeouts
in fill_buffer.

The argument is made optional at the main classes, but mandatory in all
the ::impl versions. This way we'll make sure we didn't forget anything.

At this point we're still mostly passing that information around and
don't have any entity that will act on those timeouts. In the next patch
we will wire that up.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
80c4a211d8 consolidate timeout_clock
At the moment, various different subsystems use their different
ideas of what a timeout_clock is. This makes it a bit harder to pass
timeouts between them because although most are actually a lowres_clock,
that is not guaranteed to be the case. As a matter of fact, the timeout
for restricted reads is expressed as nanoseconds, which is not a valid
duration in the lowres_clock.

As a first step towards fixing this, we'll consolidate all of the
existing timeout_clocks in one, now called db::timeout_clock. Other
things that tend to be expressed in terms of that clock--like the fact
that the maximum time_point means no timeout and a semaphore that
wait()s with that resolution are also moved to the common header.

In the upcoming patch we will fix the restricted reader timeouts to
be expressed in terms of the new timeout_clock.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Paweł Dziepak
4dfddc97c7 db/schema_tables: do not use moved from shared pointer
Shared pointer view is captured by two continuations, one of which is
moving it away. Using do_with() solves the problem.

Fixes #3092.
Message-Id: <20171221111614.16208-1-pdziepak@scylladb.com>
2017-12-21 15:13:25 +01:00
George Tavares
ceecd542cd db/view: Consume updated rows regardless of static row
Using Materialized Views, if the base table has static columns,
and the update in base table mutates static and non static rows,
the streamed_mutation is stopped before process non static row.
The patch avoids stopping the stream_mutation and adds a test case.

Message-Id: <20171220173434.25091-1-tavares.george@gmail.com>
2017-12-21 00:49:15 +01:00
Raphael S. Carvalho
928beae242 Fix compilation of db/hints/manager.cc and row_cache.cc
compiler: gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)

Problems introduced in f6a461c7a4
and 37b19ae6ba, respectively.

They both fail to compile due to use of method in lambda without
explicit mention of this. Some of failure is fixed by not using
auto in lambda parameter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171218222144.12297-1-raphaelsc@scylladb.com>
2017-12-19 11:15:45 +01:00
Nadav Har'El
101cce3c79 Fix compilation of tests/commitlog_test.cc
In commit 878d58d23a, a new parameter was
added to commitlog::descriptor. The commit message says that "It's default
value is a descriptor::FILENAME_PREFIX." while in reality, it did not have
a default value and compilation of tests/commitlog_test.cc broke, because
it didn't specify a value.

So this patch adds a default value for this parameter, as was suggested
by the original commit message.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218131020.17883-1-nyh@scylladb.com>
2017-12-18 15:35:34 +02:00
Vlad Zolotarov
c2296c9575 config: add hints related options
- hints_directory:
      - This option allows defining of the directory where hints files are going
        to be stored if hinted handoff is enabled.

   - hinted_handoff_enabled:
      - May receive either a boolean value or a list of DCs. In the later case this
        will define the DCs to which Nodes hints are going to be generated.

   - max_hint_window_in_ms:
      - Maximum amount of milliseconds the hints are going to be generated to the Node that is DOWN.
        After this time period the hints are no longer going to be generated until the Node is seen UP.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:08:11 -05:00
Vlad Zolotarov
51bbf18c08 db::hints::manager: initial commit
Curently implemented:
   - Hints generation: db::hints::manager::store_hint(...).
   - Sending: db::hints::manager::on_timer().

TODO:
   - Resharding.
   - Node decommission.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:08:07 -05:00
Vlad Zolotarov
ec15d60a2d db::commitlog::replay_position: added std::hash<replay_position>
It's needed for hinted handoff.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:48 -05:00
Vlad Zolotarov
af70c0a709 db::commitlog: truncate segments to their actual sizes during shutdown
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:48 -05:00
Vlad Zolotarov
033af6c950 db::commitlog: allow defining a metrics category name
Add a new field to db::commitlog::config that would define the metrics category name.
If not given - metrics are not going to be registered.
Set it to "commitlog" in db::commitlog::config(const db::config&).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:47 -05:00
Vlad Zolotarov
878d58d23a db/commitlog/commitlog::descriptor: add a filename_prefix parameter
This parameter is used when creating a new segment.
It's default value is a descriptor::FILENAME_PREFIX.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:47 -05:00
Vlad Zolotarov
719b1fb24f db::commitlog::descriptor::descriptor(filename): pass a filename as a const ref
Avoid not needed copy by passing a file name as a reference.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:47 -05:00
Michael Munday
5158b3f484 utils::crc: introduce process_le/be(T) methods
Replace the oblique process(T) overloads for integer types with
explicit process_le/be(T) methods that would interpret the given integer
as a stream of bytes using the corresponding endiannes.

For instance

process_le(0x11223344) would treat this integer as the following array of bytes:
{0x44, 0x33, 0x22, 0x11}.

process_be(0x11223344) on the other hand would treat this integer as if it's
{0x11, 0x22, 0x33, 0x44}.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 10:12:21 -05:00
Gleb Natapov
357c77a333 consistency_level: constify quorum_for() and local_quorum_for() 2017-12-05 13:01:20 +02:00
Paweł Dziepak
586b61d57d size_estimates: convert reader to flat mutation readers
Message-Id: <20171129105909.27084-1-pdziepak@scylladb.com>
2017-11-29 12:14:05 +00:00
Jesse Haber-Kucharsky
460f3c7065 auth: Add dormant role manager to service
The role manager still does not interact with the rest of the system,
but the role manager is now sharded on all cores and metadata is
created.

The following metadata tables are created:

- `system_auth.roles`
- `system_auth.role_members`

The default superuser, "cassandra", is also created, but has no function.
2017-11-27 12:14:24 -05:00
Pekka Enberg
0c192c835c cql3: Fix 'DROP INDEX' to also drop index view
This patch fixes 'DROP INDEX' CQL statement to also drop the underlying
index view automatically so that we don't leave unused materialized
views behind.
Message-Id: <1510303421-15945-1-git-send-email-penberg@scylladb.com>
2017-11-10 10:52:08 +01:00
Calle Wilund
959d729428 config: Resurrect command line aliases that where lost 2017-11-06 09:54:46 +00:00