Commit Graph

767 Commits

Author SHA1 Message Date
Tomasz Grabiec
ddfee57c97 Replace iostream include with iosfwd in headers
Message-Id: <1484656119-8386-4-git-send-email-tgrabiec@scylladb.com>
2017-01-17 14:52:44 +02:00
Tomasz Grabiec
50e3e3af08 db: Add missing include
Message-Id: <1484656119-8386-3-git-send-email-tgrabiec@scylladb.com>
2017-01-17 14:52:44 +02:00
Tomasz Grabiec
ea9ab36ad5 db: Move operator<<() definition to .cc
Message-Id: <1484656119-8386-2-git-send-email-tgrabiec@scylladb.com>
2017-01-17 14:52:43 +02:00
Avi Kivity
c314047b6c config: disable new sharding algorithm
It still has problems:
 - while resharding a very large leveled compaction strategy table, a huge
   amount of tiny sstables are generated, overwhelming the file descriptor
   limits
 - there is a large impact on read latency while resharding is going on

(cherry picked from commit cf27d44412)

(forward-ported from branch-1.6)
2017-01-15 10:48:53 +02:00
Vlad Zolotarov
dcdd98ccc1 db::commitlog::commitlog: move collectd counters registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
a9f6e5f8da db::batchlog_manager: move collectd registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Pekka Enberg
3d0217ec43 db/schema_tables: Fix system keyspace table list
Commit f0c28e1 ("db/schema_tables: Add schema_functions and
schema_aggregates tables") forgot to add the newly added tables to the
db::schema_tables::ALL list, which is used for authorization checks, for
example.

Fixes the following auth_test.py dtest failures:

  ('Unable to connect to any servers', {'127.0.0.1': Unauthorized('Error from server: code=2100 [Unauthorized] message="User cathy has no SELECT permission on <table system.schema_functions> or any of its parents"',)})
Message-Id: <1484045277-4997-1-git-send-email-penberg@scylladb.com>
2017-01-10 13:55:04 +01:00
Pekka Enberg
f0c28e1b2d db/schema_tables: Add schema_functions and schema_aggregates tables
The 3.0.3 Java driver, for example, search for the tables and fails when
we advertise Cassandra 2.2 version from Scylla.
2017-01-09 10:42:21 +02:00
Amnon Heiman
70b2a1bfd4 Set the prometheus prefix to scylla
This patch make the prometheus prefix configurable and set the default
value to scylla.

Fixes #1964

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1482671970-21487-1-git-send-email-amnon@scylladb.com>
2016-12-25 15:21:53 +02:00
Avi Kivity
3989e4ed15 Revert "config, dht: reduce default msb ignore bits to 4"
This reverts commit b81a57e8eb.

With exponential range scanning, we should now be able to survive
msb ignore bits of 12, which allows better sharding on large clusters.
2016-12-20 19:41:05 +02:00
Duarte Nunes
124802e196 cql3: Add function to build view's select statement
This patch adds an utility function that creates a raw select
statement from a set of columns and a where clause. It is intended to
be used to create the prepared select statement used by the view
class.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
5bd74abee8 create_view_statement: Implement check_access
This patch implements check_access according to Cassandra's
implementation.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
8ce21a9c01 schema_tables: Make drop view mutations
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
61a5a74ea2 schema_tables: Make update view mutations
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
2098c336d9 schema_tables: Make create view mutations
This patch builds the mutations to announce a new view. Aside from
including the view schema, we include the base table mutations so
that a node is resilient against receiving create view mutations
before the base table create mutations.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
19a76a82e8 frozen_schema: Support view schemas
This patch allows a view schema to be frozen. To unfreeze such a
schema, we add an is_view attribute to the schema idl.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
c11eb30225 schema_tables: Replace add_table_to_schema_mutation
This patch replaces the add_table_to_schema_mutation() function with
add_table_or_view_to_schema_mutation().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
04b93ba803 schema_tables: Make view mutations
This patch adds functions that translate a view schema to the
corresponding mutations.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
fe632e8ba5 schema_tables: Factor out duplicate code
This patch factors out duplicate code between
merge_tables() and merge_views().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
3fd79bb6d6 schema_tables: Merge views for schema merging
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
06ab61a570 schema_tables: Extract update_column_family
This patch extracts update_column_family from schema_tables into
database so it can be used when adding materialized views, in future
patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
40c684b5f5 database: Extract common create cf code
This patch moves some duplicate code into the
add_column_family_and_create_directory() function. It also saves some
superfluous keyspace lookups and readies the code to be used by
materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
42242273f6 schema_tables: Create views from mutations
This patch enables views to be created from their low-level,
mutation-based representation.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
888a8923c7 read_table_mutations: Support other schemas
This patch changes read_table_mutations() so that it can now
read schemas from other tables besides the column families
schema table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
b9cf25c4dd schema_tables: Add views schema table
This patch adds the views schema table, containing the definition of
views in a keyspace.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
7818339791 materialized views: Add view class
This patch adds the view class, which will contains functions related
to populating a view, either from the base table's write path or from
the view building mechanism which copies over already existing data in
the base table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Asias He
937f28d2f1 Convert to use dht::partition_range_vector and dht::token_range_vector 2016-12-19 14:08:50 +08:00
Asias He
e5485f3ea6 Get rid of query::partition_range
Use dht::partition_range instead
2016-12-19 08:09:25 +08:00
Asias He
85034c1b57 Convert to use dht::partition_range 2016-12-19 08:04:30 +08:00
Asias He
d1178fa299 Convert to use dht::token_range 2016-12-19 08:04:29 +08:00
Avi Kivity
a61ff53150 Merge "rework flush criteria" from Glauber
"The current criteria for memtable flush is not being respected.  The
problem is demonstrated to happen when the dirty memory group is over
limit, and so is the system table extra allowance. In that situation,
both the normal region and the system table region will be under
pressure and try to flush.

More specifically, because the normal region inherits from the system
region, if the normal region is under pressure (over the soft limit
threshold), the system region will certainly be as well, even though it
has an extra allowance. This is because after virtual dirty, we start
blocking when we reach half the region, but memory itself can grow up to
100 % of the region. So the total amount of memory used will be
certainly bigger than the system pressure threshold, which is now 50 %
plus the allowance.

To fix that, this patch reworks the flush logic so that the regions are
not dependent on each other.

Fixes #1918"

* 'flush-criteria-v6' of github.com:glommer/scylla:
  config: get rid of memtable_total_space
  database: rework dirty memory hierarchy
  system keyspace: write batchlog mutation in user memory
  database: remove flush_token
  database: abstract pressure condition notification
  database: encapsulate semaphore_units into a flush_permit
  database: remove friendship declaration
  database: simplify flush_one
  database: make memtable_list aware in cases it can't flush
2016-12-14 11:24:10 +02:00
Glauber Costa
2aa6514667 config: get rid of memtable_total_space
Those values are now statically set.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 17:05:12 -05:00
Glauber Costa
db7cc3cba8 system keyspace: write batchlog mutation in user memory
Batchlog is a potentially memory-intensive table whose workload is
driven by user needs, not system's. Move it to the user dirty memory
manager.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:35 -05:00
Tomasz Grabiec
059a1a4f22 db: Fix commitlog replay to not drop cell mutations with older schema
column_mapping is not safe to access across shards, because data_type
is not safe to access. One of the manifestation of this is that
abstract_type::is_value_compatible_with() always fails if the two
types belong to different shards.

During replay, column_mapping lives on the replaying shard, and is
used by converting_mutation_partition_applier against the schema on
the target shard. Since types in the mapping will be considered
incompatible with types in the schema, all cells will be dropped.

Fix by using column_mapping in a safe way, by copying it to the target
shard if necessary. Each shard maintains its own cache of column
mappings.

Fixes #1924.
Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>
2016-12-13 12:19:32 +02:00
Glauber Costa
9b5e6d6bd8 commitlog: correctly report requests blocked
The semaphore future may be unavailable for many reasons. Specifically,
if the task quota is depleted right between sem.wait() and the .then()
clause in get_units() the resulting future won't be available.

That is particularly visible if we decrease the task quota, since those
events will be more frequent: we can in those cases clearly see this
counter going up, even though there aren't more requests pending than
usual.

This patch improves the situation by replacing that check. We now verify
whether or not there are waiters in the semaphore.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <113c0d6b43cd6653ce972541baf6920e5765546b.1481222621.git.glauber@scylladb.com>
2016-12-09 15:02:26 +02:00
Tomasz Grabiec
f7197dabf8 commitlog: Fix replay to not delete dirty segments
The problem is that replay will unlink any segments which were on disk
at the time the replay starts. However, some of those segments may
have been created by current node since the boot. If a segment is part
of reserve for example, it will be unlinked by replay, but we will
still use that segment to log mutations. Those mutations will not be
visible to replay after a crash though.

The fix is to record preexisting segents before any new segments will
have a chance to be created and use that as the replay list.

Introduced in abe7358767.

dtest failure:

 commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup

Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>
2016-12-07 15:54:47 +02:00
Asias He
00d7a35949 utils: Put crc32 under utils namespace
It conflicts with crc in zlib
Message-Id: <1480918984-4117-2-git-send-email-asias@scylladb.com>
2016-12-05 11:48:29 +02:00
Glauber Costa
99a5a77234 prevent commitlog replay position reordering during reserve refill
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.

The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.

That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished.  The
following sequence of events with its deferring points will ilustrate
it:

on_timer:

    return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {

At this point, the segment id is already allocated.

new_segment():

    if (_reserve_segments.empty()) {
	[ ... ]
        return allocate_segment(true).then ...

At this point, we have a new segment that has an id that is higher than
the previous id allocated.

Then we resume the execution from the deferring point in on_timer():

    i = _reserve_segments.emplace(i, std::move(s));

The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.

Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.

This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).

Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49eb7edfcafaef7f1fdceb270639a9a8b50cfce7.1480531446.git.glauber@scylladb.com>
2016-12-01 13:20:46 +01:00
Tomasz Grabiec
31645e2c4a commitlog: Allow allocations to be timed out 2016-11-29 16:40:58 +01:00
Glauber Costa
353a4cd2d4 commitlog: sync segments before acquiring semaphore on shutdown.
Sync all segments before acquiring the semaphore, otherwise waiting may
have to wait for the timer to kick in and push them down.
Note that we can't guarantee that no other requests were executed in the
mean time, so we have to sync again.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <aea019fe49820acce5d2b55dd5ec31e975b3436c.1480388674.git.glauber@scylladb.com>
2016-11-29 11:07:28 +02:00
Tomasz Grabiec
96c7764458 Revert "prevent commitlog replay position reordering during reserve refill"
This reverts commit 0e9b75d406.

commitlog_test fails with this:

Running 14 test cases...
ERROR 2016-11-28 20:48:00,565 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:00,578 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:10,591 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,601 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
tests/commitlog_test.cc(203): fatal error in "test_commitlog_discard_completed_segments": critical check dn <= nn failed
ERROR 2016-11-28 20:48:20,645 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,837 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN  2016-11-28 20:48:20,838 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
ERROR 2016-11-28 20:48:20,952 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,064 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,083 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,098 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,111 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,113 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN  2016-11-28 20:48:31,116 [shard 0] commitlog - Could not allocate 16388 k bytes output buffer (16388 k required)

*** 1 failure detected in test suite "tests/commitlog_test.cc"
WARN  2016-11-28 20:48:31,117 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
2016-11-28 20:52:13 +01:00
Glauber Costa
0e9b75d406 prevent commitlog replay position reordering during reserve refill
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.

The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.

That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished.  The
following sequence of events with its deferring points will ilustrate
it:

on_timer:

    return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {

At this point, the segment id is already allocated.

new_segment():

    if (_reserve_segments.empty()) {
	[ ... ]
        return allocate_segment(true).then ...

At this point, we have a new segment that has an id that is higher than
the previous id allocated.

Then we resume the execution from the deferring point in on_timer():

    i = _reserve_segments.emplace(i, std::move(s));

The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.

Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.

This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).

Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2606b97df39997bcf3af84a23adf17e094ffb0b8.1480107174.git.glauber@scylladb.com>
2016-11-28 19:26:26 +01:00
Avi Kivity
28857e42e7 Merge " Virtualize size_estimates system table" from Duarte
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.

This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.

Fixes #1616"

* 'virtual-estimates/v4' of github.com:duarten/scylla:
  size_estimates_virtual_reader: Add unit test
  db: Delete size_estimates_recorder
  size_estimates: Add virtual reader
  column_family: Add support for virtual readers
  storage_service: get_local_tokens() returns a future
  nonwrapping_range: Add slice() function
  range: Find a sequence's lower and upper bounds
  system_keyspace: Build mutations for size estimates
  size_estimates: Store the token range as bytes
  range_estimates: Add schema
  murmur3_partitioner: Convert maximum_token to sstring
2016-11-28 10:12:59 +02:00
Avi Kivity
b81a57e8eb config, dht: reduce default msb ignore bits to 4
With the default value of 12, a node's range is partitioned into
4096 * smp::count sub-ranges which are queried sequentually for a range
scan.  If the number of rows in the table is smaller than the required
result size, we will query all of them.  This can take so long that we
time out.

A better fix is to query multiple sub-ranges in parallel and merge them,
but for that we need to resurrect the non-sequential merger.
2016-11-23 21:25:37 +02:00
Paweł Dziepak
919825a2c7 Merge "Improve sharding in large clusters" from Avi
"Clusters with a large number of nodes, or a low number of vnodes, and a
high number of shards, or a combination, suffer from an aliasing problem:
both vnodes and intra-node sharding consider the most significant bits
to select the owning node and owning shard respectively.  Since the same
bits are used for both, a low number of vnodes leads to some shards
being overcommitted relative to others.

This series fixes the problem by sharding on bits 0:47 of the token
(murmur3 partitioner only), leaving the most significant 12 bits for
vnodes.  Simulation shows that this value provides reasonable sharding
for 100-node, 30-shard clusters.

In order to prevent re-sharding sstables on each boot, token ranges for
the range are stored in a new sub-component of the sstable Statistics
component. With the default 12 ignored bits we have 4096 token ranges
for non-Level-compacted SSTables, which takes some space but is still
reasonable.

Fixes #1277."
2016-11-23 11:25:53 +00:00
Avi Kivity
07d5a20bae Wire up sharding ignore msb parameter to configuration
We might have used a fancy map<sstring, any> to pass the parameters, but
that's overkill for now.
2016-11-22 22:40:47 +02:00
Glauber Costa
0b8b5abf16 commitlog: acquire semaphore earlier
Recently we have changed our shutdown strategy to wait for the
_request_controller semaphore to make sure no other allocations are
in-flight. That was done to fix an actual issue.

The problem is that this wasn't done early enough. We acquire the
semaphore after we have already marked ourselves as _shutdown and
released the timer.

That means that if there is an allocation in flight that needs to use a
new segment, it will never finish - and we'll therefore neve acquire
the semaphore.

Fix it by acquiring it first. At this point the allocations will all be
done and gone, and then we can shutdown everything else.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <5c2a2f20e3832b6ea37d6541897519a9307294ed.1479765782.git.glauber@scylladb.com>
2016-11-21 22:19:32 +00:00
Duarte Nunes
6a37d87c76 db: Delete size_estimates_recorder
Now that access to the size_estimates system is virtualized, we no
longer need the recorder.

Fixes #1616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
225648780d size_estimates: Add virtual reader
This patch add a virtual mutation_reader so that queries
to the size_estimates system table are handled by the engine
without needing to perform any IO.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
636287fdf2 system_keyspace: Build mutations for size estimates
This patch adds a function to system_keyspace responsible for creating
a mutation to a partition of the size_estimates system table from a
set of range_estimates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00