Compare commits

...

666 Commits

Author SHA1 Message Date
Pekka Enberg
1915521974 release: prepare for 1.0.4 2016-05-29 10:41:38 +03:00
Tomasz Grabiec
ef9974e723 tests: Add unit tests for schema_registry
(cherry picked from commit 90c31701e3)
2016-05-18 14:52:45 +03:00
Tomasz Grabiec
93ac6a584a schema_registry: Fix possible hang in maybe_sync() if syncer doesn't defer
Spotted during code review.

If it doesn't defer, we may execute then_wrapped() body before we
change the state. Fix by moving then_wrapped() body after state changes.

(cherry picked from commit 443e5aef5a)
2016-05-18 13:53:14 +03:00
Tomasz Grabiec
2457a16d23 migration_manager: Fix schema syncing with older version
The problem was that "s" would not be marked as synced-with if it came from
shard != 0.

As a result, mutation using that schema would fail to apply with an exception:

  "attempted to mutate using not synced schema of ..."

The problem could surface when altering schema without changing
columns and restarting one of the nodes so that it forgets past
versions.

Fixes #1258.

Will be covered by dtest:

  SchemaManagementTest.test_prepared_statements_work_after_node_restart_after_altering_schema_without_changing_columns

(cherry picked from commit 8703136a4f)
2016-05-18 13:52:24 +03:00
Tomasz Grabiec
daabc8777d migration_manager: Invalidate prepared statements on every schema change
Currently we only do that when column set changes. When prepared
statements are executed, paramaters like read repair chance are read
from schema version stored in the statement. Not invalidating prepared
statements on changes of such parameters will appear as if alter took
no effect.

Fixes #1255.
Message-Id: <1462985495-9767-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 13d8cd0ae9)
(cherry picked from commit 734cfa949a)
2016-05-15 13:36:39 +03:00
Raphael S. Carvalho
b259e1b0bc tests: test that leveled strategy was fixed
L1 wasn't being compacted into L2.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <1a357896a448eafa7da4d28bc56fa02b89d4193e.1460508373.git.raphaelsc@scylladb.com>
(cherry picked from commit beaacbda2e)
2016-05-09 08:17:59 +03:00
Raphael S. Carvalho
322f194032 sstables: Fix leveled compaction strategy
There is a problem in the implementation of leveled compaction strategy that
prevents level 1 from being compacted into level 2, and so forth. As a result,
all sstables will only belong to either level 0 or 1. One of the consequences
is level 1 being overwhelmed by a huge amount of sstables.

The root of the problem is a conditional statement in the code that prevents a
single sstable, with level > 0, from being compacted into a subsequent level
that is empty or has no overlapping sstables.

Fixes #1180.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <9a4bffdb0368dea77b49c23687015ff5832299ab.1460508373.git.raphaelsc@scylladb.com>
(cherry picked from commit c7b728e716)
2016-05-09 08:17:39 +03:00
Glauber Costa
c51b05efb3 throttle: always release at least one request if we are below the limit
Our current throttling code releases one requests per 1MB of memory available
that we have. If we are below the memory limit, but not by 1MB or more, then
we will keep getting to unthrottle, but never really do anything.

If another memtable is close to the flushing point, those requests may be
exactly the ones that would make it flush. Without them, we'll freeze the
database.

In general, we need to always release at least one request to make sure that
progress is always achieved.

This fixes #1144

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 9c87ae3496)
2016-05-09 08:14:37 +03:00
Glauber Costa
2e41a09631 memtable_list: make sure at least two memtables are available
This is usually not a problem for the main memtable list - although it can be,
depending on settings, but shows up easily for the streaming memtables list.

We would like to have at least two memtables, even if we have to cut it short.
If we don't do that, one memtable will have use all available memory and we'll
force throttling until the memtable gets totally flushed.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 2c5dfe08c1)
2016-05-09 08:14:37 +03:00
Glauber Costa
44cfbc15d0 unnest throttle_state
throttle_state is currently a nested member of database, but there is no
particular reason - aside from the fact that it is currently only ever
referenced by the database for us to do so.

We'll soon want to have some interaction between this and the column family, to
allow us to flush during throttle. To make that easier, let's unnest it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 1daede7396)
2016-05-09 08:14:37 +03:00
Glauber Costa
c9bd954237 move information about memtables' region group inside memtable list
This is a preparation patch so we can move the throttling infrastructure inside
the memtable_list. To do that, the region group will have to be passed to the
throttler so let's just go ahead and store it.

In consequence of that, all that the CF has to tell us is what is the current
schema - no longer how to create a new memtable.

Also, with a new parameter to be passed to the memtable_list the creation code
gets quite big and hard to follow. So let's move the creation functions to a
helper.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 39def369ce)
2016-05-09 08:14:37 +03:00
Calle Wilund
6c4d7223fe database.cc: Fix compilation error with boost 1.55
Message-Id: <1461067254-526-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 9130b0de16)
2016-05-04 08:42:21 +03:00
Calle Wilund
c1a5488993 sstables: Fix compilation error on boost 1.55
Message-Id: <1461067254-526-2-git-send-email-calle@scylladb.com>
(cherry picked from commit 49d3d79dfe)
2016-05-04 08:42:15 +03:00
Pekka Enberg
9c9f62e30b release: prepare for 1.0.3 2016-05-02 14:29:15 +03:00
Pekka Enberg
c147676ccb dist/docker/redhat: Make sure image builds against latest Scylla
Use "yum clean expire-cache" to make sure we build against the latest
Scylla release.
Message-Id: <1460374418-27315-1-git-send-email-penberg@scylladb.com>

(cherry picked from commit 355c3ea331)
2016-04-27 15:07:38 +03:00
Raphael S. Carvalho
07adedf28a tests: fix use-after-free in sstable test
After commit a843aea547, a gate was introduced to make sure that
an asynchronous operation is finished before column family is
destroyed. A sstable testcase was not stopping column family,
instead it just removed column family from compaction manager.
That could cause an user-after-free if column family is destroyed
while the asynchronous operation is running. Let's fix it by
stopping column family in the test.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ed910ec459c1752148099e6dc503e7f3adee54da.1461177411.git.raphaelsc@scylladb.com>
(cherry picked from commit eb51c93a5a)
2016-04-26 10:37:40 +03:00
Pekka Enberg
8ca530b6d3 Merge "Backport atomic sstable deletion to 1.0" from Avi
"This patchset is a backport of the atomic sstable deletion patchset, which
 waits until all shards agree to delete an sstable set before deleting it,
 avoiding the resurrecting data problem.

 The first four patches are identical to master, the last patch is new.

 Fixes #1181"
2016-04-25 14:12:33 +03:00
Avi Kivity
e5a123ea80 sstables: avoid long-duration smp calls in delete_atomically()
Since seastar is limited to 128 cross-shard calls per shard-pair,
long-duration smp calls can lead to deadlocks.

Prevent such calls by returning immediately from shard 0 (which manages
the deletions), and calling back to the requesting shard when the deletion
completes.
2016-04-25 13:21:00 +03:00
Avi Kivity
9bfce3255a db: delete compacted sstables atomically
If sstables A, B are compacted, A and B must be deleted atomically.
Otherwise, if A has data that is covered by a tombstone in B, and that
tombstone is deleted, and if B is deleted while A is not, then the data
in A is resurrected.

Fixes #1181.

(cherry picked from commit a843aea547)
2016-04-25 11:41:50 +03:00
Avi Kivity
d2251199b2 sstables: convert sstable::mark_for_deletion() to atomic deletion infrastructure
All deletions must go through the same data structure, or some atomic
deletions will never be satisified.

(cherry picked from commit 3798d04ae8)
2016-04-25 11:41:39 +03:00
Avi Kivity
bed6437b38 main: cancel pending atomic deletions on shutdown
A shared sstable must be compacted by all shards before it can be deleted.
Since we're stoping, that's not going to happen.  Cancel those pending
deletions to let anyone waiting on them to continue.

(cherry picked from commit e43dbac836)
2016-04-25 11:41:28 +03:00
Avi Kivity
70508734a5 sstables: add delete_atomically(), for atomically deleting multiple sstables
When we compact a set of sstables, we have to remove the set atomically,
otherwise we can resurrect data if the following happens:

 insert data to sstable A
 insert tombstone to sstable B
 compact A+B -> C (removing both data and tombstone)
 delete B only
 read data from A

Since an sstable may be shared by multiple shard, and each shard performs
compaction at a different time, we need to defer deletion of an sstable
set until all shards agree that the set can be deleted.

An additional atomicity issue exists because posix does not provide a way
to atomically delete multiple files.  This issue is not addressed by this
patch.

(cherry picked from commit 2ba584db8d)
2016-04-25 11:41:20 +03:00
Pekka Enberg
60307f62fe release: prepare for 1.0.2 2016-04-20 22:10:57 +03:00
Gleb Natapov
8006a15e3b udt: fix error generation if accessed type is not udt
Fixes #1198
Message-Id: <1460884314-3717-2-git-send-email-gleb@scylladb.com>

(cherry picked from commit f3b515052b)
2016-04-19 11:28:53 +03:00
Duarte Nunes
1cfbc29f01 udt: Implement to_string() for selectable
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1460884314-3717-1-git-send-email-gleb@scylladb.com>
(cherry picked from commit ece89069dd)
2016-04-19 11:28:46 +03:00
Tomasz Grabiec
c665455b71 tests: Add test for query of collection with deleted item
(cherry picked from commit 89bc32b020)
2016-04-18 11:30:28 +03:00
Tomasz Grabiec
b09c91d1c8 mutation_partition: Fix collection emptiness check
Broken by f15c380a4f.

This resulted in empty collection being returned in the results
instead of no collection.

Fixes org.apache.cassandra.cql3.validation.entities.CollectionsTest
from cassandra-unit-tests.

(cherry picked from commit c69d0a8e87)
2016-04-18 11:30:22 +03:00
Tomasz Grabiec
776ae831e6 types: Add default argument values to is_any_live()
(cherry picked from commit b0d4782016)
2016-04-18 11:30:16 +03:00
Pekka Enberg
2ad3c7532f Merge "Summary backport" from Glauber
This series contains 1.0 backports of the following series:

 * Commit 9b98278 ("Merge "Be able to boot without a Summary" from Glauber")
 * Commit 60352f8 ("Merge "Fixes for the reading of missing Summary" from Glauber")

The backport was done by Glauber because the original commits don't work
as-is due to I/O error handling differences in master and 1.0.

Fixes #1170
2016-04-13 22:02:40 +03:00
Glauber Costa
91c35c3e19 sstable_tests: make sure the generation of the Summary is sane
When we recreate the summary from a missing Summary, we should make
sure it is generated sanely, and that it resembles the Summary that
would have otherwise been there.

In this tests we'll grab one of the Summary tests we've been doing,
and just apply them to the non-existent Summary file. We expect
the same results on those cases. Plus, a new test is added with some
sanity checking.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-13 14:18:03 -04:00
Glauber Costa
4f0cc195dc be robust against broken summary files
Now that we can boot without a Summary file, we can just as easily boot
with a broken one.

Suggested by Nadav, and it is actually very easy to do, so do it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-13 14:17:54 -04:00
Glauber Costa
c9f7986be4 review fixes for generate_summary
Spotted by Avi post-merge
1) Need to close the file
2) Should be using the parameter pc instead of the default_class

1.0 backport: general_disk_error is non-existent. Replace it with just
propagating the exception

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-13 14:17:15 -04:00
Glauber Costa
4feaf1372b clear components if reading toc fail
This shouldn't be a problem in practice, because if read_toc() fails,
the users will just tend to discard the sstable object altogether, and
not insist on using it.

However, if somebody does try to keep using it, a subsequent read_toc() could
theoretically have some components filled up leading the new reader to believe
the toc was populated successfully.

It is easier to just clear the _components set and never worry about it, than
trying to reason about whether or not that could happen.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-13 14:14:04 -04:00
Glauber Costa
3ebfecc88e index_reader: avoid misleading parent name
Also add comments about the expected signature of IndexConsumer

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-13 14:13:56 -04:00
Glauber Costa
c841d87fe3 summary: generate one if it is not present
There are cases in which a Summary file will not be present, and imported
SSTables will have just the Index and Data files. In earlier versions of
Cassandra, a Summary didn't exist, so one may not be generated when migrating.

In Issue #1170, we can see an example of tables generated by CQLSSTableWriter,
and they lack a Summary. Cassandra is robust against this and can cope
perfectly with the Summary not existing. I will argue that we should do the
same.

1.0 backport: open_checked_file_dma -> open_file_dma

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-13 14:13:11 -04:00
Glauber Costa
7a887ea2ea sstables: allow read_toc to be called more than once
We do that by bailing immediately if we detect that the components
map is already populated. This allow us to call read_toc() earlier
if we need to - for instance, to inquire about the existence of the
Summary - without the need to re-read the components again later.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-13 14:10:52 -04:00
Glauber Costa
bc4d63c802 sstables: avoid passing schema unnecessarily
for prepare_summary we can just pass the min interval as a parameter and
avoid having the schema do yet another hop. For sealing the summary, it
is completely unused and we can do away with it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-13 14:10:41 -04:00
Glauber Costa
616196b543 index reader: make index_consumer a template parameter
This is done so we can use other consumers. An example of that, is regeneration
of the Summary from an existing Index.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-13 14:10:32 -04:00
Glauber Costa
a04f462904 make get_sstable_key_range an instance method
Because just creating an SSTable object does not generate any I/O,
get_sstable_key_range should be an instance method. The main advantage
of doing that is that we won't have to read the summary twice. The way
we're doing it currently, if happens to be a shard-relevant table we'll
call load() - which reads the summary again.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-13 14:10:18 -04:00
Glauber Costa
ebf8fb802e do not re-read the summary
There are times in which we read the Summary file twice. That actually happens
every time during normal boot (it doesn't during refresh). First during
get_sstable_key_range and then again during load().

Every summary will have at least one entry, so we can easily test for whether
or not this is properly initialized.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-13 14:10:00 -04:00
Avi Kivity
8d1374e911 sstables: filter sstables single-row read using first_key/last_key
Using leveled compaction strategy, only a few sstables will contain a
given key, so we need to filter out the rest.  Using the summary entries
to filter keys works if the key is before the first summary entry,
but does not work if it is after the last summary entry, because the last
summary entry does not represent the last key; so sstables that are
are towards the beginning of the ring are read even if they do not contain
the key, greatly reducing read performance.

Fix by consulting the summary's first_key/last_key entries before consulting
the summary entry array.

(cherry picked from commit 715794cce6)
2016-04-13 09:25:07 +03:00
Avi Kivity
bacc769328 Update seastar submodule (branch-1.0)
* seastar aa281bd...0225940 (10):
  > memory: avoid exercising the reclaimers for oversized requests
  > memory: fix live objects counter underflow due to cross-cpu free
  > core/reactor: Don't abort in allocate_aligned_buffer() on allocation failure
  > scripts/posix_net_conf.sh: added a support for bonding interfaces
  > scripts/posix_net_conf.sh: move the NIC configuration code into a separate function
  > scripts/posix_net_conf.sh: implement the logic for selecting default MQ mode
  > scripts/posix_net_conf.sh: forward the interface name as a parameter
  > http/routes: Remove request failure logging to stderr
  > lowres_clock: Initialize _now when the clock is created
  > apps/iotune: fix broken URL
2016-04-11 09:18:47 +03:00
Avi Kivity
241eb9e199 Update seastar submodule to point to scylla-seastar
This allows us to cherry-pick seastar fixes.
2016-04-10 18:25:31 +03:00
Pekka Enberg
58fdfe5bc9 release: prepare for 1.0.1 2016-04-09 19:21:21 +03:00
Tomasz Grabiec
f45cc1b229 tests: cql_query_test: Add test for slicing in reverse
(cherry picked from commit 3e0c24934b)
2016-04-09 18:42:53 +03:00
Tomasz Grabiec
14f9eeaafd mutation_partition: Fix static row being returned when paginating
Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test.

Broken by f15c380a4f, where the
calcualtion of has_ck_selector got broken, in such a way that present
clustering restrictions were treated as if not present, which resulted
in static row being returned when it shouldn't.

While at it, unify the check between query_compacted() and
do_compact() by extracting it to a function.

(cherry picked from commit c2b955d40b)
2016-04-09 18:42:53 +03:00
Tomasz Grabiec
05df90ad4b mutation_partition: Fix reversed trim_rows()
The first erase_and_dispose(), which removes rows between last
position and beginning of the next range, can invalidate end()
iterator of the range. Fix by looking up end after erasing.

mutation_partition::range() was split into lower_bound() and
upper_bound() to allow for that.

This affects for example queries with descending order where the
selected clustering range is empty and falls before all rows.

Exposed by f15c380a4f, which is now
calling do_compact() during query.

Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test

(cherry picked from commit a1539fed95)
2016-04-09 18:42:53 +03:00
Tomasz Grabiec
5646faba18 tests: Add test for query digest calculation
(cherry picked from commit 474a35ba6b)
2016-04-09 18:42:52 +03:00
Tomasz Grabiec
814df06245 tests: mutation_source: Include random mutations in generate_mutation_sets() result
Probably increases coverage.

(cherry picked from commit 4418da77e6)
2016-04-09 18:42:52 +03:00
Tomasz Grabiec
5ac9e2501c tests: mutation_test: Move mutation generator to mutation_source_test.hh
So that it can be reused.

(cherry picked from commit 5d768d0681)
2016-04-09 18:42:52 +03:00
Tomasz Grabiec
34ddfb4498 tests: mutation_test: Add test case for querying of expired cells
(cherry picked from commit 30d25bc47a)
2016-04-09 18:42:52 +03:00
Tomasz Grabiec
e4d4d0b31c partition_slice_builder: Add new setters
(cherry picked from commit 58bbd4203f)
2016-04-09 18:42:52 +03:00
Tomasz Grabiec
4125f279c0 tests: result_set_assertions: Add and_only_that()
(cherry picked from commit 7cd8e61429)
2016-04-09 18:42:52 +03:00
Tomasz Grabiec
e276e7b1e3 database: Compact mutations when executing data queries
Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.

The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.

perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.

Fixes #1165.

(cherry picked from commit f15c380a4f)
2016-04-09 18:42:52 +03:00
Tomasz Grabiec
a516b24111 mutation_query: Extract main part of mutation_query() into more generic querying_reader
So that it can be reused in query()

(cherry picked from commit e4e8acc946)
2016-04-09 18:42:52 +03:00
Gleb Natapov
4642c706c1 commitlog, sstables: enlarge XFS extent allocation for large files
With big rows I see contention in XFS allocations which cause reactor
thread to sleep. Commitlog is a main offender, so enlarge extent to
commitlog segment size for big files (commitlog and sstable Data files).

Message-Id: <20160404110952.GP20957@scylladb.com>
(cherry picked from commit 70575699e4)
2016-04-07 09:52:15 +03:00
Nadav Har'El
4666c095bc sstables: overhaul range tombstone reading
Until recently, we believed that range tombstones we read from sstables will
always be for entire rows (or more generalized clustering-key prefixes),
not for arbitrary ranges. But as we found out, because Cassandra insists
that range tombstones do not overlap, it may take two overlapping row
tombstones and convert them into three range tombstones which look like
general ranges (see the patch for a more detailed example).

Not only do we need to accept such "split" range tombstones, we also need
to convert them back to our internal representation which, in the above
example, involves two overlapping tombstones. This is what this patch does.

This patch also contains a test for this case: We created in Cassandra
an sstable with two overlapping deletions, and verify that when we read
it to Scylla, we get these two overlapping deletions - despite the
sstable file actually having contained three non-overlapping tombstones.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <b7c07466074bf0db6457323af8622bb5210bb86a.1459399004.git.glauber@scylladb.com>
(cherry picked from commit 99ecda3c96)
2016-03-31 12:58:07 +03:00
Nadav Har'El
507e6ec75a sstables: merge range tombstones if possible
This is a rewrite of Glauber's earlier patch to do the same thing, taking
into account Avi's comments (do not use a class, do not throw from the
constructor, etc.). I also verified that the actual use case which was
broken in #1136 was fixed by this patch.

Currently, we have no support for range tombstones because CQL will not
generate them as of version 2.x. Thrift will, but we can safely leave this for
the future.

However, we have seen cases during a real migration in which a pure-CQL
Cassandra would generate range tombstones in its SSTables.

Although we are not sure how and why, those range tombstones were of a special
kind: their end and next's start range were adjacent, which means that in
reality, they could very well have been written as a single range tombstone for
an entire clustering key - which we support just fine.

This code will attempt to fix this problem temporarily by merging such ranges
if possible. Care must be taken so that we don't end up accepting a true
generic range tombstone by accident.

Fixes #1136

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459333972-20345-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit 0fc9a5ee4d)
2016-03-31 12:57:56 +03:00
Glauber Costa
29d6952ddd sstables: fix exception printouts in check_marker
As Nadav noticed in his bug report, check_marker is creating its error messages
using characters instead of numbers - which is what we intended here in the
first place.

That happens because sprint(), when faced with an 8-byte type, interprets this
as a character.  To avoid that we'll use uint16_t types, taking care not to
sign-extend them.

The bug also noted that one of the error messages is missing a parameter, and
that is also fixed.

Fixes #1122

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <74f825bbff8488ffeb1911e626db51eed88629b1.1459266115.git.glauber@scylladb.com>
(cherry picked from commit 23808ba184)
2016-03-31 12:56:02 +03:00
Pekka Enberg
9fae641099 release: prepare for 1.0.0 2016-03-30 12:19:12 +03:00
Pekka Enberg
ccd1fe4348 Revert "sanity check Seastar's I/O queue configuration"
This reverts commit 7b88ba8882, it's too
late for it.
2016-03-29 16:44:55 +03:00
Glauber Costa
7b88ba8882 sanity check Seastar's I/O queue configuration
While Seastar in general can accept any parameter for its I/O queues, Scylla
in particular shouldn't run with them disabled. Such will be the status when
the max-io-requests parameter is not enabled.

On top of that, we would like to have enough depth per I/O queue not to allow
for shard-local parallelism. Therefore, we will require a minimum per-queue
capacity of 4. In machines where the disk iodepth is not enough to allow for 4
concurrent requests per shard, one should reduce the number of I/O queues.

For --max-io-requests, we will check the parameter itself. However, the
--num-io-queues parameter is not mandatory, and given enough concurrent
requests, Seastar's default configuration can very well just be doing the right
thing. So for that, we will check the final result of each I/O queue.

As it is the case with other checks of the sorts, this can be overridden by
the --developer-mode switch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <63bf7e91ac10c95810351815bb8f5e94d75592a5.1458836000.git.glauber@scylladb.com>
(cherry picked from commit e750a94300)
2016-03-29 16:37:16 +03:00
Pekka Enberg
46825a5e07 release: prepare for 1.0.rc3 2016-03-29 16:22:31 +03:00
Benoît Canet
740d98901f collectd: Write to the network to get rid of spurious log messages
Closes #1018

Suggested-by: Avi Kivity <avi@scylladb.com>
Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1458759378-4935-1-git-send-email-benoit@scylladb.com>
(cherry picked from commit 4ac1126677)
2016-03-29 11:47:19 +03:00
Tomasz Grabiec
ceff8b9b41 schema_tables: Wait for notifications to be processed.
Listeners may defer since:

 93015bcc54 "migration_manager: Make the migration callbacks runs inside seastar thread"

Not all places were adjusted to wait for them. Fix that.

Message-Id: <1458837613-27616-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 53bbcf4a1e)
2016-03-29 11:18:32 +03:00
Gleb Natapov
1b2dbcc26e config: enable truncate_request_timeout_in_ms option
Option truncate_request_timeout_in_ms is used by truncate. Mark it as
used.

Message-Id: <20160323162649.GH2282@scylladb.com>
(cherry picked from commit 0afd1c6f0a)
2016-03-29 11:16:53 +03:00
Raphael Carvalho
75b2db7862 sstables: fix deletion of sstable with temporary TOC
After 4e52b41a4, remove_by_toc_name() became aware of temporary TOC
files, however, it doesn't consider that some components may be
missing if temporary TOC is present.
When creating a new sstable, the first thing we do is to write all
components into temporary TOC, so content of a temporary TOC isn't
reliable until it is renamed.

Solution is about implementing the following flow (described by Avi):
"Flow should be:

  - remove all components in parallel
  - forgive ENOENT, since the compoent may not have been written;
otherwise deletion error should be raised
  - fsync the directory
  - delete the temporary TOC
"

This problem can be reproduced by running compaction without disk
space, so compaction would fail and leave a partial sstable that would
be marked for deletion. Afterwards, remove_by_toc_name() would try to
delete a component that doesn't exist because it looked at the content
of temporary TOC.

Fixes #1095.

Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <0cfcaacb43cc5bad3a8a7ea6c1fa6f325c5de97d.1459194263.git.raphaelsc@scylladb.com>
(cherry picked from commit d515a7fd85)
2016-03-29 10:56:49 +03:00
Tomasz Grabiec
789c1297dd storage_service: Fix typos
Message-Id: <1458837390-26634-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit d1db23e353)
2016-03-29 10:29:28 +03:00
Pekka Enberg
afeaaab034 Update scylla-ami submodule
* dist/ami/files/scylla-ami 89e7436...7019088 (1):
  > Re-enable clocksource=tsc on AMI
2016-03-29 09:59:34 +03:00
Takuya ASADA
80242ff443 dist: re-enable clocksource=tsc on AMI
clocksource=tsc on boot parameter mistakenly dropped on b3c85aea89, need to re-enable.

[ penberg: Manual backport of commit 050fb911d5 to 1.0. ]
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459180643-4389-1-git-send-email-syuu@scylladb.com>
2016-03-29 09:56:10 +03:00
Nadav Har'El
0b456578c0 sstable: fix read failure of certain sstables
We had a problem reading certain existing Cassandra sstables into
Scylla.

Our consume_range_tombstone() function assumes that the start and end
columns have a certain "end of component" markers, and want to verify
that assumption. But because of bugs in older versions of Cassandra,
see https://issues.apache.org/jira/browse/CASSANDRA-7593, sometimes the
"end of component" was missing (set to 0). CASSANDRA-7593 suggested
this problem might exist on the start column, so we allowed for that,
but now we discovered a case where also the end column is set to 0 -
causing the test in consume_range_tombstone() to fail and the sstable
read to fail - causing Scylla to no be able to import that sstable from
Cassandra. Allowing for an 0 also on the end column made it possible
to read that sstable, compact it, and so on.

Fixes #1125.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459173964-23242-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit a05577ca41)
2016-03-28 17:10:10 +03:00
Pekka Enberg
3b5a55c6fc release: prepare for 1.0.rc2 2016-03-27 10:19:53 +03:00
Raphael Carvalho
4f1d37c3c9 Fix corner-case in refresh
Problem found by dtest which loads sstables with generation 1 and 2 into an
empty column family. The root of the problem is that reshuffle procedure
changes new sstables to start from generation 2 at least. So reshuffle could
try to set generation 1 to 2 when generation 2 exists.
This problem can be fixed by starting from generation 1 instead, so reshuffle
would handle this case properly.

Fixes #1099.

Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <88c51fbda9557a506ad99395aeb0a91cd550ede4.1458917237.git.raphaelsc@scylladb.com>
(cherry picked from commit e6e5999282)
2016-03-27 10:04:28 +03:00
Avi Kivity
8422a42381 dist: ami: fix AMI_OPT receiving no value
We assign AMI=0 and AMI_OPT=1, so in the true case, AMI_OPT has no value,
and a later compare fails.

(cherry picked from commit 077c0d1022)
2016-03-26 21:17:49 +03:00
Takuya ASADA
c0f31fac48 dist/ami: use tilde for release candidate builds
Sync with ubuntu package versioning rule

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458882718-29317-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 2582dbe4a0)
2016-03-26 16:51:24 +02:00
Calle Wilund
6fe88a663f database: Use disk-marking delete function in discard_sstables
Fixes #797

To make sure an inopportune crash after truncate does not leave
sstables on disk to be considered live, and thus resurrect data,
after a truncate, use delete function that renames the TOC file to
make sure we've marked sstables as dead on disk when we finish
this discard call.
Message-Id: <1458575440-505-2-git-send-email-calle@scylladb.com>

Rebase to 1.0:
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-24 09:16:24 -04:00
Calle Wilund
5f76f3d445 sstables: Add delete func to rename TOC ensuring table is marked dead
Note: "normal" remove_by_toc_name must now be prepared for and check
if the TOC of the sstable is already moved to temp file when we
get to the juicy delete parts.
Message-Id: <1458575440-505-1-git-send-email-calle@scylladb.com>

For the rebase to 1.0:

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-24 09:05:03 -04:00
Asias He
6676d126aa streaming: Complete receive task after the flush
A STREAM_MUTATION_DONE message will signal the receiver that the sender
has completed the sending of streams mutations. When the receiver finds
it has zero task to send and zero task to receive, it will finish the
stream_session, and in turn finish the stream_plan if all the
stream_sessions are finished. We should call receive_task_completed only
after the flush finishes so that when stream_plan is finshed all the
data is on disk.

Fixes repair_disjoint_data_test issue with Glauber's "[PATCH v4 0/9] Make
sure repairs do not cripple incoming load" serries

======================================================================
FAIL: repair_disjoint_data_test
(repair_additional_test.RepairAdditionalTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "scylla-dtest/repair_additional_test.py",
line 102, in repair_disjoint_data_test
    self.check_rows_on_node(node1, 3000)
  File "scylla-dtest/repair_additional_test.py",
line 33, in check_rows_on_node
    self.assertEqual(len(result), rows, len(result))
AssertionError: 2461

(cherry picked from commit c2eff7e824)
2016-03-24 10:26:00 +02:00
Glauber Costa
38343ccbfe repair: rework repair code so we can limit parallelism
The repair code as it is right now is a bit convoluted: it resorts to detached
continuations + do_for_each when calling sync_ranges, and deals with the
problem of excessive parallelism by employing a semaphore inside that range.

Still, even by doing that, we still generate a great number of
checksum requests because the ranges themselves are processed in parallel.

It would be better to have a single-semaphore to limit the overall parallelism
for all requests.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit f49e965d78)
2016-03-24 10:26:00 +02:00
Glauber Costa
f1272933fd database: keep streaming memtables in their own region group
Theoretically, because we can have a lot of pending streaming memtables, we can
have the database start throttling and incoming connections slowing down during
streaming.

Turns out this is actually a very easy condition to trigger. That is basically
because the other side of the wire in this case is quite efficient in sending
us work. This situation is alleviated a bit by reducing parallelism, but not
only it does't go away completely, once we have the tools to start increasing
parallelism again it will become common place.

The solution for this is to limit the streaming memtables to a fraction of the
total allowed dirty memory. Using the nesting capability built in in the LSA
regions, we will make the streaming region group a child of the main region
group.  With that, we can throttle streaming requests separately, while at the
same time being able to control the total amount of dirty memory as well.

Because of the property, it can still be the case that incoming requests will
throttle earlier due to streaming - unless we allow for more dirty memory to be
used during repairs - but at least that effect will be limited.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 34a9fc106f)
2016-03-24 10:26:00 +02:00
Glauber Costa
ccd623aa87 streaming memtables: coalesce incoming writes
The repair process will potentially send ranges containing few mutations,
definitely not enough to fill a memtable. It wants to know whether or not each
of those ranges individually succeeded or failed, so we need a future for each.

Small memtables being flushed are bad, and we would like to write bigger
memtables so we can better utilize our disks.

One of the ways to fix that, is changing the repair itself to send more
mutations at a single batch. But relying on that is a bad idea for two reasons:

First, the goals of the SSTable writer and the repair sender are at odds. The
SSTable writer wants to write as few SSTables as possible, while the repair
sender wants to break down the range in pieces as small as it can and checksum
them individually, so it doesn't have to send a lot of mutations for no reason.

Second, even if the repair process wants to process larger ranges at once, some
ranges themselves may be small. So while most ranges would be large, we would
still have potentially some fairly small SSTables lying around.

The best course of action in this case is to coalesce the incoming streams
write-side.  repair can now choose whatever strategy - small or big ranges - it
wants, resting assure that the incoming memtables will be coalesced together.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 455d5a57d2)
2016-03-24 10:26:00 +02:00
Glauber Costa
8176fa8379 streaming: add incoming streaming mutations to a different sstable
Keeping the mutations coming from the streaming process as mutations like any
other have a number of advantages - and that's why we do it.

However, this makes it impossible for Seastar's I/O scheduler to differentiate
between incoming requests from clients, and those who are arriving from peers
in the streaming process.

As a result, if the streaming mutations consume a significant fraction of the
total mutations, and we happen to be using the disk at its limits, we are in no
position to provide any guarantees - defeating the whole purpose of the
scheduler.

To implement that, we'll keep a separate set of memtables that will contain
only streaming mutations. We don't have to do it this way, but doing so
makes life a lot easier. In particular, to write an SSTable, our API requires
(because the filter requires), that a good estimate on the number of partitions
is informed in advance. The partitions also need to be sorted.

We could write mutations directly to disk, but the above conditions couldn't be
met without significant effort. In particular, because mutations can be
arriving from multiple peer nodes, we can't really sort them without keeping a
staging area anyway.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 5fa866223d)
2016-03-24 10:26:00 +02:00
Glauber Costa
d03910f46d priority manager: separate streaming reads from writes
Streaming has currently one class, that can be used to contain the read
operations being generated by the streaming process. Those reads come from two
places:

- checksums (if doing repair)
- reading mutations to be sent over the wire.

Depending on the amount of data we're dealing with, that can generate a
significant chunk of data, with seconds worth of backlog, and if we need to
have the incoming writes intertwined with those reads, those can take a long
time.

Even if one node is only acting as a receiver, it may still read a lot for the
checksums - if we're talking about repairs, those are coming from the
checksums.

However, in more complicated failure scenarios, it is not hard to imagine a
node that will be both sending and receiving a lot of data.

The best way to guarantee progress on both fronts, is to put both kinds of
operations into different classes.

This patch introduces a new write class, and rename the old read class so it
can have a more meaningful name.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 10c8ca6ace)
2016-03-24 10:26:00 +02:00
Glauber Costa
0c75700d8c database: make seal_on_overflow a method of the memtable_list
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 78189de57f)
2016-03-24 10:26:00 +02:00
Glauber Costa
478975b3fa database: move add_memtable as a method of the memtable_list
The column family still has to teach the memtable list how to allocate a new memtable,
since it uses CF parameters to do so.

After that, the memtable_list's constructor takes a seal and a create function and is complete.
The copy constructor can now go, since there are no users left.
The behavior of keeping a reference to the underlying memtables can also go, since we can now
guarantee that nobody is keeping references to it (it is not even a shared pointer anymore).
Individual memtables are, and users may be keeping references to them individually.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 635bb942b2)
2016-03-24 10:26:00 +02:00
Glauber Costa
5ce76258c8 database: move active_memtable to memtable_list
Each list can have a different active memtable. The column family method keeps
existing, since the two separate sets of memtable are just an implementation
detail to deal with the problem of streaming QoS: *the* active memtable keeps
being the one from the main list.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 6ba95d450f)
2016-03-24 10:26:00 +02:00
Glauber Costa
4cf8791d56 database: create a class for memtable_list
memtable_list is currently just an alias for a vector of memtables.  Let's move
them to a class on its own, exporting the relevant methods to keep user code
unchanged as much as possible.

This will help us keeping separate lists of memtables.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit af6c7a5192)
2016-03-24 10:26:00 +02:00
Pekka Enberg
ccd51010f1 Merge seastar upstream
* seastar 9f2b868...aa281bd (7):
  > shared_promise: Add move assignment operator
  > lowres_clock: Fix stretched time
  > scripts: Delete tap with ip instead of tunctl
  > vla: Actually be exception-safe
  > vla: Ensure memory is freed if ctor throws
  > vla: Ensure memory is correctly freed
  > net: Improve error message when parsing invalid ipv4 address
2016-03-24 10:25:42 +02:00
Shlomi Livne
8e78cbfc2d fix a collision betwen --ami command line param and env
sysconfig scylla-server includes an AMI, the script also used an AMI
variable fix this by renaming the script variable

6a18634f9f introduced this issue since it
started imported the sysconfig scylla-server

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <0bc472bb885db2f43702907e3e40d871f1385972.1458767984.git.shlomi@scylladb.com>
(cherry picked from commit d3a91e737b)
2016-03-24 08:18:45 +02:00
Shlomi Livne
c6c176b1be scylla_io_setup import scylla-server env args
scylla_io_seup requires the scylla-server env to be setup to run
correctly. previously scylla_io_setup was encapsulated in
scylla-io.service that assured this.

extracting CPUSET,SMP from SCYLLA_ARGS as CPUSET is needed for invoking
io_tune

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <d49af9cb54ae327c38e451ff76fe0322e64a5f00.1458747527.git.shlomi@scylladb.com>
(cherry picked from commit 6a18634f9f)
2016-03-23 17:55:33 +02:00
Shlomi Livne
9795edbe04 dist/ami: Use the actual number of disks instead of AWS meta service
We have seen in some cases that when using the boto api to start
instances the aws metadata service
http://169.254.169.254/latest/meta-data/block-device-mapping/ returns
incorrect number of disks - workaround that by checking the actual
number of disks using lsblk

Adding a validation at the end verifying that after all computations the
NR_IO_QUEUES will not be greater then the number of shards (we had an
issue with i2.8x)

Fixes: #1062

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <54c51cd94dd30577a3fe23aef3ce916c01e05504.1458721659.git.shlomi@scylladb.com>
(cherry picked from commit 4ecc37111f)
2016-03-23 11:22:25 +02:00
Shlomi Livne
1539c8b136 fix centos local ami creation (revert some changes)
in centos we do not have a version file created - revert this changes
introduced when adding ubuntu ami creation

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <69c80dcfa7afe4f5db66dde2893d9253a86ac430.1458578004.git.shlomi@scylladb.com>
(cherry picked from commit b7e338275b)
2016-03-23 11:22:25 +02:00
Takuya ASADA
0396a94eaf dist: allow more requests for i2 instances
i2 instances has better performance than others, so allow more requests.
Fixes #921

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458251067-1533-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 769204d41e)
2016-03-23 11:22:25 +02:00
Raphael Carvalho
3c40c1be71 service: fix refresh
Vlad and I were working on finding the root of the problems with
refresh. We found that refresh was deleting existing sstable files
because of a bug in a function that was supposed to return the maximum
generation of a column family.
The intention of this function is to get generation from last element
of column_family::_sstables, which is of type std::map.
However, we were incorrectly using std::map::end() to get last element,
so garbage was being read instead of maximum generation.
If the garbage value is lower than the minimum generation of a column
family, then reshuffle_sstables() would set generation of all existing
sstables to a lower value. That would confuse our mechanism used to
delete sstables because sstables loaded at boot stage were touched.
Solution to this problem is about using rbegin() instead of end() to
get last element from column_family::_sstables.

The other problem is that refresh will only load generations that are
larger than or equal to X, so new sstables with lower generation will
not be loaded. Solution is about creating a set with generation of
live SSTables from all shards, and using this set to determine whether
a generation is new or not.

The last change was about providing an unused generation to reshuffle
procedure by adding one to the maximum generation. That's important to
prevent reshuffle from touching an existing SSTable.

Tested 'refresh' under the following scenarios:
1) Existing generations: 1, 2, 3, 4. New ones: 5, 6.
2) Existing generations: 3, 4, 5, 6. New ones: 1, 2.
3) Existing generations: 1, 2, 3, 4. New ones: 7, 8.
4) No existing generation. No new generation.
5) No existing generation. New ones: 1, 2.
I also had to adapt existing testcase for reshuffle procedure.

Fixes #1073.

Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <1c7b8b7f94163d5cd00d90247598dd7d26442e70.1458694985.git.raphaelsc@scylladb.com>
(cherry picked from commit 370b1336fe)
2016-03-23 11:22:25 +02:00
Benoît Canet
de969a5d6f dist/ubuntu: Fix the init script variable sourcing
The variable sourcing was crashing the init script on ubuntu.
Fix it with the suggestion from Avi.

Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1458685099-1160-1-git-send-email-benoit@scylladb.com>
(cherry picked from commit 1594bdd5bb)
2016-03-23 11:22:25 +02:00
Takuya ASADA
0ade2894f7 dist: stop using '-p' option on lsblk since Ubuntu doesn't supported it
On scylla_setup interactive mode we are using lsblk to list up candidate
block devices for RAID, and -p option is to print full device paths.

Since Ubuntu 14.04LTS version of lsblk doesn't supported this option, we
need to use non-full path name and complete paths before passes it to
scylla_raid_setup.

Fixes #1030

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458325411-9870-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 6edd909b00)
2016-03-23 09:16:04 +02:00
Takuya ASADA
6b36315040 dist: allow to run 'sudo scylla_ami_setup' for Ubuntu AMI
Allows to run scylla_ami_setup from scylla-server.conf

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
(cherry picked from commit a6cd085c38)
2016-03-23 09:14:49 +02:00
Takuya ASADA
edc5f8f2f7 dist: launch scylla_ami_setup on Ubuntu AMI
Since upstart does not have same behavior as systemd, we need to run scylla_io_setup and scylla_ami_setup in scylla-server.conf's pre-start stanza.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
(cherry picked from commit 7828023599)
2016-03-23 09:14:49 +02:00
Takuya ASADA
066149ad46 dist: fix broken scylla_install_pkg --local-pkg and --unstable on Ubuntu
--local-pkg and --unstable arguments didn't handled on Ubuntu, support it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
(cherry picked from commit 93bf7bff8e)
2016-03-23 09:14:49 +02:00
Takuya ASADA
1f07468195 dist: prevent to show up dialog on apt-get in scylla_raid_setup
"apt-get -y install mdadm" shows up a dialog to select install mode of postfix, this will block scylla-ami-setup.service forever since it is running as background task, we need to prevent it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
(cherry picked from commit 0c83b34d0c)
2016-03-23 09:14:49 +02:00
Takuya ASADA
0577ae5a61 dist: Ubuntu based AMI support
This introduces Ubuntu AMI.
Both CentOS AMI and Ubuntu AMI are need to build on same distribution, so build_ami.sh script automatically detect current distribution, and selects base AMI image.

Fixes #998

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
(cherry picked from commit b097ed6d75)
2016-03-23 09:14:49 +02:00
Pekka Enberg
054cf13cd0 Update scylla-ami submodule
* dist/ami/files/scylla-ami 84bcd0d...89e7436 (3):
  > Merge "iotune packaging fix for scylla-ami" from Takuya
  > Ubuntu AMI support on scylla_install_ami
  > scylla_ami_setup is not POSIX sh compatible, change shebang to /bin/bash
2016-03-23 09:07:07 +02:00
Takuya ASADA
71446edc97 dist: on scylla_io_setup, SMP and CPUSET should be empty when the parameter not present
Fixes #1060

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458659928-2050-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit dac2bc3055)
2016-03-23 09:06:00 +02:00
Takuya ASADA
c1d8a62b5b dist: remove scylla-io-setup.service and make it standalone script
(cherry picked from commit 9889712d43)
2016-03-23 09:06:00 +02:00
Takuya ASADA
a3baef6b45 dist: on scylla_io_setup print out message both for stdout and syslog
(cherry picked from commit 2cedab07f2)
2016-03-23 09:06:00 +02:00
Takuya ASADA
feaba177e2 dist: introduce dev-mode.conf and scylla_dev_mode_setup
(cherry picked from commit 83112551bb)
2016-03-23 09:06:00 +02:00
Tomasz Grabiec
83a289bdcd cql3: batch_statement: Execute statements sequentially
Currently we execute all statements in parallel, but some statements
depend on order, in particular list append/prepend. Fix by executing
sequentially.

Fixes cql_additional_tests.py:TestCQL.batch_and_list_test dtest.

Fixes #1075.

Message-Id: <1458672874-4749-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 5f44afa311)
2016-03-22 21:06:21 +02:00
Tomasz Grabiec
382e7e63b3 Fix assertion in row_cache_alloc_stress
Fixes the following assertion failure:

  row_cache_alloc_stress: tests/row_cache_alloc_stress.cc:120: main(int, char**)::<lambda()>::<lambda()>: Assertion `mt->occupancy().used_space() < memory::stats().free_memory()' failed.

memory::stats()::free_memory() may be much lower than the actual
amount of reclaimable memory in the system since LSA zones will try to
keep a lot of free segments to themselves. Fix by using actual amount
of reclaimable memory in the check.

(cherry picked from commit a4e3adfbec)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
deeed904f4 logalloc: Introduce tracker::occupancy()
Returns occupancy information for all memory allocated by LSA, including
segment pools / zones.

(cherry picked from commit a0cba3c86f)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
d927053b3b logalloc: Rename tracker::occupancy() to region_occupancy()
(cherry picked from commit 529c8b8858)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
8b8923b5af managed_bytes: Make operator[] work for large blobs as well
Fixes assertion in mutation_test:

mutation_test: ./utils/managed_bytes.hh:349: blob_storage::char_type* managed_bytes::data(): Assertion `!_u.ptr->next'

Introduced in ea7c2dd085

Message-Id: <1458648786-9127-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit ca08db504b)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
48ec129595 perf_simple_query: Make duration configurable
(cherry picked from commit 6e73c3f3dc)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
a4757a6737 mutation_test: Add allocation failure stress test for apply()
The test injects allocation failures at every allocation site during
apply(). Only allocations throug allocation_strategy are instrumented,
but currently those should include all allocations in the apply() path.

The target and source mutations are randomized.

(cherry picked from commit 2fbb55929d)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
223b73849d mutation_test: Add more apply() tests
(cherry picked from commit 8ede27f9c6)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
ba4b1eac45 mutation_test: Hoist make_blob() to a function
(cherry picked from commit 36575d9f01)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
9cf5fabfdf mutation_test: Make make_blob() return different blob each time
random_bytes was constructed with the same seed each time.

(cherry picked from commit 4c85d06df7)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
5723c664ad mutation_test: Fix use-after-free
The problem was that verify_row() was returning a future which was not
waited on. Fix by running the code in a thread.

(cherry picked from commit 19b3df9f0f)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
9635a83edd mutation_partition: Fix friend declarations
Missing "class" confuses CLion IDE.

(cherry picked from commit a7966e9b71)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
24c68e48a5 mutation_partition: Make apply() atomic even in case of exception
We cannot leave partially applied mutation behind when the write
fails. It may fail if memory allocation fails in the middle of
apply(). This for example would violate write atomicity, readers
should either see the whole write or none at all.

This fix makes apply() revert partially applied data upon failure, by
the means of ReversiblyMergeable concept. In a nut shell the idea is
to store old state in the source mutation as we apply it and swap back
in case of exception. At cell level this swapping is inexpensive, just
rewiring pointers. For this to work, the source mutation needs to be
brought into mutable form, so frozen mutations need to be unfrozen. In
practice this doesn't increase amount of cell allocations in the
memtable apply path because incoming data will usually be newer and we
will have to copy it into LSA anyway. There are extra allocations
though for the data structures which holds cells.

I didn't see significant change in performance of:

  build/release/tests/perf/perf_simple_query -c1 -m1G --write --duration 13

The score fluctuates around ~77k ops/s.

Fixes #283.

(cherry picked from commit dc290f0af7)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
80cb0a28e1 mutation_partition: Make intrusive sets ReversiblyMergeable
(cherry picked from commit e09d186c7c)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
95a9f66b75 mutation_partition: Make row_tombstones_entry ReversiblyMergeable
(cherry picked from commit f1a4feb1fc)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
58448d4b05 mutation_partition: Make rows_entry ReversiblyMergeable
(cherry picked from commit e4a576a90f)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
0a4d0e95f2 mutation_partition: Make row_marker ReversiblyMergeable
(cherry picked from commit aadcd75d89)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
2c73e1c2e8 mutation_partition: Make row ReversiblyMergeable
(cherry picked from commit ea7c2dd085)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
0ebd1ae62a atomic_cell_or_collection: Introduce as_atomic_cell_ref()
Needed for setting the REVERT flag on existing cell.

(cherry picked from commit c9d4f5a49c)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
14f616de3f atomic_cell_hash: Specialize appending_hash<> for atomic_cell and collection_mutation
(cherry picked from commit 1ffe06165d)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
827c0f68c3 atomic_cell: Add REVERT flag
Needed to make atomic cells ReversiblyMergeable.

(cherry picked from commit bfc6413414)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
e3607a4c16 tombstone: Make ReversiblyMergeable
(cherry picked from commit 7fcfa97916)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
59270c6d00 Introduce the concept of ReversiblyMergeable
(cherry picked from commit 1407173186)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
3be5d3a7c9 mutation_partition: row: Add empty()
(cherry picked from commit 9fc7f8a5ed)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
cd6697b506 mutation_partition: row: Allow storing empty cells internally
Currently only "set" storage could store empty cells, but not the
"vector" one because there empty cell has the meaning of being
missing. To implement rolback, we need to be able to distinguish empty
cells from missing ones. Solve by making vector storage use a bitmap
for presence checking instead of emptiness. This adds 4 bytes to
vector storage.

(cherry picked from commit d5e66a5b0d)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
acc9849e2b mutation_partition: Make row::merge() tolerate empty row
The row may be empty and still have a set storage, in which case
rbegin() dereference is undefined behavior.

(cherry picked from commit ed1e6515db)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
a445f6a7be managed_bytes: Mark move-assignment noexcept
(cherry picked from commit 184e2831e7)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
88ed9c53a6 managed_bytes: Make copy assignment exception-safe
(cherry picked from commit 92d4cfc3ab)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
50f98ff90a managed_bytes: Make linearization_context::forget() noexcept
It is needed for noexcept destruction, which we need for exception
safety in higher layers.

According to [1], erase() only throws if key comparison throws, and in
our case it doesn't.

[1] http://en.cppreference.com/w/cpp/container/unordered_map/erase

(cherry picked from commit 22d193ba9f)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
30ffb2917f mutation: Add copy assignment operator
We already have a copy constructor, so can have copy assignment as
well.

(cherry picked from commit 87d7279267)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
6ef8b45bf4 mutation_partition: Add cell_entry constructor which makes an empty cell
(cherry picked from commit 8134992024)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
144829606a mutation_partition: Make row::vector_to_set() exception-safe
Currently allocation failure can leave the old row in a
half-moved-from state and leak cell_entry objects.

(cherry picked from commit 518e956736)
2016-03-22 19:59:16 +02:00
Tomasz Grabiec
2eb54bb068 mutation_partition: Unmark cell_entry's copy constructor as noexcept
It was a mistake, it certainly may throw because it copies cells.

(cherry picked from commit c91eefa183)
2016-03-22 19:59:16 +02:00
Pekka Enberg
a133e48515 Merge seastar upstream
* seastar 6a207e1...9f2b868 (10):
  > memory: set free memory to non-zero value in debug mode
  > Merge "Increase IOTune's robustness by including a timeout" from Glauber
  > shared_future: add companion class, shared_promise
  > rpc: fix client connection stopping
  > semaphore: allow wait() and signal() after broken()
  > run reactor::stop() only once
  > sharded: fix start with reference parameter
  > core: add asserts to rwlock
  > util/defer: Fix cancel() not being respected
  > tcp: Do not return accept until the connection is connected
2016-03-22 15:49:51 +02:00
Asias He
5db0049d99 gossip: Sync gossip_digest.idl.hh and application_state.hh
We did the clean up in idl/gossip_digest.idl.hh, but the patch to clean
up gms/application_state.hh was never merged.

To maintain compatibility with previous version of scylla, we can not
change application_state.hh, instead change idl to be sync with
application_state.hh.

Message-Id: <3a78b159d5cb60bc65b354d323d163ce8528b36d.1458557948.git.asias@scylladb.com>
(cherry picked from commit 39992dd559)
2016-03-22 15:22:12 +02:00
Takuya ASADA
ac80445bd9 dist: enable collectd on scylla_setup by default, to make scyllatop usable
Fixes #1037

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458324769-9152-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 6b2a8a2f70)
2016-03-22 15:16:54 +02:00
Asias He
0c3ffba5c8 messaging_service: Take reference of ms in send_message_timeout_and_retry
Take a reference of messaging_service object inside
send_message_timeout_and_retry to make sure it is not freed during the
life time of send_message_timeout_and_retry operation.

(cherry picked from commit b8abd88841)
2016-03-22 13:20:47 +02:00
Gleb Natapov
7ca3d22c7d messaging: do not admit new requests during messaging service shutdown.
Sending a message may open new client connection which will never be
closed in case messaging service is shutting down already.

Fixes #1059

Message-Id: <1458639452-29388-3-git-send-email-gleb@scylladb.com>
(cherry picked from commit 1e6352e398)
2016-03-22 13:18:12 +02:00
Gleb Natapov
9b1d2dad89 messaging: do not delete client during messaging service shutdown
Messaging service stop() method calls stop() on all clients. If
remove_rpc_client_one() is called while those stops are running
client::stop() will be called twice which not suppose to happen. Fix it
by ignoring client remove request during messaging service shutdown.

Fixes #1059

Message-Id: <1458639452-29388-2-git-send-email-gleb@scylladb.com>
(cherry picked from commit 357c91a076)
2016-03-22 13:18:05 +02:00
Pekka Enberg
7e6a7a6cb5 release: prepare for 1.0.rc1 2016-03-22 12:19:03 +02:00
Pekka Enberg
ec7f637384 dist/ubuntu: Use tilde for release candidate builds
The version number ordering rules are different for rpm and deb. Use
tilde ('~') for the latter to ensure a release candidate is ordered
_before_ a final version.

Message-Id: <1458627524-23030-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit ae33e9fe76)
2016-03-22 12:18:52 +02:00
Nadav Har'El
eecfb2e4ef sstable: fix use-after-free of temporary ioclass copy
Commit 6a3872b355 fixed some use-after-free
bugs but introduced a new one because of a typo:

Instead of capturing a reference to the long-living io-class object, as
all the code does, one place in the code accidentally captured a *copy*
of this object. This copy had a very temporary life, and when a reference
to that *copy* was passed to sstable reading code which assumed that it
lives at least as long as the read call, a use-after-free resulted.

Fixes #1072

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1458595629-9314-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit 2eb0627665)
2016-03-22 08:08:49 +02:00
Pekka Enberg
1f6476351a build: Invoke Seastar build only once
Make sure we invoke the Seastar ninja build only once from our own build
process so that we don't have multiple ninjas racing with each other.

Refs #1061.

Message-Id: <1458563076-29502-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 4892a6ded9)
2016-03-22 08:08:02 +02:00
Pekka Enberg
0d95dd310a Revert "build: prepare for 1.0 release series"
This reverts commit 80d2b72068. It breaks
the RPM build which does not allow the "-" character to appear in
version numbers.
2016-03-22 08:03:22 +02:00
Avi Kivity
80d2b72068 build: prepare for 1.0 release series 2016-03-21 18:44:05 +02:00
Asias He
ac95f04ff9 gossip: Handle unknown application_state when printing
In case an unknown application_state is received, we should be able to
handle it when printting.

Message-Id: <98d2307359292e90c8925f38f67a74b69e45bebe.1458553057.git.asias@scylladb.com>
(cherry picked from commit 7acc9816d2)
2016-03-21 11:59:35 +02:00
Pekka Enberg
08a8a4a1b4 main: Defer API server hooks until commitlog replay
Defer registering services to the API server until commitlog has been
replayed to ensure that nobody is able to trigger sstable operations via
'nodetool' before we are ready for them.
Message-Id: <1458116227-4671-1-git-send-email-penberg@scylladb.com>

(cherry picked from commit 972fc6e014)
2016-03-18 09:20:49 +02:00
Pekka Enberg
b7e9924299 main: Fix broadcast_address and listen_address validation errors
Fix the validation error message to look like this:

  Scylla version 666.development-20160316.49af399 starting ...
  WARN  2016-03-17 12:24:15,137 [shard 0] config - Option partitioner is not (yet) used.
  WARN  2016-03-17 12:24:15,138 [shard 0] init - NOFILE rlimit too low (recommended setting 200000, minimum setting 10000; you may run out of file descriptors.
  ERROR 2016-03-17 12:24:15,138 [shard 0] init - Bad configuration: invalid 'listen_address': eth0: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> > (Invalid argument)
  Exiting on unhandled exception of type 'bad_configuration_error': std::exception

Instead of:

  Exiting on unhandled exception of type 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> >': Invalid argument

Fixes #1051.

Message-Id: <1458210329-4488-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 69dacf9063)
2016-03-18 09:00:23 +02:00
Takuya ASADA
19ed269cc7 dist: follow sysconfig setting when counting number of cpus on scylla_io_setup
When NR_CPU >= 8, we disabled cpu0 for AMI on scylla_sysconfig_setup.
But scylla_io_setup doesn't know that, try to assign NR_CPU queues, then scylla fails to start because queues > cpus.
So on this fix scylla_io_setup checks sysconfig settings, if '--smp <n>' specified on SCYLLA_ARGS, use n to limit queue size.
Also, when instance type is not supported pre-configured parameters, we need to passes --cpuset parameters to iotune. Otherwise iotune will run on a different set of CPUs, which may have different performance characteristics.

Fixes #996, #1043, #1046

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458221762-10595-2-git-send-email-syuu@scylladb.com>
(cherry picked from commit 4cc589872d)
2016-03-18 08:58:00 +02:00
Takuya ASADA
a223450a56 dist: On scylla_sysconfig_setup, don't disable cpu0 on non-AMI environments
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458221762-10595-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 6f71173827)
2016-03-18 08:57:56 +02:00
Paweł Dziepak
8f4800b30e lsa: update _closed_occupancy after freeing all segments
_closed_occupancy will be used when a region is removed from its region
group, make sure that it is accurate.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
(cherry picked from commit 338fd34770)
2016-03-18 08:11:31 +02:00
Pekka Enberg
7d13d115c6 dist: Fix '--developer-mode' parsing in scylla_io_setup
We need to support the following variations:

   --developer-mode true
   --developer-mode 1
   --developer-mode=true
   --developer-mode=1

Fixes #1026.
Message-Id: <1458203393-26658-1-git-send-email-penberg@scylladb.com>

(cherry picked from commit 0434bc3d33)
2016-03-17 11:00:14 +02:00
Glauber Costa
c9c52235a1 stream_session: print debug message for STREAM_MUTATION
For this verb(), we don't call get_session - and it doesn't look like we will.
We currently have no debug message for this one, which makes it harder to debug
the stream of messages. Print it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit a3ebf640c6)
2016-03-17 08:18:54 +02:00
Glauber Costa
52eeab089c stream_session: remove duplicated debug message
Whenever we call get_session, that will print a debug message about the arrival
of this new verb. Because we also print that explicitly in PREPARE_DONE, that
message gets duplicated.

That confuses poor developers who are, for a while, left wondering why is it that
the sender is sender the message twice.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 0ab4275893)
2016-03-17 08:18:49 +02:00
Glauber Costa
49af399a2e sstables: do not assume mutation_reader will be kept alive
Our sstables::mutation_reader has a specialization in which start and end
ranges are passed as futures. That is needed because we may have to read the
index file for those.

This works well under the assumption that every time a mutation_reader will be
created it will be used, since whoever is using it will surely keep the state
of the reader alive.

However, that assumption is no longer true - for a while. We use a reader
interface for reading everything from mutations and sstables to cache entries,
and when we create an sstable mutation_reader, that does not mean we'll use it.
In fact we won't, if the read can be serviced first by a higher level entity.

If that happens to be the case, the reader will be destructed. However, since
it may take more time than that for the start and end futures to resolve, by
the time they are resolved the state of the mutation reader will no longer be
valid.

The proposed fix for that is to only resolve the future inside
mutation_reader's read() function. If that function is called,  we can have a
reasonable expectation that the caller object is being kept alive.

A second way to fix this would be to force the mutation reader to be kept alive
by transforming it into a shared pointer and acquiring a reference to itself.
However, because the reader may turn out not to be used, the delayed read
actually has the advantage of not even reading anything from the disk if there
is no need for it.

Also, because sstables can be compacted, we can't guarantee that the sst object
itself , used in the resolution of start and end can be alive and that has the
same problem. If we delay the calling of those, we will also solve a similar
problem.  We assume here that the outter reader is keeping the SSTable object
alive.

I must note that I have not reproduced this problem. What goes above is the
result of the analysis we have made in #1036. That being the case, a thorough
review is appreciated.

Fixes #1036

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <a7e4e722f76774d0b1f263d86c973061fb7fe2f2.1458135770.git.glauber@scylladb.com>
(cherry picked from commit 6a3872b355)
2016-03-16 19:41:06 +02:00
Nadav Har'El
d915370e3f Allow uncompression at end of file
Asking to read from byte 100 when a file has 50 bytes is an obvious error.
But what if we ask to read from byte 50? What if we ask to read 0 bytes at
byte 50? :-)

Before this patch, code which asked to read from the EOF position would
get an exception. After this patch, it would simply read nothing, without
error. This allows, for example, reading 0 bytes from position 0 on a file
with 0 bytes, which apparently happened in issue #1039...

A read which starts at a position higher than the EOF position still
generates an exception.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1458137867-10998-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit 02ba8ffbe8)
2016-03-16 19:40:59 +02:00
Nadav Har'El
a6d5e67923 Fix out-of-range exception when uncompressing 0 bytes
The uncompression code reads the compressed chunks containing the bytes
pos through pos + len - 1. This, however, is not correct when len==0,
and pos + len - 1 may even be -1, causing an out-of-range exception when
calling locate() to find the chunks containing this byte position.

So we need to treat len==0 specially, and in this case we don't read
anything, and don't need to locate() the chunks to read.

Refs #1039.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1458135987-10200-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit 73297c7872)
2016-03-16 15:55:12 +02:00
Takuya ASADA
f885750f90 dist: do not auto-start scylla-server job on Ubuntu package install time
Fixes #1017

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458122424-22889-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit f1d18e9980)
2016-03-16 13:55:30 +02:00
Pekka Enberg
36f55e409d tests/gossip_test: Fix messaging service stop
This fixes gossip test shutdown similar to what commit 13ce48e ("tests:
Fix stop of storage_service in cql_test_env") did for CQL tests:

  gossip_test: /home/penberg/scylla/seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local() [with Service = net::messaging_service]: Assertion `local_is_initialized()' failed.
  Running 1 test case...

  [snip]

  unknown location(0): fatal error in "test_boot_shutdown": signal: SIGABRT (application abort requested)
  seastar/tests/test-utils.cc(32): last checkpoint
Message-Id: <1458126520-20025-1-git-send-email-penberg@scylladb.com>

(cherry picked from commit 2f519b9b34)
2016-03-16 13:15:39 +02:00
Asias He
c436fb5892 streaming: Handle cf is deleted after the deletion check
The cf can be deleted after the cf deletion check. Handle this case as
well.

Use "warn" level to log if cf is missing. Although we can handle the
case, but it is good to distingush where the receiver of streaming
applied all the stream mutations or not. We believe that the cf is
missing because it was dropped, but it could be missing because of a bug
or something we didn't anticipated here.

Related patch: "streaming: Handle cf is deleted when sending
STREAM_MUTATION_DONE"

Fixes simple_add_new_node_while_schema_changes_test failure.
Message-Id: <c4497e0500f50e0a3422efb37e73130765c88c57.1458090598.git.asias@scylladb.com>

(cherry picked from commit 2d50c71ca3)
2016-03-16 11:47:03 +02:00
Asias He
950bcd3e38 tests: Fix stop of storage_service in cql_test_env
In stop() of storage_service, it unregisters the verb handler. In the
test, we stop messaging_service before storage_service. Fix it by
deferring stop of messaging_service.
Message-Id: <c71f7b5b46e475efe2fac4c1588460406f890176.1458086329.git.asias@scylladb.com>

(cherry picked from commit 13ce48e775)
2016-03-16 11:36:36 +02:00
Asias He
83ffae1568 storage_service: Drop block_until_update_pending_ranges_finished
It is a legacy API from c*. Since we can wait for the
update_pending_ranges to complete, we can wait for it directly instead
of calling block_until_update_pending_ranges_finished to do so.

Also, change do_update_pending_ranges to be private.

Message-Id: <ac79b2879ec08fdcd3b2278ff68962cc71492f12.1458040608.git.asias@scylladb.com>
2016-03-15 15:18:45 +02:00
Avi Kivity
cc3e49e16f Merge seastar upstream
* seastar 0739576...6a207e1 (3):
  > file: allow custom file_impl implementations
  > Dockerfile update
  > tcp: Fix a typo in input_handle_other_state
2016-03-15 15:06:35 +02:00
Gleb Natapov
c6157dd99e enable rpc_keepalive parameter
Fixes #1044

Message-Id: <20160315104609.GV6117@scylladb.com>
2016-03-15 12:51:12 +02:00
Paweł Dziepak
9f3893980a move SCHEMA_CHECK registration to migration_manager
The verb is just for reporting and debugging purposes, but it is better
not to register it until it can return a meaningful value. Besides, it
really belongs to the migration manager subsystem anyway.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1458037053-14836-1-git-send-email-pdziepak@scylladb.com>
2016-03-15 12:24:37 +02:00
Asias He
d79dbfd4e8 main: Defer initalization of streaming
Streaming is used by bootstrap and repair. Streaming uses storage_proxy
class to apply the frozen_mutation and db/column_family class to
invalidate row cache. Defer the initalization just before repair and
bootstrap init.
Message-Id: <8e99cf443239dd8e17e6b6284dab171f7a12365c.1458034320.git.asias@scylladb.com>
2016-03-15 11:56:34 +02:00
Pekka Enberg
eb13f65949 main: Defer REPAIR_CHECKSUM_RANGE RPC verb registration after commitlog replay
Register the REPAIR_CHECKSUM_RANGE messaging service verb handler after
we have replayed the commitlog to avoid responding with bogus checksums.
Message-Id: <1458027934-8546-1-git-send-email-penberg@scylladb.com>
2016-03-15 11:56:18 +02:00
Pekka Enberg
917ed4adbe Merge "verb init/handler for gosisp and storage_service" from Asias
"- ignore ack2 msg if gossip is not enabled
 - move REPLICATION_FINISHED to where it belongs to
 - add comments for gossip runtime dependency"
2016-03-15 11:12:10 +02:00
Avi Kivity
ad26e81444 Merge "Update pending ranges when ks is changed" from Asias
"At the momment, the migration_listener callbacks returns void, it is impossible
to wait for the callbacks to complete. Make the callbacks runs inside seastar
thread, so if we need to wait for the callback, we can make it call
foo_operation().get() in the callback. It is easier than making the callbacks
return future<>.

Fixes #1000."
2016-03-15 10:50:07 +02:00
Asias He
883d8cb8fd storage_service: Move REPLICATION_FINISHED verb to storage_service
It belongs to storage_service not storage_proxy.
2016-03-15 16:13:22 +08:00
Asias He
fb4d292d5c storage_service: Drop unused debug code 2016-03-15 16:13:21 +08:00
Asias He
16af12ca47 gossip: Add comments on external runtime dependency needed by gossip 2016-03-15 16:13:13 +08:00
Asias He
1034dd0aff gossip: Ignore ack2 message if gosisp is not enabled yet 2016-03-15 16:09:43 +08:00
Asias He
1bf0412e7a gossip: Introduce handle_shutdown_msg helper 2016-03-15 16:09:43 +08:00
Asias He
54d8ac16b5 gossip: Introduce handle_echo_msg helper 2016-03-15 16:09:42 +08:00
Asias He
1f64f4bfcb gossip: Introdcue handle_ack2_msg helper 2016-03-15 16:09:42 +08:00
Asias He
d63281b256 storage_service: Update pending ranges when keyspace is changed
If a keyspace is created after we calcuate the pending ranges during
bootstrap. We will ignore the keyspace in pending ranges when handling
write request for that keyspace which will casue data lose if rf = 1.

Fixes #1000
2016-03-15 15:41:23 +08:00
Asias He
93015bcc54 migration_manager: Make the migration callbacks runs inside seastar thread
At the momment, the callbacks returns void, it is impossible to wait for
the callbacks to complete. Make the callbacks runs inside seastar
thread, so if we need to wait for the callback, we can make it call
foo_operation().get() in the callback. It is easier than making the
callbacks return future<>.
2016-03-15 15:41:23 +08:00
Gleb Natapov
5076f4878b main: Defer storage proxy RPC verb registration after commitlog replay
Message-Id: <20160315071229.GM6117@scylladb.com>
2016-03-15 09:18:12 +02:00
Gleb Natapov
e228ef1bd9 messaging: enable keepalive tcp option for inter-node communication
Some network equipment that does TCP session tracking tend to drop TCP
sessions after a period of inactivity. Use keepalive mechanism to
prevent this from happening for our inter-node communication.

Message-Id: <20160314173344.GI31837@scylladb.com>
2016-03-14 19:39:39 +02:00
Avi Kivity
7ae2298081 Merge seastar upstream
* seastar 88cc232...0739576 (4):
  > rpc: allow configuring keepalive for rpc client
  > net: add keepalive configuration to socket interface
  > iotune: refuse to run if there is not enough space available
  > rpc: make client connection error more clear
2016-03-14 19:38:54 +02:00
Pekka Enberg
1429213b4c main: Defer migration manager RPC verb registration after commitlog replay
Defer registering migration manager RPC verbs after commitlog has has
been replayed so that our own schema is fully loaded before other other
nodes start querying it or sending schema updates.
Message-Id: <1457971028-7325-1-git-send-email-penberg@scylladb.com>
2016-03-14 18:03:16 +01:00
Pekka Enberg
16f947dcb3 message/messaging_service: Remove init_messaging_service() declaration
The function no longer exists so drop the function declaration.
Message-Id: <1457694134-25600-1-git-send-email-penberg@scylladb.com>
2016-03-14 13:54:53 +02:00
Vlad Zolotarov
ce47fcb1ba sstables: properly account removal requests
The same shard may create an sstables::sstable object for the same SStable
that doesn't belong to it more than once and mark it
for deletion (e.g. in a 'nodetool refresh' flow).

In that case the destructor of sstables::sstable accounted
the deletion requests from the same shard more than once since it was a simple
counter incremented each time there was a deletion request while it should
account request from the same shard as a single request. This is because
the removal logic waited for all shards to agree on a removal of a specific
SStable by comparing the counter mentioned above to the total
number of shards and once they were equal the SStable files were actually removed.

This patch fixes this by replacing the counter by an std::unordered_set<unsigned>
that will store a shard ids of the shards requesting the deletion
of the sstable object and will compare the size() of this set
to smp::count in order to decide whether to actually delete the corresponding
SStable files.

Fixes #1004

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1457886812-32345-1-git-send-email-vladz@cloudius-systems.com>
2016-03-14 11:45:08 +02:00
Raphael S. Carvalho
1ff7d32272 sstables: make write_simple() safer by using exclusive flag
We should guarantee that write_simple() will not try to overwrite
an existing file.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <194bd055f1f2dc1bb9766a67225ec38c88e7b005.1457818073.git.raphaelsc@scylladb.com>
2016-03-14 11:45:00 +02:00
Raphael S. Carvalho
0af786f3ea sstables: fix race condition when writing to the same sstable in parallel
When we are about to write a new sstable, we check if the sstable exists
by checking if respective TOC exists. That check was added to handle a
possible attempt to write a new sstable with a generation being used.
Gleb was worried that a TOC could appear after the check, and that's indeed
possible if there is an ongoing sstable write that uses the same generation
(running in parallel).
If TOC appear after the check, we would again crap an existing sstable with
a temporary, and user wouldn't be to boot scylla anymore without manual
intervention.

Then Nadav proposed the following solution:
"We could do this by the following variant of Raphael's idea:

   1. create .txt.tmp unconditionally, as before the commit 031bf57c1
(if we can't create it, fail).
   2. Now confirm that .txt does not exist. If it does, delete the .txt.tmp
we just created and fail.
   3. continue as usual
   4. and at the end, as before, rename .txt.tmp to .txt.

The key to solving the race is step 1: Since we created .txt.tmp in step 1
and know this creation succeeded, we know that we cannot be running in
parallel with another writer - because such a writer too would have tried to
create the same file, and kept it existing until the very last step of its
work (step 4)."

This patch implements the solution described above.
Let me also say that the race is theoretical and scylla wasn't affected by
it so far.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ef630f5ac1bd0d11632c343d9f77a5f6810d18c1.1457818331.git.raphaelsc@scylladb.com>
2016-03-14 11:44:51 +02:00
Avi Kivity
7278d0343b Merge seastar upstream
* seastar 906b562...88cc232 (2):
  > reactor: fix work item leak in syscall work queue
  > rpc_test: add missing header
2016-03-14 11:15:42 +02:00
Asias He
9f64c36a08 storage_service: Fix pending_range_calculator_service
Since calculate_pending_ranges will modify token_metadata, we need to
replicate to other shards. With this patch, when we call
calculate_pending_ranges, token_metadata will be replciated to other
non-zero shards.

In addition, it is not useful as a standalone class. We can merge it
into the storage_service. Kill one singleton class.

Fixes #1033
Refs #962
Message-Id: <fb5b26311cafa4d315eb9e72d823c5ade2ab4bda.1457943074.git.asias@scylladb.com>
2016-03-14 10:14:22 +02:00
Pekka Enberg
d4b4baad98 Merge "Add more information to query result digest" from Paweł
"This series adds more information (i.e. keys and tombstones) to the
query result digest in order to ensure correctness and increase the
chances of early detection of disagreement between replicas.

The digest is no longer computed by hashing query::result but build
using the query result builder. That is necessary since the query
result itself doesn't contain all information required to compute
the digest. Another consequence of this is that now replicas asked
for a result need to send both the result and the digest to
the coordinator as it won't be able to compute the digest itself.

Unfortunately, these patches change our on wire communication:
 1) hash computation is different
 2) format of query::result is changed (and it is made non-final)

Fixes #182."
2016-03-14 08:22:05 +02:00
Paweł Dziepak
72970c9c90 query: add query::result::_digest to pretty printer
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:17 +00:00
Paweł Dziepak
82d2a2dccb specify whether query::result, result_digest or both are needed
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Paweł Dziepak
21e2ebcf8c query: build only result, only digest or both
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Paweł Dziepak
46079f763b query: add keys and tombstones to result digest
Query result digest is used to verify that all replicas have the same
data. Therefore, it needs to contain more information than the query
result itself in order to ensure proper detection of disagreements.

Generally, adding clustering keys to the digest regardless of whether
the client asked for them will guarantee correctness. However, adding
tombstones as well improves the chances of early detection of nodes
containing stale data.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Paweł Dziepak
15fd3e96ff md5_hasher: add finalize_array()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Paweł Dziepak
3efb10bd08 result.idl: keep digest together with result
Result digest is going to be computed in query result builder and
require information not available in the query resylt. That's why the
digest now needs to be sent to the other nodes together with the result
as they won't be able compute it on their own.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Paweł Dziepak
86ba96622e atomic_cell: do not require type to hash collection cell
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Paweł Dziepak
23ee493d91 types: make collection_type_impl::deserialize_mutation_form static
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Paweł Dziepak
c1f7f11d54 mutation_partition: do not add ck to result when not asked to
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Paweł Dziepak
77dbe3c12f storage_proxy: fix reconciliation with limits
Currently, if there is a disagreement between replicas we get mutations
from all of them, merge this mutations and send the result to the
client, difference between the result and the mutation sent by a
particular replica is sent back to repair it.
Unfortunately, that may not suffice to provide user with correct results
in case of disagreements.

Consider the following scenario:

create table cf(p int, c int, r int, primary key(p, c));

node1:
p=0, c=1, r=1 (timestamp = 1)
p=0, c=2, r=2 (timestamp = 2)

node2:
p=0, c=1, r=tombstone (timestamp = 2)
p=0, c=2, r=1 (timestamp = 1)

query:
select r from cf limit 1;

Let's assume there are no row markers. node1 will send only outdated
cell (p=0, c=1, r=1) while node2 will send both tombstone for c=1 and
outdated cell (p=0, c=2, r=1). A disagreement will be detected, the
replies will be merged and the coordinator will respond to the client
with result r=1, while the correct answer is r=2.

The solution proposed in this patch is to attempt to detect cases when
the problem may occur and retry queries with larger limit which result
in replicas providing more information.

The detection logic is simple: the partition key and clustering key of
the last row in the reconciled result are compared with the partition
keys and clustering keys of the last rows of replies from replicas
(except short reads). If the (pk, ck) of the replica last row is smaller
than the (pk, ck) of the reconciled result the query is retried.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:26:33 +00:00
Asias He
f747df2aff streaming: Fix rethrow in stream_transfer_task
Fix bootstrap_test.py:TestBootstrap.failed_bootstap_wiped_node_can_join_test

Logs on node 1:
 INFO  2016-03-11 15:53:43,287 [shard 0] gossip - FatClient 127.0.0.2 has been silent for 30000ms, removing from gossip
 INFO  2016-03-11 15:53:43,287 [shard 0] stream_session - stream_manager: Close all stream_session with peer = 127.0.0.2 in on_remove
 WARN  2016-03-11 15:53:43,498 [shard 0] stream_session - [Stream #4e411ba0-e75e-11e5-81f8-000000000000] stream_transfer_task: Fail to send STREAM_MUTATION_DONE to 127.0.0.2:0: std::runtime_error ([Stream #4e411ba0-e75e-11e5-81f8-000000000000] GOT STREAM_ MUTATION_DONE 127.0.0.1: Can not find stream_manager)

 terminate called without an active exception

Backtrace on node 1:

 #0  0x00007fb74723da98 in raise () from /lib64/libc.so.6
 #1  0x00007fb74723f69a in abort () from /lib64/libc.so.6
 #2  0x00007fb74ab84aed in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
 #3  0x00007fb74ab82936 in ?? () from /lib64/libstdc++.so.6
 #4  0x00007fb74ab82981 in std::terminate() () from /lib64/libstdc++.so.6
 #5  0x00007fb74ab82be9 in __cxa_rethrow () from /lib64/libstdc++.so.6
 #6  0x0000000000f3521e in streaming::stream_transfer_task::<lambda()>::<lambda(auto:44)>::operator()<std::__exception_ptr::exception_ptr> (ep=..., __closure=0x7ffce74d8630) at streaming/stream_transfer_task.cc:169
 #7  do_void_futurize_apply<const streaming::stream_transfer_task::start()::<lambda()>::<lambda(auto:44)>&, std::__exception_ptr::exception_ptr> (func=...) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1142
 #8  futurize<void>::apply<const streaming::stream_transfer_task::start()::<lambda()>::<lambda(auto:44)>&, std::__exception_ptr::exception_ptr> (func=...) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1190
 #9  future<>::<lambda(auto:7&&)>::operator()<future<> > ( fut=fut@entry=<unknown type in /home/asias/src/cloudius-systems/scylla/build/release/scylla, CU 0xec84d00, DIE 0xee2561d>, __closure=__closure@entry=0x7ffce74d8630) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1014

Message-Id: <1457684884-4776-2-git-send-email-asias@scylladb.com>
2016-03-11 11:14:05 +02:00
Asias He
bcdd3dbb3e messaging_service: Add missed throw
It is missed somehow.
Message-Id: <1457684884-4776-1-git-send-email-asias@scylladb.com>
2016-03-11 11:01:24 +02:00
Raphael S. Carvalho
031bf57c19 sstables: bail out if toc exists for generation used by write_components
Currently, if sstable::write_components() is called to write a new sstable
using the same generation of a sstable that exists, a temporary TOC will
be unconditionally created. Afterwards, the same sstable::write_components()
will fail when it reaches sstable::create_data(). The reason is obvious
because data component exists for that generation (in this scenario).
After that, user will not be able to boot scylla anymore because there is
a generation with both a TOC and a temporary TOC. We cannot simply remove a
generation with TOC and temporary TOC because user data will be lost (again,
in this scenario). After all, the temporary TOC was only created because
sstable::write_components() was wrongly called with the generation of a
sstable that exists.

Solution proposed by this patch is to trigger exception if a TOC file
exists for the generation used.

Some SSTable unit tests were also changed to guarantee that we don't try
to overwrite components of an existing sstable.

Refs #1014.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <caffc4e19cdcf25e4c6b9dd277d115422f8246c4.1457643565.git.raphaelsc@scylladb.com>
2016-03-11 09:22:51 +02:00
Nadav Har'El
1b4f8842ee sstable: fix compressed data file overread
Since commit 2f56577 ("sstables: more efficient read of compressed data
file"), the compressed_file_input_stream uses a file_input_stream to
efficiently read the compressed data at chunks some desired size (128 KB
is our default) instead of at smaller compressed chunks.

However, I had a bug where I mis-calculated the desired length of the
read (giving the *end byte* instead of the length!) and as a result
file_input_stream did not know where the read was supposed to stop, and
always read 128 KB buffers. The results were not incorrect, because the
sstable reader stops when it needs to, even if given too much data. But
it was inefficient because too much data was read in the last buffer.

With this patch, the length is correctly given to the input stream, and
it can read a much smaller buffer at the end of the read, not the full
128 KB. I tested that this actually happens.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457633616-15193-1-git-send-email-nyh@scylladb.com>
2016-03-11 09:17:50 +02:00
Pekka Enberg
987e8579d7 Merge "general robustness improvements for SSTables code" from Glauber
"As described in issue #1014, we have found ourselves in a situation where
 SSTables can be written too early, and that causes problems for the existing
 SSTables.  While this shouldn't happen - and Pekka's recent patch to move
 populate() a lot earlier in initialization should fix that, when that did
 happen what we had was not enough to prevent it from overwriting existing
 tables.

 We should do a lot better job protecting against that.

 Also, some of the exceptions that are generated at totally inconclusive. This
 series also aims at making some of the exceptions more descriptive."
2016-03-11 09:03:05 +02:00
Glauber Costa
a339296385 database: turn sstable generation number into an optional
This patch makes sure that every time we need to create a new generation number -
the very first step in the creation of a new SSTable, the respective CF is already
initialized and populated. Failure to do so can lead to data being overwritten.
Extensive details about why this is important can be found
in Scylla's Github Issue #1014

Nothing should be writing to SSTables before we have the chance to populate the
existing SSTables and calculate what should the next generation number be.

However, if that happens, we want to protect against it in a way that does not
involve overwriting existing tables. This is one of the ways to do it: every
column family starts in an unwriteable state, and when it can finally be written
to, we mark it as writeable.

Note that this *cannot* be a part of add_column_family. That adds a column family
to a db in memory only, and if anybody is about to write to a CF, that was most
likely already called. We need to call this explicitly when we are sure we're ready
to issue disk operations safely.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-10 21:06:05 -05:00
Glauber Costa
f2a8bcabc2 sstables: improve error messages
The standard C++ exception messages that will be thrown if there is anything
wrong writing the file, are suboptimal: they barely tell us the name of the failing
file.

Use a specialized create function so that we can capture that better.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-10 21:06:05 -05:00
Glauber Costa
6c4e31bbdb main: when scanning SSTables, run shard 0 first
Deletion of previous stale, temporary SSTables is done by Shard0. Therefore,
let's run Shard0 first. Technically, we could just have all shards agree on the
deletion and just delete it later, but that is prone to races.

Those races are not supposed to happen during normal operation, but if we have
bugs, they can. Scylla's Github Issue #1014 is an example of a situation where
that can happen, making existing problems worse. So running a single shard
first and getting making sure that all temporary tables are deleted provides
extra protection against such situations.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-10 21:06:05 -05:00
Glauber Costa
8eb4e69053 database: remove unused parameter
We are no longer using the in_flight_seals gate, but forgot to remove it.
To guarantee that all seal operations will have finished when we're done,
we are using the memtable_flush_queue, which also guarantees order. But
that gate was never removed.

The FIXME code should also be removed, since such interface does exist now.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-10 21:05:54 -05:00
Glauber Costa
94e90d4a17 column_family: do not open code generation calculation
We already have a function that wraps this, re-use it.  This FIXME is still
relevant, so just move it there. Let's not lose it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-10 21:05:47 -05:00
Glauber Costa
46fdeec60a colum_family: remove mutation_count
We use memory usage as a threshold these days, and nowhere is _mutation_count
checked. Get rid of it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-10 21:05:47 -05:00
Gleb Natapov
16135c2084 make initialization run in a thread
While looking at initialization code I felt like my head is going to
explode. Moving initialization into a thread makes things a little bit
better. Only lightly tested.

Message-Id: <20160310163142.GE28529@scylladb.com>
2016-03-10 17:42:05 +01:00
Gleb Natapov
176aa25d35 fix developer-mode parameter application on SMP
I am almost sure we want to apply it once on each shard, and not multiple
times on a single shard.

Message-Id: <20160310155804.GB28529@scylladb.com>
2016-03-10 17:17:48 +01:00
Pekka Enberg
97bef4fb7c build: Fix http/http_response_parser.hh dependency
Make sure http_response.hh that is pulled by locator/ec2_snitch.hh is
built. The commit is similar to what commit 6ccf8f8 ("build: make sure
to ask seastar to build http/request_parser.hh, and depend on it") did
for request_parser.hh.

Fixes the following build error on CentOS:

  In file included from ./locator/ec2_multi_region_snitch.hh:41:0,
                   from locator/ec2_multi_region_snitch.cc:39:
  ./locator/ec2_snitch.hh:24:40: fatal error: http/http_response_parser.hh: No such file or directory

Spotted by Shlomi.
Message-Id: <1457612266-315-1-git-send-email-penberg@scylladb.com>
2016-03-10 14:46:41 +01:00
Gleb Natapov
51ca3122cf cleanup forward declaration for key types
Message-Id: <20160310075138.GC6117@scylladb.com>
2016-03-10 10:52:19 +01:00
Pekka Enberg
5dd1fda6cf main: Initialize system keyspace earlier
We start services like gossiper before system keyspace is initialized
which means we can start writing too early. Shuffle code so that system
keyspace is initialized earlier.

Refs #1014
Message-Id: <1457593758-9444-1-git-send-email-penberg@scylladb.com>
2016-03-10 10:39:27 +01:00
Pekka Enberg
f2f35a2f50 Merge "fix shutdown and improve logging" from Asias
"Fixes #1005 and probably fixes #1013."
2016-03-10 08:21:48 +02:00
Asias He
a9ec752939 streaming: Reduce STREAM_MUTATION error logging
There might be larger number of STREAM_MUTATION inflight. Log one error
per column_family per range to avoid spam the log.
2016-03-10 10:56:48 +08:00
Asias He
134b814cde gossip: Log status info when stopping gossip 2016-03-10 10:56:48 +08:00
Asias He
7c4c99d7c7 streaming: Fix a log level in get_column_family_stores
It is supposed to be debug level instead of info level.
2016-03-10 10:56:48 +08:00
Asias He
cb90ff2709 storage_service: Make decommission log info instead of debug level
The log is just a few lines. It is very useful to tell which step fails
in case of error when we do decommission.
2016-03-10 10:56:48 +08:00
Asias He
ed723665df gossip: Do not stop gossip more than once
If we do
   - Decommission a node
   - Stop a node
we will shutdown gossip more than once in:
   - storage_service::decommission
   - storage_service::drain_on_shutdown

Fix by checking if it is already stopped and back off if so.
2016-03-10 10:56:48 +08:00
Asias He
138c5f5834 storage_service: Do not stop messaging_service more than once
If we do
   - Decommission a node
   - Stop a node
we will shutdown messaging_service more than once in:
   - storage_service::decommission
   - storage_service::drain_on_shutdown

Fixes #1005
Refs  #1013

This fix a dtest failure in debug build.

update_cluster_layout_tests.TestUpdateClusterLayout.simple_decommission_node_1_test/

/data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:802:35:
runtime error: member call on null pointer of type 'struct
future_state'
core/future.hh:334:49: runtime error: member access within null
pointer of type 'const struct future_state'
ASAN:SIGSEGV
=================================================================
==4557==ERROR: AddressSanitizer: SEGV on unknown address
0x000000000000 (pc 0x00000065923e bp 0x7fbf6ffac430 sp 0x7fbf6ffac420
T0)
    #0 0x65923d in future_state<>::available() const
/data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:334
    #1 0x41458f1 in future<>::available()
/data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:802
    #2 0x41458f1 in then_wrapped<parallel_for_each(Iterator, Iterator,
Func&&)::<lambda(parallel_for_each_state&)> [with Iterator =
std::__detail::_Node_iterator<std::pair<const net::msg_addr,
net::messaging_service::shard_info>, false, true>; Func =
net::messaging_service::stop()::<lambda(auto:39&)> [with auto:39 =
std::unordered_map<net::msg_addr, net::messaging_service::shard_info,
net::msg_addr::hash>]::<lambda(std::pair<const net::msg_addr,
net::messaging_service::shard_info>&)>]::<lambda(future<>)>, future<>
> /data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:878
2016-03-10 10:56:48 +08:00
Tomasz Grabiec
838a038cbd log: Fix operator<<(std::ostream&, const std::exception_ptr&)
Attempt to print std::nested_exception currently results in exception
to leak outside the printer. Fix by capturing all exception in the
final catch block.

For nested exception, the logger will print now just
"std::nested_exception".  For nested exceptions specifically we should
log more, but that is a separate problem to solve.
Message-Id: <1457532215-7498-1-git-send-email-tgrabiec@scylladb.com>
2016-03-09 16:05:03 +02:00
Pekka Enberg
2566f8dc18 configure: Remove 'scylla_libs' variable
It's not actually used by anyone so drop it.
Message-Id: <1457531753-27891-2-git-send-email-penberg@scylladb.com>
2016-03-09 14:56:54 +01:00
Pekka Enberg
9bfb6a0c5b configure: Add boost date_time library as a dependency
It's needed to fix the debug build.
Message-Id: <1457531753-27891-1-git-send-email-penberg@scylladb.com>
2016-03-09 14:56:51 +01:00
Takuya ASADA
0ab3d0fd52 dist: use SEASTAR_IO instead of SCYLLA_IO
sync with iotune, fixes #1010

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1457530910-1273-1-git-send-email-syuu@scylladb.com>
2016-03-09 15:45:34 +02:00
Gleb Natapov
f242c6395c storage_proxy: add counter for retries reads
Message-Id: <20160309130453.GF2253@scylladb.com>
2016-03-09 14:09:42 +01:00
Pekka Enberg
ab502bcfa8 types: Implement to_string for timestamps and dates
The to_string() function is used for logging purpose so use boost
to_iso_extended_string() to format both timestamps and dates.

Fixes #968 (showstopper)
Message-Id: <1457528755-6164-1-git-send-email-penberg@scylladb.com>
2016-03-09 14:08:33 +01:00
Pekka Enberg
8eedaca948 Merge "streaming: handle cf is deleted" from Asias
"Fixes #979
 Fixes #976"
2016-03-09 14:52:27 +02:00
Asias He
3a4ea227d8 storage_service: Fix effective_ownership
Now, get_ranges_for_endpoint will unwrap the first range. With t0 t1 t2
t3, the first range (t3,t0] will be splitted as (min,t0] and (t3,max].
Skippping the range (t3,max] we will get the correct ownership number as
if the first range were not splitted.

Fixes #928
Message-Id: <2e30ebd53f3dba3cc5e0cf36d5541c354b0e30ca.1457506704.git.asias@scylladb.com>
2016-03-09 13:26:01 +01:00
Asias He
d9ead889f3 streaming: Handle cf is deleted when sending STREAM_MUTATION_DONE
In the preparation phase of streaming, we check that remote node has all
the cf_id which are needed for the entire streaming process, including the
cf_id which local node will send to remote node and wise versa.

So, at later time, if the cf_id is missing, it must be that the cf_id is
deleted. It is fine to ingore no_such_column_family exception. In this
patch, we change the code to ignore at server side to avoid sending the
exception back, to avoid handle exception in an IDL compatiable way.

One thing we can improve is that the sender might know the cf is deleted
later than the receiver does. In this case, the sender will send some
more mutations if we send back the no_such_column_family back to the
sender. However, since we do not throw exceptions in the receiver stream
mutation handler, it will not cause a lot of overhead, the receiver will
just ignore the mutation received.

Fixes #979
2016-03-09 16:50:38 +08:00
Asias He
efa74dbae0 streaming: Do not send if the cf is deleted
It is possible that a cf is deleted after we make the cf reader. Avoid
sending them to avoid the unnecessary overhead to send them on the wire and
the peer node to drop the received mutations.
2016-03-09 16:50:38 +08:00
Asias He
4abaacfc61 db: Introduce column_family_exists
It is cheaper than throwing a no_such_column_family exception to test if
a cf is gone, e.g., deleted.
2016-03-09 16:50:38 +08:00
Asias He
dca9e594cc streaming: Remove the unused test code
It is introduced in the early development of streaming. We have dtest
for streaming now, drop it.
Message-Id: <1457499303-21163-1-git-send-email-asias@scylladb.com>
2016-03-09 10:31:42 +02:00
Pekka Enberg
4f3d6977f1 Merge "Abort stream_session if peer is removed or restarted" from Asias
"Hook streaming with gossip callback so we can abort
the stream_session in such case:

- a node is restarted
- a node is removed from the cluster

Fixes #1001."
2016-03-09 10:18:42 +02:00
Nadav Har'El
2f56577794 sstables: more efficient read of compressed data file
Before this patch, reading large ranges from a compressed data file involved
two inefficiencies:

 1.  The compressed data file was read one compressed chunk at a time.
     Such a chunk is around 30 KB in size, well below our desired sstable
     read-ahead size (sstable_buffer_size = 128 KB).

 2.  Because the compressed chunks have variable length (the uncompressed
     chunk has a fixed length) they are not aligned to disk blocks, so
     consecutive chunks have overlapping blocks which were unnecessarily
     read twice.

The fix for both issues is to build the compressed_file_input_stream on
an existing file_input_stream, instead of using direct file IO to read the
individual chunks. file_input_stream takes care of doing the appropriate
amount of read-ahead, and the compressed_file_input_stream layer does the
decompression of the data read from the underlying layer.

Fixes #992.

Historical note: Implementing compressed_file_input_stream on top of
file_input_stream was already tried in the past, and rejected. The problem
at that time was that compressed_file_input_stream's constructor did not
specify the *end* of the range to read, so that when we wanted to read
only a small range we got too much read-ahead beyond the exactly one
compressed chunk that we needed to read.  Following the fix to issue #964,
we now know on every streaming read also the intended *end* of the stream,
so we can now use this to stop reading at the end of the last required
chunk, even when we use a read-ahead buffer much larger than a chunk.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457304335-8507-1-git-send-email-nyh@scylladb.com>
2016-03-09 10:14:15 +02:00
Glauber Costa
8260b8fc6f touch CF directories during startup
We try to be robust against files disappearing (due to any kind of corruption)
inside the data directory.

But if the data directory itself goes missing, that's a situation that we don't
handle correctly.  We will keep accepting writes normally, but when we try to
flush the memtable to disk, we'll fail with a system error.

Having the CF directory disappearing is not a common thing. But it is also one
that we can easily protect against, by touching all CF directories we know
about on startup.

Fixes #999

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <ed66373dccca11742150a6d08e21ece3980227d3.1457379853.git.glauber@scylladb.com>
2016-03-09 09:06:51 +02:00
Asias He
bf3507d093 messaging_service: Stop retrying if node is removed from gossip
- Start a node
- Inject data
- Start another node to bootstrap
- Before the second node finishes streaming, kill the second node
- After a while the node will be removed from the cluster becusue it does
  not manage to join the cluster.
- At this time, messaging_service might keep retrying the
  stream_mutations unncessarily.

To fix, check if the peer node is still a known node in the gossip.
2016-03-09 07:35:20 +08:00
Asias He
1f3928c321 streaming: Hook streaming with gossip callback
If the peer node of a stream_session is restarted or removed we should
abort the streaming. It is better to hook gossip callback in the stream
manager than in each streamm_session.
2016-03-09 07:35:20 +08:00
Glauber Costa
2cd756ae5e repair: replace a magic number with another magic number
In due time we will have to fix this, but as an interim step, let's use
a "better" magic number.

The problem with 100, is that as soon as the partitions start to go bigger,
we're using too much memory. Since this is multiplied by the number of token
ranges, and happens in every shard, the final number can become really big,
and the amount of resources we use go up proportionally.

This means that even we are mistaken about the new number (we probably are),
in this case it is better to err on the side of a more conservative resource
usage.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <97158f3db5734916cee4ccf12eaa66e7402570bb.1457448855.git.glauber@scylladb.com>
2016-03-08 17:29:00 +02:00
Nadav Har'El
b7e29691c2 sstables: avoid index and data file over-reads
When we do a streaming read that knows the expected *end* position of the
read, we can use a large read-ahead buffer, and at the same time, stop
reading at exactly the intended end (or small rounding of it to the DMA
block size) and not waste resources blindly reading a large amount of data
after the end just to fill the read-ahead buffer.

The sstable reading code, both for reading the data file and the index file,
created a file input stream without specifiying its end, thereby losing
this optimization - so when a large buffer was used, we would get a large
over-read. This patch fixes this, so sstable data file and index file are
read using a file input stream which is a ware of its end.

Fixes #964.

Note that this patch does not change the behavior when reading a
*compressed* data file. For compressed read, we did not have the problem
of over-read in the first place, because chunks are read one by one.
But we do have other sources of inefficiencies there (stemming, again,
from the fact that the compressed chunks are read one by one), and I
opened a separate issue #992 for that.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457219304-12680-1-git-send-email-nyh@scylladb.com>
2016-03-08 17:26:10 +02:00
Calle Wilund
8575f1391f lists.cc: fix update insert of frozen list
Fixes #967

Frozen lists are just atomic cells. However, old code inserted the
frozen data directly as an atomic_cell_or_collection, which in turn
meant it lacked the header data of a cell. When in turn it was
handled by internal serialization (freeze), since the schema said
is was not a (non-frozen) collection, we tried to look at frozen
list data as cell header -> most likely considered dead.
Message-Id: <1457432538-28836-1-git-send-email-calle@scylladb.com>
2016-03-08 13:48:45 +01:00
Pekka Enberg
81af486b69 Update scylla-ami submodule
* dist/ami/files/scylla-ami d4a0e18...84bcd0d (1):
  > Add --ami parameter
2016-03-08 13:49:31 +02:00
Takuya ASADA
254b0fa676 dist: show message to use XFS for scylla data directory and also notify about developer mode, when iotune fails
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1457426286-15925-1-git-send-email-syuu@scylladb.com>
2016-03-08 12:20:33 +02:00
Pekka Enberg
83d82ea901 Merge "Fix Ubuntu package issues on AMI" from Takuya
"This fixes bugs on Ubuntu package and AMI scripts, closes #991."
2016-03-08 11:51:30 +02:00
Takuya ASADA
18a27de3c8 dist: export all entries on /etc/default/scylla-server on Ubuntu
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-03-08 18:18:30 +09:00
Gleb Natapov
ce6d1a242a storage_proxy: fix background_reads counter
background_reads collectd counter was not always properly decremented.
Fix it and streamline background read repair error handling.

Message-Id: <20160307182255.GI4849@scylladb.com>
2016-03-07 19:41:09 +01:00
Yoav Kleinberger
1cd01cd2ab tools/scyllatop: defend against curses "out of screen bounds" error
Fixes issue #945 (hopefully)
This issue was probably the result of trying to write outside the
confines of the window. The views.Base class now defends against this.

Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <9735806b211567f3239e187d87437c484f532291.1457265435.git.yoav@scylladb.com>
2016-03-07 18:02:26 +01:00
Raphael S. Carvalho
0f4239d63a service: improve logging of storage_service::load_new_sstables
Closes #952.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <2402f387c32d2d1221e740edb67e56c1593c1936.1457366098.git.raphaelsc@scylladb.com>
2016-03-07 18:01:52 +01:00
Raphael S. Carvalho
e850c1406e sstables: update comment
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <8abc1c6c66ed8d3bb35ecfb6d8251de3f61a97ae.1457093016.git.raphaelsc@scylladb.com>
2016-03-07 17:36:34 +01:00
Raphael S. Carvalho
822759eee0 compaction_manager: update stat pending_tasks properly
Size of both _cfs_to_cleanup and _cfs_to_compact must be added when
calculating a new value to _stats.pending_tasks.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b601e24d0631922798575f39d00fb54fe00d4971.1457093016.git.raphaelsc@scylladb.com>
2016-03-07 17:36:03 +01:00
Gleb Natapov
2d092bbd32 storage_proxy: send read requests with timeout
No need to wait for replies long after request is timed out.
Message-Id: <1457351304-28721-2-git-send-email-gleb@scylladb.com>
2016-03-07 14:00:11 +01:00
Gleb Natapov
4122422d19 storage_proxy: always wait for digest read resolver done future
Currently it is waited upon only if background read repair check is
needed and this cause unhandled exception warning to be printed if
it enters failed state. Fix this by always waiting on it, but doing
anything beyond ignoring an exception only if check is needed.
Message-Id: <1457351304-28721-1-git-send-email-gleb@scylladb.com>
2016-03-07 14:00:09 +01:00
Gleb Natapov
626c9d046b fix EACH_QUORUM handling during bootstrapping
Currently write acknowledgements handling does not take bootstrapping
node into account for CL=EACH_QUORUM. The patch fixes it.

Fixes #994

Message-Id: <20160307121620.GR2253@scylladb.com>
2016-03-07 13:56:34 +01:00
Raphael S. Carvalho
d65642cee8 fix storage_service::load_new_sstables() to not disable write permanently
Avi says:
"If an exception happens, then enable_sstable_writes won't be called."

The problem is fixed by catching a possible exception and enabling sstable
write for the relevant column family if it wasn't enabled already.

Closes #953.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <32c1bcb2c60c7b9e5514eb0a95062f40ca92093a.1457119308.git.raphaelsc@scylladb.com>
2016-03-07 13:56:02 +01:00
Gleb Natapov
f59415b3c6 Take pending endpoints into account while checking for sufficient live nodes
During bootstrapping additional copies of data has to be made to ensure
that CL level is met (see CASSANDRA-833 for details). Our code does
that, but it does not take into account that bootstraping node can be
dead which may cause request to proceed even though there is no
enough live nodes for it to be completed. In such a case request neither
completes nor timeouts, so it appear to be stuck from CQL layer POV. The
patch fixes this by taking into account pending nodes while checking
that there are enough sufficient live nodes for operation to proceed.

Fixes #965

Message-Id: <20160303165250.GG2253@scylladb.com>
2016-03-07 13:30:13 +01:00
Gleb Natapov
8dad399256 log: add space between log level and date in the outpu
It was dropped by 6dc51027a3

Message-Id: <20160306125313.GI2253@scylladb.com>
2016-03-07 13:06:06 +01:00
Tomasz Grabiec
9deb036e4e Merge branch 'dev/issue-845-set-incremental-backup-config-v1' from seastar-dev.git
From Vlad:

This series modifies the 'database' class to use the internal
_enable_incremental_backups value (initialized with
'incremental_backups' configuration value) instead of using the
'incremental_backups' configuration value directly.

Then we update this internal value in runtime from 'nodetool
enable/disablebackup' API callback so that newly created keyspaces and
column families use the newly configured incremental backup
configuration.
2016-03-07 10:47:20 +01:00
Tomasz Grabiec
b3e56549ca Merge branch 'dev/issue-909-synchronization-part-v2' from seatar-dev.git
From Vlad:

This series fixes the first part of issue #909 (the second part has a
separate github issue #965) which is a discrepancy between a
storage_service::token_metadata and a gossiper::endpoint_state_map
contents on non-zero shards.
2016-03-07 10:20:15 +01:00
Paweł Dziepak
99b61d3944 lsa: set _active to nullptr in region destructor
In region destructor, after active segments is freed pointer to it is
left unchanged. This confuses the remaining parts of the destructor
logic (namely, removal from region group) which may rely on the
information in region_impl::_active.

In this particular case the problem was that code removing from the
region group called region_impl::occupancy() which was
dereferencing _active if not null.

Fixes #993.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1457341670-18266-1-git-send-email-pdziepak@scylladb.com>
2016-03-07 10:15:28 +01:00
Takuya ASADA
9ee14abf24 dist: export sysconfig for scylla-io-setup.service
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-03-07 18:13:30 +09:00
Takuya ASADA
3d9dc52f5f Revert "Revert "dist: align ami option with others (-a --> --ami)""
This reverts commit 66c5feb9e9.

Conflicts:
	dist/common/scripts/scylla_sysconfig_setup

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-03-07 18:13:30 +09:00
Takuya ASADA
c9882bc2c4 Revert "Revert "Revert "dist: remove AMI entry from sysconfig, since there is no script refering it"""
This reverts commit 643beefc8c.

Conflicts:
	dist/common/scripts/scylla_sysconfig_setup
	dist/common/sysconfig/scylla-server

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-03-07 17:15:42 +09:00
Takuya ASADA
c888eaac74 dist: add /etc/scylla.d/io.conf on Ubuntu
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-03-07 17:15:42 +09:00
Vlad Zolotarov
2cd836a02e api::set_storage_service(): fix the 'nodetool enablebackup' API
'nodetool enable/disablebackup' callback was modifying only the
existing keyspaces and column families configurations.
However new keyspaces/column families were using
the original 'incremental_backups' configuration value which could
be different from the value configured by 'nodetool enable/disablebackup'
user command.

This patch updates the database::_enable_incremental_backups per-shard
value in addition to updating the existing keyspaces and column families
configurations.

Fixes #845

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-03-06 17:26:31 +02:00
Vlad Zolotarov
a45ecaf336 database: store "incremental backup" configuration value in per-shard instance
Store the "incremental_backups" configuration value in the database
class (and use it when creating a keyspace::config) in order to be
able to modify it in runtime.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-03-06 17:22:48 +02:00
Vlad Zolotarov
87e6efcdab storage_service: distribute gossiper::endpoint_state_map together with token_metadata
If storage_service::token_metadata is not distributed together with
gossiper::endpoint_state_map there may be a situation when a non-zero
shard sees a new value in token_metadata (e.g. newly added node's
token ranges) while still seeing an old gossiper::endpoint_state_map
contents (e.g. a mentioned above newly added node may not be present,
thus causing gossiper::is_alive() to return FALSE for that node, while
the node is actually alive and kicking).

To avoid this discrepancy we will always update a token_metadata together
with an endpoint_state_map when we distribute new token_metadata data
among shards.

Fixes #909

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-03-06 13:15:19 +02:00
Vlad Zolotarov
3a72ef87f2 gossiper: make _shadow_endpoint_state_map public and rename
We will need to access it from a storage_service class when replicate
token_metadata.

Rename _shadow_endpoint_state_map -> shadow_endpoint_state_map
according to our coding convention.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-03-06 11:16:44 +02:00
Vlad Zolotarov
4a21d48cc5 gossiper: use a semaphore instead of a future<> for serializing a timer callback
Use a semaphore to allow serializing with a gossiper's timer callback.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-03-06 11:16:44 +02:00
Takuya ASADA
6dc51027a3 log: make log.cc able to compile with g++-4.9
std::put_time() is not implemented on g++-4.9, so replace it with strftime().
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1457024183-893-1-git-send-email-syuu@scylladb.com>
2016-03-04 12:48:43 +01:00
Avi Kivity
6c2e57b003 Merge seastar upstream
* seastar ba615c7...906b562 (1):
  > rpc: prepare some more for feature negotiation
2016-03-03 18:22:57 +02:00
Gleb Natapov
b89b6f442b storage_proxy: fix race between read cl completion and timeout in digest resolver
If timeout happens after cl promise is fulfilled, but before
continuation runs it removes all the data that cl continuation needs
to calculate result. Fix this by calculating result immediately and
returning it in cl promise instead of delaying this work until
continuation runs. This has a nice side effect of simplifying digest
mismatch handling and making it exception free.

Fixes #977.

Message-Id: <1457015870-2106-3-git-send-email-gleb@scylladb.com>
2016-03-03 16:48:28 +02:00
Gleb Natapov
e4ac5157bc storage_proxy: store only one data reply in digest resolver.
Read executor may ask for more than one data reply during digest
resolving stage, but only one result is actually needed to satisfy
a query, so no need to store all of them.

Message-Id: <1457015870-2106-2-git-send-email-gleb@scylladb.com>
2016-03-03 16:47:53 +02:00
Gleb Natapov
69b61b81ce storage_proxy: fix cl achieved condition in digest resolver timeout handler
In digest resolver for cl to be achieved it is not enough to get correct
number of replies, but also to have data reply among them. The condition
in digest timeout does not check that, fortunately we have a variable
that we set to true when cl is achieved, so use it instead.

Message-Id: <1457015870-2106-1-git-send-email-gleb@scylladb.com>
2016-03-03 16:47:11 +02:00
Tomasz Grabiec
2abd62b5cb bytes_ostream: Drop methods which serialize integers
This will make bytes_ostream completely agnostic to serialization
format, which should be determined by layer above it.

Message-Id: <1457004221-8345-2-git-send-email-tgrabiec@scylladb.com>
2016-03-03 13:27:27 +02:00
Tomasz Grabiec
aaac2a3cec serializer: Add missing include
Message-Id: <1457004221-8345-1-git-send-email-tgrabiec@scylladb.com>
2016-03-03 13:27:22 +02:00
Pekka Enberg
9c930d88a0 db/system_keyspace: Remove ifdef'd code
We have our implementations of all the three ifdef'd functions.

Message-Id: <1456926917-12594-1-git-send-email-penberg@scylladb.com>
2016-03-03 12:26:50 +02:00
Takuya ASADA
da56325f69 configure.py: add support --static-stdc++ for seastar binaries (iotune)
Ubuntu 14.04LTS package is broken now because iotune does not statically linked against libstdc++, so this patch fixed it.
Requires seastar patch to add --static-stdc++ on configure.py.

Fixes #982

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456995050-22007-1-git-send-email-syuu@scylladb.com>
2016-03-03 12:18:47 +02:00
Avi Kivity
d4c92c7e27 Merge seastar upstream
* seastar b3fc7c5...ba615c7 (1):
  > configure.py: add --static-stdc++ to link libstdc++ statically
2016-03-03 12:18:23 +02:00
Asias He
01cb6b0d42 gossip: Send syn message in parallel and do not wait for it
1) As explained in commit 697b16414a (gossip: Make gossip message
handling async), in each gossip round we can make talking to the 1-3
peer nodes in parallel to reduce latency of gossip round.

2) Gossip syn message uses one way rpc message, but now the returned
future of the one way message is ready only when message is dequeued for
some reason (sent or dropped). If we wait for the one way syn messge to
return it might block the gossip round for a unbounded time. To fix, do
not wait for it in the gossip round. The downside is there will be no
back pressure to bound the syn messages, however since the messages are
once per second, I think it is fine.
Message-Id: <ea4655f121213702b3f58185378bb8899e422dd1.1456991561.git.asias@scylladb.com>
2016-03-03 11:17:50 +02:00
Takuya ASADA
e545013e47 Revert "dist: downgrade g++ to 4.9 on Ubuntu"
This reverts commit 01bd4959ac.

Fixes #983

Conflicts:
	dist/ubuntu/build_deb.sh
	dist/ubuntu/control.in
	dist/ubuntu/rules.in

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456996244-19889-1-git-send-email-syuu@scylladb.com>
2016-03-03 11:12:18 +02:00
Tomasz Grabiec
04f2482d74 schema_tables: Log results of schema merge
Currently schema changes are only logged at coordinator node which
initiates the change. It would be helpful in post morten analysis to
also see when and how schema changes are resolved when applied on
other nodes.
Message-Id: <1456953095-1982-1-git-send-email-tgrabiec@scylladb.com>
2016-03-03 11:12:15 +02:00
Nadav Har'El
2cf09147b5 Repair: don't use freeze() to calculate mutation checksums
Use the existing "feed_hash" mechanism to find a checksum of the
content of a mutation, instead of serializing the mutation (with freeze())
and then finding the checksum of that string.

The serialized form is more prone to future changes, and not really
guaranteed to provide equal hashes for mutations which are considered
"equal".

Fixes #971

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1456958676-27121-1-git-send-email-nyh@scylladb.com>
2016-03-03 09:58:24 +01:00
Avi Kivity
bec30ccf25 build: add order-only dependency between building antlr .o and IDL headers
This ensures that if an antlr generated .cpp file depends on an
IDL-generated .hh file, then that .hh is generated before the .o is
built.
2016-03-03 09:52:25 +02:00
Tomasz Grabiec
b42d3a90b3 cql3: create_table_statement: Sort _defined_names by text
Currently they are sorted by address in memory, which breaks the
check for column name duplicates, which assumes sorting by text.

Fixes #975.

Message-Id: <1456937400-20475-1-git-send-email-tgrabiec@scylladb.com>
2016-03-02 18:53:43 +02:00
Avi Kivity
dda77d14b9 Merge seastar upstream
* seastar 9964cbf...b3fc7c5 (2):
  > Introduce util/indirect.hh
  > reactor: new counters for the io queue
2016-03-02 18:52:36 +02:00
Calle Wilund
0c3322befd commitlog: Ensure segment survives whole flush call
Must keep shared pointer alíve.
Likewise though, the shared pointer copy in cycle main continuation
is not needed.

Message-Id: <1456931988-5876-3-git-send-email-calle@scylladb.com>
2016-03-02 18:22:13 +02:00
Calle Wilund
f1c4e3eb3d commitlog: Clear reserve segments in orphan_all
Otherwise they will keep the segment_manager alive (leak).
Fixes jenkins ASan errors.

Message-Id: <1456931988-5876-2-git-send-email-calle@scylladb.com>
2016-03-02 18:22:09 +02:00
Calle Wilund
a556f665c0 commitlog: Take segment_manager locks first in write/flush
While is is formally better to take a local lock first and
then first contend for a global, in this case it is arguably
better to ensure we get a gate exception synchronously (early)
instead of potentially in a continuation. Old version might
cause us to do a gate::leave even while never entered.

And since we should really only have one active (contending)
segment per shard anyway, it should not matter.

Message-Id: <1456931988-5876-1-git-send-email-calle@scylladb.com>
2016-03-02 18:22:05 +02:00
Calle Wilund
e79ca557ed managed_bytes: Change init of small object to silence error on gcc5
Fixes #865

(Some) gcc 5 (5.3.0 for me) on ubuntu will generate errors on
compilation of this code (compiling logalloc_test). The memcpy
to inline storage seems to confuse the compiler.
Simply change to std::copy, which shuts the compiler up.
Any decent stl should convert primitive std::copy to memcpy
anyway, but since it is also the inline (small storage),
it should not matter which way.

Message-Id: <1456931988-5876-4-git-send-email-calle@scylladb.com>
2016-03-02 18:21:51 +02:00
Pekka Enberg
6d7e14a53a Merge "Implement describe_schema_versions" from Paweł
"This series implements describe_schema_versions so that we nodetool
 describecluster can return proper schema information for the whole
 cluster. It involves adding new verb SCHEMA_CHECK which is used to get
 schema version for a given node and a simple map-reduce that using that
 verb gets info from the whole cluster.

 This fixes #677, fixes #684, and fixes #472."
2016-03-02 16:02:53 +02:00
Paweł Dziepak
5396042f06 api: use proper describe_schema_versions implementation
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-02 12:49:55 +00:00
Paweł Dziepak
723b3ae7ed storage_service: implement describe_schema_versions
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-02 12:49:55 +00:00
Paweł Dziepak
b5eee2e5d4 gms: add inet_address::to_sstring()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-02 12:49:55 +00:00
Paweł Dziepak
ca68c36c8c storage_proxy: handle SCHEMA_CHECK verb
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-02 12:49:54 +00:00
Paweł Dziepak
b92f8a6d2b messaging_service: add SCHEMA_CHECK verb
SCHEMA_CHECK is used to get node schema version.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-02 12:49:54 +00:00
Tomasz Grabiec
9a5d7c6388 log: Prepend log lines with timestamp when printed to stdout
Useful for determining order of events in logs of different nodes, or
for estimating how much time passed between two events.

Fixes #941.

Example log:

INFO  2016-03-01 18:30:37,688 [shard 0] gossip - Waiting for gossip to settle before accepting client requests...
INFO  2016-03-01 18:30:45,689 [shard 0] gossip - No gossip backlog; proceeding
INFO  2016-03-01 18:30:45,689 [shard 0] storage_service - Starting listening for CQL clients on localhost:9042...

Message-Id: <1456853532-28800-1-git-send-email-tgrabiec@scylladb.com>
2016-03-02 13:49:39 +02:00
Avi Kivity
431e1fd379 Merge "Drop db::serializer<>s" from Paweł
"This series removes old-style db::serializer<>s which were replaced by
the IDL-based serialization."
2016-03-02 13:16:36 +02:00
Asias He
a41bcad585 storage_service: Fix run with api lock
Start with coarse control:

1) converting the run_with_write_api_lock operations:

join_ring, start_gossiping, stop_gossiping, start_rpc_server,
stop_rpc_server, start_native_transport, stop_native_transport,
decommission, remove_node, drain, move, rebuild

to use run_with_api_lock which uses a flag to indicate current operation
in progress.

If one of the above operation is in progress when admin issues another
opeartion we return a "try again" exception to avoid running two
operations in parallel.

2) converting the run_with_read_api_lock to use no lock.

Fixes #850.

Message-Id: <00782b601028ed87437e5decae382f72dff634f6.1456758391.git.asias@scylladb.com>
2016-03-02 11:32:02 +02:00
Paweł Dziepak
d50594351b db: remove old-style serializers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-02 09:09:30 +00:00
Paweł Dziepak
bdc23ae5b5 remove db/serializer.hh includes
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-02 09:07:09 +00:00
Paweł Dziepak
53858ed9cd keys: remove old-style serializers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-02 09:05:25 +00:00
Paweł Dziepak
e1a4b992c5 mutation_partition_serializer: remove read() and read_as_view()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-02 09:04:02 +00:00
Tomasz Grabiec
4a4d288bba query_pagers: Fix dereference of potentially disengaged _last_ckey optional
Message-Id: <1456855674-1984-3-git-send-email-tgrabiec@scylladb.com>
2016-03-02 10:49:15 +02:00
Tomasz Grabiec
307c7676da to_string: Make std::experimental::optional printable
Message-Id: <1456855674-1984-2-git-send-email-tgrabiec@scylladb.com>
2016-03-02 10:49:14 +02:00
Takuya ASADA
6ae41a71c9 dist: fix initctl start scylla-server failed on Ubuntu
scylla_io_setup is executed via sudo, so we need to add it to sudoers

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456906634-14504-1-git-send-email-syuu@scylladb.com>
2016-03-02 10:36:47 +02:00
Tomasz Grabiec
4279ab40c5 cql_serialization_format: Print version as integer instead of char
Currently prints ^C instead of 3.

Message-Id: <1456856287-3681-1-git-send-email-tgrabiec@scylladb.com>
2016-03-01 20:47:48 +02:00
Tomasz Grabiec
f4a86729f9 query: Move implementaion of result_merger to .cc file
Message-Id: <1456855396-1563-1-git-send-email-tgrabiec@scylladb.com>
2016-03-01 20:06:42 +02:00
Tomasz Grabiec
d0ae2e3607 idl-compiler: Make sub-streams stored in views be properly bounded
Currently when reading a view to an object the stream stored has the
same bound as the containing stream, not the bounds of the object
itself. The serializer of the view assumes that the stream has the
bounds of the object itself.

Fixes dtest failure in
paging_test.py:TestPagingSize.test_undefined_page_size_default

Fixes #963.

Message-Id: <1456854556-32088-1-git-send-email-tgrabiec@scylladb.com>
2016-03-01 19:50:42 +02:00
Calle Wilund
e667dcc3d0 commitlog: Make segment->segment_manager relation shared pointer
The segment->segment_manager pointer has, until now, been a raw pointer,
which in a way is sensible, since making circular shared pointer
relations is in general bad. However, since the code and life cycle
of segments has evolved quite a bit since that initial relation
was defined, becoming both more and then suddenly, in a sense,
less, asynchronous over time, the usage of the relation is in fact
more consistent with a shared pointer, in that a segment needs to
access its manager to properly do things like write and flush.

These two ops in particular depend on accessing the segment manager
in a way that might be fine even using raw pointers, if it was not
again for that little annoying thing of continuation reordering.

So, lets just make the relation a shared pointer, solving the issue
of whether the manager is alive when a segment accesses it. If it
has been "released" (shut down), the existing mechanisms (gate)
will then trigger and prevent any actual _actions_ from taking
place. And we don't have to complicate anything else even more.

Only "big" change is that we need to explicitly orphan all
segments in commitlog destructor (segment_manager is essentially
a p-impl).

This fixes some spurious crashes in nightly unit tests.

Fixes #966.

Message-Id: <1456838735-17108-1-git-send-email-calle@scylladb.com>
2016-03-01 16:48:28 +02:00
Pekka Enberg
3a6d43c784 cql3: Fix duplicate column definition check
We cannot use shared_ptr *instances* for checking duplicate column
definitions because they are never equal. Store column definition name
in the unordered_map instead.

Fixes cql_additional_tests.py:TestCQL.identifier_test.

Spotted by Shlomi.

Message-Id: <1456840506-13941-1-git-send-email-penberg@scylladb.com>
2016-03-01 16:46:33 +02:00
Asias He
50bf65db8d streaming: Fix keep alive timer progress checking
When the first time the keep alive timer fires, the _last_stream_bytes
btyes will be zero since it is the first time we update it. The keep
alive timer will be rearmed and fired again. The second time, we find
there is no progress, we close the session. The total idle time will be
2 * keep alive timer.

To make the idle time to close the session be more precise, we reduce
the interval to check the progess and close the session by checking last
time the progress is made.

Message-Id: <c959cffce0cc738a3d73caaf71d2adb709d46863.1456831616.git.asias@scylladb.com>
2016-03-01 16:46:08 +02:00
Paweł Dziepak
92f9c9428e cql3: don't insert row marker if schema is_cql3_table()
Checking schema::is_dense() is not enough to know whether row marker
should be inserted or not as there may be compact storage tables that
are not considered dense (namely, a table with now clustering key).

Row marker should only be insterted if schema::is_cql3_table() is true.

Fixes #931.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1456834937-1630-1-git-send-email-pdziepak@scylladb.com>
2016-03-01 13:29:53 +01:00
Paweł Dziepak
6a6c12f8c4 tests/commitlog: use unaligned_cast instead of reinterpret_cast
corrupt_segment() is meant to write some garbage at arbitrary position
in the commitlog segment. That position is not necessairly properly
aligned for uint32_t.

Silences ubsan complaints about unaligned write.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1456827726-21288-1-git-send-email-pdziepak@scylladb.com>
2016-03-01 12:57:06 +02:00
Takuya ASADA
fd7eb7d1e5 dist: add support scylla_io_setup for Ubuntu
Unlike CentOS/Fedora, scylla_io_setup is calling from pre-start section of scylla-server upstart job, not from separated job.
This is because Upstart does not provide same behavior as After / Requires directives on systemd.

Fixes #954.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456825805-4195-1-git-send-email-syuu@scylladb.com>
2016-03-01 12:56:44 +02:00
Avi Kivity
e295b9b4e4 Merge 2016-03-01 09:52:04 +02:00
Amnon Heiman
1c7bc28d35 idl-compiler: change optional vector implementation
This patch change the way optional vector are implemented.

Now a vector of optional would be handle like any other non primitive
types, with a single method add() that would return a writer to the
optional.

The writer to the optional would have a skip and write method like
simple optional field.

For basic types the write method would get the value as a parameter, for
composite type, it would return a writer to the type.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1456796143-3366-2-git-send-email-amnon@scylladb.com>
2016-03-01 09:41:30 +02:00
Raphael S. Carvalho
34ed930aa4 sstables: fix lack of accuracy in disk usage report
To report disk usage, scylla was only taking into account size of
sstable data component. Other components such as index and filter
may be relatively big too. Therefore, 'nodetool status' would
report an innacurate disk usage. That can be fixed by taking into
account size of all sstable components.

Fixes #943.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <08453585223570006ac4d25fe5fb909ad6c140a5.1456762244.git.raphaelsc@scylladb.com>
2016-03-01 08:58:42 +02:00
Paweł Dziepak
e194835d8a tests/idl: add test for stdx::optional<> serialization
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1456761055-23916-1-git-send-email-pdziepak@scylladb.com>
2016-02-29 18:12:59 +02:00
Paweł Dziepak
dec63eac6e commitlog: add commitlog entry move constructor
Default move constructor and assignment didn't handle reference to
mutation (_mutation) properly.

Fixes #935.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1456760905-23478-1-git-send-email-pdziepak@scylladb.com>
2016-02-29 18:10:15 +02:00
Calle Wilund
0de8f6d24f cql_test_env: Shutdown auth on test stop
Ensures no spurious timer tasks tries to touch stopped distributed
objects.

Message-Id: <1456753987-6914-4-git-send-email-calle@scylladb.com>
2016-02-29 16:06:33 +02:00
Calle Wilund
fafcb8cc1e storage_service: Explicitly shutdown "auth" on system drain
I.e. cancels auth object setup tasks if not already run.

Message-Id: <1456753987-6914-3-git-send-email-calle@scylladb.com>
2016-02-29 16:06:30 +02:00
Calle Wilund
2ba738b555 auth: make scheduled tasks explicity cancellable
Adds a shutdown method. In this, explicitly cancels all waiting tasks
(all two!).

Message-Id: <1456753987-6914-2-git-send-email-calle@scylladb.com>
2016-02-29 16:06:25 +02:00
Calle Wilund
dc136a6a1c commitlog: Fix reserve counter overflow
Fixes #482

See code comment. Reserve segment allocation count sum can temporarily
overflow due to continuation delay/reordering, if we manage to reach the
on_timer code before finally clauses from previous reserve allocation
invocation has processed. However, since these are benign overflows
(just indicating even more that we don't need to do anything right now)
simply capping the count should be fine.
Avoids assert in boost irange.

Message-Id: <1456740679-4537-1-git-send-email-calle@scylladb.com>
2016-02-29 14:56:24 +02:00
Avi Kivity
5cc1b39cc9 Merge "Store gossip generation in system table" from Asias
"Kill one FIXME."
2016-02-29 14:53:06 +02:00
Avi Kivity
0bababedc3 Merge "Fix scylla-io-setup.service" from Takuya
"This patchset fixes #950, run scylla-io-setup before scylla-server on anycase, and installs example /etc/scylla.d/io.conf by default to prevent error on 'EnvironmentFile=/etc/scylla.d/*.conf'."
2016-02-29 14:14:42 +02:00
Takuya ASADA
6e55ed96d6 dist: add user-defined prefix for AMI name
With this change, you can define your own prefix of AMI name in variable.json.

example:
{
	"access_key": "xxx",
	"secret_key": "xxx",
	"subnet_id": "xxx",
	"security_group_id": "xxx",
	"region": "us-east-1",
	"associate_public_ip_address": "true",
	"instance_type": "c4.xlarge",
	"ami_prefix": "takuya-"
}

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456329247-5109-1-git-send-email-syuu@scylladb.com>
2016-02-29 13:49:11 +02:00
Takuya ASADA
48d72c01d1 dist: don't run iotune on developer mode 2016-02-29 20:13:46 +09:00
Takuya ASADA
e8a107de43 dist: install example io.conf by default
Prevent error on 'EnvironmentFile=/etc/scylla.d/*.conf'.
Parameters are commented out, and the file will replace when scylla starts, by scylla-io-setup.service.
2016-02-29 20:12:54 +09:00
Avi Kivity
69fdbf6a6e Merge "Use IDL for query results" from Tomek
"The series includes Amnon's unmerged support for optional<> in idl-compiler.

Depends on seastar patch "[PATCH seastar] simple_input_stream: Introduce begin()".

The query result footprint for cassandra-stress mutation as reported
by tests/memory-footprint increased by 18% from 285 B to 337 B.

perf_simple_query shows slight regression in throughput (-8%):

  build/release/tests/perf/perf_simple_query -c4 -m1G --partitions 100000

Before: ~433k tps
After:  ~400k tps"
2016-02-29 12:52:44 +02:00
Takuya ASADA
a281b10210 dist: run scylla-io-setup.service before scylla-server.service on anycase
We needed "systemctl enable scylla-io-setup.service; systemctl start scylla-server.service" to launch scylla-io-setup.service before scylla-server.service.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-29 18:59:10 +09:00
Takuya ASADA
11a616d4d6 dist: add scyllaio-setup.service for %systemd_post and %systemd_preun on .rpm
This is needed for initialize the service correctly

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-29 18:59:10 +09:00
Pekka Enberg
3919878a32 service/storage_service: Use logger for CQL listening report
Message-Id: <1456739417-11909-1-git-send-email-penberg@scylladb.com>
2016-02-29 11:52:06 +02:00
Avi Kivity
a1ff21f6ea main: sanity check cpu support
We require SSE 4.2 (for commitlog CRC32), verify it exists early and bail
out if it does not.

We need to check early, because the compiler may use newer instructions
in the generated code; the earlier we check, the lower the probability
we hit an undefined opcode exception.

Message-Id: <1456665401-18252-1-git-send-email-avi@scylladb.com>
2016-02-29 11:41:54 +02:00
Asias He
cc1e1a567c storage_service: Make replace-node error msg more friendly
Before:

ERROR [shard 0] storage_service - Format of host-id =
marshal_exception (marshalling error) is incorrect ???
Exiting on unhandled exception of type 'marshal_exception': marshalling error

After:

ERROR [shard 0] storage_service - Unable to parse 127.0.0.3 as host-id
Exiting on unhandled exception of type 'std::runtime_error': Unable to
parse 127.0.0.3 as host-id

Message-Id: <1456737987-32353-1-git-send-email-asias@scylladb.com>
2016-02-29 11:40:13 +02:00
Asias He
e36a99ef23 storage_service: Do not take api lock for get_load_map
It is used by

nodetool status

If an api operation inside storage_service takes a long time to finish
, which holds the lock, it will block nodetool status for a long time.

I think it is safe to get the load map even if other operations are in-flight.

Refs: #850

Message-Id: <1456737987-32353-2-git-send-email-asias@scylladb.com>
2016-02-29 11:39:09 +02:00
Takuya ASADA
84447fd7b0 dist: permission error fixed on scylla_io_setup
Can't be scylla user since we are writing to /etc

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456330209-5828-1-git-send-email-syuu@scylladb.com>
2016-02-29 11:35:38 +02:00
Asias He
1061bf0854 storage_service: Use increment_and_get_generation to get generation
The gossip generation is now stored in system.local table.

Before:
cqlsh> SELECT gossip_generation from system.local;

 gossip_generation
-------------------
              null

(1 rows)

After:
cqlsh> SELECT gossip_generation from system.local;

 gossip_generation
-------------------
        1456733559

(1 rows)
2016-02-29 16:31:42 +08:00
Asias He
abafec99a5 system_keyspace: Implement increment_and_get_generation 2016-02-29 16:31:42 +08:00
Gleb Natapov
22d2b9a2dc Yield execution in mutation_result_merger
mutation_result_merger::get can run for a long time. Make it yield
execution from time to time.

Message-Id: <1456674046-14502-1-git-send-email-gleb@scylladb.com>
2016-02-28 17:55:33 +02:00
Avi Kivity
182e6eb89b Merge seastar upstream
* seastar fbb4b01...9964cbf (4):
  > Allow map_reduce reducer to return future
  > Workaround for gcc 4.9 optional bug
  > add convert() to future<> futurizer specification
  > tests: fix rpc_test build
2016-02-28 17:55:03 +02:00
Gleb Natapov
32e9f1ecd4 Fix read_timeouts storage_proxy counter
Read timeouts are not counted now. The patch fixes it.

Message-Id: <20160228133315.GN6705@scylladb.com>
2016-02-28 15:34:42 +02:00
Avi Kivity
31b42a2574 Merge seastar upstream
* seastar 769cb8b...fbb4b01 (5):
  > simple_input_stream: Introduce begin()
  > tests: add rpc unit testing
  > tests: add loopback sockets
  > packet: introduce release() and release_into()
  > temporary_buffer: add make_copy() named constructor
2016-02-28 15:32:38 +02:00
Yoav Kleinberger
651aa06c32 tools/scyllatop: fix mistake in help message
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <ae5f11d7df954dc7561db94cda2d73bd8233f1e5.1456510513.git.yoav@scylladb.com>
2016-02-28 12:42:12 +02:00
Avi Kivity
ed365c2779 Merge "Fix row_cache::update()" from Tomasz
"Fixes recent regression in row_cache_test.cc:test_update_failure"
2016-02-28 11:17:32 +02:00
Pekka Enberg
81bc5dab77 Merge "streaming progress info fix" from Asias
"This series:

1) Log total bytes sent/recevied when a stream plan completes.
It is useful in test code.

2) Fix http://scylla_ip:10000/stream_manager API"
2016-02-27 16:13:04 +02:00
Tomasz Grabiec
3997421b2c row_cache: Let the cleanup guard do invalidation of unmerged partitions 2016-02-26 16:57:31 +01:00
Tomasz Grabiec
aa15268249 row_cache: Delete the entry even if invalidation failed
Otherwise we will leak it, and region destructor will fail:

row_cache_test: utils/logalloc.cc:1211: virtual logalloc::region_impl::~region_impl(): Assertion `seg->is_empty()' failed.

Fixes regression in row_cache_test.
2016-02-26 16:57:31 +01:00
Tomasz Grabiec
be24816c8a row_cache: Clear partitions with region locked
Since invalidate() may allocate, we need to take the region lock to
keep m.partitions references valid around whole clear_and_dispose(),
which relies on that.
2016-02-26 16:57:31 +01:00
Tomasz Grabiec
6cec131432 query: Switch to IDL-generated views and writers
The query result footprint for cassandra-stress mutation as reported
by tests/memory-footprint increased by 18% from 285 B to 337 B.

perf_simple_query shows slight regression in throughput (-8%):

  build/release/tests/perf/perf_simple_query -c4 -m1G --partitions 100000

Before: ~433k tps
After:  ~400k tps
2016-02-26 12:26:13 +01:00
Tomasz Grabiec
ee8509cf36 idl-compiler: Introduce add(*_view) on vector 2016-02-26 12:26:13 +01:00
Tomasz Grabiec
1ecf9a7427 query: result_view: Introduce do_with()
Encapsulates linearization. Abstracts away the fact that result_view
can't work with discontiguous storage yet.
2016-02-26 12:26:13 +01:00
Tomasz Grabiec
135c1fa306 tests: memory_footprint: Report size in query results 2016-02-26 12:26:13 +01:00
Tomasz Grabiec
6c89e3d2ea serializer: Fix wrong size_type being serialized into the placeholder 2016-02-26 12:26:13 +01:00
Tomasz Grabiec
4ab0ca07f1 idl-compiler: Catch un-closed frame errors sooner
By initilizing them to 0 we can catch unclosed frames at
deserialization time. It's better than leaving frame size undefined,
which may cause errors much later in deserialization process and thus
would make it harder to identifiy the real cause.
2016-02-26 12:26:13 +01:00
Tomasz Grabiec
697d9bfa56 serializer: Introduce as_input_stream(bytes_view) 2016-02-26 12:26:13 +01:00
Tomasz Grabiec
85fb4eba32 Add missing includes 2016-02-26 12:26:13 +01:00
Tomasz Grabiec
4284715ddf Relax includes 2016-02-26 12:26:13 +01:00
Amnon Heiman
9ea3ffe527 idl-compiler: Add optional support
This patch adds optional writer support an optional field can be either
skip or set.

For vector of optional, a write_empty method will
add 1 to the vector count and mark the optional as false.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-02-26 12:25:08 +01:00
Asias He
fd5f3cff47 streaming: Fix stream_manager progress api
For each stream_session, we pretend we are sending/receiving one file,
to make it compatible with nodetool. For receiving_files, the file name
is "rxnofile". For sending_files, the file name is "txnofile".

stream_manager::update_all_progress_info is introduced to update the
progress info of all the stream_sessions in the node. We need this
because streaming mutations are received on all the cores, but the
stream_session object is only on one of the cores. It adds overhead if
we update progress info in stream_session object whenever we receive a
streaming mutation. So, what we do now is when we really need the
progress info, we update the progress info in stream_session object.

With http://127.0.0.$i:10000/stream_manager/, it looks like below when
decommission node 3 in a 3 nodes cluster.

=========== GET NODE 1
[{"plan_id": "935a2cc0-dc6b-11e5-bdbf-000000000000", "description":
"Unbootstrap", "sessions": [{"receiving_files": [{"value": {"direction":
"IN", "file_name": "rxnofile", "session_index": 0, "total_bytes":
16876296, "peer": "127.0.0.3", "current_bytes": 16876296}, "key":
"rxnofile"}], "receiving_summaries": [{"files": 1, "total_size": 0,
"cf_id": "869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0,
"state": "PREPARING", "connecting": "127.0.0.3", "peer": "127.0.0.3"}]}]

=========== GET NODE 2

[{"plan_id": "935a2cc0-dc6b-11e5-bdbf-000000000000", "description":
"Unbootstrap", "sessions": [{"receiving_files": [{"value": {"direction":
"IN", "file_name": "rxnofile", "session_index": 0, "total_bytes":
16755552, "peer": "127.0.0.3", "current_bytes": 16755552}, "key":
"rxnofile"}], "receiving_summaries": [{"files": 1, "total_size": 0,
"cf_id": "869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0,
"state": "PREPARING", "connecting": "127.0.0.3", "peer": "127.0.0.3"}]}]

=========== GET NODE 3
[{"plan_id": "935a2cc0-dc6b-11e5-bdbf-000000000000", "description":
"Unbootstrap", "sessions": [{"sending_files": [{"value": {"direction":
"OUT", "file_name": "txnofile", "session_index": 0, "total_bytes":
16876296, "peer": "127.0.0.1", "current_bytes": 16876296}, "key":
"txnofile"}], "sending_summaries": [{"files": 1, "total_size": 0,
"cf_id": "869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0,
"state": "PREPARING", "connecting": "127.0.0.1", "peer":
"127.0.0.1"},{"sending_files": [{"value": {"direction": "OUT",
"file_name": "txnofile", "session_index": 0, "total_bytes": 16755552,
"peer": "127.0.0.2", "current_bytes": 16755552}, "key": "txnofile"}],
"sending_summaries": [{"files": 1, "total_size": 0, "cf_id":
"869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0, "state":
"PREPARING", "connecting": "127.0.0.2", "peer": "127.0.0.2"}]}]
2016-02-26 17:38:37 +08:00
Asias He
37f52d632f streaming: Remove unused progress() function 2016-02-26 17:38:37 +08:00
Asias He
8060b97d67 streaming: Log number of bytes sent and recevied when stream_plan completes
It is useful for test code to verify number of bytes sent/received.

It looks like below in the log.

/tmp/out1:INFO  [shard 0] stream_session - \
[Stream #1f3e23f0-db9e-11e5-9cfb-000000000000] bytes_sent = 0, bytes_received = 15760704

/tmp/out2:INFO  [shard 0] stream_session - \
[Stream #1f3e23f0-db9e-11e5-9cfb-000000000000] bytes_sent = 0, bytes_received = 18203964

/tmp/out3:INFO  [shard 0] stream_session - \
[Stream #1f3e23f0-db9e-11e5-9cfb-000000000000] bytes_sent = 33964668, bytes_received = 0
2016-02-26 17:38:37 +08:00
Asias He
9dede89e07 streaming: Add get_progress_on_all_shards for plan_id
Get stream_bytes for a specific plan_id.
2016-02-26 17:38:37 +08:00
Tomasz Grabiec
97558b2cfe idl-compiler: Put serializers inside template class specializations
The problem is that a generic functions (eg. skip()) which call
deserialize() overloads based on their template parameter only see
deserilize() overloads which were declared at the time skip() was
declared and not those which are available at the time of
instantiation. This forces all serializers to be declared before
serialization_visitors.hh is first included. Serializers included
later will fail to compile. This becomes problematic to ensure when
serializers are included from headers.

Template class specialization lookup doesn't suffer from this
limitation. We can use that to solve the problem. The IDL compiler
will now generate template class specializations with read/write
static methods. In addition to that, default serializer() and
deserialize() implementations are delegating to serializer<>
specialization so that API and existing code doesn't have to change.

Message-Id: <1456423066-6979-1-git-send-email-tgrabiec@scylladb.com>
2016-02-25 20:00:49 +02:00
Takuya ASADA
aa3f6ad462 dist: add scyllatop on .rpm/.deb
Fixes #933

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456420768-15921-1-git-send-email-syuu@scylladb.com>
2016-02-25 19:24:11 +02:00
Avi Kivity
a74f68eeb2 Merge "Properly tag readers" from Glauber
"Gleb has recently noted that our query reads are not even being registered
with the I/O queue.

Investigating what is happening, I found out that while the priority that
make_reader receives was not being properly passed downwards to the SSTable
reader. The reader code is also used by compaction class, and that one is fine.
But the CQL reads are not.

On top of that, there are also some other places where the tag was not properly
propagated, and those are patched."
2016-02-25 18:35:58 +02:00
Raphael S. Carvalho
fc4cbcde72 Revert "Revert "database: Fix use and assumptions about pending compations""
This reverts commit a4d92750eb.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <8a405e7c1daf94c4d70d8084f59ce7205d56fe52.1456415398.git.raphaelsc@scylladb.com>
2016-02-25 18:02:01 +02:00
Raphael S. Carvalho
7f0371129c tests: sstable_test: submit compaction request through column family
That's needed for reverted commit 9586793c to work. It's also the
correct thing to do, i.e. column family submits itself to manager.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <2a1d141ad929c1957933f57412083dd52af0390b.1456415398.git.raphaelsc@scylladb.com>
2016-02-25 18:02:00 +02:00
Avi Kivity
c269527f42 Merge "Get rid of assert in gossip and storage_service" from Asias
"Make the error handling more robust."
2016-02-25 17:38:21 +02:00
Pekka Enberg
a4d92750eb Revert "database: Fix use and assumptions about pending compations"
This reverts commit 9586793c70. It breaks
sstable_test as follows:

  [penberg@nero scylla]$ build/release/tests/sstable_test --smp 1
  Running 81 test cases...
  INFO  [shard 0] compaction_manager - Asked to stop
  INFO  [shard 0] compaction_manager - Stopped
  sstable_test: database.cc:878: future<> column_family::run_compaction(sstables::compaction_descriptor): Assertion `_stats.pending_compactions > 0' failed.
  unknown location(0): fatal error in "compaction_manager_test": signal: SIGABRT (application abort requested)
  tests/sstable_datafile_test.cc(1023): last checkpoint
2016-02-25 15:28:06 +02:00
Asias He
32eaaecf36 gossip: Get rid of assert
Log the error and throw the exception, instead of abort the whole
process. Make the code more robust.
2016-02-25 21:19:52 +08:00
Asias He
699fd25467 storage_service: Get rid of assert
We can recover from most of the errors. Log the error and throw the
exception, instead of abort the whole process. Make the code more
robust.
2016-02-25 21:19:52 +08:00
Asias He
59564591d5 storage_service: Use get_gossip_status to get status
The help is introduced recently, use it. Avoid to open code it.
2016-02-25 21:19:52 +08:00
Pekka Enberg
8e2c924de3 cql3: Fix quadratic behavior in update_statement::parsed_insert::prepare_internal()
This fixes a quadratic search for duplicate columns in prepare_internal().

Refs #822.

Message-Id: <1456405104-16482-1-git-send-email-penberg@scylladb.com>
2016-02-25 15:06:56 +02:00
Yoav Kleinberger
872079d999 tools/scyllatop: correct mistake in help text
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <01844d90f2d942a051d128b03ae12578ac0bb69c.1456324697.git.yoav@scylladb.com>
2016-02-25 12:49:48 +02:00
Asias He
94cb7f22d4 gossip: Make add_local_application_state safe to call on any cpu
add_local_application_state is used in various places. Before this
patch, it can only be called on cpu zero. To make it safer to use, use
invoke_on() to foward the code to run on cpu zero, so that caller can
call it on any cpu.

Refs: #795
Message-Id: <d69b81c5561622078dbe887d87209c4ea2e3bf46.1456315043.git.asias@scylladb.com>
2016-02-25 12:45:54 +02:00
Asias He
4e931c2453 gossip: Log the error when fails to add local application state
Gleb saw once:

scylla: gms/gossiper.cc:1393:
gms::gossiper::add_local_application_state(gms::application_state,
gms::versioned_value):: mutable: Assertion
`endpoint_state_map.count(ep_addr)' failed.

The assert is about we can not find the entry in endpoint_state_map of
the node itself. I can not really find any place we could call
add_local_application_state before we call gossiper::start_gossiping()
where it inserts broadcast address into endpoint_state_map.

I can not reproduce issue, let's log the error so we can narrow down
which application state triggered the assert.

Refs: #795
Message-Id: <f4433be0a0d4f23470a5e24e528afdb67b74c7ef.1456315043.git.asias@scylladb.com>
2016-02-25 12:45:17 +02:00
Takuya ASADA
b250a3b116 dist: Add collectd configuration support on .rpm/.deb
Depends on collectd, add /etc/collectd.d/scylla.conf on scylla-server package installation.
Fixes #946

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456336200-11876-1-git-send-email-syuu@scylladb.com>
2016-02-25 10:35:47 +02:00
Takuya ASADA
28dd202613 scyllatop: add --logfile argument to specify path to log file
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456333116-7389-2-git-send-email-syuu@scylladb.com>
2016-02-25 10:33:41 +02:00
Takuya ASADA
af3a8ead21 scyllatop: output error message both on log file and stdout
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456333116-7389-1-git-send-email-syuu@scylladb.com>
2016-02-25 10:33:40 +02:00
Calle Wilund
9586793c70 database: Fix use and assumptions about pending compations
Fixes #934 - faulty assert in discard_sstables

run_with_compaction_disabled clears out a CF from compaction
mananger queue. discard_sstables wants to assert on this, but looks
at the wrong counters.

pending_compactions is an indicator on how much interested parties
want a CF compacted (again and again). It should not be considered
an indicator of compactions actually being done.

This modifies the usage slightly so that:
1.) The counter is always incremented, even if compaction is disallowed.
    The counters value on end of run_with_compaction_disabled is then
    instead used as an indicator as to whether a compaction should be
    re-triggered. (If compactions finished, it will be zero)
2.) Document the use and purpose of the pending counter, and add
    method to re-add CF to compaction for r_w_c_d above.
3.) discard_sstables now asserts on the right things.

Message-Id: <1456332824-23349-1-git-send-email-calle@scylladb.com>
2016-02-25 08:57:04 +02:00
Glauber Costa
6f1d0dce00 mutation_query: attach the query priority read when reading mutations
We call a mutation source during the query path without any consideration
for attaching a priority. This is incorrect, and queries called through this
facility will end up in the default class.

Fix this by attaching the query priority class here.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-02-24 18:00:34 -05:00
Glauber Costa
336babfcb8 database: add a priority class to a few SSTable readers
Not all SSTable readers will end up getting the right tag for a priority
class. In particular, the range reader, also used for the memtables complete
ignores any priority class.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-02-24 18:00:34 -05:00
Glauber Costa
2816bc6fed database: use a reference instead of a pointer to store the priority classes
We will always initialize it, so don't use a pointer.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-02-24 18:00:34 -05:00
Glauber Costa
80ab41a715 memtable reader: also include a priority class
There are situations when a memtable is already flushed but the memtable
reader will continue to be in place, relaying reads to the underlying
table.

For that reason, the "memtables don't need a priority class" argument
gets obviously broken. We need to pass a priority class for its reader
as well.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-02-24 18:00:34 -05:00
Calle Wilund
590ec1674b truncate: Require timestamp join-function to ensure equal values
Fixes #937

In fixing #884, truncation not truncating memtables properly,
time stamping in truncate was made shard-local. This however
breaks the snapshot logic, since for all shards in a truncate,
the sstables should snapshot to the same location.

This patch adds a required function argument to truncate (and
by extension drop_column_family) that produces a time stamp in
a "join" fashion (i.e. same on all shards), and utilizes the
joinpoint type in caller to do so.

Message-Id: <1456332856-23395-2-git-send-email-calle@scylladb.com>
2016-02-24 18:59:31 +02:00
Calle Wilund
43ea1f5945 utils::jointpoint: Helper type to generate a singular value for all shards
Lets operations working on all shards "join" and acquire
the same value of something, with that value being based on
whenever all shards reach the join.

Obvious use case: time stamp after one set of per-shard ops, but
before final ones.

The generation of the value is guaranteed to happen on the shards
that created the join point.

Based on the join-ops in CF::snapshot, but abstracted and made
caller responsibility. Primary use case is to help deal with
the join-problem of truncation.

Message-Id: <1456332856-23395-1-git-send-email-calle@scylladb.com>
2016-02-24 18:59:25 +02:00
Yoav Kleinberger
c3ce9e53cb tools/scyllatop: support glob patterns to specifiy metrics
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <42f84cdeeb75c3719230028a13a1dd8499673d4c.1456319441.git.yoav@scylladb.com>
2016-02-24 15:35:45 +02:00
Raphael S. Carvalho
bb48f1b06c sstables: use system clock's epoch for timestamp in compaction history
As pointed out by Tomek, the type of column used is timestamp, therefore
system's clock epoch (db_clock) should be used instead.

Fixes #817.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <f80f9f411d673cf2d653e193ccb8ebaa36bc891b.1456317766.git.raphaelsc@scylladb.com>
2016-02-24 14:49:21 +02:00
Pekka Enberg
dfcc48d82a transport: Add result metadata to PREPARED message
The gocql driver assumes that there's a result metadata section in the
PREPARED message. Technically, Scylla is not at fault here as the CQL
specification explicitly states in Section 4.2.5.4. ("Prepared") that the
section may be empty:

   - <result_metadata> is defined exactly as <metadata> but correspond to the
      metadata for the resultSet that execute this query will yield. Note that
      <result_metadata> may be empty (have the No_metadata flag and 0 columns, See
      section 4.2.5.2) and will be for any query that is not a Select. There is
      in fact never a guarantee that this will non-empty so client should protect
      themselves accordingly. The presence of this information is an

However, Cassandra always populates the section so lets do that as well.

Fixes #912.

Message-Id: <1456317082-31688-1-git-send-email-penberg@scylladb.com>
2016-02-24 14:43:24 +02:00
Avi Kivity
fedba9d6cd Merge "reduce gossip round latency" from Asias
"This series makes gossip message handling to be async to reduce gossip round
latency. Commit log of patch 3 explains the issue in detail.

Refs: #900"
2016-02-24 13:44:06 +02:00
Avi Kivity
b42a32efc7 Update scylla-ami submodule
* dist/ami/files/scylla-ami 398b1aa...d4a0e18 (3):
  > Sort service running order (scylla-ami-setup.service -> scylla-io-setup.service -> scylla-server.service)
  > Drop --ami and --disk-count parameters
  > dist: pass the number of disks to set io params
2016-02-24 13:38:05 +02:00
Avi Kivity
cda29c0324 Merge seastar upstream
* seastar 8c560f2...769cb8b (4):
  > temporary_buffer: make operator bool explicit (and const)
  > iotune: use SEASTAR_IO instead of SCYLLA_IO
  > iotune: add --format option, to use EnvironmentFile on systemd
  > sstring: add data() methods
2016-02-24 13:38:05 +02:00
Avi Kivity
efabb1a1d8 commitlog: fix buffer size calculation
We were adding bool(buffer), instead of buffer.size(); exposed by making
temporary_buffer::operator bool explicit.
2016-02-24 13:38:05 +02:00
Asias He
697b16414a gossip: Make gossip message handling async
In each gossip round, i.e., gossiper::run(), we do:

1) send syn message
2)                           peer node: receive syn message, send back ack message
3) process ack message in handle_ack_msg
   apply_state_locally
     mark_alive
       send_gossip_echo
     handle_major_state_change
       on_restart
       mark_alive
         send_gossip_echo
       mark_dead
         on_dead
       on_join
     apply_new_states
       do_on_change_notifications
          on_change
4) send back ack2 message
5)                            peer node: process ack2 message
   			      apply_state_locally

At the moment, syn is "wait" message, it times out in 3 seconds. In step
3, all the registered gossip callbacks are called which might take
significant amount of time to complete.

In order to reduce the gossip round latency, we make syn "no-wait" and
do not run the handle_ack_msg insdie the gossip::run(). As a result, we
will not get a ack message as the return value of a syn message any
more, so a GOSSIP_DIGEST_ACK message verb is introduced.

With this patch, the gossip message exchange is now async. It is useful
when some nodes are down in the cluster. We will not delay the gossip
round, which is supposed to run every second, 3*n seconds (n = 1-3,
since it talks to 1-3 peer nodes in each gossip round) or even
longer (considering the time to run gossip callbacks).

Later, we can make talking to the 1-3 peer nodes in parallel to reduce
latency even more.

Refs: #900
2016-02-24 19:33:39 +08:00
Asias He
63df54b368 messaging_service: Add GOSSIP_DIGEST_ACK
We will soon switch to use no-wait message for gossip. GOSSIP_DIGEST_SYN
will no longer return GOSSIP_DIGEST_ACK message. So we need a standalone
verb for GOSSIP_DIGEST_ACK.
2016-02-24 19:31:14 +08:00
Asias He
022c7e50a1 failure_detector: Fix false alarm of "Not marking nodes down due to local pause of"
The problem is we initialize _last_interpret when failure_detector
object is constructed. When interpret() runs for the first time, the
_last_interpret value is not the last time we run interpret() but the
time we initialize failure_detector object.

Fix by initializing _last_interpret inside interpret().

[Thu Feb 18 02:40:04 2016] INFO  [shard 0] storage_service - Node 127.0.0.1 state jump to normal
[Thu Feb 18 02:40:04 2016] INFO  [shard 0] storage_service - NORMAL: node is now in normal status
[Thu Feb 18 02:40:04 2016] INFO  [shard 0] gossip - Waiting for gossip to settle before accepting client requests...
[Thu Feb 18 02:40:12 2016] INFO  [shard 0] gossip - No gossip backlog; proceeding
Starting listening for CQL clients on 127.0.0.1:9042...
[Thu Feb 18 02:40:12 2016] INFO  [shard 0] gossip - Node 127.0.0.2 is now part of the cluster
[Thu Feb 18 02:40:12 2016] INFO  [shard 0] gossip - InetAddress 127.0.0.2 is now UP
[Thu Feb 18 02:40:13 2016] INFO  [shard 0] gossip - do_gossip_to_live_member: Favor newly added node 127.0.0.2
[Thu Feb 18 02:40:13 2016] WARN  [shard 0] failure_detector - Not marking nodes down due to local pause of 9091 > 5000 (milliseconds)
2016-02-24 19:31:14 +08:00
Avi Kivity
e993102cb5 Merge "introduce scylla-io-setup.service" from Takuya
"Add scylla-io-setup.service to configure max-io-requests and num-io-queues on first boot.
Moved SCYLLA_IO configuration code from scylla_sysconfig_setup to scylla-io-setup.service, revert commits related it.
On scylla-io-setup.service, autodetect Amazon EC2 instead of using AMI variable on sysconfig."
2016-02-24 10:13:23 +02:00
Takuya ASADA
c4035a0a13 dist: add comment about /etc/scylla.d/io.conf on sysconfig 2016-02-24 04:00:52 +09:00
Takuya ASADA
0f20abb365 Revert "dist: introduce SCYLLA_IO"
This reverts commit 5cae2560a3.

Conflicts:
	dist/common/sysconfig/scylla-server
2016-02-24 03:46:14 +09:00
Takuya ASADA
b79a1a77da Revert "dist: update SCYLLA_IO with params for AMI"
This reverts commit 5494135ddd.

Conflicts:
	dist/common/scripts/scylla_sysconfig_setup
2016-02-24 03:45:11 +09:00
Takuya ASADA
643beefc8c Revert "Revert "dist: remove AMI entry from sysconfig, since there is no script refering it""
This reverts commit 21e6720988.
2016-02-24 03:33:50 +09:00
Takuya ASADA
66c5feb9e9 Revert "dist: align ami option with others (-a --> --ami)"
This reverts commit 312f1c9d98.
2016-02-24 03:33:41 +09:00
Takuya ASADA
a9926f1cea dist: introduce scylla-io-setup.service to setup io parameters on first startup
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-24 03:33:03 +09:00
Tomasz Grabiec
79bcb5a616 tests: Fix build of memory_footprint 2016-02-23 19:12:54 +01:00
Amnon Heiman
f461ebc411 idl-compiler: Add pos and rollback to serialize vector
This adds the ability to store a position of a serialized vector and to
rollback to that stored position afterwards.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1456041750-1505-3-git-send-email-amnon@scylladb.com>
2016-02-23 17:49:51 +01:00
Amnon Heiman
ea97e07ed7 serialization_visitors: Adding vector_position struct
While serialization vector it is sometimes required to rollback some of
the serialized elements.

vector_position is the equivalent to the bytes_ostream position struct.
It holds information about the current position in a serialized vector,
the position in the bufffer and the current number of elements
serialized.

It will allow to rollback to the current point.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1456041750-1505-2-git-send-email-amnon@scylladb.com>
2016-02-23 17:49:51 +01:00
Tomasz Grabiec
f72fd9eefd Merge branch 'pdziepak/canonical-mutation-idl/v1' from sesastar-dev.git 2016-02-23 17:02:43 +01:00
Tomasz Grabiec
995b638d96 mutation_partition_visitor: Fix crash for large blobs
Fixes #927.

The new visiting code builds cell instances using
atomic_cell::make_*() factory methods, which won't work in LSA context
because they depend on managed_bytes storage to be linearized. It may
not be since large blob support. This worked before because we created
cells from views before which works in all contexts.

Fix by constructing them in standard allocator context.

Message-Id: <1456234064-13608-2-git-send-email-tgrabiec@scylladb.com>
2016-02-23 16:41:39 +02:00
Tomasz Grabiec
33cf65c2aa mutation_partition_view: Fix use-after-move on visitor instance
The line:

  boost::apply_visitor(atomic_cell_or_collection_visitor(std::move(visitor), id, col), cell);

is executed in a loop, so visitor could be used after being
moved-from. This may not always be allowed for some visitors. Also,
vistors may keep state, which should be preserved for the whole
visitation.

This doesn't fix any issue right now.

Message-Id: <1456234064-13608-1-git-send-email-tgrabiec@scylladb.com>
2016-02-23 16:41:39 +02:00
Yoav Kleinberger
f822359d96 bugfix: fixed broken --print-config option
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <57b452106cdcd9ceb09da4c63781650cefe48040.1456234464.git.yoav@scylladb.com>
2016-02-23 15:35:44 +02:00
Asias He
f7fccc6efb locator: Fix get token from a range<token>
With a range{t1, t2}, if t2 == {}, the range.end() will contain no
value. Fix getting t2 in this case.

Fixes #911.
Message-Id: <4462e499d706d275c03b116c4645e8aaee7821e1.1456128310.git.asias@scylladb.com>
2016-02-23 14:29:26 +01:00
Pekka Enberg
4a4074ad21 tools/scyllatop: Sort metrics by name
This makes the output much easier to read, especially if you have tons
of metrics specified.

Message-Id: <1456230377-3149-1-git-send-email-penberg@scylladb.com>
2016-02-23 14:35:57 +02:00
Takuya ASADA
0f87922aa6 main: notify service start completion ealier, to reduce systemd unit startup time
Fixes #910

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1455830245-11782-1-git-send-email-syuu@scylladb.com>
2016-02-23 14:33:16 +02:00
Pekka Enberg
1f6cac8839 tools/scyllatop: Use 'erase' to clear the screen
The 'clear' function explicitly clears the screen and repaints it which
causes really annoying flicker. Use 'erase' to make scyllatop more
pleasant on the eyes.

Message-Id: <1456229348-2194-1-git-send-email-penberg@scylladb.com>
2016-02-23 14:12:48 +02:00
Tomasz Grabiec
2b5253927f test.py: Print output on timeout as well
It is often the case that the there is useful debugging information
printed by the test before it hangs. It is annoying to see just "TIMED
OUT" in jenkins. Print the output always when it is available.

In addition to that, we should not interpret all exceptions thrown
from communicate() as timeouts. For example, currently ^C sent to the
script misleadingly results in "TIMED OUT" to be printed.
Message-Id: <1456174992-21909-1-git-send-email-tgrabiec@scylladb.com>
2016-02-23 13:41:11 +02:00
Pekka Enberg
78c6fdf429 cql3/functions: Fix is_pure() for native scalar functions
Every native scalar function is already tagged whether they're pure or
not but because we don't implement the is_pure() function, all functions
end up being advertised as pure. This means that functions like now()
that are *not* pure, end up being evaluated only once.

Fixes #571.
Message-Id: <1456227171-461-1-git-send-email-penberg@scylladb.com>
2016-02-23 12:37:32 +01:00
Yoav Kleinberger
74fbc62129 ScyllaTop: top-like tool to see live scylla metrics
requires a local collectd configured with the unix-sock plugin,
use the --help option for more. Run it with:

        $ scyllatop.py --help

Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <bd3f8c7e120996fc464f41f60130c82e3fb55ac6.1456164703.git.yoav@scylladb.com>
2016-02-23 12:32:47 +02:00
Avi Kivity
8ba474f1c9 Merge "Drop empty partitions from mutation query results" from Tomasz 2016-02-23 11:18:47 +02:00
Tomasz Grabiec
c591157755 tests: mutation_query: Add test for dropping partitions with expired tombstones 2016-02-22 20:23:29 +01:00
Tomasz Grabiec
41d475d9c0 schema_builder: Fluentize property setters 2016-02-22 20:23:29 +01:00
Tomasz Grabiec
6fdaf110d6 mutation_query: Don't include empty partitions
In same cases we may have a lot of empty partitions whose tombstones
expired, and there is no point in including them in the results.

This was found to cause performance issues for workloads using batch
updates. system.batchlog table would accumulate a lot of deletes over
time. It has gc_grace_seconds set to 0 so most of the tombstones would
be expired. mutation queries done by batchlog manager were still
returning all partitions present in memtables which caused mutation
queries result to be inflated. This in turn was causing
mutation_result_merger to take a long time to process them.
2016-02-22 20:21:23 +01:00
Pekka Enberg
4ff1692248 cql3: Make 'CREATE TYPE' error message human readable
We don't support the 'CREATE TYPE' statement for now. The user-visible
error message, however, is unreadable because our CQL parser doesn't
even recognize the statement.

  cqlsh:ks1> CREATE TYPE config (url text);
  SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message=" : cannot match to any predicted input...

Implement just enough of 'CREATE TYPE' parsing to be able to report a
human readable error message if someone tries to execute such
statements:

  cqlsh:ks1> CREATE TYPE config (url text);
  ServerError: <ErrorMessage code=0000 [Server error] message="User-defined types are not supported yet">
Message-Id: <1456148719-9473-2-git-send-email-penberg@scylladb.com>
2016-02-22 14:50:25 +01:00
Pekka Enberg
d1bbd0271a cql3: Return const reference from ut_name::get_keyspace()
There's no need to copy the string but it does make it more difficult to
use get_keyspace() from other places that already return a const
reference.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
Message-Id: <1456148719-9473-1-git-send-email-penberg@scylladb.com>
2016-02-22 14:50:25 +01:00
Pekka Enberg
a15cbf0968 transport: Remove read_unsigned_short() variant
As explained in commit 0ff0c55 ("transport: server: 'short' should be
unsigned"), "short" type is always unsigned in the CQL binary protocol.
Therefore, drop the read_unsigned_short() variant altogether and just
use read_short() everywhere.

Message-Id: <1456133171-1433-1-git-send-email-penberg@scylladb.com>
2016-02-22 11:39:33 +02:00
Tomasz Grabiec
fb3344eba1 sstables: Do not write corrupted sstables when column names are too large
This may result in errors during reading like the following one:

  runtime error: Unexpected marker. Found k, expected \x01\n)'

The error above happened when executing limits.py:max_key_length_test dtest.

After change the exception will happen during writing and will be clearer.

Refs #807.

This patch doesn't deal with the problem of ensuring that we will
never hit those errors, which is very desirable. We shouldn't ack a
write if we can't persist it to sstables.

Message-Id: <1456130045-2364-1-git-send-email-tgrabiec@scylladb.com>
2016-02-22 11:03:16 +02:00
Vlad Zolotarov
f2c6f16a50 locator: everywhere_replication_strategy: change the class_registrator name to "EverywhereStrategy"
Change the name used with class_registrator from "EverywhereReplicationStrategy"
(used in the initial patch from CASSANDRA-826 JIRA) to "EverywhereStrategy"
as it is in the current DCE code.

With this change one will be able to create an instance of
everywhere_replication_strategy class by giving either
an "org.apache.cassandra.locator.EverywhereStrategy" (full name) or
an "EverywhereStrategy" (short name) as a replication strategy name.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1456081258-937-1-git-send-email-vladz@cloudius-systems.com>
2016-02-22 09:18:47 +02:00
Vlad Zolotarov
cc30956c56 locator: added EverywhereReplicationStrategy
This strategy would ignore an RF configuration and would
always try to replicate on all cluster nodes.

This means that its get_replication_factor()  would return a
number of currently "known" nodes in the cluster and
if a cluster is currently bootstrapping this value obviously may
change in time for the same key. Therefore using this strategy
should be done with caution.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1456074333-15014-3-git-send-email-vladz@cloudius-systems.com>
2016-02-21 19:29:29 +02:00
Vlad Zolotarov
ec14fb2a70 locator: token_metadata: add get_all_endpoints_count()
Return a number of currently known endpoints when
it's needed in a fast path flow.

Calling a get_all_endpoints().size() for that matter
would not be fast enough because of the unordered_set->vector transformation
we don't need.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1456074333-15014-2-git-send-email-vladz@cloudius-systems.com>
2016-02-21 19:29:28 +02:00
Avi Kivity
63841b425d Merge seastar upstream
* seastar c829b69...8c560f2 (2):
  > iotune: add missing static variable definitions
  > prevent futures ignored by parallel_for_each from generating warnings
2016-02-21 18:39:28 +02:00
Shlomi Livne
312f1c9d98 dist: align ami option with others (-a --> --ami)
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <c159aac7f0478aba34d4398a2eb8ea71285ede21.1456052976.git.shlomi@scylladb.com>
2016-02-21 15:06:20 +02:00
Shlomi Livne
21e6720988 Revert "dist: remove AMI entry from sysconfig, since there is no script refering it"
This reverts commit 54f9e59006.

AMI is needed to setting up io params

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <4154a000f019059f740319cfa2fbf875568770b7.1456052976.git.shlomi@scylladb.com>
2016-02-21 15:06:20 +02:00
Tomasz Grabiec
d376167bd4 cql: create_table_statement: Optimize duplicate column names detection
Current algorithm is O(N^2) where N is the column count. This causes
limits.py:TestLimits.max_columns_and_query_parameters_test to timeout
because CREATE TABLE statement takes too long.

This change replaces it with an algorithm of O(N)
complexity. _defined_names are already sorted so if any duplicates
exist, they must be next to each other.

Message-Id: <1456058447-5080-1-git-send-email-tgrabiec@scylladb.com>
2016-02-21 14:55:03 +02:00
Tomasz Grabiec
d3b7e143dc db: Fix error handling in populate_keyspace()
When find_uuid() fails Scylla would terminate with:

  Exiting on unhandled exception of type 'std::out_of_range': _Map_base::at

But we are supposed to ignore directories for unknown column
families. The try {} catch block is doing just that when
no_such_column_family is thrown from the find_column_family() call
which follows find_uuid(). Fix by converting std::out_of_range to
no_such_column_family.

Message-Id: <1456056280-3933-1-git-send-email-tgrabiec@scylladb.com>
2016-02-21 14:19:31 +02:00
Tomasz Grabiec
0c8db777b1 bytes_ostream: Avoid recursion when freeing chunks
When there is a lot of chunks we may get stack overflow.

This seems to fix issue #906, a memory corruption during schema
merge. I suspect that what causes corruption there is overflowing of
the stack allocated for the seastar thread. Those stacks don't have
red zones which would catch overflow.

Message-Id: <1456056288-3983-1-git-send-email-tgrabiec@scylladb.com>
2016-02-21 14:18:49 +02:00
Raphael S. Carvalho
b1cc0490f5 sstables: make compaction manager shutdown less verbose
before:

^CINFO  [shard 0] compaction_manager - Asked to stop
INFO  [shard 0] compaction_manager - compaction task handler stopped due to shutdown
INFO  [shard 0] compaction_manager - compaction task handler stopped due to shutdown
INFO  [shard 1] compaction_manager - Asked to stop
INFO  [shard 2] compaction_manager - Asked to stop
INFO  [shard 1] compaction_manager - compaction task handler stopped due to shutdown
INFO  [shard 2] compaction_manager - compaction task handler stopped due to shutdown
INFO  [shard 3] compaction_manager - Asked to stop
INFO  [shard 1] compaction_manager - compaction task handler stopped due to shutdown
INFO  [shard 2] compaction_manager - compaction task handler stopped due to shutdown
INFO  [shard 3] compaction_manager - compaction task handler stopped due to shutdown
INFO  [shard 3] compaction_manager - compaction task handler stopped due to shutdown

after:

^CINFO  [shard 0] compaction_manager - Asked to stop
INFO  [shard 0] compaction_manager - Stopped
INFO  [shard 1] compaction_manager - Asked to stop
INFO  [shard 2] compaction_manager - Asked to stop
INFO  [shard 3] compaction_manager - Asked to stop
INFO  [shard 1] compaction_manager - Stopped
INFO  [shard 2] compaction_manager - Stopped
INFO  [shard 3] compaction_manager - Stopped

`compaction_manager - compaction task handler stopped due to shutdown` is still printed
in debug level

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <535d5ad40102571a3d5d36257342827989e8f0f4.1455835407.git.raphaelsc@scylladb.com>
2016-02-21 11:55:17 +02:00
Raphael S. Carvalho
55be1830ff database: make column_family::rebuild_sstable_list safer
If any of the allocation in rebuild_sstable_list fail, the system
may be left with an incorrect set of sstables.
It's probably safer to assign the new set of sstables as a last
step.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <52b188262dcc06730dc9220b54ff6810d7dca1ae.1455835030.git.raphaelsc@scylladb.com>
2016-02-21 11:55:15 +02:00
Raphael S. Carvalho
9cb8a43684 start using notation ks.cf everywhere
Some places were using the notation ks/cf to represent a keyspace
and column family pair. ks.cf is the notation used by C*, so we
should use it everywhere.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <939449af92565b79d1823890784dc4d1dc3cdc84.1455830989.git.raphaelsc@scylladb.com>
2016-02-21 11:15:09 +02:00
Avi Kivity
69ac1a3229 Merge seastar upstream
* seastar cf1716f...c829b69 (1):
  > iotune: limit generate() concurrency to 128

Fixes #922.
2016-02-21 11:12:10 +02:00
Takuya ASADA
5a213341a4 dist: restart scylla on abnormal termination (Ubuntu)
Fixes #907

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1455833659-12652-1-git-send-email-syuu@scylladb.com>
2016-02-21 10:33:28 +02:00
Takuya ASADA
7a6f9c9bb4 dist: use /usr/bin/python3.4 for idl-compiler.py on CentOS
Fixes #923

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1455826201-11333-1-git-send-email-syuu@scylladb.com>
2016-02-20 19:15:21 +02:00
Avi Kivity
bba2034957 Merge "IDL-ize mutations" from Paweł
"This series switches mutation_partition_serializer, mutation_partition_view
and frozen_mutation to the IDL-based serialization format.
canonical_mutations and frozen_schemas are still not converted.

Quick test with 4 node ccm cluster and cassandra-stress doesn't show any
problem, unsurprisingly, as frozen_mutation_test obviously still passes."
2016-02-20 18:49:23 +02:00
Avi Kivity
889bf1aef2 dist: build only scylla and iotune binaries, not all the tests 2016-02-20 18:45:42 +02:00
Takuya ASADA
ca9f7bf72e dist: add iotune on .rpm/.deb packages
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1455929805-28987-1-git-send-email-syuu@scylladb.com>
2016-02-20 18:44:47 +02:00
Paweł Dziepak
8e7a2fa557 schema_mutations: drop old serializers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 23:12:00 +00:00
Paweł Dziepak
061dd111b5 canonical_mutation: drop old serializers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 23:12:00 +00:00
Paweł Dziepak
351c69b476 frozen_schema: use IDL-based serialization
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 23:12:00 +00:00
Paweł Dziepak
81f42415d4 schema_mutations: prepare for auto-generated serializers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 23:12:00 +00:00
Paweł Dziepak
1b52264dfd batchlog_manager: use new canonical_mutation serializers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 23:12:00 +00:00
Paweł Dziepak
6c8b298ccd canonical_mutation: prepare for auto-generated serializers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 23:12:00 +00:00
Paweł Dziepak
7dda3977c6 column_mapping: drop old-style serializers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 23:11:59 +00:00
Paweł Dziepak
89b75a02d4 commitlog: use IDL-based serialization for entries
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 23:11:59 +00:00
Paweł Dziepak
d5c794d5e4 data_output: add reserve()
Allows mixing data_output with other output stream like
seastar::simple_output_stream which is useful when switching to the new
IDL-based serializers.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 23:11:59 +00:00
Paweł Dziepak
f548c75200 commitlog: move implementation to *.cc file
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 23:11:59 +00:00
Paweł Dziepak
5a353486c6 canonical_mutation: switch to IDL-based serialization
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 23:11:31 +00:00
Paweł Dziepak
c55fa9e4c2 schema: make column_mapping serializer-friendly
- unnested column_mapping::column
- more accessors

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 23:11:16 +00:00
Paweł Dziepak
4f3ee7abbc frozen_mutation: use IDL-based serialization
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:51:17 +00:00
Paweł Dziepak
28fa2a6493 idl-compiler: add serialization callback interface
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:50:29 +00:00
Paweł Dziepak
186061adef mutation_partition: switch serialization to IDL-based one
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:49:08 +00:00
Paweł Dziepak
ccd29bf7a7 frozen_mutation: switch to bytes_ostream
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:47:54 +00:00
Paweł Dziepak
5127321866 column_mapping: add column_at()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:47:54 +00:00
Paweł Dziepak
e332f95960 types: make serialize_mutation_form() static
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:47:42 +00:00
Paweł Dziepak
7586e47004 idl-compiler: avoid copy of basic types
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:47:42 +00:00
Paweł Dziepak
2ea735f5ed idl-compiler: accept both bytes and bytes_view
bytes can always be trivially converted to bytes_view. Conversion in the
other direction requires a copy.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:47:42 +00:00
Paweł Dziepak
18b1c66287 idl-compiler: allow auto-generated serializers in writers
This patch allows using both auto-generated serializers or writer-based
serialization for non-stub [[writable]] types.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:47:21 +00:00
Paweł Dziepak
af2241686f idl-compiler: add reindent() function
This helps having C++ code properly indented both in the compiler source
code and in the auto-generated files.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:20:09 +00:00
Paweł Dziepak
597ed15dfd tests: add idl unit test
Test auto-generated and writer-based serialization as well as
deserialization of simple compound type, vectors and variants.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:19:30 +00:00
Paweł Dziepak
7d1a66d3a0 idl-compiler: move writers and view to *.impl.hh
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:17:00 +00:00
Paweł Dziepak
f1f14631f4 add set_size() overload for bytes_ostream()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:16:55 +00:00
Paweł Dziepak
340d0cccbc serializer: fix duration deserializer
Deserializer is supposed to update input stream.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-02-19 21:11:57 +00:00
Avi Kivity
6330743775 Merge seastar upstream
* seastar 8679033...cf1716f (1):
  > iotune: qualify filesystems for aio
2016-02-18 17:16:46 +02:00
Nadav Har'El
f9ee74f56f repair: options for repairing only a subrange
To implement nodetool's "--start-token"/"--end-token" feature, we need
to be able to repair only *part* of the ranges held by this node.
Our REST API already had a "ranges" option where the tool can list the
specific ranges to repair, but using this interface in the JMX
implementation is inconvenient, because it requires the *Java* code
to be able to intersect the given start/end token range with the actual
ranges held by the repaired node.

A more reasonable approach, which this patch uses, is to add new
"startToken"/"endToken" options to the repair's REST API. What these
options do is is to find the node's token ranges as usual, and only
then *intersect* them with the user-specified token range. The JMX
implementation becomes much simpler (in a separate patch for scylla-jmx)
and the real work is done in the C++ code, where it belongs, not in
Java code.

With the additional scylla-jmx patch to use the new REST API options
provided here, this fixes #917.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1455807739-25581-1-git-send-email-nyh@scylladb.com>
2016-02-18 17:13:56 +02:00
Raphael S. Carvalho
a53cfc8127 compaction manager: add support to wait for termination of cleanup
'nodetool cleanup' must wait for termination of cleanup, however,
cleanup is handled asynchronously. To solve that, a mechanism is
added here to wait for termination of a cleanup. This mechanism is
about using promise to notificate waiter of cleanup completion.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <6dc0a39170f3f51487fb8858eb443573548d8bce.1455655016.git.raphaelsc@scylladb.com>
2016-02-18 17:01:18 +02:00
Paweł Dziepak
763f6e1dc0 storage_service: don't drain twice
If drain was explicitly requested by user there is no need to do that
again during shutdown.

Fixes segmentation fault when shutting down already drained Scylla.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1455794465-16670-1-git-send-email-pdziepak@scylladb.com>
2016-02-18 14:58:02 +02:00
Avi Kivity
1b49c0ce19 dist: Restart scylla on abnormal termination
Restarting the service can recover from a transient failure.

Fixes #904 (on systemd systems only)
Message-Id: <1455441103-11963-1-git-send-email-avi@scylladb.com>
2016-02-17 21:30:40 +01:00
Avi Kivity
62e96de48a Merge "Adding writers and visitor to IDL" from Amnon
"writers are used to stream objects programmatically and not from objects.
visitors views are used to retrieve information from serialized objects without
deserialize them entirely, but to skip to the position in the buffer with the
relevant information and deserialize only it."
2016-02-17 22:06:46 +02:00
Amnon Heiman
fbc6770837 idl-compiler: Verify member type
This patch adds static assert to the generated code that verify that a
declare type in the idl matches the parameter type.

Accepted type ignores reference and const when comparison.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-02-17 21:44:53 +02:00
Amnon Heiman
38cd55e9cf Adding the mutation idl
This adds the mutation definition idl and add it to the compilation.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-02-17 18:42:09 +02:00
Amnon Heiman
714d4927d6 idl-compiler: add writers and view to classes.
This patch adds a writer object to classes in the idl.

It adds attribute support to classes. Writer will be created for classes
that are marked as writable.

For the writers, the code generator creates two kind of struct.
states, that holds the write state (manly the place holders for all
current objects, and vectors) and nodes that represent the current
position in the writing state machine.

To write an object create a writer:
For example creating a writer for mutation, if out is a bytes_ostream

writer_of_mutation w(out);

Views are used to read from buffer without deserialize an entire
object.
This patch adds view creation to the idl-compiler. For each view a
read_size function is created that will be used when skipping through
buffers.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-02-17 18:42:09 +02:00
Amnon Heiman
33d5c95b90 serialization_visitors: Add skip template
The skip template function is used when skipping data types.
By default is uses a deserializer to calculate the size.

A specific implementation save unneeded deserialization. For fix sized
object the skip function would become an expression allowing the
compiler to drop the function altogether.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-02-17 18:42:09 +02:00
Amnon Heiman
64c097422d Adding the serialization_visitors.hh file
The serialization_visitors.hh contains helper classes for the readers and writers
visitors class.

place_holder is a wrapper around bytes_stream place holder.
frame is used to store size in bytes.
empty_frame is used with final object (that do not store their size)
from the code that uses it, it looks the same, but in practice it does
not store any data.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-02-17 18:42:08 +02:00
Amnon Heiman
ca72d637f9 bytes_ostream: Allow place holder return a stream
Reader and writer can use the bytes_ostream as a raw bytes stream,
handling the bytes encoding and streaming on their own.

To fully support this functionality, place holder should support it as
well.

This patch adds a get_stream method that return a simple_output_stream
writer can use it using their own serialization function.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-02-17 18:42:04 +02:00
Tomasz Grabiec
eedc1548e7 transport: server: Fix typo
Spotted-by: Vitaly Davidovich <vitalyd@gmail.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
2016-02-17 15:55:10 +01:00
Avi Kivity
0363a0efe6 Merge "Fix limits handling in CQL server" from Tomasz
"Fixes the following issues:
 #807 Wrong maximum key length
 #809 Scylla assert on returning result when max column size is overflow"
2016-02-17 15:06:51 +02:00
Tomasz Grabiec
6e7bac14b3 transport: server: Throw instead of abort on bounds check failures
Instead of crashing the server we will respond with a "Server Error"
to the requestor.

Fixes #809.
2016-02-17 13:12:11 +01:00
Tomasz Grabiec
0ff0c5555a transport: server: 'short' should be unsigned
According to CQL binary protocol v3 [1], "short" fields are unsigned:

   [short]        A 2 bytes unsigned integer

[1] https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=doc/native_protocol_v3.spec

C* code agrees as well.

Fixes #807.
2016-02-17 13:12:11 +01:00
Tomasz Grabiec
9375b7df7b transport: server: Size check should allow max value 2016-02-17 13:12:11 +01:00
Tomasz Grabiec
48e9d67525 compound: Extract size_type alias 2016-02-17 13:12:11 +01:00
Amnon Heiman
719b8e1e4d serializer: Add boost::variant, chrono::time_point and unknown variant
This patch adds a stub support for boost::variant. Currently variant are
not serialized, this is added just so non stub classes will be able to
compile.

deserialize for chrono::time_point and deserializer for chrono::duration

unknown variant:
Planning for situations where variant could be expanded, there may be
situation that a variant will return an unknown value.

In those cases the data and index will be paseed to the reader, that
can decide what to do with it.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-02-17 11:43:50 +02:00
Avi Kivity
69183be6f0 Update scylla-ami submodule
* dist/ami/files/scylla-ami b3b85be...398b1aa (3):
  > Import AMI initialization code from scylla-server repo
  > Use long options on scylla_raid_setup and scylla_sysconfig_setup
  > Wait more longer to finishing AMI setup
2016-02-17 10:45:50 +02:00
Takuya ASADA
e640b42081 dist: On Ubuntu, log coredump before creating the core
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1455664279-32157-1-git-send-email-syuu@scylladb.com>
2016-02-17 10:43:56 +02:00
Avi Kivity
00f40881d6 Merge "Interactive scylla_setup and refactoring setup scripts" from Takuya 2016-02-17 10:42:26 +02:00
Takuya ASADA
553c2ca523 dist: call scylla_install_ami directly from scylla.json
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
2d429e602a dist: interactive scylla_setup
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
b9ae4ff272 dist: delete build_ami_local.sh, merge it to build_ami.sh
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
6b93505952 dist: long options for build_rpm.sh
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
d754bbe122 dist: add --unstable on build_ami.sh
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
70f397911b dist: long options for build_ami.sh
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
a860e4baae dist: remove AMI initialization code from scylla_setup, move to scylla-ami
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
08038e3f42 dist: long options for scylla_install_pkg
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
d7a03676e3 dist: don't ignore posix_net_conf.sh error
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
684447d3ab dist: long options for scylla_sysconfig_setup
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
7861871c1b dist: long options for scylla_raid_setup
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
889c4706fc dist: long options for scylla_ntp_setup
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
2804e394a6 dist: long options for scylla_coredump_setup
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:08 +09:00
Takuya ASADA
c12d95afe0 dist: add options to skip running setups on scylla_setup, support long options
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:34:05 +09:00
Takuya ASADA
dc9012d5a4 dist: move selinux setup code to scylla_selinux_setup
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:33:01 +09:00
Takuya ASADA
54f9e59006 dist: remove AMI entry from sysconfig, since there is no script refering it
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:33:01 +09:00
Takuya ASADA
9b8f45d5b7 dist: don't use -a option for scylla_bootparam_setup since it was removed
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:33:01 +09:00
Takuya ASADA
5b742ff447 dist: generalize scylla_ntp_setup, drop '-a' option (means AMI) from it
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-02-17 07:28:38 +09:00
Takuya ASADA
51c497527c dist: support unstable repository on scylla_install_pkg 2016-02-17 07:28:36 +09:00
Amnon Heiman
1e4d227b20 managed_bytes: don't return auto from non-member function
gcc 4.9 does not allow non-static data member declared auto.

This patch replace the auto decleration with std::result_of_t

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1455652166-16860-1-git-send-email-amnon@scylladb.com>
2016-02-16 21:50:55 +02:00
Tomasz Grabiec
7af65e45b2 compound: Throw exception when key is too large rather than abort
Abort is too big of a hammer.

Refs #809.

Message-Id: <1455650129-9202-1-git-send-email-tgrabiec@scylladb.com>
2016-02-16 21:36:25 +02:00
Avi Kivity
bd3a08fd19 Merge seastar upstream
* seastar 1bbb02f...8679033 (1):
  > net: fix compilation problem introduced after e5cbee3
2016-02-16 19:46:32 +02:00
Avi Kivity
e1828b82b5 Merge seastar upstream
* seastar b25a958...1bbb02f (6):
  > native-stack: fix arp request missing under loopback connection
  > apps: iotune: fix compilation with g++ 4.9
  > simple-stream: Add copy constructor
  > tcp: don't need to choose another core since only one core
  > Merge "Fix undefined behaviors related to reactor shutdown" from Tomasz
  > rpc: do not wait for data to be send before reporting timeout
2016-02-16 18:06:50 +02:00
Tomasz Grabiec
a921479e71 Merge tag '807-v3' from https://github.com/avikivity/scylla
From Avi:

This patchset introduces a linearization context for managed_bytes objects.

Within this context, any scattered managed_bytes (found only in lsa regions,
so limited to memtable and cache) are auto-linearized for the lifetime of
the context.   This ensures that key and value lookups can use fast
contiguous iterators instead of using slow discontiguous iterators (or
crashing, as is the case now).
2016-02-16 14:29:48 +01:00
Avi Kivity
13144ea9eb managed_bytes: get rid of explicit linearize/scatter
Now that everything is in a linarization context, we don't need to explicitly
gather data.
2016-02-16 14:37:46 +02:00
Avi Kivity
d415167496 memtable: use managed_bytes linearization context when applying mutations
Ensures that we don't access scattered keys when looking up stuff.
2016-02-16 14:37:46 +02:00
Avi Kivity
fbe6961827 row_cache: run partiton-touching operations of row_cache::update in a linearization context
To avoid scattered keys (and values, though those are already protected)
from being accessed, run the update procedure in a managed_bytes linearization
context.

Fixes #807.
2016-02-16 14:37:44 +02:00
Avi Kivity
47ea1237ed build: build seastar's iotune
Target name is build/{mode}/iotune.
2016-02-16 12:13:29 +02:00
Avi Kivity
84ede4c14c Merge seastar upstream
* seastar 0f759f0...b25a958 (1):
  > Merge "IOTune: a tool to tune Seastar's I/O parameters" from Glauber
2016-02-16 12:12:34 +02:00
Asias He
d146045bc5 Revert "Revert "streaming: Send mutations on all shards""
This brings back streaming on all shards. The bug in
locator/abstract_replication_strategy is now fixed.

This reverts commit 9f3061ade8.

Message-Id: <a79ce9cdd6f4af1c6088b89e1911b4b2ed1c10ae.1455589460.git.asias@scylladb.com>
2016-02-16 11:16:51 +02:00
Avi Kivity
ce74718950 Merge "Preparation for specifying query result format in IDL" from Tomasz 2016-02-15 19:41:18 +02:00
Raphael S. Carvalho
59bbe98c21 sstables: keep track of compacting sstables in compacton manager itself
Avi says:
"Something like unordered_set<unsigned long> is error prone, because ints
tend to mix up (also, need to use a sized type, unsigned long varies among
machines)."

With that in mind, it's better if we keep track of compacting sstables in
a unordered_set<shared_sstable>.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <249f0fd4cfcf786cf3c37a79978f7743d07f48ad.1455120811.git.raphaelsc@scylladb.com>
2016-02-15 18:35:43 +02:00
Nadav Har'El
3a2885e1e3 repair: use seastar::gate
Switch to use seastar::gate (and its new gate::check() method) instead
of a similar implementation in repair.cc.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1455553063-13488-1-git-send-email-nyh@scylladb.com>
2016-02-15 18:22:36 +02:00
Avi Kivity
54f145b666 Merge seastar upstream
* seastar 353b1a1...0f759f0 (11):
  > tutorial: add a link to future API documentation
  > sleep: document
  > tutorial: fix typos
  > gate: add check() method
  > tutorial: introduce seastar::gate
  > doc: explain how to test the native stack without dpdk
  > doc: separate the mini-tutorial into its own file
  > doc: move DPDK build instructions to its own file
  > doc: split building instructions into separate files
  > doc: fix/modernize git commands in contributing.md
  > doc: how-to on contributing & guidelines
2016-02-15 18:22:02 +02:00
Tomasz Grabiec
09dc79f245 cql3: select_statement: Set desired serialization format 2016-02-15 17:05:55 +01:00
Tomasz Grabiec
63006e5dd2 query: Serialize collection cells using CQL format
We want the format of query results to be eventually defined in the
IDL and be independent of the format we use in memory to represent
collections. This change is a step in this direction.

The change decouples format of collection cells in query results from
our in-memory representation. We currently use collection_mutation_view,
after the change we will use CQL binary protocol format. We use that because
it requires less transformations on the coordinator side.

One complication is that some list operations need to retrieve keys
used in list cells, not only values. To satisfy this need, new query
option was added called "collections_as_maps" which will cause lists
and sets to be reinterpreted as maps matching their underlying
representation. This allows the coordinator to generate mutations
referencing existing items in lists.
2016-02-15 17:05:55 +01:00
Tomasz Grabiec
383296c05b cql3: Fix handling of lists with static columns
List operations and prefetching were not handling static columns
correctly. One issue was that prefetching was attaching static column
data to row data using ids which might overlap with clustered columns.

Another problem was that list operations were always constructing
clustering key even if they worked on a static column. For static
columns the key would be always empty and lookup would fail.

The effect was that list operations which depend on curent state had
no effect. Similar problem could be observed on C* 2.1.9, but not on 2.2.3.

Fixes #903.
2016-02-15 17:05:55 +01:00
Tomasz Grabiec
e65fddc14b types: Introduce data_value::serialize() 2016-02-15 17:05:55 +01:00
Tomasz Grabiec
5f756fcbe5 query: Add cql_format property to partition_slice
It will specify in which format CQL values should be serialized. Will
allow for rolling out new CQL binary protocol versions without
stalling reads.
2016-02-15 17:05:55 +01:00
Tomasz Grabiec
6709c0ac15 cql_serialization_format: Make it CQL protocol version aware
We want to serialize it as a single number, the CQL binary protocol
version to which it corresponds, so it needs to be aware of the
version number.
2016-02-15 17:05:55 +01:00
Tomasz Grabiec
81fdd12f07 cql_serialization_version: Abstract away collection format changes
This puts knowledge about which cql_serialization_formats have the
same collection format into one place,
cql_serialization_format::collection_format_unchanged().
2016-02-15 17:03:53 +01:00
Tomasz Grabiec
9d11968ad8 Rename serialization_format to cql_serialization_format 2016-02-15 16:53:56 +01:00
Tomasz Grabiec
916a91c913 query: Split send_timestamp_and_expiry into two separate options
It's cleaner that way. They don't need to come together.
2016-02-15 16:53:56 +01:00
Tomasz Grabiec
100b540a53 validation: Fix validation of empty partition key
The validation was wrongly assuming that empty thrift key, for which
the original C* code guards against, can only correspond to empty
representation of our partition_key. This no longer holds after:

   commit 095efd01d6
   "keys: Make from_exploded() and components() work without schema"

This was responsible for dtest failure:
cql_additional_tests.TestCQL:column_name_validation_test
2016-02-15 16:53:56 +01:00
Tomasz Grabiec
f4e3bd0c00 keys: Introduce partition_key::validate()
So that user doesn't have to play with low-level representations.
2016-02-15 16:53:56 +01:00
Tomasz Grabiec
df5f8e4bfc keys: Avoid unnecessary construction of temporary 'bytes' object
We're now using managed_bytes as main storage, so conversion from
bytes_view to bytes is redundant, we need to convert to managed_bytes
eventualy.
2016-02-15 16:53:56 +01:00
Tomasz Grabiec
6d00e473ac keys: Make constructor from bytes private 2016-02-15 16:53:55 +01:00
Tomasz Grabiec
e061eb02df cql3: Avoid using partition_key::from_bytes()
serialize() and from_bytes() is a low level interface, which in this
case can be replaced with a partition_key static factory method
resulting in cleaner code.
2016-02-15 16:53:55 +01:00
Paweł Dziepak
dbb878d16e Revert "do not use boost::multiprecision::msb()"
This reverts commit dadd097f9c.

That commit caused serialized forms of varint and decimal to have some
excess leading zeros. They didn't affect deserialization in any way but
caused computed tokens to differ from the Cassandra ones.

Fixes #898.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1455537278-20106-1-git-send-email-pdziepak@scylladb.com>
2016-02-15 14:24:37 +02:00
Avi Kivity
1f752446d2 Merge "Truncation format & fixes" from Calle
"Fixes #884
Fixes #895

Also at seastar-dev: calle/truncate_more

1.) Change truncation records to be stored with IDL serialization
2.) Fix db::serializers encoding of replay_position
3.) Detect attempted reading of Origin truncation records, and instead
    of crashing, ignore and warn.
4.) Change truncation time stamps to be generated per-shard, _after_
    CF flush is done, otherwise data in memtables at flush would be
    retained/replayed on next start. Retain the highest time stamp
    generated.

Note for (3): This patch set does _not_ clear out origin records
automatically. This because I feel that is a somewhat drastic and
irreversible thing to do. If we want to avail the user of a means
to get rid of the (3) warning, we should probably tell him to either
use cqlsh, or add an API call for this, so he can do it explicitly.
"
2016-02-15 11:39:56 +02:00
Takuya ASADA
fb3f4cc148 dist: add posix_net_conf.sh on Ubuntu package
Fixes #881

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1455522990-32044-1-git-send-email-syuu@scylladb.com>
2016-02-15 11:37:30 +02:00
Nadav Har'El
7dc843fc1c repair: stop ongoing repairs during shutdown
When shutting down a node gracefully, this patch asks all ongoing repairs
started on this node to stop as soon as possible (without completing
their work), and then waits for these repairs to finish (with failure,
usually, because they didn't complete).

We need to do this, because if the repair loop continues to run while we
start destructing the various services it relies on, it can crash (as
reported in #699, although the specific crash reported there no longer
occurs after some changes in the streaming code). Additionally, it is
important that to stop the ongoing repair, and not wait for it to complete
its normal operation, because that can take a very long time, and shutdown
is supposed to not take more than a few seconds.

Fixes #699.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1455218873-6201-1-git-send-email-nyh@scylladb.com>
2016-02-14 16:52:41 +02:00
Raphael S. Carvalho
a487ef1ff3 sstables: improve log message when a sstable is sealed
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <e391243212d83347b1b50c728bee24f6a2ecc950.1455230788.git.raphaelsc@scylladb.com>
2016-02-14 12:05:16 +02:00
Tomasz Grabiec
456275e06a storage_proxy: Simplify condition
Message-Id: <1455288472-30538-1-git-send-email-tgrabiec@scylladb.com>
2016-02-14 11:22:15 +02:00
Tomasz Grabiec
321287dd7c cql3: Fix crash when parsing collection condition
Happened when parsing a statement like this:

 DELETE FROM tmap WHERE k=0 IF m[null] = 'foo'

Message-Id: <1455294896-15184-1-git-send-email-tgrabiec@scylladb.com>
2016-02-14 11:21:10 +02:00
Takuya ASADA
3697cee76d dist: switch AMI base image to 'CentOS7-Base2', uses CentOS official kernel
On previous CentOS base image, it accsidently uses non-standard kernel from elrepo.
This replaces base image to new one, contains CentOS default kernel.

Fixes #890

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1455398903-2865-1-git-send-email-syuu@scylladb.com>
2016-02-14 10:15:27 +02:00
Tomasz Grabiec
efdbc3d6d7 abstract_replication_strategy: Fix generation of token ranges
We can't move-from in the loop because the subject will be empty in
all but the first iteration.

Fixes crash during node stratup:

  "Exiting on unhandled exception of type 'runtime_exception': runtime error: Invalid token. Should have size 8, has size 0"

Fixes update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test (and probably others)

Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
2016-02-12 19:38:36 +01:00
Shlomi Livne
f938e1d303 dist: start scylla with SCYLLA_IO
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <d93a7b41a285fcde796c5681479a328f1efac0c3.1455188901.git.shlomi@scylladb.com>
2016-02-11 17:01:03 +02:00
Shlomi Livne
5494135ddd dist: update SCYLLA_IO with params for AMI
Add setting of --num-io-queues, --max-io-requests for AMI

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <b94a63154a91c8568e194d7221b9ffc7d7813ebc.1455188901.git.shlomi@scylladb.com>
2016-02-11 17:01:02 +02:00
Shlomi Livne
5cae2560a3 dist: introduce SCYLLA_IO
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <6490d049fd23a335bb0a95cac3e8a4c08c61166e.1455188901.git.shlomi@scylladb.com>
2016-02-11 17:01:02 +02:00
Shlomi Livne
d8cdf76e70 dist: change setting of scylla home from "-d" to "-r"
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <53dcd9d1daa0194de3f889b67788d9c21d1e474d.1455188901.git.shlomi@scylladb.com>
2016-02-11 17:00:37 +02:00
Avi Kivity
3c4f67f3e6 build: require boost > 1.55
See #898.

Add checks both for boost being installed, and for the correct version.
Message-Id: <1455193574-24959-1-git-send-email-avi@scylladb.com>
2016-02-11 15:15:49 +02:00
Avi Kivity
9249d45ae1 Update scylla-ami submodule
* dist/ami/files/scylla-ami b2724be...b3b85be (1):
  > adding --stop-services
2016-02-11 12:24:17 +02:00
Avi Kivity
5834815ed9 Merge seastar upstream
* seastar 14c9991...353b1a1 (2):
  > scripts: posix_net_conf.sh: Change the way we learn NIC's IRQ numbers
  > gate: protect against calling close() more than once
2016-02-11 12:23:51 +02:00
Takuya ASADA
09b1ec6103 dist: attach ephemeral disks on AMI by default
To attach maximum number of ephemeral disks available on the instance, specify 8.
On AMI creation, it will be reduce to available number.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1454439628-2882-1-git-send-email-syuu@scylladb.com>
2016-02-11 12:21:09 +02:00
Takuya ASADA
16e6db42e1 dist: abandon to start scylla-server when it's disabled from AMI userdata
Support AMi's --stop-services, prevent startup scylla-server (and scylla-jmx, since it's dependent on scylla-server)

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1454492729-11876-1-git-send-email-syuu@scylladb.com>
2016-02-11 12:21:08 +02:00
Takuya ASADA
f227b3faac dist: On AMI, mark root disk with delete_on_termination
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1454513308-12384-1-git-send-email-syuu@scylladb.com>
2016-02-11 12:19:28 +02:00
Takuya ASADA
33309f667e dist: enable enhanced networking on AMI
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1454971289-21369-1-git-send-email-syuu@scylladb.com>
2016-02-11 12:18:48 +02:00
Raphael S. Carvalho
ed61fe5831 sstables: make compaction stop report user-friendly
When scylla stopped an ongoing compaction, the event was reported
as an error. This patch introduces a specialized exception for
compaction stop so that the event can be handled appropriately.

Before:
ERROR [shard 0] compaction_manager - compaction failed: read exception:
std::runtime_error (Compaction for keyspace1/standard1 was deliberately
stopped.)

After:
INFO  [shard 0] compaction_manager - compaction info: Compaction for
keyspace1/standard1 was stopped due to shutdown.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <1f85d4e5c24d23a1b4e7e0370a2cffc97cbc6d44.1455034236.git.raphaelsc@scylladb.com>
2016-02-11 12:16:53 +02:00
Takuya ASADA
8d8130f9c9 dist: fix typo on build_ami.sh
We should always run scylla_setup, not just for locally built rpm

Fixes #897

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1455103519-13780-1-git-send-email-syuu@scylladb.com>
2016-02-11 11:56:11 +02:00
Shlomi Livne
64f8d5a50e dist: update packer location
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <3c33ea073f702e00b789930fce9befef03ad9e88.1455178900.git.shlomi@scylladb.com>
2016-02-11 11:52:56 +02:00
Avi Kivity
bfbf89ee31 Merge "Serialize keys in a form independent of in-memory representation" from Tomasz
"This series changes the on-wire definitions of keys to be of the following form:

  class partition_key {
     std::vector<bytes> exploded();
  };

Keys are therefore collections of components. The components are serialized according
to the format specified in the CQL binary protocol. No bit depends now on how we store keys in memory.

Constructing keys from components currently requires a schema reference,
which makes it not possible to deserialize or serialize the keys automatically
by RPC. To avoid those complications, compound_type was changed so that
it can be constructed and components can be iterated over without schema.
Because of this, partition_key size increased by 2 bytes."
2016-02-10 17:54:42 +02:00
Tomasz Grabiec
b74301302c tests: Add test for key serialization 2016-02-10 15:22:56 +01:00
Tomasz Grabiec
3e2c1840d8 idl: Make key definitions independent of in-memory representation 2016-02-10 15:22:56 +01:00
Tomasz Grabiec
428fce3828 compound: Optimize serialize_single() 2016-02-10 15:22:56 +01:00
Tomasz Grabiec
0cc2832a76 keys: Allow constructing from a range 2016-02-10 15:22:56 +01:00
Tomasz Grabiec
3ffcb998fb keys: Enable serialization from a range not just a vector 2016-02-10 14:35:14 +01:00
Tomasz Grabiec
095efd01d6 keys: Make from_exploded() and components() work without schema
For simplicity, we want to have keys serializable and deserializable
without schema for now. We will serialize keys in a generic form of a
vector of components where the format of components is specified by
CQL binary protocol. So conversion between keys and vector of
components needs to be possible to do without schema.

We may want to make keys schema-dependent back in the future to apply
space optimizations specific to column types. Existing code should
still pass schema& to construct and access the key when possible.

One optimization had to be reverted in this change - avoidance of
storing key length (2 bytes) for single-component partition keys. One
consequence of this, in addition to a bit larger keys, is that we can
no longer avoid copy when constructing single-component partition keys
from a ready "bytes" object.

I haven't noticed any significant performance difference in:

  tests/perf/perf_simple_query -c1 --write

It does ~130K tps on my machine.
2016-02-10 14:35:13 +01:00
Tomasz Grabiec
31312722d1 compound: Reduce duplication 2016-02-10 14:35:13 +01:00
Tomasz Grabiec
085d148d6f compound: Remove unused methods 2016-02-10 14:35:13 +01:00
Tomasz Grabiec
b777cc9565 tests: Fix tests to not rely on key representation 2016-02-10 14:35:13 +01:00
Asias He
6d0407503b locator: Do not generate wrap-around ranges
Like we did in commit d54c77d5d0,
make the remaining functions in abstract_replication_strategy return
non-wrap-around ranges.

This fixes:

ERROR [shard 0] stream_session - [Stream #f0b7fda0-cf3e-11e5-b6c4-000000000000]
stream_transfer_task: Fail to send to 127.0.0.4:0: std::runtime_error (Not implemented: WRAP_AROUND)

in streaming.
Message-Id: <514d2a9a1d3b868d213464c8858ac5162c0338d8.1455093643.git.asias@scylladb.com>
2016-02-10 10:03:31 +01:00
Avi Kivity
fc6159e2b9 key: tighten partition_key::representation() to return a const managed_bytes&
The conversion to bytes_view can fail if the key is scattered; so defer that
conversion until later.  In a later patch we will intervene before the
conversion to ensure the data is linearized.
2016-02-09 19:55:13 +02:00
Avi Kivity
3c60310e38 key: relax some APIs to accept partition_key_view instead of const partition_key&
Using a partition_key_view can save an allocation in some cases.  We will
make use of it when we linearize a partition_key; during the process we
are given a simple byte pointer, and constructing a partition_key from that
requires an allocation.
2016-02-09 19:55:13 +02:00
Avi Kivity
af8ef54d5a managed_bytes: introduce with_linearized_managed_bytes()
A large managed_bytes blob can be scattered in lsa memory.  Usually this is
fine, but someone we want to examine it in place without copying it out, but
using contiguous iterators for efficiency.

For this use case, introduce with_linearized_managed_bytes(Func),
which runs a function in a "linearization context".  Within the linearization
context, reads of managed_bytes object will see temporarily linearized copies
instead of scattered data.
2016-02-09 19:55:13 +02:00
Avi Kivity
9f3061ade8 Revert "streaming: Send mutations on all shards"
This reverts commit 31d439213c.

Fixes #894.

Conflicts:
    streaming/stream_manager.cc

(may have undone part of 63a5aa6122)
2016-02-09 18:26:14 +02:00
Calle Wilund
18203a4244 database::truncate/drop: Move time stamp generation to shard
Fixes #884

Time stamps for truncation must be generated after flush, either by
splitting the truncate into two (or more) for-each-shard operations,
or simply by doing time stamping per shard (this solution).

We generate TS on each shard after flushing, and then rely on the
actual stored value to be the highest time point generated.

This should however, from batch replay point of view, be functionally
equivalent. And not a problem.
2016-02-09 15:45:37 +00:00
Calle Wilund
ce66acc771 system_keyspace: Always retain highest truncation time stamp
Since the table is written from all shards, and we possibly might
have conflicting time stamps, we define the trucated_at time
as the highest time point. I.e. conservative.
2016-02-09 15:45:37 +00:00
Calle Wilund
22a38f0025 db/serializer: Fix db::serializer<replay_position> format
Should match struct/"official" serial format. (64+32)
This serializer is however not really used any more and could
be removed.
2016-02-09 15:45:37 +00:00
Calle Wilund
1c213e1f38 system_keyspace: Use IDL types + better verification of truncation record
Truncation records are not portable between us and Origin.
We need to detect and ensure we neither try to use, and more to the
point, don't crash because of data format error when loading, origin
records from a migrated system.

This problem was seen by Tzach when doing a migration from an origin
setup.

Updated record storage to use IDL-serialized types + added versioning
and magic marking + odd-size-checking to ensure we load only correct
data. The code will also deal with records from an older version of
scylla.
2016-02-09 15:45:37 +00:00
Calle Wilund
4d7289b275 serializer_impl: Add convinience wrapper for one-obj deserialization
Akin to serizalize_to_buffer
2016-02-09 13:55:33 +00:00
Calle Wilund
dff89fffcd IDL: Add idl definitions for replay_position and truncation_record 2016-02-09 13:55:33 +00:00
Calle Wilund
873f87430d database: Check sstable dir name UUID part when populating CF
Fixes #870
Only load sstables from CF directories that match the current
CF uuid.
Message-Id: <1454938450-4338-1-git-send-email-calle@scylladb.com>
2016-02-08 14:48:19 +01:00
Avi Kivity
e5b72aedf1 managed_bytes: don't copy data during hashing 2016-02-08 12:43:05 +02:00
Avi Kivity
5d958db869 managed_bytes: fix operator== for fragmented blobs
Must compare fragment by fragment.
2016-02-08 12:43:05 +02:00
Calle Wilund
2ffd7d7b99 stream_manager: Change construction to make gcc 4.9 happy
gcc 4.9 complains about the type{ val, val } construction of
type with implicit default constructor, i.e. member = initial
declarations. gcc 5 does not (and possibly rightly so).
However, we still (implicitly) claim to support gcc 4.9 so
why not just change this particular instance.

Message-Id: <1454921328-1106-1-git-send-email-calle@scylladb.com>
2016-02-08 10:54:48 +02:00
Paweł Dziepak
c90ec731c8 transport: do not close gate at connection shutdown
connection::_pending_requests_gate is responsible for keeping connection
objects alive as long as there are outstanding requests and is closed
in connection::proccess() when needed. Closing it in connection::shutdown()
as well may cause the gate to be closed twice what is a bug.

Fixes #690.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1454596390-23239-1-git-send-email-pdziepak@scylladb.com>
2016-02-07 20:07:23 +02:00
Avi Kivity
8b0a26f06d build: support for alternative versions of libsystemd pkgconfig
While pkgconfig is supposed to be a distribution and version neutral way
of detecting packages, it doesn't always work this way.  The sd_notify()
manual page documents that sd_notify is available via the libsystemd
package, but on centos 7.0 it is only available via the libsystemd-daemon
package (on centos 7.1+ it works as expected).

Fix by allowing for alternate version of package names, testing each one
until a match is found.

Fixes #879.

Message-Id: <1454858862-5239-1-git-send-email-avi@scylladb.com>
2016-02-07 17:36:57 +02:00
Avi Kivity
ad58663c96 row_cache: reindent 2016-02-07 13:25:29 +02:00
Asias He
31d439213c streaming: Send mutations on all shards
Currently, only the shard where the stream_plan is created on will send
streaing mutations. To utilize all the available cores, we can make each
shard send mutations which it is responsbile for. On the receiver side,
we do not forward the mutations to the shard where the stream_session is
created, so that we can avoid unnecessary forwarding.

Note: the downside is that it is now harder to:

1) to track number of bytes sent and received
2) to update the keep alive timer upon receive of the STREAM_MUTATION

To fix, we now store the sent/recieved bytes info on all shards. When
the keep alive timer expires, we check if any progress has been made.

Hopefully, this patch will make the streaming much faster and in turn
make the repair/decommission/adding a node faster.

Refs: https://github.com/scylladb/scylla/issues/849

Tested with decommission/repair dtest.

Message-Id: <96b419ab11b736a297edd54a0b455ffdc2511ac5.1454645370.git.asias@scylladb.com>
2016-02-07 10:57:51 +02:00
Gleb Natapov
63a5aa6122 prevent superfluous frozen_mutation copying
Sometimes frozen_mutation is copied while it can be moved instead. Fix
those cases.

Message-Id: <20160204165708.GI6705@scylladb.com>
2016-02-07 10:54:16 +02:00
Erich Keane
4197ceeedb raw_statement::is_reversed rewrite to avoid VLA
The is_reversed function uses a variable length array, which isn't
spec-abiding C++.  Additionally, the Clang compiler doesn't allow them
with non-POD types, so this function wouldn't compile.

After reading through the function it seems that the array wasn't
necessary as the check could be calculated inline rather than
separately.  This version should be more performant (since it no longer
requires the VLA lookup performance hit) while taking up less memory in
all but the smallest of edge-cases (when the clustering_key_size *
sizeof(optional<bool>) < sizeof(size_type) - sizeof(uint32_t) +
sizeof(bool).

This patch uses  relation_order_unsupported it assure that the exception
order is consistent with the preivous version.  The throw would
otherwise be moved into the initial for-loop.

There are two derrivations in behavior:
The first is the initial assert.  It however should not change the apparent
behavior besides causing orderings() to be looked up 2x in debug
situations.

The second is the conversion of is_reversed_ from an optional to a bool.
The result is that the final return value is now well-defined to be
false in the release-condition where orderings().size() == 0, rather
than be the ill-defined *is_reversed_ that was there previously.

Signed-off-by: Erich Keane <erich.keane@verizon.net>
Message-Id: <1454546285-16076-4-git-send-email-erich.keane@verizon.net>
2016-02-07 10:38:17 +02:00
Erich Keane
49842aacd9 managed_vector: maybe_constructed ctor to non-constexpr
Clang enforces that a union's constexpr CTOR must initialize
one of the members.  The spec is seemingly silent as to what
the rule on this is, however, making this non-constexpr results in clang
accepting the constructor.

Signed-off-by: Erich Keane <erich.keane@verizon.net>
Message-Id: <1454604300-1673-1-git-send-email-erich.keane@verizon.net>
2016-02-07 10:30:45 +02:00
Erich Keane
e87019843f Fix PHI_FACTOR definition to be spec compliant
PHI_FACTOR is a constexpr variable that is defined using std::log.
Though G++ has a constexpr version of std::log, this itself is not spec
complaint (in fact, Clang enforces this).  See C++ Spec 26.8 for the
definition of std::log and 17.6.5.6 for the rule regarding adding
constexpr where it isn't specified.

This patch replaces the std::log statement with a version from math.h
that contains the exact value (M_LOG10El).

Signed-off-by: Erich Keane <erich.keane@verizon.net>
Message-Id: <1454603285-32677-1-git-send-email-erich.keane@verizon.net>
2016-02-04 18:33:44 +02:00
Avi Kivity
c85f6c4df1 Merge seastar upstream
* seastar 661ccd9...14c9991 (1):
  > reactor: use correct open_flags when opening a file without DMA support

Fixes #871.
2016-02-04 18:17:04 +02:00
Gleb Natapov
77d47c0c4b optimize serialization of array/vector of integral types
Array of integral types on little endian machine can be memcpyed into/out
of a buffer instead of serialized/deserialized element by element.

Message-Id: <20160204155425.GC6705@scylladb.com>
2016-02-04 18:01:14 +02:00
Avi Kivity
91fbb81477 Merge seastar upstream
* seastar f8beab9...661ccd9 (1):
  > Merge "Use swapcontext() with AddressSanitizer" from Paweł
2016-02-04 17:30:15 +02:00
Paweł Dziepak
ababdfc9e2 tests/batchlog: use proper batchlog version
Since 42e3999a00 "Check batchlog version
before replaying" there is a version check in batchlog replay.
However, the test wasn't updated and still used some arbitrary version
number which caused it to fail.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1454595368-21670-1-git-send-email-pdziepak@scylladb.com>
2016-02-04 16:50:45 +02:00
Gleb Natapov
049ae37d08 storage_proxy: change collectd to show foreground mutation instead of overall mutation count
It is much easier to see what is going on this way otherwise graphs for
bg mutations and overall mutations are very close with usual scaling for
many workloads.

Message-Id: <20160204083452.GH6705@scylladb.com>
2016-02-04 14:58:56 +02:00
Gleb Natapov
a9e4afd8d2 Drop query-result.hh from database.hh
It is not needed there but causes a lot of recompilation when changed.

Message-Id: <1454496142-14537-3-git-send-email-gleb@scylladb.com>
2016-02-04 13:22:27 +02:00
Gleb Natapov
2ae1ae2d18 Cleanup messaging_service.hh includes a bit.
Forward declare some classes instead.

Message-Id: <1454496142-14537-2-git-send-email-gleb@scylladb.com>
2016-02-04 13:22:24 +02:00
Avi Kivity
f3ca597a01 Merge "Sstable cleanup fixes" from Tomasz
"  - Added waiting for async cleanup on clean shutdown

  - Crash in the middle of sstable removal doesn't leave system in a non-bootable state"
2016-02-04 12:36:13 +02:00
Tomasz Grabiec
c7ef3703cc sstable: Make sstable deletion never leave sstable set in a non-bootable state
Refs #860
Refs #802

An sstable file set with any component missing is interpreted as a
critical error during boot. Currently sstable removal procedure could
leave the files in a non-bootable state if the process crashed after
TOC was removed but before all components were removed as well.

To solve this problem, start the removal by renaming the TOC file to a
so called "temporary TOC". Upon boot such kind of TOC file is
interpreted as an sstable which is safe to remove. This kind of TOC
was added before to deal with a similar scenario but in the opposite
direction - when writing a new sstable.
2016-02-03 17:36:17 +01:00
Tomasz Grabiec
c8a98b487c sstables: Remove coupling-hiding duplication 2016-02-03 17:36:17 +01:00
Tomasz Grabiec
355874281a sstables: Do not register exit hooks from static initializer
Fixes #868.

Registerring exit hooks while reactor is already iterating over exit
hooks is not allowed and currently leads to undefined behavior
observed in #868. While we should make the failure more user friendly,
registering exit hooks concurrently with shutdown will not be allowed.

We don't expect exit hooks to be registered after exit starts because
this would violate the guarantee which says that exit hooks are
executed in reverse order of registration. Starting exit sequence in
the middle of initialization sequence would result in use after free
errors. Btw, I'm not sure if currently there's anything which prevents
this

To solve this problem, move the exit hook to initilization
sequence. In case of tests, the cleanup has to be called explicitly.
2016-02-03 17:35:50 +01:00
Tomasz Grabiec
136c9d9247 sstables: Improve error message in case of generation duplication
Refs #870.
2016-02-03 17:35:50 +01:00
Calle Wilund
a00ff015f4 transport::server: read cqlv2 batch options correctly
Fixes #563.
Refs #584

CQLv2 encodes batch query_options in v1 format, not v2+.
CQLv1 otoh has no batch support at all.
Make read_options use explicit version format if needed.

v2: Ensure we preserve cql protocol version in query_opts
Message-Id: <1454514510-21706-1-git-send-email-calle@scylladb.com>
2016-02-03 16:55:07 +01:00
Gleb Natapov
b4b560e0fc change result_digest to hold std::array instead of a std::vector
Digest size if fixed, so no need to use std::vector to hold it.

Message-Id: <20160203102530.GU6705@scylladb.com>
2016-02-03 12:27:39 +02:00
Raphael S. Carvalho
4041f8cffc compaction: stop all ongoing compaction during shutdown
Currently, we wait for ongoing compaction during shutdown, but
that may take 'forever' if compacting huge sstables with a slow
disk. Compaction of huge sstables will take a considerable amount
of time even with fast disks. Therefore, all ongoing compaction
should be stopped during shutdown.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <3370f17ce4274df417ea60651f33fc5d4de91199.1454441286.git.raphaelsc@scylladb.com>
2016-02-03 10:18:51 +02:00
Raphael S. Carvalho
cf22c827f9 compaction_manager: fix assertion when stopping task
Task is stopped by closing gate and forcing it to exit via gate
exception. The problem is that task->compacting_cf may be set to
the column family being compacted, and compaction_manager::remove
would see it and try to stop the same task again, which would
lead to problems. The fix is to clean task->compacting_cf when
stopping task.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <3473e93c1a107a619322769d65fa020529b5501b.1454441286.git.raphaelsc@scylladb.com>
2016-02-03 10:18:15 +02:00
Asias He
c67538009c streaming: Fix assert in update_progress
The problem is that on the follower side, we set up _session_info too
late, after received PREPARE_DONE_MESSAGE message. The initiator can
send STREAM_MUTATION before sending PREPARE_DONE_MESSAGE message.

To fix, we set up _session_info after we received the prepare_message on
both initiator and follower.

Fixes #869

scylla: streaming/session_info.cc:44: void
streaming::session_info::update_progress(streaming::progress_info):
Assertion `peer == new_progress.peer' failed.
Message-Id: <6d945ba1e8c4fc0949c3f0a72800c9448ba27761.1454476876.git.asias@scylladb.com>
2016-02-03 10:15:45 +02:00
Asias He
46c392eb17 messaging_service: Stop retrying if messaging_service is being shutdown
If we are shutting down the messaging_service, we should not retry the
message again.

Refs #862

Message-Id: <7c3afb646ba8254eca69096d80dd5ea007e416a7.1454418053.git.asias@scylladb.com>
2016-02-02 19:50:54 +02:00
Gleb Natapov
c509e48674 Parallelize batchlog replay
Current code is serialized by get_truncated_at(). Use map_reduce to make
it run in parallel.
Message-Id: <1454421603-13080-4-git-send-email-gleb@scylladb.com>
2016-02-02 17:08:54 +01:00
Gleb Natapov
42e3999a00 Check batchlog version before replaying
In case batchlog serialization format changes check it before trying
to interpret raw data.
Message-Id: <1454421603-13080-3-git-send-email-gleb@scylladb.com>
2016-02-02 17:08:54 +01:00
Gleb Natapov
116ad5a603 Use net::messaging_service::current_version for serialization format versioning
Message-Id: <1454421603-13080-2-git-send-email-gleb@scylladb.com>
2016-02-02 17:08:53 +01:00
Avi Kivity
b14d39bfb1 Merge "Move last bits to IDL serializer and get rid of old one" from Gleb 2016-02-02 12:33:18 +02:00
Gleb Natapov
19067db642 remove old serializer 2016-02-02 12:15:50 +02:00
Gleb Natapov
4e440ebf8e Remove old inet_address and uuid serializers 2016-02-02 12:15:50 +02:00
Gleb Natapov
31bb194c21 Remove old result_digest serializer 2016-02-02 12:15:50 +02:00
Gleb Natapov
10cd4d948c Move result_digest to idl 2016-02-02 12:15:50 +02:00
Gleb Natapov
775cc93880 remove unused range and token serializers 2016-02-02 12:15:49 +02:00
Gleb Natapov
e3a40254e6 Remove old partition_checksum serializer 2016-02-02 12:15:49 +02:00
Gleb Natapov
e6f7b12b51 Move partition_checksum to use idl 2016-02-02 12:15:49 +02:00
Gleb Natapov
8cc1d1a445 Add std:array serializer 2016-02-02 12:15:49 +02:00
Gleb Natapov
a8902ccb4a Remove old frozen_schema serializer 2016-02-02 12:15:49 +02:00
Gleb Natapov
60e3637efc Move frozen_schema to idl 2016-02-02 12:15:49 +02:00
Nadav Har'El
b95c15f040 repair: change checksum structure to be better suited for serializer
Change the partition_checksum structure to be better suited for the
new serializers:

 1. Use std::array<> instead of a C array, as the latters are not
    supported by the new serializers.

 2. Use an array of 32 bytes, instead of 4 8-byte integers. This will
    guarantee that no byte-swapping monkey-business will be done on
    these checksums.
    The checksum XOR and equality-checking methods still temporarily
    cast the bytes to 8-byte chunks, for (hopefully) better performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1454364900-3076-1-git-send-email-nyh@scylladb.com>
2016-02-02 11:58:25 +02:00
Calle Wilund
c67e7e4ce4 cql3::sets: Make insert/update frozen set handle null/empty correctly
Fixes #578

Message-Id: <1454345878-1977-1-git-send-email-calle@scylladb.com>
2016-02-01 19:15:28 +02:00
Takuya ASADA
5fe82ce555 dist: fix build error on Ubuntu 15.10
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1454345982-5899-1-git-send-email-syuu@scylladb.com>
2016-02-01 19:14:49 +02:00
Avi Kivity
1f245e3bcb mutation_partition: fix use of boost::intrusive::set<>::comp()
Seems like boost::intrusive::set<>::comp() is not accessible on some
versions of boost.  Replace by the equivalent
boost::intrusive::set<>::key_comp().

Fixes #858.
Message-Id: <1454326483-29780-1-git-send-email-avi@scylladb.com>
2016-02-01 13:54:52 +01:00
Calle Wilund
159dbe3a64 sstable_datafile_tests: Replace '---' with auto
Fixes compilation issues on some g++.
Message-Id: <1454323749-21933-1-git-send-email-calle@scylladb.com>
2016-02-01 12:58:33 +02:00
Avi Kivity
2b84bd3b75 Merge "standalone tcp connection for streaming" from Asias
"Make the streaming use standalone tcp connection and send more mutations in
parallel.

It is supposed to help: "Decommission not fully utilizing hardware #849""
2016-02-01 09:54:11 +02:00
Asias He
c618c699b3 streaming: Increase mutation_send_limiter
The idea behind the current 10 stream_mutations per core limitation is
to avoid streaming overwhelms the TCP connection and starves normal cql
verbs if the streaming mutations are big and takes long time to
complete.

Now that we use a standalone connection for streaming verbs, we can
increase the limitation.

Hopefully, this will fix #849.
2016-02-01 11:01:56 +08:00
Asias He
fbf796b812 messaging_service: Use standalone connection for stream verbs
In streaming, the amount of data needs to be streamed to peer nodes
might be large.

In order to avoid the streaming overwhelms the TCP connection used by
user CQL verbs and starves the user CQL queries, we use a standalone TCP
connection for streaming verbs.
2016-02-01 11:01:56 +08:00
Avi Kivity
1146e3796d Merge "streaming refactor" from Asias
"- Wire up session progress
- Refactor stream_coordinator::host_streaming_data
- Introduce get_session helper to simplfy verb handling
- Remove unused code

Tested with streaming in update_cluster_layout_tests.py"
2016-01-31 20:17:53 +02:00
Tomasz Grabiec
945ae5d1ea Move std::hash<range<T>> definition to range.hh
Message-Id: <1454008052-5152-1-git-send-email-tgrabiec@scylladb.com>
2016-01-31 20:11:30 +02:00
Avi Kivity
f6e7dbf080 Merge seastar upstream
* seastar 6623379...f8beab9 (2):
  > json_base_element: do not assign the element name
  > io_queue: change visibility of internal function
2016-01-31 16:34:39 +02:00
Raphael S. Carvalho
a46aa47ab1 make sstables::compact_sstables return list of created sstables
Now, sstables::compact_sstables() receives as input a list of sstables
to be compacted, and outputs a list of sstables generated by compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <0d8397f0395ce560a7c83cccf6e897a7f464d030.1454110234.git.raphaelsc@scylladb.com>
2016-01-31 12:39:20 +02:00
Raphael S. Carvalho
ee84f310d9 move deletion of sstables generated by interrupted compaction
This deletion should be handled by sstables::compact_sstables, which
is the responsible for creation of new sstables.
It also simplifies the code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <541206be2e910ab4edb1500b098eb5ebf29c6509.1454110234.git.raphaelsc@scylladb.com>
2016-01-31 12:39:20 +02:00
Glauber Costa
7214649b8a sstables: const where const is due
Some SSTable methods are not marked as const. But they should be.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <72cd3ef0157eb38e7fd48d0c989f2342cbc42f3c.1454103008.git.glauber@scylladb.com>
2016-01-31 12:36:36 +02:00
Avi Kivity
3434b8e7c6 Merge seastar upstream
* seastar fbd9b30...6623379 (1):
  > fstream: improve make_file_input_stream() for a subrange of a file
2016-01-31 12:01:46 +02:00
Avi Kivity
a3fa123070 Update scylla-ami submodule
* dist/ami/files/scylla-ami e284bcd...b2724be (2):
  > Revert "Run scylla.yaml construction only once"
  > Move AMI dependent part of scylla_prepare to scylla-ami-setup.service
2016-01-31 12:01:15 +02:00
Avi Kivity
f08f5858a8 Merge "Introduce scylla-ami-setup.service, fix bugs" from Takuya
"This moves AMI dependent part of scylla_prepare to scylla-ami repo, make it scylla-ami-setup.service which is independent systemd unit.
Also, it stopped calling scylla_sysconfig_setup on scylla_setup (called on AMI creation time), call it from scylla-ami-setup instead."
2016-01-31 12:00:32 +02:00
Takuya ASADA
111dc19942 dist: construct scylla.yaml on first startup of AMI instance, not AMI image creation time
Install scylla-ami-setup.service, stop calling scylla_sysconfig_setup on AMI.
scylla-ami-setup.service will call it instead.
Only works with scylla-ami fix.
Fixes #857

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-01-30 15:48:45 -05:00
Takuya ASADA
71a26e1412 dist: don't need AMI_KEEP_VERSION anymore, since we fixed the issue that 'yum update' mistakenly replaces scylla development version with release version
It actually doesn't called unconditionally now, (since $LOCAL_PKG is always empty) so we can safely remove this.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-01-30 15:47:05 -05:00
Takuya ASADA
f9d32346ef dist: scylla_sysconfig_setup uses current sysconfig values as default value
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-01-30 15:46:21 -05:00
Takuya ASADA
4d5baef3e3 dist: keep original SCYLLA_ARGS when updating sysconfig
Since we dropped scylla_run, now default SCYLLA_ARGS parameter is not empty.
So we need to support it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-01-30 15:44:03 -05:00
Asias He
f07cd30c81 streaming: Remove unused create_message_for_retry 2016-01-29 16:31:07 +08:00
Asias He
cb92fe75e6 streaming: Introduce get_session helper
To simplify streaming verb handler.

- Use get_session instead of open coded logic to get get_coordinator and
  stream_session in all the verb handlers

- Use throw instead of assert for error handling

- init_receiving_side now returns a shared_ptr<stream_result_future>
2016-01-29 16:31:07 +08:00
Asias He
360df6089c streaming: Remove unused stream_session::retry 2016-01-29 16:31:07 +08:00
Asias He
2f48d402e2 streaming: Remove unused commented code 2016-01-29 16:31:07 +08:00
Asias He
ed3da7b04c streaming: Drop flush_tables option for add_transfer_ranges
We do not stream sstable files. No need to flush it.
2016-01-29 16:31:07 +08:00
Asias He
aa69d5ffb2 streaming: Drop update_progress in stream_coordinator
Since we have session_info inside stream_session now, we can call
update_progress directly in stream_session.
2016-01-29 16:31:07 +08:00
Asias He
30c745f11a streaming: Get rid of stream_coordinator::host_streaming_data
Now host_streaming_data only holds shared_ptr<stream_session>, we can
get rid of it and put shared_ptr<stream_session> inside _peer_sessions.
2016-01-29 16:31:07 +08:00
Asias He
46bec5980b streaming: Put session_info inside stream_session
It is 1:1 mapping between session_info and stream_session. Putting
session_info inside stream_session, we can get rid of the
stream_coordinator::host_streaming_data class.
2016-01-29 16:31:07 +08:00
Asias He
91e245edac streaming: Initialize total_size in stream_transfer_task
Also rename the private member to _total_size and _files
2016-01-29 16:31:07 +08:00
Asias He
c4bdb6f782 streaming: Wire up session progress
The progress info is needed by JMX api.
2016-01-29 16:31:07 +08:00
Avi Kivity
3e4ce609ee Merge seastar upstream
* seastar ec468ba...fbd9b30 (9):
  > Add implementation of count_leading_zeros<LL>
  > Fix htonl usage for clang
  > Fix gnutls_error_category ctor for clang
  > Add header files required for libc++
  > Add clang warning suppressions
  > Switch to correct usage of std::abs
  > Fix the do_marshall(sic) function to init_list
  > Corrected sockaddr_in initialization
  > Remove unused const char misc_strings
2016-01-28 18:24:46 +02:00
322 changed files with 11741 additions and 6299 deletions

1
.gitignore vendored
View File

@@ -8,3 +8,4 @@ cscope.*
dist/ami/files/*.rpm
dist/ami/variables.json
dist/ami/scylla_deploy.sh
*.pyc

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=666.development
VERSION=1.0.4
if test -f version
then

View File

@@ -836,6 +836,22 @@
"type":"string",
"paramType":"query"
},
{
"name":"startToken",
"description":"Token on which to begin repair",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"endToken",
"description":"Token on which to end repair",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"columnFamilies",
"description":"Which column families to repair in the given keyspace. Multiple columns families can be named separated by commas. If this option is missing, all column families in the keyspace are repaired.",

View File

@@ -214,16 +214,16 @@ void set_storage_proxy(http_context& ctx, routes& r) {
});
sp::get_schema_versions.set(r, [](std::unique_ptr<request> req) {
//TBD
// FIXME
// describe_schema_versions is not implemented yet
// this is a work around
std::vector<sp::mapper_list> res;
sp::mapper_list entry;
entry.key = boost::lexical_cast<std::string>(utils::fb_utilities::get_broadcast_address());
entry.value.push(service::get_local_storage_service().get_schema_version());
res.push_back(entry);
return make_ready_future<json::json_return_type>(res);
return service::get_local_storage_service().describe_schema_versions().then([] (auto result) {
std::vector<sp::mapper_list> res;
for (auto e : result) {
sp::mapper_list entry;
entry.key = std::move(e.first);
entry.value = std::move(e.second);
res.emplace_back(std::move(entry));
}
return make_ready_future<json::json_return_type>(std::move(res));
});
});
sp::get_cas_read_timeouts.set(r, [](std::unique_ptr<request> req) {

View File

@@ -280,10 +280,12 @@ void set_storage_service(http_context& ctx, routes& r) {
return ctx.db.invoke_on_all([keyspace, column_families] (database& db) {
std::vector<column_family*> column_families_vec;
auto& cm = db.get_compaction_manager();
for (auto entry : column_families) {
column_family* cf = &db.find_column_family(keyspace, entry);
cm.submit_cleanup_job(cf);
for (auto cf : column_families) {
column_families_vec.push_back(&db.find_column_family(keyspace, cf));
}
return parallel_for_each(column_families_vec, [&cm] (column_family* cf) {
return cm.perform_cleanup(cf);
});
}).then([]{
return make_ready_future<json::json_return_type>(0);
});
@@ -326,7 +328,8 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::repair_async.set(r, [&ctx](std::unique_ptr<request> req) {
static std::vector<sstring> options = {"primaryRange", "parallelism", "incremental",
"jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "trace"};
"jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "trace",
"startToken", "endToken" };
std::unordered_map<sstring, sstring> options_map;
for (auto o : options) {
auto s = req->get_query_param(o);
@@ -585,6 +588,8 @@ void set_storage_service(http_context& ctx, routes& r) {
auto val_str = req->get_query_param("value");
bool value = (val_str == "True") || (val_str == "true") || (val_str == "1");
return service::get_local_storage_service().db().invoke_on_all([value] (database& db) {
db.set_enable_incremental_backups(value);
// Change both KS and CF, so they are in sync
for (auto& pair: db.get_keyspaces()) {
auto& ks = pair.second;

View File

@@ -32,11 +32,16 @@ namespace hs = httpd::stream_manager_json;
static void set_summaries(const std::vector<streaming::stream_summary>& from,
json::json_list<hs::stream_summary>& to) {
for (auto sum : from) {
if (!from.empty()) {
hs::stream_summary res;
res.cf_id = boost::lexical_cast<std::string>(sum.cf_id);
res.files = sum.files;
res.total_size = sum.total_size;
res.cf_id = boost::lexical_cast<std::string>(from.front().cf_id);
// For each stream_session, we pretend we are sending/receiving one
// file, to make it compatible with nodetool.
res.files = 1;
// We can not estimate total number of bytes the stream_session will
// send or recvieve since we don't know the size of the frozen_mutation
// until we read it.
res.total_size = 0;
to.push(res);
}
}
@@ -85,18 +90,22 @@ static hs::stream_state get_state(
void set_stream_manager(http_context& ctx, routes& r) {
hs::get_current_streams.set(r,
[] (std::unique_ptr<request> req) {
return streaming::get_stream_manager().map_reduce0([](streaming::stream_manager& stream) {
std::vector<hs::stream_state> res;
for (auto i : stream.get_initiated_streams()) {
res.push_back(get_state(*i.second.get()));
}
for (auto i : stream.get_receiving_streams()) {
res.push_back(get_state(*i.second.get()));
}
return res;
}, std::vector<hs::stream_state>(),concat<hs::stream_state>).
then([](const std::vector<hs::stream_state>& res) {
return make_ready_future<json::json_return_type>(res);
return streaming::get_stream_manager().invoke_on_all([] (auto& sm) {
return sm.update_all_progress_info();
}).then([] {
return streaming::get_stream_manager().map_reduce0([](streaming::stream_manager& stream) {
std::vector<hs::stream_state> res;
for (auto i : stream.get_initiated_streams()) {
res.push_back(get_state(*i.second.get()));
}
for (auto i : stream.get_receiving_streams()) {
res.push_back(get_state(*i.second.get()));
}
return res;
}, std::vector<hs::stream_state>(),concat<hs::stream_state>).
then([](const std::vector<hs::stream_state>& res) {
return make_ready_future<json::json_return_type>(res);
});
});
});
@@ -111,17 +120,9 @@ void set_stream_manager(http_context& ctx, routes& r) {
hs::get_total_incoming_bytes.set(r, [](std::unique_ptr<request> req) {
gms::inet_address peer(req->param["peer"]);
return streaming::get_stream_manager().map_reduce0([peer](streaming::stream_manager& sm) {
int64_t res = 0;
for (auto sr : sm.get_all_streams()) {
if (sr) {
for (auto session : sr->get_coordinator()->get_all_stream_sessions()) {
if (session->peer == peer) {
res += session->get_bytes_received();
}
}
}
}
return res;
return sm.get_progress_on_all_shards(peer).then([] (auto sbytes) {
return sbytes.bytes_received;
});
}, 0, std::plus<int64_t>()).then([](int64_t res) {
return make_ready_future<json::json_return_type>(res);
});
@@ -129,15 +130,9 @@ void set_stream_manager(http_context& ctx, routes& r) {
hs::get_all_total_incoming_bytes.set(r, [](std::unique_ptr<request> req) {
return streaming::get_stream_manager().map_reduce0([](streaming::stream_manager& sm) {
int64_t res = 0;
for (auto sr : sm.get_all_streams()) {
if (sr) {
for (auto session : sr->get_coordinator()->get_all_stream_sessions()) {
res += session->get_bytes_received();
}
}
}
return res;
return sm.get_progress_on_all_shards().then([] (auto sbytes) {
return sbytes.bytes_received;
});
}, 0, std::plus<int64_t>()).then([](int64_t res) {
return make_ready_future<json::json_return_type>(res);
});
@@ -145,18 +140,10 @@ void set_stream_manager(http_context& ctx, routes& r) {
hs::get_total_outgoing_bytes.set(r, [](std::unique_ptr<request> req) {
gms::inet_address peer(req->param["peer"]);
return streaming::get_stream_manager().map_reduce0([peer](streaming::stream_manager& sm) {
int64_t res = 0;
for (auto sr : sm.get_all_streams()) {
if (sr) {
for (auto session : sr->get_coordinator()->get_all_stream_sessions()) {
if (session->peer == peer) {
res += session->get_bytes_sent();
}
}
}
}
return res;
return streaming::get_stream_manager().map_reduce0([peer] (streaming::stream_manager& sm) {
return sm.get_progress_on_all_shards(peer).then([] (auto sbytes) {
return sbytes.bytes_sent;
});
}, 0, std::plus<int64_t>()).then([](int64_t res) {
return make_ready_future<json::json_return_type>(res);
});
@@ -164,15 +151,9 @@ void set_stream_manager(http_context& ctx, routes& r) {
hs::get_all_total_outgoing_bytes.set(r, [](std::unique_ptr<request> req) {
return streaming::get_stream_manager().map_reduce0([](streaming::stream_manager& sm) {
int64_t res = 0;
for (auto sr : sm.get_all_streams()) {
if (sr) {
for (auto session : sr->get_coordinator()->get_all_stream_sessions()) {
res += session->get_bytes_sent();
}
}
}
return res;
return sm.get_progress_on_all_shards().then([] (auto sbytes) {
return sbytes.bytes_sent;
});
}, 0, std::plus<int64_t>()).then([](int64_t res) {
return make_ready_future<json::json_return_type>(res);
});

View File

@@ -54,9 +54,9 @@ class atomic_cell_or_collection;
*/
class atomic_cell_type final {
private:
static constexpr int8_t DEAD_FLAGS = 0;
static constexpr int8_t LIVE_FLAG = 0x01;
static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
static constexpr int8_t REVERT_FLAG = 0x04; // transient flag used to efficiently implement ReversiblyMergeable for atomic cells.
static constexpr unsigned flags_size = 1;
static constexpr unsigned timestamp_offset = flags_size;
static constexpr unsigned timestamp_size = 8;
@@ -67,14 +67,21 @@ private:
static constexpr unsigned ttl_offset = expiry_offset + expiry_size;
static constexpr unsigned ttl_size = 4;
private:
static bool is_revert_set(bytes_view cell) {
return cell[0] & REVERT_FLAG;
}
template<typename BytesContainer>
static void set_revert(BytesContainer& cell, bool revert) {
cell[0] = (cell[0] & ~REVERT_FLAG) | (revert * REVERT_FLAG);
}
static bool is_live(const bytes_view& cell) {
return cell[0] != DEAD_FLAGS;
return cell[0] & LIVE_FLAG;
}
static bool is_live_and_has_ttl(const bytes_view& cell) {
return cell[0] & EXPIRY_FLAG;
}
static bool is_dead(const bytes_view& cell) {
return cell[0] == DEAD_FLAGS;
return !is_live(cell);
}
// Can be called on live and dead cells
static api::timestamp_type timestamp(const bytes_view& cell) {
@@ -106,7 +113,7 @@ private:
}
static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
managed_bytes b(managed_bytes::initialized_later(), flags_size + timestamp_size + deletion_time_size);
b[0] = DEAD_FLAGS;
b[0] = 0;
set_field(b, timestamp_offset, timestamp);
set_field(b, deletion_time_offset, deletion_time.time_since_epoch().count());
return b;
@@ -140,8 +147,11 @@ protected:
ByteContainer _data;
protected:
atomic_cell_base(ByteContainer&& data) : _data(std::forward<ByteContainer>(data)) { }
atomic_cell_base(const ByteContainer& data) : _data(data) { }
friend class atomic_cell_or_collection;
public:
bool is_revert_set() const {
return atomic_cell_type::is_revert_set(_data);
}
bool is_live() const {
return atomic_cell_type::is_live(_data);
}
@@ -187,10 +197,13 @@ public:
bytes_view serialize() const {
return _data;
}
void set_revert(bool revert) {
atomic_cell_type::set_revert(_data, revert);
}
};
class atomic_cell_view final : public atomic_cell_base<bytes_view> {
atomic_cell_view(bytes_view data) : atomic_cell_base(data) {}
atomic_cell_view(bytes_view data) : atomic_cell_base(std::move(data)) {}
public:
static atomic_cell_view from_bytes(bytes_view data) { return atomic_cell_view(data); }
@@ -198,6 +211,11 @@ public:
friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);
};
class atomic_cell_ref final : public atomic_cell_base<managed_bytes&> {
public:
atomic_cell_ref(managed_bytes& buf) : atomic_cell_base(buf) {}
};
class atomic_cell final : public atomic_cell_base<managed_bytes> {
atomic_cell(managed_bytes b) : atomic_cell_base(std::move(b)) {}
public:

View File

@@ -27,16 +27,18 @@
#include "atomic_cell.hh"
#include "hashing.hh"
template<typename Hasher>
void feed_hash(collection_mutation_view cell, Hasher& h, const data_type& type) {
auto&& ctype = static_pointer_cast<const collection_type_impl>(type);
auto m_view = ctype->deserialize_mutation_form(cell);
::feed_hash(h, m_view.tomb);
for (auto&& key_and_value : m_view.cells) {
::feed_hash(h, key_and_value.first);
::feed_hash(h, key_and_value.second);
template<>
struct appending_hash<collection_mutation_view> {
template<typename Hasher>
void operator()(Hasher& h, collection_mutation_view cell) const {
auto m_view = collection_type_impl::deserialize_mutation_form(cell);
::feed_hash(h, m_view.tomb);
for (auto&& key_and_value : m_view.cells) {
::feed_hash(h, key_and_value.first);
::feed_hash(h, key_and_value.second);
}
}
}
};
template<>
struct appending_hash<atomic_cell_view> {
@@ -55,3 +57,19 @@ struct appending_hash<atomic_cell_view> {
}
}
};
template<>
struct appending_hash<atomic_cell> {
template<typename Hasher>
void operator()(Hasher& h, const atomic_cell& cell) const {
feed_hash(h, static_cast<atomic_cell_view>(cell));
}
};
template<>
struct appending_hash<collection_mutation> {
template<typename Hasher>
void operator()(Hasher& h, const collection_mutation& cm) const {
feed_hash(h, static_cast<collection_mutation_view>(cm));
}
};

View File

@@ -27,11 +27,10 @@
// A variant type that can hold either an atomic_cell, or a serialized collection.
// Which type is stored is determined by the schema.
// Has an "empty" state.
// Objects moved-from are left in an empty state.
class atomic_cell_or_collection final {
managed_bytes _data;
template<typename T>
friend class db::serializer;
private:
atomic_cell_or_collection(managed_bytes&& data) : _data(std::move(data)) {}
public:
@@ -39,6 +38,7 @@ public:
atomic_cell_or_collection(atomic_cell ac) : _data(std::move(ac._data)) {}
static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }
atomic_cell_view as_atomic_cell() const { return atomic_cell_view::from_bytes(_data); }
atomic_cell_ref as_atomic_cell_ref() { return { _data }; }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm.data)) {}
explicit operator bool() const {
return !_data.empty();
@@ -63,11 +63,5 @@ public:
::feed_hash(as_collection_mutation(), h, def.type);
}
}
void linearize() {
_data.linearize();
}
void unlinearize() {
_data.scatter();
}
friend std::ostream& operator<<(std::ostream&, const atomic_cell_or_collection&);
};

View File

@@ -103,35 +103,41 @@ static auth_migration_listener auth_migration;
* Should be abstracted to some sort of global server function
* probably.
*/
struct waiter {
promise<> done;
timer<> tmr;
waiter() : tmr([this] {done.set_value();})
{
tmr.arm(auth::auth::SUPERUSER_SETUP_DELAY);
}
~waiter() {
if (tmr.armed()) {
tmr.cancel();
done.set_exception(std::runtime_error("shutting down"));
}
logger.trace("Deleting scheduled task");
}
void kill() {
}
};
typedef std::unique_ptr<waiter> waiter_ptr;
static std::vector<waiter_ptr> & thread_waiters() {
static thread_local std::vector<waiter_ptr> the_waiters;
return the_waiters;
}
void auth::auth::schedule_when_up(scheduled_func f) {
struct waiter {
promise<> done;
timer<> tmr;
waiter() : tmr([this] {done.set_value();})
{
tmr.arm(SUPERUSER_SETUP_DELAY);
}
~waiter() {
if (tmr.armed()) {
tmr.cancel();
done.set_exception(std::runtime_error("shutting down"));
}
logger.trace("Deleting scheduled task");
}
void kill() {
}
};
typedef std::unique_ptr<waiter> waiter_ptr;
static thread_local std::vector<waiter_ptr> waiters;
logger.trace("Adding scheduled task");
auto & waiters = thread_waiters();
waiters.emplace_back(std::make_unique<waiter>());
auto* w = waiters.back().get();
w->done.get_future().finally([w] {
auto & waiters = thread_waiters();
auto i = std::find_if(waiters.begin(), waiters.end(), [w](const waiter_ptr& p) {
return p.get() == w;
});
@@ -146,7 +152,6 @@ void auth::auth::schedule_when_up(scheduled_func f) {
});
}
bool auth::auth::is_class_type(const sstring& type, const sstring& classname) {
if (type == classname) {
return true;
@@ -205,6 +210,15 @@ future<> auth::auth::setup() {
});
}
future<> auth::auth::shutdown() {
// just make sure we don't have pending tasks.
// this is mostly relevant for test cases where
// db-env-shutdown != process shutdown
return smp::invoke_on_all([] {
thread_waiters().clear();
});
}
static db::consistency_level consistency_for_user(const sstring& username) {
if (username == auth::auth::DEFAULT_SUPERUSER_NAME) {
return db::consistency_level::QUORUM;

View File

@@ -102,6 +102,7 @@ public:
* Sets up Authenticator and Authorizer.
*/
static future<> setup();
static future<> shutdown();
/**
* Set up table from given CREATE TABLE statement under system_auth keyspace, if not already done so.

View File

@@ -21,11 +21,12 @@
#pragma once
#include "types.hh"
#include "net/byteorder.hh"
#include <boost/range/iterator_range.hpp>
#include "bytes.hh"
#include "core/unaligned.hh"
#include "hashing.hh"
#include "seastar/core/simple-stream.hh"
/**
* Utility for writing data into a buffer when its final size is not known up front.
*
@@ -42,6 +43,14 @@ private:
struct chunk {
// FIXME: group fragment pointers to reduce pointer chasing when packetizing
std::unique_ptr<chunk> next;
~chunk() {
auto p = std::move(next);
while (p) {
// Avoid recursion when freeing chunks
auto p_next = std::move(p->next);
p = std::move(p_next);
}
}
size_type offset; // Also means "size" after chunk is closed
size_type size;
value_type data[0];
@@ -163,16 +172,12 @@ public:
template <typename T>
struct place_holder {
value_type* ptr;
// makes the place_holder looks like a stream
seastar::simple_output_stream get_stream() {
return seastar::simple_output_stream{reinterpret_cast<char*>(ptr)};
}
};
// Writes given values in big-endian format
template <typename T>
inline
std::enable_if_t<std::is_fundamental<T>::value, void>
write(T val) {
*reinterpret_cast<unaligned<T>*>(alloc(sizeof(T))) = net::hton(val);
}
// Returns a place holder for a value to be written later.
template <typename T>
inline
@@ -210,19 +215,6 @@ public:
write(bytes_view(reinterpret_cast<const signed char*>(ptr), size));
}
// Writes given sequence of bytes with a preceding length component encoded in big-endian format
inline void write_blob(bytes_view v) {
assert((size_type)v.size() == v.size());
write<size_type>(v.size());
write(v);
}
// Writes given value into the place holder in big-endian format
template <typename T>
inline void set(place_holder<T> ph, T val) {
*reinterpret_cast<unaligned<T>*>(ph.ptr) = net::hton(val);
}
bool is_linearized() const {
return !_begin || !_begin->next;
}

View File

@@ -24,80 +24,66 @@
#include "mutation_partition_serializer.hh"
#include "converting_mutation_partition_applier.hh"
#include "hashing_partition_visitor.hh"
template class db::serializer<canonical_mutation>;
//
// Representation layout:
//
// <canonical_mutation> ::= <column_family_id> <table_schema_version> <partition_key> <column-mapping> <partition>
//
// For <partition> see mutation_partition_serializer.cc
// For <column-mapping> see db::serializer<column_mapping>
//
#include "utils/UUID.hh"
#include "serializer.hh"
#include "idl/uuid.dist.hh"
#include "idl/keys.dist.hh"
#include "idl/mutation.dist.hh"
#include "serializer_impl.hh"
#include "serialization_visitors.hh"
#include "idl/uuid.dist.impl.hh"
#include "idl/keys.dist.impl.hh"
#include "idl/mutation.dist.impl.hh"
canonical_mutation::canonical_mutation(bytes data)
: _data(std::move(data))
{ }
canonical_mutation::canonical_mutation(const mutation& m)
: _data([&m] {
bytes_ostream out;
db::serializer<utils::UUID>(m.column_family_id()).write(out);
db::serializer<table_schema_version>(m.schema()->version()).write(out);
db::serializer<partition_key_view>(m.key()).write(out);
db::serializer<column_mapping>(m.schema()->get_column_mapping()).write(out);
mutation_partition_serializer ser(*m.schema(), m.partition());
ser.write(out);
return to_bytes(out.linearize());
}())
{ }
{
mutation_partition_serializer part_ser(*m.schema(), m.partition());
bytes_ostream out;
ser::writer_of_canonical_mutation wr(out);
std::move(wr).write_table_id(m.schema()->id())
.write_schema_version(m.schema()->version())
.write_key(m.key())
.write_mapping(m.schema()->get_column_mapping())
.partition([&] (auto wr) {
part_ser.write(std::move(wr));
}).end_canonical_mutation();
_data = to_bytes(out.linearize());
}
utils::UUID canonical_mutation::column_family_id() const {
data_input in(_data);
return db::serializer<utils::UUID>::read(in);
auto in = ser::as_input_stream(_data);
auto mv = ser::deserialize(in, boost::type<ser::canonical_mutation_view>());
return mv.table_id();
}
mutation canonical_mutation::to_mutation(schema_ptr s) const {
data_input in(_data);
auto in = ser::as_input_stream(_data);
auto mv = ser::deserialize(in, boost::type<ser::canonical_mutation_view>());
auto cf_id = db::serializer<utils::UUID>::read(in);
auto cf_id = mv.table_id();
if (s->id() != cf_id) {
throw std::runtime_error(sprint("Attempted to deserialize canonical_mutation of table %s with schema of table %s (%s.%s)",
cf_id, s->id(), s->ks_name(), s->cf_name()));
}
auto version = db::serializer<table_schema_version>::read(in);
auto pk = partition_key(db::serializer<partition_key_view>::read(in));
auto version = mv.schema_version();
auto pk = mv.key();
mutation m(std::move(pk), std::move(s));
if (version == m.schema()->version()) {
db::serializer<column_mapping>::skip(in);
auto partition_view = mutation_partition_serializer::read_as_view(in);
auto partition_view = mutation_partition_view::from_view(mv.partition());
m.partition().apply(*m.schema(), partition_view, *m.schema());
} else {
column_mapping cm = db::serializer<column_mapping>::read(in);
column_mapping cm = mv.mapping();
converting_mutation_partition_applier v(cm, *m.schema(), m.partition());
auto partition_view = mutation_partition_serializer::read_as_view(in);
auto partition_view = mutation_partition_view::from_view(mv.partition());
partition_view.accept(cm, v);
}
return m;
}
template<>
db::serializer<canonical_mutation>::serializer(const canonical_mutation& v)
: _item(v)
, _size(db::serializer<bytes>(v._data).size())
{ }
template<>
void
db::serializer<canonical_mutation>::write(output& out, const canonical_mutation& v) {
db::serializer<bytes>(v._data).write(out);
}
template<>
canonical_mutation db::serializer<canonical_mutation>::read(input& in) {
return canonical_mutation(db::serializer<bytes>::read(in));
}

View File

@@ -24,7 +24,6 @@
#include "bytes.hh"
#include "schema.hh"
#include "database_fwd.hh"
#include "db/serializer.hh"
#include "mutation_partition_visitor.hh"
#include "mutation_partition_serializer.hh"
@@ -33,8 +32,8 @@
// Safe to pass serialized across nodes.
class canonical_mutation {
bytes _data;
canonical_mutation(bytes);
public:
explicit canonical_mutation(bytes);
explicit canonical_mutation(const mutation&);
canonical_mutation(canonical_mutation&&) = default;
@@ -51,15 +50,6 @@ public:
utils::UUID column_family_id() const;
friend class db::serializer<canonical_mutation>;
const bytes& representation() const { return _data; }
};
namespace db {
template<> serializer<canonical_mutation>::serializer(const canonical_mutation&);
template<> void serializer<canonical_mutation>::write(output&, const canonical_mutation&);
template<> canonical_mutation serializer<canonical_mutation>::read(input&);
extern template class serializer<canonical_mutation>;
}

View File

@@ -26,29 +26,10 @@
#include <algorithm>
#include <vector>
#include <boost/range/iterator_range.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include "utils/serialization.hh"
#include "unimplemented.hh"
// value_traits is meant to abstract away whether we are working on 'bytes'
// elements or 'bytes_opt' elements. We don't support optional values, but
// there are some generic layers which use this code which provide us with
// data in that format. In order to avoid allocation and rewriting that data
// into a new vector just to throw it away soon after that, we accept that
// format too.
template <typename T>
struct value_traits {
static const T& unwrap(const T& t) { return t; }
};
template<>
struct value_traits<bytes_opt> {
static const bytes& unwrap(const bytes_opt& t) {
assert(t);
return *t;
}
};
enum class allow_prefixes { no, yes };
template<allow_prefixes AllowPrefixes = allow_prefixes::no>
@@ -62,13 +43,14 @@ public:
static constexpr bool is_prefixable = AllowPrefixes == allow_prefixes::yes;
using prefix_type = compound_type<allow_prefixes::yes>;
using value_type = std::vector<bytes>;
using size_type = uint16_t;
compound_type(std::vector<data_type> types)
: _types(std::move(types))
, _byte_order_equal(std::all_of(_types.begin(), _types.end(), [] (auto t) {
return t->is_byte_order_equal();
}))
, _byte_order_comparable(!is_prefixable && _types.size() == 1 && _types[0]->is_byte_order_comparable())
, _byte_order_comparable(false)
, _is_reversed(_types.size() == 1 && _types[0]->is_reversed())
{ }
@@ -85,79 +67,54 @@ public:
prefix_type as_prefix() {
return prefix_type(_types);
}
private:
/*
* Format:
* <len(value1)><value1><len(value2)><value2>...<len(value_n-1)><value_n-1>(len(value_n))?<value_n>
* <len(value1)><value1><len(value2)><value2>...<len(value_n)><value_n>
*
* For non-prefixable compounds, the value corresponding to the last component of types doesn't
* have its length encoded, its length is deduced from the input range.
*
* serialize_value() and serialize_optionals() for single element rely on the fact that for a single-element
* compounds their serialized form is equal to the serialized form of the component.
*/
template<typename Wrapped>
void serialize_value(const std::vector<Wrapped>& values, bytes::iterator& out) {
if (AllowPrefixes == allow_prefixes::yes) {
assert(values.size() <= _types.size());
} else {
assert(values.size() == _types.size());
}
size_t n_left = _types.size();
for (auto&& wrapped : values) {
auto&& val = value_traits<Wrapped>::unwrap(wrapped);
assert(val.size() <= std::numeric_limits<uint16_t>::max());
if (--n_left || AllowPrefixes == allow_prefixes::yes) {
write<uint16_t>(out, uint16_t(val.size()));
}
template<typename RangeOfSerializedComponents>
static void serialize_value(RangeOfSerializedComponents&& values, bytes::iterator& out) {
for (auto&& val : values) {
assert(val.size() <= std::numeric_limits<size_type>::max());
write<size_type>(out, size_type(val.size()));
out = std::copy(val.begin(), val.end(), out);
}
}
template <typename Wrapped>
size_t serialized_size(const std::vector<Wrapped>& values) {
template <typename RangeOfSerializedComponents>
static size_t serialized_size(RangeOfSerializedComponents&& values) {
size_t len = 0;
size_t n_left = _types.size();
for (auto&& wrapped : values) {
auto&& val = value_traits<Wrapped>::unwrap(wrapped);
assert(val.size() <= std::numeric_limits<uint16_t>::max());
if (--n_left || AllowPrefixes == allow_prefixes::yes) {
len += sizeof(uint16_t);
}
len += val.size();
for (auto&& val : values) {
len += sizeof(size_type) + val.size();
}
return len;
}
public:
bytes serialize_single(bytes&& v) {
if (AllowPrefixes == allow_prefixes::no) {
assert(_types.size() == 1);
return std::move(v);
} else {
// FIXME: Optimize
std::vector<bytes> vec;
vec.reserve(1);
vec.emplace_back(std::move(v));
return ::serialize_value(*this, vec);
}
return serialize_value({std::move(v)});
}
bytes serialize_value(const std::vector<bytes>& values) {
return ::serialize_value(*this, values);
}
bytes serialize_value(std::vector<bytes>&& values) {
if (AllowPrefixes == allow_prefixes::no && _types.size() == 1 && values.size() == 1) {
return std::move(values[0]);
template<typename RangeOfSerializedComponents>
static bytes serialize_value(RangeOfSerializedComponents&& values) {
auto size = serialized_size(values);
if (size > std::numeric_limits<size_type>::max()) {
throw std::runtime_error(sprint("Key size too large: %d > %d", size, std::numeric_limits<size_type>::max()));
}
return ::serialize_value(*this, values);
bytes b(bytes::initialized_later(), size);
auto i = b.begin();
serialize_value(values, i);
return b;
}
template<typename T>
static bytes serialize_value(std::initializer_list<T> values) {
return serialize_value(boost::make_iterator_range(values.begin(), values.end()));
}
bytes serialize_optionals(const std::vector<bytes_opt>& values) {
return ::serialize_value(*this, values);
}
bytes serialize_optionals(std::vector<bytes_opt>&& values) {
if (AllowPrefixes == allow_prefixes::no && _types.size() == 1 && values.size() == 1) {
assert(values[0]);
return std::move(*values[0]);
}
return ::serialize_value(*this, values);
return serialize_value(values | boost::adaptors::transformed([] (const bytes_opt& bo) -> bytes_view {
if (!bo) {
throw std::logic_error("attempted to create key component from empty optional");
}
return *bo;
}));
}
bytes serialize_value_deep(const std::vector<data_value>& values) {
// TODO: Optimize
@@ -171,37 +128,21 @@ public:
return serialize_value(partial);
}
bytes decompose_value(const value_type& values) {
return ::serialize_value(*this, values);
return serialize_value(values);
}
class iterator : public std::iterator<std::input_iterator_tag, bytes_view> {
private:
ssize_t _types_left;
bytes_view _v;
value_type _current;
private:
void read_current() {
if (_types_left == 0) {
if (!_v.empty()) {
throw marshal_exception();
}
_v = bytes_view(nullptr, 0);
return;
}
--_types_left;
uint16_t len;
if (_types_left == 0 && AllowPrefixes == allow_prefixes::no) {
len = _v.size();
} else {
size_type len;
{
if (_v.empty()) {
if (AllowPrefixes == allow_prefixes::yes) {
_types_left = 0;
_v = bytes_view(nullptr, 0);
return;
} else {
throw marshal_exception();
}
_v = bytes_view(nullptr, 0);
return;
}
len = read_simple<uint16_t>(_v);
len = read_simple<size_type>(_v);
if (_v.size() < len) {
throw marshal_exception();
}
@@ -211,10 +152,10 @@ public:
}
public:
struct end_iterator_tag {};
iterator(const compound_type& t, const bytes_view& v) : _types_left(t._types.size()), _v(v) {
iterator(const bytes_view& v) : _v(v) {
read_current();
}
iterator(end_iterator_tag, const bytes_view& v) : _types_left(0), _v(nullptr, 0) {}
iterator(end_iterator_tag, const bytes_view& v) : _v(nullptr, 0) {}
iterator& operator++() {
read_current();
return *this;
@@ -226,21 +167,18 @@ public:
}
const value_type& operator*() const { return _current; }
const value_type* operator->() const { return &_current; }
bool operator!=(const iterator& i) const { return _v.begin() != i._v.begin() || _types_left != i._types_left; }
bool operator==(const iterator& i) const { return _v.begin() == i._v.begin() && _types_left == i._types_left; }
bool operator!=(const iterator& i) const { return _v.begin() != i._v.begin(); }
bool operator==(const iterator& i) const { return _v.begin() == i._v.begin(); }
};
iterator begin(const bytes_view& v) const {
return iterator(*this, v);
static iterator begin(const bytes_view& v) {
return iterator(v);
}
iterator end(const bytes_view& v) const {
static iterator end(const bytes_view& v) {
return iterator(typename iterator::end_iterator_tag(), v);
}
boost::iterator_range<iterator> components(const bytes_view& v) const {
static boost::iterator_range<iterator> components(const bytes_view& v) {
return { begin(v), end(v) };
}
auto iter_items(const bytes_view& v) {
return boost::iterator_range<iterator>(begin(v), end(v));
}
value_type deserialize_value(bytes_view v) {
std::vector<bytes> result;
result.reserve(_types.size());
@@ -258,7 +196,7 @@ public:
}
auto t = _types.begin();
size_t h = 0;
for (auto&& value : iter_items(v)) {
for (auto&& value : components(v)) {
h ^= (*t)->hash(value);
++t;
}
@@ -277,12 +215,6 @@ public:
return type->compare(v1, v2);
});
}
bytes from_string(sstring_view s) {
throw std::runtime_error(sprint("%s not implemented", __PRETTY_FUNCTION__));
}
sstring to_string(const bytes& b) {
throw std::runtime_error(sprint("%s not implemented", __PRETTY_FUNCTION__));
}
// Retruns true iff given prefix has no missing components
bool is_full(bytes_view v) const {
assert(AllowPrefixes == allow_prefixes::yes);

View File

@@ -25,6 +25,31 @@ from distutils.spawn import find_executable
configure_args = str.join(' ', [shlex.quote(x) for x in sys.argv[1:]])
for line in open('/etc/os-release'):
key, _, value = line.partition('=')
value = value.strip().strip('"')
if key == 'ID':
os_ids = [value]
if key == 'ID_LIKE':
os_ids += value.split(' ')
# distribution "internationalization", converting package names.
# Fedora name is key, values is distro -> package name dict.
i18n_xlat = {
'boost-devel': {
'debian': 'libboost-dev',
'ubuntu': 'libboost-dev (libboost1.55-dev on 14.04)',
},
}
def pkgname(name):
if name in i18n_xlat:
dict = i18n_xlat[name]
for id in os_ids:
if id in dict:
return dict[id]
return name
def get_flags():
with open('/proc/cpuinfo') as f:
for line in f:
@@ -137,6 +162,7 @@ modes = {
scylla_tests = [
'tests/mutation_test',
'tests/schema_registry_test',
'tests/canonical_mutation_test',
'tests/range_test',
'tests/types_test',
@@ -167,7 +193,6 @@ scylla_tests = [
'tests/commitlog_test',
'tests/cartesian_product_test',
'tests/hash_test',
'tests/serializer_test',
'tests/map_difference_test',
'tests/message',
'tests/gossip',
@@ -190,6 +215,7 @@ scylla_tests = [
'tests/flush_queue_test',
'tests/dynamic_bitset_test',
'tests/auth_test',
'tests/idl_test',
]
apps = [
@@ -198,7 +224,11 @@ apps = [
tests = scylla_tests
all_artifacts = apps + tests
other = [
'iotune',
]
all_artifacts = apps + tests + other
arg_parser = argparse.ArgumentParser('Configure scylla')
arg_parser.add_argument('--static', dest = 'static', action = 'store_const', default = '',
@@ -235,7 +265,6 @@ add_tristate(arg_parser, name = 'xen', dest = 'xen', help = 'Xen support')
args = arg_parser.parse_args()
defines = []
scylla_libs = '-llz4 -lsnappy -lz -lboost_thread -lcryptopp -lrt -lyaml-cpp -lboost_date_time'
extra_cxxflags = {}
@@ -289,6 +318,7 @@ scylla_core = (['database.cc',
'cql3/statements/cf_statement.cc',
'cql3/statements/create_keyspace_statement.cc',
'cql3/statements/create_table_statement.cc',
'cql3/statements/create_type_statement.cc',
'cql3/statements/drop_keyspace_statement.cc',
'cql3/statements/drop_table_statement.cc',
'cql3/statements/schema_altering_statement.cc',
@@ -343,7 +373,7 @@ scylla_core = (['database.cc',
'db/schema_tables.cc',
'db/commitlog/commitlog.cc',
'db/commitlog/commitlog_replayer.cc',
'db/serializer.cc',
'db/commitlog/commitlog_entry.cc',
'db/config.cc',
'db/index/secondary_index.cc',
'db/marshal/type_parser.cc',
@@ -357,6 +387,7 @@ scylla_core = (['database.cc',
'utils/rate_limiter.cc',
'utils/file_lock.cc',
'utils/dynamic_bitset.cc',
'utils/managed_bytes.cc',
'gms/version_generator.cc',
'gms/versioned_value.cc',
'gms/gossiper.cc',
@@ -378,6 +409,7 @@ scylla_core = (['database.cc',
'locator/simple_strategy.cc',
'locator/local_strategy.cc',
'locator/network_topology_strategy.cc',
'locator/everywhere_replication_strategy.cc',
'locator/token_metadata.cc',
'locator/locator.cc',
'locator/snitch_base.cc',
@@ -391,7 +423,6 @@ scylla_core = (['database.cc',
'service/client_state.cc',
'service/migration_task.cc',
'service/storage_service.cc',
'service/pending_range_calculator_service.cc',
'service/load_broadcaster.cc',
'service/pager/paging_state.cc',
'service/pager/query_pagers.cc',
@@ -471,6 +502,14 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/reconcilable_result.idl.hh',
'idl/streaming.idl.hh',
'idl/paging_state.idl.hh',
'idl/frozen_schema.idl.hh',
'idl/partition_checksum.idl.hh',
'idl/replay_position.idl.hh',
'idl/truncation_record.idl.hh',
'idl/mutation.idl.hh',
'idl/query.idl.hh',
'idl/idl_test.idl.hh',
'idl/commitlog.idl.hh',
]
scylla_tests_dependencies = scylla_core + api + idls + [
@@ -514,6 +553,7 @@ tests_not_using_seastar_test_framework = set([
'tests/perf/perf_sstable',
'tests/managed_vector_test',
'tests/dynamic_bitset_test',
'tests/idl_test',
])
for t in tests_not_using_seastar_test_framework:
@@ -556,16 +596,44 @@ else:
args.pie = ''
args.fpie = ''
optional_packages = ['libsystemd']
# a list element means a list of alternative packages to consider
# the first element becomes the HAVE_pkg define
# a string element is a package name with no alternatives
optional_packages = [['libsystemd', 'libsystemd-daemon']]
pkgs = []
for pkg in optional_packages:
if have_pkg(pkg):
pkgs.append(pkg)
upkg = pkg.upper().replace('-', '_')
defines.append('HAVE_{}=1'.format(upkg))
else:
print('Missing optional package {pkg}'.format(**locals()))
def setup_first_pkg_of_list(pkglist):
# The HAVE_pkg symbol is taken from the first alternative
upkg = pkglist[0].upper().replace('-', '_')
for pkg in pkglist:
if have_pkg(pkg):
pkgs.append(pkg)
defines.append('HAVE_{}=1'.format(upkg))
return True
return False
for pkglist in optional_packages:
if isinstance(pkglist, str):
pkglist = [pkglist]
if not setup_first_pkg_of_list(pkglist):
if len(pkglist) == 1:
print('Missing optional package {pkglist[0]}'.format(**locals()))
else:
alternatives = ':'.join(pkglist[1:])
print('Missing optional package {pkglist[0]} (or alteratives {alternatives})'.format(**locals()))
if not try_compile(compiler=args.cxx, source='#include <boost/version.hpp>'):
print('Boost not installed. Please install {}.'.format(pkgname("boost-devel")))
sys.exit(1)
if not try_compile(compiler=args.cxx, source='''\
#include <boost/version.hpp>
#if BOOST_VERSION < 105500
#error Boost version too low
#endif
'''):
print('Installed boost version too old. Please update {}.'.format(pkgname("boost-devel")))
sys.exit(1)
defines = ' '.join(['-D' + d for d in defines])
@@ -595,6 +663,8 @@ if args.dpdk:
seastar_flags += ['--enable-dpdk']
elif args.dpdk_target:
seastar_flags += ['--dpdk-target', args.dpdk_target]
if args.staticcxx:
seastar_flags += ['--static-stdc++']
seastar_cflags = args.user_cflags + " -march=nehalem"
seastar_flags += ['--compiler', args.cxx, '--cflags=%s' % (seastar_cflags)]
@@ -628,7 +698,7 @@ for mode in build_modes:
seastar_deps = 'practically_anything_can_change_so_lets_run_it_every_time_and_restat.'
args.user_cflags += " " + pkg_config("--cflags", "jsoncpp")
libs = "-lyaml-cpp -llz4 -lz -lsnappy " + pkg_config("--libs", "jsoncpp") + ' -lboost_filesystem' + ' -lcrypt'
libs = "-lyaml-cpp -llz4 -lz -lsnappy " + pkg_config("--libs", "jsoncpp") + ' -lboost_filesystem' + ' -lcrypt' + ' -lboost_date_time'
for pkg in pkgs:
args.user_cflags += ' ' + pkg_config('--cflags', pkg)
libs += ' ' + pkg_config('--libs', pkg)
@@ -664,12 +734,15 @@ with open(buildfile, 'w') as f:
command = seastar/json/json2code.py -f $in -o $out
description = SWAGGER $out
rule serializer
command = ./idl-compiler.py --ns ser -f $in -o $out
command = {python} ./idl-compiler.py --ns ser -f $in -o $out
description = IDL compiler $out
rule ninja
command = {ninja} -C $subdir $target
restat = 1
description = NINJA $out
rule copy
command = cp $in $out
description = COPY $out
''').format(**globals()))
for mode in build_modes:
modeval = modes[mode]
@@ -706,6 +779,8 @@ with open(buildfile, 'w') as f:
thrifts = set()
antlr3_grammars = set()
for binary in build_artifacts:
if binary in other:
continue
srcs = deps[binary]
objs = ['$builddir/' + mode + '/' + src.replace('.cc', '.o')
for src in srcs
@@ -771,7 +846,8 @@ with open(buildfile, 'w') as f:
for obj in compiles:
src = compiles[obj]
gen_headers = list(ragels.keys())
gen_headers += ['seastar/build/{}/http/request_parser.hh'.format(mode)]
gen_headers += ['seastar/build/{}/gen/http/request_parser.hh'.format(mode)]
gen_headers += ['seastar/build/{}/gen/http/http_response_parser.hh'.format(mode)]
for th in thrifts:
gen_headers += th.headers('$builddir/{}/gen'.format(mode))
for g in antlr3_grammars:
@@ -802,10 +878,14 @@ with open(buildfile, 'w') as f:
grammar.source.rsplit('.', 1)[0]))
for cc in grammar.sources('$builddir/{}/gen'.format(mode)):
obj = cc.replace('.cpp', '.o')
f.write('build {}: cxx.{} {}\n'.format(obj, mode, cc))
f.write('build seastar/build/{}/libseastar.a: ninja {}\n'.format(mode, seastar_deps))
f.write('build {}: cxx.{} {} || {}\n'.format(obj, mode, cc, ' '.join(serializers)))
f.write('build seastar/build/{mode}/libseastar.a seastar/build/{mode}/apps/iotune/iotune seastar/build/{mode}/gen/http/request_parser.hh seastar/build/{mode}/gen/http/http_response_parser.hh: ninja {seastar_deps}\n'
.format(**locals()))
f.write(' subdir = seastar\n')
f.write(' target = build/{}/libseastar.a\n'.format(mode))
f.write(' target = build/{mode}/libseastar.a build/{mode}/apps/iotune/iotune build/{mode}/gen/http/request_parser.hh build/{mode}/gen/http/http_response_parser.hh\n'.format(**locals()))
f.write(textwrap.dedent('''\
build build/{mode}/iotune: copy seastar/build/{mode}/apps/iotune/iotune
''').format(**locals()))
f.write('build {}: phony\n'.format(seastar_deps))
f.write(textwrap.dedent('''\
rule configure
@@ -816,10 +896,6 @@ with open(buildfile, 'w') as f:
command = find -name '*.[chS]' -o -name "*.cc" -o -name "*.hh" | cscope -bq -i-
description = CSCOPE
build cscope: cscope
rule request_parser_hh
command = {ninja} -C seastar build/release/gen/http/request_parser.hh build/debug/gen/http/request_parser.hh
description = GEN seastar/http/request_parser.hh
build seastar/build/release/http/request_parser.hh seastar/build/debug/http/request_parser.hh: request_parser_hh
rule clean
command = rm -rf build
description = CLEAN

View File

@@ -75,7 +75,7 @@ public:
}
virtual void accept_static_cell(column_id id, atomic_cell_view cell) override {
const column_mapping::column& col = _visited_column_mapping.static_column_at(id);
const column_mapping_entry& col = _visited_column_mapping.static_column_at(id);
const column_definition* def = _p_schema.get_column_definition(col.name());
if (def) {
accept_cell(_p._static_row, column_kind::static_column, *def, col.type(), cell);
@@ -83,7 +83,7 @@ public:
}
virtual void accept_static_cell(column_id id, collection_mutation_view collection) override {
const column_mapping::column& col = _visited_column_mapping.static_column_at(id);
const column_mapping_entry& col = _visited_column_mapping.static_column_at(id);
const column_definition* def = _p_schema.get_column_definition(col.name());
if (def) {
accept_cell(_p._static_row, column_kind::static_column, *def, col.type(), collection);
@@ -102,7 +102,7 @@ public:
}
virtual void accept_row_cell(column_id id, atomic_cell_view cell) override {
const column_mapping::column& col = _visited_column_mapping.regular_column_at(id);
const column_mapping_entry& col = _visited_column_mapping.regular_column_at(id);
const column_definition* def = _p_schema.get_column_definition(col.name());
if (def) {
accept_cell(_current_row->cells(), column_kind::regular_column, *def, col.type(), cell);
@@ -110,7 +110,7 @@ public:
}
virtual void accept_row_cell(column_id id, collection_mutation_view collection) override {
const column_mapping::column& col = _visited_column_mapping.regular_column_at(id);
const column_mapping_entry& col = _visited_column_mapping.regular_column_at(id);
const column_definition* def = _p_schema.get_column_definition(col.name());
if (def) {
accept_cell(_current_row->cells(), column_kind::regular_column, *def, col.type(), collection);

View File

@@ -36,6 +36,7 @@ options {
#include "cql3/statements/drop_keyspace_statement.hh"
#include "cql3/statements/create_index_statement.hh"
#include "cql3/statements/create_table_statement.hh"
#include "cql3/statements/create_type_statement.hh"
#include "cql3/statements/property_definitions.hh"
#include "cql3/statements/drop_table_statement.hh"
#include "cql3/statements/truncate_statement.hh"
@@ -283,7 +284,9 @@ cqlStatement returns [shared_ptr<parsed_statement> stmt]
| st22=listUsersStatement { $stmt = st22; }
| st23=createTriggerStatement { $stmt = st23; }
| st24=dropTriggerStatement { $stmt = st24; }
#endif
| st25=createTypeStatement { $stmt = st25; }
#if 0
| st26=alterTypeStatement { $stmt = st26; }
| st27=dropTypeStatement { $stmt = st27; }
| st28=createFunctionStatement { $stmt = st28; }
@@ -695,7 +698,6 @@ cfamOrdering[shared_ptr<cql3::statements::create_table_statement::raw_statement>
;
#if 0
/**
* CREATE TYPE foo (
* <name1> <type1>,
@@ -703,17 +705,16 @@ cfamOrdering[shared_ptr<cql3::statements::create_table_statement::raw_statement>
* ....
* )
*/
createTypeStatement returns [CreateTypeStatement expr]
@init { boolean ifNotExists = false; }
: K_CREATE K_TYPE (K_IF K_NOT K_EXISTS { ifNotExists = true; } )?
tn=userTypeName { $expr = new CreateTypeStatement(tn, ifNotExists); }
createTypeStatement returns [::shared_ptr<create_type_statement> expr]
@init { bool if_not_exists = false; }
: K_CREATE K_TYPE (K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
tn=userTypeName { $expr = ::make_shared<create_type_statement>(tn, if_not_exists); }
'(' typeColumns[expr] ( ',' typeColumns[expr]? )* ')'
;
typeColumns[CreateTypeStatement expr]
: k=ident v=comparatorType { $expr.addDefinition(k, v); }
typeColumns[::shared_ptr<create_type_statement> expr]
: k=ident v=comparatorType { $expr->add_definition(k, v); }
;
#endif
/**

View File

@@ -737,7 +737,7 @@ public:
/** A condition on a collection element. For example: "IF col['key'] = 'foo'" */
static ::shared_ptr<raw> collection_condition(::shared_ptr<term::raw> value, ::shared_ptr<term::raw> collection_element,
const operator_type& op) {
return ::make_shared<raw>(std::move(value), std::vector<::shared_ptr<term::raw>>{}, ::shared_ptr<abstract_marker::in_raw>{}, std::move(collection_element), operator_type::IN);
return ::make_shared<raw>(std::move(value), std::vector<::shared_ptr<term::raw>>{}, ::shared_ptr<abstract_marker::in_raw>{}, std::move(collection_element), op);
}
/** An IN condition on a collection element. For example: "IF col['key'] IN ('foo', 'bar', ...)" */

View File

@@ -121,3 +121,7 @@ column_identifier::new_selector_factory(database& db, schema_ptr schema, std::ve
}
}
bool cql3::column_identifier::text_comparator::operator()(const cql3::column_identifier& c1, const cql3::column_identifier& c2) const {
return c1.text() < c2.text();
}

View File

@@ -61,6 +61,11 @@ public:
private:
sstring _text;
public:
// less comparator sorting by text
struct text_comparator {
bool operator()(const column_identifier& c1, const column_identifier& c2) const;
};
column_identifier(sstring raw_text, bool keep_case);
column_identifier(bytes bytes_, data_type type);

View File

@@ -58,10 +58,10 @@ public:
virtual void reset() override {
_count = 0;
}
virtual opt_bytes compute(serialization_format sf) override {
virtual opt_bytes compute(cql_serialization_format sf) override {
return long_type->decompose(_count);
}
virtual void add_input(serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
++_count;
}
};
@@ -83,10 +83,10 @@ public:
virtual void reset() override {
_sum = {};
}
virtual opt_bytes compute(serialization_format sf) override {
virtual opt_bytes compute(cql_serialization_format sf) override {
return data_type_for<Type>()->decompose(_sum);
}
virtual void add_input(serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
@@ -120,14 +120,14 @@ public:
_sum = {};
_count = 0;
}
virtual opt_bytes compute(serialization_format sf) override {
virtual opt_bytes compute(cql_serialization_format sf) override {
Type ret = 0;
if (_count) {
ret = _sum / _count;
}
return data_type_for<Type>()->decompose(ret);
}
virtual void add_input(serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
@@ -159,13 +159,13 @@ public:
virtual void reset() override {
_max = {};
}
virtual opt_bytes compute(serialization_format sf) override {
virtual opt_bytes compute(cql_serialization_format sf) override {
if (!_max) {
return {};
}
return data_type_for<Type>()->decompose(*_max);
}
virtual void add_input(serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
@@ -206,13 +206,13 @@ public:
virtual void reset() override {
_min = {};
}
virtual opt_bytes compute(serialization_format sf) override {
virtual opt_bytes compute(cql_serialization_format sf) override {
if (!_min) {
return {};
}
return data_type_for<Type>()->decompose(*_min);
}
virtual void add_input(serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
@@ -255,10 +255,10 @@ public:
virtual void reset() override {
_count = 0;
}
virtual opt_bytes compute(serialization_format sf) override {
virtual opt_bytes compute(cql_serialization_format sf) override {
return long_type->decompose(_count);
}
virtual void add_input(serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}

View File

@@ -77,7 +77,7 @@ public:
* @param protocol_version native protocol version
* @param values the values to add to the aggregate.
*/
virtual void add_input(serialization_format sf, const std::vector<opt_bytes>& values) = 0;
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) = 0;
/**
* Computes and returns the aggregate current value.
@@ -85,7 +85,7 @@ public:
* @param protocol_version native protocol version
* @return the aggregate current value.
*/
virtual opt_bytes compute(serialization_format sf) = 0;
virtual opt_bytes compute(cql_serialization_format sf) = 0;
/**
* Reset this aggregate.

View File

@@ -58,7 +58,7 @@ shared_ptr<function>
make_to_blob_function(data_type from_type) {
auto name = from_type->as_cql3_type()->to_string() + "asblob";
return make_native_scalar_function<true>(name, bytes_type, { from_type },
[] (serialization_format sf, const std::vector<bytes_opt>& parameters) {
[] (cql_serialization_format sf, const std::vector<bytes_opt>& parameters) {
return parameters[0];
});
}
@@ -68,7 +68,7 @@ shared_ptr<function>
make_from_blob_function(data_type to_type) {
sstring name = sstring("blobas") + to_type->as_cql3_type()->to_string();
return make_native_scalar_function<true>(name, to_type, { bytes_type },
[name, to_type] (serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
[name, to_type] (cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
auto&& val = parameters[0];
if (!val) {
return val;
@@ -89,7 +89,7 @@ inline
shared_ptr<function>
make_varchar_as_blob_fct() {
return make_native_scalar_function<true>("varcharasblob", bytes_type, { utf8_type },
[] (serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
[] (cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
return parameters[0];
});
}
@@ -98,7 +98,7 @@ inline
shared_ptr<function>
make_blob_as_varchar_fct() {
return make_native_scalar_function<true>("blobasvarchar", utf8_type, { bytes_type },
[] (serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
[] (cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
return parameters[0];
});
}

View File

@@ -61,11 +61,11 @@ public:
virtual shared_ptr<terminal> bind(const query_options& options) override;
virtual bytes_view_opt bind_and_get(const query_options& options) override;
private:
static bytes_opt execute_internal(serialization_format sf, scalar_function& fun, std::vector<bytes_opt> params);
static bytes_opt execute_internal(cql_serialization_format sf, scalar_function& fun, std::vector<bytes_opt> params);
public:
virtual bool contains_bind_marker() const override;
private:
static shared_ptr<terminal> make_terminal(shared_ptr<function> fun, bytes_opt result, serialization_format sf);
static shared_ptr<terminal> make_terminal(shared_ptr<function> fun, bytes_opt result, cql_serialization_format sf);
public:
class raw : public term::raw {
function_name _name;

View File

@@ -299,7 +299,7 @@ function_call::collect_marker_specification(shared_ptr<variable_specifications>
shared_ptr<terminal>
function_call::bind(const query_options& options) {
return make_terminal(_fun, to_bytes_opt(bind_and_get(options)), options.get_serialization_format());
return make_terminal(_fun, to_bytes_opt(bind_and_get(options)), options.get_cql_serialization_format());
}
bytes_view_opt
@@ -315,12 +315,12 @@ function_call::bind_and_get(const query_options& options) {
}
buffers.push_back(std::move(to_bytes_opt(val)));
}
auto result = execute_internal(options.get_serialization_format(), *_fun, std::move(buffers));
auto result = execute_internal(options.get_cql_serialization_format(), *_fun, std::move(buffers));
return options.make_temporary(result);
}
bytes_opt
function_call::execute_internal(serialization_format sf, scalar_function& fun, std::vector<bytes_opt> params) {
function_call::execute_internal(cql_serialization_format sf, scalar_function& fun, std::vector<bytes_opt> params) {
bytes_opt result = fun.execute(sf, params);
try {
// Check the method didn't lied on it's declared return type
@@ -347,7 +347,7 @@ function_call::contains_bind_marker() const {
}
shared_ptr<terminal>
function_call::make_terminal(shared_ptr<function> fun, bytes_opt result, serialization_format sf) {
function_call::make_terminal(shared_ptr<function> fun, bytes_opt result, cql_serialization_format sf) {
if (!dynamic_pointer_cast<const collection_type_impl>(fun->return_type())) {
return ::make_shared<constants::value>(std::move(result));
}
@@ -413,7 +413,7 @@ function_call::raw::prepare(database& db, const sstring& keyspace, ::shared_ptr<
// If all parameters are terminal and the function is pure, we can
// evaluate it now, otherwise we'd have to wait execution time
if (all_terminal && scalar_fun->is_pure()) {
return make_terminal(scalar_fun, execute(*scalar_fun, parameters), query_options::DEFAULT.get_serialization_format());
return make_terminal(scalar_fun, execute(*scalar_fun, parameters), query_options::DEFAULT.get_cql_serialization_format());
} else {
return ::make_shared<function_call>(scalar_fun, parameters);
}
@@ -429,7 +429,7 @@ function_call::raw::execute(scalar_function& fun, std::vector<shared_ptr<term>>
buffers.push_back(std::move(param));
}
return execute_internal(serialization_format::internal(), fun, buffers);
return execute_internal(cql_serialization_format::internal(), fun, buffers);
}
assignment_testable::test_result

View File

@@ -74,7 +74,10 @@ public:
: native_scalar_function(std::move(name), std::move(return_type), std::move(arg_types))
, _func(std::forward<Func>(func)) {
}
virtual bytes_opt execute(serialization_format sf, const std::vector<bytes_opt>& parameters) override {
virtual bool is_pure() override {
return Pure;
}
virtual bytes_opt execute(cql_serialization_format sf, const std::vector<bytes_opt>& parameters) override {
return _func(sf, parameters);
}
};

View File

@@ -58,7 +58,7 @@ public:
* @return the result of applying this function to the parameter
* @throws InvalidRequestException if this function cannot not be applied to the parameter
*/
virtual bytes_opt execute(serialization_format sf, const std::vector<bytes_opt>& parameters) = 0;
virtual bytes_opt execute(cql_serialization_format sf, const std::vector<bytes_opt>& parameters) = 0;
};

View File

@@ -56,7 +56,7 @@ inline
shared_ptr<function>
make_now_fct() {
return make_native_scalar_function<false>("now", timeuuid_type, {},
[] (serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
return {to_bytes(utils::UUID_gen::get_time_UUID())};
});
}
@@ -65,7 +65,7 @@ inline
shared_ptr<function>
make_min_timeuuid_fct() {
return make_native_scalar_function<true>("mintimeuuid", timeuuid_type, { timestamp_type },
[] (serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
auto& bb = values[0];
if (!bb) {
return {};
@@ -84,7 +84,7 @@ inline
shared_ptr<function>
make_max_timeuuid_fct() {
return make_native_scalar_function<true>("maxtimeuuid", timeuuid_type, { timestamp_type },
[] (serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
// FIXME: should values be a vector<optional<bytes>>?
auto& bb = values[0];
if (!bb) {
@@ -104,7 +104,7 @@ inline
shared_ptr<function>
make_date_of_fct() {
return make_native_scalar_function<true>("dateof", timestamp_type, { timeuuid_type },
[] (serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
@@ -119,7 +119,7 @@ inline
shared_ptr<function>
make_unix_timestamp_of_fcf() {
return make_native_scalar_function<true>("unixtimestampof", long_type, { timeuuid_type },
[] (serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {

View File

@@ -61,10 +61,9 @@ public:
, _schema(s) {
}
bytes_opt execute(serialization_format sf, const std::vector<bytes_opt>& parameters) override {
auto buf = _schema->partition_key_type()->serialize_optionals(parameters);
auto view = partition_key_view::from_bytes(std::move(buf));
auto tok = dht::global_partitioner().get_token(*_schema, view);
bytes_opt execute(cql_serialization_format sf, const std::vector<bytes_opt>& parameters) override {
auto key = partition_key::from_optional_exploded(*_schema, parameters);
auto tok = dht::global_partitioner().get_token(*_schema, key);
warn(unimplemented::cause::VALIDATION);
return dht::global_partitioner().token_to_bytes(tok);
}

View File

@@ -53,7 +53,7 @@ inline
shared_ptr<function>
make_uuid_fct() {
return make_native_scalar_function<false>("uuid", uuid_type, {},
[] (serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
[] (cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
return {uuid_type->decompose(utils::make_random_uuid())};
});
}

View File

@@ -108,7 +108,7 @@ lists::literal::to_string() const {
}
lists::value
lists::value::from_serialized(bytes_view v, list_type type, serialization_format sf) {
lists::value::from_serialized(bytes_view v, list_type type, cql_serialization_format sf) {
try {
// Collections have this small hack that validate cannot be called on a serialized object,
// but compose does the validation (so we're fine).
@@ -128,11 +128,11 @@ lists::value::from_serialized(bytes_view v, list_type type, serialization_format
bytes_opt
lists::value::get(const query_options& options) {
return get_with_protocol_version(options.get_serialization_format());
return get_with_protocol_version(options.get_cql_serialization_format());
}
bytes
lists::value::get_with_protocol_version(serialization_format sf) {
lists::value::get_with_protocol_version(cql_serialization_format sf) {
// Can't use boost::indirect_iterator, because optional is not an iterator
auto deref = [] (bytes_opt& x) { return *x; };
return collection_type_impl::pack(
@@ -212,7 +212,7 @@ lists::marker::bind(const query_options& options) {
if (!value) {
return nullptr;
} else {
return make_shared(value::from_serialized(*value, std::move(ltype), options.get_serialization_format()));
return make_shared(value::from_serialized(*value, std::move(ltype), options.get_cql_serialization_format()));
}
}
@@ -259,7 +259,10 @@ lists::setter_by_index::execute(mutation& m, const exploded_clustering_prefix& p
// we should not get here for frozen lists
assert(column.type->is_multi_cell()); // "Attempted to set an individual element on a frozen list";
auto row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
auto index = _idx->bind_and_get(params._options);
auto value = _t->bind_and_get(params._options);
@@ -269,32 +272,30 @@ lists::setter_by_index::execute(mutation& m, const exploded_clustering_prefix& p
}
auto idx = net::ntoh(int32_t(*unaligned_cast<int32_t>(index->begin())));
auto existing_list_opt = params.get_prefetched_list(m.key(), row_key, column);
auto&& existing_list_opt = params.get_prefetched_list(m.key(), std::move(row_key), column);
if (!existing_list_opt) {
throw exceptions::invalid_request_exception("Attempted to set an element on a list which is null");
}
collection_mutation_view existing_list_ser = *existing_list_opt;
auto ltype = dynamic_pointer_cast<const list_type_impl>(column.type);
collection_type_impl::mutation_view existing_list = ltype->deserialize_mutation_form(existing_list_ser);
auto&& existing_list = *existing_list_opt;
// we verified that index is an int32_type
if (idx < 0 || size_t(idx) >= existing_list.cells.size()) {
if (idx < 0 || size_t(idx) >= existing_list.size()) {
throw exceptions::invalid_request_exception(sprint("List index %d out of bound, list has size %d",
idx, existing_list.cells.size()));
idx, existing_list.size()));
}
bytes_view eidx = existing_list.cells[idx].first;
const bytes& eidx = existing_list[idx].key;
list_type_impl::mutation mut;
mut.cells.reserve(1);
if (!value) {
mut.cells.emplace_back(to_bytes(eidx), params.make_dead_cell());
mut.cells.emplace_back(eidx, params.make_dead_cell());
} else {
if (value->size() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(
sprint("List value is too long. List values are limited to %d bytes but %d bytes value provided",
std::numeric_limits<uint16_t>::max(), value->size()));
}
mut.cells.emplace_back(to_bytes(eidx), params.make_cell(*value));
mut.cells.emplace_back(eidx, params.make_cell(*value));
}
auto smut = ltype->serialize_mutation_form(mut);
m.set_cell(prefix, column, atomic_cell_or_collection::from_collection_mutation(std::move(smut)));
@@ -337,13 +338,8 @@ lists::do_append(shared_ptr<term> t,
if (!value) {
m.set_cell(prefix, column, params.make_dead_cell());
} else {
auto&& to_add = list_value->_elements;
auto deref = [] (const bytes_opt& v) { return *v; };
auto&& newv = collection_mutation{list_type_impl::pack(
boost::make_transform_iterator(to_add.begin(), deref),
boost::make_transform_iterator(to_add.end(), deref),
to_add.size(), serialization_format::internal())};
m.set_cell(prefix, column, atomic_cell_or_collection::from_collection_mutation(std::move(newv)));
auto newv = list_value->get_with_protocol_version(cql_serialization_format::internal());
m.set_cell(prefix, column, params.make_cell(std::move(newv)));
}
}
}
@@ -383,8 +379,13 @@ lists::discarder::requires_read() {
void
lists::discarder::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
assert(column.type->is_multi_cell()); // "Attempted to delete from a frozen list";
auto&& row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
auto&& existing_list = params.get_prefetched_list(m.key(), row_key, column);
std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
auto&& existing_list = params.get_prefetched_list(m.key(), std::move(row_key), column);
// We want to call bind before possibly returning to reject queries where the value provided is not a list.
auto&& value = _t->bind(params._options);
@@ -394,9 +395,9 @@ lists::discarder::execute(mutation& m, const exploded_clustering_prefix& prefix,
return;
}
auto&& elist = ltype->deserialize_mutation_form(*existing_list);
auto&& elist = *existing_list;
if (elist.cells.empty()) {
if (elist.empty()) {
return;
}
@@ -413,14 +414,14 @@ lists::discarder::execute(mutation& m, const exploded_clustering_prefix& prefix,
// toDiscard will be small and keeping a list will be more efficient.
auto&& to_discard = lvalue->_elements;
collection_type_impl::mutation mnew;
for (auto&& cell : elist.cells) {
for (auto&& cell : elist) {
auto have_value = [&] (bytes_view value) {
return std::find_if(to_discard.begin(), to_discard.end(),
[ltype, value] (auto&& v) { return ltype->get_elements_type()->equal(*v, value); })
!= to_discard.end();
};
if (cell.second.is_live() && have_value(cell.second.value())) {
mnew.cells.emplace_back(bytes(cell.first.begin(), cell.first.end()), params.make_dead_cell());
if (have_value(cell.value)) {
mnew.cells.emplace_back(cell.key, params.make_dead_cell());
}
}
auto mnew_ser = ltype->serialize_mutation_form(mnew);
@@ -444,18 +445,21 @@ lists::discarder_by_index::execute(mutation& m, const exploded_clustering_prefix
auto cvalue = dynamic_pointer_cast<constants::value>(index);
assert(cvalue);
auto row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
auto&& existing_list = params.get_prefetched_list(m.key(), row_key, column);
std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
auto&& existing_list_opt = params.get_prefetched_list(m.key(), std::move(row_key), column);
int32_t idx = read_simple_exactly<int32_t>(*cvalue->_bytes);
if (!existing_list) {
if (!existing_list_opt) {
throw exceptions::invalid_request_exception("Attempted to delete an element from a list which is null");
}
auto&& deserialized = ltype->deserialize_mutation_form(*existing_list);
if (idx < 0 || size_t(idx) >= deserialized.cells.size()) {
throw exceptions::invalid_request_exception(sprint("List index %d out of bound, list has size %d", idx, deserialized.cells.size()));
auto&& existing_list = *existing_list_opt;
if (idx < 0 || size_t(idx) >= existing_list.size()) {
throw exceptions::invalid_request_exception(sprint("List index %d out of bound, list has size %d", idx, existing_list.size()));
}
collection_type_impl::mutation mut;
mut.cells.emplace_back(to_bytes(deserialized.cells[idx].first), params.make_dead_cell());
mut.cells.emplace_back(existing_list[idx].key, params.make_dead_cell());
m.set_cell(prefix, column, ltype->serialize_mutation_form(mut));
}

View File

@@ -78,9 +78,9 @@ public:
explicit value(std::vector<bytes_opt> elements)
: _elements(std::move(elements)) {
}
static value from_serialized(bytes_view v, list_type type, serialization_format sf);
static value from_serialized(bytes_view v, list_type type, cql_serialization_format sf);
virtual bytes_opt get(const query_options& options) override;
virtual bytes get_with_protocol_version(serialization_format sf) override;
virtual bytes get_with_protocol_version(cql_serialization_format sf) override;
bool equals(shared_ptr<list_type_impl> lt, const value& v);
virtual std::vector<bytes_opt> get_elements() override;
virtual sstring to_string() const;

View File

@@ -152,7 +152,7 @@ maps::literal::to_string() const {
}
maps::value
maps::value::from_serialized(bytes_view value, map_type type, serialization_format sf) {
maps::value::from_serialized(bytes_view value, map_type type, cql_serialization_format sf) {
try {
// Collections have this small hack that validate cannot be called on a serialized object,
// but compose does the validation (so we're fine).
@@ -171,11 +171,11 @@ maps::value::from_serialized(bytes_view value, map_type type, serialization_form
bytes_opt
maps::value::get(const query_options& options) {
return get_with_protocol_version(options.get_serialization_format());
return get_with_protocol_version(options.get_cql_serialization_format());
}
bytes
maps::value::get_with_protocol_version(serialization_format sf) {
maps::value::get_with_protocol_version(cql_serialization_format sf) {
//FIXME: share code with serialize_partially_deserialized_form
size_t len = collection_value_len(sf) * map.size() * 2 + collection_size_len(sf);
for (auto&& e : map) {
@@ -257,7 +257,7 @@ maps::marker::bind(const query_options& options) {
maps::value::from_serialized(*val,
static_pointer_cast<const map_type_impl>(
_receiver->type),
options.get_serialization_format())) :
options.get_cql_serialization_format())) :
nullptr;
}
@@ -333,7 +333,7 @@ maps::do_put(mutation& m, const exploded_clustering_prefix& prefix, const update
m.set_cell(prefix, column, params.make_dead_cell());
} else {
auto v = map_type_impl::serialize_partially_deserialized_form({map_value->map.begin(), map_value->map.end()},
serialization_format::internal());
cql_serialization_format::internal());
m.set_cell(prefix, column, params.make_cell(std::move(v)));
}
}

View File

@@ -81,9 +81,9 @@ public:
value(std::map<bytes, bytes, serialized_compare> map)
: map(std::move(map)) {
}
static value from_serialized(bytes_view value, map_type type, serialization_format sf);
static value from_serialized(bytes_view value, map_type type, cql_serialization_format sf);
virtual bytes_opt get(const query_options& options) override;
virtual bytes get_with_protocol_version(serialization_format sf);
virtual bytes get_with_protocol_version(cql_serialization_format sf);
bool equals(map_type mt, const value& v);
virtual sstring to_string() const;
};

View File

@@ -47,7 +47,7 @@ namespace cql3 {
thread_local const query_options::specific_options query_options::specific_options::DEFAULT{-1, {}, {}, api::missing_timestamp};
thread_local query_options query_options::DEFAULT{db::consistency_level::ONE, std::experimental::nullopt,
{}, false, query_options::specific_options::DEFAULT, version::native_protocol(), serialization_format::use_32_bit()};
{}, false, query_options::specific_options::DEFAULT, cql_serialization_format::latest()};
query_options::query_options(db::consistency_level consistency,
std::experimental::optional<std::vector<sstring_view>> names,
@@ -55,16 +55,14 @@ query_options::query_options(db::consistency_level consistency,
std::vector<bytes_view_opt> value_views,
bool skip_metadata,
specific_options options,
int32_t protocol_version,
serialization_format sf)
cql_serialization_format sf)
: _consistency(consistency)
, _names(std::move(names))
, _values(std::move(values))
, _value_views(std::move(value_views))
, _skip_metadata(skip_metadata)
, _options(std::move(options))
, _protocol_version(protocol_version)
, _serialization_format(sf)
, _cql_serialization_format(sf)
{
}
@@ -73,8 +71,7 @@ query_options::query_options(db::consistency_level consistency,
std::vector<bytes_view_opt> value_views,
bool skip_metadata,
specific_options options,
int32_t protocol_version,
serialization_format sf)
cql_serialization_format sf)
: query_options(
consistency,
std::move(names),
@@ -82,7 +79,6 @@ query_options::query_options(db::consistency_level consistency,
std::move(value_views),
skip_metadata,
std::move(options),
protocol_version,
sf
)
{
@@ -94,7 +90,7 @@ query_options::query_options(query_options&& o, std::vector<std::vector<bytes_vi
std::vector<query_options> tmp;
tmp.reserve(value_views.size());
std::transform(value_views.begin(), value_views.end(), std::back_inserter(tmp), [this](auto& vals) {
return query_options(_consistency, {}, vals, _skip_metadata, _options, _protocol_version, _serialization_format);
return query_options(_consistency, {}, vals, _skip_metadata, _options, _cql_serialization_format);
});
_batch_options = std::move(tmp);
}
@@ -107,8 +103,7 @@ query_options::query_options(db::consistency_level cl, std::vector<bytes_opt> va
{},
false,
query_options::specific_options::DEFAULT,
version::native_protocol(),
serialization_format::use_32_bit()
cql_serialization_format::latest()
)
{
for (auto&& value : _values) {
@@ -178,12 +173,12 @@ api::timestamp_type query_options::get_timestamp(service::query_state& state) co
int query_options::get_protocol_version() const
{
return _protocol_version;
return _cql_serialization_format.protocol_version();
}
serialization_format query_options::get_serialization_format() const
cql_serialization_format query_options::get_cql_serialization_format() const
{
return _serialization_format;
return _cql_serialization_format;
}
const query_options::specific_options& query_options::get_specific_options() const

View File

@@ -48,7 +48,7 @@
#include "service/pager/paging_state.hh"
#include "cql3/column_specification.hh"
#include "cql3/column_identifier.hh"
#include "serialization_format.hh"
#include "cql_serialization_format.hh"
namespace cql3 {
@@ -74,8 +74,7 @@ private:
mutable std::vector<std::vector<int8_t>> _temporaries;
const bool _skip_metadata;
const specific_options _options;
const int32_t _protocol_version; // transient
serialization_format _serialization_format;
cql_serialization_format _cql_serialization_format;
std::experimental::optional<std::vector<query_options>> _batch_options;
public:
query_options(query_options&&) = default;
@@ -87,22 +86,19 @@ public:
std::vector<bytes_view_opt> value_views,
bool skip_metadata,
specific_options options,
int32_t protocol_version,
serialization_format sf);
cql_serialization_format sf);
explicit query_options(db::consistency_level consistency,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<bytes_view_opt> value_views,
bool skip_metadata,
specific_options options,
int32_t protocol_version,
serialization_format sf);
cql_serialization_format sf);
explicit query_options(db::consistency_level consistency,
std::vector<std::vector<bytes_view_opt>> value_views,
bool skip_metadata,
specific_options options,
int32_t protocol_version,
serialization_format sf);
cql_serialization_format sf);
// Batch query_options constructor
explicit query_options(query_options&&, std::vector<std::vector<bytes_view_opt>> value_views);
@@ -131,7 +127,7 @@ public:
* a native protocol request (i.e. it's been allocated locally or by CQL-over-thrift).
*/
int get_protocol_version() const;
serialization_format get_serialization_format() const;
cql_serialization_format get_cql_serialization_format() const;
// Mainly for the sake of BatchQueryOptions
const specific_options& get_specific_options() const;
const query_options& for_statement(size_t i) const;

View File

@@ -423,10 +423,9 @@ void query_processor::migration_subscriber::on_update_keyspace(const sstring& ks
void query_processor::migration_subscriber::on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool columns_changed)
{
if (columns_changed) {
log.info("Column definitions for {}.{} changed, invalidating related prepared statements", ks_name, cf_name);
remove_invalid_prepared_statements(ks_name, cf_name);
}
// #1255: Ignoring columns_changed deliberately.
log.info("Column definitions for {}.{} changed, invalidating related prepared statements", ks_name, cf_name);
remove_invalid_prepared_statements(ks_name, cf_name);
}
void query_processor::migration_subscriber::on_update_user_type(const sstring& ks_name, const sstring& type_name)

View File

@@ -287,6 +287,13 @@ public:
};
inline ::shared_ptr<cql3::metadata> make_empty_metadata()
{
auto result = ::make_shared<cql3::metadata>(std::vector<::shared_ptr<cql3::column_specification>>{});
result->set_skip_metadata();
return result;
}
class result_set {
#if 0
private static final ColumnIdentifier COUNT_COLUMN = new ColumnIdentifier("count", false);

View File

@@ -53,7 +53,7 @@ public:
return true;
}
virtual void add_input(serialization_format sf, result_set_builder& rs) override {
virtual void add_input(cql_serialization_format sf, result_set_builder& rs) override {
// Aggregation of aggregation is not supported
size_t m = _arg_selectors.size();
for (size_t i = 0; i < m; ++i) {
@@ -65,7 +65,7 @@ public:
_aggregate->add_input(sf, _args);
}
virtual bytes_opt get_output(serialization_format sf) override {
virtual bytes_opt get_output(cql_serialization_format sf) override {
return _aggregate->compute(sf);
}

View File

@@ -87,11 +87,11 @@ public:
return false;
}
virtual void add_input(serialization_format sf, result_set_builder& rs) override {
virtual void add_input(cql_serialization_format sf, result_set_builder& rs) override {
_selected->add_input(sf, rs);
}
virtual bytes_opt get_output(serialization_format sf) override {
virtual bytes_opt get_output(cql_serialization_format sf) override {
auto&& value = _selected->get_output(sf);
if (!value) {
return std::experimental::nullopt;

View File

@@ -57,7 +57,7 @@ public:
return _arg_selectors[0]->is_aggregate();
}
virtual void add_input(serialization_format sf, result_set_builder& rs) override {
virtual void add_input(cql_serialization_format sf, result_set_builder& rs) override {
size_t m = _arg_selectors.size();
for (size_t i = 0; i < m; ++i) {
auto&& s = _arg_selectors[i];
@@ -68,7 +68,7 @@ public:
virtual void reset() override {
}
virtual bytes_opt get_output(serialization_format sf) override {
virtual bytes_opt get_output(cql_serialization_format sf) override {
size_t m = _arg_selectors.size();
for (size_t i = 0; i < m; ++i) {
auto&& s = _arg_selectors[i];

View File

@@ -52,6 +52,11 @@ selectable::writetime_or_ttl::new_selector_factory(database& db, schema_ptr s, s
return writetime_or_ttl_selector::new_factory(def->name_as_text(), add_and_get_index(*def, defs), _is_writetime);
}
sstring
selectable::writetime_or_ttl::to_string() const {
return sprint("%s(%s)", _is_writetime ? "writetime" : "ttl", _id->to_string());
}
shared_ptr<selectable>
selectable::writetime_or_ttl::raw::prepare(schema_ptr s) {
return make_shared<writetime_or_ttl>(_id->prepare_column_identifier(s), _is_writetime);
@@ -78,6 +83,11 @@ selectable::with_function::new_selector_factory(database& db, schema_ptr s, std:
return abstract_function_selector::new_factory(std::move(fun), std::move(factories));
}
sstring
selectable::with_function::to_string() const {
return sprint("%s(%s)", _function_name.name, join(", ", _args));
}
shared_ptr<selectable>
selectable::with_function::raw::prepare(schema_ptr s) {
std::vector<shared_ptr<selectable>> prepared_args;
@@ -101,7 +111,7 @@ selectable::with_field_selection::new_selector_factory(database& db, schema_ptr
if (!ut) {
throw exceptions::invalid_request_exception(
sprint("Invalid field selection: %s of type %s is not a user type",
"FIXME: selectable" /* FIMXME: _selected */, ut->as_cql3_type()));
_selected->to_string(), factory->new_instance()->get_type()->as_cql3_type()));
}
for (size_t i = 0; i < ut->size(); ++i) {
if (ut->field_name(i) != _field->bytes_) {
@@ -110,7 +120,12 @@ selectable::with_field_selection::new_selector_factory(database& db, schema_ptr
return field_selector::new_factory(std::move(ut), i, std::move(factory));
}
throw exceptions::invalid_request_exception(sprint("%s of type %s has no field %s",
"FIXME: selectable" /* FIXME: _selected */, ut->as_cql3_type(), _field));
_selected->to_string(), ut->as_cql3_type(), _field));
}
sstring
selectable::with_field_selection::to_string() const {
return sprint("%s.%s", _selected->to_string(), _field->to_string());
}
shared_ptr<selectable>
@@ -126,6 +141,10 @@ selectable::with_field_selection::raw::processes_selection() const {
return true;
}
std::ostream & operator<<(std::ostream &os, const selectable& s) {
return os << s.to_string();
}
}
}

View File

@@ -55,6 +55,7 @@ class selectable {
public:
virtual ~selectable() {}
virtual ::shared_ptr<selector::factory> new_selector_factory(database& db, schema_ptr schema, std::vector<const column_definition*>& defs) = 0;
virtual sstring to_string() const = 0;
protected:
static size_t add_and_get_index(const column_definition& def, std::vector<const column_definition*>& defs) {
auto i = std::find(defs.begin(), defs.end(), &def);
@@ -84,6 +85,8 @@ public:
class with_field_selection;
};
std::ostream & operator<<(std::ostream &os, const selectable& s);
class selectable::with_function : public selectable {
functions::function_name _function_name;
std::vector<shared_ptr<selectable>> _args;
@@ -92,17 +95,7 @@ public:
: _function_name(std::move(fname)), _args(std::move(args)) {
}
#if 0
@Override
public String toString()
{
return new StrBuilder().append(functionName)
.append("(")
.appendWithSeparators(args, ", ")
.append(")")
.toString();
}
#endif
virtual sstring to_string() const override;
virtual shared_ptr<selector::factory> new_selector_factory(database& db, schema_ptr s, std::vector<const column_definition*>& defs) override;
class raw : public selectable::raw {

View File

@@ -59,13 +59,7 @@ public:
: _selected(std::move(selected)), _field(std::move(field)) {
}
#if 0
@Override
public String toString()
{
return String.format("%s.%s", selected, field);
}
#endif
virtual sstring to_string() const override;
virtual shared_ptr<selector::factory> new_selector_factory(database& db, schema_ptr s, std::vector<const column_definition*>& defs) override;

View File

@@ -63,7 +63,8 @@ selection::selection(schema_ptr schema,
query::partition_slice::option_set selection::get_query_options() {
query::partition_slice::option_set opts;
opts.set_if<query::partition_slice::option::send_timestamp_and_expiry>(_collect_timestamps || _collect_TTLs);
opts.set_if<query::partition_slice::option::send_timestamp>(_collect_timestamps);
opts.set_if<query::partition_slice::option::send_expiry>(_collect_TTLs);
opts.set_if<query::partition_slice::option::send_partition_key>(
std::any_of(_columns.begin(), _columns.end(),
@@ -112,11 +113,11 @@ protected:
_current.clear();
}
virtual std::vector<bytes_opt> get_output_row(serialization_format sf) override {
virtual std::vector<bytes_opt> get_output_row(cql_serialization_format sf) override {
return std::move(_current);
}
virtual void add_input_row(serialization_format sf, result_set_builder& rs) override {
virtual void add_input_row(cql_serialization_format sf, result_set_builder& rs) override {
_current = std::move(*rs.current);
}
@@ -180,7 +181,7 @@ protected:
return _factories->contains_only_aggregate_functions();
}
virtual std::vector<bytes_opt> get_output_row(serialization_format sf) override {
virtual std::vector<bytes_opt> get_output_row(cql_serialization_format sf) override {
std::vector<bytes_opt> output_row;
output_row.reserve(_selectors.size());
for (auto&& s : _selectors) {
@@ -189,7 +190,7 @@ protected:
return output_row;
}
virtual void add_input_row(serialization_format sf, result_set_builder& rs) {
virtual void add_input_row(cql_serialization_format sf, result_set_builder& rs) {
for (auto&& s : _selectors) {
s->add_input(sf, rs);
}
@@ -252,11 +253,11 @@ selection::collect_metadata(schema_ptr schema, const std::vector<::shared_ptr<ra
return r;
}
result_set_builder::result_set_builder(const selection& s, db_clock::time_point now, serialization_format sf)
result_set_builder::result_set_builder(const selection& s, db_clock::time_point now, cql_serialization_format sf)
: _result_set(std::make_unique<result_set>(::make_shared<metadata>(*(s.get_result_metadata()))))
, _selectors(s.new_selectors())
, _now(now)
, _serialization_format(sf)
, _cql_serialization_format(sf)
{
if (s._collect_timestamps) {
_timestamps.resize(s._columns.size(), 0);
@@ -295,17 +296,16 @@ void result_set_builder::add(const column_definition& def, const query::result_a
}
}
void result_set_builder::add(const column_definition& def, collection_mutation_view c) {
auto&& ctype = static_cast<const collection_type_impl*>(def.type.get());
current->emplace_back(ctype->to_value(c, _serialization_format));
void result_set_builder::add_collection(const column_definition& def, bytes_view c) {
current->emplace_back(to_bytes(c));
// timestamps, ttls meaningless for collections
}
void result_set_builder::new_row() {
if (current) {
_selectors->add_input_row(_serialization_format, *this);
_selectors->add_input_row(_cql_serialization_format, *this);
if (!_selectors->is_aggregate()) {
_result_set->add_row(_selectors->get_output_row(_serialization_format));
_result_set->add_row(_selectors->get_output_row(_cql_serialization_format));
_selectors->reset();
}
current->clear();
@@ -319,13 +319,13 @@ void result_set_builder::new_row() {
std::unique_ptr<result_set> result_set_builder::build() {
if (current) {
_selectors->add_input_row(_serialization_format, *this);
_result_set->add_row(_selectors->get_output_row(_serialization_format));
_selectors->add_input_row(_cql_serialization_format, *this);
_result_set->add_row(_selectors->get_output_row(_cql_serialization_format));
_selectors->reset();
current = std::experimental::nullopt;
}
if (_result_set->empty() && _selectors->is_aggregate()) {
_result_set->add_row(_selectors->get_output_row(_serialization_format));
_result_set->add_row(_selectors->get_output_row(_cql_serialization_format));
}
return std::move(_result_set);
}
@@ -344,7 +344,7 @@ void result_set_builder::visitor::add_value(const column_definition& def,
_builder.add_empty();
return;
}
_builder.add(def, *cell);
_builder.add_collection(def, *cell);
} else {
auto cell = i.next_atomic_cell();
if (!cell) {

View File

@@ -69,9 +69,9 @@ public:
* @param rs the <code>ResultSetBuilder</code>
* @throws InvalidRequestException
*/
virtual void add_input_row(serialization_format sf, result_set_builder& rs) = 0;
virtual void add_input_row(cql_serialization_format sf, result_set_builder& rs) = 0;
virtual std::vector<bytes_opt> get_output_row(serialization_format sf) = 0;
virtual std::vector<bytes_opt> get_output_row(cql_serialization_format sf) = 0;
virtual void reset() = 0;
};
@@ -236,13 +236,13 @@ private:
std::vector<api::timestamp_type> _timestamps;
std::vector<int32_t> _ttls;
const db_clock::time_point _now;
serialization_format _serialization_format;
cql_serialization_format _cql_serialization_format;
public:
result_set_builder(const selection& s, db_clock::time_point now, serialization_format sf);
result_set_builder(const selection& s, db_clock::time_point now, cql_serialization_format sf);
void add_empty();
void add(bytes_opt value);
void add(const column_definition& def, const query::result_atomic_cell_view& c);
void add(const column_definition& def, collection_mutation_view c);
void add_collection(const column_definition& def, bytes_view c);
void new_row();
std::unique_ptr<result_set> build();
api::timestamp_type timestamp_of(size_t idx);

View File

@@ -71,7 +71,7 @@ public:
* @param rs the <code>result_set_builder</code>
* @throws InvalidRequestException if a problem occurs while add the input value
*/
virtual void add_input(serialization_format sf, result_set_builder& rs) = 0;
virtual void add_input(cql_serialization_format sf, result_set_builder& rs) = 0;
/**
* Returns the selector output.
@@ -80,7 +80,7 @@ public:
* @return the selector output
* @throws InvalidRequestException if a problem occurs while computing the output value
*/
virtual bytes_opt get_output(serialization_format sf) = 0;
virtual bytes_opt get_output(cql_serialization_format sf) = 0;
/**
* Returns the <code>selector</code> output type.

View File

@@ -88,12 +88,12 @@ public:
, _type(type)
{ }
virtual void add_input(serialization_format sf, result_set_builder& rs) override {
virtual void add_input(cql_serialization_format sf, result_set_builder& rs) override {
// TODO: can we steal it?
_current = (*rs.current)[_idx];
}
virtual bytes_opt get_output(serialization_format sf) override {
virtual bytes_opt get_output(cql_serialization_format sf) override {
return std::move(_current);
}

View File

@@ -58,13 +58,7 @@ public:
: _id(std::move(id)), _is_writetime(is_writetime) {
}
#if 0
@Override
public String toString()
{
return (isWritetime ? "writetime" : "ttl") + "(" + id + ")";
}
#endif
virtual sstring to_string() const override;
virtual shared_ptr<selector::factory> new_selector_factory(database& db, schema_ptr s, std::vector<const column_definition*>& defs) override;

View File

@@ -86,7 +86,7 @@ public:
return make_shared<wtots_factory>(std::move(column_name), idx, is_writetime);
}
virtual void add_input(serialization_format sf, result_set_builder& rs) override {
virtual void add_input(cql_serialization_format sf, result_set_builder& rs) override {
if (_is_writetime) {
int64_t ts = rs.timestamp_of(_idx);
if (ts != api::missing_timestamp) {
@@ -108,7 +108,7 @@ public:
}
}
virtual bytes_opt get_output(serialization_format sf) override {
virtual bytes_opt get_output(cql_serialization_format sf) override {
return _current;
}

View File

@@ -120,7 +120,7 @@ sets::literal::to_string() const {
}
sets::value
sets::value::from_serialized(bytes_view v, set_type type, serialization_format sf) {
sets::value::from_serialized(bytes_view v, set_type type, cql_serialization_format sf) {
try {
// Collections have this small hack that validate cannot be called on a serialized object,
// but compose does the validation (so we're fine).
@@ -138,11 +138,11 @@ sets::value::from_serialized(bytes_view v, set_type type, serialization_format s
bytes_opt
sets::value::get(const query_options& options) {
return get_with_protocol_version(options.get_serialization_format());
return get_with_protocol_version(options.get_cql_serialization_format());
}
bytes
sets::value::get_with_protocol_version(serialization_format sf) {
sets::value::get_with_protocol_version(cql_serialization_format sf) {
return collection_type_impl::pack(_elements.begin(), _elements.end(),
_elements.size(), sf);
}
@@ -215,7 +215,7 @@ sets::marker::bind(const query_options& options) {
return nullptr;
} else {
auto as_set_type = static_pointer_cast<const set_type_impl>(_receiver->type);
return make_shared(value::from_serialized(*value, as_set_type, options.get_serialization_format()));
return make_shared(value::from_serialized(*value, as_set_type, options.get_cql_serialization_format()));
}
}
@@ -258,16 +258,14 @@ sets::adder::do_add(mutation& m, const exploded_clustering_prefix& row_key, cons
auto smut = set_type->serialize_mutation_form(mut);
m.set_cell(row_key, column, std::move(smut));
} else {
} else if (set_value != nullptr) {
// for frozen sets, we're overwriting the whole cell
auto v = set_type->serialize_partially_deserialized_form(
{set_value->_elements.begin(), set_value->_elements.end()},
serialization_format::internal());
if (set_value->_elements.empty()) {
m.set_cell(row_key, column, params.make_dead_cell());
} else {
m.set_cell(row_key, column, params.make_cell(std::move(v)));
}
cql_serialization_format::internal());
m.set_cell(row_key, column, params.make_cell(std::move(v)));
} else {
m.set_cell(row_key, column, params.make_dead_cell());
}
}

View File

@@ -78,9 +78,9 @@ public:
value(std::set<bytes, serialized_compare> elements)
: _elements(std::move(elements)) {
}
static value from_serialized(bytes_view v, set_type type, serialization_format sf);
static value from_serialized(bytes_view v, set_type type, cql_serialization_format sf);
virtual bytes_opt get(const query_options& options) override;
virtual bytes get_with_protocol_version(serialization_format sf) override;
virtual bytes get_with_protocol_version(cql_serialization_format sf) override;
bool equals(set_type st, const value& v);
virtual sstring to_string() const override;
};

View File

@@ -169,26 +169,21 @@ public:
}
private:
future<std::vector<mutation>> get_mutations(distributed<service::storage_proxy>& storage, const query_options& options, bool local, api::timestamp_type now) {
struct collector {
std::vector<mutation> _result;
std::vector<mutation> get() && { return std::move(_result); }
void operator()(std::vector<mutation> more) {
std::move(more.begin(), more.end(), std::back_inserter(_result));
}
};
auto get_mutations_for_statement = [this, &storage, &options, now, local] (size_t i) {
auto&& statement = _statements[i];
auto&& statement_options = options.for_statement(i);
auto timestamp = _attrs->get_timestamp(now, statement_options);
return statement->get_mutations(storage, statement_options, local, timestamp);
};
// FIXME: origin tries hard to merge mutations to same keyspace, for
// some reason.
return map_reduce(
boost::make_counting_iterator<size_t>(0),
boost::make_counting_iterator<size_t>(_statements.size()),
get_mutations_for_statement,
collector());
// Do not process in parallel because operations like list append/prepend depend on execution order.
return do_with(std::vector<mutation>(), [this, &storage, &options, now, local] (auto&& result) {
return do_for_each(boost::make_counting_iterator<size_t>(0),
boost::make_counting_iterator<size_t>(_statements.size()),
[this, &storage, &options, now, local, &result] (size_t i) {
auto&& statement = _statements[i];
auto&& statement_options = options.for_statement(i);
auto timestamp = _attrs->get_timestamp(now, statement_options);
return statement->get_mutations(storage, statement_options, local, timestamp).then([&result] (auto&& more) {
std::move(more.begin(), more.end(), std::back_inserter(result));
});
}).then([&result] {
return std::move(result);
});
});
}
public:

View File

@@ -44,6 +44,7 @@
#include <regex>
#include <boost/range/adaptor/map.hpp>
#include <boost/range/algorithm/adjacent_find.hpp>
#include "cql3/statements/create_table_statement.hh"
@@ -173,13 +174,12 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
throw exceptions::invalid_request_exception(sprint("Table names shouldn't be more than %d characters long (got \"%s\")", schema::NAME_LENGTH, cf_name.c_str()));
}
for (auto&& entry : _defined_names) {
auto c = std::count_if(_defined_names.begin(), _defined_names.end(), [&entry] (auto e) {
return entry->text() == e->text();
});
if (c > 1) {
throw exceptions::invalid_request_exception(sprint("Multiple definition of identifier %s", entry->text().c_str()));
}
// Check for duplicate column names
auto i = boost::range::adjacent_find(_defined_names, [] (auto&& e1, auto&& e2) {
return e1->text() == e2->text();
});
if (i != _defined_names.end()) {
throw exceptions::invalid_request_exception(sprint("Multiple definition of identifier %s", (*i)->text()));
}
properties->validate();

View File

@@ -51,6 +51,7 @@
#include "core/shared_ptr.hh"
#include <seastar/util/indirect.hh>
#include <unordered_map>
#include <utility>
#include <vector>
@@ -139,7 +140,8 @@ private:
create_table_statement::column_set_type _static_columns;
bool _use_compact_storage = false;
std::multiset<::shared_ptr<column_identifier>> _defined_names;
std::multiset<::shared_ptr<column_identifier>,
indirect_less<::shared_ptr<column_identifier>, column_identifier::text_comparator>> _defined_names;
bool _if_not_exists;
public:
raw_statement(::shared_ptr<cf_name> name, bool if_not_exists);

View File

@@ -0,0 +1,156 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "cql3/statements/create_type_statement.hh"
namespace cql3 {
namespace statements {
create_type_statement::create_type_statement(const ut_name& name, bool if_not_exists)
: _name{name}
, _if_not_exists{if_not_exists}
{
}
void create_type_statement::prepare_keyspace(const service::client_state& state)
{
if (!_name.has_keyspace()) {
_name.set_keyspace(state.get_keyspace());
}
}
void create_type_statement::add_definition(::shared_ptr<column_identifier> name, ::shared_ptr<cql3_type::raw> type)
{
_column_names.emplace_back(name);
_column_types.emplace_back(type);
}
void create_type_statement::check_access(const service::client_state& state)
{
warn(unimplemented::cause::PERMISSIONS);
#if 0
state.hasKeyspaceAccess(keyspace(), Permission.CREATE);
#endif
}
void create_type_statement::validate(distributed<service::storage_proxy>&, const service::client_state& state)
{
#if 0
KSMetaData ksm = Schema.instance.getKSMetaData(name.getKeyspace());
if (ksm == null)
throw new InvalidRequestException(String.format("Cannot add type in unknown keyspace %s", name.getKeyspace()));
if (ksm.userTypes.getType(name.getUserTypeName()) != null && !ifNotExists)
throw new InvalidRequestException(String.format("A user type of name %s already exists", name));
for (CQL3Type.Raw type : columnTypes)
if (type.isCounter())
throw new InvalidRequestException("A user type cannot contain counters");
#endif
}
#if 0
public static void checkForDuplicateNames(UserType type) throws InvalidRequestException
{
for (int i = 0; i < type.size() - 1; i++)
{
ByteBuffer fieldName = type.fieldName(i);
for (int j = i+1; j < type.size(); j++)
{
if (fieldName.equals(type.fieldName(j)))
throw new InvalidRequestException(String.format("Duplicate field name %s in type %s",
UTF8Type.instance.getString(fieldName),
UTF8Type.instance.getString(type.name)));
}
}
}
#endif
shared_ptr<transport::event::schema_change> create_type_statement::change_event()
{
using namespace transport;
return make_shared<transport::event::schema_change>(event::schema_change::change_type::CREATED,
event::schema_change::target_type::TYPE,
keyspace(),
_name.get_string_type_name());
}
const sstring& create_type_statement::keyspace() const
{
return _name.get_keyspace();
}
#if 0
private UserType createType() throws InvalidRequestException
{
List<ByteBuffer> names = new ArrayList<>(columnNames.size());
for (ColumnIdentifier name : columnNames)
names.add(name.bytes);
List<AbstractType<?>> types = new ArrayList<>(columnTypes.size());
for (CQL3Type.Raw type : columnTypes)
types.add(type.prepare(keyspace()).getType());
return new UserType(name.getKeyspace(), name.getUserTypeName(), names, types);
}
#endif
future<bool> create_type_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
{
throw std::runtime_error("User-defined types are not supported yet");
#if 0
KSMetaData ksm = Schema.instance.getKSMetaData(name.getKeyspace());
assert ksm != null; // should haven't validate otherwise
// Can happen with ifNotExists
if (ksm.userTypes.getType(name.getUserTypeName()) != null)
return false;
UserType type = createType();
checkForDuplicateNames(type);
MigrationManager.announceNewType(type, isLocalOnly);
return true;
#endif
}
}
}

View File

@@ -0,0 +1,75 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "cql3/statements/schema_altering_statement.hh"
#include "cql3/cql3_type.hh"
#include "cql3/ut_name.hh"
namespace cql3 {
namespace statements {
class create_type_statement : public schema_altering_statement {
ut_name _name;
std::vector<::shared_ptr<column_identifier>> _column_names;
std::vector<::shared_ptr<cql3_type::raw>> _column_types;
bool _if_not_exists;
public:
create_type_statement(const ut_name& name, bool if_not_exists);
virtual void prepare_keyspace(const service::client_state& state) override;
void add_definition(::shared_ptr<column_identifier> name, ::shared_ptr<cql3_type::raw> type);
virtual void check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual const sstring& keyspace() const override;
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
};
}
}

View File

@@ -186,11 +186,30 @@ modification_statement::make_update_parameters(
class prefetch_data_builder {
update_parameters::prefetch_data& _data;
const query::partition_slice& _ps;
schema_ptr _schema;
std::experimental::optional<partition_key> _pkey;
private:
void add_cell(update_parameters::prefetch_data::row& cells, const column_definition& def, const std::experimental::optional<bytes_view>& cell) {
if (cell) {
auto ctype = static_pointer_cast<const collection_type_impl>(def.type);
if (!ctype->is_multi_cell()) {
throw std::logic_error(sprint("cannot prefetch frozen collection: %s", def.name_as_text()));
}
auto map_type = map_type_impl::get_instance(ctype->name_comparator(), ctype->value_comparator(), true);
update_parameters::prefetch_data::cell_list list;
// FIXME: Iterate over a range instead of fully exploded collection
auto dv = map_type->deserialize(*cell);
for (auto&& el : value_cast<map_type_impl::native_type>(dv)) {
list.emplace_back(update_parameters::prefetch_data::cell{el.first.serialize(), el.second.serialize()});
}
cells.emplace(def.id, std::move(list));
}
};
public:
prefetch_data_builder(update_parameters::prefetch_data& data, const query::partition_slice& ps)
prefetch_data_builder(schema_ptr s, update_parameters::prefetch_data& data, const query::partition_slice& ps)
: _data(data)
, _ps(ps)
, _schema(std::move(s))
{ }
void accept_new_partition(const partition_key& key, uint32_t row_count) {
@@ -205,20 +224,9 @@ public:
const query::result_row_view& row) {
update_parameters::prefetch_data::row cells;
auto add_cell = [&cells] (column_id id, std::experimental::optional<collection_mutation_view>&& cell) {
if (cell) {
cells.emplace(id, collection_mutation{to_bytes(cell->data)});
}
};
auto static_row_iterator = static_row.iterator();
for (auto&& id : _ps.static_columns) {
add_cell(id, static_row_iterator.next_collection_cell());
}
auto row_iterator = row.iterator();
for (auto&& id : _ps.regular_columns) {
add_cell(id, row_iterator.next_collection_cell());
add_cell(cells, _schema->regular_column_at(id), row_iterator.next_collection_cell());
}
_data.rows.emplace(std::make_pair(*_pkey, key), std::move(cells));
@@ -228,7 +236,16 @@ public:
assert(0);
}
void accept_partition_end(const query::result_row_view& static_row) {}
void accept_partition_end(const query::result_row_view& static_row) {
update_parameters::prefetch_data::row cells;
auto static_row_iterator = static_row.iterator();
for (auto&& id : _ps.static_columns) {
add_cell(cells, _schema->static_column_at(id), static_row_iterator.next_collection_cell());
}
_data.rows.emplace(std::make_pair(*_pkey, std::experimental::nullopt), std::move(cells));
}
};
future<update_parameters::prefetched_rows_type>
@@ -265,7 +282,8 @@ modification_statement::read_required_rows(
std::move(regular_cols),
query::partition_slice::option_set::of<
query::partition_slice::option::send_partition_key,
query::partition_slice::option::send_clustering_key>());
query::partition_slice::option::send_clustering_key,
query::partition_slice::option::collections_as_maps>());
std::vector<query::partition_range> pr;
for (auto&& pk : *keys) {
pr.emplace_back(dht::global_partitioner().decorate_key(*s, pk));
@@ -278,7 +296,7 @@ modification_statement::read_required_rows(
bytes_ostream buf(result->buf());
query::result_view v(buf.linearize());
auto prefetched_rows = update_parameters::prefetched_rows_type({update_parameters::prefetch_data(s)});
v.consume(ps, prefetch_data_builder(prefetched_rows.value(), ps));
v.consume(ps, prefetch_data_builder(s, prefetched_rows.value(), ps));
return prefetched_rows;
});
}

View File

@@ -117,6 +117,11 @@ select_statement::for_selection(schema_ptr schema, ::shared_ptr<selection::selec
::shared_ptr<term>{});
}
::shared_ptr<cql3::metadata> select_statement::get_result_metadata() const {
// FIXME: COUNT needs special result metadata handling.
return _selection->get_result_metadata();
}
uint32_t select_statement::get_bound_terms() {
return _bound_terms;
}
@@ -170,7 +175,7 @@ select_statement::make_partition_slice(const query_options& options) {
if (_parameters->is_distinct()) {
_opts.set(query::partition_slice::option::distinct);
return query::partition_slice({ query::clustering_range::make_open_ended_both_sides() },
std::move(static_columns), {}, _opts);
std::move(static_columns), {}, _opts, nullptr, options.get_cql_serialization_format());
}
auto bounds = _restrictions->get_clustering_bounds(options);
@@ -179,7 +184,7 @@ select_statement::make_partition_slice(const query_options& options) {
std::reverse(bounds.begin(), bounds.end());
}
return query::partition_slice(std::move(bounds),
std::move(static_columns), std::move(regular_columns), _opts);
std::move(static_columns), std::move(regular_columns), _opts, nullptr, options.get_cql_serialization_format());
}
int32_t select_statement::get_limit(const query_options& options) const {
@@ -246,7 +251,7 @@ select_statement::execute(distributed<service::storage_proxy>& proxy, service::q
if (aggregate) {
return do_with(
cql3::selection::result_set_builder(*_selection, now,
options.get_serialization_format()),
options.get_cql_serialization_format()),
[p, page_size, now](auto& builder) {
return do_until([p] {return p->is_exhausted();},
[p, &builder, page_size, now] {
@@ -338,8 +343,8 @@ shared_ptr<transport::messages::result_message> select_statement::process_result
db_clock::time_point now) {
cql3::selection::result_set_builder builder(*_selection, now,
options.get_serialization_format());
query::result_view::consume(results->buf(), cmd->slice,
options.get_cql_serialization_format());
query::result_view::consume(*results, cmd->slice,
cql3::selection::result_set_builder::visitor(builder, *_schema,
*_selection));
auto rs = builder.build();
@@ -529,9 +534,12 @@ select_statement::raw_statement::get_ordering_comparator(schema_ptr schema,
}
bool select_statement::raw_statement::is_reversed(schema_ptr schema) {
std::experimental::optional<bool> reversed_map[schema->clustering_key_size()];
uint32_t i = 0;
assert(_parameters->orderings().size() > 0);
parameters::orderings_type::size_type i = 0;
bool is_reversed_ = false;
bool relation_order_unsupported = false;
for (auto&& e : _parameters->orderings()) {
::shared_ptr<column_identifier> column = e.first->prepare_column_identifier(schema);
bool reversed = e.second;
@@ -551,32 +559,23 @@ bool select_statement::raw_statement::is_reversed(schema_ptr schema) {
"Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY");
}
reversed_map[i] = std::experimental::make_optional(reversed != def->type->is_reversed());
bool current_reverse_status = (reversed != def->type->is_reversed());
if (i == 0) {
is_reversed_ = current_reverse_status;
}
if (is_reversed_ != current_reverse_status) {
relation_order_unsupported = true;
}
++i;
}
// GCC incorrenctly complains about "*is_reversed_" below
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wmaybe-uninitialized"
// Check that all bool in reversedMap, if set, agrees
std::experimental::optional<bool> is_reversed_{};
for (auto&& b : reversed_map) {
if (b) {
if (!is_reversed_) {
is_reversed_ = b;
} else {
if ((*is_reversed_) != *b) {
throw exceptions::invalid_request_exception("Unsupported order by relation");
}
}
}
if (relation_order_unsupported) {
throw exceptions::invalid_request_exception("Unsupported order by relation");
}
assert(is_reversed_);
return *is_reversed_;
#pragma GCC diagnostic pop
return is_reversed_;
}
/** If ALLOW FILTERING was not specified, this verifies that it is not needed */

View File

@@ -121,6 +121,7 @@ public:
static ::shared_ptr<select_statement> for_selection(
schema_ptr schema, ::shared_ptr<selection::selection> selection);
::shared_ptr<cql3::metadata> get_result_metadata() const;
virtual uint32_t get_bound_terms() override;
virtual void check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;

View File

@@ -78,7 +78,7 @@ void update_statement::add_update_for_key(mutation& m, const exploded_clustering
// If there are static columns, there also must be clustering columns, in which
// case empty prefix can only refer to the static row.
bool is_static_prefix = s->has_static_columns() && !prefix;
if (type == statement_type::INSERT && !is_static_prefix) {
if (type == statement_type::INSERT && !is_static_prefix && s->is_cql3_table()) {
auto& row = m.partition().clustered_row(clustering_key::from_clustering_prefix(*s, prefix));
row.apply(row_marker(params.timestamp(), params.ttl(), params.expiry()));
}
@@ -137,19 +137,17 @@ update_statement::parsed_insert::prepare_internal(database& db, schema_ptr schem
throw exceptions::invalid_request_exception("No columns provided to INSERT");
}
std::unordered_set<bytes> column_ids;
for (size_t i = 0; i < _column_names.size(); i++) {
auto id = _column_names[i]->prepare_column_identifier(schema);
auto def = get_column_definition(schema, *id);
if (!def) {
throw exceptions::invalid_request_exception(sprint("Unknown identifier %s", *id));
}
for (size_t j = 0; j < i; j++) {
auto other_id = _column_names[j]->prepare_column_identifier(schema);
if (*id == *other_id) {
throw exceptions::invalid_request_exception(sprint("Multiple definitions found for column %s", *id));
}
if (column_ids.count(id->name())) {
throw exceptions::invalid_request_exception(sprint("Multiple definitions found for column %s", *id));
}
column_ids.emplace(id->name());
auto&& value = _column_values[i];

View File

@@ -205,7 +205,7 @@ class collection_terminal {
public:
virtual ~collection_terminal() {}
/** Gets the value of the collection when serialized with the given protocol version format */
virtual bytes get_with_protocol_version(serialization_format sf) = 0;
virtual bytes get_with_protocol_version(cql_serialization_format sf) = 0;
};
/**

View File

@@ -202,12 +202,12 @@ public:
buffers[i] = to_bytes_opt(_elements[i]->bind_and_get(options));
// Inside tuples, we must force the serialization of collections to v3 whatever protocol
// version is in use since we're going to store directly that serialized value.
if (options.get_serialization_format() != serialization_format::internal()
if (options.get_cql_serialization_format() != cql_serialization_format::internal()
&& _type->type(i)->is_collection()) {
if (buffers[i]) {
buffers[i] = static_pointer_cast<const collection_type_impl>(_type->type(i))->reserialize(
options.get_serialization_format(),
serialization_format::internal(),
options.get_cql_serialization_format(),
cql_serialization_format::internal(),
bytes_view(*buffers[i]));
}
}
@@ -251,7 +251,7 @@ public:
try {
// Collections have this small hack that validate cannot be called on a serialized object,
// but the deserialization does the validation (so we're fine).
auto l = value_cast<list_type_impl::native_type>(type->deserialize(value, options.get_serialization_format()));
auto l = value_cast<list_type_impl::native_type>(type->deserialize(value, options.get_cql_serialization_format()));
auto ttype = dynamic_pointer_cast<const tuple_type_impl>(type->get_elements_type());
assert(ttype);

View File

@@ -43,17 +43,17 @@
namespace cql3 {
std::experimental::optional<collection_mutation_view>
const update_parameters::prefetch_data::cell_list*
update_parameters::get_prefetched_list(
const partition_key& pkey,
const clustering_key& row_key,
partition_key pkey,
std::experimental::optional<clustering_key> ckey,
const column_definition& column) const
{
if (!_prefetched) {
return {};
}
auto i = _prefetched->rows.find(std::make_pair(pkey, row_key));
auto i = _prefetched->rows.find(std::make_pair(std::move(pkey), std::move(ckey)));
if (i == _prefetched->rows.end()) {
return {};
}
@@ -63,7 +63,7 @@ update_parameters::get_prefetched_list(
if (j == row.end()) {
return {};
}
return {j->second};
return &j->second;
}
update_parameters::prefetch_data::prefetch_data(schema_ptr schema)

View File

@@ -58,8 +58,9 @@ namespace cql3 {
*/
class update_parameters final {
public:
// Holder for data needed by CQL list updates which depend on current state of the list.
struct prefetch_data {
using key = std::pair<partition_key, clustering_key>;
using key = std::pair<partition_key, std::experimental::optional<clustering_key>>;
struct key_hashing {
partition_key::hashing pk_hash;
clustering_key::hashing ck_hash;
@@ -70,7 +71,7 @@ public:
{ }
size_t operator()(const key& k) const {
return pk_hash(k.first) ^ ck_hash(k.second);
return pk_hash(k.first) ^ (k.second ? ck_hash(*k.second) : 0);
}
};
struct key_equality {
@@ -83,10 +84,16 @@ public:
{ }
bool operator()(const key& k1, const key& k2) const {
return pk_eq(k1.first, k2.first) && ck_eq(k1.second, k2.second);
return pk_eq(k1.first, k2.first)
&& bool(k1.second) == bool(k2.second) && (!k1.second || ck_eq(*k1.second, *k2.second));
}
};
using row = std::unordered_map<column_id, collection_mutation>;
struct cell {
bytes key;
bytes value;
};
using cell_list = std::vector<cell>;
using row = std::unordered_map<column_id, cell_list>;
public:
std::unordered_map<key, row, key_hashing, key_equality> rows;
schema_ptr schema;
@@ -183,8 +190,11 @@ public:
return _timestamp;
}
std::experimental::optional<collection_mutation_view> get_prefetched_list(
const partition_key& pkey, const clustering_key& row_key, const column_definition& column) const;
const prefetch_data::cell_list*
get_prefetched_list(
partition_key pkey,
std::experimental::optional<clustering_key> ckey,
const column_definition& column) const;
};
}

View File

@@ -161,15 +161,15 @@ void user_types::delayed_value::collect_marker_specification(shared_ptr<variable
}
std::vector<bytes_opt> user_types::delayed_value::bind_internal(const query_options& options) {
auto sf = options.get_serialization_format();
auto sf = options.get_cql_serialization_format();
std::vector<bytes_opt> buffers;
for (size_t i = 0; i < _type->size(); ++i) {
buffers.push_back(to_bytes_opt(_values[i]->bind_and_get(options)));
// Inside UDT values, we must force the serialization of collections to v3 whatever protocol
// version is in use since we're going to store directly that serialized value.
if (sf != serialization_format::use_32_bit() && _type->field_type(i)->is_collection() && buffers.back()) {
if (!sf.collection_format_unchanged() && _type->field_type(i)->is_collection() && buffers.back()) {
auto&& ctype = static_pointer_cast<const collection_type_impl>(_type->field_type(i));
buffers.back() = ctype->reserialize(sf, serialization_format::use_32_bit(), bytes_view(*buffers.back()));
buffers.back() = ctype->reserialize(sf, cql_serialization_format::latest(), bytes_view(*buffers.back()));
}
}
return buffers;

View File

@@ -56,7 +56,7 @@ void ut_name::set_keyspace(sstring keyspace) {
_ks_name = std::experimental::optional<sstring>{keyspace};
}
sstring ut_name::get_keyspace() const {
const sstring& ut_name::get_keyspace() const {
return _ks_name.value();
}

View File

@@ -58,7 +58,7 @@ public:
void set_keyspace(sstring keyspace);
sstring get_keyspace() const;
const sstring& get_keyspace() const;
bytes get_user_type_name() const;

View File

@@ -0,0 +1,52 @@
/*
* Copyright (C) 2015 Cloudius Systems, Ltd.
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <iostream>
using cql_protocol_version_type = uint8_t;
// Abstraction of transport protocol-dependent serialization format
// Protocols v1, v2 used 16 bits for collection sizes, while v3 and
// above use 32 bits. But letting every bit of the code know what
// transport protocol we're using (and in some cases, we aren't using
// any transport -- it's for internal storage) is bad, so abstract it
// away here.
class cql_serialization_format {
cql_protocol_version_type _version;
public:
static constexpr cql_protocol_version_type latest_version = 3;
explicit cql_serialization_format(cql_protocol_version_type version) : _version(version) {}
static cql_serialization_format latest() { return cql_serialization_format{latest_version}; }
static cql_serialization_format internal() { return latest(); }
bool using_32_bits_for_collections() const { return _version >= 3; }
bool operator==(cql_serialization_format x) const { return _version == x._version; }
bool operator!=(cql_serialization_format x) const { return !operator==(x); }
cql_protocol_version_type protocol_version() const { return _version; }
friend std::ostream& operator<<(std::ostream& out, const cql_serialization_format& sf) {
return out << static_cast<int>(sf._version);
}
bool collection_format_unchanged(cql_serialization_format other = cql_serialization_format::latest()) const {
return using_32_bits_for_collections() == other.using_32_bits_for_collections();
}
};

File diff suppressed because it is too large Load Diff

View File

@@ -41,6 +41,7 @@
#include <set>
#include <iostream>
#include <boost/functional/hash.hpp>
#include <boost/range/algorithm/find.hpp>
#include <experimental/optional>
#include <string.h>
#include "types.hh"
@@ -56,7 +57,6 @@
#include "tombstone.hh"
#include "atomic_cell.hh"
#include "query-request.hh"
#include "query-result.hh"
#include "keys.hh"
#include "mutation.hh"
#include "memtable.hh"
@@ -71,6 +71,7 @@
#include "sstables/compaction.hh"
#include "key_reader.hh"
#include <seastar/core/rwlock.hh>
#include <seastar/core/shared_future.hh>
class frozen_mutation;
class reconcilable_result;
@@ -97,9 +98,132 @@ void make(database& db, bool durable, bool volatile_testing_only);
}
}
class throttle_state {
size_t _max_space;
logalloc::region_group& _region_group;
throttle_state* _parent;
circular_buffer<promise<>> _throttled_requests;
timer<> _throttling_timer{[this] { unthrottle(); }};
void unthrottle();
bool should_throttle() const {
if (_region_group.memory_used() > _max_space) {
return true;
}
if (_parent) {
return _parent->should_throttle();
}
return false;
}
public:
throttle_state(size_t max_space, logalloc::region_group& region, throttle_state* parent = nullptr)
: _max_space(max_space)
, _region_group(region)
, _parent(parent)
{}
future<> throttle();
};
class replay_position_reordered_exception : public std::exception {};
using memtable_list = std::vector<lw_shared_ptr<memtable>>;
// We could just add all memtables, regardless of types, to a single list, and
// then filter them out when we read them. Here's why I have chosen not to do
// it:
//
// First, some of the methods in which a memtable is involved (like seal) are
// assume a commitlog, and go through great care of updating the replay
// position, flushing the log, etc. We want to bypass those, and that has to
// be done either by sprikling the seal code with conditionals, or having a
// separate method for each seal.
//
// Also, if we ever want to put some of the memtables in as separate allocator
// region group to provide for extra QoS, having the classes properly wrapped
// will make that trivial: just pass a version of new_memtable() that puts it
// in a different region, while the list approach would require a lot of
// conditionals as well.
//
// If we are going to have different methods, better have different instances
// of a common class.
class memtable_list {
using shared_memtable = lw_shared_ptr<memtable>;
std::vector<shared_memtable> _memtables;
std::function<future<> ()> _seal_fn;
std::function<schema_ptr()> _current_schema;
size_t _max_memtable_size;
logalloc::region_group* _dirty_memory_region_group;
public:
memtable_list(std::function<future<> ()> seal_fn, std::function<schema_ptr()> cs, size_t max_memtable_size, logalloc::region_group* region_group)
: _memtables({})
, _seal_fn(seal_fn)
, _current_schema(cs)
, _max_memtable_size(max_memtable_size)
, _dirty_memory_region_group(region_group) {
add_memtable();
}
shared_memtable back() {
return _memtables.back();
}
// The caller has to make sure the element exist before calling this.
void erase(const shared_memtable& element) {
_memtables.erase(boost::range::find(_memtables, element));
}
void clear() {
_memtables.clear();
}
size_t size() const {
return _memtables.size();
}
future<> seal_active_memtable() {
return _seal_fn();
}
auto begin() noexcept {
return _memtables.begin();
}
auto begin() const noexcept {
return _memtables.begin();
}
auto end() noexcept {
return _memtables.end();
}
auto end() const noexcept {
return _memtables.end();
}
memtable& active_memtable() {
return *_memtables.back();
}
void add_memtable() {
_memtables.emplace_back(new_memtable());
}
bool should_flush() {
return active_memtable().occupancy().total_space() >= _max_memtable_size;
}
void seal_on_overflow() {
if (should_flush()) {
// FIXME: if sparse, do some in-memory compaction first
// FIXME: maybe merge with other in-memory memtables
_seal_fn();
}
}
private:
lw_shared_ptr<memtable> new_memtable() {
return make_lw_shared<memtable>(_current_schema(), _dirty_memory_region_group);
}
};
using sstable_list = sstables::sstable_list;
// The CF has a "stats" structure. But we don't want all fields here,
@@ -122,7 +246,9 @@ public:
bool enable_commitlog = true;
bool enable_incremental_backups = false;
size_t max_memtable_size = 5'000'000;
size_t max_streaming_memtable_size = 5'000'000;
logalloc::region_group* dirty_memory_region_group = nullptr;
logalloc::region_group* streaming_dirty_memory_region_group = nullptr;
::cf_stats* cf_stats = nullptr;
};
struct no_commitlog {};
@@ -154,14 +280,43 @@ private:
config _config;
stats _stats;
lw_shared_ptr<memtable_list> _memtables;
// In older incarnations, we simply commited the mutations to memtables.
// However, doing that makes it harder for us to provide QoS within the
// disk subsystem. Keeping them in separate memtables allow us to properly
// classify those streams into its own I/O class
//
// We could write those directly to disk, but we still want the mutations
// coming through the wire to go to a memtable staging area. This has two
// major advantages:
//
// first, it will allow us to properly order the partitions. They are
// hopefuly sent in order but we can't really guarantee that without
// sacrificing sender-side parallelism.
//
// second, we will be able to coalesce writes from multiple plan_id's and
// even multiple senders, as well as automatically tapping into the dirty
// memory throttling mechanism, guaranteeing we will not overload the
// server.
lw_shared_ptr<memtable_list> _streaming_memtables;
lw_shared_ptr<memtable_list> make_memtable_list();
lw_shared_ptr<memtable_list> make_streaming_memtable_list();
// generation -> sstable. Ordered by key so we can easily get the most recent.
lw_shared_ptr<sstable_list> _sstables;
// sstables that have been compacted (so don't look up in query) but
// have not been deleted yet, so must not GC any tombstones in other sstables
// that may delete data in these sstables:
std::vector<sstables::shared_sstable> _sstables_compacted_but_not_deleted;
// Control background fibers waiting for sstables to be deleted
seastar::gate _sstable_deletion_gate;
// There are situations in which we need to stop writing sstables. Flushers will take
// the read lock, and the ones that wish to stop that process will take the write lock.
rwlock _sstables_lock;
mutable row_cache _cache; // Cache covers only sstables.
int64_t _sstable_generation = 1;
unsigned _mutation_count = 0;
std::experimental::optional<int64_t> _sstable_generation = {};
db::replay_position _highest_flushed_rp;
// Provided by the database that owns this commitlog
db::commitlog* _commitlog;
@@ -172,30 +327,43 @@ private:
int _compaction_disabled = 0;
class memtable_flush_queue;
std::unique_ptr<memtable_flush_queue> _flush_queue;
// Store generation of sstables being compacted at the moment. That's needed to prevent a
// sstable from being compacted twice.
std::unordered_set<unsigned long> _compacting_generations;
// Because streaming mutations bypass the commitlog, there is
// no need for the complications of the flush queue. Besides, it
// is easier to just use a common gate than it is to modify the flush_queue
// to work both with and without a replay position.
//
// Last but not least, we seldom need to guarantee any ordering here: as long
// as all data is waited for, we're good.
seastar::gate _streaming_flush_gate;
private:
void update_stats_for_new_sstable(uint64_t new_sstable_data_size);
void update_stats_for_new_sstable(uint64_t disk_space_used_by_sstable);
void add_sstable(sstables::sstable&& sstable);
void add_sstable(lw_shared_ptr<sstables::sstable> sstable);
void add_memtable();
lw_shared_ptr<memtable> new_memtable();
lw_shared_ptr<memtable> new_streaming_memtable();
future<stop_iteration> try_flush_memtable_to_sstable(lw_shared_ptr<memtable> memt);
future<> update_cache(memtable&, lw_shared_ptr<sstable_list> old_sstables);
struct merge_comparator;
// update the sstable generation, making sure that new new sstables don't overwrite this one.
void update_sstables_known_generation(unsigned generation) {
_sstable_generation = std::max<uint64_t>(_sstable_generation, generation / smp::count + 1);
if (!_sstable_generation) {
_sstable_generation = 1;
}
_sstable_generation = std::max<uint64_t>(*_sstable_generation, generation / smp::count + 1);
}
uint64_t calculate_generation_for_new_table() {
return _sstable_generation++ * smp::count + engine().cpu_id();
assert(_sstable_generation);
// FIXME: better way of ensuring we don't attempt to
// overwrite an existing table.
return (*_sstable_generation)++ * smp::count + engine().cpu_id();
}
// Rebuild existing _sstables with new_sstables added to it and sstables_to_remove removed from it.
void rebuild_sstable_list(const std::vector<sstables::shared_sstable>& new_sstables,
const std::vector<sstables::shared_sstable>& sstables_to_remove);
void rebuild_statistics();
private:
// Creates a mutation reader which covers sstables.
// Caller needs to ensure that column_family remains live (FIXME: relax this).
@@ -207,7 +375,29 @@ private:
key_source sstables_as_key_source() const;
partition_presence_checker make_partition_presence_checker(lw_shared_ptr<sstable_list> old_sstables);
std::chrono::steady_clock::time_point _sstable_writes_disabled_at;
void do_trigger_compaction();
public:
// This function should be called when this column family is ready for writes, IOW,
// to produce SSTables. Extensive details about why this is important can be found
// in Scylla's Github Issue #1014
//
// Nothing should be writing to SSTables before we have the chance to populate the
// existing SSTables and calculate what should the next generation number be.
//
// However, if that happens, we want to protect against it in a way that does not
// involve overwriting existing tables. This is one of the ways to do it: every
// column family starts in an unwriteable state, and when it can finally be written
// to, we mark it as writeable.
//
// Note that this *cannot* be a part of add_column_family. That adds a column family
// to a db in memory only, and if anybody is about to write to a CF, that was most
// likely already called. We need to call this explicitly when we are sure we're ready
// to issue disk operations safely.
void mark_ready_for_writes() {
update_sstables_known_generation(0);
}
// Creates a mutation reader which covers all data sources for this column family.
// Caller needs to ensure that column_family remains live (FIXME: relax this).
// Note: for data queries use query() instead.
@@ -227,7 +417,7 @@ public:
// FIXME: in case a query is satisfied from a single memtable, avoid a copy
using const_mutation_partition_ptr = std::unique_ptr<const mutation_partition>;
using const_row_ptr = std::unique_ptr<const row>;
memtable& active_memtable() { return *_memtables->back(); }
memtable& active_memtable() { return _memtables->active_memtable(); }
const row_cache& get_row_cache() const {
return _cache;
}
@@ -252,10 +442,11 @@ public:
// The mutation is always upgraded to current schema.
void apply(const frozen_mutation& m, const schema_ptr& m_schema, const db::replay_position& = db::replay_position());
void apply(const mutation& m, const db::replay_position& = db::replay_position());
void apply_streaming_mutation(schema_ptr, const frozen_mutation&);
// Returns at most "cmd.limit" rows
future<lw_shared_ptr<query::result>> query(schema_ptr,
const query::read_command& cmd,
const query::read_command& cmd, query::result_request request,
const std::vector<query::partition_range>& ranges);
future<> populate(sstring datadir);
@@ -264,6 +455,7 @@ public:
future<> stop();
future<> flush();
future<> flush(const db::replay_position&);
future<> flush_streaming_mutations(std::vector<query::partition_range> ranges = std::vector<query::partition_range>{});
void clear(); // discards memtable(s) without flushing them to disk.
future<db::replay_position> discard_sstables(db_clock::time_point);
@@ -274,14 +466,19 @@ public:
future<int64_t> disable_sstable_write() {
_sstable_writes_disabled_at = std::chrono::steady_clock::now();
return _sstables_lock.write_lock().then([this] {
return make_ready_future<int64_t>((*_sstables->end()).first);
if (_sstables->empty()) {
return make_ready_future<int64_t>(0);
}
return make_ready_future<int64_t>((*_sstables->rbegin()).first);
});
}
// SSTable writes are now allowed again, and generation is updated to new_generation
// SSTable writes are now allowed again, and generation is updated to new_generation if != -1
// returns the amount of microseconds elapsed since we disabled writes.
std::chrono::steady_clock::duration enable_sstable_write(int64_t new_generation) {
update_sstables_known_generation(new_generation);
if (new_generation != -1) {
update_sstables_known_generation(new_generation);
}
_sstables_lock.write_unlock();
return std::chrono::steady_clock::now() - _sstable_writes_disabled_at;
}
@@ -295,9 +492,11 @@ public:
// very dangerous to do that with live SSTables. This is meant to be used with SSTables
// that are not yet managed by the system.
//
// Parameter all_generations stores the generation of all SSTables in the system, so it
// will be easy to determine which SSTable is new.
// An example usage would query all shards asking what is the highest SSTable number known
// to them, and then pass that + 1 as "start".
future<std::vector<sstables::entry_descriptor>> reshuffle_sstables(int64_t start);
future<std::vector<sstables::entry_descriptor>> reshuffle_sstables(std::set<int64_t> all_generations, int64_t start);
// FIXME: this is just an example, should be changed to something more
// general. compact_all_sstables() starts a compaction of all sstables.
@@ -331,6 +530,7 @@ public:
}
lw_shared_ptr<sstable_list> get_sstables();
lw_shared_ptr<sstable_list> get_sstables_including_compacted_undeleted();
size_t sstables_count();
int64_t get_unleveled_sstables() const;
@@ -362,15 +562,15 @@ public:
Result run_with_compaction_disabled(Func && func) {
++_compaction_disabled;
return _compaction_manager.remove(this).then(std::forward<Func>(func)).finally([this] {
if (--_compaction_disabled == 0) {
trigger_compaction();
// #934. The pending counter is actually a great indicator into whether we
// actually need to trigger a compaction again.
if (--_compaction_disabled == 0 && _stats.pending_compactions > 0) {
// we're turning if on again, use function that does not increment
// the counter further.
do_trigger_compaction();
}
});
}
std::unordered_set<unsigned long>& compacting_generations() {
return _compacting_generations;
}
private:
// One does not need to wait on this future if all we are interested in, is
// initiating the write. The writes initiated here will eventually
@@ -380,23 +580,42 @@ private:
// But it is possible to synchronously wait for the seal to complete by
// waiting on this future. This is useful in situations where we want to
// synchronously flush data to disk.
//
// FIXME: A better interface would guarantee that all writes before this
// one are also complete
future<> seal_active_memtable();
// I am assuming here that the repair process will potentially send ranges containing
// few mutations, definitely not enough to fill a memtable. It wants to know whether or
// not each of those ranges individually succeeded or failed, so we need a future for
// each.
//
// One of the ways to fix that, is changing the repair itself to send more mutations at
// a single batch. But relying on that is a bad idea for two reasons:
//
// First, the goals of the SSTable writer and the repair sender are at odds. The SSTable
// writer wants to write as few SSTables as possible, while the repair sender wants to
// break down the range in pieces as small as it can and checksum them individually, so
// it doesn't have to send a lot of mutations for no reason.
//
// Second, even if the repair process wants to process larger ranges at once, some ranges
// themselves may be small. So while most ranges would be large, we would still have
// potentially some fairly small SSTables lying around.
//
// The best course of action in this case is to coalesce the incoming streams write-side.
// repair can now choose whatever strategy - small or big ranges - it wants, resting assure
// that the incoming memtables will be coalesced together.
shared_promise<> _waiting_streaming_flushes;
timer<> _delayed_streaming_flush{[this] { seal_active_streaming_memtable(); }};
future<> seal_active_streaming_memtable();
future<> seal_active_streaming_memtable_delayed();
// filter manifest.json files out
static bool manifest_json_filter(const sstring& fname);
seastar::gate _in_flight_seals;
// Iterate over all partitions. Protocol is the same as std::all_of(),
// so that iteration can be stopped by returning false.
// Func signature: bool (const decorated_key& dk, const mutation_partition& mp)
template <typename Func>
future<bool> for_all_partitions(schema_ptr, Func&& func) const;
future<sstables::entry_descriptor> probe_file(sstring sstdir, sstring fname);
void seal_on_overflow();
void check_valid_rp(const db::replay_position&) const;
public:
// Iterate over all partitions. Protocol is the same as std::all_of(),
@@ -499,7 +718,9 @@ public:
bool enable_cache = true;
bool enable_incremental_backups = false;
size_t max_memtable_size = 5'000'000;
size_t max_streaming_memtable_size = 5'000'000;
logalloc::region_group* dirty_memory_region_group = nullptr;
logalloc::region_group* streaming_dirty_memory_region_group = nullptr;
::cf_stats* cf_stats = nullptr;
};
private:
@@ -561,18 +782,19 @@ public:
class database {
::cf_stats _cf_stats;
logalloc::region_group _dirty_memory_region_group;
logalloc::region_group _streaming_dirty_memory_region_group;
std::unordered_map<sstring, keyspace> _keyspaces;
std::unordered_map<utils::UUID, lw_shared_ptr<column_family>> _column_families;
std::unordered_map<std::pair<sstring, sstring>, utils::UUID, utils::tuple_hash> _ks_cf_to_uuid;
std::unique_ptr<db::commitlog> _commitlog;
std::unique_ptr<db::config> _cfg;
size_t _memtable_total_space = 500 << 20;
size_t _streaming_memtable_total_space = 500 << 20;
utils::UUID _version;
// compaction_manager object is referenced by all column families of a database.
compaction_manager _compaction_manager;
std::vector<scollectd::registration> _collectd;
timer<> _throttling_timer{[this] { unthrottle(); }};
circular_buffer<promise<>> _throttled_requests;
bool _enable_incremental_backups = false;
future<> init_commitlog();
future<> apply_in_memory(const frozen_mutation& m, const schema_ptr& m_schema, const db::replay_position&);
@@ -586,12 +808,16 @@ private:
void create_in_memory_keyspace(const lw_shared_ptr<keyspace_metadata>& ksm);
friend void db::system_keyspace::make(database& db, bool durable, bool volatile_testing_only);
void setup_collectd();
future<> throttle();
throttle_state _memtables_throttler;
throttle_state _streaming_throttler;
future<> do_apply(schema_ptr, const frozen_mutation&);
void unthrottle();
public:
static utils::UUID empty_version;
void set_enable_incremental_backups(bool val) { _enable_incremental_backups = val; }
future<> parse_system_tables(distributed<service::storage_proxy>&);
database();
database(const db::config&);
@@ -618,8 +844,6 @@ public:
void add_column_family(schema_ptr schema, column_family::config cfg);
future<> drop_column_family(db_clock::time_point changed_at, const sstring& ks_name, const sstring& cf_name);
/* throws std::out_of_range if missing */
const utils::UUID& find_uuid(const sstring& ks, const sstring& cf) const throw (std::out_of_range);
const utils::UUID& find_uuid(const schema_ptr&) const throw (std::out_of_range);
@@ -644,6 +868,7 @@ public:
const column_family& find_column_family(const utils::UUID&) const throw (no_such_column_family);
column_family& find_column_family(const schema_ptr&) throw (no_such_column_family);
const column_family& find_column_family(const schema_ptr&) const throw (no_such_column_family);
bool column_family_exists(const utils::UUID& uuid) const;
schema_ptr find_schema(const sstring& ks_name, const sstring& cf_name) const throw (no_such_column_family);
schema_ptr find_schema(const utils::UUID&) const throw (no_such_column_family);
bool has_schema(const sstring& ks_name, const sstring& cf_name) const;
@@ -652,9 +877,10 @@ public:
unsigned shard_of(const dht::token& t);
unsigned shard_of(const mutation& m);
unsigned shard_of(const frozen_mutation& m);
future<lw_shared_ptr<query::result>> query(schema_ptr, const query::read_command& cmd, const std::vector<query::partition_range>& ranges);
future<lw_shared_ptr<query::result>> query(schema_ptr, const query::read_command& cmd, query::result_request request, const std::vector<query::partition_range>& ranges);
future<reconcilable_result> query_mutations(schema_ptr, const query::read_command& cmd, const query::partition_range& range);
future<> apply(schema_ptr, const frozen_mutation&);
future<> apply_streaming_mutation(schema_ptr, const frozen_mutation&);
keyspace::config make_keyspace_config(const keyspace_metadata& ksm);
const sstring& get_snitch_name() const;
future<> clear_snapshot(sstring tag, std::vector<sstring> keyspace_names);
@@ -686,9 +912,16 @@ public:
}
future<> flush_all_memtables();
// See #937. Truncation now requires a callback to get a time stamp
// that must be guaranteed to be the same for all shards.
typedef std::function<future<db_clock::time_point>()> timestamp_func;
/** Truncates the given column family */
future<> truncate(db_clock::time_point truncated_at, sstring ksname, sstring cfname);
future<> truncate(db_clock::time_point truncated_at, const keyspace& ks, column_family& cf);
future<> truncate(sstring ksname, sstring cfname, timestamp_func);
future<> truncate(const keyspace& ks, column_family& cf, timestamp_func);
future<> drop_column_family(const sstring& ks_name, const sstring& cf_name, timestamp_func);
const logalloc::region_group& dirty_memory_region_group() const {
return _dirty_memory_region_group;

View File

@@ -59,6 +59,12 @@
#include "gms/failure_detector.hh"
#include "service/storage_service.hh"
#include "schema_registry.hh"
#include "idl/uuid.dist.hh"
#include "idl/frozen_schema.dist.hh"
#include "serializer_impl.hh"
#include "serialization_visitors.hh"
#include "idl/uuid.dist.impl.hh"
#include "idl/frozen_schema.dist.impl.hh"
static logging::logger logger("batchlog_manager");
@@ -119,15 +125,11 @@ mutation db::batchlog_manager::get_batch_log_mutation_for(const std::vector<muta
auto timestamp = api::new_timestamp();
auto data = [this, &mutations] {
std::vector<canonical_mutation> fm(mutations.begin(), mutations.end());
const auto size = std::accumulate(fm.begin(), fm.end(), size_t(0), [](size_t s, auto& m) {
return s + serializer<canonical_mutation>{m}.size();
});
bytes buf(bytes::initialized_later(), size);
data_output out(buf);
bytes_ostream out;
for (auto& m : fm) {
serializer<canonical_mutation>{m}(out);
ser::serialize(out, m);
}
return buf;
return to_bytes(out.linearize());
}();
mutation m(key, schema);
@@ -155,47 +157,58 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
auto written_at = row.get_as<db_clock::time_point>("written_at");
auto id = row.get_as<utils::UUID>("id");
// enough time for the actual write + batchlog entry mutation delivery (two separate requests).
// enough time for the actual write + batchlog entry mutation delivery (two separate requests).
auto timeout = get_batch_log_timeout();
if (db_clock::now() < written_at + timeout) {
logger.debug("Skipping replay of {}, too fresh", id);
return make_ready_future<>();
}
// not used currently. ever?
//auto version = row.has("version") ? row.get_as<uint32_t>("version") : /*MessagingService.VERSION_12*/6u;
// check version of serialization format
if (!row.has("version")) {
logger.warn("Skipping logged batch because of unknown version");
return make_ready_future<>();
}
auto version = row.get_as<int32_t>("version");
if (version != net::messaging_service::current_version) {
logger.warn("Skipping logged batch because of incorrect version");
return make_ready_future<>();
}
auto data = row.get_blob("data");
logger.debug("Replaying batch {}", id);
auto fms = make_lw_shared<std::deque<canonical_mutation>>();
data_input in(data);
while (in.has_next()) {
fms->emplace_back(serializer<canonical_mutation>::read(in));
auto in = ser::as_input_stream(data);
while (in.size()) {
fms->emplace_back(ser::deserialize(in, boost::type<canonical_mutation>()));
}
auto mutations = make_lw_shared<std::vector<mutation>>();
auto size = data.size();
return repeat([this, fms = std::move(fms), written_at, mutations]() mutable {
if (fms->empty()) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
auto& fm = fms->front();
auto mid = fm.column_family_id();
return system_keyspace::get_truncated_at(mid).then([this, mid, &fm, written_at, mutations](db_clock::time_point t) {
schema_ptr s = _qp.db().local().find_schema(mid);
return map_reduce(*fms, [this, written_at] (canonical_mutation& fm) {
return system_keyspace::get_truncated_at(fm.column_family_id()).then([written_at, &fm] (db_clock::time_point t) ->
std::experimental::optional<std::reference_wrapper<canonical_mutation>> {
if (written_at > t) {
mutations->emplace_back(fm.to_mutation(s));
return { std::ref(fm) };
} else {
return {};
}
}).then([fms] {
fms->pop_front();
return make_ready_future<stop_iteration>(stop_iteration::no);
});
}).then([this, id, mutations, limiter, written_at, size] {
if (mutations->empty()) {
},
std::vector<mutation>(),
[this] (std::vector<mutation> mutations, std::experimental::optional<std::reference_wrapper<canonical_mutation>> fm) {
if (fm) {
schema_ptr s = _qp.db().local().find_schema(fm.value().get().column_family_id());
mutations.emplace_back(fm.value().get().to_mutation(s));
}
return mutations;
}).then([this, id, limiter, written_at, size, fms] (std::vector<mutation> mutations) {
if (mutations.empty()) {
return make_ready_future<>();
}
const auto ttl = [this, mutations, written_at]() -> clock_type {
const auto ttl = [this, &mutations, written_at]() -> clock_type {
/*
* Calculate ttl for the mutations' hints (and reduce ttl by the time the mutations spent in the batchlog).
* This ensures that deletes aren't "undone" by an old batch replay.
@@ -217,8 +230,8 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
// Our normal write path does not add much redundancy to the dispatch, and rate is handled after send
// in both cases.
// FIXME: verify that the above is reasonably true.
return limiter->reserve(size).then([this, mutations, id] {
return _qp.proxy().local().mutate(std::move(*mutations), db::consistency_level::ANY);
return limiter->reserve(size).then([this, mutations = std::move(mutations), id] {
return _qp.proxy().local().mutate(mutations, db::consistency_level::ANY);
});
}).then([this, id] {
// delete batch

View File

@@ -67,6 +67,9 @@
#include "commitlog_entry.hh"
#include "service/priority_manager.hh"
#include <boost/range/numeric.hpp>
#include <boost/range/adaptor/transformed.hpp>
static logging::logger logger("commitlog");
class crc32_nbo {
@@ -145,7 +148,7 @@ const std::string db::commitlog::descriptor::FILENAME_PREFIX(
"CommitLog" + SEPARATOR);
const std::string db::commitlog::descriptor::FILENAME_EXTENSION(".log");
class db::commitlog::segment_manager {
class db::commitlog::segment_manager : public ::enable_shared_from_this<segment_manager> {
public:
config cfg;
const uint64_t max_size;
@@ -275,6 +278,8 @@ public:
scollectd::registrations create_counters();
void orphan_all();
void discard_unused_segments();
void discard_completed_segments(const cf_id_type& id,
const replay_position& pos);
@@ -372,7 +377,7 @@ private:
*/
class db::commitlog::segment: public enable_lw_shared_from_this<segment> {
segment_manager* _segment_manager;
::shared_ptr<segment_manager> _segment_manager;
descriptor _desc;
file _file;
@@ -404,7 +409,7 @@ class db::commitlog::segment: public enable_lw_shared_from_this<segment> {
// This is maintaining the semantica of only using the write-lock
// as a gate for flushing, i.e. once we've begun a flush for position X
// we are ok with writes to positions > X
return _dwrite.write_lock().then(std::bind(&segment_manager::begin_flush, _segment_manager)).finally([this] {
return _segment_manager->begin_flush().then(std::bind(&rwlock::write_lock, &_dwrite)).finally([this] {
_dwrite.write_unlock();
});
}
@@ -417,12 +422,12 @@ class db::commitlog::segment: public enable_lw_shared_from_this<segment> {
// This is maintaining the semantica of only using the write-lock
// as a gate for flushing, i.e. once we've begun a flush for position X
// we are ok with writes to positions > X
return _dwrite.read_lock().then(std::bind(&segment_manager::begin_write, _segment_manager));
return _segment_manager->begin_write().then(std::bind(&rwlock::read_lock, &_dwrite));
}
void end_write() {
_segment_manager->end_write();
_dwrite.read_unlock();
_segment_manager->end_write();
}
public:
@@ -444,8 +449,8 @@ public:
// TODO : tune initial / default size
static constexpr size_t default_size = align_up<size_t>(128 * 1024, alignment);
segment(segment_manager* m, const descriptor& d, file && f, bool active)
: _segment_manager(m), _desc(std::move(d)), _file(std::move(f)), _sync_time(
segment(::shared_ptr<segment_manager> m, const descriptor& d, file && f, bool active)
: _segment_manager(std::move(m)), _desc(std::move(d)), _file(std::move(f)), _sync_time(
clock_type::now()), _queue(0)
{
++_segment_manager->totals.segments_created;
@@ -553,7 +558,7 @@ public:
throw;
}
});
}).finally([this] {
}).finally([this, me] {
end_flush();
});
}
@@ -642,7 +647,7 @@ public:
forget_schema_versions();
// acquire read lock
return begin_write().then([this, size, off, buf = std::move(buf), me]() mutable {
return begin_write().then([this, size, off, buf = std::move(buf)]() mutable {
auto written = make_lw_shared<size_t>(0);
auto p = buf.get();
return repeat([this, size, off, written, p]() mutable {
@@ -1038,10 +1043,12 @@ void db::commitlog::segment_manager::flush_segments(bool force) {
future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager::allocate_segment(bool active) {
descriptor d(next_id());
return open_file_dma(cfg.commit_log_location + "/" + d.filename(), open_flags::wo | open_flags::create).then([this, d, active](file f) {
file_open_options opt;
opt.extent_allocation_size_hint = max_size;
return open_file_dma(cfg.commit_log_location + "/" + d.filename(), open_flags::wo | open_flags::create, opt).then([this, d, active](file f) {
// xfs doesn't like files extended betond eof, so enlarge the file
return f.truncate(max_size).then([this, d, active, f] () mutable {
auto s = make_lw_shared<segment>(this, d, std::move(f), active);
auto s = make_lw_shared<segment>(this->shared_from_this(), d, std::move(f), active);
return make_ready_future<sseg_ptr>(s);
});
});
@@ -1155,6 +1162,10 @@ future<> db::commitlog::segment_manager::shutdown() {
return make_ready_future<>();
}
void db::commitlog::segment_manager::orphan_all() {
_segments.clear();
_reserve_segments.clear();
}
/*
* Sync all segments, then clear them out. To ensure all ops are done.
@@ -1168,7 +1179,7 @@ future<> db::commitlog::segment_manager::clear() {
for (auto& s : _segments) {
s->mark_clean();
}
_segments.clear();
orphan_all();
});
}
/**
@@ -1202,7 +1213,15 @@ void db::commitlog::segment_manager::on_timer() {
// take outstanding allocations into regard. This is paranoid,
// but if for some reason the file::open takes longer than timer period,
// we could flood the reserve list with new segments
auto n = _reserve_segments.size() + _reserve_allocating;
//
// #482 - _reserve_allocating is decremented in the finally clause below.
// This is needed because if either allocate_segment _or_ emplacing into
// _reserve_segments should throw, we still need the counter reset
// However, because of this, it might be that emplace was done, but not decrement,
// when we get here again. So occasionally we might get a sum of the two that is
// not consistent. It should however always just potentially be _to much_, i.e.
// just an indicator that we don't need to do anything. So lets do that.
auto n = std::min(_reserve_segments.size() + _reserve_allocating, _num_reserve_segments);
return parallel_for_each(boost::irange(n, _num_reserve_segments), [this, n](auto i) {
++_reserve_allocating;
return this->allocate_segment(false).then([this](sseg_ptr s) {
@@ -1283,8 +1302,9 @@ void db::commitlog::segment_manager::release_buffer(buffer_type&& b) {
logger.trace("Deleting {} buffers", _temp_buffers.size() - max_temp_buffers);
_temp_buffers.erase(_temp_buffers.begin() + max_temp_buffers, _temp_buffers.end());
}
totals.buffer_list_bytes = std::accumulate(_temp_buffers.begin(),
_temp_buffers.end(), size_t(0), std::plus<size_t>());
totals.buffer_list_bytes = boost::accumulate(
_temp_buffers | boost::adaptors::transformed(std::mem_fn(&buffer_type::size)),
size_t(0), std::plus<size_t>());
}
/**
@@ -1334,7 +1354,7 @@ future<db::replay_position> db::commitlog::add_entry(const cf_id_type& id, const
}
db::commitlog::commitlog(config cfg)
: _segment_manager(new segment_manager(std::move(cfg))) {
: _segment_manager(::make_shared<segment_manager>(std::move(cfg))) {
}
db::commitlog::commitlog(commitlog&& v) noexcept
@@ -1342,6 +1362,9 @@ db::commitlog::commitlog(commitlog&& v) noexcept
}
db::commitlog::~commitlog() {
if (_segment_manager != nullptr) {
_segment_manager->orphan_all();
}
}
future<db::commitlog> db::commitlog::create_commitlog(config cfg) {

View File

@@ -98,7 +98,7 @@ public:
class segment;
private:
std::unique_ptr<segment_manager> _segment_manager;
::shared_ptr<segment_manager> _segment_manager;
public:
enum class sync_mode {
PERIODIC, BATCH

View File

@@ -0,0 +1,86 @@
/*
* Copyright 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "commitlog_entry.hh"
#include "idl/uuid.dist.hh"
#include "idl/keys.dist.hh"
#include "idl/frozen_mutation.dist.hh"
#include "idl/mutation.dist.hh"
#include "idl/commitlog.dist.hh"
#include "serializer_impl.hh"
#include "serialization_visitors.hh"
#include "idl/uuid.dist.impl.hh"
#include "idl/keys.dist.impl.hh"
#include "idl/frozen_mutation.dist.impl.hh"
#include "idl/mutation.dist.impl.hh"
#include "idl/commitlog.dist.impl.hh"
commitlog_entry::commitlog_entry(stdx::optional<column_mapping> mapping, frozen_mutation&& mutation)
: _mapping(std::move(mapping))
, _mutation_storage(std::move(mutation))
, _mutation(*_mutation_storage)
{ }
commitlog_entry::commitlog_entry(stdx::optional<column_mapping> mapping, const frozen_mutation& mutation)
: _mapping(std::move(mapping))
, _mutation(mutation)
{ }
commitlog_entry::commitlog_entry(commitlog_entry&& ce)
: _mapping(std::move(ce._mapping))
, _mutation_storage(std::move(ce._mutation_storage))
, _mutation(_mutation_storage ? *_mutation_storage : ce._mutation)
{
}
commitlog_entry& commitlog_entry::operator=(commitlog_entry&& ce)
{
if (this != &ce) {
this->~commitlog_entry();
new (this) commitlog_entry(std::move(ce));
}
return *this;
}
commitlog_entry commitlog_entry_writer::get_entry() const {
if (_with_schema) {
return commitlog_entry(_schema->get_column_mapping(), _mutation);
} else {
return commitlog_entry({}, _mutation);
}
}
void commitlog_entry_writer::compute_size() {
_size = ser::get_sizeof(get_entry());
}
void commitlog_entry_writer::write(data_output& out) const {
seastar::simple_output_stream str(out.reserve(size()));
ser::serialize(str, get_entry());
}
commitlog_entry_reader::commitlog_entry_reader(const temporary_buffer<char>& buffer)
: _ce([&] {
seastar::simple_input_stream in(buffer.get(), buffer.size());
return ser::deserialize(in, boost::type<commitlog_entry>());
}())
{
}

View File

@@ -25,21 +25,43 @@
#include "frozen_mutation.hh"
#include "schema.hh"
#include "utils/data_output.hh"
namespace stdx = std::experimental;
class commitlog_entry {
stdx::optional<column_mapping> _mapping;
stdx::optional<frozen_mutation> _mutation_storage;
const frozen_mutation& _mutation;
public:
commitlog_entry(stdx::optional<column_mapping> mapping, frozen_mutation&& mutation);
commitlog_entry(stdx::optional<column_mapping> mapping, const frozen_mutation& mutation);
commitlog_entry(commitlog_entry&&);
commitlog_entry(const commitlog_entry&) = delete;
commitlog_entry& operator=(commitlog_entry&&);
commitlog_entry& operator=(const commitlog_entry&) = delete;
const stdx::optional<column_mapping>& mapping() const { return _mapping; }
const frozen_mutation& mutation() const { return _mutation; }
};
class commitlog_entry_writer {
schema_ptr _schema;
db::serializer<column_mapping> _column_mapping_serializer;
const frozen_mutation& _mutation;
bool _with_schema = true;
size_t _size;
private:
void compute_size();
commitlog_entry get_entry() const;
public:
commitlog_entry_writer(schema_ptr s, const frozen_mutation& fm)
: _schema(std::move(s)), _column_mapping_serializer(_schema->get_column_mapping()), _mutation(fm)
{ }
: _schema(std::move(s)), _mutation(fm)
{
compute_size();
}
void set_with_schema(bool value) {
_with_schema = value;
compute_size();
}
bool with_schema() {
return _with_schema;
@@ -49,40 +71,17 @@ public:
}
size_t size() const {
size_t size = data_output::serialized_size<bool>();
if (_with_schema) {
size += _column_mapping_serializer.size();
}
size += _mutation.representation().size();
return size;
return _size;
}
void write(data_output& out) const {
out.write(_with_schema);
if (_with_schema) {
_column_mapping_serializer.write(out);
}
auto bv = _mutation.representation();
out.write(bv.begin(), bv.end());
}
void write(data_output& out) const;
};
class commitlog_entry_reader {
frozen_mutation _mutation;
stdx::optional<column_mapping> _column_mapping;
commitlog_entry _ce;
public:
commitlog_entry_reader(const temporary_buffer<char>& buffer)
: _mutation(bytes())
{
data_input in(buffer);
bool has_column_mapping = in.read<bool>();
if (has_column_mapping) {
_column_mapping = db::serializer<::column_mapping>::read(in);
}
auto bv = in.read_view(in.avail());
_mutation = frozen_mutation(bytes(bv.begin(), bv.end()));
}
commitlog_entry_reader(const temporary_buffer<char>& buffer);
const stdx::optional<column_mapping>& get_column_mapping() const { return _column_mapping; }
const frozen_mutation& mutation() const { return _mutation; }
const stdx::optional<column_mapping>& get_column_mapping() const { return _ce.mapping(); }
const frozen_mutation& mutation() const { return _ce.mutation(); }
};

View File

@@ -53,7 +53,6 @@
#include "database.hh"
#include "sstables/sstables.hh"
#include "db/system_keyspace.hh"
#include "db/serializer.hh"
#include "cql3/query_processor.hh"
#include "log.hh"
#include "converting_mutation_partition_applier.hh"

View File

@@ -487,7 +487,7 @@ public:
val(cas_contention_timeout_in_ms, uint32_t, 5000, Unused, \
"The time that the coordinator continues to retry a CAS (compare and set) operation that contends with other proposals for the same row." \
) \
val(truncate_request_timeout_in_ms, uint32_t, 10000, Unused, \
val(truncate_request_timeout_in_ms, uint32_t, 10000, Used, \
"The time that the coordinator waits for truncates (remove all data from a table) to complete. The long default value allows for a snapshot to be taken before removing the data. If auto_snapshot is disabled (not recommended), you can reduce this time." \
) \
val(write_request_timeout_in_ms, uint32_t, 2000, Used, \
@@ -556,7 +556,7 @@ public:
val(start_rpc, bool, false, Used, \
"Starts the Thrift RPC server" \
) \
val(rpc_keepalive, bool, true, Unused, \
val(rpc_keepalive, bool, true, Used, \
"Enable or disable keepalive on client connections (RPC or native)." \
) \
val(rpc_max_threads, uint32_t, 0, Invalid, \

View File

@@ -241,7 +241,7 @@ is_sufficient_live_nodes(consistency_level cl,
if (rs.get_type() == replication_strategy_type::network_topology) {
for (auto& entry : count_per_dc_endpoints(ks, live_endpoints)) {
if (entry.second < local_quorum_for(ks, entry.first)) {
if (entry.second.live < local_quorum_for(ks, entry.first)) {
return false;
}
}

View File

@@ -88,10 +88,16 @@ filter_for_query(consistency_level cl,
std::vector<gms::inet_address> filter_for_query(consistency_level cl, keyspace& ks, std::vector<gms::inet_address>& live_endpoints);
template <typename Range>
inline std::unordered_map<sstring, size_t> count_per_dc_endpoints(
struct dc_node_count {
size_t live = 0;
size_t pending = 0;
};
template <typename Range, typename PendingRange = std::array<gms::inet_address, 0>>
inline std::unordered_map<sstring, dc_node_count> count_per_dc_endpoints(
keyspace& ks,
Range& live_endpoints) {
Range& live_endpoints,
const PendingRange& pending_endpoints = std::array<gms::inet_address, 0>()) {
using namespace locator;
auto& rs = ks.get_replication_strategy();
@@ -100,9 +106,9 @@ inline std::unordered_map<sstring, size_t> count_per_dc_endpoints(
network_topology_strategy* nrs =
static_cast<network_topology_strategy*>(&rs);
std::unordered_map<sstring, size_t> dc_endpoints;
std::unordered_map<sstring, dc_node_count> dc_endpoints;
for (auto& dc : nrs->get_datacenters()) {
dc_endpoints.emplace(dc, 0);
dc_endpoints.emplace(dc, dc_node_count());
}
//
@@ -111,7 +117,11 @@ inline std::unordered_map<sstring, size_t> count_per_dc_endpoints(
// nrs->get_datacenters().
//
for (auto& endpoint : live_endpoints) {
++(dc_endpoints[snitch_ptr->get_datacenter(endpoint)]);
++(dc_endpoints[snitch_ptr->get_datacenter(endpoint)].live);
}
for (auto& endpoint : pending_endpoints) {
++(dc_endpoints[snitch_ptr->get_datacenter(endpoint)].pending);
}
return dc_endpoints;
@@ -122,21 +132,23 @@ is_sufficient_live_nodes(consistency_level cl,
keyspace& ks,
const std::vector<gms::inet_address>& live_endpoints);
template<typename Range>
template<typename Range, typename PendingRange>
inline bool assure_sufficient_live_nodes_each_quorum(
consistency_level cl,
keyspace& ks,
Range& live_endpoints) {
Range& live_endpoints,
const PendingRange& pending_endpoints) {
using namespace locator;
auto& rs = ks.get_replication_strategy();
if (rs.get_type() == replication_strategy_type::network_topology) {
for (auto& entry : count_per_dc_endpoints(ks, live_endpoints)) {
for (auto& entry : count_per_dc_endpoints(ks, live_endpoints, pending_endpoints)) {
auto dc_block_for = local_quorum_for(ks, entry.first);
auto dc_live = entry.second;
auto dc_live = entry.second.live;
auto dc_pending = entry.second.pending;
if (dc_live < dc_block_for) {
if (dc_live < dc_block_for + dc_pending) {
throw exceptions::unavailable_exception(cl, dc_block_for, dc_live);
}
}
@@ -147,11 +159,12 @@ inline bool assure_sufficient_live_nodes_each_quorum(
return false;
}
template<typename Range>
template<typename Range, typename PendingRange = std::array<gms::inet_address, 0>>
inline void assure_sufficient_live_nodes(
consistency_level cl,
keyspace& ks,
Range& live_endpoints) {
Range& live_endpoints,
const PendingRange& pending_endpoints = std::array<gms::inet_address, 0>()) {
size_t need = block_for(ks, cl);
switch (cl) {
@@ -159,13 +172,13 @@ inline void assure_sufficient_live_nodes(
// local hint is acceptable, and local node is always live
break;
case consistency_level::LOCAL_ONE:
if (count_local_endpoints(live_endpoints) == 0) {
if (count_local_endpoints(live_endpoints) < count_local_endpoints(pending_endpoints) + 1) {
throw exceptions::unavailable_exception(cl, 1, 0);
}
break;
case consistency_level::LOCAL_QUORUM: {
size_t local_live = count_local_endpoints(live_endpoints);
if (local_live < need) {
if (local_live < need + count_local_endpoints(pending_endpoints)) {
#if 0
if (logger.isDebugEnabled())
{
@@ -184,14 +197,15 @@ inline void assure_sufficient_live_nodes(
break;
}
case consistency_level::EACH_QUORUM:
if (assure_sufficient_live_nodes_each_quorum(cl, ks, live_endpoints)) {
if (assure_sufficient_live_nodes_each_quorum(cl, ks, live_endpoints, pending_endpoints)) {
break;
}
// Fallthough on purpose for SimpleStrategy
default:
size_t live = live_endpoints.size();
if (live < need) {
cl_logger.debug("Live nodes {} do not satisfy ConsistencyLevel ({} required)", live, need);
size_t pending = pending_endpoints.size();
if (live < need + pending) {
cl_logger.debug("Live nodes {} do not satisfy ConsistencyLevel ({} required, {} pending)", live, need, pending);
throw exceptions::unavailable_exception(cl, need, live);
}
break;

View File

@@ -65,6 +65,7 @@
#include <boost/range/adaptor/map.hpp>
#include "compaction_strategy.hh"
#include "utils/joinpoint.hh"
using namespace db::system_keyspace;
@@ -415,16 +416,16 @@ future<std::vector<frozen_mutation>> convert_schema_to_mutations(distributed<ser
if (partition_key == system_keyspace::NAME) {
continue;
}
results.emplace_back(p.mut());
results.emplace_back(std::move(p.mut()));
}
return results;
});
};
auto reduce = [] (auto&& result, auto&& mutations) {
std::copy(mutations.begin(), mutations.end(), std::back_inserter(result));
std::move(mutations.begin(), mutations.end(), std::back_inserter(result));
return std::move(result);
};
return map_reduce(ALL.begin(), ALL.end(), map, std::move(std::vector<frozen_mutation>{}), reduce);
return map_reduce(ALL.begin(), ALL.end(), map, std::vector<frozen_mutation>{}, reduce);
}
future<schema_result>
@@ -606,10 +607,10 @@ future<> do_merge_schema(distributed<service::storage_proxy>& proxy, std::vector
#endif
proxy.local().get_db().invoke_on_all([keyspaces_to_drop = std::move(keyspaces_to_drop)] (database& db) {
// it is safe to drop a keyspace only when all nested ColumnFamilies where deleted
for (auto&& keyspace_to_drop : keyspaces_to_drop) {
return do_for_each(keyspaces_to_drop, [&db] (auto keyspace_to_drop) {
db.drop_keyspace(keyspace_to_drop);
service::get_local_migration_manager().notify_drop_keyspace(keyspace_to_drop);
}
return service::get_local_migration_manager().notify_drop_keyspace(keyspace_to_drop);
});
}).get0();
});
}
@@ -649,7 +650,7 @@ future<std::set<sstring>> merge_keyspaces(distributed<service::storage_proxy>& p
return do_for_each(created, [&db](auto&& val) {
auto ksm = create_keyspace_from_schema_partition(val);
return db.create_keyspace(ksm).then([ksm] {
service::get_local_migration_manager().notify_create_keyspace(ksm);
return service::get_local_migration_manager().notify_create_keyspace(ksm);
});
}).then([&altered, &db] () mutable {
for (auto&& name : altered) {
@@ -662,7 +663,7 @@ future<std::set<sstring>> merge_keyspaces(distributed<service::storage_proxy>& p
});
}
static void update_column_family(database& db, schema_ptr new_schema) {
static future<> update_column_family(database& db, schema_ptr new_schema) {
column_family& cfm = db.find_column_family(new_schema->id());
bool columns_changed = !cfm.schema()->equal_columns(*new_schema);
@@ -671,7 +672,7 @@ static void update_column_family(database& db, schema_ptr new_schema) {
s->registry_entry()->mark_synced();
cfm.set_schema(std::move(s));
service::get_local_migration_manager().notify_update_column_family(cfm.schema(), columns_changed);
return service::get_local_migration_manager().notify_update_column_family(cfm.schema(), columns_changed);
}
// see the comments for merge_keyspaces()
@@ -679,7 +680,6 @@ static void merge_tables(distributed<service::storage_proxy>& proxy,
std::map<qualified_name, schema_mutations>&& before,
std::map<qualified_name, schema_mutations>&& after)
{
auto changed_at = db_clock::now();
std::vector<global_schema_ptr> created;
std::vector<global_schema_ptr> altered;
std::vector<global_schema_ptr> dropped;
@@ -687,34 +687,44 @@ static void merge_tables(distributed<service::storage_proxy>& proxy,
auto diff = difference(before, after);
for (auto&& key : diff.entries_only_on_left) {
auto&& s = proxy.local().get_db().local().find_schema(key.keyspace_name, key.table_name);
logger.info("Dropping {}.{} id={} version={}", s->ks_name(), s->cf_name(), s->id(), s->version());
dropped.emplace_back(s);
}
for (auto&& key : diff.entries_only_on_right) {
created.emplace_back(create_table_from_mutations(after.at(key)));
auto s = create_table_from_mutations(after.at(key));
logger.info("Creating {}.{} id={} version={}", s->ks_name(), s->cf_name(), s->id(), s->version());
created.emplace_back(s);
}
for (auto&& key : diff.entries_differing) {
altered.emplace_back(create_table_from_mutations(after.at(key)));
auto s = create_table_from_mutations(after.at(key));
logger.info("Altering {}.{} id={} version={}", s->ks_name(), s->cf_name(), s->id(), s->version());
altered.emplace_back(s);
}
proxy.local().get_db().invoke_on_all([&created, &dropped, &altered, changed_at] (database& db) {
return seastar::async([&] {
for (auto&& gs : created) {
schema_ptr s = gs.get();
auto& ks = db.find_keyspace(s->ks_name());
auto cfg = ks.make_column_family_config(*s);
db.add_column_family(s, cfg);
ks.make_directory_for_column_family(s->cf_name(), s->id()).get();
service::get_local_migration_manager().notify_create_column_family(s);
}
for (auto&& gs : altered) {
update_column_family(db, gs.get());
}
parallel_for_each(dropped.begin(), dropped.end(), [changed_at, &db](auto&& gs) {
schema_ptr s = gs.get();
return db.drop_column_family(changed_at, s->ks_name(), s->cf_name()).then([s] {
service::get_local_migration_manager().notify_drop_column_family(s);
});
}).get();
do_with(utils::make_joinpoint([] { return db_clock::now();})
, [&created, &dropped, &altered, &proxy](auto& tsf) {
return proxy.local().get_db().invoke_on_all([&created, &dropped, &altered, &tsf] (database& db) {
return seastar::async([&] {
for (auto&& gs : created) {
schema_ptr s = gs.get();
auto& ks = db.find_keyspace(s->ks_name());
auto cfg = ks.make_column_family_config(*s);
db.add_column_family(s, cfg);
auto& cf = db.find_column_family(s);
cf.mark_ready_for_writes();
ks.make_directory_for_column_family(s->cf_name(), s->id()).get();
service::get_local_migration_manager().notify_create_column_family(s).get();
}
for (auto&& gs : altered) {
update_column_family(db, gs.get()).get();
}
parallel_for_each(dropped.begin(), dropped.end(), [&db, &tsf](auto&& gs) {
schema_ptr s = gs.get();
return db.drop_column_family(s->ks_name(), s->cf_name(), [&tsf] { return tsf.value(); }).then([s] {
return service::get_local_migration_manager().notify_drop_column_family(s);
});
}).get();
});
});
}).get();
}

View File

@@ -1,194 +0,0 @@
/*
* Copyright 2015 Cloudius Systems
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "serializer.hh"
#include "database.hh"
#include "types.hh"
#include "utils/serialization.hh"
typedef uint32_t count_type; // Me thinks 32-bits are enough for "normal" count purposes.
template<>
db::serializer<utils::UUID>::serializer(const utils::UUID& uuid)
: _item(uuid), _size(2 * sizeof(uint64_t)) {
}
template<>
void db::serializer<utils::UUID>::write(output& out,
const type& t) {
out.write(t.get_most_significant_bits());
out.write(t.get_least_significant_bits());
}
template<>
void db::serializer<utils::UUID>::read(utils::UUID& uuid, input& in) {
uuid = read(in);
}
template<>
void db::serializer<utils::UUID>::skip(input& in) {
in.skip(2 * sizeof(uint64_t));
}
template<> utils::UUID db::serializer<utils::UUID>::read(input& in) {
auto msb = in.read<uint64_t>();
auto lsb = in.read<uint64_t>();
return utils::UUID(msb, lsb);
}
template<>
db::serializer<bytes>::serializer(const bytes& b)
: _item(b), _size(output::serialized_size(b)) {
}
template<>
void db::serializer<bytes>::write(output& out, const type& t) {
out.write(t);
}
template<>
void db::serializer<bytes>::read(bytes& b, input& in) {
b = in.read<bytes>();
}
template<>
void db::serializer<bytes>::skip(input& in) {
in.read<bytes>(); // FIXME: Avoid reading
}
template<>
db::serializer<bytes_view>::serializer(const bytes_view& v)
: _item(v), _size(output::serialized_size(v)) {
}
template<>
void db::serializer<bytes_view>::write(output& out, const type& t) {
out.write(t);
}
template<>
void db::serializer<bytes_view>::read(bytes_view& v, input& in) {
v = in.read<bytes_view>();
}
template<>
bytes_view db::serializer<bytes_view>::read(input& in) {
return in.read<bytes_view>();
}
template<>
db::serializer<sstring>::serializer(const sstring& s)
: _item(s), _size(output::serialized_size(s)) {
}
template<>
void db::serializer<sstring>::write(output& out, const type& t) {
out.write(t);
}
template<>
void db::serializer<sstring>::read(sstring& s, input& in) {
s = in.read<sstring>();
}
template<>
void db::serializer<sstring>::skip(input& in) {
in.read<sstring>(); // FIXME: avoid reading
}
template<>
db::serializer<tombstone>::serializer(const tombstone& t)
: _item(t), _size(sizeof(t.timestamp) + sizeof(decltype(t.deletion_time.time_since_epoch().count()))) {
}
template<>
void db::serializer<tombstone>::write(output& out, const type& t) {
out.write(t.timestamp);
out.write(t.deletion_time.time_since_epoch().count());
}
template<>
void db::serializer<tombstone>::read(tombstone& t, input& in) {
t.timestamp = in.read<decltype(t.timestamp)>();
auto deletion_time = in.read<decltype(t.deletion_time.time_since_epoch().count())>();
t.deletion_time = gc_clock::time_point(gc_clock::duration(deletion_time));
}
template<>
db::serializer<atomic_cell_view>::serializer(const atomic_cell_view& c)
: _item(c), _size(bytes_view_serializer(c.serialize()).size()) {
}
template<>
void db::serializer<atomic_cell_view>::write(output& out, const atomic_cell_view& t) {
bytes_view_serializer::write(out, t.serialize());
}
template<>
void db::serializer<atomic_cell_view>::read(atomic_cell_view& c, input& in) {
c = atomic_cell_view::from_bytes(bytes_view_serializer::read(in));
}
template<>
atomic_cell_view db::serializer<atomic_cell_view>::read(input& in) {
return atomic_cell_view::from_bytes(bytes_view_serializer::read(in));
}
template<>
db::serializer<collection_mutation_view>::serializer(const collection_mutation_view& c)
: _item(c), _size(bytes_view_serializer(c.serialize()).size()) {
}
template<>
void db::serializer<collection_mutation_view>::write(output& out, const collection_mutation_view& t) {
bytes_view_serializer::write(out, t.serialize());
}
template<>
void db::serializer<collection_mutation_view>::read(collection_mutation_view& c, input& in) {
c = collection_mutation_view::from_bytes(bytes_view_serializer::read(in));
}
template<>
db::serializer<db::replay_position>::serializer(const db::replay_position& rp)
: _item(rp), _size(sizeof(uint64_t) * 2) {
}
template<>
void db::serializer<db::replay_position>::write(output& out, const db::replay_position& rp) {
out.write<uint64_t>(rp.id);
out.write<uint64_t>(rp.pos);
}
template<>
void db::serializer<db::replay_position>::read(db::replay_position& rp, input& in) {
rp.id = in.read<uint64_t>();
rp.pos = in.read<uint64_t>();
}
template class db::serializer<tombstone> ;
template class db::serializer<bytes> ;
template class db::serializer<bytes_view> ;
template class db::serializer<sstring> ;
template class db::serializer<atomic_cell_view> ;
template class db::serializer<collection_mutation_view> ;
template class db::serializer<utils::UUID> ;
template class db::serializer<db::replay_position> ;

View File

@@ -1,235 +0,0 @@
/*
* Copyright 2015 Cloudius Systems
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef DB_SERIALIZER_HH_
#define DB_SERIALIZER_HH_
#include <experimental/optional>
#include "utils/data_input.hh"
#include "utils/data_output.hh"
#include "bytes_ostream.hh"
#include "bytes.hh"
#include "database_fwd.hh"
#include "db/commitlog/replay_position.hh"
namespace db {
/**
* Serialization objects for various types and using "internal" format. (Not CQL, origin whatnot).
* The design rationale is that a "serializer" can be instantiated for an object, and will contain
* the obj + size, and is usable as a functor.
*
* Serialization can also be done "explicitly" through the static method "write"
* (Not using "serialize", because writing "serializer<apa>::serialize" all the time is tiring and redundant)
* though care should be takes than data will fit of course.
*/
template<typename T>
class serializer {
public:
typedef T type;
typedef data_output output;
typedef data_input input;
typedef serializer<T> _MyType;
serializer(const type&);
// apply to memory, must be at least size() large.
const _MyType& operator()(output& out) const {
write(out, _item);
return *this;
}
static void write(output&, const type&);
static void read(type&, input&);
static type read(input&);
static void skip(input& in);
size_t size() const {
return _size;
}
void write(bytes_ostream& out) const {
auto buf = out.write_place_holder(_size);
data_output data_out((char*)buf, _size);
write(data_out, _item);
}
void write(data_output& out) const {
write(out, _item);
}
bytes to_bytes() const {
bytes b(bytes::initialized_later(), _size);
data_output out(b);
write(out);
return b;
}
static type from_bytes(bytes_view v) {
data_input in(v);
return read(in);
}
private:
const type& _item;
size_t _size;
};
template<typename T>
class serializer<std::experimental::optional<T>> {
public:
typedef std::experimental::optional<T> type;
typedef data_output output;
typedef data_input input;
typedef serializer<T> _MyType;
serializer(const type& t)
: _item(t)
, _size(output::serialized_size<bool>() + (t ? serializer<T>(*t).size() : 0))
{}
// apply to memory, must be at least size() large.
const _MyType& operator()(output& out) const {
write(out, _item);
return *this;
}
static void write(output& out, const type& v) {
bool en = v;
out.write<bool>(en);
if (en) {
serializer<T>::write(out, *v);
}
}
static void read(type& dst, input& in) {
auto en = in.read<bool>();
if (en) {
dst = serializer<T>::read(in);
} else {
dst = {};
}
}
static type read(input& in) {
type t;
read(t, in);
return t;
}
static void skip(input& in) {
auto en = in.read<bool>();
if (en) {
serializer<T>::skip(in);
}
}
size_t size() const {
return _size;
}
void write(bytes_ostream& out) const {
auto buf = out.write_place_holder(_size);
data_output data_out((char*)buf, _size);
write(data_out, _item);
}
void write(data_output& out) const {
write(out, _item);
}
bytes to_bytes() const {
bytes b(bytes::initialized_later(), _size);
data_output out(b);
write(out);
return b;
}
static type from_bytes(bytes_view v) {
data_input in(v);
return read(in);
}
private:
const std::experimental::optional<T> _item;
size_t _size;
};
template<> serializer<utils::UUID>::serializer(const utils::UUID &);
template<> void serializer<utils::UUID>::write(output&, const type&);
template<> void serializer<utils::UUID>::read(utils::UUID&, input&);
template<> void serializer<utils::UUID>::skip(input&);
template<> utils::UUID serializer<utils::UUID>::read(input&);
template<> serializer<bytes>::serializer(const bytes &);
template<> void serializer<bytes>::write(output&, const type&);
template<> void serializer<bytes>::read(bytes&, input&);
template<> void serializer<bytes>::skip(input&);
template<> serializer<bytes_view>::serializer(const bytes_view&);
template<> void serializer<bytes_view>::write(output&, const type&);
template<> void serializer<bytes_view>::read(bytes_view&, input&);
template<> bytes_view serializer<bytes_view>::read(input&);
template<> serializer<sstring>::serializer(const sstring&);
template<> void serializer<sstring>::write(output&, const type&);
template<> void serializer<sstring>::read(sstring&, input&);
template<> void serializer<sstring>::skip(input&);
template<> serializer<tombstone>::serializer(const tombstone &);
template<> void serializer<tombstone>::write(output&, const type&);
template<> void serializer<tombstone>::read(tombstone&, input&);
template<> serializer<atomic_cell_view>::serializer(const atomic_cell_view &);
template<> void serializer<atomic_cell_view>::write(output&, const type&);
template<> void serializer<atomic_cell_view>::read(atomic_cell_view&, input&);
template<> atomic_cell_view serializer<atomic_cell_view>::read(input&);
template<> serializer<collection_mutation_view>::serializer(const collection_mutation_view &);
template<> void serializer<collection_mutation_view>::write(output&, const type&);
template<> void serializer<collection_mutation_view>::read(collection_mutation_view&, input&);
template<> serializer<db::replay_position>::serializer(const db::replay_position&);
template<> void serializer<db::replay_position>::write(output&, const db::replay_position&);
template<> void serializer<db::replay_position>::read(db::replay_position&, input&);
template<typename T>
T serializer<T>::read(input& in) {
type t;
read(t, in);
return t;
}
extern template class serializer<tombstone>;
extern template class serializer<bytes>;
extern template class serializer<bytes_view>;
extern template class serializer<sstring>;
extern template class serializer<utils::UUID>;
extern template class serializer<db::replay_position>;
typedef serializer<tombstone> tombstone_serializer;
typedef serializer<bytes> bytes_serializer; // Compatible with bytes_view_serializer
typedef serializer<bytes_view> bytes_view_serializer; // Compatible with bytes_serializer
typedef serializer<sstring> sstring_serializer;
typedef serializer<atomic_cell_view> atomic_cell_view_serializer;
typedef serializer<collection_mutation_view> collection_mutation_view_serializer;
typedef serializer<utils::UUID> uuid_serializer;
typedef serializer<db::replay_position> replay_position_serializer;
}
#endif /* DB_SERIALIZER_HH_ */

View File

@@ -58,14 +58,16 @@
#include "thrift/server.hh"
#include "exceptions/exceptions.hh"
#include "cql3/query_processor.hh"
#include "db/serializer.hh"
#include "query_context.hh"
#include "partition_slice_builder.hh"
#include "db/config.hh"
#include "schema_builder.hh"
#include "md5_hasher.hh"
#include "release.hh"
#include "log.hh"
#include "serializer.hh"
#include <core/enum.hh>
#include "service/storage_proxy.hh"
using days = std::chrono::duration<int, std::ratio<24 * 3600>>;
@@ -75,6 +77,7 @@ std::unique_ptr<query_context> qctx = {};
namespace system_keyspace {
static logging::logger logger("system_keyspace");
static const api::timestamp_type creation_timestamp = api::new_timestamp();
api::timestamp_type schema_creation_timestamp() {
@@ -438,7 +441,7 @@ static future<> setup_version() {
version::release(),
cql3::query_processor::CQL_VERSION,
org::apache::cassandra::thrift_version,
to_sstring(version::native_protocol()),
to_sstring(cql_serialization_format::latest_version),
snitch->get_datacenter(utils::fb_utilities::get_broadcast_address()),
snitch->get_rack(utils::fb_utilities::get_broadcast_address()),
sstring(dht::global_partitioner().name()),
@@ -546,31 +549,44 @@ future<> setup(distributed<database>& db, distributed<cql3::query_processor>& qp
});
}
typedef std::pair<replay_positions, db_clock::time_point> truncation_entry;
typedef utils::UUID truncation_key;
typedef std::unordered_map<truncation_key, truncation_entry> truncation_map;
struct truncation_record {
static constexpr uint32_t current_magic = 0x53435452; // 'S' 'C' 'T' 'R'
uint32_t magic;
std::vector<db::replay_position> positions;
db_clock::time_point time_stamp;
};
}
}
#include "idl/replay_position.dist.hh"
#include "idl/truncation_record.dist.hh"
#include "serializer_impl.hh"
#include "idl/replay_position.dist.impl.hh"
#include "idl/truncation_record.dist.impl.hh"
namespace db {
namespace system_keyspace {
typedef utils::UUID truncation_key;
typedef std::unordered_map<truncation_key, truncation_record> truncation_map;
static constexpr uint8_t current_version = 1;
static thread_local std::experimental::optional<truncation_map> truncation_records;
future<> save_truncation_records(const column_family& cf, db_clock::time_point truncated_at, replay_positions positions) {
auto size =
sizeof(db_clock::rep)
+ positions.size()
* db::serializer<replay_position>(
db::replay_position()).size();
bytes buf(bytes::initialized_later(), size);
data_output out(buf);
truncation_record r;
// Old version would write a single RP. We write N. Resulting blob size
// will determine how many.
// An external entity reading this blob would get a "correct" RP
// and a garbled time stamp. But an external entity has no business
// reading this data anyway, since it is meaningless outside this
// machine instance.
for (auto& rp : positions) {
db::serializer<replay_position>::write(out, rp);
}
out.write<db_clock::rep>(truncated_at.time_since_epoch().count());
r.magic = truncation_record::current_magic;
r.time_stamp = truncated_at;
r.positions = std::move(positions);
auto buf = ser::serialize_to_buffer<bytes>(r, sizeof(current_version));
buf[0] = current_version;
static_assert(sizeof(current_version) == 1, "using this as mark");
assert(buf.size() & 1); // verify we've created an odd-numbered buffer
map_type_impl::native_type tmp;
tmp.emplace_back(cf.schema()->id(), data_value(buf));
@@ -594,7 +610,7 @@ future<> remove_truncation_record(utils::UUID id) {
});
}
static future<truncation_entry> get_truncation_record(utils::UUID cf_id) {
static future<truncation_record> get_truncation_record(utils::UUID cf_id) {
if (!truncation_records) {
sstring req = sprint("SELECT truncated_at FROM system.%s WHERE key = '%s'", LOCAL, LOCAL);
return qctx->qp().execute_internal(req).then([cf_id](::shared_ptr<cql3::untyped_result_set> rs) {
@@ -605,22 +621,56 @@ static future<truncation_entry> get_truncation_record(utils::UUID cf_id) {
auto uuid = p.first;
auto buf = p.second;
truncation_entry e;
try {
truncation_record e;
data_input in(buf);
if (buf.size() & 1) {
// new record.
if (buf[0] != current_version) {
logger.warn("Found truncation record of unknown version {}. Ignoring.", int(buf[0]));
continue;
}
e = ser::deserialize_from_buffer(buf, boost::type<truncation_record>(), 1);
if (e.magic == truncation_record::current_magic) {
tmp[uuid] = e;
continue;
}
} else {
// old scylla records. (We hope)
// Read 64+64 bit RP:s, even though the
// struct (and official serial size) is 64+32.
data_input in(buf);
while (in.avail() > sizeof(db_clock::rep)) {
e.first.emplace_back(db::serializer<replay_position>::read(in));
logger.debug("Reading old type record");
while (in.avail() > sizeof(db_clock::rep)) {
auto id = in.read<uint64_t>();
auto pos = in.read<uint64_t>();
e.positions.emplace_back(id, position_type(pos));
}
if (in.avail() == sizeof(db_clock::rep)) {
e.time_stamp = db_clock::time_point(db_clock::duration(in.read<db_clock::rep>()));
tmp[uuid] = e;
continue;
}
}
} catch (std::out_of_range &) {
}
e.second = db_clock::time_point(db_clock::duration(in.read<db_clock::rep>()));
tmp[uuid] = e;
// Trying to load an origin table.
// This is useless to us, because the only usage for this
// data is commit log and batch replay, and we cannot replay
// either from origin anyway.
logger.warn("Error reading truncation record for {}. "
"Most likely this is data from a cassandra instance."
"Make sure you have cleared commit and batch logs before upgrading.",
uuid
);
}
}
truncation_records = std::move(tmp);
return get_truncation_record(cf_id);
});
}
return make_ready_future<truncation_entry>((*truncation_records)[cf_id]);
return make_ready_future<truncation_record>((*truncation_records)[cf_id]);
}
future<> save_truncation_record(const column_family& cf, db_clock::time_point truncated_at, db::replay_position rp) {
@@ -628,16 +678,16 @@ future<> save_truncation_record(const column_family& cf, db_clock::time_point tr
// once, for each core (calling us). But right now, redesigning so that calling here (or, rather,
// save_truncation_records), is done from "somewhere higher, once per machine, not shard" is tricky.
// Mainly because drop_tables also uses truncate. And is run per-core as well. Gah.
return get_truncated_position(cf.schema()->id()).then([&cf, truncated_at, rp](replay_positions positions) {
auto i = std::find_if(positions.begin(), positions.end(), [rp](auto& p) {
return get_truncation_record(cf.schema()->id()).then([&cf, truncated_at, rp](truncation_record e) {
auto i = std::find_if(e.positions.begin(), e.positions.end(), [rp](replay_position& p) {
return p.shard_id() == rp.shard_id();
});
if (i == positions.end()) {
positions.emplace_back(rp);
if (i == e.positions.end()) {
e.positions.emplace_back(rp);
} else {
*i = rp;
}
return save_truncation_records(cf, truncated_at, positions);
return save_truncation_records(cf, std::max(truncated_at, e.time_stamp), e.positions);
});
}
@@ -653,14 +703,14 @@ future<db::replay_position> get_truncated_position(utils::UUID cf_id, uint32_t s
}
future<replay_positions> get_truncated_position(utils::UUID cf_id) {
return get_truncation_record(cf_id).then([](truncation_entry e) {
return make_ready_future<replay_positions>(e.first);
return get_truncation_record(cf_id).then([](truncation_record e) {
return make_ready_future<replay_positions>(e.positions);
});
}
future<db_clock::time_point> get_truncated_at(utils::UUID cf_id) {
return get_truncation_record(cf_id).then([](truncation_entry e) {
return make_ready_future<db_clock::time_point>(e.second);
return get_truncation_record(cf_id).then([](truncation_record e) {
return make_ready_future<db_clock::time_point>(e.time_stamp);
});
}
@@ -1096,5 +1146,36 @@ future<std::vector<compaction_history_entry>> get_compaction_history()
});
}
future<int> increment_and_get_generation() {
auto req = sprint("SELECT gossip_generation FROM system.%s WHERE key='%s'", LOCAL, LOCAL);
return qctx->qp().execute_internal(req).then([] (auto rs) {
int generation;
if (rs->empty() || !rs->one().has("gossip_generation")) {
// seconds-since-epoch isn't a foolproof new generation
// (where foolproof is "guaranteed to be larger than the last one seen at this ip address"),
// but it's as close as sanely possible
generation = service::get_generation_number();
} else {
// Other nodes will ignore gossip messages about a node that have a lower generation than previously seen.
int stored_generation = rs->one().template get_as<int>("gossip_generation") + 1;
int now = service::get_generation_number();
if (stored_generation >= now) {
logger.warn("Using stored Gossip Generation {} as it is greater than current system time {}."
"See CASSANDRA-3654 if you experience problems", stored_generation, now);
generation = stored_generation;
} else {
generation = now;
}
}
auto req = sprint("INSERT INTO system.%s (key, gossip_generation) VALUES ('%s', ?)", LOCAL, LOCAL);
return qctx->qp().execute_internal(req, {generation}).then([generation] (auto rs) {
return force_blocking_flush(LOCAL);
}).then([generation] {
return make_ready_future<int>(generation);
});
});
}
} // namespace system_keyspace
} // namespace db

View File

@@ -401,127 +401,9 @@ enum class bootstrap_state {
*/
future<std::unordered_map<gms::inet_address, utils::UUID>> load_host_ids();
#if 0
/**
* Get preferred IP for given endpoint if it is known. Otherwise this returns given endpoint itself.
*
* @param ep endpoint address to check
* @return Preferred IP for given endpoint if present, otherwise returns given ep
*/
public static InetAddress getPreferredIP(InetAddress ep)
{
String req = "SELECT preferred_ip FROM system.%s WHERE peer=?";
UntypedResultSet result = executeInternal(String.format(req, PEERS), ep);
if (!result.isEmpty() && result.one().has("preferred_ip"))
return result.one().getInetAddress("preferred_ip");
return ep;
}
/**
* Return a map of IP addresses containing a map of dc and rack info
*/
public static Map<InetAddress, Map<String,String>> loadDcRackInfo()
{
Map<InetAddress, Map<String, String>> result = new HashMap<>();
for (UntypedResultSet.Row row : executeInternal("SELECT peer, data_center, rack from system." + PEERS))
{
InetAddress peer = row.getInetAddress("peer");
if (row.has("data_center") && row.has("rack"))
{
Map<String, String> dcRack = new HashMap<>();
dcRack.put("data_center", row.getString("data_center"));
dcRack.put("rack", row.getString("rack"));
result.put(peer, dcRack);
}
}
return result;
}
/**
* One of three things will happen if you try to read the system keyspace:
* 1. files are present and you can read them: great
* 2. no files are there: great (new node is assumed)
* 3. files are present but you can't read them: bad
* @throws ConfigurationException
*/
public static void checkHealth() throws ConfigurationException
{
Keyspace keyspace;
try
{
keyspace = Keyspace.open(NAME);
}
catch (AssertionError err)
{
// this happens when a user switches from OPP to RP.
ConfigurationException ex = new ConfigurationException("Could not read system keyspace!");
ex.initCause(err);
throw ex;
}
ColumnFamilyStore cfs = keyspace.getColumnFamilyStore(LOCAL);
String req = "SELECT cluster_name FROM system.%s WHERE key='%s'";
UntypedResultSet result = executeInternal(String.format(req, LOCAL, LOCAL));
if (result.isEmpty() || !result.one().has("cluster_name"))
{
// this is a brand new node
if (!cfs.getSSTables().isEmpty())
throw new ConfigurationException("Found system keyspace files, but they couldn't be loaded!");
// no system files. this is a new node.
req = "INSERT INTO system.%s (key, cluster_name) VALUES ('%s', ?)";
executeInternal(String.format(req, LOCAL, LOCAL), DatabaseDescriptor.getClusterName());
return;
}
String savedClusterName = result.one().getString("cluster_name");
if (!DatabaseDescriptor.getClusterName().equals(savedClusterName))
throw new ConfigurationException("Saved cluster name " + savedClusterName + " != configured name " + DatabaseDescriptor.getClusterName());
}
#endif
future<std::unordered_set<dht::token>> get_saved_tokens();
#if 0
public static int incrementAndGetGeneration()
{
String req = "SELECT gossip_generation FROM system.%s WHERE key='%s'";
UntypedResultSet result = executeInternal(String.format(req, LOCAL, LOCAL));
int generation;
if (result.isEmpty() || !result.one().has("gossip_generation"))
{
// seconds-since-epoch isn't a foolproof new generation
// (where foolproof is "guaranteed to be larger than the last one seen at this ip address"),
// but it's as close as sanely possible
generation = (int) (System.currentTimeMillis() / 1000);
}
else
{
// Other nodes will ignore gossip messages about a node that have a lower generation than previously seen.
final int storedGeneration = result.one().getInt("gossip_generation") + 1;
final int now = (int) (System.currentTimeMillis() / 1000);
if (storedGeneration >= now)
{
logger.warn("Using stored Gossip Generation {} as it is greater than current system time {}. See CASSANDRA-3654 if you experience problems",
storedGeneration, now);
generation = storedGeneration;
}
else
{
generation = now;
}
}
req = "INSERT INTO system.%s (key, gossip_generation) VALUES ('%s', ?)";
executeInternal(String.format(req, LOCAL, LOCAL), generation);
forceBlockingFlush(LOCAL);
return generation;
}
#endif
future<int> increment_and_get_generation();
bool bootstrap_complete();
bool bootstrap_in_progress();
bootstrap_state get_bootstrap_state();

View File

@@ -263,29 +263,6 @@ int token_comparator::operator()(const token& t1, const token& t2) const {
return tri_compare(t1, t2);
}
void token::serialize(bytes::iterator& out) const {
uint8_t kind = _kind == dht::token::kind::before_all_keys ? 0 :
_kind == dht::token::kind::key ? 1 : 2;
serialize_int8(out, kind);
serialize_int16(out, _data.size());
out = std::copy(_data.begin(), _data.end(), out);
}
token token::deserialize(bytes_view& in) {
uint8_t kind = read_simple<uint8_t>(in);
size_t size = read_simple<uint16_t>(in);
return token(kind == 0 ? dht::token::kind::before_all_keys :
kind == 1 ? dht::token::kind::key :
dht::token::kind::after_all_keys,
to_bytes(read_simple_bytes(in, size)));
}
size_t token::serialized_size() const {
return serialize_int8_size // token::kind;
+ serialize_int16_size // token size
+ _data.size();
}
bool ring_position::equal(const schema& s, const ring_position& other) const {
return tri_compare(s, other) == 0;
}

View File

@@ -97,11 +97,6 @@ public:
bool is_maximum() const {
return _kind == kind::after_all_keys;
}
void serialize(bytes::iterator& out) const;
static token deserialize(bytes_view& in);
size_t serialized_size() const;
};
token midpoint_unsigned(const token& t1, const token& t2);

102
dist/ami/build_ami.sh vendored
View File

@@ -6,22 +6,100 @@ if [ ! -e dist/ami/build_ami.sh ]; then
fi
print_usage() {
echo "build_ami.sh -l"
echo " -l deploy locally built rpms"
echo "build_ami.sh --localrpm --unstable"
echo " --localrpm deploy locally built rpms"
echo " --unstable use unstable branch"
exit 1
}
LOCALRPM=0
while getopts lh OPT; do
case "$OPT" in
"l")
while [ $# -gt 0 ]; do
case "$1" in
"--localrpm")
LOCALRPM=1
INSTALL_ARGS="$INSTALL_ARGS --localrpm"
shift 1
;;
"h")
"--unstable")
INSTALL_ARGS="$INSTALL_ARGS --unstable"
shift 1
;;
*)
print_usage
;;
esac
done
. /etc/os-release
case "$ID" in
"centos")
AMI=ami-f3102499
REGION=us-east-1
SSH_USERNAME=centos
;;
"ubuntu")
AMI=ami-ff427095
REGION=us-east-1
SSH_USERNAME=ubuntu
;;
*)
echo "build_ami.sh does not supported this distribution."
exit 1
;;
esac
if [ $LOCALRPM -eq 1 ]; then
if [ "$ID" = "centos" ]; then
rm -rf build/*
sudo yum -y install git
if [ ! -f dist/ami/files/scylla-server.x86_64.rpm ]; then
dist/redhat/build_rpm.sh
cp build/rpmbuild/RPMS/x86_64/scylla-server-`cat build/SCYLLA-VERSION-FILE`-`cat build/SCYLLA-RELEASE-FILE`.*.x86_64.rpm dist/ami/files/scylla-server.x86_64.rpm
fi
if [ ! -f dist/ami/files/scylla-jmx.noarch.rpm ]; then
cd build
git clone --depth 1 https://github.com/scylladb/scylla-jmx.git
cd scylla-jmx
sh -x -e dist/redhat/build_rpm.sh $*
cd ../..
cp build/scylla-jmx/build/rpmbuild/RPMS/noarch/scylla-jmx-`cat build/scylla-jmx/build/SCYLLA-VERSION-FILE`-`cat build/scylla-jmx/build/SCYLLA-RELEASE-FILE`.*.noarch.rpm dist/ami/files/scylla-jmx.noarch.rpm
fi
if [ ! -f dist/ami/files/scylla-tools.noarch.rpm ]; then
cd build
git clone --depth 1 https://github.com/scylladb/scylla-tools-java.git
cd scylla-tools-java
sh -x -e dist/redhat/build_rpm.sh
cd ../..
cp build/scylla-tools-java/build/rpmbuild/RPMS/noarch/scylla-tools-`cat build/scylla-tools-java/build/SCYLLA-VERSION-FILE`-`cat build/scylla-tools-java/build/SCYLLA-RELEASE-FILE`.*.noarch.rpm dist/ami/files/scylla-tools.noarch.rpm
fi
else
sudo apt-get install -y git
if [ ! -f dist/ami/files/scylla-server_amd64.deb ]; then
if [ ! -f ../scylla-server_`cat build/SCYLLA-VERSION-FILE | sed 's/\.rc/~rc/'`-`cat build/SCYLLA-RELEASE-FILE`-ubuntu1_amd64.deb ]; then
echo "Build .deb before running build_ami.sh"
exit 1
fi
cp ../scylla-server_`cat build/SCYLLA-VERSION-FILE | sed 's/\.rc/~rc/'`-`cat build/SCYLLA-RELEASE-FILE`-ubuntu1_amd64.deb dist/ami/files/scylla-server_amd64.deb
fi
if [ ! -f dist/ami/files/scylla-jmx_all.deb ]; then
cd build
git clone --depth 1 https://github.com/scylladb/scylla-jmx.git
cd scylla-jmx
sh -x -e dist/ubuntu/build_deb.sh $*
cd ../..
cp build/scylla-jmx_`cat build/scylla-jmx/build/SCYLLA-VERSION-FILE | sed 's/\.rc/~rc/'`-`cat build/scylla-jmx/build/SCYLLA-RELEASE-FILE`-ubuntu1_all.deb dist/ami/files/scylla-jmx_all.deb
fi
if [ ! -f dist/ami/files/scylla-tools_all.deb ]; then
cd build
git clone --depth 1 https://github.com/scylladb/scylla-tools-java.git
cd scylla-tools-java
sh -x -e dist/ubuntu/build_deb.sh $*
cd ../..
cp build/scylla-tools_`cat build/scylla-tools-java/build/SCYLLA-VERSION-FILE | sed 's/\.rc/~rc/'`-`cat build/scylla-tools-java/build/SCYLLA-RELEASE-FILE`-ubuntu1_all.deb dist/ami/files/scylla-tools_all.deb
fi
fi
fi
cd dist/ami
if [ ! -f variables.json ]; then
@@ -30,19 +108,11 @@ if [ ! -f variables.json ]; then
fi
if [ ! -d packer ]; then
wget https://dl.bintray.com/mitchellh/packer/packer_0.8.6_linux_amd64.zip
wget https://releases.hashicorp.com/packer/0.8.6/packer_0.8.6_linux_amd64.zip
mkdir packer
cd packer
unzip -x ../packer_0.8.6_linux_amd64.zip
cd -
fi
if [ $LOCALRPM = 0 ]; then
echo "sudo yum remove -y abrt; sudo sh -x -e /home/centos/scylla_install_pkg; sudo sh -x -e /usr/lib/scylla/scylla_setup -a" > scylla_deploy.sh
else
echo "sudo yum remove -y abrt; sudo sh -x -e /home/centos/scylla_install_pkg -l /home/centos; sudo sh -x -e /usr/lib/scylla/scylla_setup -a" > scylla_deploy.sh
fi
chmod a+rx scylla_deploy.sh
packer/packer build -var-file=variables.json scylla.json
packer/packer build -var-file=variables.json -var install_args="$INSTALL_ARGS" -var region="$REGION" -var source_ami="$AMI" -var ssh_username="$SSH_USERNAME" scylla.json

View File

@@ -1,31 +0,0 @@
#!/bin/sh -e
if [ ! -e dist/ami/build_ami_local.sh ]; then
echo "run build_ami_local.sh in top of scylla dir"
exit 1
fi
rm -rf build/*
sudo yum -y install git
if [ ! -f dist/ami/files/scylla-server.x86_64.rpm ]; then
dist/redhat/build_rpm.sh
cp build/rpmbuild/RPMS/x86_64/scylla-server-`cat build/SCYLLA-VERSION-FILE`-`cat build/SCYLLA-RELEASE-FILE`.*.x86_64.rpm dist/ami/files/scylla-server.x86_64.rpm
fi
if [ ! -f dist/ami/files/scylla-jmx.noarch.rpm ]; then
cd build
git clone --depth 1 https://github.com/scylladb/scylla-jmx.git
cd scylla-jmx
sh -x -e dist/redhat/build_rpm.sh $*
cd ../..
cp build/scylla-jmx/build/rpmbuild/RPMS/noarch/scylla-jmx-`cat build/scylla-jmx/build/SCYLLA-VERSION-FILE`-`cat build/scylla-jmx/build/SCYLLA-RELEASE-FILE`.*.noarch.rpm dist/ami/files/scylla-jmx.noarch.rpm
fi
if [ ! -f dist/ami/files/scylla-tools.noarch.rpm ]; then
cd build
git clone --depth 1 https://github.com/scylladb/scylla-tools-java.git
cd scylla-tools-java
sh -x -e dist/redhat/build_rpm.sh
cd ../..
cp build/scylla-tools-java/build/rpmbuild/RPMS/noarch/scylla-tools-`cat build/scylla-tools-java/build/SCYLLA-VERSION-FILE`-`cat build/scylla-tools-java/build/SCYLLA-RELEASE-FILE`.*.noarch.rpm dist/ami/files/scylla-tools.noarch.rpm
fi
exec dist/ami/build_ami.sh -l

View File

@@ -30,7 +30,21 @@ echo 'More documentation available at: '
echo ' http://www.scylladb.com/doc/'
echo
if [ "`systemctl is-active scylla-server`" = "active" ]; then
. /etc/os-release
if [ "$ID" = "ubuntu" ]; then
if [ "`initctl status ssh|grep "running, process"`" != "" ]; then
STARTED=1
else
STARTED=0
fi
else
if [ "`systemctl is-active scylla-server`" = "active" ]; then
STARTED=1
else
STARTED=0
fi
fi
if [ $STARTED -eq 1 ]; then
tput setaf 4
tput bold
echo " ScyllaDB is active."
@@ -42,6 +56,13 @@ else
echo " ScyllaDB is not started!"
tput sgr0
echo "Please wait for startup. To see status of ScyllaDB, run "
echo " 'systemctl status scylla-server'"
echo
if [ "$ID" = "ubuntu" ]; then
echo " 'initctl status scylla-server'"
echo "and"
echo " 'cat /var/log/upstart/scylla-server.log'"
echo
else
echo " 'systemctl status scylla-server'"
echo
fi
fi

58
dist/ami/scylla.json vendored
View File

@@ -8,16 +8,52 @@
"security_group_id": "{{user `security_group_id`}}",
"region": "{{user `region`}}",
"associate_public_ip_address": "{{user `associate_public_ip_address`}}",
"source_ami": "ami-8ef1d6e4",
"source_ami": "{{user `source_ami`}}",
"user_data_file": "user_data.txt",
"instance_type": "{{user `instance_type`}}",
"ssh_username": "centos",
"ssh_username": "{{user `ssh_username`}}",
"ssh_timeout": "5m",
"ami_name": "scylla_{{isotime | clean_ami_name}}",
"ami_name": "{{user `ami_prefix`}}scylla_{{isotime | clean_ami_name}}",
"enhanced_networking": true,
"launch_block_device_mappings": [
{
"device_name": "/dev/sda1",
"volume_size": 10
"volume_size": 10,
"delete_on_termination": true
}
],
"ami_block_device_mappings": [
{
"device_name": "/dev/sdb",
"virtual_name": "ephemeral0"
},
{
"device_name": "/dev/sdc",
"virtual_name": "ephemeral1"
},
{
"device_name": "/dev/sdd",
"virtual_name": "ephemeral2"
},
{
"device_name": "/dev/sde",
"virtual_name": "ephemeral3"
},
{
"device_name": "/dev/sdf",
"virtual_name": "ephemeral4"
},
{
"device_name": "/dev/sdg",
"virtual_name": "ephemeral5"
},
{
"device_name": "/dev/sdh",
"virtual_name": "ephemeral6"
},
{
"device_name": "/dev/sdi",
"virtual_name": "ephemeral7"
}
]
}
@@ -26,16 +62,18 @@
{
"type": "file",
"source": "files/",
"destination": "/home/centos/"
"destination": "/home/{{user `ssh_username`}}/"
},
{
"type": "file",
"source": "../../scripts/scylla_install_pkg",
"destination": "/home/centos/scylla_install_pkg"
"destination": "/home/{{user `ssh_username`}}/scylla_install_pkg"
},
{
"type": "shell",
"script": "scylla_deploy.sh"
"inline": [
"sudo /home/{{user `ssh_username`}}/scylla-ami/scylla_install_ami {{ user `install_args` }}"
]
}
],
"variables": {
@@ -45,6 +83,10 @@
"security_group_id": "",
"region": "",
"associate_public_ip_address": "",
"instance_type": ""
"instance_type": "",
"install_args": "",
"ami_prefix": "",
"source_ami": "",
"ssh_username": ""
}
}

5
dist/common/bin/scyllatop vendored Executable file
View File

@@ -0,0 +1,5 @@
#!/bin/sh -e
#
# Copyright (C) 2016 ScyllaDB
exec python /usr/lib/scylla/scyllatop/scyllatop.py $@

16
dist/common/collectd.d/scylla.conf vendored Normal file
View File

@@ -0,0 +1,16 @@
LoadPlugin network
LoadPlugin unixsock
# dummy write_graphite to silent noisy warning
LoadPlugin network
<Plugin "network">
Server "127.0.0.1 65534"
</Plugin>
<Plugin network>
Listen "127.0.0.1" "25826"
</Plugin>
<Plugin unixsock>
SocketFile "/var/run/collectd-unixsock"
SocketPerms "0666"
</Plugin>

View File

@@ -2,6 +2,25 @@
#
# Copyright (C) 2015 ScyllaDB
print_usage() {
echo "scylla_bootparam_setup --ami"
echo " --ami setup AMI instance"
exit 1
}
AMI_OPT=0
while [ $# -gt 0 ]; do
case "$1" in
"--ami")
AMI_OPT=1
shift 1
;;
*)
print_usage
;;
esac
done
. /etc/os-release
if [ ! -f /etc/default/grub ]; then
@@ -14,7 +33,11 @@ if [ "`grep hugepagesz /etc/default/grub`" != "" ] || [ "`grep hugepages /etc/de
sed -e "s#hugepages=[0-9]* ##" /etc/default/grub > /tmp/grub
mv /tmp/grub /etc/default/grub
fi
sed -e "s#^GRUB_CMDLINE_LINUX=\"#GRUB_CMDLINE_LINUX=\"hugepagesz=2M hugepages=$NR_HUGEPAGES #" /etc/default/grub > /tmp/grub
if [ $AMI_OPT -eq 1 ]; then
sed -e "s#^GRUB_CMDLINE_LINUX=\"#GRUB_CMDLINE_LINUX=\"clocksource=tsc tsc=reliable hugepagesz=2M hugepages=$NR_HUGEPAGES #" /etc/default/grub > /tmp/grub
else
sed -e "s#^GRUB_CMDLINE_LINUX=\"#GRUB_CMDLINE_LINUX=\"hugepagesz=2M hugepages=$NR_HUGEPAGES #" /etc/default/grub > /tmp/grub
fi
mv /tmp/grub /etc/default/grub
if [ "$ID" = "ubuntu" ]; then
grub-mkconfig -o /boot/grub/grub.cfg

View File

@@ -3,18 +3,19 @@
# Copyright (C) 2015 ScyllaDB
print_usage() {
echo "scylla_coredump_setup -s"
echo " -s store coredump to /var/lib/scylla"
echo "scylla_coredump_setup --dump-to-raiddir"
echo " --dump-to-raiddir store coredump to /var/lib/scylla"
exit 1
}
SYMLINK=0
while getopts sh OPT; do
case "$OPT" in
"s")
while [ $# -gt 0 ]; do
case "$1" in
"--dump-to-raiddir")
SYMLINK=1
shift 1
;;
"h")
*)
print_usage
;;
esac

31
dist/common/scripts/scylla_dev_mode_setup vendored Executable file
View File

@@ -0,0 +1,31 @@
#!/bin/sh -e
#
# Copyright (C) 2015 ScyllaDB
print_usage() {
echo "scylla_developer_mode_setup --developer-mode=[0|1]"
echo " --developer-mode enable/disable developer mode"
exit 1
}
DEV_MODE=
while [ $# -gt 0 ]; do
case "$1" in
"--developer-mode")
DEV_MODE=$2
shift 2
;;
*)
print_usage
;;
esac
done
if [ "$DEV_MODE" = "" ]; then
print_usage
fi
if [ "$DEV_MODE" != "0" ] && [ "$DEV_MODE" != "1" ]; then
print_usage
fi
echo "DEV_MODE=--developer-mode=$DEV_MODE" > /etc/scylla.d/dev-mode.conf

80
dist/common/scripts/scylla_io_setup vendored Executable file
View File

@@ -0,0 +1,80 @@
#!/bin/sh
print_usage() {
echo "scylla_io_setup --ami"
echo " --ami setup AMI instance"
exit 1
}
AMI_OPT=0
while [ $# -gt 0 ]; do
case "$1" in
"--ami")
AMI_OPT=1
shift 1
;;
*)
print_usage
;;
esac
done
is_developer_mode() {
cat /etc/scylla.d/dev-mode.conf|egrep -c "\-\-developer-mode(\s+|=)(1|true)"
}
output_to_user()
{
echo "$1"
logger -p user.err "$1"
}
. /etc/os-release
if [ "$NAME" = "Ubuntu" ]; then
. /etc/default/scylla-server
else
. /etc/sysconfig/scylla-server
fi
if [ `is_developer_mode` -eq 0 ]; then
SMP=`echo $SCYLLA_ARGS|grep smp|sed -e "s/^.*smp\(\s\+\|=\)\([0-9]*\).*$/\2/"`
CPUSET=`echo $SCYLLA_ARGS|grep cpuset|sed -e "s/^.*\(--cpuset\(\s\+\|=\)[0-9\-]*\).*$/\1/"`
if [ $AMI_OPT -eq 1 ]; then
NR_CPU=`cat /proc/cpuinfo |grep processor|wc -l`
NR_DISKS=`lsblk --list --nodeps --noheadings | grep -v xvda | grep xvd | wc -l`
TYPE=`curl http://169.254.169.254/latest/meta-data/instance-type|cut -d . -f 1`
if [ "$SMP" != "" ]; then
NR_CPU=$SMP
fi
NR_SHARDS=$NR_CPU
if [ $NR_CPU -ge 8 ] && [ "$SET_NIC" = "no" ]; then
NR_SHARDS=$((NR_CPU - 1))
fi
if [ $NR_DISKS -lt 2 ]; then NR_DISKS=2; fi
NR_REQS=$((32 * $NR_DISKS / 2))
NR_IO_QUEUES=$NR_SHARDS
if [ $(($NR_REQS/$NR_IO_QUEUES)) -lt 4 ]; then
NR_IO_QUEUES=$(($NR_REQS / 4))
fi
NR_IO_QUEUES=$((NR_IO_QUEUES>NR_SHARDS?NR_SHARDS:NR_IO_QUEUES))
NR_REQS=$(($(($NR_REQS / $NR_IO_QUEUES)) * $NR_IO_QUEUES))
if [ "$TYPE" = "i2" ]; then
NR_REQS=$(($NR_REQS * 2))
fi
echo "SEASTAR_IO=\"--num-io-queues $NR_IO_QUEUES --max-io-requests $NR_REQS\"" > /etc/scylla.d/io.conf
else
iotune --evaluation-directory /var/lib/scylla --format envfile --options-file /etc/scylla.d/io.conf $CPUSET
if [ $? -ne 0 ]; then
output_to_user "/var/lib/scylla did not pass validation tests, it may not be on XFS and/or has limited disk space."
output_to_user "This is a non-supported setup, and performance is expected to be very bad."
output_to_user "For better performance, placing your data on XFS-formatted directories is required."
output_to_user " To override this error, see the developer_mode configuration option."
fi
fi
fi

Some files were not shown because too many files have changed in this diff Show More