Compare commits

...

2470 Commits

Author SHA1 Message Date
Botond Dénes
a0b9fcc041 cache_flat_mutation_reader: read_from_underlying(): propagate timeout
Propagate the timeout to `consume_mutation_fragments_until()` and hence
to the underlying reader, to ensure queued sstable reads that belong
to timed-out requests are dropped from the queue, instead of
pointlessly serving them.

consume_mutation_fragments_until() received a `timeout` parameter as it
didn't have one.

Fixes: #1068
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190906135629.67342-1-bdenes@scylladb.com>
2019-09-06 16:05:42 +02:00
Paweł Dziepak
35c9b675c1 mutation_partition: verify row::append_cell() precondition
row::append_cell() has a precondition that the new cell column id needs
to be larger than that of any other already existing cell. If this
precondition is violated the row will end up in an invalid state. This
patch adds assertion to make sure we fail early in such cases.

(cherry picked from commit 060e3f8ac2)
2019-08-23 15:06:35 +02:00
Jenkins
d71836fef7 release: prepare for 2.3.6 by hagitsegev 2019-08-17 13:12:40 +03:00
Tomasz Grabiec
f8e150e97c Merge "Fix the system.size_estimates table" from Kamil
Fixes a segfault when querying for an empty keyspace.

Also, fixes an infinite loop on smp > 1. Queries to
system.size_estimates table which are not single-partition queries
caused Scylla to go into an infinite loop inside
multishard_combining_reader::fill_buffer. This happened because
multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for
size_estimates_mutation_reader.

Fixes #4689
2019-08-14 15:33:33 +02:00
Kamil Braun
10c300f894 Fix infinite looping when performing a range query on system.size_estimates.
Queries to system.size_estimates table which are not single parition queries
caused Scylla to go into an infinite loop inside multishard_combining_reader::fill_buffer.
This happened because multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for size_estimates_mutation_reader.
This commit fixes the issue and closes #4689.
2019-08-14 13:11:56 +02:00
Kamil Braun
de1d3e5c6b Fix segmentation fault when querying system.size_estimates for an empty keyspace. 2019-08-14 13:11:56 +02:00
Kamil Braun
69810c13ca Refactor size_estimates_virtual_reader
Move the implementation of size_estimates_mutation_reader
to a separate compilation unit to speed up compilation times
and increase readability.

Refactor tests to use seastar::thread.
2019-08-14 13:11:54 +02:00
Raphael S. Carvalho
9b025a5742 table: do not rely on undefined behavior in cleanup_sstables
It shouldn't rely on argument evaluation order, which is ub.

Fixes #4718.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry-picked from commit 0e732ed1cf)
2019-08-07 22:01:11 +03:00
Piotr Jastrzębski
74eebc4cab sstables: ka/la: reader: Make sure push_ready_fragments() does not miss to emit partition_end (#4790)
Currently, if there is a fragment in _ready and _out_of_range was set
after row end was consumer, push_ready_fragments() would return
without emitting partition_end.

This is problematic once we make consume_row_start() emit
partiton_start directly, because we will want to assume that all
fragments for the previous partition are emitted by then. If they're
not, then we'd emit partition_start before partition_end for the
previous partition. The fix is to make sure that
push_ready_fragments() emits everything.

Fixes #4786

(cherry picked from commit 9b8ac5ecbc)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-08-05 10:17:42 +03:00
Piotr Sarna
9b2ca4ee44 main: stop view builder conditionally
The view builder is started only if it's enabled in config,
via the view_building=true variable. Unfortunately, stopping
the builder was unconditional, which may result in failed
assertions during shutdown. To remedy this, view building
is stopped only if it was previously started.

Fixes #4589

(cherry picked from commit efa7951ea5)
2019-06-26 11:06:02 +03:00
Benny Halevy
773bf45774 time_window_backlog_tracker: fix use after free
Fixes #4465

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190430094209.13958-1-bhalevy@scylladb.com>
(cherry picked from commit 3a2fa82d6e)
2019-05-06 09:38:49 +03:00
Asias He
c6705b4335 streaming: Get rid of the keep alive timer in streaming
There is no guarantee that rpc streaming makes progress in some time
period. Remove the keep alive timer in streaming to avoid killing the
session when the rpc streaming is just slow.

The keep alive timer is used to close the session in the following case:

n2 (the rpc streaming sender) streams to n1 (the rpc streaming receiver)
kill -9 n2

We need this because we do not kill the session when gossip think a node
is down, because we think the node down might only be temporary
and it is a waste to drop the previous work that has done especially
when the stream session takes long time.

Since in range_streamer, we do not stream all data in a single stream
session, we stream 10% of the data per time, and we have retry logic.
I think it is fine to kill a stream session when gossip thinks a node is
down. This patch changes to close all stream session with the node that
gossip think it is down.
Message-Id: <bdbb9486a533eee25fcaf4a23a946629ba946537.1551773823.git.asias@scylladb.com>

(cherry picked from commit b8158dd65d)
Message-Id: <4ebc544c85261873591fd5ac30043e693d74434a.1555466551.git.asias@scylladb.com>
2019-04-17 17:40:08 +03:00
Tomasz Grabiec
3997871b4d lsa: Cover more bad_alloc cases with abort
When --abort-on-lsa-bad-alloc is enabled we want to abort whenever
we think we can be out of memory.

We covered failures due to bad_alloc thrown from inside of the
allocation section, but did not cover failures from reservations done
at the beginning of with_reserve(). Fix by moving the trap into
reserve().

Message-Id: <1553258915-27929-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3356a085d2)
2019-04-17 15:21:35 +02:00
Tomasz Grabiec
4ff1d731bd lsa: Fix spurios abort with --enable-abort-on-lsa-bad-alloc
allocate_segment() can fail even though we're not out of memory, when
it's invoked inside an allocating section with the cache region
locked. That section may later succeed after retried after memory
reclamation.

We should ignore bad_alloc thrown inside allocating section body and
fail only when the whole section fails.

Fixes #2924

Message-Id: <1550597493-22500-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit dafe22dd83)
2019-04-17 15:20:21 +02:00
Jenkins
0e0f9143c9 release: prepare for 2.3.5 by hagitsegev 2019-04-17 11:47:53 +03:00
Raphael S. Carvalho
9d809d6ea4 database: fix 2x increase in disk usage during cleanup compaction
Don't hold reference to sstables cleaned up, so that file descriptors
for their index and data files will be closed and consequently disk
space released.

Fixes #3735.

Backport note:
To reduce risk considerably, we'll not backport a mechanism to release
sstable introduced in incremental compaction work.
Instead, only one sstable is passed to table::cleanup_sstables() at a
time (it won't affect performance because the operation is serialized
anyway), to make it easy to release reference to cleaned sstable held
by compaction manager.

tests: release mode.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180914194047.26288-1-raphaelsc@scylladb.com>
(cherry picked from commit 5bc028f78b)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190416025801.15048-1-raphaelsc@scylladb.com>
2019-04-16 17:41:30 +03:00
Tomasz Grabiec
630d599c34 schema_tables: Serialize schema merges fairly
All schema changes made to the node locally are serialized on a
semaphore which lives on shard 0. For historical reasons, they don't
queue but rather try to take the lock without blocking and retry on
failure with a random delay from the range [0, 100 us]. Contenders
which do not originate on shard 0 will have an extra disadvantage as
each lock attempt will be longer by the across-shard round trip
latency. If there is constant contention on shard 0, contenders
originating from other shards may keep loosing to take the lock.

Schema merge executed on behalf of a DDL statement may originate on
any shard. Same for the schema merge which is coming from a push
notification. Schema merge executed as part of the background schema
pull will originate on shard 0 only, where the application state
change listeners run. So if there are constant schema pulls, DDL
statements may take a long time to get through.

The fix is to serialize merge requests fairly, by using the blocking
semaphore::wait(), which is fair.

We don't have to back-off any more, since submit_to() no longer has a
global concurrency limit.

Fixes #4436.

Message-Id: <1555349915-27703-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3fd82021b1)
2019-04-16 10:21:10 +03:00
Takuya ASADA
0933c1a00a dist/docker/redhat: switch to python36
Since EPEL switched python3 default version to 3.6, we need to follow
the change.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190409094139.4797-2-syuu@scylladb.com>
(cherry picked from commit d527ef19f7)
2019-04-14 21:17:57 +03:00
Takuya ASADA
7a7099fcfb dist/ami: switch to python36
Since EPEL switched python3 default version to 3.6, we need to follow
the change.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190409094139.4797-1-syuu@scylladb.com>
2019-04-14 21:17:03 +03:00
Avi Kivity
50235aacb4 compaction: fix use-after-free when calculating backlog after schema change
The problem happens after a schema change because we fail to properly
remove ongoing compaction, which stopped being tracked, from list that
is used to calculate backlog, so it may happen that a compaction read
monitor (ceases to exist after compaction ends) is used after freed.

Fixes #4410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190409024936.23775-1-raphaelsc@scylladb.com>
(cherry-picked from commit 8a117c338a)
2019-04-14 14:00:36 +03:00
Takuya ASADA
e888009f12 dist/redhat: switch to python36
Since EPEL switched python3 default version to 3.6, we need to follow
the change.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190408160934.32701-1-syuu@scylladb.com>
(cherry picked from commit 6e51a95668)
2019-04-09 09:08:28 +03:00
Piotr Sarna
a19615ee9b types: fix varint and decimal serialization
Varint and decimal types serialization did not update the output
iterator after generating a value, which may lead to corrupted
sstables - variable-length integers were properly serialized,
but if anything followed them directly in the buffer (e.g. in a tuple),
their value will be overwritten.

Fixes #4348

Tests: unit (dev)
dtest: json_test.FromJsonUpdateTests.complex_data_types_test
       json_test.FromJsonInsertTests.complex_data_types_test
       json_test.ToJsonSelectTests.complex_data_types_test

Note that dtests still do not succeed 100% due to formatting differences
in compared results (e.g. 1.0e+07 vs 1.0E7, but it's no longer a query
correctness issue.

(cherry picked from commit 287a02dc05)
2019-03-26 16:38:58 +02:00
Tomasz Grabiec
357ca67fda row_cache: Fix abort in cache populating read concurrent with memtable flush
When we're populating a partition range and the population range ends
with a partition key (not a token) which is present in sstables and
there was a concurrent memtable flush, we would abort on the following
assert in cache::autoupdating_underlying_reader:

     utils::phased_barrier::phase_type creation_phase() const {
         assert(_reader);
         return _reader_creation_phase;
     }

That's because autoupdating_underlying_reader::move_to_next_partition()
clears the _reader field when it tries to recreate a reader but it finds
the new range to be empty:

         if (!_reader || _reader_creation_phase != phase) {
            if (_last_key) {
                auto cmp = dht::ring_position_comparator(*_cache._schema);
                auto&& new_range = _range.split_after(*_last_key, cmp);
                if (!new_range) {
                    _reader = {};
                    return make_ready_future<mutation_fragment_opt>();
                }

Fix by not asserting on _reader. creation_phase() will now be
meaningful even after we clear the _reader. The meaning of
creation_phase() is now "the phase in which the reader was last
created or 0", which makes it valid in more cases than before.

If the reader was never created we will return 0, which is smaller
than any phase returned by cache::phase_of(), since cache starts from
phase 1. This shouldn't affect current behavior, since we'd abort() if
called for this case, it just makes the value more appropriate for the
new semantics.

Tests:

  - unit.row_cache_test (debug)

Fixes #4236
Message-Id: <1553107389-16214-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 69775c5721)
2019-03-22 09:34:15 -03:00
Jenkins
7818c63eb1 release: prepare for 2.3.4 by hagitsegev 2019-03-18 12:44:10 +02:00
Eliran Sinvani
da10eae18c cql3 : fix a crash upon preparing select with an IN restriction due to memory violation
When preparing a select query with a multicolumn in restriction, the
node crashed due to using a parameter after using a move on it.

Tests:
1. UnitTests (release)
2. Preparing a select statement that crashed the system before,
and verify it is not crashing.

Fixes #3204
Fixes #3692

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <7ebd210cd714a460ee5557ac612da970cee03270.1537947897.git.eliransin@scylladb.com>
(cherry picked from commit 22ad5434d1)
2019-03-07 10:10:03 +00:00
Tomasz Grabiec
d5292cd3ec sstable/compaction: Use correct schema in the writing consumer
Introduced in 2a437ab427.

regular_compaction::select_sstable_writer() creates the sstable writer
when the first partition is consumed from the combined mutation
fragment stream. It gets the schema directly from the table
object. That may be a different schema than the one used by the
readers if there was a concurrent schema alter duringthat small time
window. As a result, the writing consumer attached to readers will
interpret fragments using the wrong version of the schema.

One effect of this is storing values of some columns under a different
column.

This patch replaces all column_family::schema() accesses with accesses
to the _schema memeber which is obtained once per compaction and is
the same schema which readers use.

Fixes #4304.

Tests:

  - manual tests with hard-coded schema change injection to reproduce the bug
  - build/dev/scylla boot
  - tests/sstable_mutation_test

Message-Id: <1551698056-23386-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 58e7ad20eb)
2019-03-05 15:06:28 +02:00
Avi Kivity
9cb35361d9 Update seastar submodule
* seastar 10ac122...efda428 (1):
  > net: fix tcp load balancer accounting leak while moving socket to other shard

Fixes #4269.
2019-03-05 15:06:07 +02:00
Jenkins
3e285248be release: prepare for 2.3.3 by hagitsegev 2019-02-19 14:02:37 +02:00
Raphael S. Carvalho
6f10ccb441 database: Fix race condition in sstable snapshot
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.

Refs #4051.

(master commit 1b7cad3531)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>
2019-02-19 10:13:56 +02:00
Botond Dénes
df420499bc service/storage_service: fix pre-bootstrap wait for schema agreement
When bootstrapping, a node should to wait to have a schema agreement
with its peers, before it can join the ring. This is to ensure it can
immediately accept writes. Failing to reach schema agreement before
joining is not fatal, as the node can pull unknown schemas on writes
on-demand. However, if such a schema contains references to UDFs, the
node will reject writes using it, due to #3760.

To ensure that schema agreement is reached before joining the ring,
`storage_service::join_token_ring()` has to checks. First it checks that
at least one peer was connected previously. For this it compares
`database::get_version()` with `database::empty_version`. The (implied)
assumption is that this will become something other than
`database::empty_version` only after having connected (and pulled
schemas from) at least one peer. This assumption doesn't hold anymore,
as we now set the version earlier in the boot process.
The second check verifies that we have the same schema version as all
known, live peers. This check assumes (since 3e415e2) that we have
already "met" all (or at least some) of our peers and if there is just
one known node (us) it concludes that this is a single-node cluster,
which automatically has schema agreement.
It's easy to see how these two checks will fail. The first fails to
ensure that we have met our peers, and the second wrongfully concludes
that we are a one-node cluster, and hence have schema agreement.

To fix this, modify the first check. Instead of relying on the presence
of a non-empty database version, supposedly implying that we already
talked to our peers, explicitely make sure that we have really talked to
*at least* one other node, before proceeding to the second check, which
will now do the correct thing, actually checking the schema versions.

Fixes: #4196

Branches: 3.0, 2.3

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40b95b18e09c787e31ba6c5519fb64d68b4ca32e.1550228389.git.bdenes@scylladb.com>
(cherry picked from commit 2125e99531)
2019-02-16 19:04:38 +02:00
Avi Kivity
d29527b4e1 auth: password_authenticator: protect against NULL salted_hash
In case salted_hash was NULL, we'd access uninitialized memory when dereferencing
the optional in get_as<>().

Protect against that by using get_opt() and failing authentication if we see a NULL.

Fixes #4168.

Tests: unit (release)
Branches: 3.0, 2.3
Message-Id: <20190211173820.8053-1-avi@scylladb.com>
(cherry picked from commit da9628c6dc)
2019-02-11 23:55:06 +02:00
Duarte Nunes
8a90e242e4 Merge 'Fix misdetection of remote counter shards' from Paweł
"
The code reading counter cells form sstables verifies that there are no
unsupported local or remote shards. The latter are detected by checking
if all shards are present in the counter cell header (only remote shards
do not have entries there). However, the logic responsible for doing
that was incorrectly computing the total number of counter shards in a
cell if the header was larger than a single counter shard. This resulted
in incorrect complaints that remote shards are present.

Fixes #4206

Tests: unit(release)
"

* tag 'counter-header-fix/v1' of https://github.com/pdziepak/scylla:
  tests/sstables: test counter cell header with large number of shards
  sstables/counters: fix remote counter shard detection

(cherry picked from commit d2d885fb93)
2019-02-11 14:18:54 +02:00
Calle Wilund
8a78c0aba9 commitlog_replayer: Bugfix: finding truncation positions uses local var ref
"uuid" was ref:ed in a continuation. Works 99.9% of the time because
the continuation is not actually delayed (and assuming we begin the
checks with non-truncated (system) cf:s it works).
But if we do delay continuation, the resulting cf map will be
borked.

Fixes #4187.

Message-Id: <20190204141831.3387-1-calle@scylladb.com>
(cherry picked from commit 9cadbaa96f)
2019-02-04 18:02:43 +02:00
Botond Dénes
8a2bbcf138 auth/service: unregister migration listener on stop()
Otherwise any event that triggers notification to this listener would
trigger a heap-use-after-free.

Refs: #4107

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b6bbd609371a2312aed7571b05119d59c7d103d7.1548067626.git.bdenes@scylladb.com>
(cherry picked from commit f229dff210)
2019-01-22 17:55:18 +02:00
Pekka Enberg
22c891e6df Update scylla-ami submodule
* dist/ami/files/scylla-ami a425887...fe156a5 (1):
  > scylla_install_ami: update NIC drivers

See scylladb/scylla-ami#44
2019-01-17 08:45:22 +02:00
Duarte Nunes
1841d0c2d9 tests/gossip_test: Use RAII for orderly destruction
Change the test so that services are correctly teared down, by the
correct order (e.g., storage_service access the messaging_service when
stopping).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814112111.8521-2-duarte@scylladb.com>
(cherry picked from commit 495a92c5b6)
2019-01-08 19:44:58 +02:00
Duarte Nunes
e10107fe5a tests/gossip_test: Don't bind address to avoid conflicts
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814112111.8521-1-duarte@scylladb.com>
(cherry picked from commit 3956a77235)
2019-01-08 19:44:52 +02:00
Jenkins
0b3a4679db release: prepare for 2.3.2 2019-01-08 14:40:33 +02:00
Avi Kivity
ba60d666a9 Update seastar submodule
* seastar db30251...10ac122 (1):
  > iotune: Initialize io_rates member variables

Fixes #4064.
2019-01-08 11:41:00 +02:00
Avi Kivity
6ea4d0b75c Update seastar submodule
* seastar b846dfe...db30251 (1):
  > reactor: disable nowait aio due to a kernel bug

Fixes #3996.
2018-12-17 15:56:47 +02:00
Vladimir Krivopalov
8c5911f312 database: Capture io_priority_class by reference to avoid dangling ref.
The original reference points to a thread-local storage object that
guaranteed to outlive the continuation, but copying it make the
subsequent calls point to a local object and introduces a use-after-free
bug.

Fixes #3948

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
(cherry picked from commit 68458148e7)
2018-12-02 13:32:45 +02:00
Tomasz Grabiec
de00d7f5a1 utils: phased_barrier: Make advance_and_await() have strong exception guarantees
Currently, when advance_and_await() fails to allocate the new gate
object, it will throw bad_alloc and leave the phased_barrier object in
an invalid state. Calling advance_and_await() again on it will result
in undefined behavior (typically SIGSEGV) beacuse _gate will be
disengaged.

One place affected by this is table::seal_active_memtable(), which
calls _flush_barrier.advance_and_await(). If this throws, subsequent
flush attempts will SIGSEGV.

This patch rearranges the code so that advance_and_await() has strong
exception guarantees.
Message-Id: <1542645562-20932-1-git-send-email-tgrabiec@scylladb.com>

Fixes #3931.

(cherry picked from commit 57e25fa0f8)
2018-11-21 12:18:08 +02:00
Glauber Costa
e5f9dae4bb remove monitor if sstable write failed
In (almost) all SSTable write paths, we need to inform the monitor that
the write has failed as well. The monitor will remove the SSTable from
controller's tracking at that point.

Except there is one place where we are not doing that: streaming of big
mutations. Streaming of big mutations is an interesting use case, in
which it is done in 2 parts: if the writing of the SSTable fails right
away, then we do the correct thing.

But the SSTables are not commited at that point and the monitors are
still kept around with the SSTables until a later time, when they are
finally committed. Between those two points in time, it is possible that
the streaming code will detect a failure and manually call
fail_streaming_mutations(), which marks the SSTable for deletions. At
that point we should propagate that information to the monitor as well,
but we don't.

Fixes #3732 (hopefully)
Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181114213618.16789-1-glauber@scylladb.com>
(cherry picked from commit 9f403334c8)
2018-11-20 20:40:44 +02:00
Glauber Costa
e13e796290 sstables: correctly parse estimated histograms
In commit a33f0d6, we changed the way we handle arrays during the write
and parse code to avoid reactor stalls. Some potentially big loops were
transformed into futurized loops, and also some calls to vector resizes
were replaced by a reserve + push_back idiom.

The latter broke parsing of the estimated histogram. The reason being
that the vectors that are used here are already initialized internally
by the estimated_histogram object. Therefore, when we push_back, we
don't fill the array all the way from index 0, but end up with a zeroed
beginning and only push back some of the elements we need.

We could revert this array to a resize() call. After all, the reason we
are using reserve + push_back is to avoid calling the constructor member
for each element, but We don't really expect the integer specialization
to do any of that.

However, to avoid confusion with future developers that may feel tempted
to converted this as well for the sake of consistency, it is safer to
just make sure these arrays are zeroed.

Fixes #3918

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181116130853.10473-1-glauber@scylladb.com>
(cherry picked from commit c6811bd877)
2018-11-17 17:20:38 +02:00
Avi Kivity
336c771663 release: prepare for 2.3.1 2018-10-19 20:53:17 +03:00
Avi Kivity
82968afc25 locator: fix abstract_replication_strategy::get_ranges() and friends violating sort order
get_ranges() is supposed to return ranges in sorted order. However, a35136533d
broke this and returned the range that was supposed to be last in the second
position (e.g. [0, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9]). The broke cleanup, which
relied on the sort order to perform a binary search. Other users of the
get_ranges() family did not rely on the sort order.

Fixes #3872.
Message-Id: <20181019113613.1895-1-avi@scylladb.com>

(cherry picked from commit 1ce52d5432)
2018-10-19 20:52:31 +03:00
Duarte Nunes
383dcffb53 Merge 'Fix issues with endpoint state replication to other shards' from Tomasz
Fixes #3798
Fixes #3694

Tests:

  unit(release), dtest([new] cql_tests.py:TruncateTester.truncate_after_restart_test)

* tag 'fix-gossip-shard-replication-v1' of github.com:tgrabiec/scylla:
  gms/gossiper: Replicate enpoint states in add_saved_endpoint()
  gms/gossiper: Make reset_endpoint_state_map() have effect on all shards
  gms/gossiper: Replicate STATUS change from mark_as_shutdown() to other shards
  gms/gossiper: Always override states from older generations

(cherry picked from commit 48ebe6552c)
2018-10-17 10:09:07 +02:00
Glauber Costa
0c2abc007c api: use longs instead of ints for snapshot sizes
Int types in json will be serialized to int types in C++. They will then
only be able to handle 4GB, and we tend to store more data than that.

Without this patch, listsnapshots is broken in all versions.

Fixes: #3845

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181012155902.7573-1-glauber@scylladb.com>
(cherry picked from commit 98332de268)
2018-10-12 22:02:25 +03:00
Avi Kivity
1498c4f150 Update seastar submodule
* seastar ebf4812...b846dfe (1):
  > prometheus: Fix histogram text representation

Fixes #3827.
2018-10-09 16:38:04 +03:00
Eliran Sinvani
f388992a94 cql3 : add workaround to antlr3 null dereference bug
The Antlr3 exception class has a null dereference bug that crashes
the system when trying to extract the exception message using
ANTLR_Exception<...>::displayRecognitionError(...) function. When
a parsing error occurs the CqlParser throws an exception which in
turn processesed for some special cases in scylla to generate a custom
message. The default case however, creates the message using
displayRecognitionError, causing the system to crash.
The fix is a simple workaround, making sure the pointer is not null
before the call to the function. A "proper" fix can't be implemented
because the exception class itself is implemented outside scylla
in antlr headers that resides on the host machine os.

Tested manualy 2 testcases, a typo causing scylla to crash and
a cql comment without a newline at the end also caused scylla to crash.
Ran unit tests (release).

Fixes #3740
Fixes #3764

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <cfc7e0d758d7a855d113bb7c8191b0fd7d2e8921.1538566542.git.eliransin@scylladb.com>
(cherry picked from commit 20f49566a2)
2018-10-04 14:09:41 +03:00
Avi Kivity
310540c11f utils: crc32: mark power crc32 assembly as not requiring an executable stack
The linker uses an opt-in system for non-executable stack: if all object files
opt into a non-executable stack, the binary will have a non-executable stack,
which is very desirable for security. The compiler cooperates by opting into
a non-executable stack whenever possible (always for our code).

However, we also have an assembly file (for fast power crc32 computations).
Since it doesn't opt into a non-executable stack, we get a binary with
executable stack, which Gentoo's build system rightly complains about.

Fix by adding the correct incantation to the file.

Fixes #3799.

Reported-by: Alexys Jacob <ultrabug@gmail.com>
Message-Id: <20181002151251.26383-1-avi@scylladb.com>
(cherry picked from commit aaab8a3f46)
2018-10-02 23:23:23 +03:00
Calle Wilund
7d833023cc storage_proxy: Add missing re-throw in truncate_blocking
Iff truncation times out, we want to log it, but the exception should
not be swallowed, but re-thrown.

Fixes #3796.

Message-Id: <20181001112325.17809-1-calle@scylladb.com>
(cherry picked from commit 2996b8154f)
2018-10-01 21:48:57 +02:00
Avi Kivity
d94ac196e0 Update scylla-ami submodule
* dist/ami/files/scylla-ami e7aa504...a425887 (1):
  > scylla_install_ami: enable ssh_deletekeys

See scylladb/scylla-ami#31
2018-09-30 16:32:40 +03:00
Duarte Nunes
1d7430995e tests/aggregate_fcts_test: Add test case for wrapped types
Provide a test case which checks a type being wrapped in a
reverse_type plays no role in assignment.

Refs #3789

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223201.28152-2-duarte@scylladb.com>
(cherry picked from commit 17578c3579)
2018-09-28 14:34:19 +03:00
Duarte Nunes
b662a7f8a4 cql3/selection/selector: Unwrap types when validating assignment
When validating assignment between two types, it's possible one of
them is wrapped in a reverse_type, if it comes, for example, from the
type associated with a clustering column. When checking for weak
assignment the types are correctly unwrapped, but not when checking
for an exact match, which this patch fixes.

Technically, the receiver is never a reversed_type for the current
callers, but this is the morally correct implementation, as the type
being reversed or not plays no role in assignment.

Tests: unit(release)

Fixes #3789

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223201.28152-1-duarte@scylladb.com>
(cherry picked from commit 5e7bb20c8a)
2018-09-28 14:34:08 +03:00
Paweł Dziepak
447ad72882 transport: fix use-after-free in read_name_and_value_list()
(cherry picked from commit 1eeef4383c)
2018-09-27 14:05:45 +01:00
Duarte Nunes
b8485d3bce cql3/query_processor: Validate presence of statement values timeously
We need to validate before calling query_options::prepare() whether
the set of prepared statement values sent in the query matches the
amount of names we need to bind, otherwise we risk an out-of-bounds
access if the client also specified names together with the values.

Refs #3688

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814225607.14215-1-duarte@scylladb.com>
(cherry picked from commit 805ce6e019)
2018-09-27 14:05:37 +01:00
Takuya ASADA
034b0f50db dist/redhat: specify correct repo file path on scylla-housekeeping services
Currently, both scylla-housekeeping-daily/-restart services mistakenly
specify repo file path as "@@REPOFILES@@", witch is copied from .in
template, need to be replace with actual path.

Fixes #3776

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180921031605.9330-1-syuu@scylladb.com>
(cherry picked from commit 21a12aa458)
2018-09-25 12:30:03 +03:00
Avi Kivity
12ec0becf3 messaging: fix unbounded allocation in TLS RPC server
The non-TLS RPC server has an rpc::resource_limits configuration that limits
its memory consumption, but the TLS server does not. That means a many-node
TLS configuration can OOM if all nodes gang up on a single replica.

Fix by passing the limits to the TLS server too.

Fixes #3757.
Message-Id: <20180907192607.19802-1-avi@scylladb.com>

(cherry picked from commit 4553238653)
2018-09-17 20:25:49 +03:00
Piotr Sarna
666b19552d cql3, 2.3: refuse serving multi-restriction indexed queries
Secondary index queries do not work correctly when multiple
restrictions are present - the rest of the restrictions is simply
ignored, which results in too many rows returned to the client.
This 2.3 fix makes these unsafe queries return an error instead.

Refs #3754

Message-Id: <7e470052d8ffc5bd8dc12e0d7f2705f0754afdbb.1536243391.git.sarna@scylladb.com>
2018-09-17 20:16:01 +03:00
Takuya ASADA
178f870a03 dist/ami/files/.scylla_ami_login: fix python error message on unsupported instance type
We changed usage of colorprint() on f8cec2f891,
need to pass format parameters to the function.

Fixes #3680

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Tested-by: Amos Kong <amos@scylladb.com>
Message-Id: <20180913182450.13308-1-syuu@scylladb.com>
2018-09-17 14:53:17 +03:00
Pekka Enberg
1b18f16dc1 release: prepare for 2.3.0 2018-09-14 13:52:02 +03:00
Pekka Enberg
28934575e4 docker: Update RPM repository to 2.3 2018-09-12 15:54:17 +02:00
Gleb Natapov
182cbeefb0 mutation_query_test: add test for result size calculation
Check that digest only and digest+data query calculate result size to be
the same.

Message-Id: <20180906153800.GK2326@scylladb.com>
(cherry picked from commit 9e438933a2)
2018-09-12 15:54:17 +02:00
Gleb Natapov
b70fc41a90 mutation_partition: accurately account for result size in digest only queries
When measuring_output_stream is used to calculate result's element size
it incorrectly takes into account not only serialized element size, but
a placeholder that ser::qr_partition__rows/qr_partition__static_row__cells
constructors puts in the beginning. Fix it by taking starting point in a
stream before element serialization and subtracting it afterwords.

Fixes #3755

Message-Id: <20180906153609.GJ2326@scylladb.com>
(cherry picked from commit d7674288a9)
2018-09-12 15:54:12 +02:00
Tomasz Grabiec
debfc795b2 tests: flat_mutation_reader: Use fluent assertions for better error messages
Message-Id: <1531908313-29810-2-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit dc453d4f5d)
2018-09-12 15:54:09 +02:00
Tomasz Grabiec
0d094575ec tests: flat_mutation_reader_assertions: Introduce produces(mutation_fragment)
Message-Id: <1531908313-29810-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 604c8baed8)
2018-09-12 15:54:06 +02:00
Tomasz Grabiec
20baef69a9 mutation_fragment: Fix clustering_row::equal() using incorrect column kind
Incorrect column_kind was passed, which may cause wrong type to be
used for comparison if schema contains static columns. Affects only
tests.

Spotted during code review.
Message-Id: <1531144991-2658-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 1336744a05)
2018-09-07 13:34:26 +02:00
Pekka Enberg
1bac88601d release: prepare for 2.3.rc3 2018-09-07 07:41:46 +03:00
Vlad Zolotarov
e581fd1463 loading_cache: make size() return the size of lru_list instead of loading_shared_values
reloading flow may hold the items in the underlying loading_shared_values
after they have been removed (e.g. via remove(key) API) thereby loading_shared_values.size()
doesn't represent the correct value for the loading_cache. lru_list.size() on the other hand - does.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit 1e56c7dd58)
2018-09-06 16:57:22 +03:00
Vlad Zolotarov
b366bff998 loading_cache: make iterator work on top of lru_list iterators instead of loading_shared_values'
Reloading may hold value in the underlying loading_shared_values while
the corresponding cache values have already been deleted.

This may create weird situations like this:

<populate cache with 10 entries>
cache.remove(key1);
for (auto& e : cache) {
    std::out << e << std::endl;
}

<all 10 entries are printed, including the one for "key1">

In order to avoid such situations we are going to make the loading_cache::iterator
to be a transform_iterator of lru_list::iterator instead of loading_shared_values::iterator
because lru_list contains entries only for cached items.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit 945d26e4ee)
2018-09-06 16:57:12 +03:00
Gleb Natapov
38e6984ba5 mutation_partition: correctly measure static row size when doing digest calculation
The code uses incorrect output stream in case only digest is requested
and thus getting incorrect data size. Failing to correctly account
for static row size while calculating digest may cause digest mismatch
between digest and data query.

Fixes #3753.

Message-Id: <20180905131219.GD2326@scylladb.com>
(cherry picked from commit 98092353df)
2018-09-06 16:50:58 +03:00
Vlad Zolotarov
332f76579e tests: loading_cache_test: configure a validity timeout in test_loading_cache_loading_different_keys to a greater value
Change the validity timeout from 1s to 1h in order to avoid false alarms
on busy systems: for a short value there is a chance that
(loading_cache.size() == num_loaders) check is going to run after some elements
of the cache have already been evicted.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20180904193026.7304-1-vladz@scylladb.com>
(cherry picked from commit dae70e1166)
2018-09-06 16:05:42 +03:00
Jesse Haber-Kucharsky
315a03cf6c auth: Use finite time-out for all QUORUM reads
Commit e664f9b0c6 transitioned internal
CQL queries in the auth. sub-system to be executed with finite time-outs
instead of infinite ones.

It should have also modified the functions in `auth/roles-metadata.cc`
to have finite time-outs.

This change fixes some previously failing dtests, particularly around
repair. Without this change, the QUORUM query fails to terminate when
the necessary consistency level cannot be achieved.

Fixes #3736.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <e244dc3e731b4019f3be72c52a91f23ee4bb68d1.1536163859.git.jhaberku@scylladb.com>
(cherry picked from commit 682805b22c)
2018-09-05 22:54:32 +03:00
Asias He
1847dc7a6a storage_service: Wait for range setup before announcing join status
When a joining node announcing join status through gossip, other
existing nodes will send writes to the joining node. At this time, it
is possible the joining node hasn't learnt the tokens of other nodes
that causes the error like below:

   token_metadata - sorted_tokens is empty in first_token_index!
   storage_proxy - Failed to apply mutation from 127.0.4.1#0:
   std::runtime_error (sorted_tokens is empty in first_token_index!)

To fix, wait for the token range setup before announcing the join
status.

Fixes: #3382
Tests: 60 run of materialized_views_test.py:TestMaterializedViews.add_dc_during_mv_update_test

Message-Id: <01abb21ae3315ae275297e507c5956e5774557ef.1536128531.git.asias@scylladb.com>
(cherry picked from commit 89b769a073)
2018-09-05 15:32:29 +03:00
Eliran Sinvani
dd11b5987e cql3: backport test of multicolumn IN with repetitions.
The test failed after backport of the containing commit (d734d31), the
reason is that the query was missing ALLOW FILTERING which is required.
In newer versions the allow filtering enforcement "misses" some
cases that needs the filtering anotation due to cavaet in testing multi
column restriction for ALLOW FULTERRING requirement. This issue was
introduced as part of refactoring the multicolumn restrictions classes
and already has an open issue: #3574

Tests: Unitests(release)

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <ad88b7218fa55466be7bc4303dc50326a3d59733.1534322238.git.eliransin@scylladb.com>
(cherry picked from commit d734d316a6)
Message-Id: <928f1fbecffa43c4700541ee6603bb4607871510.1536146137.git.eliransin@scylladb.com>
2018-09-05 14:30:07 +03:00
Paweł Dziepak
a134e8699a test.py: do not disable human-readable format with --jenkins flag
When test.py is run with --jenkins flag Boost UTF is asked to generate
an XML file with the test results. This automatically disables the
human-readable output printed to stdout. There is no real reason to do
so and it is actually less confusing when the Boost UTF messages are in
the test output together with Scylla logger messages.

Message-Id: <20180704172913.23462-1-pdziepak@scylladb.com>
(cherry picked from commit 07a429e837)
2018-09-04 14:26:00 +02:00
Takuya ASADA
bd7dcbb8d2 dist/common/scripts/scylla_raid_setup: create scylla-server.service.d when it doesn't exist
When /etc/systemd/system/scylla-server.service.d/capabilities.conf is
not installed, we don't have /etc/systemd/system/scylla-server.service.d/,
need to create it.

Fixes #3738

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180904015841.18433-1-syuu@scylladb.com>
(cherry picked from commit bd8a5664b8)
2018-09-04 14:42:55 +03:00
Tomasz Grabiec
74e61528a6 managed_vector: Make external_memory_usage() ignore reserved space
This ensures that row::external_memory_usage() is invariant to
insertion order of cells.

It should be so, so that accounting of a clustering_row, merged from
multiple MVCC versions by the partition_snapshot_flat_reader on behalf
of a memtable flush, doesn't give a greater result than what is used
by the memtable region. Overaccounting leads to assertion failure in
~flush_memory_accounter.

Fixes #3625 (hopefully).

Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 4fb3f7e8eb)
2018-09-03 19:58:10 +03:00
Duarte Nunes
5eb4fde2d5 Merge 'utils::loading_cache: improve reload() robustness' from Vlad
"This series introduces a few improvements related to a reload flow.

From now on the callback may assume that the "key" parameter value
is kept alive till the end of its execution in the reloading flow.

It may also safely evict as many items from the cache as needed."

Fixes #3606

* 'loading_cache_improve_reload-v1' of https://github.com/vladzcloudius/scylla:
  utils::loading_cache: hold a shared_value_ptr to the value when we reload
  utils::loading_cache::on_timer(): remove not needed capture of "this"
  utils::loading_cache::on_timer(): use chunked_vector for storing elements we want to reload

(cherry picked from commit f6aadd8077)
2018-08-29 10:12:32 +01:00
Duarte Nunes
cc0703f8ca utils/loading_cache: Avoid using invalidated iterators
When periodically reloading the values in the loading_cache, we would
iterate over the list of entries and call the load() function for
those which need to be reloaded.

For some concrete caches, load() can remove the entry from the LRU set,
and can be executed inline from the parallel_for_each(). This means we
could potentially keep iterating using an invalidated iterator.

Fix this by using a temporary container to hold those entries to be
reloaded.

Spotted when reading the code.

Also use if constexpr and fix the comment in the function containing
the changes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180712124143.13638-1-duarte@scylladb.com>
(cherry picked from commit 63b63b0461)
2018-08-29 10:12:11 +01:00
Botond Dénes
678283a5bb loading_cache::reload(): obtain key before calling _load()
The continuation attached to _load() needs the key of the loaded entry
to check whether it was disposed during the load. However if _load()
invalidates the entry the continuation's capture line will access
invalid memory while trying to obtain the key.
To avoid this save a copy of the key before calling _load() and pass it
to both _load() and the continuation.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b571b73076ca863690f907fbd3fb4ff54e597b28.1531393608.git.bdenes@scylladb.com>
(cherry picked from commit 2e7bf9c6f9)
2018-08-29 10:12:03 +01:00
Avi Kivity
552c0d7641 Update scylla-ami submodule
* dist/ami/files/scylla-ami b7db861...e7aa504 (1):
  > scylla_create_devices: fix mouting RAID volume after reboot

Fixes #3640.
2018-08-28 15:45:36 +03:00
Piotr Sarna
860c06660b tests: add multi-column pk test to INSERT JSON case
Refs #3687
Message-Id: <6ba1328549ed701691ca7cbdacc7d6fa72f2c3de.1534171422.git.sarna@scylladb.com>

(cherry picked from commit aa2bfc0a71)
2018-08-28 14:39:43 +03:00
Piotr Sarna
db733ba075 cql3: fix handling multi-column partition key in INSERT JSON
Multiple column partition keys were previously handled incorrectly,
now the implementation is based on from_exploded instead of
from_singular.

Fixes #3687
Message-Id: <09e0bdb0f1c18d49b9e67c21777d93ba1545a13c.1534171422.git.sarna@scylladb.com>

(cherry picked from commit fa72422baa)
2018-08-28 14:39:41 +03:00
Avi Kivity
88677d39c8 Update seastar submodule
* seastar ed62fbd...ebf4812 (4):
  > correctly configure I/O Scheduler for usage with the YAML file
  > iotune: adjust num-io-queues recommendation
  > reactor: switch indentation
  > properly configure I/O Scheduler when --max-io-requests is passed

Fixes #3722.
Fixes #3721.
Fixes #3718.
2018-08-28 14:37:31 +03:00
Avi Kivity
d767dee5ec migration_manager: downgrade frightening "Can't send migration request" ERROR
This error is transient, since as soon as the node is up we will be able
to send the migration request.  Downgrade it to a warning to reduce anxiety
among people who actually read the logs (like QA).

The message is also badly worded as no one can guess what a migration
request is, but that is left to another patch.

Fixes #3706.
Message-Id: <20180821070200.18691-1-avi@scylladb.com>

(cherry picked from commit 5792a59c96)
2018-08-28 09:11:11 +03:00
Tomasz Grabiec
702f6ee1b7 database: Run system table flushes in the main scheduling group
memtable flushes for system and regular region groups run under the
memtable_scheduling_group, but the controller adjusts shares based on
the occupancy of the regular region group.

It can happen that regular is not under pressure, but system is. In
this case the controller will incorrectly assign low shares to the
memtable flush of system. This may result in high latency and low
throughput for writes in the system group.

I observed writes to the sytem keyspace timing out (on scylla-2.3-rc2)
in the dtest: limits_test.py:TestLimits.max_cells_test, which went
away after this.

Fixes #3717.

Message-Id: <1535016026-28006-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 10f6b125c8)
2018-08-27 16:41:35 +02:00
Piotr Sarna
473b9aec65 cql3: throw proper request exception for INSERT JSON
JSON code is amended in order to return proper
"Missing mandatory PRIMARY KEY part" message instead of generic
"Attempt to access value of a disengaged optional object".

Fixes #3665
Message-Id: <69157d659d51ce5a2d408614ce3ba7bf8e3a5d88.1534161127.git.sarna@scylladb.com>

(cherry picked from commit 310d0a74b9)
2018-08-27 12:36:33 +03:00
Tomasz Grabiec
b548061257 database: Avoid OOM when soft pressure but nothing to flush
There could be soft pressure, but soft-pressure flusher may not be
able to make progress (Refs #3716). It will keep trying to flush empty
memtables, which block on earlier flushes to complete, and thus
allocate continuations in memory. Those continuations accumulate in
memory and can cause OOM.

flush will take longer to complete. Due to scheduling group isolation,
the soft-pressure flusher will keep getting the CPU.

This causes bad_alloc and crashes of dtest:
limits_test.py:TestLimits.max_cells_test

Fixes #3717

Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 2afce13967)
2018-08-26 18:05:35 +03:00
Tomasz Grabiec
01165a9ae7 database: Make soft-pressure memtable flusher not consider already flushed memtables
The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.

The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.

I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.

The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.

This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.

Fixes #3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 1e50f85288)
2018-08-26 18:05:35 +03:00
Tomasz Grabiec
5cdb963768 logalloc: Make evictable_occupancy() indicate no free space
Doesn't fix any bug, but it's closer to the truth that all segments
are used rather than none is used.

Message-Id: <1535040132-11153-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 364418b5c5)
2018-08-26 18:05:35 +03:00
Eliran Sinvani
7c9b9a4e24 cql3: ensure repeated values in IN clauses don't return repeated rows
When the list of values in the IN list of a single column contains
duplicates, multiple executors are activated since the assumption
is that each value in the IN list corresponds to a different partition.
this results in the same row appearing in the result number times
corresponding to the duplication of the partition value.

Added queries for the in restriction unitest and fixed with a bad result check.

Fixes #2837
Tests: Queries as in the usecase from the GitHub issue in both forms ,
prepared and plain (using python driver),Unitest.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <ad88b7218fa55466be7bc4303dc50326a3d59733.1534322238.git.eliransin@scylladb.com>
(cherry picked from commit d734d316a6)
2018-08-26 18:05:33 +03:00
Avi Kivity
f475c65ae6 Update scylla-ami submodule
* dist/ami/files/scylla-ami c7e5a70...b7db861 (2):
  > scylla-ami-setup.service: run only on first startup
  > Use fstab to mount RAID volume on every reboot

(cherry picked from commit 54ac334f4b)
2018-08-26 12:40:58 +03:00
Takuya ASADA
687372bc48 dist/common/scripts/scylla_raid_setup: refuse start scylla-server.service when RAID volume is not mounted
Since the Linux system abort booting when it fails to mount fstab entries,
user may not able to see an error message when we use fstab to mount
/var/lib/scylla on AMI.

Instead of abort booting, we can just abort to start scylla-server.service
when RAID volume is not mounted, using RequiresMountsFor directive of systemd
unit file.

See #3640

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180824185511.17557-1-syuu@scylladb.com>
(cherry picked from commit ff55e3c247)
2018-08-26 12:40:50 +03:00
Piotr Sarna
65c140121c tests: add parsing varint from JSON string test
Refs #3666
Message-Id: <f4205e9484f5385796fade7986e3e38dcbc65bac.1534845398.git.sarna@scylladb.com>

(cherry picked from commit 4a274ee7e2)
2018-08-26 11:11:18 +03:00
Piotr Sarna
ed68ad220f types: enable deserializing varint from JSON string
Previously deserialization failed because the JSON string
representing a number was unnecessarily quoted.

Fixes #3666
Message-Id: <a0a100dbac7c151d627522174303657d1da05c27.1534845398.git.sarna@scylladb.com>

(cherry picked from commit 37a5c38471)
2018-08-26 11:11:18 +03:00
Piotr Sarna
35f4b8fbbe cql3: add proper setting of empty collections in INSERT JSON
Previously empty collections where incorrectly added as dead cells,
which resulted in serialization errors later.

Fixes #3664
Message-Id: <a9c90d66c6737641cafe40edb779df490ada0309.1534848313.git.sarna@scylladb.com>

(cherry picked from commit 465045368f)
2018-08-26 11:11:18 +03:00
Duarte Nunes
48012fe418 Merge seastar upstream
* seastar 22437af...ed62fbd (1):
  > core: fix __libc_free return type signature

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-22 13:49:39 +01:00
Tomasz Grabiec
c862ccda91 Merge 'Fix multi-cell static list updates in the presence of ckeys' from Duarte
Fixes a regression introduced in
9e88b60ef5, which broke the lookup for
prefetched values of lists when a clustering key is specified.

This is the code that was removed from some list operations:

 std::experimental::optional<clustering_key> row_key;
 if (!column.is_static()) {
   row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
 }
 ...
 auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column);

Put it back, in the form of common code in the update_parameters class.

Fixes #3703

* https://github.com/duarten/scylla cql-list-fixes/v1:
  tests/cql_query_test: Test multi-cell static list updates with ckeys
  cql3/lists: Fix multi-cell static list updates in the presence of ckeys
  keys: Add factory for an empty clustering_key_prefix_view

(cherry picked from commit 6937cc2d1c)
2018-08-21 17:35:14 +01:00
Duarte Nunes
83b1057c4b cql3/query_options: Use _value_views in prepare()
_value_views is the authoritative data structure for the
client-specified values. Indeed, the ctor called
transport::request::read_options() leaves _values completely empty.

In query_options::prepare() we were, however, using _values to
associated values to the client-specified column names, and not
_value_views. Fix this by using _value_views instead.

As for the reasons we didn't see this bug earlier, I assume it's
because very few drivers set the 0x04 query options flag, which means
column names are omitted. This is the right thing to do since most
drivers have enough information to correctly position the values.

Fixes #3688

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814234605.14775-1-duarte@scylladb.com>
(cherry picked from commit a4355fe7e7)
2018-08-21 18:23:10 +03:00
Gleb Natapov
c1cb779dd2 storage_proxy: do not fail read without speculation on connection error
After ac27d1c93b if a read executor has just enough targets to
achieve request's CL and a connection to one of them will be dropped
during execution ReadFailed error will be returned immediately and
client will not have a chance to issue speculative read (retry). The
patch changes the code to not return ReadFailed error immediately, but
wait for timeout instead and give a client chance to issue speculative
read in case read executor does not have additional targets to send
speculative reads to by itself.

Fixes #3699.
Message-Id: <20180819131646.GK2326@scylladb.com>

(cherry picked from commit 7277ee2939)
2018-08-20 13:06:51 +03:00
Hagit Segev
b47d18f9fd support 2.3 RC2 2018-08-19 20:17:24 +03:00
Tomasz Grabiec
f8713b019e mutation_partition: Fix exception safety of row::apply_monotonically()
When emplace_back() fails, value is already moved-from into a
temporary, which breaks monotonicity expected from
apply_monotonically(). As a result, writes to that cell will be lost.

The fix is to avoid the temporary by in-place construction of
cell_and_hash. To do that, appropriate cell_and_hash constructor was
added.

Found by mutation_test.cc::test_apply_monotonically_is_monotonic with
some modifications to the random mutation generator.

Introduced in 99a3e3a.

Fixes #3678.

Message-Id: <1533816965-27328-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 024b3c9fd9)
2018-08-13 10:44:27 +02:00
Takuya ASADA
cd5e4eace5 dist/common/scripts/scylla_setup: don't proceed RAID setup until user type 'done'
Need to wait user confirmation before running RAID setup.

See #3659
Fixes #3681

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810194507.1115-1-syuu@scylladb.com>
(cherry picked from commit 2ef1b094d7)
2018-08-12 15:08:47 +03:00
Takuya ASADA
4fb5403670 dist/common/scripts/scylla_setup: don't mention about interactive mode prompt when running on non-interactive mode
Skip showing message when it's non-interactive mode.

Fixes #3674

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810191945.32693-1-syuu@scylladb.com>
(cherry picked from commit b7cf3d7472)
2018-08-12 15:08:37 +03:00
Takuya ASADA
e9df6c42ce dist/common/scripts/scylla_setup: check existance of housekeeping.cfg before asking to run version check
Skip asking to run version check when housekeeping.cfg is already
exists.
Fixes #3657

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180807232313.15525-1-syuu@scylladb.com>
(cherry picked from commit ef9475dd3c)
2018-08-12 15:08:20 +03:00
Takuya ASADA
5fdf492ccc dist/debian: fix install scylla-server.service
On previous commit we moved debian/scylla-server.service to
debian/scylla-server.scylla-server.service to explicitly specify
subpackage name, but it doesn't work for dh_installinit without '--name'
option.

Result of that current scylla-server .deb package missing
scylla-server.service, so we need to rename the service to original
file name.

Fixes #3675

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810221944.24837-1-syuu@scylladb.com>
(cherry picked from commit f30b701872)
2018-08-12 15:07:28 +03:00
Duarte Nunes
fd2b02a12c Merge 'JSON support fixes' from Piotr
"
This series addresses SELECT/INSERT JSON support issues, namely
handling null values properly and parsing decimals from strings.
It also comes with updated cql tests.

Tests: unit (release)
"

Fixes #3666
Fixes #3664
Fixes #3667

* 'json_fixes_3' of https://github.com/psarna/scylla:
  cql3: remove superfluous null conversions in to_json_string
  tests: update JSON cql tests
  cql3: enable parsing decimal JSON values from string
  cql3: add missing return for dead cells
  cql3: simplify parsing optional JSON values
  cql3: add handling null value in to_json
  cql3: provide to_json_string for optional bytes argument

(cherry picked from commit 95677877c2)
2018-08-12 15:05:43 +03:00
Takuya ASADA
f8cec2f891 dist/common/scripts: pass format variables to colorprint()
When we use str.format() to pass variables on the message it will always
causes Exception like "KeyError: 'red'", since the message contains color
variables but it's not passed to str.format().
To avoid the error we need to pass all format variables to colorprint()
and run str.format() inside the function.

Fixes #3649

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180803015216.14328-1-syuu@scylladb.com>
(cherry picked from commit ad7bc313f7)
2018-08-09 10:46:19 +03:00
Avi Kivity
e4d6577ef2 Update seastar submodule
* seastar 814a055...22437af (1):
  > tls.cc: Make "close" timeout delay exception proof

Fixes #3461.
2018-08-08 13:35:10 +03:00
Takuya ASADA
346027248d dist/common/scripts/scylla_setup: print message when EC2 instance is optimized for Scylla
Currently scylla_ec2_check exits silently when EC2 instance is optimized
for Scylla, it's not clear a result of the check, need to output
message.

Note that this change effects AMI login prompt too.

Fixes #3655

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180808024256.9601-1-syuu@scylladb.com>
(cherry picked from commit 15825d8bf1)
2018-08-08 13:26:52 +03:00
Takuya ASADA
2cf6191353 dist/common/scripts/scylla_setup: fix typo on interactive setup
Scylls -> Scylla

Fixes #3656

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180808002443.1374-1-syuu@scylladb.com>
(cherry picked from commit 652eb5ae0e)
2018-08-08 13:26:52 +03:00
Avi Kivity
b52d647de2 docker: adjust for script conversion to Python
Since our scripts were converted to Python, we can no longer
source them from a shell. Execute them directly instead. Also,
we now need to import configuration variables ourselves, since
scylla_prepare, being an independent process, won't do it for
us.

Fixes #3647
Message-Id: <20180802153017.11112-1-avi@scylladb.com>

(cherry picked from commit c9caaa8e6e)
2018-08-07 18:58:44 +03:00
Takuya ASADA
f7c96a37f1 dist/common/scripts/scylla_setup: use specified NIC ifname correctly
Interactive NIC selection prompt always returns 'eth0' as selected NIC name
mistakenly, need to fix.

Fixes #3651

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180803020724.15155-1-syuu@scylladb.com>
(cherry picked from commit a300926495)
2018-08-07 09:51:25 +03:00
Jesse Haber-Kucharsky
ae71ffdcfd auth: Don't use unsupported hashing algorithms
In previous versions of Fedora, the `crypt_r` function returned
`nullptr` when a requested hashing algorithm was not supported.

This is consistent with the documentation of the function in its man
page.

As of Fedora 28, the function's behavior changes so that the encrypted
text is not `nullptr` on error, but instead the string "*0".

The info pages for `crypt_r` clarify somewhat (and contradict the man
pages):

    Some implementations return `NULL` on failure, and others return an
    _invalid_ hashed passphrase, which will begin with a `*` and will
    not be the same as SALT.

Because of this change of behavior, users running Scylla on a Fedora 28
machine which was upgraded from a previous release would not be able to
authenticate: an unsupported hashing algorithm would be selected,
producing encrypted text that did not match the entry in the table.

With this change, unsupported algorithms are correctly detected and
users should be able to continue to authenticate themselves.

Fixes #3637.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bcd708f3ec195870fa2b0d147c8910fb63db7e0e.1533322594.git.jhaberku@scylladb.com>
(cherry picked from commit fce10f2c6e)
2018-08-05 10:30:16 +03:00
Avi Kivity
a235900388 Merge "Fix exception safety in imr::utils::object" from Paweł
"

There is an exception safety problem in imr::utils::object. If multiple
memory allocations are needed and one of them fails the main object is
going to be freed (as expected). However, at this stage it is not
constructed yet, so  when LSA asks its migrator for the size it may get
a meaningless value. The solution is to remember the size until object
is fully created and use sized deallocation in case of failures.

Fixes #3618.

Tests: unit(release, debug/imr_test)
"

(cherry picked from commit 3b42fcfeb2)
2018-08-03 11:54:53 +03:00
Takuya ASADA
be9f150341 dist/debian: install *.service on correct subpackage
We mistakenly installing scylla-housekeeping-*.service to scylla-conf
package, all *.service should explicitly specified subpackage name.

Fixes #3642

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180801233042.307-1-syuu@scylladb.com>
(cherry picked from commit 1bb463f7e5)
2018-08-03 11:53:22 +03:00
Gleb Natapov
2478fa1f6e storage_proxy: fix rpc connection failure handling by read operation
Currently rpc::closed_error is not counted towards replica failure
during read and thus read operation waits for timeout even if one
of the nodes dies. Fix this by counting rpc::closed_error towards
failed attempts.

Fixes #3590.

Message-Id: <20180708123522.GC28899@scylladb.com>
(cherry picked from commit ac27d1c93b)
2018-08-02 11:41:58 +03:00
Gleb Natapov
d95ac1826e cache_hitrate_calculator: fix race when new table is added during calculations
The calculation consists of several parts with preemption point between
them, so a table can be added while calculation is ongoing. Do not
assume that table exists in intermediate data structure.

Fixes #3636

Message-Id: <20180801093147.GD23569@scylladb.com>
(cherry picked from commit 44a6afad8c)
2018-08-01 14:30:45 +03:00
Avi Kivity
6fc17345e9 Merge "No infinite time-outs for internal distributed queries" from Jesse
"
This series replaces infinite time-outs in internal distributed
(non-local) CQL queries with finite ones.

The implementation of tracing, which also performs internal queries,
already has finite time-outs, so it is unchanged.

Fixes #3603.
"

* 'jhk/finite_time_outs/v2' of https://github.com/hakuch/scylla:
  Use finite time-outs for internal auth. queries
  Use finite query time-outs for `system_distributed`

(cherry picked from commit 620e950fc8)
2018-08-01 14:23:49 +03:00
Takuya ASADA
4bfa0ae247 dist/common/scripts/scylla_ntp_setup: fix typo
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1533070539-2147-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 2cd99d800b)
2018-08-01 10:58:36 +03:00
Avi Kivity
174b7870e6 Update ami submodule
* dist/ami/files/scylla-ami d53834f...c7e5a70 (1):
  > ds2_configure.py: uncomment 'cluster_name' when it's commented out
2018-07-31 09:33:46 +03:00
Takuya ASADA
e95b4ee825 dist/common/scripts/scylla_ntp_setup: fix typo
Comment on Python is "#" not "//".

Fixes #3629

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180730091022.4512-1-syuu@scylladb.com>
(cherry picked from commit 032b26deeb)
2018-07-30 13:53:14 +03:00
Takuya ASADA
464305de1c dist/common/scripts/scylla_ntp_setup: ignore ntpdate error
Even ntpdate fails to adjust clock ntpd may able to recover it later,
ignore ntpdate error keep running the script.

Fixes #3629

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180726080206.28891-1-syuu@scylladb.com>
(cherry picked from commit 8e4d1350c9)
2018-07-30 13:53:08 +03:00
Avi Kivity
3a1a9e1a11 dist: redhat: fix up bad file ownership of rpms/srpms
mock outputs files owned by root. This causes attempts
by scripts that want to junk the working directory (typically
continuous integration) to fail on permission errors.

Fixup those permissions after the fact.
Message-Id: <20180719163553.5186-1-avi@scylladb.com>

(cherry picked from commit b167647bf6)
2018-07-26 14:22:52 +03:00
Avi Kivity
90dac5d944 Merge "Fix JSON string quoting" from Piotr
"

This mini-series covers a regression caused by newest versions
of jsoncpp library, which changed the way of quoting UTF-8 strings.

Tests: unit (release)
"

* 'add_json_quoting_3' of https://github.com/psarna/scylla:
  tests: add JSON unit test
  types: use value_to_quoted_string in JSON quoting
  json: add value_to_quoted_string helper function

Ref #3622.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>

(cherry picked from commit d6ef74fe36)
2018-07-26 12:03:35 +03:00
Piotr Sarna
e5a83d105c cql3: fix INSERT JSON grammar
Previously CQL grammar wrongfully required INSERT JSON queries
to provide a list of columns, even though they are already
present in JSON itself.
Unfortunately, tests were written with this false assumption as well,
so they're are updated.
Message-Id: <33b496cba523f0f27b6cbf5539a90b6feb20269e.1532514111.git.sarna@scylladb.com>

Fixes #3631.

(cherry picked from commit f66aace685)
2018-07-25 14:53:35 +01:00
Takuya ASADA
9b4a0a2879 dist/debian: fix ImportError on pystache
Seems like pystache does not provides dependency, need to install it on
build_deb.sh.

Fixes #3627

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180724164852.16094-1-syuu@scylladb.com>
(cherry picked from commit 58f094e06d)
2018-07-25 11:39:22 +03:00
Pekka Enberg
adad12ddc3 release: prepare for 2.3.rc1 2018-07-24 09:21:38 +03:00
Avi Kivity
a77bb1fe34 Merge "row_cache: Fix violation of continuity on concurrent eviction and population" from Tomasz
"
The problem happens under the following circumstances:

  - we have a partially populated partition in cache, with a gap in the middle

  - a read with no clustering restrictions trying to populate that gap

  - eviction of the entry for the lower bound of the gap concurrent with population

The population may incorrectly mark the range before the gap as continuous.
This may result in temporary loss of writes in that clustering range. The
problem heals by clearing cache.

Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been
failing sporadically.

The problem is in ensure_population_lower_bound(), which returns true if
current clustering range covers all rows, which means that the populator has a
right to set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise, we're populating since _last_row and should
consult it.

Fixes #3608.
"

* 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla:
  row_cache: Fix violation of continuity on concurrent eviction and population
  position_in_partition: Introduce is_before_all_clustered_rows()

(cherry picked from commit 31151cadd4)
2018-07-18 12:05:51 +02:00
Tomasz Grabiec
3c7e6dfdb9 mutation_partition: Fix exception-safety of row copy constructor
In case population of the vector throws, the vector object would not
be destroyed. It's a managed object, so in addition to causing a leak,
it would corrupt memory if later moved by the LSA, because it would
try to fixup forward references to itself.

Caused sporadic failures and crashes of row_cache_test, especially
with allocation failure injector enabled.

Introduced in 27014a23d7.
Message-Id: <1531757764-7638-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 3f509ee3a2)
2018-07-17 18:25:12 +02:00
Amos Kong
fab136ae1d scylla_setup: nic setup dialog is only for interactive mode
Current code raises dialog even for non-interactive mode when we pass options
in executing scylla_setup. This blocked automatical artifact-test.

Fixes #3549

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <58f90e1e2837f31d9333d7e9fb68ce05208323da.1531824972.git.amos@scylladb.com>
(cherry picked from commit 0fcdab8538)
2018-07-17 18:24:23 +03:00
Botond Dénes
a4218f536b storage_proxy: use the original row limits for the final results merging
`query_partition_key_range()` does the final result merging and trimming
(if necessary) to make sure we don't send more rows to the client than
requested. This merging and trimming is done by a continuation attached
to the `query_partition_key_range_concurrent()` which does the actual
querying. The continuations captures via value the `row_limit` and
`partition_limit` fields of the `query::read_command` object of the
query. This has an unexpected consequence. The lambda object is
constructed after the call to `query_partition_key_range_concurrent()`
returns. If this call doesn't defer, any modifications done to the read
command object done by `query_partition_key_range_concurrent()` will be
visible to the lambda. This is undesirable because
`query_partition_key_range_concurrent()` updates the read command object
directly as the vnodes are traversed which in turn will result in the
lambda doing the final trimming according to a decremented `row_limits`,
which will cause the paging logic to declare the query as exhausted
prematurely because the page will not be full.
To avoid all this make a copy of the relevant limit fields before
`query_partition_key_range_concurrent()` is called and pass these copies
to the continuation, thus ensuring that the final trimming will be done
according to the original page limits.

Spotted while investigating a dtest failure on my 1865/range-scans/v2
branch. On that branch the way range scans are executed on replicas is
completely refactored. These changes appearantly reduce the number of
continuations in the read path to the point where an entire page can be
filled without deferring and thus causing the problem to surface.

Fixes #3605.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f11e80a6bf8089d49ba3c112b25a69edf1a92231.1531743940.git.bdenes@scylladb.com>
(cherry picked from commit cc4acb6e26)
2018-07-16 16:55:12 +03:00
Takuya ASADA
9f4431ef04 dist/common/scripts/scylla_prepare: fix error when /etc/scylla/ami_disabled exists
On this part shell command wasn't converted to python3, need to fix.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180715075015.13071-1-syuu@scylladb.com>
(cherry picked from commit 9479ff6b1e)
2018-07-16 09:56:57 +03:00
Takuya ASADA
66250bf8cc dist/redhat: drop scylla_lib.sh from .rpm
Since we dropped scylla_lib.sh at 58e6ad22b2,
we need remove it from RPM spec file too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180712155129.17056-1-syuu@scylladb.com>
(cherry picked from commit 1511d92473)
2018-07-16 09:44:48 +03:00
Takuya ASADA
88fe3c2694 dist/common/scripts/scylla_ec2_check: support custom NIC ifname on EC2
Since some AMIs using consistent network device naming, primary NIC
ifname is not 'eth0'.
But we hardcoded NIC name as 'eth0' on scylla_ec2_check, we need to add
--nic option to specify custom NIC ifname.

Fixes #3584

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180712142446.15909-1-syuu@scylladb.com>
(cherry picked from commit ee61660b76)
2018-07-16 09:44:26 +03:00
Takuya ASADA
db4c3d3e52 dist/common/scripts/scylla_util.py: fix typo
Fix typo, and rename get_mode_cpu_set() to get_mode_cpuset(), since a
term 'cpuset' is not included '_' on other places.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180711141923.12675-1-syuu@scylladb.com>
(cherry picked from commit 8f80d23b07)
2018-07-16 09:43:47 +03:00
Takuya ASADA
ca22a1cd1a dist/common/scripts: drop scylla_lib.sh
Drop scylla_lib.sh since all bash scripts depends on the library is
already converted to python3, and all scylla_lib.sh features are
implemented on scylla_util.py.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180711114756.21823-1-syuu@scylladb.com>
(cherry picked from commit 58e6ad22b2)
2018-07-16 09:43:25 +03:00
Avi Kivity
f9b702764e Update scylla-ami submodule
* dist/ami/files/scylla-ami 5200f3f...d53834f (1):
  > Merge "AMI scripts python3 conversion" from Takuya

(cherry picked from commit 83d72f3755)
2018-07-16 09:43:15 +03:00
Avi Kivity
54701bd95c Merge "more conversion from bash to python3" from Takuya
"Converted more scripts to python3."

* 'script_python_conversion2_v2' of https://github.com/syuu1228/scylla:
  dist/common/scripts/scylla_util.py: make run()/out() functions shorter
  dist/ami: install python34 to run scylla_install_ami
  dist/common/scripts/scylla_ec2_check: move ec2 related code to class aws_instance
  dist/common/scripts: drop class concolor, use colorprint()
  dist/ami/files/.bash_profile: convert almost all lines to python3
  dist/common/scripts: convert node_exporter_install to python3
  dist/common/scripts: convert scylla_stop to python3
  dist/common/scripts: convert scylla_prepare to python3

(cherry picked from commit 693cf77022)
2018-07-16 09:41:50 +03:00
Asias He
30eca5f534 storage_service: Limit number of REPLICATION_FINISHED verb can retry
In the removenode operation, if the message servicing is stopped, e.g., due
to disk io error isolation, the node can keep retrying the
REPLICATION_FINISHED verb infinitely.

Scylla log full of such message was observed:

[shard 0] storage_service - Fail to send REPLICATION_FINISHED to $IP:0:
seastar::rpc::closed_error (connection is closed)

To fix, limit the number of retires.

Tests: update_cluster_layout_tests.py

Fixes #3542

Message-Id: <638d392d6b39cc2dd2b175d7f000e7fb1d474f87.1529927816.git.asias@scylladb.com>
(cherry picked from commit bb4d361cf6)
2018-07-16 09:33:56 +03:00
Piotr Sarna
cd057d3882 database: make drop_column_family wait on reads in progress
drop_column_family now waits for both writes and reads in progress.
It solves possible liveness issues with row cache, when column_family
could be dropped prematurely, before the read request was finished.

Phaser operation is passed inside database::query() call.
There are other places where reading logic is applied (e.g. view
replicas), but these are guarded with different synchronization
mechanisms, while _pending_reads_phaser applies to regular reads only.

Fixes #3357

Reported-by: Duarte Nunes <duarte@scylladb.com>
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Message-Id: <d58a5ee10596d0d62c765ee2114ac171b6f087d2.1529928323.git.sarna@scylladb.com>
(cherry picked from commit 03753cc431)
2018-07-16 09:32:15 +03:00
Piotr Sarna
c5a5a2265e database: add phaser for reads
Currently drop_column_family waits on write_in_progress phaser,
but there's no such mechanism for reads. This commit adds
a corresponding reads phaser.

Refs #3357

Reported-by: Duarte Nunes <duarte@scylladb.com>
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Message-Id: <70b5fdd44efbc24df61585baef024b809cabe527.1529928323.git.sarna@scylladb.com>
(cherry picked from commit e1a867cbe3)
2018-07-16 09:32:06 +03:00
Takuya ASADA
3e482c6c9d dist/common/scripts/scylla_util.py: use os.open(O_EXCL) to verify disk is unused
To simplify is_unused_disk(), just try to open the disk instead of
checking multiple block subsystems.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180709102729.30066-1-syuu@scylladb.com>
(cherry picked from commit 1a5a40e5f6)
2018-07-11 12:51:17 +03:00
Avi Kivity
5b6cadb890 Update scylla-ami submodule
* dist/ami/files/scylla-ami 67293ba...5200f3f (1):
  > Add custom script options to AMI user-data

(cherry picked from commit 7d0df2a06d)
2018-07-11 12:51:08 +03:00
Takuya ASADA
9cf8cd6c02 dist/common/scripts/scylla_util.py: strip double quote from sysconfig parameter
Current sysconfig_parser.get() returns parameter including double quote,
it will cause problem by append text using sysconfig_parser.set().

Fixes #3587

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180706172219.16859-1-syuu@scylladb.com>
(cherry picked from commit 929ba016ed)
2018-07-11 12:51:01 +03:00
Vlad Zolotarov
b34567b69b dist: scylla_lib.sh: get_mode_cpu_set: split the declaration and ssignment to the local variable
In bash local variable declaration is a separate operation with its own exit status
(always 0) therefore constructs like

local var=`cmd`

will always result in the 0 exit status ($? value) regardless of the actual
result of "cmd" invocation.

To overcome this we should split the declaration and the assignment to be like this:

local var
var=`cmd`

Fixes #3508

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1529702903-24909-3-git-send-email-vladz@scylladb.com>
(cherry picked from commit 7495c8e56d)
2018-07-11 12:50:51 +03:00
Vlad Zolotarov
02b763ed97 dist: scylla_lib.sh: get_mode_cpu_set: don't let the error messages out
References #3508

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1529702903-24909-2-git-send-email-vladz@scylladb.com>
(cherry picked from commit f3ca17b1a1)
2018-07-11 12:50:43 +03:00
Takuya ASADA
05500a52d7 dist/common/scripts/scylla_sysconfig_setup: fix typo
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180705133313.16934-1-syuu@scylladb.com>
(cherry picked from commit 4df982fe07)
2018-07-11 12:50:32 +03:00
Avi Kivity
4afa558e97 Update scylla-ami submodule
* dist/ami/files/scylla-ami 0fd9d23...67293ba (1):
  > scylla_install_ami: fix broken argument parser

Fixes #3578.

(cherry picked from commit dd083122f9)
2018-07-11 12:50:24 +03:00
Takuya ASADA
f3956421f7 dist/ami: hardcode target for scylla_current_repo since we don't have --target option anymore
We break build_ami.sh since we dropped Ubuntu support, scylla_current_repo
command does not finishes because of less argument ('--target' with no
distribution name, since $TARGET is always blank now).
It need to hardcoded as centos.

Fixes #3577

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180705035251.29160-1-syuu@scylladb.com>
(cherry picked from commit 3bcc123000)
2018-07-11 12:49:52 +03:00
Takuya ASADA
a17a6ce8f5 dist/debian/build_deb.sh: make build_deb.sh more simplified
Use is_debian()/is_ubuntu() to detect target distribution, also install
pystache by path since package name is different between Fedora and
CentOS.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180703193224.4773-1-syuu@scylladb.com>
(cherry picked from commit 3cb7ddaf68)
2018-07-11 12:49:40 +03:00
Takuya ASADA
58a362c1f2 dist/ami/files/.bash_profile: drop Ubuntu support
Drop Ubuntu support on login prompt, too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180703192813.4589-1-syuu@scylladb.com>
(cherry picked from commit ed1d0b6839)
2018-07-11 12:49:30 +03:00
Alexys Jacob
361b2dd7a5 Support Gentoo Linux on node_health_check script.
Gentoo Linux was not supported by the node_health_check script
which resulted in the following error message displayed:

"This s a Non-Supported OS, Please Review the Support Matrix"

This patch adds support for Gentoo Linux while adding a TODO note
to add support for authenticated clusters which the script does
not support yet.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180703124458.3788-1-ultrabug@gentoo.org>
(cherry picked from commit 8c03c1e2ce)
2018-07-11 12:49:22 +03:00
Duarte Nunes
f6a2bafae2 Merge 'Expose sharding information to connections' from Avi
"
In the same way that drivers can route requests to a coordinator that
is also a replica of the data used by the request, we can allow
drivers to route requests directly to the shard. This patchset
adds and documents a way for drivers to know which shard a connection
is connected to, and how to perform this routing.
"

* tag 'shard-info-alt/v1' of https://github.com/avikivity/scylla:
  doc: documented protocol extension for exposing sharding
  transport: expose more information about sharding via the OPTIONS/SUPPORTED messages
  dht: add i_partitioner::sharding_ignore_msb()

(cherry picked from commit 33d7de0805)
2018-07-09 17:06:30 +03:00
Avi Kivity
2ec25a55cd Update seastar submodule
* seastar d7f35d7...814a055 (1):
  > reactor: pollable_fd: limit fragment count to IOV_MAX
2018-07-09 17:05:26 +03:00
Avi Kivity
d3fb7c5515 .gitmodules: branch seastar
This allows us to backport individual patches to seastar for
branch-2.3.
2018-07-09 17:03:50 +03:00
Botond Dénes
b1ac6a36f2 tests/cql_query_tess: add unit test for querying empty ranges test
A bug was found recently (#3564) in the paging logic, where the code
assumed the queried ranges list is non-empty. This assumption is
incorrect as there can be valid (if rare) queries that can result in the
ranges list to be empty. Add a unit test that executes such a query with
paging enabled to detect any future bugs related to assumptions about
the ranges list being non-empty.

Refs: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f5ba308c4014c24bb392060a7e72e7521ff021fa.1530618836.git.bdenes@scylladb.com>
(cherry picked from commit c236a96d7d)
Message-Id: <af315aef64d381a7f486ba190c9a1b5bdd6f800b.1530698046.git.bdenes@scylladb.com>
2018-07-04 12:13:33 +02:00
Botond Dénes
8cba125bce query_pager: use query::is_single_partition() to check for singular range
Use query::is_single_partition() to check whether the queried ranges are
singular or not. The current method of using
`dht::partition_range::is_singular()` is incorrect, as it is possible to
build a singular range that doesn't represent a single partition.
`query::is_single_partition()` correctly checks for this so use it
instead.

Found during code-review.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f671f107e8069910a2f84b14c8d22638333d571c.1530675889.git.bdenes@scylladb.com>
(cherry picked from commit 8084ce3a8e)
2018-07-04 12:03:18 +02:00
Tomasz Grabiec
f46f9f7533 Merge "Fix atomic_cell_or_collection::external_memory_usage()" from Paweł
After the transition to the new in-memory representation in
aab6b0ee27 'Merge "Introduce new in-memory
representation for cells" from Paweł'
atomic_cell_or_collection::external_memory_usage() stopped accounting
for the externally stored data. Since, it wasn't covered by the unit
tests the bug remained unnotices until now.

This series fixes the memory usage calculation and adds proper unit
tests.

* https://github.com/pdziepak/scylla.git fix-external-memory-usage/v1:
  tests/mutation: properly mark atomic_cells that are collection members
  imr::utils::object: expose size overhead
  data::cell: expose size overhead of external chunks
  atomic_cell: add external chunks and overheads to
    external_memory_usage()
  tests/mutation: test external_memory_usage()

(cherry picked from commit 2ffb621271)
2018-07-04 11:45:06 +02:00
Botond Dénes
090d991f8e query_pager: be prepared to _ranges being empty
do_fetch_page() checks in the beginning whether there is a saved query
state already, meaning this is not the first page. If there is not it
checks whether the query is for a singulular partitions or a range scan
to decide whether to enable the stateful queries or not. This check
assumed that there is at least one range in _ranges which will not hold
under some circumstances. Add a check for _ranges being empty.

Fixes: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cbe64473f8013967a93ef7b2104c7ca0507afac9.1530610709.git.bdenes@scylladb.com>
(cherry picked from commit 59a30f0684)
2018-07-03 18:33:25 +03:00
Avi Kivity
ae15a80d01 Merge "more scylla_setup fixes" from Takuya
"
Added NIC / Disk existance check, --force-raid mode on
scylla_raid_setup.
"

* 'scylla_setup_fix4' of https://github.com/syuu1228/scylla:
  dist/common/scripts/scylla_raid_setup: verify specified disks are unused
  dist/common/scripts/scylla_raid_setup: add --force-raid to construct raid even only one disk is specified
  dist/common/scripts/scylla_setup: don't accept disk path if it's not block device
  dist/common/scripts/scylla_raid_setup: verify specified disk paths are block device
  dist/common/scripts/scylla_sysconfig_setup: verify NIC existance

(cherry picked from commit a36b1f1967)
2018-07-03 18:33:04 +03:00
Takuya ASADA
6cf902343a scripts: merge scylla_install_pkg to scylla-ami
scylla_install_pkg is initially written for one-liner-installer, but now
it only used for creating AMI, and it just few lines of code, so it should be
merge into scylla_install_ami script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612150106.26573-2-syuu@scylladb.com>
(cherry picked from commit 084c824d12)
2018-07-03 18:32:58 +03:00
Takuya ASADA
d5e59f671c dist/ami: drop Ubuntu AMI support
Drop Ubuntu AMI since it's not maintained for a long time, and we have
no plan to officially provide it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612150106.26573-1-syuu@scylladb.com>
(cherry picked from commit fafcacc31c)
2018-07-03 18:32:53 +03:00
Avi Kivity
38944655c5 Uodate scylla-ami submodule
* dist/ami/files/scylla-ami 36e8511...0fd9d23 (2):
  > scylla_install_ami: merge scylla_install_pkg
  > scylla_install_ami: drop Ubuntu AMI

(cherry picked from commit 677991f353)
2018-07-03 18:32:45 +03:00
Avi Kivity
06e274ff34 Merge "scylla_setup fixes" from Takuya
"
I found problems on previously submmited patchset 'scylla_setup fixes'
and 'more fixes for scylla_setup', so fixed them and merged into one
patchset.

Also added few more patches.
"

* 'scylla_setup_fix3' of https://github.com/syuu1228/scylla:
  dist/common/scripts/scylla_setup: allow input multiple disk paths on RAID disk prompt
  dist/common/scripts/scylla_raid_setup: skip constructing RAID0 when only one disk specified
  dist/common/scripts/scylla_raid_setup: fix module import
  dist/common/scripts/scylla_setup: check disk is used in MDRAID
  dist/common/scripts/scylla_setup: move unmasking scylla-fstrim.timer on scylla_fstrim_setup
  dist/common/scripts/scylla_setup: use print() instead of logging.error()
  dist/common/scripts/scylla_setup: implement do_verify_package() for Gentoo Linux
  dist/common/scripts/scylla_coredump_setup: run os.remove() when deleting directory is symlink
  dist/common/scripts/scylla_setup: don't include the disk on unused list when it contains partitions
  dist/common/scripts/scylla_setup: skip running rest of the check when the disk detected as used
  dist/common/scripts/scylla_setup: add a disk to selected list correctly
  dist/common/scripts/scylla_setup: fix wrong indent
  dist/common/scripts: sync instance type list for detect NIC type to latest one
  dist/common/scripts: verify systemd unit existance using 'systemctl cat'

(cherry picked from commit 0b148d0070)
2018-07-03 18:32:35 +03:00
Avi Kivity
c24d4a8acb Merge "Fix handling of stale write replies in storage_proxy" from Gleb
"
If a coordinator sends write requests with ID=X and restarts it may get a reply to
the request after it restarts and sends another request with the same ID (but to
different replicas). This condition will trigger an assert in a coordinator. Drop
the assertion in favor of a warning and initialize handler id in a way to make
this situation less likely.

Fixes: #3153
"

* 'gleb/write-handler-id' of github.com:scylladb/seastar-dev:
  storage_proxy: initialize write response id counter from wall clock value
  storage_proxy: drop virtual from signal(gms::inet_address)
  storage_proxy: do not assert on getting an unexpected write reply

(cherry picked from commit a45c3aa8c7)
2018-07-02 11:56:52 +03:00
Nadav Har'El
5f95b76c65 repair: fix combination of "-pr" and "-local" repair options
When nodetool repair is used with the combination of the "-pr" (primary
range) and "-local" (only repair with nodes in the same DC) options,
Scylla needs to define the "primary ranges" differently: Rather than
assign one node in the entire cluster to be the primary owner of every
token, we need one node in each data-center - so that a "-local"
repair will cover all the tokens.

Fixes #3557.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180701132445.21685-1-nyh@scylladb.com>
(cherry picked from commit 3194ce16b3)
2018-07-02 11:56:41 +03:00
Tomasz Grabiec
0bdb7e1e7c row_cache: Fix memtable reads concurrent with cache update missing writes
Introduced in 5b59df3761.

It is incorrect to erase entries from the memtable being moved to
cache if partition update can be preempted because a later memtable
read may create a snapshot in the memtable before memtable writes for
that partition are made visible through cache. As a result the read
may miss some of the writes which were in the memtable. The code was
checking for presence of snapshots when entering the partition, but
this condition may change if update is preempted. The fix is to not
allow erasing if update is preemptible.

This also caused SIGSEGVs because we were assuming that no such
snapshots will be created and hence were not invalidating iterators on
removal of the entries, which results in undefined behavior when such
snapshots are actually created.

Fixes SIGSEGV in dtest: limits_test.py:TestLimits.max_cells_test

Fixes #3532

Message-Id: <1530129009-13716-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit b464b66e90)
2018-07-01 15:36:21 +03:00
Avi Kivity
56ea4f3154 Merge "Disable sstable filtering based on min/max clustering key components" from Tomasz
"
With DateTiered and TimeWindow, there is a read optimization enabled
which excludes sstables based on overlap with recorded min/max values
of clustering key components. The problem is that it doesn't take into
account partition tombstones and static rows, which should still be
returned by the reader even if there is no overlap in the query's
clustering range. A read which returns no clustering rows can
mispopulate cache, which will appear as partition deletion or writes
to the static row being lost. Until node restart or eviction of the
partition entry.

There is also a bad interaction between cache population on read and
that optimization. When the clustering range of the query doesn't
overlap with any sstable, the reader will return no partition markers
for the read, which leads cache populator to assume there is no
partition in sstables and it will cache an empty partition. This will
cause later reads of that partition to miss prior writes to that
partition until it is evicted from cache or node is restarted.

Disable until a more elaborate fix is implemented.

Fixes #3552
Fixes #3553
"

* tag 'tgrabiec/disable-min-max-sstable-filtering-v1' of github.com:tgrabiec/scylla:
  tests: Add test for slicing a mutation source with date tiered compaction strategy
  tests: Check that database conforms to mutation source
  database: Disable sstable filtering based on min/max clustering key components

(cherry picked from commit e1efda8b0c)
2018-06-27 17:01:28 +03:00
Calle Wilund
d9c178063c sstables::compress: Ensure unqualified compressor name if possible
Fixes #3546

Both older origin and scylla writes "known" compressor names (i.e. those
in origin namespace) unqualified (i.e. LZ4Compressor).

This behaviour was not preserved in the virtualization change. But
probably should be.

Message-Id: <20180627110930.1619-1-calle@scylladb.com>
(cherry picked from commit 054514a47a)
2018-06-27 17:01:22 +03:00
Avi Kivity
b21b7f73b9 version: prepare for scylla 2.3-rc0 2018-06-27 14:14:19 +03:00
Avi Kivity
9a7ecdb3b9 Merge "Deglobalise cache_tracker" from Paweł
"
Cache tracker is a thread-local global object that indirectly depends on
the lifetimes of other objects. In particular, a member of
cache_tracker: mutation_cleaner may extend the lifetime of a
mutation_partition until the cleaner is destroyed. The
mutation_partition itself depends on LSA migrators which are
thread-local objects. Since, there is no direct dependency between
LSA-migrators and cache_tracker it is not guarantee that the former
won't be destroyed before the latter. The easiest (barring some unit
tests that repeat the same code several billion times) solution is to
stop using globals.

This series also improves the part of LSA sanitiser that deals with
migrators.

Fixes #3526.

Tests: unit(release)
"

* tag 'deglobalise-cache-tracker/v1-rebased' of https://github.com/pdziepak/scylla:
  mutation_cleaner: add disclaimer about mutation_partition lifetime
  lsa: enhance sanitizer for migrators
  lsa: formalise migrator id requirements
  row_cache: deglobalise row cache tracker
2018-06-26 16:38:12 +01:00
Asias He
c3b5a2ecd5 gossip: Fix tokens assignment in assassinate_endpoint
The tokens vector is defined a few lines above and is needed outsie the
if block.

Do not redefine it again in the if block, otherwise the tokens will be empty.

Found by code inspection.

Fixes #3551.

Message-Id: <c7a06375c65c950e94236571127f533e5a60cbfd.1530002177.git.asias@scylladb.com>
2018-06-26 16:38:12 +01:00
Tomasz Grabiec
6d6b93d1e7 flat_mutation_reader: Move field initialization to initializer list
This works around a problem of std::terminate() being called in debug
mode build if initialization of _current throws.

Backtrace:

Thread 2 "row_cache_test_" received signal SIGABRT, Aborted.
0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
(gdb) bt
  #0  0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
  #1  0x00007ffff17d077d in abort () from /lib64/libc.so.6
  #2  0x00007ffff5773025 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
  #3  0x00007ffff5770c16 in ?? () from /lib64/libstdc++.so.6
  #4  0x00007ffff576fb19 in ?? () from /lib64/libstdc++.so.6
  #5  0x00007ffff5770508 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
  #6  0x00007ffff3ce4ee3 in ?? () from /lib64/libgcc_s.so.1
  #7  0x00007ffff3ce570e in _Unwind_Resume () from /lib64/libgcc_s.so.1
  #8  0x0000000003633602 in reader::reader (this=0x60e0001160c0, r=...) at flat_mutation_reader.cc:214
  #9  0x0000000003655864 in std::make_unique<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (__args#0=...)
    at /usr/include/c++/7/bits/unique_ptr.h:825
  #10 0x0000000003649a63 in make_flat_mutation_reader<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (args#0=...)
    at flat_mutation_reader.hh:440
  #11 0x000000000363565d in make_forwardable (m=...) at flat_mutation_reader.cc:270
  #12 0x000000000303f962 in memtable::make_flat_reader (this=0x61300001d540, s=..., range=..., slice=..., pc=..., trace_state_ptr=..., fwd=..., fwd_mr=...)
    at memtable.cc:592

Message-Id: <1528792447-13336-1-git-send-email-tgrabiec@scylladb.com>
2018-06-25 20:03:23 +03:00
Avi Kivity
31eeae0126 Merge "Avoid buffer linearisation in read path" from Paweł
"
The read path on coordinator involves a lot of passing around buffers
and some occasional processing. We start with query::result obtained
from the storage_proxy which is then transformed into a
cql3::result_set, which is then used to write a response. Buffers are
copied and linearised quite excessively.

This series attempts to remedy that by using view of fragmented buffers
as much as possible. The first part deals with reading from
query::result. ser::buffer_view is introduced which enables the IDL
infrastructure to read a buffer without copying or linearising it.
The second part is switching native protocol layer to use bytes_ostream
instead of std::vector<char> to hold the generated response to the
client. The last part introduces cql3::result_generator which is an
alternative to cql3::result_set that passes buffer views without copying
or linearising anything from query::result to the native protocl layer
(or Thrift). It is only used in simple cases, when no processing at the
CQL layer is required, except for paged queries which require some
simple interpretation of the results and are supported by the result
generator.

Tests: unit(release), dtests(paging_test.py paging_additional_test.py
  cql_additional_tests.py cql_tracing_test.py cql_prepared_test.py
  cql_cast_test.py cql_tests.py)
"

* tag 'buffer-views-query-result/v2' of https://github.com/pdziepak/scylla: (34 commits)
  cql3: select_statement: use fetch_page_generator() if possible
  pager: add fetch_page_generator()
  pager: make the visitor handle_result() accepts a template parameter
  pager: make query_result_visitor base class a template parameter
  pager: make myvistor a member class of query_pager
  pager: make shared pointers to selection constant
  pager: merge query_pager and query_pagers::impl
  cql3: select_statement: use result_generator if possible
  cql3: selection: add is_trivial()
  cql3: result: support result_generator
  cql3: add lazy result_generator
  cql3: add result class
  cql3::result_set: fix encapsulation
  thrift: use cql3::result_set visiting interface
  transport: use cql3::result_set visiting interface
  cql3::result_set: add visit()
  transport: response: add write_int_placeholder()
  transport: steal response buffers and make send zero-copy
  transport: use reusable_buffer for compression
  transport: response: use bytes_ostream
  ...
2018-06-25 17:37:50 +03:00
Paweł Dziepak
bdc299cc38 mutation_cleaner: add disclaimer about mutation_partition lifetime
mutation_cleaner has already caused problems by extending lifetime of
mutation_partition past the lifetime of LSA migrators that it uses (due
to the fact that both the cleaner and migrators where thread-local
globals). Since, the long term goal is to make mutation_partition
internal representation depend more and more on schema that lifetime
extension may again cause problems in the future, so let's add a
disclaimer that hopefuly, will help avoiding them.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
55bf9d78a6 lsa: enhance sanitizer for migrators
Current LSA sanitizer performs only basic checks on the migrators use,
without doing any additonal reporting in case an error is detected. This
patch enhances it so that when a problem is detected relevant stack
traces get printed.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
fcd9b1f821 lsa: formalise migrator id requirements
object_descriptor uses special encoding for migrator ids which assumes
that the valid ones are in a range smaller than uint32_t. Let's add some
static asserts that make this fact more visible.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
96b0577343 row_cache: deglobalise row cache tracker
Row cache tracker has numerous implicit dependencies on ohter objects
(e.g. LSA migrators for data held by mutation_cleaner). The fact that
both cache tracker and some of those dependencies are thread local
objects makes it hard to guarantee correct destruction order.

Let's deglobalise cache tracker and put in in the database class.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
2b1fcfe019 cql3: select_statement: use fetch_page_generator() if possible 2018-06-25 09:21:47 +01:00
Paweł Dziepak
1cf3cb285f pager: add fetch_page_generator()
fetch_page_generator() is an equivalent of fetch_page(), but instead of
building a cql3::result_set it returns a cql3::result_generator().
2018-06-25 09:21:47 +01:00
Paweł Dziepak
f6fe831d49 pager: make the visitor handle_result() accepts a template parameter 2018-06-25 09:21:47 +01:00
Paweł Dziepak
fc87ca5926 pager: make query_result_visitor base class a template parameter
So far query_result_visitor was tied to result_set_builder. The goal is
to enable result_generator to work with paged queries as well so we need
to decouple them.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
dc9a65ea76 pager: make myvistor a member class of query_pager
It is going to be come a class template.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
319b2cde7e pager: make shared pointers to selection constant
Shared pointers make code harder to reason about, it is not easy to get
rid of them in this piece of the code, but we can restore at least a bit
of sanity by adding consts.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
327d3de51e pager: merge query_pager and query_pagers::impl
There is just a single implementation of query_pager and there is no
reason to make anything virtual. Devirtualising this code will allow
higher layers to pass visitors via templates.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
fa5dea91e7 cql3: select_statement: use result_generator if possible 2018-06-25 09:21:47 +01:00
Paweł Dziepak
3f1184d16d cql3: selection: add is_trivial()
cql3::result_generator supports only trivial selections.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
adad31ba6b cql3: result: support result_generator
cql3::result can now hold either a result_set or a result_generator.
Some code that is not performance critical expects to get result_set so
a way of converting the result_generator to a result_set is added.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
02443d10db cql3: add lazy result_generator
result_generator is a restricted alternative of result_set. It supports
only the simples cases, but is much cheaper as it passes data almost
directly from query::result to its visitor bypassing much of the CQL
layer.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
dca68afce6 cql3: add result class
So far the only way of returing a result of a CQL query was to build a
result_set. An alternative lazy result generator is going to be
introduced for the simple cases when no transformations at CQL layer are
needed. To do that we need to hide the fact that there are going to be
multiple representations of a cql results from the users.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
29cc4a4c0b cql3::result_set: fix encapsulation 2018-06-25 09:21:47 +01:00
Paweł Dziepak
8f26d9c03f thrift: use cql3::result_set visiting interface 2018-06-25 09:21:47 +01:00
Paweł Dziepak
54d5dc414d transport: use cql3::result_set visiting interface 2018-06-25 09:21:47 +01:00
Paweł Dziepak
2e4234ab63 cql3::result_set: add visit()
This visiting interface for result_set satisfies most of its users (at
least all of those which are in the hot path). It will allow having an
alternative of result_set (i.e. lazy result generator) which would
provide exaclty the same interface.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
c0e7160625 transport: response: add write_int_placeholder()
This allows the response writer to defer writing integers until later
time. It will be used by lazy response generator which will know the
number of rows in the response only after they are all written.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
88aff8eda8 transport: steal response buffers and make send zero-copy
Each response is sent only once, so we can safely steal its buffers and
pass them to the output_stream using the zero-copy interface.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
821e6683e3 transport: use reusable_buffer for compression
Compression algorithms require us to linearise bytes_ostream. This may
cause an excessive number of large allocations. Using reusable_buffers
can avoid that.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
a7c4d407ce transport: response: use bytes_ostream
std::vector<char> is not a very good container for incrementally
building a response. It may cause excessive copies and allocations. If
the response is large it will put more pressure on the memory allocator
by requiring the buffer to be contiguous.

We already have bytes_ostream which avoids all of these problems, so
let's use it.
2018-06-25 09:22:43 +01:00
Paweł Dziepak
c04d38b76b transport: drop response::make_message() 2018-06-25 09:22:35 +01:00
Paweł Dziepak
444acf49af transport: use std::unique_ptr for the response
So far cql_server::response was passed around using shared pointers.
They have very big cost of making it hard to reason about the code. All
that is not necessary and we can easily switch to using much more
sensible std::unique_ptr.
2018-06-25 09:22:24 +01:00
Paweł Dziepak
12f89299b2 transport: move response to a separate header
There are some other translation units which right now are satisfied
with the response being an incomplete type. This means that
std::unique_ptr can't be used for it. Let's move the class declaration
to a header that can be included where needed.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
3b9ba30497 tests: add test for reusable buffers 2018-06-25 09:21:47 +01:00
Paweł Dziepak
b4c5e1a6d4 utils: add reusable_buffer
This commit adds a helper class reusable_buffer which can be used to
avoid excessive memory allocations of large buffers when bytes_ostream
needs to be linearised. The idea is that reusable_buffer in most cases
is going to be thread local so that multiple continuation chains can
reuse the same large buffer.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
8feab33cf4 query::result: use std::optional instead of experimental version 2018-06-25 09:21:47 +01:00
Paweł Dziepak
9d140488bd tests/perf: add performance test for IDL 2018-06-25 09:21:47 +01:00
Paweł Dziepak
4704c4efab query::result: avoid copying and linearising cell value
query::result_view already operates on views of a serialised
query::result. However, until now the value of a cell was always
linearised and copied. This patch makes use of ser::buffer_view to avoid
that.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
982f71a804 query::result_view: add concept 2018-06-25 09:21:47 +01:00
Paweł Dziepak
2914f64b2d serializer: user buffer_view in bytes deserialiser 2018-06-25 09:21:47 +01:00
Paweł Dziepak
19caf709e1 serializer: add view of a fragmented stream
ser::buffer_view is a view of a fragmented buffer in a stream od
IDL-serialised data. It can be used to deserialise IDL objects without
needless copying and linearisation of large blobs.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
fe8dc1fa5c bytes_ostream: add remove_suffix() 2018-06-25 09:21:47 +01:00
Paweł Dziepak
969219d5bc tests/random-utils: add missing include 2018-06-25 09:21:47 +01:00
Paweł Dziepak
a85197a7b5 bytes_ostream: make fragment_iterator default constructible 2018-06-25 09:21:47 +01:00
Piotr Sarna
828497ad19 hints: amend a comment in device limits
To make the comment less confusing, 'group of managers'
is used instead of 'device'.

Refs #3516

Reported-by: Vlad Zolotarov <vladz@scylladb.com>
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Message-Id: <60c9ab6b47195570f7ce7dff9556e3739b7ae00f.1529862547.git.sarna@scylladb.com>
2018-06-24 19:14:59 +01:00
Avi Kivity
48dc875e49 Merge "convert setup scripts to python3" from Takuya
"
Converted all setup scripts from bash to python3.
"

* 'scripts_python_conversion_v1' of https://github.com/syuu1228/scylla:
  dist/common/scripts: convert scylla_kernel_check to python3
  dist/common/scripts: convert scylla_ec2_check to python3
  dist/common/scripts: convert scylla_sysconfig_setup to python3
  dist/common/scripts: convert scylla_setup to python3
  dist/common/scripts: convert scylla_selinux_setup to python3
  dist/common/scripts: convert scylla_raid_setup to python3
  dist/common/scripts: convert scylla_ntp_setup to python3
  dist/common/scripts: convert scylla_fstrim_setup to python3
  dist/common/scripts: convert scylla_dev_mode_setup to python3
  dist/common/scripts: convert scylla_cpuset_setup to python3
  dist/common/scripts: convert scylla_cpuscaling_setup to python3
  dist/common/scripts: convert scylla_coredump_setup to python3
  dist/common/scripts: convert scylla_bootparam_setup to python3
  dist/common/scripts: extend scylla_util.py to convert setup scripts to python3
  dist/common/scripts: convert scylla_io_setup and scylla_util.py to python3
2018-06-24 15:02:08 +03:00
Avi Kivity
40dbdae24e Update seastar submodule
> Merge "Allow creating views from simple streams" from Paweł
  > IOTune: allow duration to be configurable and change its defaults
2018-06-24 14:54:46 +03:00
Avi Kivity
cb549c767a database: rename column_family to table
The name "column_family" is both awkward and obsolete. Rename to
the modern and accurate "table".

An alias is kept to avoid huge code churn.

To prevent a One Definition Rule violation, a preexisting "table"
type is moved to a new namespace row_cache_stress_test.

Tests: unit (release)
Message-Id: <20180624065238.26481-1-avi@scylladb.com>
2018-06-24 14:54:46 +03:00
Tomasz Grabiec
2d4177355a Merge "Support for writing range tombstones to SSTables 3.x" from Vladimir
This patchset brings support for writing range tombstones to SSTables
3.x. ('mc' format).

In SSTables 3.x, range tombstones are represented by so-called range
tombstone markers (hereafter RT markers) that denote range tombstone
start and end bounds. So each range tombstone is represented in data
file by two ordered RT markers.
There are also markers that both close the previous range tombstone and
open the new one in case if two range tombstones are ajdacent. This is
done to consume less disk space on such occasions.
Range tombstones written as RT markers are naturally non-overlapping.

* github.com:argenet/scylla projects/sstables-30/write-range-tombstones/v6
range_tombstone_stream: Remove an unused boolean flag.
Revert "Add missing enum values to bound_kind."
sstables: Move to_deletion_time helper up and make it static.
sstables: Write end-of-partition byte before flushing the last index
block.
sstables: Add support for writing range tombstones in SSTables 3.x
format.
tests: Add unit test covering simple range tombstone.
tests: Add unit test covering adjacent range tombstones.
tests: Add test to cover non-adjacent RTs.
tests: Add test covering mixed rows and range tombstones.
tests: Add test covering SSTables 3.x with many RTs.
tests: Add unit test covering overlapping RTs and rows.
tests: Add tests writing a range tombstone and a row overlapping with
its start.
tests: Add tests writing a range tombstone and a row overlapping with
its end.
tests: Add function that writes from multiple memtable into SSTables.
tests: Add test where 2nd range tombstone covers the remainder of the
1st one.
tests: Add test writing two non-adjacent range tombstones with same
clustering key prefix at their bounds.
tests: Add test covering overlapped range tombstones.
2018-06-22 15:47:18 +02:00
Tomasz Grabiec
f09fff090a Merge 'Enhance space watchdog' from Piotr Sarna
"
This series addresses issue #3516 and enhances space watchdog to make it
device-aware. It's needed because since last MV-related changes, space
watchdog can be responsible for multiple hints manager, which means
multiple directories, which may mean multiple devices.
Hence, having a single static space size limit is not enough anymore
and watchdog should take it into account that different managers
may work on different disks, while yet another managers can share
the same device.

Tests: unit (release)
"

* 'enhance_space_watchdog_4' of https://github.com/psarna/scylla:
hints: reserve more space for dedicated storage
hints: add is_mountpoint function
hints: make space_watchdog device-aware
hints: add device_id to manager
hints: add get_device_id function
2018-06-22 15:45:47 +02:00
Piotr Sarna
8b43ac3a57 hints: reserve more space for dedicated storage
Reserving 10% of space for hints managers makes sense if the device
is shared with other components (like /data or /commitlog).
But, if hints directory is mounted on a dedicated storage, it makes
sense to reserve much more - 90% was chosen as a sane limit.
Whether storage is 'dedicated' or not is based on a simple check
if given hints directory is a mount point.

Fixes #3516

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:27:00 +02:00
Piotr Sarna
32f86ca61e hints: add is_mountpoint function
A helper function that checks whether a path is also a mount point
is added.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:52 +02:00
Piotr Sarna
b6c1b8c5ef hints: make space_watchdog device-aware
Instead of having one static space limit for all directories,
space_watchdog now keeps a per-device limit, shared among
hints managers residing on the same disks.

References #3516

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:45 +02:00
Piotr Sarna
d22668de04 hints: add device_id to manager
In order to make space_watchdog device-aware, device_id field
is added to hints manager. It's an equivalent of stat.st_dev
and it identifies the disk that contains manager's root directory.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:37 +02:00
Piotr Sarna
91b5e33c6a hints: add get_device_id function
In order to distinguish which directories reside on which devices,
get_device_id function is added to resource manager.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:25:47 +02:00
Takuya ASADA
ca52407fd6 dist/common/scripts: convert scylla_kernel_check to python3
Convert bash script to python3.
2018-06-22 12:31:12 +09:00
Takuya ASADA
5efbb714ff dist/common/scripts: convert scylla_ec2_check to python3
Convert bash script to python3.
2018-06-22 12:30:59 +09:00
Takuya ASADA
d0b9464dc7 dist/common/scripts: convert scylla_sysconfig_setup to python3
Convert bash script to python3.
2018-06-22 12:30:49 +09:00
Takuya ASADA
d3a3d0f8de dist/common/scripts: convert scylla_setup to python3
Convert bash script to python3.
2018-06-22 12:30:37 +09:00
Takuya ASADA
8030e89725 dist/common/scripts: convert scylla_selinux_setup to python3
Convert bash script to python3.
2018-06-22 12:30:29 +09:00
Takuya ASADA
8cfc4f1c3d dist/common/scripts: convert scylla_raid_setup to python3
Convert bash script to python3.
2018-06-22 12:30:19 +09:00
Takuya ASADA
63a287b7d4 dist/common/scripts: convert scylla_ntp_setup to python3
Convert bash script to python3.
2018-06-22 12:30:10 +09:00
Takuya ASADA
01eea76a4e dist/common/scripts: convert scylla_fstrim_setup to python3
Convert bash script to python3.
2018-06-22 12:29:56 +09:00
Takuya ASADA
5e07567c60 dist/common/scripts: convert scylla_dev_mode_setup to python3
Convert bash script to python3.
2018-06-22 12:29:44 +09:00
Takuya ASADA
ccc6dbf6c7 dist/common/scripts: convert scylla_cpuset_setup to python3
Convert bash script to python3.
2018-06-22 12:29:25 +09:00
Takuya ASADA
7fd81510a4 dist/common/scripts: convert scylla_cpuscaling_setup to python3
Convert bash script to python3.
2018-06-22 12:29:04 +09:00
Takuya ASADA
e858674a79 dist/common/scripts: convert scylla_coredump_setup to python3
Convert bash script to python3.
2018-06-22 12:28:50 +09:00
Takuya ASADA
b3ee02dd1e dist/common/scripts: convert scylla_bootparam_setup to python3
Convert bash script to python3.
2018-06-22 12:27:56 +09:00
Takuya ASADA
2a4ba883c8 dist/common/scripts: extend scylla_util.py to convert setup scripts to python3
To porting setup scripts to python3, following utility functions/classes
introduced:
 - run(): execute command line, returns return code
 - out(): execute command line, returns stdout as string
 - is_debian_variant() / is_redhat_variant() / is_gentoo_variant()
 / is_ec2() / is_systemd(): detect specific environment
 - hex2list(): implement hex2list.py code as a function
 - makedirs(): same as os.makedirs() but do nothing when dir is exists
 - dist_name() / dist_ver(): alias of platform.dist()
 - class systemd_unit: an utility to control systemd unit using systemctl
 - class sysconfig_parser: reader/writer of /etc/sysconfig files
 - class concolor: ANSI color escape sequences list
2018-06-22 12:21:37 +09:00
Takuya ASADA
b7c980ac56 dist/common/scripts: convert scylla_io_setup and scylla_util.py to python3
To share scylla_util.py with python3 converted setup scripts, these
scripts need to be python3 too.
2018-06-22 12:11:27 +09:00
Glauber Costa
7f6b6fa129 github: direct users asking questions to our mailing list.
Very often people use the issue tracker to just ask questions. We have
been telling them to close the bug and move the discussion somewhere
else but it would be better if people were already directed to the right
place before they even get it wrong.

This would be easier to everybody.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180621135051.3254-1-glauber@scylladb.com>
2018-06-21 17:43:23 +03:00
Tomasz Grabiec
0f380f24c3 Update seastar submodule
* seastar 3c60b82...7aca670 (2):
  > Merge "Log stack trace during exception" from Gleb
  > shared_ptr: Introduce lw_shared_ptr::dispose() for convenience
2018-06-21 12:19:33 +02:00
Vladimir Krivopalov
ea09cf732d tests: Add test covering overlapped range tombstones.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
5df3cd1787 tests: Add test writing two non-adjacent range tombstones with same clustering key prefix at their bounds.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
35b90b2d1e tests: Add test where 2nd range tombstone covers the remainder of the 1st one.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
2f277c29e8 tests: Add function that writes from multiple memtable into SSTables.
This comes in handy when we want to test overlapping range tombstones
because memtable would otherwise de-overlap them internally.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
41d283fe83 tests: Add tests writing a range tombstone and a row overlapping with its end.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
ff53f601e4 tests: Add tests writing a range tombstone and a row overlapping with its start.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
f552f30d57 tests: Add unit test covering overlapping RTs and rows.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
27e053f933 tests: Add test covering SSTables 3.x with many RTs.
This test checks the validity of the promoted index generated for an
RT-only data file.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
aa4a011eb3 tests: Add test covering mixed rows and range tombstones.
Tests three cases:
 - a row lying inside a range tombstone
 - a row that has the same clustering key as range tombstone start
 - a row that has the same clustering key as range tombstone end

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
492a401855 tests: Add test to cover non-adjacent RTs.
These are two RTs where one's RT end clustering is the same as another
one's RT start bound but they are both exclusive.

In this case those bounds should not (and cannot) be merged into a
single RT boundary when writing RT markers.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
3a96226492 tests: Add unit test covering adjacent range tombstones.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
b3e7982fec tests: Add unit test covering simple range tombstone.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
5559fc2121 sstables: Add support for writing range tombstones in SSTables 3.x format.
For SSTables 3.x. ('mc' format), range tombstones are represented by
their bounds that are written to the data file as so-called RT markers.
For adjacent range tombstones, an RT marker can be of a 'boundary' type
which means it closes the previous range tombstone and opens the new
one.

Internally, sstable_writer_m relies on range_tombstone_stream to both
de-overlap incoming range tombstones and order them so that when they
are drained they can be easily thought of as just pairs of their bounds.
2018-06-20 18:08:36 -07:00
Noam Hasson
6572917fda docker: added support for authenticator & authorizer command arguments
By default Scylla docker runs without the security features.
This patch adds support for the user to supply different params values for the
authenticator and authorizer classes and allowing to setup a secure Scylla in
Docker.
For example if you want to run a secure Scylla with password and authorization:
docker run --name some-scylla -d scylladb/scylla --authenticator
PasswordAuthenticator --authorizer CassandraAuthorizer

Update the Docker documentation with the new command line options.

Signed-off-by: Noam Hasson <noam@scylladb.com>
Message-Id: <20180620122340.30394-1-noam@scylladb.com>
2018-06-20 20:33:59 +03:00
Gleb Natapov
f53ae2d07f storage_service: avoid "ignored future" message during schema check failure
Message-Id: <20180620134402.GQ1918@scylladb.com>
2018-06-20 18:53:47 +03:00
Takuya ASADA
6acb2add4a dist/ami: show unsupported instance type message even scylla_ami_setup is still running
On current .bash_profile it prints "Constructing RAID volume..." when
scylla_ami_setup is still running, even it running on unsupported
instance types.

To avoid that we need to run instance type check at first, then we can
run rest of the script.

Fixes #2739

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613111539.30517-1-syuu@scylladb.com>
2018-06-20 16:49:15 +03:00
Takuya ASADA
4151120752 dist/debian: change owner of build/debs/ to current user
Currently build/debs/ is owned by root user since pbuilder requires to
run in root.
So chown them after finished building.

Fixes #3447

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613093213.28827-1-syuu@scylladb.com>
2018-06-20 16:48:17 +03:00
Tomasz Grabiec
4523706312 gdb: Adjust for removal of the 'fsu' field from cpu_mem
Message-Id: <1529497459-15287-1-git-send-email-tgrabiec@scylladb.com>
2018-06-20 15:27:35 +03:00
Avi Kivity
b97e1aeff5 Merge "Consume row marker correctly" from Piotr
"
Make sure we properly handle row marker and row tombstone
when reading a row.

Tests: unit {release}
"

* 'haaawk/sstables3/read-liveness-info-v4' of ssh://github.com/scylladb/seastar-dev:
  sstable: consume row marker in data_consume_rows_context_m
  sstable: Add consumer_m::consume_row_marker_and_tombstone
  sstable: add is_set and to_row_marker to liveness_info
2018-06-20 14:44:03 +03:00
Piotr Jastrzebski
75edaff7b6 sstable: consume row marker in data_consume_rows_context_m
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-20 13:13:29 +02:00
Piotr Jastrzebski
cbfc741d70 sstable: Add consumer_m::consume_row_marker_and_tombstone
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-20 13:13:16 +02:00
Tomasz Grabiec
5548eb96f7 Merge "store prepared statements parameters values" from Vlad
* https://github.com/vladzcloudius/scylla.git tracing_prepared_parameters-v6:
  cql3::query_options: add get_names() method
  tracing::trace_state: hide the internals of params_values
  tracing: store queries statements for BATCH
  tracing: store the prepared statements parameters values
2018-06-19 19:12:26 +02:00
Avi Kivity
f912eefbe2 Merge "Fix numerous issues in AMI related scriptology" from Vlad
"
A few fixes in scripts that were found when debugging #3508.
This series fixed this issue.

"

Fixes #3508

* 'ami_scripts_fixes-v1' of https://github.com/vladzcloudius/scylla:
  scylla_io_setup: properly define the disk_properties YAML hierarchy
  scylla_io_setup: fix a typo: s/write_bandwdith/write_bandwidth/
  scylla_io_setup: hardcode the "mountpoint" YAML node to "/var/lib/scylla" for AMIs
  scylla_io_setup: print the io_properties.yaml file name and not its handle info
  scylla_lib.sh: tolerate perftune.py errors
2018-06-19 19:31:23 +03:00
Avi Kivity
e0eb66af6b Merge "Do not allow compaction controller shares to grow indefinitely" from Glauber
"
We are seeing some workloads with large datasets where the compaction
controller ends up with a lot of shares. Regardless of whether or not
we'll change the algorithm, this patchset handles a more basic issue,
which is the fact that the current controller doesn't set a maximum
explicitly, so if the input is larger than the maximum it will keep
growing without bounds.

It also pushes the maximum input point of the compaction controller from
10 to 30, allowing us to err on the side of caution for the 2.2 release.
"

* 'tame-controller' of github.com:glommer/scylla:
  controller: do not increase shares of controllers for inputs higher than the maximum
  controller: adjust constants for compaction controller
2018-06-19 18:49:02 +03:00
Avi Kivity
b6b5647836 Merge "Fix querier-cache related issues" from Botond
"
This mini series fixes some querier-cache related issues discovered
while working on stateful range-scans.
1) A problem in the memory based cache eviction test that is is yet
   unexposed (#3529).
2) Possible usage of invalidated iterators in querier_cache (#3424).
3) lookup() possibly returning a querier with the wrong read range
   (#3530).

Tests: unit(release)
"

* 'fix-querier-cache-invalid-iterators-master' of https://github.com/denesb/scylla:
  querier: find_querier(): return end() when no querier matches the range
  querier_cache: restructure entries storage
  tests/querier_cache: fix memory based eviction test
2018-06-19 16:29:03 +03:00
Paweł Dziepak
e55034a33e cql3: batch_statement: use external_memory_usage() to get mutation size
batch_statement::verify_batch_size() verifies that the total size of
mutations generated by the batch statement is smaller than certain
configurable thresholds. This is done by a custom mutation_partition
visitor, which violates atomic_cell_view::value() preconditions by
calling it even for dead cells.

The simples solution is to use
mutation_partition::external_memory_usage() instead.

Message-Id: <20180619131405.12601-1-pdziepak@scylladb.com>
2018-06-19 16:26:52 +03:00
Duarte Nunes
d3e24076b0 tests/cell_locker_test: Prevent timeout underflow
Timeout underflow causes the test to hang, due to a seastar bug
with negative time_points.


Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180619091635.34228-1-duarte@scylladb.com>
2018-06-19 16:26:52 +03:00
Duarte Nunes
ee4b3c4c2d database: Await pending writes before truncating CF on drop
When dropping a table, wait for the column family to quiesce so that
no pending writes compete with the truncate operation, possibly
allowing data to be left on disk.

Fixes #2562

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180618193134.31971-1-duarte@scylladb.com>
2018-06-19 16:26:52 +03:00
Botond Dénes
9490b8935c .gitignore: add resources directory
This directory is necessary when running dtests against a scylla
repository.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8d9ad8dae2b9d2ec3cc6c9c4d6527fba8ce91272.1529387008.git.bdenes@scylladb.com>
2018-06-19 16:26:51 +03:00
Piotr Sarna
61e3ee6c3c cql3: fix supernumerary column on view update
Patch f39891a999 fixed 3443,
but also introduced a regression in dtest - new column
was unconditionally added to view during ALTER TABLE ADD,
while it should only be the case for "include all columns" views.
This patch fixes the regression (spotted by query_new_column_test).

References #3443
Message-Id: <7410d965255a514d78cf0ce941a3236b9d8ddbbd.1529399135.git.sarna@scylladb.com>
2018-06-19 16:26:51 +03:00
Botond Dénes
2609a17a23 querier: find_querier(): return end() when no querier matches the range
When none of the queriers found for the lookup key match the lookup
range `_entries.end()` should be returned as the search failed. Instead
the iterator returned from the failed `std::find_if()` is returned
which, if the find failed, will be the end iterator returned by the
previous call to `_entries.equal_range()`. This is incorrect because as
long as `equal_range()`'s end iterator is not also `_entries.end()` the
search will always return an iterator to a querier regardless of whether
any of them actually matches the read range.
Fix by returning `_entries.end()` when it is detected that no queriers
match the range.

Fixes: #3530
2018-06-19 13:20:43 +03:00
Botond Dénes
7ce7f3f0cc querier_cache: restructure entries storage
Currently querier_cache uses a `std::unordered_map<utils::UUID, querier>`
to store cache entries and an `std::list<meta_entry>` to store meta
information about the querier entries, like insertion order, expiry
time, etc.

All cache eviction algorithms use the meta-entry list to evict entries
in reverse insertion order (LRU order). To make this possible
meta-entries keep an iterator into the entry map so that given a
meta-entry one can easily erase the querier entry. This however poses a
problem as std::unordered_map can possibly invalidate all its iterators
when new items are inserted. This is use-after-free waiting to happen.

Another disadvantages of the current solution is that it requires the
meta-entry to use a weak pointer to the querier entry so that in case
that is removed (as a result of a successful lookup) it doesn't try to
access it. This has an impact on all cache eviction algorithms as they
have to be prepared to deal with stale meta-entries. Stale meta-entries
also unnecesarily consume memory.

To solve these problems redesign how querier_cache stores entries
completely. Instead of storing the entries in an `std::unordered_map`
and storing the meta-entries in an `std::list`, store the entries in an
`std::list` and an intrusive-map (index) for lookups. This new design
has severeal advantages over the old one:
* The entries will now be in insert order, so eviction strategies can
  work on the entry list itself, no need to involve additional data
  structures for this.
* All data related to an entry is stored in one place, no data
  duplication.
* Removing an entry automatically removes it from the index as intrusive
  containers support auto unlink. This means there is no need to store
  iterators for long terms, risking use-after-free when the container
  invalidates it's iterators.

Additional changes:
* Modify eviction strategies so that they work with the `entry`
  interface rather than the stored value directly.

Ref #3424
2018-06-19 13:20:40 +03:00
Botond Dénes
b9d51b4c08 tests/querier_cache: fix memory based eviction test
Do increment the key counter after inserting the first querier into the
cache. Otherwise two queriers with the same key will be inserted and
will fail the test. This problem is exposed by the changes the next
patches make to the querier-cache but will be fixed before to maintain
bisectability of the code.

Fixes: #3529
2018-06-19 13:20:13 +03:00
Vladimir Krivopalov
100eb03f29 sstables: Write end-of-partition byte before flushing the last index block.
This is to stay compliant with the Origin for SSTables 3.x.
It differs from SSTables 2.x (ka/la) as for those the last promoted
index block is pushed first and the end-of-partition byte is written
after.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-18 14:28:25 -07:00
Vladimir Krivopalov
ad0b911b03 sstables: Move to_deletion_time helper up and make it static.
It is used for writing end_open_marker for promoted index.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-18 14:25:13 -07:00
Vladimir Krivopalov
03cf20676c Revert "Add missing enum values to bound_kind."
This reverts commit 3ecc9e9ce4.

It also adds another enum to be used instead.
2018-06-18 14:22:12 -07:00
Vladimir Krivopalov
0cf42e7fd2 range_tombstone_stream: Remove an unused boolean flag.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-18 14:22:12 -07:00
Glauber Costa
e0b209b271 controller: do not increase shares of controllers for inputs higher than the maximum
Right now there is no limit to how much the shares of the controllers
can grow. That is not a big problem from the memtable flush controller,
since it has a natural maximum in the dirty limit.

But the compaction controller, the way it's written today, can grow
forever and end up with a very large value for shares. We'll cap that at
adjust() time by not allowing shares to grow indefinitely.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-18 15:16:39 -04:00
Glauber Costa
70c47eb045 controller: adjust constants for compaction controller
Right now the controller adjusts its shares based on how big the backlog
is in comparison to shard memory. We have seen in some tests that if the
dataset becomes too big, this may cause compactions to dominate.

While we may change the input altogether in future versions, I'd like to
propose a quick change for the time being: move the high point from 10x
memory size to 30x memory size. This will cause compactions to increase
in shares more slowly.

While this is as magic as the 10 before, they will allow us to err in
the side of caution, with compactions not becoming aggressive enough to
overly disrupt workloads.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-18 15:16:38 -04:00
Piotr Jastrzebski
4c261d2e51 sstable: add is_set and to_row_marker to liveness_info
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-18 20:26:39 +02:00
Paweł Dziepak
71471bb322 Merge "Make front-end processing scheduling aware" from Avi
"
This patchset runs the protocol servers under the "statement" scheduling
group, and makes all execution_stages in that path scheduling aware.

I used inheriting_concrete_execution_stage instead of passing the
scheduling group to concrete_execution_stage's constructor for two
reasons:

 1. For cql statements, there is no easily accessible object that
    can host the concrete_execution_stage and be reached from both
    main.cc and the statements,
 2. In the future, we will want to assign users to different
    scheduling_groups, thus providing performance isolation for
    service-level agreements (SLAs). Using an inheriting
    execution_stage allows us to make the scheduling_group decision
    in one place.

Depends on two unmerged patches in seastar, one fixing
inheriting_concrete_execution_stage compilation with reference parameters,
and one making smp::submit_to() scheduling aware.
"

* tag 'cql-sched/v1' of https://github.com/avikivity/scylla:
  cql: make modification_statement execution_stage scheduling aware
  cql: make batch_statement execution_stage scheduling aware
  cql: make select_statement execution_stage scheduling aware
  transport: make native protocol request processing execution_stage scheduling aware
  main: start client protocol servers under the statement scheduling group
2018-06-18 16:38:30 +01:00
Avi Kivity
0cf4cf5981 cql: make modification_statement execution_stage scheduling aware
Inherit scheduling from the caller, preventing a fall back into the main group.
2018-06-18 18:30:21 +03:00
Avi Kivity
9479d3f345 cql: make batch_statement execution_stage scheduling aware
Inherit scheduling from the caller, preventing a fall back into the main group.
2018-06-18 18:30:21 +03:00
Avi Kivity
fdfc347595 cql: make select_statement execution_stage scheduling aware
Inherit scheduling from the caller, preventing a fall back into the main group.
2018-06-18 18:30:21 +03:00
Avi Kivity
ec788d2a7a transport: make native protocol request processing execution_stage scheduling aware
Inherit scheduling from the caller, preventing a fall back into the main group.
2018-06-18 18:30:21 +03:00
Avi Kivity
ea39e3e9d4 main: start client protocol servers under the statement scheduling group
This will isolate client protocol and coordinator-side processing from
the rest of the system.
2018-06-18 18:30:21 +03:00
Paweł Dziepak
79fae49689 Merge seastar upstream
* seastar 6422ece...3c60b82 (5):
  > reactor: inherit scheduling_group in smp::submit_to()
  > execution_stage: fix inheriting_concrete_execution_stage with reference arguments
  > tests: shared_ptr: Add typename keyword to fix compilation
  > configure: Fix --static-stdc++ flag
  > scheduling: Move friends' definitions outside the class scope
2018-06-18 12:16:57 +01:00
Avi Kivity
782827cc1b Update seastar submodule
* seastar e7275e4...6422ece (7):
  > build: enable concepts whenever they are supported by compiler
  > shared_ptr: Enable releasing ownership of the object stored in lw_shared_ptr
  > reactor: change way of calculating task quota violations
  > Merge "Add metrics for steal time and task quota violations" from Glauber
  > bitops.hh/log2ceil(): add special case for n == 1
  > circular_buffer: add clear()
  > build: add core/execution_stage.{cc,hh} to core_files
2018-06-17 21:53:50 +03:00
Avi Kivity
f0fc888381 Merge "Try harder to move STCS towards zero-backlog" from Glauber
"
Tests: unit (release)

Before merging the LCS controller, we merged patches that would
guarantee that LCS would move towards zero backlog - otherwise the
backlog could get too high.

We didn't do the same for STCS, our first controlled strategy. So we may
end up with a situation where there are many SSTables inducing a large
backlog, but they are not yet meeting the minimum criteria for
compaction. The backlog, then, never goes down.

This patch changes the SSTable selection criteria so that if there is
nothing to do, we'll keep pushing towards reaching a state of zero
backlog. Very similar to what we did for LCS.
"

* 'stcs-min-threshold-v4' of github.com:glommer/scylla:
  STCS: bypass min_threshold unless configure to enforce strictly
  compaction_strategy: allow the user to tell us if min_threshold has to be strict
2018-06-17 18:07:23 +03:00
Glauber Costa
fd51ff3d9e STCS: bypass min_threshold unless configure to enforce strictly
If we fail to produce a SizeTiered compaction with the configured
min_threshold, we can try again to compact any two - unless there is a
global bypass telling us no to.

This will still privilege doing larger compactions in size buckets where
that is possible, but if we are idle will try to compact any two

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-15 14:27:22 -04:00
Glauber Costa
290d553c3a compaction_strategy: allow the user to tell us if min_threshold has to be strict
Now that we have the controller, we would like to take min_threshold as
a hint. If there is nothing to compact, we can ignore that and start
compacting less than min_threshold SSTables so that the backlog keeps
reducing.

But there are cases in which we don't want min_threshold to be a hint
and we want to enforce it strictly. For instance, if write amplification
is more of a concern than space amplification.

This patch adds a YAML option that allows the user to tell us that. We will
default to false, meaning min_threshold is not strictly enforced.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-15 13:42:43 -04:00
Avi Kivity
75b53c4170 Merge "sstables 3.x read counters v2 00/10] Support reading counters" from Piotr
"
Implement and test support for reading counters in SSTables 3.
"

* 'haaawk/sstables3/read-counters-v2' of ssh://github.com/scylladb/seastar-dev:
  sstable_3_x_test: add test for counters
  data_consume_rows_context_m: support reading counters
  Add consumer_m::consume_counter_column
  Extract make_counter_cell
  row.hh & mp_row_consumer.hh: Add required includes
  Use serialization_header::adjust in read_statistics
  sstables 3: add serialization_header::adjust
  data_consume_rows_context_m: add is_column_counter
  data_consume_rows_context_m: Remove unused CELL_PATH_SIZE state
  column_translation: add is_counter
2018-06-15 17:33:40 +03:00
Takuya ASADA
3f8719d67e dist/common/scripts/scylla_coredump_setup: fix typo
Correct function name is "is_debian_variant", not "is_debian_variants"

Fixed #3507

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612155353.28229-1-syuu@scylladb.com>
2018-06-15 12:11:52 +01:00
Glauber Costa
e1246a3a3a sstable_test: write to temporary directory
Currently the SSTable test is failing (at least for me and Raphael),
complaining about the file it tries to write already existing. We have
helpers now to generate temporary directories, so we should use it.

The test passes after that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180614210036.16662-1-glauber@scylladb.com>
2018-06-15 11:00:08 +02:00
Piotr Sarna
d7eb6e6c7f tests: fix a typo in idl_test.cc
Fixes #3520
Message-Id: <831ead669a30d1b136d9ae50c4a1ac7057cf3340.1529047397.git.sarna@scylladb.com>
2018-06-15 09:56:45 +01:00
Piotr Jastrzebski
346e559c1b sstable_3_x_test: add test for counters
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
2942f6eecc data_consume_rows_context_m: support reading counters
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
785e14dfb9 Add consumer_m::consume_counter_column
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
6f559445d0 Extract make_counter_cell
It will be used by both consumers.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
88b66189b7 row.hh & mp_row_consumer.hh: Add required includes
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
369e4a4987 Use serialization_header::adjust in read_statistics
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
a3683d6e0f sstables 3: add serialization_header::adjust
In SSTables 3, min timestamp and min deletion time in serialization
header are not stored normally but instead the difference between
their value and the cassandra "epoch" is stored.
This is supposed to make SSTables smaller. As a consequence, we have
to add the "epoch" after reading the values to obtain the actual
values of min timestamp and min deletion time.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:10:48 +02:00
Tomasz Grabiec
78274276f5 row_cache: Use the memtable cleaner to create memtable snapshot during update
Memtable entries should be cleaned using memtable cleaner, which
unlike the cache' cleaner is not associated with the cache
tracker. It's an error to clean a snapshot using tracker which doesn't
own the entries. This will corrupt cache tracker's row counter.

Fixes failure of test_exception_safety_of_update_from_memtable from
row_cache.cc in debug mode and with allocation failure injection
enabled.

Introduce in "cache: Defer during partition merging"
(70c72773be).
Message-Id: <1528988256-20578-1-git-send-email-tgrabiec@scylladb.com>
2018-06-14 18:03:02 +03:00
Piotr Sarna
6b3a97e34a hints: fix max_shard_disk_space_size initialization
Previously max_shard_disk_space_size was unconditionally initialized
with the capacity of hints_directory. But, it's likely that
hints_directory doesn't exist at all if hinted handoff is not enabled,
which results in Scylla failing to boot.
So, max_shard_disk_space_size is now initialized with the capacity
of hints_for_views directory, which is always present.
This commit also moves max_shard_disk_space_size to the .cc file
where it belongs - resource_manager.cc.

Tests: unit (release)

Message-Id: <9f7b86b6452af328c05c5c6c55bfad3382e12445.1528977363.git.sarna@scylladb.com>
2018-06-14 14:24:01 +01:00
Duarte Nunes
5a8b8afe19 Merge "Add support for datetime functions" from Piotr
"
This series adds the following datetime functions to CQL:
 - currentTimestamp
 - currentDate
 - currentTime
 - currentTimeUUID
 - timeUUIDToDate
 - timestampToDate
 - timeUUIDToTimestamp
 - dateToTimestamp
 - timeUUIDToUnixTimestamp
 - timestampToUnixTimestamp
 - dateToUnixTimestamp

It also comes with datetime conversions test added to cql_query_test.

Note: issue #2949 also mentioned queries like:
 $ SELECT * FROM myTable WHERE date >= currentDate() - 2d;
but it's a broader topic of supporting arithmetic operations in general,
so it's moved to #3499.

Tests: unit (release)
"

* 'support_datetime_functions_3' of https://github.com/psarna/scylla:
  tests: add datetime conversions to cql_query_tests
  cql3: add time conversion functions
  cql3: add current* time functions
  types: add time_native_type
2018-06-14 12:31:39 +01:00
Piotr Sarna
5900e7f55f tests: add datetime conversions to cql_query_tests
Test case related to datetime converting functions
is added to cql_query_tests suite.
2018-06-14 11:49:11 +02:00
Piotr Sarna
695015a27e cql3: add time conversion functions
Following functions are added:
 - timeuuidtodate
 - timestamptodate
 - timeuuidtotimestamp
 - datetotimestamp
 - timeuuidtounixtimestamp
 - timestamptounixtimestamp
 - datetounixtimestamp

Fixes #2949
2018-06-14 11:49:11 +02:00
Piotr Sarna
087998b768 cql3: add current* time functions
Following date/time-related functions are added:
 - currentTimestamp
 - currentDate
 - currentTime
 - currentTimeUUID
2018-06-14 11:49:08 +02:00
Piotr Sarna
90d323a522 types: add time_native_type
CQL3's time_type didn't have any suitable native type,
so time_native_type is introduced to serve that purpose.
2018-06-14 11:11:41 +02:00
Takuya ASADA
9971576ecb dist: drop collectd support from package
Since scyllatop no longer needs collectd, now we are able to drop collectd.

resolves #3490

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1528961612-8528-1-git-send-email-syuu@scylladb.com>
2018-06-14 10:40:23 +03:00
Takuya ASADA
fcc1a9f6bb dist/redhat: Disables ambient capabilities when systemd/kernel doesn't support it
CentOS 7.4 does support to use ambient capabilities on systemd unit
file, but on some other RHEL7 compatible enviroment doesn't, it causes
Scylla startup failure.

To avoid the issue, move AmbientCapabilities line to
/etc/systemd/system/scylla.server.service.d/, install .conf only when
both systemd and kernel supported the feature.

Fixes #3486

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613232327.7839-1-syuu@scylladb.com>
2018-06-14 10:32:56 +03:00
Avi Kivity
aeffbb6732 database: stop using incremental selectors
There is a bug in incremental_selector for partitioned_sstable_set, so
until it is found, stop using it.

This degrades scan performance of Leveled Compaction Strategy tables.

Fixes #3513. (as a workaround)
Introduced: 2.1
Message-Id: <20180613131547.19084-1-avi@scylladb.com>
2018-06-13 17:57:57 +02:00
Paweł Dziepak
d5982569bc Merge "Fix fragmented serialization" from Piotr
"
After issue 3501 it turned out that IDL generates incorrect
serialization code for fragmented buffers. This series addresses
the problem by:
 * providing serialization code for FragmentRange
 * changing IDL generation rules for fragmented buffers, so they
   expect a lower layer to iterate over fragments
 * adding a test to cql_query_test suite that covers #3501
 * adding a test to idl_tests suite that covers fragmented serialization
"

* 'fix_fragmented_serialization_3' of https://github.com/psarna/scylla:
  tests: add fragmented serialization test to idl_tests
  tests: add long text value test
  idl: remove for_each from fragmented serialization
  serializer: add FragmentRange serialization
2018-06-13 14:11:16 +01:00
Gleb Natapov
98b7f6148b fix regression in perf_row_cache_update test
logalloc should be initialized explicitly by every test that uses it
now.

Message-Id: <20180613093657.GY11809@scylladb.com>
2018-06-13 15:21:20 +03:00
Avi Kivity
29976600b4 Update scylla-ami submodule
* dist/ami/files/scylla-ami 1f5329f...36e8511 (1):
  > don't try to add busy devices to the RAID.
2018-06-13 15:19:57 +03:00
Piotr Sarna
551e8f5d8c tests: add fragmented serialization test to idl_tests
IDL tests now has an additional test that checks whether serializing
and deserializing of fragmented buffers is working properly.

References #3501
2018-06-13 13:54:12 +02:00
Piotr Sarna
cdd87af408 tests: add long text value test
Test adding a long (>8192) text/varchar value is added to cql suite.

References #3501
2018-06-13 13:54:12 +02:00
Piotr Sarna
450e014558 idl: remove for_each from fragmented serialization
Previously fragmented buffers of bytes were serialized
with a for_each loop. Since serializing bytes involves writing
size first and then data, only first fragment (and its size)
would be taken into account.
This commit changes fragmented code generation so it expects
that serialized range has a serialize(output, T) specification
and expects it to iterate over fragments on its own (just like
serializer for basic_value_view does).

Fixes #3501
2018-06-13 13:54:09 +02:00
Piotr Sarna
e525a0d51b serializer: add FragmentRange serialization
Serialization for FragmentRange classes is added to serialization
suite. It first serializes total length to a 32bit field and then
writes each fragment to output.

References #3501
2018-06-13 13:44:08 +02:00
Piotr Jastrzebski
42d2a162dd data_consume_rows_context_m: add is_column_counter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-13 09:27:58 +02:00
Piotr Jastrzebski
d4d3e6f8eb data_consume_rows_context_m: Remove unused CELL_PATH_SIZE state
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-13 09:27:58 +02:00
Piotr Jastrzebski
ca7ede7eaf column_translation: add is_counter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-13 09:27:58 +02:00
Vlad Zolotarov
0004c29aba scylla_io_setup: properly define the disk_properties YAML hierarchy
disk_properties map should be an entry in the 'disk' list hierarchy.
Currently this list is going to containe a single element.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 19:14:22 -04:00
Vlad Zolotarov
038b2f3be2 scylla_io_setup: fix a typo: s/write_bandwdith/write_bandwidth/
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 18:54:34 -04:00
Vlad Zolotarov
26277e5973 scylla_io_setup: hardcode the "mountpoint" YAML node to "/var/lib/scylla" for AMIs
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 15:51:10 -04:00
Vlad Zolotarov
77463ddc3b scylla_io_setup: print the io_properties.yaml file name and not its handle info
In order to get a file name from the given file() handle one should use
a file_handle.name property.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 15:25:55 -04:00
Vlad Zolotarov
aa3d9c38b5 scylla_lib.sh: tolerate perftune.py errors
When we check the currently configured tuning mode perftune.py is allowed
to return an error. get_tune_mode() has to be able to tolerate them.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 14:21:26 -04:00
Vlad Zolotarov
818b5b75ba tracing: store the prepared statements parameters values
Store the prepared statement positional parameters values in the
corresponding system_traces.sessions entry in the 'parameters' column
(which has a map<text,text> type).

Parameters are stored as a pair of "param[X]" : "value", where X is
the index of the parameter starting from 0 and the "value" is the first
64 characters of the parameter's value string representation.

If parameters were given with their names attached (see the description
on bit 0x40 of QUERY flags in the CQL binary protocol specification) then
parameters are going to be stored in the "param[X](<bound variable name>)" : "value"
form.

If the value's string representation is longer than 64 characters then the "value" will
contain only first 64 characters of it and will have the "..." at
the end.

For a BATCH of prepared statements the parameter "name" will have a form of
param[Y][X] where Y is the index of the corresponding prepared statement
in the BATCH and X is the index of the parameter. Both X and Y start from
0.

Note:
Had to switch to boost::range::find() in sstables::big_sstable_set in order to
address the "ambiguous overload" compilation error.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 10:57:05 -04:00
Vlad Zolotarov
a1da285f9e tracing: store queries statements for BATCH
Similarly to the regular QUERY of EXECUTE we want to see the actual
queries statement that were part of the BATCH.

If a traced query has only a single statement to execute then its statement will be stored in a form 'query':'<statement>'.

If there are two or more queries (BATCH) then statements of each query in the BATCH will be stored in a form 'query[X]':'<statement>', where X is the index of the query in the
BATCH starting from 0.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 10:57:05 -04:00
Vlad Zolotarov
c0e51c4521 tracing::trace_state: hide the internals of params_values
Hide it inside the trace_state.cc in order to avoid future circular
dependencies with other .hh files.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 10:57:05 -04:00
Vlad Zolotarov
a469567605 cql3::query_options: add get_names() method
This method returns names of named prepared statement parameters.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 10:57:05 -04:00
Avi Kivity
d91891b6f0 Restore scylla-ami submodule
Commit b38ced0fcd ("Configure logalloc
memory size during initialization") updated the scylla-ami submodule
inadvertently.
2018-06-12 10:56:59 +03:00
Avi Kivity
74a3ab36e3 Restore seastar submodule
Commit b38ced0fcd ("Configure logalloc
memory size during initialization") updated the seastar submodule
inadvertently.
2018-06-12 10:37:35 +03:00
Avi Kivity
24a9a3c679 Merge "Push memory limits configuration up to main" from Gleb
"
May components limit its internal memory pools/caches/queues depending
on amount of memory present in a system. Each of them uses seastar
memory interface to get the information about memory availability
which makes it harder to 1: test the components with various memory
configurations and 2: to see which components reserve memory and how
much each one reserves.

The patch changes all the components that rely on memory size to get this
information through configuration parameter during creation instead of
checking it directly with seastar, so only main interacts with seastar
allocator.
"

* 'gleb/memory-config-v2' of github.com:scylladb/seastar-dev:
  Provide available memory size to compaction_manager object during creation
  Configure authorized_prepared_statment_cache memory limit during object creation
  Configure logalloc memory size during initialization
  Provide cql max request limit to cql server object during creation
  Configure query result memory limiter size limit during object creation
  Configure querier_cache size limit during object creation
  Provide available memory size to messaging_service object during creation
  Provide available memory size to hinted handoff resource manager during creation
  Provide available memory size to storage_proxy object during creation
  Provide available memory size to commitlog during creation
  Provide available memory size to database object during creation
  Configure prepared_statements_cache memory limit from outside
2018-06-11 15:34:14 +03:00
Gleb Natapov
59da525e0d Provide available memory size to compaction_manager object during creation 2018-06-11 15:34:14 +03:00
Gleb Natapov
da20d86423 Configure authorized_prepared_statment_cache memory limit during object creation 2018-06-11 15:34:14 +03:00
Gleb Natapov
b38ced0fcd Configure logalloc memory size during initialization 2018-06-11 15:34:14 +03:00
Gleb Natapov
894673ac14 Provide cql max request limit to cql server object during creation 2018-06-11 15:34:14 +03:00
Gleb Natapov
7832266cd7 Configure query result memory limiter size limit during object creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
04727acee9 Configure querier_cache size limit during object creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
646e400918 Provide available memory size to messaging_service object during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
cdf1289b43 Provide available memory size to hinted handoff resource manager during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
ac88935baa Provide available memory size to storage_proxy object during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
cc47f6c69d Provide available memory size to commitlog during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
f41575a156 Provide available memory size to database object during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
461f20e7b1 Configure prepared_statements_cache memory limit from outside
Pass desirable memory limit during construction instead of querying
memory size explicitly.
2018-06-11 15:34:13 +03:00
Tomasz Grabiec
a91974af7a tests: row_cache: Reduce concurrency limit to avoid bad_alloc
The test uses random mutations. We saw it failing with bad_alloc from time to time.
Reduce concurrency to reduce memory footprint.

Message-Id: <20180611090304.16681-1-tgrabiec@scylladb.com>
2018-06-11 10:06:56 +01:00
Tomasz Grabiec
cd7c7ac40f mutation_partition: Make do_compact() respect range tombstone merging rules
It compares only timestamps, but it should use intrinsic ordering of
the tombstone, which takes deletio ntime into consideration as well.
If we have two range tombstones with the same timestamp but different
deletion time (odd case, but still), then the one with the higher
deletion time should win. That's what all other parts of the system
use to resolve merges, in particular range_tombstone_list and
compact_mutation_state (the fragment stream compactor).

Not respecting this ordering violates the following equality:

  do_compact(do_compact(m1) + m2) == do_compact(m1 + m2)

which may results in some clustered rows being missing in the
right-hand side, but not in the left-hand side, due to differences in
range tombstones.

This impacts only tests currently.
Message-Id: <1528705602-7218-1-git-send-email-tgrabiec@scylladb.com>
2018-06-11 10:05:52 +01:00
Nadav Har'El
41472e2618 legacy_schema_migrator: add comment
When I came across db/legacy_schema_migrator.cc, I had no idea what it
does and though I had obvious guesses (it somehow migrates old schemas,
right?) I didn't know what it really does. So after I figured this out,
I wrote this comment so the next person doesn't need to guess.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180605120225.25173-1-nyh@scylladb.com>
2018-06-10 19:39:06 +03:00
Avi Kivity
4b81feb344 Merge "switch to systemd-coredump on Debian 9" from Takuya
* 'systemd-coredump-debian9' of https://github.com/syuu1228/scylla:
  dist/debian: fix pystache package name on Debian / Ubuntu
  dist/debian: switch to systemd-coredump on Debian 9
  dist/debian: rename 99-scylla.conf to 99-scylla-coredump.conf
2018-06-10 19:38:25 +03:00
Asias He
059ec89ad1 gms: Add is_normal helper to endpoint_state
It is faster than gossiper::is_normal because it avoids to do search in
the std::map<application_state, versioned_value>. It is useful for the
code in the fast path which needs to query if a node is in NORMAL
status.

Fixes #3500

Message-Id: <42db91fa4108f9f4fcf94fed3ec403ccf35d15e9.1528354644.git.asias@scylladb.com>
2018-06-10 19:21:03 +03:00
Vladimir Krivopalov
9c9c85cde5 tests: Add test writing UDT data to SSTables 3.x.
Original data and index files are generated using Cassandra 3.11.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <d0ea8146d6f2a76a5f661271500b35390962a9d4.1528420647.git.vladimir@scylladb.com>
2018-06-10 19:20:42 +03:00
Avi Kivity
74469ecc09 Merge "Support reading collections" from Piotr
"
Implement and test support for reading collections in SSTables 3.

Tests: unit {release}
"

* 'haaawk/sstables3/read-collections-v1' of ssh://github.com/scylladb/seastar-dev:
  sstables 3: Add tests for reading collections
  flat_mutation_reader_assertions: add more flexible asserts
  data_consume_rows_context_m: add support for collections
  mp_row_consumer_m: Add support for collections
  data_consume_rows_context_m: introduce cell_path
  Use column_translation::*_is_collection in reading
  column_translation: add *_column_is_collection()
  column_flags_m: add HAS_COMPLEX_DELETION
  Use read_unsigned_vint_length_bytes for COLUMN_VALUE
  Use read_unsigned_vint_length_bytes for CK_BLOCKS
  Implement read_unsigned_vint_length_bytes
2018-06-10 17:10:52 +03:00
Avi Kivity
2582f53b44 Merge "database and API: Add column_family::get_sstables_by_key" from Amnon
"
This is series is for nodetool getsstables.

This patch is based on:
8daaf9833a

With some minor adjustments because of the code change in sstables.

The idea is to allow searching for all the sstables that contains a
given key.

After this patch if there is a table t1 in keyspace k1 and it has a key
called aa.

curl -X GET "http://localhost:10000/column_family/sstables/by_key/k1%3At1?key=aa"

Will return the list of sstables file names that contains that key.
"

* 'amnon/sstable_for_key_v4' of github.com:scylladb/seastar-dev:
  Add the API implementation to get_sstables_by_key
  api: column_family.json make the get_sstables_for_key doc clearer
  column_family: Add the get_sstables_by_partition_key method
  sstable test: add has_partition_key test
  sstable: Add has_partition_key method
  keys_test: add a test for nodetool_style string
  keys: Add from_nodetool_style_string factory method
2018-06-10 16:53:56 +03:00
Amnon Heiman
8fbc6a22fb Add the API implementation to get_sstables_by_key
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-06-10 16:13:01 +03:00
Amnon Heiman
cc5601d000 api: column_family.json make the get_sstables_for_key doc clearer
This patch makes it clearer that the key that get_sstables_for_key
refers to, is a partition key.
2018-06-10 16:13:01 +03:00
Amnon Heiman
acb0a738eb column_family: Add the get_sstables_by_partition_key method
The get_sstables_by_partition_key method used by the API to return a set of
sstables names that holds a given partition key.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-06-10 16:13:01 +03:00
Amnon Heiman
b8e5029991 sstable test: add has_partition_key test
This patch adds a test to the has_partition_key method, it creates an
sstable with a partition key and then used that key in the
has_partition_key method to verify that it is there.

It creates a different key and use that to verify that a non exist key
return false.
2018-06-10 16:12:12 +03:00
Avi Kivity
ba5d8717c8 tests: disable reactor stall notifier
In case it is interacting badly with ASAN and causing spurious test
failures.
2018-06-10 15:55:00 +03:00
Avi Kivity
95b00aae33 Revert scylla-ami update in "scylla_setup: fix conditional statement of silent mode"
This reverts part of commit 364c2551c8. I mistakenly
changed the scylla-ami submodule in addition to applying the patch. The revert
keeps the intended part of the patch and undoes the scylla-ami change.
2018-06-10 14:53:40 +03:00
Asias He
d23dafa7ac dht: Remove column_families parameter in add_rx_ranges and add_tx_ranges
In 4b1034b (storage_service: Remove the stream_hints), we removed the
only user of the api with the column_families parameter.

std::vector column_families = { db::system_keyspace::HINTS };
streamer->add_tx_ranges(keyspace, std::move(ranges_per_endpoint),
column_families);

We can simplify the code range_streamer a bit by removing it.

Fixes #3476

Tests: dtest update_cluster_layout_tests.py
Message-Id: <c81d79c5e6dbc8dd78c1242837de892e39d6abd2.1528356342.git.asias@scylladb.com>
2018-06-10 14:53:40 +03:00
Piotr Jastrzebski
7d3abb0668 sstables 3: Add tests for reading collections
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:40:10 +02:00
Piotr Jastrzebski
176305c2f2 flat_mutation_reader_assertions: add more flexible asserts
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:39:51 +02:00
Piotr Jastrzebski
f9c62b8188 data_consume_rows_context_m: add support for collections
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:39:07 +02:00
Piotr Jastrzebski
fd89f42b09 mp_row_consumer_m: Add support for collections
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:35:12 +02:00
Piotr Jastrzebski
ffb6b9ed24 data_consume_rows_context_m: introduce cell_path
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:30:40 +02:00
Piotr Jastrzebski
5e1dd89d4d Use column_translation::*_is_collection in reading
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 22:50:23 +02:00
Piotr Jastrzebski
7bb25a2dd9 column_translation: add *_column_is_collection()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 22:48:43 +02:00
Piotr Jastrzebski
2b8ff15f9f column_flags_m: add HAS_COMPLEX_DELETION
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 22:47:19 +02:00
Avi Kivity
f9d66f88bb transport: advertise the shard serving a connection
It is useful for the client driver to know which shard is serving a
particular connection, so it can only send requests through that connection
which will be served by the same shard, eliminating a hop.

Support that by advertising a "SCYLLA_SHARD" option, with a value
corresponding to the shard number.

Acked-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180606203437.1198-1-avi@scylladb.com>
2018-06-07 10:43:16 +03:00
Avi Kivity
4a90eeb326 Update seastar submodule
* seastar 12cffef...e7275e4 (9):
  > tests: execution_stage_test: capture sg by value
  > Merge "Add in-path parameter suport to the code generation" from Amnon
  > Merge "Add scheduling_group inheritance to execution_stage" from Avi
  > tutorial: explain how to find origin of exception
  > tls: Ensure handshake always drains output before return/throw
  > build: cmake: correct stdc++fs library name once more
  > perftune.py: make sure config file existing before write
  > Update travis-ci integration
  > build: fix compilation issues on cmake. missing stdc++-fs
2018-06-06 19:07:16 +03:00
Avi Kivity
6f23403137 Merge "Virtualize IndexInfo system table" from Duarte
"
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.

Fixes #3483
"

* 'built-indexes-virtual-reader/v2' of github.com:duarten/scylla:
  tests/virtual_reader_test: Add test for built indexes virtual reader
  db/system_keysace: Add virtual reader for IndexInfo table
  db/system_keyspace: Explain that table_name is the keyspace in IndexInfo
  index/secondary_index_manager: Expose index_table_name()
  db/legacy_schema_migrator: Don't migrate indexes
2018-06-06 17:35:51 +03:00
Piotr Jastrzebski
f7a1d5a437 Use read_unsigned_vint_length_bytes for COLUMN_VALUE
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-06 15:54:17 +02:00
Piotr Jastrzebski
3b8b165053 Use read_unsigned_vint_length_bytes for CK_BLOCKS
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-06 15:44:53 +02:00
Piotr Jastrzebski
21a0e95a06 Implement read_unsigned_vint_length_bytes
It's a common operation that's used in multiple
places so it's best to have it implemented once.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-06 15:44:06 +02:00
Piotr Sarna
0818eb42ae cql3: remove additional IN relation check
Commit 80fc1b1408 introduced additional checks to ensure
that IN relation in WHERE clause can only occur on last restricted
column. This check is not present in current Cassandra code,
the restriction isn't mentioned anywhere in 'IN relation' documentation
and removing it fixes issue 2865.
Running cql_tests dtest suite doesn't show any regression after removing
this check.

Also at: https://github.com/psarna/scylla/tree/remove_additional_in_relation_check

Tests: dtest (cql_tests), unit (release)

Fixes #2865

Message-Id: <aa8c0b33618dd58cd153e83589ac016bc63f4343.1528288388.git.sarna@scylladb.com>
2018-06-06 16:01:54 +03:00
Tomasz Grabiec
9975135110 row_cache: Make sure reader makes forward progress after each fill_buffer()
If reader's buffer is small enough, or preemption happens often
enough, fill_buffer() may not make enough progress to advance
_lower_bound. If also iteartors are constantly invalidated across
fill_buffer() calls, the reader will not be able to make progress.

See row_cache_test.cc::test_reading_progress_with_small_buffer_and_invalidation()
for an examplary scenario.

Also reproduced in debug-mode row_cache_test.cc::test_concurrent_reads_and_eviction

Message-Id: <1528283957-16696-1-git-send-email-tgrabiec@scylladb.com>
2018-06-06 16:01:52 +03:00
Vlad Zolotarov
12e3e4fb2a service::client_state::has_access(): make readable_system_resources an std::unordered_set
There is not reason to use an std::set for it since we don't care about
the ordering - only about the existance of a particular entry.
Hash table will be more efficient for this use case.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528220892-5784-2-git-send-email-vladz@scylladb.com>
2018-06-06 15:29:29 +03:00
Duarte Nunes
833d34e88a Merge 'Make rows in a secondary index ordered by token' from Piotr
"
As in #3423, ensuring token order on secondary index queries can be done
by adding an additional column to views that back secondary indexes.
This column is a first clustering column and contains token value,
computed on updates.
This series also updates tests and comments refering to issue 3423.

Tests: unit (release, debug)
"

* 'order_by_token_in_si_5' of https://github.com/psarna/scylla:
  cql3: update token order comments
  index, tests: add token column to secondary index schema
  view: add handling of a token column for secondary indexes
  view: add is_index method
2018-06-06 10:07:43 +01:00
Vlad Zolotarov
2dde372ae6 locator::ec2_multi_region_snitch: don't call for ec2_snitch::gossiper_starting()
ec2_snitch::gossiper_starting() calls for the base class (default) method
that sets _gossip_started to TRUE and thereby prevents to following
reconnectable_snitch_helper registration.

Fixes #3454

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528208520-28046-1-git-send-email-vladz@scylladb.com>
2018-06-06 12:00:17 +03:00
Asias He
6496cdf0fb db: Get rid of the streaming memtable delayed flush
In 455d5a5 (streaming memtables: coalesce incoming writes), we
introduced the delayed flush to coalesce incoming streaming mutations
from different stream_plan.

However, most of the time there will be one stream plan at a time, the
next stream plan won't start until the previous one is finished. So, the
current coalescing does not really work.

The delayed flush adds 2s of dealy for each stream session. If we have lots
of table to stream, we will waste a lot of time.

We stream a keyspace in around 10 stream plans, i.e., 10% of ranges a
time. If we have 5000 tables, even if the tables are almost empty, the
delay will waste 5000 * 10 * 2 = 27 hours.

To stream a keyspace with 4 tables, each table has 1000 rows.

Before:

 [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
 [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 125.21 KiB/s
 [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 8.233 seconds

After:

 [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
 [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 4772.32 KiB/s
 [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 0.216 seconds

Fixes #3436

Message-Id: <cb2dde263782d2a2915ddfe678c74f9637ffd65b.1526979175.git.asias@scylladb.com>
2018-06-06 10:16:02 +03:00
Piotr Sarna
70ba8c8317 cql3: update token order comments
Comments about token order were outdated with token column patches
and they are now up to date.

Fixes #3423
2018-06-06 09:02:37 +02:00
Piotr Sarna
4a9bf7ed5b index, tests: add token column to secondary index schema
Additional token column is now present in every view schema
that backs a secondary index. This column is always a first part
of the clustering key, so it forces token order on queries.
Column's name is ideally idx_token, but can be postfixed
with a number to ensure its uniqueness.

It also updates tests to make them acknowledge the new token order.

Fixes #3423
2018-06-06 09:02:33 +02:00
Takuya ASADA
899f7641b6 dist/debian: fix pystache package name on Debian / Ubuntu
It's python-pystache, not python2-pystache.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2018-06-06 15:55:52 +09:00
Takuya ASADA
db9074707a dist/debian: switch to systemd-coredump on Debian 9
Debian 9 has newer systemd that supports systemd-coredump, so enable it.
2018-06-06 15:04:31 +09:00
Takuya ASADA
30386ed215 dist/debian: rename 99-scylla.conf to 99-scylla-coredump.conf
Since 99-scylla.conf is only used for setting coredump handler, rename
it to 99-scylla-coredump.conf.
2018-06-06 14:59:32 +09:00
Piotr Sarna
d5e7b5507b view: add handling of a token column for secondary indexes
In order to ensure token order on secondary index queries,
first clustering column for each view that backs a secondary index
is going to store a token computed from base's partition keys.
After this commit, if there exists a column that is not present
in base schema, it will be filled with computed token.
2018-06-05 18:59:25 +02:00
Tomasz Grabiec
f775fc2e4c mvcc: Fix partition_entry::open_version()
After 70c72773be it's possible that
open_version() is called with a phase which is smaller than the phase
of the latest version, because latest version belongs to the
in-progress cache update. In such case we must return the existing
non-latest snapshot and not create a new version on top of the
in-progress update. Not doing this violates several invariants, and
may lead to inconsistencies, including violation of write atomicity or
temporary loss of writes.

partition_entry::read() was already adjusted by the aforementioned
commit. Do a similar adjustement for open_version().

Fixes sporadic failures of row_cache_test.cc::test_concurrent_reads_and_eviction
Message-Id: <1528211847-22825-1-git-send-email-tgrabiec@scylladb.com>
2018-06-05 18:22:38 +03:00
Takuya ASADA
60844ae67b dist/common/scripts/scylla_coredump_setup: don't run sysctl on Ubuntu 18.04
Since 99-scylla.conf is not included on Ubuntu 18.04, skip running it.

Fixes #3494

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180605093619.9197-1-syuu@scylladb.com>
2018-06-05 12:47:46 +03:00
Takuya ASADA
222b8588ee dist/common/systemd/scylla-server.service.in: add local-fs.target as dependency
We mistakenly only added network-online.target is doens't promises to
wait /var/lib/scylla mount.
To do this we need local-fs.target.

Fixes #3441

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180521083349.8970-1-syuu@scylladb.com>
2018-06-05 12:26:21 +03:00
Piotr Sarna
06eee0f525 view: add is_index method
is_index method returns true if view that owns it
is backing a secondary index.
2018-06-05 11:10:24 +02:00
Piotr Sarna
6130a00597 dist: add scylla/hints directory to scripts
/var/lib/scylla/hints directory was missing from dist-specific
scripts, which may cause package installations to fail.
Package building scripts and descriptions are updated/

Fixes #3495

Message-Id: <0f5596cb49500416820ece023b7f76a4e2427799.1528184949.git.sarna@scylladb.com>
2018-06-05 11:33:29 +03:00
Avi Kivity
4aaf7bbc1d Merge "Add test for compression" from Piotr
"
It turns out that compression just works for SSTables 3.x.
Thanks to the previous work done on the write path.
This series cleans up tests a bit and introduces test for compression
on the read path.
"

* 'haaawk/sstables3/read-compression-v1' of ssh://github.com/scylladb/seastar-dev:
  Add test for compression in sstables 3.x
  Extract test_partition_key_with_values_of_different_types_read
  sstable_3_x_test: use SEASTAR_THREAD_TEST_CASE
  Drop UNCOMPRESSD_ when code will be used for compressed too
2018-06-04 20:33:50 +03:00
Piotr Jastrzebski
25a7f03f7f Add test for compression in sstables 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-04 18:41:10 +02:00
Piotr Jastrzebski
be9c7391aa Extract test_partition_key_with_values_of_different_types_read
It will be used also for testing compression.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-04 18:41:10 +02:00
Piotr Jastrzebski
1f324b7fc8 sstable_3_x_test: use SEASTAR_THREAD_TEST_CASE
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-04 18:40:52 +02:00
Piotr Jastrzebski
3e3ccdb323 Drop UNCOMPRESSD_ when code will be used for compressed too
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-04 18:29:02 +02:00
Avi Kivity
6d6c355dc0 Merge "augment system.local with sharding information" from Glauber
"
This patch adds nr_shards, msb_ignore, and the actual sharding algorithm to the
system.local table. Drivers and other tools can then make use of this
information to talk to scylla in an optimal way
"

* 'system_tables-v3' of github.com:glommer/scylla:
  system_keyspace: add sharding information to local table
  partitioner: export the name of the algorithm used to do intra-node sharding
2018-06-04 18:50:28 +03:00
Glauber Costa
bdce561ada system_keyspace: add sharding information to local table
We would like the clients to be able to route work directly to the right
shards. To do that, they need to know the sharding algorithm and its
parameters.

The algorithm can be copied into the client, but the parameters need to
be exported somewhere. Let's use the local table for that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
---
v2: force msb to zero on non-murmur
2018-06-04 11:25:58 -04:00
Glauber Costa
250d9332dc partitioner: export the name of the algorithm used to do intra-node sharding
We will export this on system tables. To avoid hard-coding it in the system
table level, keep it at least in the dht layer where it belongs.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-04 11:25:58 -04:00
Takuya ASADA
ad4ca1e166 dist: simplified build script templates
Currently, build_deb.sh looks very complicated because each of distribution
requires different parameter, and we are applying them by sed command one-by-one.

This patch will replace them by Mustache, it's simple and easy syntax
template language.
Both .rpm distributions and .deb distributions have pystache (a Python
implimentation of Mustache), we will use it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180604104026.22765-1-syuu@scylladb.com>
2018-06-04 14:38:52 +03:00
Paweł Dziepak
24764712b6 sstable: fix capture by reference of stack variable in continuation
Message-Id: <20180604102542.21799-1-pdziepak@scylladb.com>
2018-06-04 14:35:49 +03:00
Duarte Nunes
dfa779ebe7 Merge 'Separate hinted handoff manager for materialized views' from Piotr
"
This series introduces a separate hinted handoff manager for materialized views.

Steps:
 * decouple resource limits from hinted handoff, so multiple instances can share space
   and throughput limits in order to avoid internal fragmentation for every instance's
   reservations
 * add a subdirectory to data/, responsible for storing materialized view hints
 * decouple registering global metrics from hinted handoff constructor, now that there
   can be more than one instance - otherwise 'registering metrics twice' errors are going to occur
 * add a hints_for_views_manager to storage proxy and route failed view updates to use it
   instead of the original hints_manager
 * restore previous semantics for enabling/disabling hinted handoff - regular hinted handoff
   can be disabled or enabled just for specific datacenters without influencing materialized
   views flow
"

* 'separate_hh_for_mv_4' of https://github.com/psarna/scylla:
  storage_proxy: restore optional hinted handoff
  storage_proxy: add hints manager for views
  hints: decouple hints manager metrics from constructor
  db, config: add view_pending_updates directory
  hints: move space_watchdog to resource manager
  hints: move send limiter to resource manager
  hints: move constants to resource_manager
2018-06-04 12:03:59 +01:00
Duarte Nunes
01676a2cda tests/virtual_reader_test: Add test for built indexes virtual reader
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:31:29 +01:00
Duarte Nunes
3e39985c7a db/system_keysace: Add virtual reader for IndexInfo table
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.

Fixes #3483

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Duarte Nunes
65c4205334 db/system_keyspace: Explain that table_name is the keyspace in IndexInfo
This patch adds the same comment that exists in Apache Cassandra,
explaining that the table_name column in the IndexInfo system table
actually refers to the keyspace name. Don't be fooled.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Duarte Nunes
bc4db67524 index/secondary_index_manager: Expose index_table_name()
Expose secondary_index::index_table_name() so knowledge on how to
built an index name can remain centralized.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Duarte Nunes
7187963bda db/legacy_schema_migrator: Don't migrate indexes
Previous versions contained no indexes, and Apache Cassandra indexes
cannot be migrated to Scylla.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Vlad Zolotarov
e759803f48 cql3::authorized_prepared_statements_cache: properly set the expiration timeout
Because authorized_prepared_statements_cache caches the information that comes from
the permissions cache and from the prepared statements cache it should has the entries
expiration period set to the minimum of expiration periods of these caches.

The same goes to the entry refresh period but since prepared statements cache does have a
refresh period authorized_prepared_statements_cache's entries refresh period
is simply equal to the one of the permissions cache.

Fixes #3473

Tests: dtest{release} auth_test.py

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1527789716-6206-1-git-send-email-vladz@scylladb.com>
2018-06-04 10:34:54 +02:00
Piotr Jastrzebski
0b72594c1f data_consume_rows_context_m: Use find_first and find_next
Those methods of boost::dynamic_bitset allow much more
efficient implementation of skip_absent_columns and
move_to_next_column.

Also fix some indentation and variable naming.

Test: unit {release}

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <8a4dea51060c5a02bb774eac43e9eb67d316049a.1528100153.git.piotr@scylladb.com>
2018-06-04 11:18:03 +03:00
Piotr Sarna
f12fdcffdb storage_proxy: restore optional hinted handoff
Since hinted handoff for materialized views is now a separate entity,
regular hinted handoff can go back to being optional.
2018-06-04 09:46:06 +02:00
Piotr Sarna
a6aae369da storage_proxy: add hints manager for views
This commit adds a separate hints manager that serves
only failed materialized view updates.
2018-06-04 09:46:06 +02:00
Piotr Sarna
204bc17bd7 hints: decouple hints manager metrics from constructor
Now that more than one instance of hints manager can be present
at the same time, registering metrics is moved out of the constructor
to prevent 'registering metrics twice' errors.
2018-06-04 09:46:06 +02:00
Piotr Sarna
a791dce0ae db, config: add view_pending_updates directory
Hints for materialized view updates need to be kept somewhere,
because their dedicated hints manager has to have a root directory.
view_pending_updates directory resides in /data and is used
for that purpose.
2018-06-04 09:46:06 +02:00
Piotr Sarna
f345efc79a hints: move space_watchdog to resource manager
Space watchdog is decoupled from hints manager and moved to resource
manager, so it can be shared among different hints manager instances.
2018-06-04 09:46:01 +02:00
Piotr Sarna
ef40f7e628 hints: move send limiter to resource manager
Send limiting semaphore is moved from hints manager to resource manager.
In consequence, hints manager now keeps a reference to its resource
manager.
2018-06-04 09:35:58 +02:00
Piotr Sarna
2315937854 hints: move constants to resource_manager
Constants related to managing resources are moved to newly created
resource_manager class. Later, this class will be used to manage
(potentially shared) resources of hints managers.
2018-06-04 09:35:58 +02:00
Avi Kivity
9b21fbc055 Merge "LCS: enable compaction controller" from Glauber
"

In preparation, we change LCS so that it tries harder to push data
to the last level, where the backlog is supposed to be zero.

The backlog is defined as:

backlog_of_stcs_in_l0 + Sum(L in level) sizeof(L) * (max_level - L) * fan_out

where:
 * the fan_out is the amount of SSTables we usually compact with the
   next level (usually 10).
 * max_levels is the number of levels currently populated
 * sizeof(L) is the total amount of data in a particular level.

Tests: unit (release)
"

* 'lcs-backlog-v2' of github.com:glommer/scylla:
  LCS: implement backlog tracker for compaction controller
  LCS: don't construct property in the body of constructor
  LCS: try harder to move SSTables to highest levels.
  leveled manifest: turn 10 into a constant
  backlog: add level to write progress monitor
2018-06-04 10:29:56 +03:00
Amos Kong
364c2551c8 scylla_setup: fix conditional statement of silent mode
Commit 300af65555 introdued a problem in
conditional statement, script will always abort in silent mode, it doesn't
care about the return value.

Fixes #3485

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <1c12ab04651352964a176368f8ee28f19ae43c68.1528077114.git.amos@scylladb.com>
2018-06-04 10:14:06 +03:00
Glauber Costa
6317bd45d7 LCS: implement backlog tracker for compaction controller
This is the last missing tracker among the major strategies. After
this, only DTCS is left.

To calculate the backlog, we will define the point of zero-backlog
as having all data in the last level. The backlog is then:

Sum(L in levels) sizeof(L) * (max_levels - L) * fan_out,

where:
 * the fan_out is the amount of SSTables we usually compact with the
   next level (usually 10).
 * max_levels is the number of levels currently populated
 * sizeof(L) is the total amount of data in a particular level.

Care is taken for the backlog not to jump when a new level has been just
recently created.

Aside from that, SSTables that accumulate in L0 can be subject to STCS.
We will then add a STCS backlog in those SSTables to represent that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-03 18:14:09 -04:00
Glauber Costa
04546df55c LCS: don't construct property in the body of constructor
Right now we are constructing the _max_sstable_size_in_mb property in
the body of the constructor, which it makes it hard for us to use from
other properties.

We are doing that because we'd like to test for bounds of that value. So
a cleaner way is to have a helper function for that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-03 18:14:09 -04:00
Glauber Costa
28382cb25c LCS: try harder to move SSTables to highest levels.
Our current implementation of LCS can end up with situations in which
just a bit of data is in the highest levels, with the majority in the
lowest levels. That happens because we will only promote things to
highest levels if the amount of data in the current level is higher than
the maximum.

This is a pre-existing problem in itself, but became even clearer when
we started trying to define what is the backlog for LCS.

We have discussed ways to fix this it by redefining the criteria on when
to move data to the next levels. That would require us to change the way
things are today considerably, allowing parallel compactions, etc. There
is significant risk that we'll increase write amplication and we would
need to carefully validate that.

For now I will propose a simpler change, that essentially solves the
"inverted pyramid" problem of current LCS without major disruption:
keep selecting compaction candidates with the same criteria that we do
today, we should help make sure we are not compacting high levels for no
reason; but if there is nothing to do, use the idle time to push data to
higher levels. As an added benefit, old data that is in the higher level
can also be compacted away faster.

With this patch we see that in an idle, post-load system all data is
eventually pushed to the last level. Systems under constant writes keep
behaving the same way they did before.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-03 18:12:19 -04:00
Glauber Costa
e64b471e3d leveled manifest: turn 10 into a constant
We increase levels in powers of 10 but that is a parameter
of the algorithm. At least make it into a constant so that we can
reuse it somewhere else.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-03 16:55:58 -04:00
Avi Kivity
6f2d3b7f9f Merge "Fix previous row size calculation for SSTables 3.x" from Vladimir
"
SSTables 3.x format ('m') stores the size of previous row or RT marker
inside each row/marker. That potentially allows to traverse rows/markers
in reverse order.

The previous code calculating those sizes appeared to produce invalid
values for all rows except the first one. The problem with detecting
this bug was that neither Cassandra itself nor the sstabledump tool use
those values, they are simply rejected on reading.
From UnfilteredSerializer.deserializeRowBody() method,
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java#L562
:

            if (header.isForSSTable())
            {
                in.readUnsignedVInt(); // Skip row size
                in.readUnsignedVInt(); // previous unfiltered size
            }

So while the previous test files were technically correct in that they
contained valid data readable by Cassandra/sstabledump, they didn't
follow the format specification.

This patchset fixes the code to produce correct values and replaces
incorrect data files with correct ones. The newly generated data files
have been validated to be identical to files generated with Cassandra
using same data and timestamps as unit tests.

Tests: Unit {release}
"

* 'projects/sstables-30/fix-prev-row_size/v1' of https://github.com/argenet/scylla:
  tests: Fix test files to use correct previous row sizes.
  sstables: Fix calculation of previous row size for SSTables 3.x
  sstables: Factor out code building promoted index blocks into separate helpers.
2018-06-03 11:38:22 +03:00
Avi Kivity
a43b3e22fc Merge "Fix clustering blocks serialization for SSTables 3.x" from Vladimir
"
This patchset contains two fixes to the clustering key prefixes
serialization logic for SSTables 3.x.

First, it fixes a vexing typo: a bitwise-and (&) has been used instead
of a remainder operator (%) for truncating the shift value.
This did not show up in existing tests because they all had non-empty
clustering columns values.
Added tests to cover empty clustering columns values.

Second, it fixes the logic of serialization to write values up to the
prefix length, not the length of the clustering key as defined by
schema. This matches the way it is done by the Origin.

There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.
"

* 'projects/sstables-30/fix-clustering-blocks/v1' of https://github.com/argenet/scylla:
  tests: Add test covering compact table with non-full clustering key.
  sstables: Improve clustering blocks writing, use logical clustering prefix size.
  tests: Add test covering large clustering keys (>32 columns) for SSTables 3.x
  tests: Add unit test covering empty values in clustering key.
  sstables: Fix typo in clustering blocks write helper.
2018-06-03 11:35:49 +03:00
Avi Kivity
1071e481ed Merge "Implement support for missing columns in SSTable 3.0" from Piotr
"
Add handling for missing columns and tests for it.

There are 3 cases:
1. Number of columns in a table is smaller than 64
2. Number of columns in a table is greater than 64
2a. and less than half of all possible columns are present in sstable
2b. and at least half of all possible columns are present in sstable

Case 1 is implemented using bit mask and column is present if mask & (1 << <column number>) == 0
Case 2 is implemented by storing list of column numbers for each present column
case 3 is implemented by storing list of column numbers for each absent column
"

* 'haaawk/sstables3/read-missing-columns-v3' of ssh://github.com/scylladb/seastar-dev:
  sstables 3: add test for reading big dense subset of columns
  sstables 3: support reading big dense subsets of columns
  sstables 3: add test for reading big sparse subset of columns
  sstables 3: support reading big sparse subsets of columns
  sstables 3: add test for reading small subset of columns
  sstables 3: support reading small subsets of columns
2018-06-03 10:42:00 +03:00
Avi Kivity
78182a704b partition_snapshot_row_cursor: initialize _dummy and _continuous
Debug mode view_schema_test sometimes complains that a bool member
doesn't contain in-range values, apparenty in the move constructor.

Initialize them for its benefit to avoid false-positive test
failures.
Message-Id: <20180602184934.31258-1-avi@scylladb.com>
2018-06-02 19:51:36 +01:00
Avi Kivity
187ebdbe46 auth: fix possible use of disengaged optional in has_salted_hash()
untyped_result_set_row's cell data type is bytes_opt, and the
get_block() accessor accesses the value assuming it's engaged
(relying on the caller to call has()).

has_unsalted_hash() calls get_blob() without calling has() beforehand,
potentially triggering undefined behavior.

Fix by using get_or() instead, which also simplifies the caller.

I observed failures in Jenkins in this area. It's hard to be sure
this is the root cause, since the failures triggered an internal
consistency assertion in asan rather than an asan report. However,
the error is hard to reproduce and the fix makes sense even if it
doesn't prevent the error.

See #3480 for the asan error.

Fixes #3480 (hopefully).
Message-Id: <20180602181919.29204-1-avi@scylladb.com>
2018-06-02 19:46:32 +01:00
Piotr Jastrzebski
2fd0566eb7 sstables 3: add test for reading big dense subset of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-02 10:41:18 +02:00
Piotr Jastrzebski
829f0c5f80 sstables 3: support reading big dense subsets of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-02 10:41:18 +02:00
Piotr Jastrzebski
4e4972ffea sstables 3: add test for reading big sparse subset of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-02 10:40:56 +02:00
Piotr Jastrzebski
e5fb499736 sstables 3: support reading big sparse subsets of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-01 21:35:28 +02:00
Piotr Jastrzebski
24e9ab4ab6 sstables 3: add test for reading small subset of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-01 21:34:03 +02:00
Piotr Jastrzebski
63d45c4f24 sstables 3: support reading small subsets of columns
Small subset is contains no more than 63 elements.
Support for large subsets will come in the following
patches.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-01 21:33:50 +02:00
Glauber Costa
7e3093709a backlog: add level to write progress monitor
For SSTables being written, we don't know their level yet. Add that
information to the write monitor. New SSTables will always be at L0.
Compacted SSTables will have their level determined by the compaction
process.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-31 21:09:38 -04:00
Vladimir Krivopalov
b6511d1b07 tests: Add test covering compact table with non-full clustering key.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 17:30:36 -07:00
Vladimir Krivopalov
47a7e78bc8 sstables: Improve clustering blocks writing, use logical clustering prefix size.
In the Origin, the size of the clustering key prefix used during
serialization is the actual length of the prefix and not the full size
as defined in schema. So the code is fixed to align with that logic.
This, in particular, is needed to write clustering blocks for RT
markers.

There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 17:30:36 -07:00
Vladimir Krivopalov
3f404f19dc tests: Add test covering large clustering keys (>32 columns) for SSTables 3.x
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 17:30:36 -07:00
Vladimir Krivopalov
487796de85 tests: Add unit test covering empty values in clustering key.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 17:30:36 -07:00
Vladimir Krivopalov
0dadd4fdf3 sstables: Fix typo in clustering blocks write helper.
What supposed to be an operation of taking remainder turned to be a
bitwise 'and'. This didn't show up in existing tests only because they
all had non-empty clustering values.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 15:12:40 -07:00
Avi Kivity
aab6b0ee27 Merge "Introduce new in-memory representation for cells" from Paweł
"
This is the first part of the first step of switching Scylla. It covers
converting cells to the new serialisation format. The actual structure
of the cells doesn't differ much from the original one with a notable
exception of the fact that large values are now fragmented and
linearisation needs to be explicit. Counters and collections still
partially rely on their old, custom serialisation code and their
handling is not optimial (although not significantly worse than it used
to be).

The new in-memory representation allows objects to be of varying size
and makes it possible to provide deserialisation context so that we
don't need to keep in each instance of an IMR type all the information
needed to interpret it. The structure of IMR types is described in C++
using some metaprogramming with the hopes of making it much easier to
modify the serialisation format that it would be in case of open-coded
serialisation functions.

Moreover, IMR types can own memory thanks to a limited support for
destructors and movers (the latter are not exactly the same thing as C++
move constructors hence a different name). This makes it (relatively)
to ensure that there is an upper bound on the size of all allocations.

For now the only thing that is converted to the IMR are atomic_cells
and collections which means that the reduction in the memory footprint
is not as big as it can be, but introducing the IMR is a big step on its
own and also paves the way towards complete elimination of unbounded
memory allocations.

The first part of this patchset contains miscellaneous preparatory
changes to various parts of the Scylla codebase. They are followed by
introduction of the IMR infrastructure. Then structure of cells is
defined and all helper functions are implemented. Next are several
treewide patches that mostly deal with propagating type information to
the cell-related operations. Finally, atomic_cell and collections are
switched to used the new IMR-based cell implementation.

The IMR is described in much more detail in imr/IMR.md added in "imr:
add IMR documentation".

Refs #2031.
Refs #2409.

perf_simple_query -c4, medians of 30 results:

        ./perf_base  ./perf_imr   diff
 read     308790.08   309775.35   0.3%
 write    402127.32   417729.18   3.9%

The same with 1 byte values:
        ./perf_base1  ./perf_imr1   diff
 read      314107.26    314648.96   0.2%
 write     463801.40    433255.96  -6.6%

The memory footprint is reduced, but that is partially due to removal of
small buffer optimisation (whether it will be restored depends on the
exact mesurements of the performance impact). Generally, this series was
not expected to make a huge difference as this would require converting
whole rows to the IMR.

Memory footprint:
Before:
mutation footprint:
 - in cache: 1264
 - in memtable: 986

After:
mutation footprint:
 - in cache: 1104
 - in memtable: 866

Tests: unit (release, debug)
"

* tag 'imr-cells/v3' of https://github.com/pdziepak/scylla: (37 commits)
  tests/mutation: add test for changing column type
  atomic_cell: switch to new IMR-based cell reperesentation
  atomic_cell: explicitly state when atomic_cell is a collection member
  treewide: require type for creating collection_mutation_view
  treewide: require type for comparing cells
  atomic_cell: introduce fragmented buffer value interface
  treewide: require type to compute cell memory usage
  treewide: require type to copy atomic_cell
  treewide: require type info for copying atomic_cell_or_collection
  treewide: require type for creating atomic_cell
  atomic_cell: require column_definition for creating atomic_cell views
  tests: test imr representation of cells
  types: provide information for IMR
  data: introduce cell
  data: introduce type_info
  imr/utils: add imr object holder
  imr: introduce concepts
  imr: add helper for allocating objects
  imr: allow creating lsa migrators for IMR objects
  imr: introduce placeholders
  ...
2018-05-31 19:21:15 +03:00
Amnon Heiman
bc7503feee Scyllatop to use prometheus by default
Scylla now expose the prometheus API by default. This patch chagnes
scyllatop to use the Prometheus API, the collect API is still available.

The main changes in the patch:
* Move collectd specific logic inside collectd.
* Add support for help information.
* Add command line to configure prometheus end point and to enable
collectd.

* Add a prometheus class that collect information from prometheus.

Fixes: #1541
Message-Id: <20180531124156.26336-1-amnon@scylladb.com>
2018-05-31 18:00:22 +03:00
Tomasz Grabiec
b5e42bc6a0 tests: row_cache: Do not hang when only one of the readers throws
Message-Id: <20180531122729.3314-1-tgrabiec@scylladb.com>
2018-05-31 18:00:22 +03:00
Piotr Sarna
360326fdc5 cql3: add compatibility with libjsoncpp < 1.6.0
Only libjsoncpp >= 1.6.0 offers a safe name() method for value
iterators. For older versions, deprecated memberName() is used
instead. Note that memberName() was deprecated because of its
inability to deal with embedded null characters.

Fixes #3471

Message-Id: <e64a62bfc24ef06daee238d79d557fe6ec8979d3.1527758708.git.sarna@scylladb.com>
2018-05-31 18:00:22 +03:00
Paweł Dziepak
131a47dea3 tests/mutation: add test for changing column type
With the introduction of the new in-memory representation changing
column type has become a more complex operation since it needs to handle
switch from fixed-size to variable-size types. This commit adds an
explicit test for such cases.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
a040d37cd5 atomic_cell: switch to new IMR-based cell reperesentation
This patch changes the implementation of atomic_cell and
atomic_cell_or_collection to use the data::cell implementation which is
based on the new in-memory representation infrastructure.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
0ea6d14cf5 atomic_cell: explicitly state when atomic_cell is a collection member
Collections are not going to be fully converted to the IMR just yet and
still use the old serialisation format. This means that they still don't
support fragmented values very well. This patch passes the information
when an atomic_cell is created as a member of a collection so that later
we can avoid fragmenting the value in such cases.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
e34ff8b4bf treewide: require type for creating collection_mutation_view 2018-05-31 15:51:11 +01:00
Paweł Dziepak
9bb1f10bb6 treewide: require type for comparing cells 2018-05-31 15:51:11 +01:00
Paweł Dziepak
aa25f0844f atomic_cell: introduce fragmented buffer value interface
As a prepratation for the switch to the new cell representation this
patch changes the type returned by atomic_cell_view::value() to one that
requires explicit linearisation of the cell value. Even though the value
is still implicitly linearised (and only when managed by the LSA) the
new interface is the same as the target one so that no more changes to
its users will be needed.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
ec9d166a4f treewide: require type to compute cell memory usage 2018-05-31 15:51:11 +01:00
Paweł Dziepak
418c159057 treewide: require type to copy atomic_cell 2018-05-31 15:51:11 +01:00
Paweł Dziepak
27014a23d7 treewide: require type info for copying atomic_cell_or_collection 2018-05-31 15:51:11 +01:00
Paweł Dziepak
e9d6fc48ac treewide: require type for creating atomic_cell 2018-05-31 15:51:11 +01:00
Paweł Dziepak
93130e80fb atomic_cell: require column_definition for creating atomic_cell views 2018-05-31 15:51:11 +01:00
Paweł Dziepak
b25cc61a13 tests: test imr representation of cells 2018-05-31 15:51:11 +01:00
Paweł Dziepak
43b216b43d types: provide information for IMR 2018-05-31 15:51:11 +01:00
Paweł Dziepak
eec33fda14 data: introduce cell
This commit introduces cell serializers and views based on the in-memory
representation infrastructure. The code doesn't assume anything about
how the cells are stored, they can be either a part of another IMR
object (once the rows are converted to the IMR) or a separate objects
(just like current atomic_cell).
2018-05-31 15:51:11 +01:00
Duarte Nunes
f8626c7c93 tests/view_schema_test: Test view correctness under base schema changes
Reproducer for #3443.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180530194536.51202-2-duarte@scylladb.com>
2018-05-31 12:10:50 +03:00
Duarte Nunes
c4f267bdfe database: Refresh view dependent fields when altering base
A view schema's view_info contains the id of the base regular column
that view includes in its primary key. Since the column id of a
particular column can potentially change with a new schema version, we
need to refresh the stored column id. We weren't doing that when
unselected base columns are added, and this patch fixes it by
triggering an update of the view schema when base columns are added
and the view contains a base regular column in its PK.

Fixes #3443

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180530194536.51202-1-duarte@scylladb.com>
2018-05-31 12:10:49 +03:00
Paweł Dziepak
544b3c9a34 data: introduce type_info
This patch introduces type_info class which contains all type
information needed by IMR deserialisation contexts.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
4929c1f39a imr/utils: add imr object holder
imr::object<> is an owning pointer to an IMR objects. It is LSA-aware.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
fd47858755 imr: introduce concepts
This commit adds type traits and concepts for sizers, serializers and
writers that help explicitly specify requirements of various interfaces.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
28ea36a686 imr: add helper for allocating objects
IMR objects may own memory. object_allocator takes care of allocating
memory for all owned objects during the serialisation of their owner.

In practice a writer of the parent object would accept a helper object
created by object_allocator. That helper object would be either
responsible for computing the size of buffers that have to be allocated
or perform the actual serialisation in the same two phase manner as it
is done for the parent IMR object.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
79941f2fc7 imr: allow creating lsa migrators for IMR objects
This patch introduces helpers for creating LSA migrators from IMR
deserialisation contexts and context factories.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
5ddb118c78 imr: introduce placeholders
In some cases the actual value of an IMR object is not know at the
serialisation time. If the type is fixed-size we can use a placeholder
to defer writing it to a more conveninent moment.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
8c38f09fbc tests/imr: add tests for destructor and mover methods 2018-05-31 10:09:01 +01:00
Paweł Dziepak
fa7b080443 imr: introduce destructor and mover methods
This patch introduces destructors and movers for IMR objects which
enables them to own memory. Custom destructors and methods can be
defined by specialising appropriate classes.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
c02bfb942d imr/compound: introduce tagged_type<Tag, T> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
a29a88c9d9 tests/imr/compound: add tests for structure<...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
4f51901dfe imr/compound: introduce structure<...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
466d91f652 tests/imr/compound: add tests for variant<Ts...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
8e4c8ce2c4 imr/compound: introduce variant<Ts...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
7c28c9eda8 tests/imr: add test for optional<T> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
6d7b205d1a imr: introduce optional<T> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
eb2479fa9a tests: add test for new in memory representation 2018-05-31 10:09:01 +01:00
Paweł Dziepak
a995fb337c imr: introduce fundamental types
This patch introduces fundamental IMR types: a set of flags, a POD type
and a buffer.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
5f960beca1 imr: add IMR documentation 2018-05-31 10:09:01 +01:00
Paweł Dziepak
0092076167 tests: add helpers for generating random data 2018-05-31 10:09:01 +01:00
Paweł Dziepak
cc76480174 tests: introduce tests for metaprogramming helpers 2018-05-31 10:09:01 +01:00
Paweł Dziepak
ba5e64383a utils: add metaprogramming helper functions 2018-05-31 10:09:01 +01:00
Paweł Dziepak
5845d52632 idl: allow fragmented bytes_view in serialisation
This patch adds new way of serialising bytes and sstring objects in the
IDL. Using write_fragmented_<field-name>() the caller can pass a range
of fragments that would be serialised without linearising the buffer.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
c41b9fc7ec utils: add fragment range
This patch introduces a FragmentRange concept which is the minimal interface all
classes representing a fragmented buffer should satisfy.
2018-05-31 10:09:01 +01:00
Vladimir Krivopalov
0886c189bf tests: Fix test files to use correct previous row sizes.
Since sstabledump and Cassandra do not use row size values, the new
files have been validated to be identical to files generated by
Cassandra with the same data inserted at same timestamps.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-30 18:18:35 -07:00
Vladimir Krivopalov
2d86fcc8ab sstables: Fix calculation of previous row size for SSTables 3.x
The previous code incorrectly calculated sizes of previous rows while
writing SSTables in 3.x ('m') format.
The problem with detecting this issue was that neither sstabledump nor
Cassandra 3.x itself use those values, as of today, they are simply
ignored when data is read from files.

Still, we want to be compatible and write correct values as they may be
of use in the future.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-30 18:14:12 -07:00
Vladimir Krivopalov
71f7f45d64 sstables: Factor out code building promoted index blocks into separate helpers.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-30 12:40:18 -07:00
Nadav Har'El
a1cbeeffcd tests/view_complex_test.cc: fix and enable buggy test
tests/view_complex_test.cc contained a #ifdef'ed-out test claiming to
be a reproducer for issue #3362. Unfortunately, it it is not - after
earlier commits the only reason this test still fails is a mistake in
the test, which expects 0 rows in a case where the real result is 1 row.
Issue #3362 does *not* have to be fixed to fix this test.

So this patch fixes the broken test, and enables it. It also adds comments
explaining what this test is supposed to do, and why it works the way it
does.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180530142214.29398-1-nyh@scylladb.com>
2018-05-30 15:39:25 +01:00
Avi Kivity
9999e0e6bc Merge "Implement support for static rows in SSTable 3.0" from Piotr
"
Add handling for static rows and tests for it.
"

* 'haaawk/sstables3/read-static-v1' of ssh://github.com/scylladb/seastar-dev:
  sstable_3_x_test: Add test_uncompressed_compound_static_row_read
  sstable_3_x_test: add test_uncompressed_static_row_read
  flat_mutation_reader_assertions: improve static row assertions
  data_consume_rows_context_m: Implement support for static rows
  mp_row_consumer_m: Implement support for static rows
  mp_row_consumer_m: Extract fill_cells
2018-05-30 17:17:17 +03:00
Paweł Dziepak
62d0639fe9 Merge "Avoid reactor stalls in cache with large partitions" from Tomasz
"
We currently suffer from reactor stalls caused by non-preemptible processing
of large partitions in the following places:

  (1) dropping partition entries from cache or memtables does not defer

  (2) dropping partition versions abandoned by detached snapshots does not defer

  (3) merging of partition versions when snapshots go away does not defer

  (4) cache update from memtable processes partition entries without deferring (#2578)

  (5) partition entries are upgraded to new schema atomically

This series fixes problems (1), (2) and (4), but not (3) and (5).

(1) and (2) are fixed by introducing mutation_cleaner objects which are
containers for garbage partition versions which are delaying actual freeing.
Freeing happens from memory reclaimers and is incremental.

(3) and (5) are not solved yet.

(4) is solved by having partition merging process partitions with row
granularity and defer in the middle of partition. In order to preserve update
atomicity on partition level as perceived by reads, when update starts we
create a snapshot to the current version of partition and process memtable
entry by inserting data into a separate partition version. This way if upgrade
defers in the middle of partition reads can still go to the old version and
not see partial writes. Snapshots are marked with phase numbers, and reads
will use the previous phase until whole partition is upgraded. When partition
is finally merged, the snapshots go away and the new version will eventually
be merged to the old version. Due to (3) however, this merging may still add
latency to the upgrade path.

Remaining work:

  - Solving problem (3). I think the approach to take here would be to
    move the task of merging versions to the background, maybe into mutation_cleaner.

  - Merging range tombstones incrementally.

Performance
===========

Performance improvements were evaluated using tests/perf_row_cache_update -c1 -m1G,
which measures time it takes to update cache from memtable for various workloads
and schemas.

For large partition with lots of small rows we see a significant reduction of
scheduling latency from ~550ms to ~23ms. The cause of remainig latency is
problem (3) stated above. The run time is reduced by 70%.

For small partition case without clustering columns we see no degradation.

For small partition case with clustering key, but only 3 small rows per partition,
we see a 30% degradation in run time.

For large partition with lots of range tombstones we see degradation of 15% in
run time and scheduling latency.

Below you can see full statistics for cache update run time:

=== Small partitions, no overwrites:

Before:

  avg = 433.965155
  stdev = 35.958024
  min = 340.093201
  max = 468.564514

After:

  avg = 436.929447 (+1%)
  stdev = 37.130237
  min = 349.410339
  max = 489.953400

=== Small partition with a few rows:

Before:

  avg = 315.379316
  stdev = 30.059120
  min = 240.340561
  max = 342.408295

After:

  avg = 407.232691 (+30%)
  stdev = 53.918717
  min = 269.514648
  max = 444.846649

=== Large partition, lots of small rows:

Before:

  avg = 412.870689
  stdev = 227.411317
  min = 286.990631
  max = 1263.417847

After:

  avg = 124.351705 (-70%)
  stdev = 4.705762
  min = 110.063255
  max = 129.643387

=== Large partition, lots of range tombstones:

Before:

  avg = 601.172644
  stdev = 121.376866
  min = 223.502136
  max = 874.111572

After:

  avg = 695.627588 (+15%)
  stdev = 135.057004
  min = 337.173950
  max = 784.838745
"

* tag 'tgrabiec/clear-gently-all-partitions-v3' of github.com:tgrabiec/scylla:
  mvcc: Use small_vector<> in partition_snapshot_row_cursor
  utils: Extract small_vector.hh
  mvcc: Erase rows gradually in apply_to_incomplete()
  mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible
  cache: real_dirty_memory_accounter: Move unpinning out of the hot path
  mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
  mutation_partition: Reduce row lookups in apply_monotonically()
  cache: Release dirty memory with row granularity
  cache: Defer during partition merging
  mvcc: partition_snapshot_row_cursor: Introduce consume_row()
  mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
  mvcc: Make apply_to_incomplete() work with attached versions
  cache: Propagate phase to apply_to_incomplete()
  cache: Prepare for incremental apply_to_incomplete()
  Introduce a coroutine wrapper
  tests: mvcc: Encapsulate memory management details
  tests: cache: Take into account that update() may defer
  cache: real_dirty_memory_accounter: Allow construction without memtable
  cache: Extract real_dirty_memory_accounter
  mvcc: Destroy memtable partition versions gently
  memtable: Destroy partitions incrementally from clear_gently()
  mvcc: Remove rows from tracker gently
  cache: Destroy partition versions incrementally
  Introduce mutation_cleaner
  mvcc: Introduce partition_version_list
  mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
  database: Add API for incremental clearing of partition entries
  cache: Define trivial methods inline
  tests: Improve perf_row_cache_update
  mutation_reader: Make empty mutation source advertize no partitions
2018-05-30 14:12:29 +01:00
Tomasz Grabiec
4561e97efe mvcc: Use small_vector<> in partition_snapshot_row_cursor
I measured 8% improvement in cache update throughput for small
partitions.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
db36ff0643 utils: Extract small_vector.hh 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
5b59df3761 mvcc: Erase rows gradually in apply_to_incomplete()
So that we avoid double-buffering partitions.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
b7fdf4309f mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
8d66f6da58 cache: real_dirty_memory_accounter: Move unpinning out of the hot path
Instead of calling into real dirty memory manager per row, call it per
deferring point.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
60000b98a4 mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
Leverage the fact that it is called with monotonically increasing
positions, and avoid lookups in case the current target entry is the
successor of desired position. Reduces cache update latency by 40%
for large partition in a time-series workload.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
82e8217ba0 mutation_partition: Reduce row lookups in apply_monotonically()
This change speeds up merging of partition versions with many rows in
case the merged version has many rows which fall between existing rows
in the target version. This is often the case for time-series
workloads, which insert rows at the front. Lookup can be avoided for
all but the first row in the stride because we already have a
reference to the successor in the target tree, we only need to check
that the current entry in the target tree is still the successor.

This change greatly reduces amount of lookups per row during version
merging of large partitions in time-series workloads.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
5bc201df10 cache: Release dirty memory with row granularity 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
70c72773be cache: Defer during partition merging 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
051bb74583 mvcc: partition_snapshot_row_cursor: Introduce consume_row() 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
518fd7083f mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
A version of maybe_refresh() optimized for snapshots which are
no longer populated. Will be used to implement cache update from
memtable.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
c653137b2b mvcc: Make apply_to_incomplete() work with attached versions
Needed before making it preemptible. We cannot steal the entry since
we may need to resume merging later.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
1792be3697 cache: Propagate phase to apply_to_incomplete()
It will be needed to create snapshots with appropriate phase markers.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
494cb3f3da cache: Prepare for incremental apply_to_incomplete()
Incremental merging will be implemented by the means of resumable
functions, which return stop_iteration::no when not yet
finished. We're not using futures, so that the caller can do work
around preemption points as well.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
a19c5cbc16 Introduce a coroutine wrapper
Represents a deferring operation which defers cooperatively with the caller.

The operation is started and resumed by calling run(), which returns
with stop_iteration::no whenever the operation defers and is not
completed yet. When the operation is finally complete, run() returns
with stop_iteration::yes.

This allows the caller to:

 1) execute some post-defer and pre-resume actions atomically

 2) have control over when the operation is resumed and in which context,
    in particular the caller can cancel the operation at deferring points.

It will be used to implement deferring partition_version::apply_to_incomplete().
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
6bd1a04c10 tests: mvcc: Encapsulate memory management details
Curently tests have a single LSA region lock around construction of
managed objects, their manipulation, and access. This way we avoid the
complexity of dealing with allocating sections. That will not be
possible once apply_to_incomplete() is changed to enter an allocating
section itself becasue this requires region to be unlocked at
entry. The tests will have to take more fine-grained locks. That is
somewhat tricky add would add a lot of noise to tests. This patch will
make things easier by abstracting LSA management, among other things,
inside mvcc_conatiner and mvcc_partition classes.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
f6e21accc7 tests: cache: Take into account that update() may defer
The test incorrectly assumed that once update() is started the
cache will return only versions from last_generation. This will not
hold once we start to defer during partition merging.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
c10d9e1607 cache: real_dirty_memory_accounter: Allow construction without memtable 2018-05-30 14:41:40 +02:00
Tomasz Grabiec
6ecda1ccd7 cache: Extract real_dirty_memory_accounter 2018-05-30 14:41:40 +02:00
Tomasz Grabiec
3f19f76c67 mvcc: Destroy memtable partition versions gently
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.

Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.

Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.

Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
c2d702622e memtable: Destroy partitions incrementally from clear_gently()
Destroying large partitions may stall the reactor for a long
time. Avoid this by clearing incrementally.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
81d231f35b mvcc: Remove rows from tracker gently
Some parititons may have a lot of rows. Better to iterate over them
incrementally as part of clear_gently() to avoid stalls.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
f0c1edd672 cache: Destroy partition versions incrementally
Instead of destroying whole partition_versions at once, we will do that
gently using mutation_cleaner to avoid reactor stalls.

Large deletions could happen when large partition gets invalidated,
upgraded to a new schema, or when it's abandaned by a detached snapshot.

Refs #3289.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
e0803ff71e Introduce mutation_cleaner
Used for collecting unsued partition_version objects and freeing them
incrementally. Will be used for both cache and memtables.
2018-05-30 14:41:39 +02:00
Tomasz Grabiec
e5aa02efeb mvcc: Introduce partition_version_list 2018-05-30 12:18:56 +02:00
Tomasz Grabiec
ca1ee93577 mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
We didn't rely on that yet, it seems, but will.

(cherry picked from commit 21a744337de01f699d5c5c340483ad23cabab2ee)
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
40cc766cf2 database: Add API for incremental clearing of partition entries
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:

  stop_iteration clear_gently() noexcept;

It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:

  return repeat([this] { return clear_gently(); });

The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
2f75212ca4 cache: Define trivial methods inline
They have users in a different compilation unit, in partition_version.cc
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
25b3641d9e tests: Improve perf_row_cache_update
We now test more kinds of workloads:
 - small partitions with no clustering key
 - large partition with lots of small rows
 - large partition with lots of range tombstones

We also collect statistics about scheduling latency induced by cache
update.

Example output:

Small partitions, no overwrites:
update: 356.809113 [ms], stall: {ticks: 396, min: 0.006867 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.358102 [ms]}, cache: 257/257 [MB] LSA: 257/257 [MB] std free: 83 [MB]
update: 337.542999 [ms], stall: {ticks: 373, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.358102 [ms]}, cache: 514/514 [MB] LSA: 514/514 [MB] std free: 83 [MB]
update: 383.485291 [ms], stall: {ticks: 425, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 771/788 [MB] LSA: 771/788 [MB] std free: 83 [MB]
update: 574.968811 [ms], stall: {ticks: 634, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.629722 [ms], max: 1.955666 [ms]}, cache: 879/917 [MB] LSA: 879/917 [MB] std free: 83 [MB]
update: 411.541138 [ms], stall: {ticks: 455, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.358102 [ms]}, cache: 787/835 [MB] LSA: 787/835 [MB] std free: 83 [MB]
update: 368.491211 [ms], stall: {ticks: 408, min: 0.001332 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 750/790 [MB] LSA: 750/790 [MB] std free: 83 [MB]
update: 343.671967 [ms], stall: {ticks: 380, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 734/769 [MB] LSA: 734/769 [MB] std free: 83 [MB]
update: 320.277283 [ms], stall: {ticks: 357, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 724/753 [MB] LSA: 724/753 [MB] std free: 83 [MB]
update: 310.583282 [ms], stall: {ticks: 344, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 714/740 [MB] LSA: 714/740 [MB] std free: 83 [MB]
update: 303.627106 [ms], stall: {ticks: 338, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.955666 [ms]}, cache: 707/731 [MB] LSA: 707/731 [MB] std free: 83 [MB]
update: 296.742523 [ms], stall: {ticks: 330, min: 0.001332 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 701/724 [MB] LSA: 701/724 [MB] std free: 83 [MB]
update: 286.598541 [ms], stall: {ticks: 319, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 697/719 [MB] LSA: 697/719 [MB] std free: 83 [MB]
update: 288.649323 [ms], stall: {ticks: 321, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 694/715 [MB] LSA: 694/715 [MB] std free: 83 [MB]
update: 282.069916 [ms], stall: {ticks: 314, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 692/712 [MB] LSA: 692/712 [MB] std free: 83 [MB]
update: 292.462036 [ms], stall: {ticks: 325, min: 0.001917 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 689/708 [MB] LSA: 689/708 [MB] std free: 83 [MB]
update: 274.390442 [ms], stall: {ticks: 305, min: 0.001332 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 687/705 [MB] LSA: 687/705 [MB] std free: 83 [MB]
invalidation: 172.617508 [ms]
Large partition, lots of small rows:
update: 262.132721 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.005722 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 268.650944 [ms]}, cache: 187/188 [MB] LSA: 187/188 [MB] std free: 82 [MB]
update: 281.359467 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 322.381152 [ms]}, cache: 375/376 [MB] LSA: 375/376 [MB] std free: 82 [MB]
update: 287.229065 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 322.381152 [ms]}, cache: 563/564 [MB] LSA: 563/564 [MB] std free: 82 [MB]
update: 1294.816284 [ms], stall: {ticks: 4, min: 0.001917 [ms], 50%: 0.005722 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 1386.179840 [ms]}, cache: 586/625 [MB] LSA: 586/625 [MB] std free: 82 [MB]
update: 845.022461 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.005722 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 962.624896 [ms]}, cache: 439/475 [MB] LSA: 439/475 [MB] std free: 82 [MB]
update: 380.335938 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 386.857376 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 477.234680 [ms], stall: {ticks: 4, min: 0.002760 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 525.955017 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 548.003784 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.006866 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 528.697937 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 609.292603 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.005722 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 575.762451 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 668.489536 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 530.801392 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 535.948364 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 527.143555 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.020501 [ms], 99%: 0.020501 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 521.869202 [ms], stall: {ticks: 4, min: 0.002760 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
invalidation: 173.069733 [ms]
Large partition, lots of range tombstones:
update: 224.003220 [ms], stall: {ticks: 4, min: 0.001917 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 268.650944 [ms]}, cache: 52/52 [MB] LSA: 52/52 [MB] std free: 82 [MB]
update: 570.882874 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 105/105 [MB] LSA: 105/105 [MB] std free: 82 [MB]
update: 577.249878 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 158/158 [MB] LSA: 158/158 [MB] std free: 82 [MB]
update: 580.239624 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 211/211 [MB] LSA: 211/211 [MB] std free: 82 [MB]
update: 614.187134 [ms], stall: {ticks: 4, min: 0.001917 [ms], 50%: 0.004768 [ms], 90%: 0.011864 [ms], 99%: 0.011864 [ms], max: 668.489536 [ms]}, cache: 264/264 [MB] LSA: 264/264 [MB] std free: 82 [MB]
update: 618.709229 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.003973 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 317/317 [MB] LSA: 317/317 [MB] std free: 82 [MB]
update: 626.943359 [ms], stall: {ticks: 4, min: 0.001598 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 369/370 [MB] LSA: 369/370 [MB] std free: 82 [MB]
update: 602.873474 [ms], stall: {ticks: 4, min: 0.001917 [ms], 50%: 0.003973 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 422/423 [MB] LSA: 422/423 [MB] std free: 82 [MB]
update: 617.522583 [ms], stall: {ticks: 4, min: 0.001598 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 475/475 [MB] LSA: 475/475 [MB] std free: 82 [MB]
update: 627.291138 [ms], stall: {ticks: 4, min: 0.001598 [ms], 50%: 0.004768 [ms], 90%: 0.011864 [ms], 99%: 0.011864 [ms], max: 668.489536 [ms]}, cache: 528/528 [MB] LSA: 528/528 [MB] std free: 82 [MB]
update: 623.720886 [ms], stall: {ticks: 4, min: 0.001598 [ms], 50%: 0.003973 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 581/581 [MB] LSA: 581/581 [MB] std free: 82 [MB]
update: 630.735596 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 634/634 [MB] LSA: 634/634 [MB] std free: 82 [MB]
update: 2776.525635 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 2874.382592 [ms]}, cache: 687/687 [MB] LSA: 687/687 [MB] std free: 82 [MB]
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
bb96518cc5 mutation_reader: Make empty mutation source advertize no partitions
So that perf_row_cache_update will always populate cache.
2018-05-30 12:18:56 +02:00
Avi Kivity
dd26cf1490 Merge "db/view: Clarifications to range movement scenarios" from Duarte
"
This series provides reasoning and clarification for the current
structure of mutate_MV(), and how we handle some scenarios related to
range movements.
"

* 'materialized-views/clarifications/v3' of github.com:duarten/scylla:
  db/view: Remove ifdef'd Java code
  db/view: Ignore scenario where base replica hasn't joined the ring
  db/view: Handle case when base has no paired view replica
2018-05-29 18:51:06 +03:00
Avi Kivity
928af7701c Merge "Implement reading clustering columns from SSTables 3.x" from Piotr
"
Add handling for clustering columns and tests for it.
"

* 'haaawk/sstables3/read-ck-v3' of ssh://github.com/scylladb/seastar-dev:
  Add test_uncompressed_compound_ck_read for SSTables 3.x
  Add test_uncompressed_simple_read for SSTables 3.x
  Implement reading clustering key from SSTables 3.x
  column_translation: cache fixed value lengths for ck
  data_consume_rows_context_m: use cached fixed column value lenghts
  column_translation: store fix lengths of column values
  consume_row_start: change type of clustering key
  Rename ROW_BODY state to CLUSTERING_ROW
2018-05-29 18:49:26 +03:00
Piotr Jastrzebski
d2300bc5a9 sstable_3_x_test: Add test_uncompressed_compound_static_row_read
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:55:36 +02:00
Piotr Jastrzebski
6639ef8769 sstable_3_x_test: add test_uncompressed_static_row_read
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:55:11 +02:00
Piotr Jastrzebski
18cced2edc flat_mutation_reader_assertions: improve static row assertions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:52:55 +02:00
Piotr Jastrzebski
6ab660880d data_consume_rows_context_m: Implement support for static rows
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:52:14 +02:00
Piotr Jastrzebski
c9c2fc8e4b mp_row_consumer_m: Implement support for static rows
Add consumer_m::consume_static_row_start

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:50:15 +02:00
Piotr Jastrzebski
f018e5dfed mp_row_consumer_m: Extract fill_cells
This lambda will be used not only for regular columns
but also for static columns.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:46:02 +02:00
Laura Novich
e053da6f51 scylla_setup: adjust language
Edited the text for the scylla setup, improving readability for the prompts
with regards to grammar and usage.

Signed-off by: Laura Novich <laura@scylladb.com>
Message-Id: <CAGcEH3Xa6TFy=_rdz_=NP0b23vEDZmfRQzAdxV-f04C1p+AzTw@mail.gmail.com>
2018-05-29 09:56:41 +03:00
Piotr Sarna
ffe52681ea storage_proxy: add mv stats to write handler
Previous patch for issue 3416 did not cover passing write stats
to write response handler, which results in some write stats
being incorrectly counted as user write stats, while they belong
to materialized views.
This one fixes that by passing correct write stats reference
to write response handler constructor.

Also at: https://github.com/psarna/scylla/commits/fix_3416_again

Closes #3416
Message-Id: <53ef3cc96ccadfdad8992d92ed6a41473419eb0a.1527510473.git.sarna@scylladb.com>
2018-05-28 17:50:49 +01:00
Piotr Jastrzebski
a7a152b27f Add test_uncompressed_compound_ck_read for SSTables 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
5c0f9f17ba Add test_uncompressed_simple_read for SSTables 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
c89b485871 Implement reading clustering key from SSTables 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
101e38f19b column_translation: cache fixed value lengths for ck
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
b7149d349c data_consume_rows_context_m: use cached fixed column value lenghts
Take them from column_translation instead of parsing the type every
time.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
9d41c2299d column_translation: store fix lengths of column values
We don't need to parse the type every time.
It's better to cache fix lengths of column values
for sstable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
351c9e5d65 consume_row_start: change type of clustering key
Clustering key in 3.x format is stored differently
so it's easier to create a vector of temporary buffers
instead of a single block of concatenated bytes.

Each temporary buffer stores a value of a single
clustering column.

This is because the way clustering key is stored on disk
in SSTables 3.x is not the same as the way we store it
internally.

This means that we have to first read a value of every
clustering column into temporary_buffer and only then
we can create clustering key using a vector of those
temporary buffers.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:27:56 +02:00
Amnon Heiman
1f28e97458 sstable: Add has_partition_key method
This patch adds a helper function to sstable to check if it has a given
partition key.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-05-28 18:12:17 +03:00
Amnon Heiman
cd1f4ccb89 keys_test: add a test for nodetool_style string
This patch adds a test for single and compund partition key that is
created from a nodetoold style string.
2018-05-28 18:12:12 +03:00
Amnon Heiman
c517ee8353 keys: Add from_nodetool_style_string factory method
Based on:
8daaf9833a

This patch adds a from_nodetool_style_string factory method to partition_key.
The string format is follows the nodetool format, that column in the
partition keys are split by ':'.
For example, if a partition key has two column col1 and col2, to get the
partition key that has col1 = val1 and col2 = val2:

val1:val2
2018-05-28 18:09:51 +03:00
Tomasz Grabiec
aefb5e0fbd Merge "Get rid of cql_statement::execute_internal" from Avi
execute_internal() duplicates several code paths, especially in
the select path, for no good reason.  It boils down to timeout and
consistency level selection which can be done based on
client_state::is_internal().

This patchset eliminated the duplication and execute_internal(),
simplifying the code.

* github.com:avikivity/scylla cql-no-execute_internal/v2:
  cql: schema_altering_statement: make execute() and execute_internal()
    equivalent
  cql: select_statement: make execute() and execute_internal()
    equivalent
  cql: query_processor: don't call cql_statement::execute_internal() any
    more
  cql: cql_statement: remove execute_internal()
2018-05-28 13:01:43 +02:00
Avi Kivity
8033785b36 Update scylla-ami submodule
* dist/ami/files/scylla-ami 025644d...1f5329f (1):
  > scylla_install_ami: Update CentOS to latest version
2018-05-28 13:59:57 +03:00
Avi Kivity
ff3e86888a tests: report tests as they are completed
As each test completes, report it. This prevents a long-running
test in the beginning of the list from stalling output.
Message-Id: <20180526173517.23078-1-avi@scylladb.com>
2018-05-28 13:58:01 +03:00
Avi Kivity
3a4d11d374 Merge "Introduce frozen_mutation_fragment" from Paweł
"
This series introduces frozen_mutation_fragment which can be used to
send mutation_fragments over the wire to a remote node. The main
intended user is going to be the new streaming implementation.

The first part of the series fixes some IDL issues related to empty
structures and variant being the first member of a structure. Both these
problems make the generated code fail to build and they do not, in any
way, affect the existing on-wire protocol.

Logic responsible for freezing and unfreezing of mutation_fragments is
heavily based on the existing code for freezing mutations and shares the
same drawbacks (for example, unnecessary copy during unfreezing). These
preexisting performance problems can be fixed incrementally.

Another performance problem (which affects frozen_mutations as well, but
to a lesser extent) is that since the batching is done at a different
layer each frozen mutation fragment is a separate bytes_ostream object
owning at least one  memory buffer. If the mutation fragments are small
this will cause an excessive number of allocations. This could be solved
either by freezing fragments in batches (though it goes against the RPC
layer doing its own batching) or using bytes_ostream or an equivalent
object with a buffer allocation policy more suitable for such use cases.
This also is something that probably could be an incremental fix.

Tests: unit (release)
"

* tag 'frozen_mutation_fragment/v1-rebased' of https://github.com/pdziepak/scylla:
  idl: add idl description of frozen_mutation_fragments
  tests: add test for frozen_mutation_fragments
  frozen_mutation: introduce frozen_mutation_fragment
  tests/idl: test variant being the first member of a structure
  idl: create variant state in root node
  tests/idl: test serialising and deserialising empty structures
  idl-compiler: avoid unused variable in empty struct deserialisers
  tests/mutation_reader: disambiguate freeze() overload
2018-05-28 13:54:01 +03:00
Takuya ASADA
55d6be9254 Revert "dist/ami: update CentOS base image to latest version"
This reverts commit 69d226625a.
Since ami-4bf3d731 is Market Place AMI, not possible to publish public AMI based on it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180523112414.27307-1-syuu@scylladb.com>
2018-05-28 13:52:34 +03:00
Duarte Nunes
99d678d079 db/view: Remove ifdef'd Java code
It provides no useful information, so just get rid of it.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-28 11:51:23 +01:00
Duarte Nunes
ad18d535e9 db/view: Ignore scenario where base replica hasn't joined the ring
Apache Cassandra handles a case where the node hasn't joined the ring
and may consequentially have an outdated view of it. Following the same
reasoning as with the previous patch, we ignore this scenario. It
happens when there are range movements, and this node is bootstrapping,
but there are already other mechanisms in the cluster, such as hinted
handoff and dual-writing to replicas during range movements, that
contribute to this update eventually making its way to the view.

This patch doesn't change any behavior, but it provides the reasoning
why we won't use the batchlog as Cassandra does, or the hinted handoff
log as we will, to later send the update when the node is joined (note
that Cassandra just sends the mutations "later", and doesn't check
again for any condition or change).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-28 11:51:23 +01:00
Duarte Nunes
be45e6a1b7 db/view: Handle case when base has no paired view replica
If no view replica is paired with the current base replica, it means
there's a range movement going on (decommission or move), such that
this base replica is gaining new token ranges. The current node is
thus a pending_endpoint from the POV of the coordinator that sent the
request.

Sending view updates to the view replica this base will eventually be
paired with only makes a difference when the base update didn't make
it to the node which is currently being decommissioned or moved-from.

The update will, however, make it to that node if HH is enabled at the
coordinator, before the range movement finishes, or later to this node
when it becomes a natural endpoint for the token.

We still ensure we send to any pending view endpoints though, at least
until we handle that case more optimally.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-28 11:51:18 +01:00
Avi Kivity
b70febe246 cql: cql_statement: remove execute_internal()
With no callers, it can be safely removed.
2018-05-27 12:40:27 +03:00
Avi Kivity
c8a66efb6a cql: query_processor: don't call cql_statement::execute_internal() any more
All cql_statement::execute_internal() overrides now either throw or
call execute().  Since we shouldn't be calling the throwing overrides
internally, we can safely call execute() instead.  This allows us to
get rid of execute_internal().
2018-05-27 12:37:37 +03:00
Avi Kivity
eb19798f99 cql: select_statement: make execute() and execute_internal() equivalent
execute_internal(), for some code paths, differs from execute by the
following:
 1. it uses CL_ONE unconditionally
 2. it has no query timeout
 3. it doesn't use execution stages

for other code paths, it just calls execute.

As preparation for getting rid of execute_internal(), unify the two
code paths.

Commit 4859b759b9 caused the consistency level and timeouts
to be provided by the caller, so using the caller provided parameters
instead of overriding them does not change behavior.
2018-05-27 12:36:02 +03:00
Avi Kivity
d998f06633 cql: schema_altering_statement: make execute() and execute_internal() equivalent
To get rid of execute_internal(), make the normal execute() equivalent and call
it instead of having two different paths.
2018-05-27 11:08:55 +03:00
Duarte Nunes
4859b759b9 Merge 'Make all timeouts explicit' from Avi
"
This patchset makes all users of query_processor specify their timeouts
explicitly, in preparation for the removal of
cql_statement::execute_internal() (whose main function was to override
timeouts).
"

* tag 'cql-explicit-timeouts/v1' of https://github.com/avikivity/scylla:
  query_processor: require clients to specify timeout configuration
  query_processor: un-default consistency level in make_internal_options
2018-05-26 16:10:58 +02:00
Avi Kivity
6e97609049 Merge "Improve support for data types handling in SSTables 3.x" from Vladimir
"
Firstly, this patchset removes the is_fixed_length() function of
abstract_type in favour of value_length_if_fixed().

Secondly, it fixed the byte_type to be compatible with Cassandra which
erroneously treats it as a variable-length data type.

Lastly, it adds a unit test covering all non-composite CQL data types
for writing.

Tests: unit {release}
"

* 'projects/sstables-30/different-data-types/v1' of https://github.com/argenet/scylla:
  tests: Add a unit test for writing different data types to SSTables 3.x format.
  types: Treat byte_type as a variable-length type for compatibility reasons.
  types: Remove is_value_fixed() and use value_length_if_fixed() instead.
2018-05-26 10:24:35 +03:00
Vladimir Krivopalov
0951153292 tests: Add a unit test for writing different data types to SSTables 3.x format.
This tests covers all non-composite CQL data types.
The resulting files are dumped using sstabledump as follows:

[
  {
    "partition" : {
      "key" : [ "key" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 174,
        "liveness_info" : { "tstamp" : "1525385507816568" },
        "cells" : [
          { "name" : "asciival", "value" : "hello" },
          { "name" : "bigintval", "value" : 9223372036854775807 },
          { "name" : "blobval", "value" : "0x6772656174" },
          { "name" : "boolval", "value" : true },
          { "name" : "dateval", "value" : "2017-05-05" },
          { "name" : "decimalval", "value" : 5.45 },
          { "name" : "doubleval", "value" : 36.6 },
          { "name" : "durationval", "value" : 1h4m48s20ms },
          { "name" : "floatval", "value" : 7.62 },
          { "name" : "inetval", "value" : "192.168.0.110" },
          { "name" : "intval", "value" : -2147483648 },
          { "name" : "smallintval", "value" : 32767 },
          { "name" : "timeuuidval", "value" : "50554d6e-29bb-11e5-b345-feff819cdc9f" },
          { "name" : "timeval", "value" : "19:45:05.090000000" },
          { "name" : "tinyintval", "value" : 127 },
          { "name" : "tsval", "value" : "2015-05-01 09:30:54.234Z" },
          { "name" : "uuidval", "value" : "01234567-0123-0123-0123-0123456789ab" },
          { "name" : "varcharval", "value" : "привет" },
          { "name" : "varintval", "value" : 123 }
        ]
      }
    ]
  }
]

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-25 21:41:23 -07:00
Vladimir Krivopalov
3981dd6dd6 types: Treat byte_type as a variable-length type for compatibility reasons.
Although values of the byte_type that corresponds to CQL TINYINT type
always occupy only a single byte, Cassandra treats this it as a
variable-length type for SSTables 3.0 reading and writing.

While it is clearly a mistake at Cassandra side, we have to stay
compatible.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-25 21:41:23 -07:00
Vladimir Krivopalov
24cb062834 types: Remove is_value_fixed() and use value_length_if_fixed() instead.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-25 21:41:23 -07:00
Paweł Dziepak
ed12555192 idl: add idl description of frozen_mutation_fragments 2018-05-25 10:15:10 +01:00
Paweł Dziepak
0bac487426 tests: add test for frozen_mutation_fragments 2018-05-25 10:15:10 +01:00
Paweł Dziepak
aa4e589ace frozen_mutation: introduce frozen_mutation_fragment
This patch introduces IDL definition as well as serialisers and
deserialisers for freezing mutation_fragment so that they can be
transferred between nodes in a cluster.
2018-05-25 10:15:10 +01:00
Paweł Dziepak
b2e9491728 tests/idl: test variant being the first member of a structure 2018-05-25 10:15:10 +01:00
Paweł Dziepak
a5731ded98 idl: create variant state in root node
Each non-final IDL object is preceeded by a frame containing its size.
In case of boost::variant there is a frame for the variant itself, an
integer determining the active alternative of the variant and a frame of
that active alternative.

However, if a variant was the first member of a writable stub object the
IDL would generate code that would not write the frame for the variant.
This is not a very severe issue since there are no such cases right now
as  C++ type system would no allow such generated code to compile.
2018-05-25 10:15:10 +01:00
Paweł Dziepak
d731cf427d tests/idl: test serialising and deserialising empty structures 2018-05-25 10:15:10 +01:00
Paweł Dziepak
f719516be8 idl-compiler: avoid unused variable in empty struct deserialisers
Deserialisers generated by IDL compiler first create a substream
covering the deserialised structure and then skip and read appropriate
members. If there are no members the substream will be unused and prompt
the compiler to emit a warning.
2018-05-25 10:15:10 +01:00
Paweł Dziepak
fde9e1d55f tests/mutation_reader: disambiguate freeze() overload
freeze() is about to get overloaded so make sure we don't get any
ambiguities.
2018-05-25 10:15:10 +01:00
Duarte Nunes
4db0b4af58 Merge 'secondary index: Fixes for tables with multiple clustering columns' from Nadav
"
This patch series fixes #3405: secondary-index search only provided
correct results in certain cases, where entire partitions or contiguous
partition slices matched the query. When this was not the case, and
individual clustering rows match or do not match the query, the wrong
results were returned.

To fix this bug, we need to fix the two stages of secondary-index search:

1. In the first stage, we read from the index MV a list of row keys
   (i.e., primary keys) matching the query. We can no longer remember
   just the partition keys, and need to keep the list of full primary keys.

2. In the second stage, we have a list of rows (not partitions) and need
   to read their selected contents to return to the user. Since CQL queries
   do not have a syntax to select an arbitrary list of rows, we have to
   add new code to do such a selection.

Because we provide an ad-hoc, inefficient, implementation for the row
selection described in stage 2, these patches leave two paths in the code:
The old path, efficiently selecting entire partitions, and the new path,
selecting individual rows. The old path is still used when it is applicable,
which is when a partition key column or the first clustering key column
is searched.
"

* 'si-fix-v4' of http://github.com/nyh/scylla:
  secondary index: test multiple clustering column
  secondary index: fix wrong results returned in certain cases
  secondary index: method for fetching list of rows from base table
  secondary index: method for fetching list of rows from index
  select_statement.cc: refactor find_index_partition_ranges()
  select_statement.cc: fix variable lifetime errors
2018-05-24 21:36:18 +01:00
Nadav Har'El
a6d9ea2fb5 secondary index: test multiple clustering column
This patch adds a test for secondary indexes on a table which has many
columns - two partition key column, two clustering key columns, and two
regular columns. We add a bunch of data in various rows and partitions,
index all columns and search on this data and verify the results.

This test exposed various bugs in secondary index search, including
issue #3405. After we fixed those bugs, the test now passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:56:57 +03:00
Nadav Har'El
1b29dd44f7 secondary index: fix wrong results returned in certain cases
The current secondary-index search code, in
indexed_table_select_statement::do_execute(), begins by fetching a list
of partitions, and then the content of these partitions from the base
table. However, in some cases, when the table has clustering columns and
not searching on the first one of them, doing this work in partition
granularity is wrong, and yields wrong results as demonstrated in
issue #3405.

So in this patch, we recognize the cases where we need to work in
clustering row granularity, and in those cases use the new functions
introduced in the previous patches - find_index_clustering_rows() and
the execute() variant taking a list of primary-keys of rows.

Fixes #3405.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:56:03 +03:00
Nadav Har'El
adf6d742be secondary index: method for fetching list of rows from base table
We add a new variant of select_statement::execute() which allows selecting
an arbitrary list of clustering rows. The existing execute() variant can't
do that - it can only take a list of *partitions*, and read the same
clustering rows from all of them.

The new select variant is not needed for regular CQL queries (which do
not have a syntax allowing reading a list of rows with arbitrary primary
keys), but we will need it for secondary index search, for solving
issue #3405.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:54:36 +03:00
Nadav Har'El
a096a82adc secondary index: method for fetching list of rows from index
We already have a method find_index_partition_ranges(), to fetch a list
of partition keys from the secondary index. However, as we shall see in
the following patches (and see also issue #3405), getting a list of entire
partitions is not always enough - the secondary index actually holds a list
of primary keys, which includes clustering keys, and in some queries we
can't just ignore them.

So this patch provides a new method find_index_clustering_rows(), to
query the secondary index and get a list of matching clustering keys.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:53:29 +03:00
Nadav Har'El
083b2ae573 select_statement.cc: refactor find_index_partition_ranges()
The function find_index_partition_ranges() is used in secondary index
searches for fetching a list of matching partition. In a following patch,
we want to add a similar function for getting a list of *rows*. To avoid
duplicate code, in this patch we split parts of find_index_partition_ranges()
into two new functions:

1. get_index_schema() returns a pointer to the index view's schema.

2. read_posting_list() reads from this view the posting list (i.e., list
   of keys) for the current searched value.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:50:45 +03:00
Nadav Har'El
7dc9b77682 select_statement.cc: fix variable lifetime errors
do_with() provides code a *reference* to an object which will be kept
alive. It is a mistake to make a copy of this object or of parts of it,
because then the lifetime of this copy will have to be maintained as well.

In particular, it is a mistake to do do_with(..., [] (auto x) { ... }) -
note how "auto x" appears instead of the correct "auto& x". This causes
the object to be copied, and its lifetime not maintained.

This patch fixes several cases where this rule was broken in
select_statement.cc. I could not reproduce actual crashes caused by
these mistakes, but in theory they could have happened.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:46:12 +03:00
Piotr Jastrzebski
3b6e80a180 Rename ROW_BODY state to CLUSTERING_ROW
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-24 12:48:33 +02:00
Avi Kivity
0b8d06ebf9 Merge seastar upstream
* seastar a48fe69...12cffef (5):
  > variant_utils: don't pass variant by rref to boost::apply_visitor
  > Revert "build: fix compilation issues on cmake. missing stdc++-fs"
  > reactor: prevent expected overflow from triggering ubsan warning
  > cmake: Add cmake option to disable testing altogether
  > build: fix compilation issues on cmake. missing stdc++-fs
2018-05-24 12:17:56 +03:00
Avi Kivity
f893dc61f0 Merge "Implement reading columns from SSTable 3 format" from Piotr
"
This patchset implements reading row columns from SSTable 3 format data file.

Tests: units (release)
"

* 'haaawk/sstables3/read-columns-v4' of ssh://github.com/scylladb/seastar-dev: (21 commits)
  Add test for reading column values of different types.
  Support all fixed size column types from SSTable 3.x
  Add abstract_type::value_length_if_fixed
  Add test for simple table with value
  flat_reader_assertions: Add produces_row taking column values
  Implement reading rows and columns in data_consume_rows_context_m
  Introduce column_flags_m
  Add column_translation to data_consume_rows_context_m
  Pass schema to data_consume_context
  Add column_translation.hh
  consumer_m: Add consume methods for consuming rows and columns
  Extract make_atomic_cell from mp_row_consumer_k_l
  Rename NON_STATIC_ROW_* states to ROW_BODY_*
  Add liveness_info and use it in reading sstables
  Add helper methods for parsing simple types.
  Add unfiltered_flags_m::has_all_columns
  data_consume_context: use make_unique instead of new
  Pass serialization_header to data_consume_rows_context*
  Use disk_string_vint_size for bytes_array_vint_size
  Introduce disk_string_vint_size type
  ...
2018-05-24 10:11:25 +03:00
Takuya ASADA
e0d49aae37 dist/debian: fix missing --configfile parameter on pdebuild
We need to specify --configfile on pdebuild too, otherwise we will
always fail to build .deb on newly created build environment.
Only reason why we still able to build .deb is we already copied
.pbuilderrc to home directory on existing build environment.

Fixes #3456

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180523204112.24669-1-syuu@scylladb.com>
2018-05-24 10:10:27 +03:00
Piotr Jastrzebski
7869bd98b1 Add test for reading column values of different types.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
a572d126e4 Support all fixed size column types from SSTable 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
7a25819e5a Add abstract_type::value_length_if_fixed
This info is used by SSTable 3.x format to read column values
without reading their lengths.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
f58f10d708 Add test for simple table with value
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
0a5d06b2f3 flat_reader_assertions: Add produces_row taking column values
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
9348006092 Implement reading rows and columns in data_consume_rows_context_m
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
f6e1c38486 Introduce column_flags_m
This will be used for reading columns from data file.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
609854e21a Add column_translation to data_consume_rows_context_m
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
7fd222e639 Pass schema to data_consume_context
It will be needed to obtain column_translation that will
be added to data_consume_context in the next patch.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
d3f3cd36dd Add column_translation.hh
It contains a class that manages mapping between sstable
columns and schema column definitions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
25b8cf9e4c consumer_m: Add consume methods for consuming rows and columns
Also implement them in mp_row_consumer_m.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:53:29 +02:00
Piotr Jastrzebski
94e3138dc5 Extract make_atomic_cell from mp_row_consumer_k_l
It will be used in both mp_row_consumer_k_l and
mp_row_consumer_m.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
c6d5ebc274 Rename NON_STATIC_ROW_* states to ROW_BODY_*
New name describes the states in a better way as those states
will be used both for static and non-static rows.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
10c669d2b5 Add liveness_info and use it in reading sstables
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
b2f9841dd4 Add helper methods for parsing simple types.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
d8cd8e04ed Add unfiltered_flags_m::has_all_columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
51d079e17c data_consume_context: use make_unique instead of new
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
54ef775501 Pass serialization_header to data_consume_rows_context*
This header is needed to parse data for SSTable 3.0 format

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
b849eefc8c Use disk_string_vint_size for bytes_array_vint_size
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
76f0f2693d Introduce disk_string_vint_size type
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:30:03 +02:00
Piotr Jastrzebski
5ca4bfd69a disk_array_vint_size: Remove unused Size template parameter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:15:44 +02:00
Duarte Nunes
4eb47d136b Merge 'Introduce authorized_prepared_statements_cache' from Vlad
"
This series introduces a cache of already authenticated prepared statements which
is meant to optimize the prepared statement lookup when authentication is enabled.

This cache allows to perform a single cache lookup per EXECUTE operation as opposed
to at least 2 lookups: one in the prepared statements cache and one in the authentication
cache.

Tests:
   - cql_query_test {debug, release}.
   - cassandra-stress with authentication enabled and with short eviction timeout.
   - Manual (with printouts) checks:
      - Tested the eviction due to eviction in the prepared_statements_cache:
         - Artificially decreased the prepared_statements_cache size and ran c-s with different keyspaces.
         - Verified that the corresponding authorized_prepared_statements_cache entry is evicted and re-populated.
      - Tested the BATCH of prepared statements (with dtest infrastructure):
         - Verified that for each prepared statement authorized_prepared_statements_cache is updated only once:
            - The batch contained a few entries of the same prepared statement.
"

* 'authorized_prepared_statements_cache-v3' of https://github.com/vladzcloudius/scylla:
  cql3: use authorized_prepared_statements_cache in the BATCH processing
  cql3::statements::batch_statement: introduce a single_statement class
  cql3: introduce the authorized_prepared_statements_cache class
  loading_shared_values: introduce the templated find() overload
  tests: loading_cache_test: add a tests for a loading_cache::remove(key)/remove(iterator)
  utils::loading_cache: add remove(key)/remove(iterator) methods
  cql3::query_processor: properly stop() prepared_statements_cache object
2018-05-23 14:40:09 +01:00
Avi Kivity
3dd2f68712 dist: drop libunwind dependency
Since Seastar no longer (1f005fb434) requires libunwind, we can
drop it from our dependency list.  This helps the power build, for
which no libunwind is available.

Fixes #3453.
Message-Id: <20180523114750.10753-1-avi@scylladb.com>
2018-05-23 13:53:29 +02:00
Avi Kivity
1f005fb434 Merge seastar upstream
* seastar 5da5d4e...a48fe69 (1):
  > backtrace: drop libwind in favor of libc backtrace()
2018-05-23 14:42:14 +03:00
Duarte Nunes
eed09dfdf9 mutation_partition: Throw std::out_of_range with backtrace on cell_at
Makes it easier to investigate bugs.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180521133753.16375-1-duarte@scylladb.com>
2018-05-23 13:51:54 +03:00
Avi Kivity
701e6f2cff Merge "Implement backlog controller for TWCS" from Glauber
"
This series implements the backlog tracker for TWCS, allowing it to
be controlled. The backlog for a TWCS colum family is just the sum of
the SizeTiered backlogs for all the windows that we know about.

A possible optimization for this is to stop tracking windows after
they become old enough and revert to zero backlog. I reverted that
last minute, though, since this will probably cause the backlog to
completely misrepresent reality if we import SSTables into old buckets
with things like repairs or nodetool refresh.
"

* 'twcs-backlog-v4.1' of github.com:glommer/scylla:
  backlog: implement backlog tracker for the TWCS
  STCS_backlog: allow users to query for the total bytes managed
  backlog: keep track of maximum timestamp in write monitor
  memtable: also keep track of max timestamp
2018-05-23 13:37:49 +03:00
Glauber Costa
44a89d654b backlog: implement backlog tracker for the TWCS
The TWCS backlog is relatively simple: we just need to keep track of
which SSTable belong to which time window (and actually as usual,
just their sizes). That is an easy thing to do since we can statically
calculate the time bound from the timestamp.

Once we do that we can just sum the backlogs for each individual window.
Time windows that are well enough into the past can be at some point
discarded when their backlogs become zero.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-23 06:20:21 -04:00
Nadav Har'El
433fc6c36e keys.hh: simplify empty clustering-key check
The exploded_clustering_prefix type has a convenient is_empty() method
and an even more convenient "operator bool" shortcut. Unfortunately,
the other clustering prefix types (clustering_key_prefix,
clustering_key_prefix_view) have, for historic reasons, an is_empty
method which takes a schema parameter. That also means they can't
have an "operator bool" shortcut.

But checking if a prefix doesn't really need the schema - all we need to
check is whether the byte representation is empty. The result is simpler
and more efficient code, and easier to use. It is also more consistent -
all clustering-key-related types will have an "operator bool" instead of
just some of them.

To avoid massive code changes, we leave a is_empty(schema) variant, which
simply calls is_empty(). There's already precedent for that - various
methods which have a variant taking schema (and ignoring it) and one
taking nothing.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180521174220.13262-1-nyh@scylladb.com>
2018-05-23 11:46:23 +02:00
Takuya ASADA
300af65555 dist/common/scripts/scylla_setup: abort running script when one of setup failed in silent mode
Current script silently continues even one of setup fails, need to
abort.

Fixes #3433

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180522180355.1648-1-syuu@scylladb.com>
2018-05-23 11:05:33 +03:00
Vlad Zolotarov
82f7d1d006 cql3: use authorized_prepared_statements_cache in the BATCH processing
Like with the EXECUTE command avoid authorizing the same prepared
statement twice - this time in the context of processing the BATCH
command.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:15:03 -04:00
Vlad Zolotarov
9723988926 cql3::statements::batch_statement: introduce a single_statement class
This is a helper class needed to control the handling process of a single
statement in the current batch. In particular it has the boolean defining
if the authorization is needed for this statement.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:15:03 -04:00
Vlad Zolotarov
a138c59991 cql3: introduce the authorized_prepared_statements_cache class
Add a cache that would store the checked weak pointer to already authorized prepared statements
and which key is a tuple of an authenticated_user and key of the prepared_statements_cache.

The entries will be held as long as the corresponding prepared statement is valid (cached)
and will be discarded with the period equal to the refresh period of the permissions cache.

Entries are also going to be discarded after 60 minutes if not used.

The purpose of this new cache is to save the lookup in the permissions cache for already authenticated
resource (whatever is needed to be authenticated for the particular prepared statement).

This is meant to improve the cache coherency as well (since we are going to look in a single cache
instead of two).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:15:03 -04:00
Vlad Zolotarov
3114cef42c loading_shared_values: introduce the templated find() overload
This overload alows searching the elements by an arbitrary key as long as it is "hashable"
to the same values as the default key and if there is a comparator for
this new key.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:15:00 -04:00
Vlad Zolotarov
ab251a1fc3 tests: loading_cache_test: add a tests for a loading_cache::remove(key)/remove(iterator)
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:05:01 -04:00
Vlad Zolotarov
34620deee4 utils::loading_cache: add remove(key)/remove(iterator) methods
remove(key): removes the entry with the given key if exists, otherwise does nothing.
remote(iterator): removes an entry by a given iterator (returned from loading_cache::find()).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:05:00 -04:00
Piotr Sarna
b7ac2da238 main: initialize hints manager unconditionally
This commit makes sure that hints manager is always initialized,
including creating hints directories and starting it.
It needs to be fixed because hints manager is internally used
to store failed materialized view replicas.

Fixes #3451
Message-Id: <44532fd3704e20cabeb9c4985dace5650fd22d2c.1527018865.git.sarna@scylladb.com>
2018-05-22 22:21:50 +01:00
Duarte Nunes
ed2a1518f8 Merge 'Allow dropping tables with active secondary indexes' from Piotr
"
This series addresses issue #3202 about dropping a table with secondary
indexes present. Previously dropping such tables was impossible due to
materialized view restrictions (which is an implementation detail
of Scylla's secondary indexes).

Implemented:
 * fixing 'DROP KEYSPACE' with active materialized views
 * adapting schema_builder to make it easy to drop indexes
 * dropping all dependent SI before dropping a table
 * a test case for dropping a table with secondary indexes
"

* 'drop_si_before_drop_table_3' of https://github.com/psarna/scylla:
  tests: add test for dropping a table with secondary indexes
  migration_manager: allow dropping table with secondary indexes
  schema: add clearing indexes to schema builder
  database: do not truncate already removed views
2018-05-22 22:20:35 +01:00
Vlad Zolotarov
5bde36f29e cql3::query_processor: properly stop() prepared_statements_cache object
prepared_statements_cache has a timer that evicts old entries - it needs to be properly stopped.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 16:33:52 -04:00
Piotr Sarna
76848fb577 tests: add test for dropping a table with secondary indexes
This commit adds a test case for dropping a table with dependent
secondary indexes. Dependent materialized views prohibit the table
from being dropped, but dropping a table with dependent SI is legal.

References #3202
2018-05-22 21:10:51 +02:00
Piotr Sarna
7e4813a466 migration_manager: allow dropping table with secondary indexes
Previously dropping a table with secondary indexes failed, because
SI are internally backed by materialized views.
This commit triggers dropping dependent secondary indexes before
dropping a table.

Fixes #3202
2018-05-22 21:10:51 +02:00
Piotr Sarna
0513dc17a1 schema: add clearing indexes to schema builder
This commit adds 'without_indexes()' method to builder,
used to clear all previous index declarations from schema definition.
2018-05-22 21:10:51 +02:00
Piotr Sarna
f8237dd664 database: do not truncate already removed views
This commit clears table's views before truncating it
in drop_column_family function. The only case when
views are not empty during drop is when they're backing secondary
indexes of a base table and they are all atomically dropped
in the same go as the base table itself.
This change will prevent trying to truncate views that were
already dropped, which used to result in no_such_column_family error.

References #3202
2018-05-22 21:10:51 +02:00
Duarte Nunes
a3bbd52e2e Merge 'Add materialized view metrics' from Piotr
"
This series introduces materialized view statistics, as stated in issue #3385:
 - updates pushed
 - updates failed
 - row lock stats

It also addresses issue #3416 by decoupling user write stats from view
update stats.
"

* 'materialized_view_metrics_9' of https://github.com/psarna/scylla:
  view: adapt view_stats to act as write stats
  storage_proxy: decouple write_stats from stats
  db: add row locking metrics
  view: add view metrics
2018-05-22 18:41:51 +01:00
Glauber Costa
be39736293 STCS_backlog: allow users to query for the total bytes managed
We would like to know whether there is still backlog at rest in a
particular STCS object. This is useful, for instance, in the TWCS
backlog, that uses STCS so it can delete old windows that are no longer
used.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 13:40:15 -04:00
Glauber Costa
b573a2ff61 backlog: keep track of maximum timestamp in write monitor
For sealed SSTables we can get the maximum timestamp from the statistics
component.  But for partially written SSTables, the metadata is not yet
available.

One way to solve this would be to make the SSTable statistics available
earlier. But we would end up with a maximum timestamp that potentially
changes all the time as we write more cells.

A better approach is to take note of what's the maximum timestamp in a
memtable before we start to flush, and when time comes for us to flush
we will use the progress manager to inform the consumers about the
maximum timestamp.

For SSTables being compacted, we can't know for sure what is the maximum
timestamp as some entries could be TTLd already. But the maximum of all
SSTables present in the compaction is a good enough estimation for this
purposes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 12:55:58 -04:00
Glauber Costa
68d1c64e7a memtable: also keep track of max timestamp
We are now keeping track of the minimum timestamp in a memtable. Also
keep track of the max timestamp so we can know what it is before we
finish flushing the entire memtable to an SSTable. Will be used by
partially written SSTables undergoing TWCS.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 12:55:58 -04:00
Avi Kivity
49892a06b9 Merge "exception safety and minimum work for compaction controller" from Glauber
"
This was sent before as two separate patchsets. It is now unified
because it has a lot of common infrastructure.

In this patchset I am aiming at two goals:

1) Provide a minimum amount of shares for user-initiated operations like
nodetool compact and nodetool cleanup

2) Be more robust with exceptions in the backlog tracker

For the first, the main difference is that I now made the compaction
controller a part of the compaction manager. It then becomes easy to
consult with the compaction controller for the correct amount of shares
those operations should have.

In compaction_strategy.cc, the major_compaction_strategy object was
actually already unused before. So instead of making use of it, which
would require some form of information flow downwards about the backlog
we need to export, I am creating a user-initiated backlog type inside
the compaction manager.

With the two changes described above everything is very well
self-contained within the compaction manager and the implementation
becomes trivial.

For the second, I am now handling exceptions in two places:

1) the backlog computation. Those are const functions so if we just have
a transient exception when compacting the backlog, all we need to do is
return some fixed amount of shares and try again in the next adjustment
window.

2) the process of adding / removing SSTables. Those are harder, since if
we fail to manipulate the list we'll be left in an inconsistent state.
The best approach is then to disable the backlog tracker and return a
fixed amount of shares globally.

Tests: unit (release)
"

* 'backlog-improvements-v3' of github.com:glommer/scylla:
  compaction_manager: disable backlog tracker if we see an exception
  backlog tracker: protect against exceptions in backlog calculation.
  STCS_backlog: protect against negative backlog
  STCS_backlog: remove unused attribute
  compaction strategy: move size tiered backlog to a header
  compaction_strategy: delete major_compaction_strategy class
  compaction: make sure that user-initiated compactions always have a minimum priority
  backlog_controller: add constants to represent a globally disabled controller
  backlog_controller: move compaction controller to the compaction manager
  backlog_controller: allow users to compute inverse function of shares
2018-05-22 18:35:42 +03:00
Piotr Sarna
3792bed3ed view: adapt view_stats to act as write stats
This commit adapts view_stats structure so it can be passed
to storage_proxy as write stats. Thanks to that, mv replica updates
will not interfere with user write metrics. As a side effect it also
provides more stats to replica view updates.

Closes #3385
Closes #3416
2018-05-22 16:52:58 +02:00
Piotr Sarna
1d590b3ca4 storage_proxy: decouple write_stats from stats
This commit extracts metrics related to writes from stats structure,
so it can be easily replaced later, e.g. for materialized view metrics.

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Piotr Sarna
9246bb36bc db: add row locking metrics
This commit adds statistics to row_locker class. Metrics are
independendly counted for all lock types: row<->partition and
exclusive<->shared.

Metrics gathered:
 - total acquisitions
 - operations that wait on the lock
 - histogram of the time spent on waiting on this type of lock

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Piotr Sarna
49bebcfa25 view: add view metrics
This commit introduces view statistics:
 - updates pushed to local/remote replicas
 - updates failed to be pushed to local/remote replicas

Metrics are kept on per-table basis, i.e. updates_pushed_remote
shows the number of total updates (mutations) pushed to all paired
mv replicas that this particular table has.
Every single update is taken into consideration, so if view update
requires removing a row from one view and adding a row to another,
it will be counted as 2 updates.

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Tomasz Grabiec
e554a39fbb tests: memtable_snapshot_source: Fix compact()
Compactor collects all currently active memtables and later replaces
them with the merged result. The problem is that active memtable
belongs to the input set during compaction and as a result mutations
applied concurrently with compaction could be lost once compaction
replaces the memtables. The fix is to open a new active memtable when
compaction starts.

Caused sporadic failures of row_cache_test.cc:test_continuity_is_populated_when_read_overlaps_with_older_version()
Message-Id: <1526997724-13037-1-git-send-email-tgrabiec@scylladb.com>
2018-05-22 15:08:07 +01:00
Glauber Costa
d4e7783188 compaction_manager: disable backlog tracker if we see an exception
If we see an exception when adding or removing SSTables from the backlog
tracker, the backlog tracker can be inconsistent forever. It would be
best if we act before that happens and disable the backlog tracker. Once
the backlog tracker is disabled it will default to returning a fixed
number of shares.

We can either disable the backlog tracker or remove it. But if we remove
it we can end up with a backlog of zero if that's the only tracker with
a backlog. We then keep it registered but mark it as disabled. This also
leaves room for recovery in some situations: we can recover the backlog
by a doing a schema change in the column family that had the backlog
disabled, for instance.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:32 -04:00
Glauber Costa
fde26ec633 backlog tracker: protect against exceptions in backlog calculation.
Backlog calculations should be exception free, but there are at cases in
which I can see they happening. One example is if  some backlog tracker
that uses temporary objects fails an allocation.

Memory shortages can be specially pernicious: if we leave the
responsibility of catching those to the individual backlog tracker, we
will keep trying to make more allocations in the other backlog trackers
if we have many column families. By handling it here we can stop that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
3e08bd17f0 STCS_backlog: protect against negative backlog
A negative backlog can be interpreted as a very large backlog.
Part of that is because we keep the total_size as an unsigned type,
which is what we expect. But in case there is an issue-- like an
exception that causes some SSTable not to be tracked then this size
can become negative. Returning a zero backlog is better than allowing
it to be interpreted as a giant number.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
4b4e9f6c8c STCS_backlog: remove unused attribute
This attribute ended up being unused in the final version.
Spotted now while reading the code for other purposes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
10046593be compaction strategy: move size tiered backlog to a header
It's very common to other strategies to include a SizeTiered
step somehow inside their algorithms: LCS will do SizeTiered on
L0, TWCS will do SizeTiered within a window, etc.

To make it easier for those strategies to consume the SizeTiered
backlog tracker, we will move that to its own file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
36ccb1dd7c compaction_strategy: delete major_compaction_strategy class
It was already unused before this series. In an earlier version I have
used it to provide an ad-hoc backlog for major compactions. But now that
this is done by the compaction manager, this class really isn't being
used.

And it is likely it won't be: major compaction is not a compaction
strategy a user can choose, unlike the others that need to be built
through make_compaction_strategy.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:33:59 -04:00
Glauber Costa
9320d6f17f compaction: make sure that user-initiated compactions always have a minimum priority
We have observed the following behavior with user initiated compactions,
like major compactions:

- if there are no writes, the backlog doesn't increase.
- as compaction progresses the backlog decreases.
- at some point, the backlog is so low that compaction barely makes any
  progress.

Going forward, we should allow one to read from the generated partial
SSTables, in which case this doesn't matter that much. But for
user-iniated compactions we would like to guarantee a minimum baseline.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:33:25 -04:00
Glauber Costa
c55ab93178 backlog_controller: add constants to represent a globally disabled controller
There are situations in which we want the controllers to stop working
altogether. Usually that's when we have an unimplemented controller or
some exception.

We want to return fixed shares in this case, but this is a very
different situation from when we want fixed shares for *one* backlog
tracker: we want to return fixed shares, yes, but if we disable 200
backlog trackers (because they all failed, for instance), we don't want
that fixed number x 200 to be our backlog.

So the mechanism to globally disable the controller is still granted,
and infinity is a good way to represent that. It's a float that the
controller can easily test against. But actually using infinity in the
code is confusing. People reading it may interpret it as the other way
around from what it means, just meaning "a very large backlog".

Let's turn that into a constant instead. It will help us convey meaning.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:25:23 -04:00
Glauber Costa
d758a416f8 backlog_controller: move compaction controller to the compaction manager
There was recently an attempt to add minimum shares to major compactions
which ended up being harder than it should be due to all the plumbing
necessary to call the compaction controller from inside the compaction
manager-- since it is currently a database object. We had this problem
again when trying to return fixed shares in case of an exception.

Taking a step back, all of those problems stem from the fact that the
compaction controller really shouldn't be a part of the database: as it
deals with compactions and its consequences it is a lot more natural to
have it inside the compaction manager to begin with.

Once we do that, all the aforementioned problems go away. So let's move
there where it belongs.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:24:19 -04:00
Calle Wilund
62c3b4c429 commitlog: Ensure file objects are closed before object free
Fixes #3446

Previously, only shutdown-synced objects where actually closed,
which is wrong.

This introduces yet another queue, processed together with the
deletion objects, which ensures we explicitly close all objects
that have been discarded.

Message-Id: <20180521140456.32100-1-calle@scylladb.com>
2018-05-22 14:52:06 +03:00
Duarte Nunes
4b2fd8d6f2 Merge 'Use hinted handoff to replay missed updates from base to view' from Piotr
"This series leverages hinted handoff for failed view replica
updates."

* 'materialized_view_updates_with_hh_5' of https://github.com/psarna/scylla:
  storage_proxy: enable hinted handoff for materialized views
  storage_proxy: make view updates use consistency_level::ANY
2018-05-22 11:24:37 +01:00
Paweł Dziepak
05c94bc98d mutation_partition: do not dereference null in find_cell()
row::find_cell() may be called for cells that do not exist in that row.
In such case nullptr shall be returned, this patch makes sure that
it is not dereferenced.
Message-Id: <20180522091726.24396-1-pdziepak@scylladb.com>
2018-05-22 10:31:09 +01:00
Glauber Costa
d3f985ef46 backlog_controller: allow users to compute inverse function of shares
There are some situations in which we want to force a specific amount of
shares and don't have a backlog. We can provide a function to get that
from the controller.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-21 19:35:07 -04:00
Avi Kivity
51f5599c75 Merge seastar upstream
* seastar a6cb005...5da5d4e (6):
  > append_challenged_posix_file_impl: Ensure continuation uses non-stale object
  > utils: make make_visitor() public
  > tcp: Adjust receive window
  > tcp: Fix allowed sending size calculation in can_send
  > tcp: Fix assert in tcp::tcb::output_one
  > be more descriptive with failed syscalls for filesystem operations

Contains alternative fix for #3446 (will also be fixed directly).
2018-05-21 20:35:30 +03:00
Piotr Sarna
f5d6326ced storage_proxy: enable hinted handoff for materialized views
This commit initializes and enables hinted handoff for materialized
views, even if HH is not explicitly turned on in config.

User writes still use hinted handoff only if it is explicitly enabled,
while materialized views are allowed to use it unconditionally
in order to store failed replica updates somewhere.

Fixes #3383
2018-05-21 17:09:27 +02:00
Piotr Sarna
da0d458f5f storage_proxy: make view updates use consistency_level::ANY
This commit makes view replica updates internally use consistency
level ANY, so in case an update fails it will fall back to hinted
handoff.

References #3383
2018-05-21 17:09:27 +02:00
Piotr Sarna
ba9e8a4f2c tests: initialize hints directory for cql env
This commit initializes hints_directory config value for cql_test_env.
It's needed now because materialized views support force-enables
hinted handoff.

Message-Id: <2aadf35eee329c1f89977c4a55660f330bd9d591.1526914827.git.sarna@scylladb.com>
2018-05-21 18:06:01 +03:00
Botond Dénes
204f6fd478 test.py: print test args when listing failed tests
This can be very helpful when a test only fails when run with some
particular arguments.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <dac1f7e23afa904156e65c3bb3c8fd52b7e999ff.1526906955.git.bdenes@scylladb.com>
2018-05-21 17:28:18 +03:00
Avi Kivity
f9c2ff1f9c install: prepare /etc directory
install(1) creates missing directories on recent Fedora, but not
on CentOS 7. This causes the RPM build (which installs to a pristine
tree, without an existing /etc) to fail.

Fix by setting up /etc.

Tests: rpm (Fedora, CentOS)
Message-Id: <20180520124937.20466-1-avi@scylladb.com>
2018-05-21 09:51:46 +02:00
Asias He
db8c3a7059 streaming: Do not use dht::split_ranges_to_shards
There is no need to call dht::split_ranges_to_shards to split the token
range into <shard> : <a lot of small ranges> mapping and create a flat
mutation reader with a lot of small ranges.

Because:

1) The flat mutation reader on each shard only returns data belongs to
this local shard, there is no correctness issue if we do not split and
feed the sub ranges only belongs to this local shard.

2) With murmur3_partitioner_ignore_msb_bits = 12, it is almost certain
that given a token range, all the shards will have data for the range
anyway. Even if we ask all the shards to work on the token range and
some of the shards have no data for it, it is fine. We simply send no
data from this shard.

Tests: update_cluster_layout_tests.py

Message-Id: <ac00cd21d6156c47b74451dd415d627481e48212.1526864222.git.asias@scylladb.com>
2018-05-21 10:42:45 +03:00
Takuya ASADA
5407c34c73 dist/debian: depends to coreutils instead of realpath on Ubuntu 18.04
On Ubuntu 18.04 realpath package is dropped, it becomes part of coreutils.

Fixes #3445

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180521031954.30815-1-syuu@scylladb.com>
2018-05-21 10:42:05 +03:00
Asias He
0c54c6e16f storage_service: Add node has left the cluster log
Remove a node from the cluster is a major operation, it deserves a log
for it. Add a log when node is removed from the cluster by `nodetool
decommission` or `nodetool removenode`.

Message-Id: <b6adf34492c8138296911f2b37b39e9dd8ed10a2.1523347916.git.asias@scylladb.com>
2018-05-19 21:47:05 +03:00
Asias He
e20038eb84 streaming: Handle stream_mutation rpc handler on all shards
In streaming, the sender sends the mutations on all the local shards in
parallel, it is possible that the receiver handle more than one such
connection on the same shard. It is determined by where the tcp
connection goes. Current rpc ignores the dest shard id when sending the
rpc message.

For instance, say node1 has 2 shards, node2 has 2 shards. Currently, we
can end up with like this:

   Node 1 shard 0 -> Node 2 shard 1
   Node 1 shard 1 -> Node 2 shard 1

It is better if we do:

   Node 1 shard 0 -> Node 2 shard 0
   Node 1 shard 1 -> Node 2 shard 1

This patch solves this problem by let the handler always handle on
shard = src_cpu_id % smp::count.

If sender and receiver have the same shard config, it is completely
distributed the work evenly.

If sender and receiver do not have the same shard config, it is
unavoidable some of the shard will do more work than the others.

Tests: dtest update_cluster_layout_tests.py

Message-Id: <911827bcf67459a07ec92623a9ed4c4fbba195ca.1524622375.git.asias@scylladb.com>
2018-05-19 21:08:25 +03:00
Calle Wilund
f69a52c475 storage_service: Add more error info to "isolate_on_error" shutdown
Fixes #2793

Prints error handle class (commitlog or "other/disk") + exception
type and message. While not exhaustive, at least gives a correlation
point to (hopefully) other log printouts.

Message-Id: <20180509081040.7676-1-calle@scylladb.com>
2018-05-19 21:06:03 +03:00
Piotr Jastrzebski
1520ffe7f5 sstables: check buffer size when reading vints
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6ecbedae818fbef1f67a4472aba4ce443b9df0ee.1525888830.git.piotr@scylladb.com>
2018-05-19 21:01:45 +03:00
Avi Kivity
46a0109608 Merge "Support compression when writing SSTables 3.x." from Vladimir
"
For compression, SSTables 3.x format uses CRC32 for checksumming
compressed chunks as well as for calculating the full file checksum.
Also, while for older formats "full checksum" of a compressed data file
means a combination of checksums of its compressed chunks, in SSTables
3.x this now reads literally and assumes the checkum of all bytes
written, including per-chunk digests.

Tests: unit {debug, release}
"

* 'projects/sstables-30/write-compression/v3' of https://github.com/argenet/scylla:
  tests: Add unit tests for writing compressed SSTables 3.x.
  tests: Validate Digest32.crc for SSTables 3.x write tests.
  tests: Fix invalid Digest file for write_counter_table test.
  sstables: Support writing compressed SSTables 3.0.
  sstables: Make compressed streams customizable on checksumming.
  sstables: Move checksum calculation logic to compressed_output_stream.
2018-05-19 20:52:08 +03:00
Vladimir Krivopalov
d588a7e743 tests: Add unit tests for writing compressed SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:08 +03:00
Vladimir Krivopalov
e5ab271863 tests: Validate Digest32.crc for SSTables 3.x write tests.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:08 +03:00
Vladimir Krivopalov
fcc7bad777 tests: Fix invalid Digest file for write_counter_table test.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:07 +03:00
Vladimir Krivopalov
dd00d90a05 sstables: Support writing compressed SSTables 3.0.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:07 +03:00
Vladimir Krivopalov
cc62ad3b69 sstables: Make compressed streams customizable on checksumming.
Use either Adler32 or CRC32 while writing to or reading from a
compressed stream.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:07 +03:00
Vladimir Krivopalov
5183294676 sstables: Move checksum calculation logic to compressed_output_stream.
Previously, compressed_output_stream used to calculate checksum of the
supplied chunk and pass it to the 'compression' object to combine with
the full checksum calculated on prior writes.
Now, all the checksum calculation happens inside
compressed_output_stream and 'compression' only stores the result.

This is done to loosen ties between two classes and simplify
compressed_output_stream customisation with various checksum algorithms.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:07 +03:00
Glauber Costa
596a525950 commitlog: don't move pointer to segment
We are currently moving the pointer we acquired to the segment inside
the lambda in which we'll handle the cycle.

The problem is, we also use that same pointer inside the exception
handler. If an exception happens we'll access it and we'll crash.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180518125820.10726-1-glauber@scylladb.com>
2018-05-18 17:25:18 +02:00
Avi Kivity
684bb2042d Merge "Fixes and improvements for gdb LSA commands" from Tomasz
* tag 'tgrabiec/fixes-and-improvements-for-gdb-scripts-v1' of github.com:tgrabiec/scylla:
  gdb: Print live object size from 'scylla lsa-segment'
  gdb: Extend 'scylla segment-descs' output with full occupancy info
  gdb: Print allocated object's type name instead of full LSA migrator
  gdb: Fix LSA migrator discovery
  gdb: Drop code related to LSA zones
  gdb: Fix uses of removed segment_desctriptor::_lsa_managed
  lsa: Add use for debug::static_migrators
2018-05-17 15:54:21 +03:00
Tomasz Grabiec
d4a2d22812 gdb: Print live object size from 'scylla lsa-segment' 2018-05-17 14:22:20 +02:00
Tomasz Grabiec
08026a64c5 gdb: Extend 'scylla segment-descs' output with full occupancy info
After:

 0x600007220000: lsa free=24800  used=106272  81.08% region=0x600000403210
 0x600007240000: lsa free=13     used=131059  99.99% region=0x600000403210
 0x600007260000: lsa free=23072  used=108000  82.40% region=0x600000403210
 0x600007280000: lsa free=16772  used=114300  87.20% region=0x600000403210
 0x6000072a0000: lsa free=23996  used=107076  81.69% region=0x600000401410
 0x6000072c0000: lsa free=15552  used=115520  88.13% region=0x600000403210
2018-05-17 14:22:20 +02:00
Tomasz Grabiec
abd667d924 gdb: Print allocated object's type name instead of full LSA migrator
Before:

  0x6000302604e0: live {_vptr.migrate_fn_type = 0x3797a00 <vtable for standard_migrator<cache_entry>+16>, _migrators = std::any containing seastar::lw_shared_ptr<(anonymous namespace)::migrators> = {[contained value] = {_p = 0x600000080a80}}, _align = 8, _index = 0} @ 0x6000302604e8

After:

  0x6000302604e0: live cache_entry @ 0x6000302604e8
2018-05-17 14:22:14 +02:00
Tomasz Grabiec
653fcc10bb gdb: Fix LSA migrator discovery
Fixes 'scylla lsa-segment' which broke after recent changes, probably
commit b3699f286d.
2018-05-17 14:22:14 +02:00
Tomasz Grabiec
bb8f82f43f gdb: Drop code related to LSA zones
LSA zones have been removed.
2018-05-17 14:22:14 +02:00
Tomasz Grabiec
84a7961c23 gdb: Fix uses of removed segment_desctriptor::_lsa_managed 2018-05-17 14:22:14 +02:00
Tomasz Grabiec
498a4132c5 lsa: Add use for debug::static_migrators
Otherwise GDB complains about it being optimized out, breaking our
debug scritps.
2018-05-17 14:22:14 +02:00
Avi Kivity
d9c80cac26 dist: move Red Hat installation from .spec %install to new install.sh
Move code to a traditional install.sh script (more traditional would be
a "make install", but this is close enough).

This allows testing installation independently of packaging. In addition,
non-Red Hat-packaging can share much of the code in install.sh.

Ref #3243.

Tests: build+install rpm
Message-Id: <20180517114147.30863-1-avi@scylladb.com>
2018-05-17 13:46:27 +02:00
Avi Kivity
98967da94f Merge seastar upstream
* seastar 0a1a327...a6cb005 (1):
  > Merge " misc fixes for iotune" from Glauber
2018-05-17 12:42:46 +03:00
Avi Kivity
3b8118d4e5 dist: redhat: get rid of raid0.devices_discard_performance
This parameter is not available on recent Red Hat kernels or on
non-Red Hat kernels (it was removed on 3.10.0-772.el7,
RHBZ 1455932). The presence of the parameter on kernels that don't
support it cause the module load to fail, with the result that the
storage is not available.

Fix by removing the parameter. For someone running an older Red Hat
kernel the effect will be that discard is disabled, but they can fix
that by updating the kernel. For someone running a newer kernel, the
effect will be that they can access their data.

Fixes #3437.
Message-Id: <20180516134913.6540-1-avi@scylladb.com>
2018-05-16 15:38:29 +01:00
Avi Kivity
20271b3890 Update scylla-ami submodule
* dist/ami/files/scylla-ami e0b35dc...025644d (1):
  > Merge "AMI build fix" from Takuya
2018-05-16 12:33:45 +03:00
Avi Kivity
05cec4a265 Merge "Reduce LSA memory reclamation overhead" from Tomasz
"
Main optimization is in the patch titled "lsa: Reduce amount of segment compactions".

I measured 50% reduction of cache update run time in a steady state for an
append-only workload with large partition, in perf_row_cache_update version from:

  c3f9e6ce1f/tests/perf_row_cache_update.cc

Other workloads, and other allocation sites probably also could see the
improvement.
"

* tag 'tgrabiec/reduce-lsa-segment-compactions-v1' of github.com:tgrabiec/scylla:
  lsa: Expose counters for allocation and compaction throughput
  lsa: Reduce amount of segment compactions
  lsa: Avoid the call to segment_pool::descriptor() in compact()
  lsa: Make reclamation on reserve refill more efficient
2018-05-16 10:24:20 +03:00
Tomasz Grabiec
534068a0f7 Update seastar submodule
Fixes #3339

* seastar 840002c...0a1a327 (7):
  > Merge "fix perftune.py issues with cpu-masks on big machines" from Vlad
  > Merge 'Handle Intel's NICs in a special way'  from Vlad
  > reactor: fix calculation of idle ticks
  > log: streamline logging internals a little
  > Merge "CMake imrovements and compatibility" from Jesse
  > iotune: fix typo in property name
  > cmake: do not find_package(Boost ...) if Boost is a target
2018-05-16 09:11:22 +02:00
Avi Kivity
832e8fb1e0 Merge "Support writing counters in SSTables 3.x format." from Vladimir
"
This patchset adds support for writing counter cells in SSTables 3.x
format ('m'). The logic of writing counters is almost identical to that
used for the old 2.x format ('k'/'l') with the only difference that the
data length preceding serialised shards is written as a vint.

Tests: unit {release}.

Generated SSTables are verified to be processed fine by sstabledump
(note that sstabledump only outputs the binary data for counters, not
their actual values, same as sstable2json).

Verified with Cassandra 3.11 to get the expected values from the
counters table:
cqlsh> SELECT * from sst3.counter_table;

 pk  | ck  | rc1 | rc2
-----+-----+-----+-----
 key | ck1 |  10 |   1

(1 rows)

Verified that the deleted counter can no longer be updated:
cqlsh> use sst3 ;
cqlsh:sst3> UPDATE counter_table SET rc1 = rc1 + 2 WHERE pk = 'key' AND ck = 'ck2';
cqlsh:sst3> SELECT * from sst3.counter_table;

 pk  | ck  | rc1 | rc2
-----+-----+-----+-----
 key | ck1 |  10 |   1

(1 rows)
"

* 'projects/sstables-30/write_counters/v1' of https://github.com/argenet/scylla:
  tests: Unit tests to cover writing counters in SSTables 3.x format.
  sstables: Support writing counters for SSTables 3.x.
  sstables: Move code writing counter value into a separate helper.
2018-05-16 08:46:15 +03:00
Raphael S. Carvalho
59c57861ae tests/sstable_test: switch to dynamic temporary dir creation
sstable test fails when running concurrently (for example, release and debug
mode) because it uses a static temporary dir in lots of tests.
Let's fix it by switching to dynamic temporary dir, which is created using
mkdtemp(). Also the sstable tests will now run in /tmp, and so it's made
much faster.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180516042044.15336-1-raphaelsc@scylladb.com>
2018-05-16 08:00:29 +03:00
Tomasz Grabiec
4fdd61f1b0 lsa: Expose counters for allocation and compaction throughput
Allow observing amplification induced by segment compaction.
2018-05-15 21:49:01 +02:00
Tomasz Grabiec
3775a9ecec lsa: Reduce amount of segment compactions
Reclaiming memory through segment compaction is expensive. For
occupancy of 85%, in order to reclaim one free segment, we need to
compact 7 segments, by migrating 6 segments worth of data. This results
in significant amplification. Compaction involves moving objects,
which in some cases is expensive in itself as well
(See https://github.com/scylladb/scylla/issues/3247).

This patch reduces amount of segment compactions in favor of doing
more eviction. It especially helps workloads in which LRU order
matches allocation order, in which case there will be no segment
compaction, and just eviction.

In perf_row_cache_update test case for large partition with lots of
rows, which simulates appending workload, I measured that for each new
object allocated, 2 need to be migrated, before the patch. After the
patch, only 0.003 objects are migrated. This reduces run time of
cache update part by 50%.
2018-05-15 21:49:01 +02:00
Vladimir Krivopalov
a16b8d5d77 tests: Unit tests to cover writing counters in SSTables 3.x format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-15 11:44:44 -07:00
Vladimir Krivopalov
ffd8886da9 sstables: Support writing counters for SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-15 11:44:44 -07:00
Vladimir Krivopalov
28c3c21c73 sstables: Move code writing counter value into a separate helper.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-15 11:44:44 -07:00
Avi Kivity
5f3a5c436e Merge "chunked vector memory estimation" from Glauber
"
The memory estimations we have when using the chunked vector
are usually slightly wrong. We can make them more accurate by
exporting the memory usage directly as a chunked_vector API.
"

* 'chunked_memory-v2' of github.com:glommer/scylla:
  large_bitset: be more accurate with memory usage
  chunked_vector: exports its current memory usage
2018-05-15 19:00:36 +03:00
Glauber Costa
2ba08178ca large_bitset: be more accurate with memory usage
We are slightly underestimating the amount of memory we use. Now that
the chunked vector can exports its internal memory usage we can use that
directly.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-15 11:22:21 -04:00
Glauber Costa
7190bb4f95 chunked_vector: exports its current memory usage
There are times in which we would like to estimate how much memory
a chunked_vector is using. We have two strategies to do it:

1) multiply the size by the size of the elements. That is wrong, because
the chunked_vector can allocate larger chunks in anticipation of more
elements to come.

2) multiply the number of chunks by 128kB. That is also wrong, because
the chunk_vector will not always allocate the entire chunk if there are
only a few elements in it.

The best way to deal with it is to allow the chunked_vector to exports
its current memory usage.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-15 11:22:21 -04:00
Raphael S. Carvalho
83e64192d3 tests/perf: fix compaction and write mode of perf_sstable
storage_service_for_tests must be instantiated only once at a global
scope.

Fixes #3369.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180510042200.2548-1-raphaelsc@scylladb.com>
2018-05-15 18:00:18 +03:00
Avi Kivity
e0ef39705f dist: redhat: properly package scylla_blocktune.py
Commit 9eb8ea8b11 installed
scylla_blocktune.py as part of preparing the rpm, but forgot
to add it to the installed file list, breaking the rpm build.

Fix by listing the file in the %files section.
Message-Id: <20180506202807.5719-1-avi@scylladb.com>
2018-05-15 18:00:05 +03:00
Piotr Sarna
40bf5d671b cql: add secondary index metrics
This commit adds basic secondary index metrics to cql_stats:
 * total number of indexes creates
 * total number of indexes dropped
 * total number of reads from a secondary index
 * total number of rows read from a secondary index

References #3384
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <d5eda7a343cee547c921dd4d289ecb1ac1c2bf24.1526374243.git.sarna@scylladb.com>
2018-05-15 17:59:53 +03:00
Avi Kivity
4f81e1f55a Merge "Use CRC32 to calculate checksums for SSTables 3.0." from Vladimir
"
SSTables 3.x (format 'm') use CRC32 instead of Adler32 for calculating
checksums. This patchset introduces support for CRC32 along with Adler32
in checksummed_file_writer to be used for SSTables written in 'mc'
format.

Structures and helpers introduced for CRC32 will be later used for
calculating checksums for compressed files as well (not a part of this
patchset).

Tests: unit {release}
"

* 'projects/sstables-30/write-digest-crc/v3' of https://github.com/argenet/scylla:
  tests: Add test covering checksumming SSTables 3.0 with CRC32.
  sstables: Support CRC32 checksum for SSTables 3.x.
  sstables: Move adler32 routines under the scope of a class.
  sstables: Move checksum utils into separate header.
  sstables: Remove unused 'checksum_file' flag from checksummed_file_writer.
2018-05-15 10:18:14 +03:00
Duarte Nunes
3a7d655d01 Merge 'transport: reduce unneeded continuations' from Avi
"
The native protocol server generates mant reactor tasks that
can be easily eliminated. I measured a read workload with 100%
cache hit rate, seeing the number of tasks per request drop
from ~31 to ~27, and an increase of 3% in throughput.
"

* tag 'transport-optimize-1/v1' of https://github.com/avikivity/scylla:
  transport: remove unused capture of flags variable
  transport: merge response write and error handling continuations
  transport: make write_repsonse() return void
  transport: de-template a lambda
  transport: merge memory-management and logging continuations
  transport: remove gate continuation
  transport: merge two response processing continuations
  transport: simplify response processing continuation
  transport: remove gratuitous continuation from process_request_one()
2018-05-14 10:12:07 +01:00
Avi Kivity
a99e820bb9 query_processor: require clients to specify timeout configuration
Remove implicit timeouts and replace with caller-specified timeouts.
This allows removing the ambiguity about what timeout a statement is
executed with, and allows removing cql_statement::execute_internal(),
which mostly overrode timeouts and consistency levels.

Timeout selection is now as follows:

  query_processor::*_internal: infinite timeout, CL=ONE
  query_processor::process(), execute(): user-specified consisistency level and timeout

All callers were adjusted to specify an infinite timeout. This can be
further adjusted later to use the "other" timeout for DCL and the
read or write timeout (as needed) for authentication in the normal
query path.

Note that infinite timeouts don't mean that the query will hang; as
soon as the failure detector decides that the node is down, RPC
responses will termiante with a failure and the query will fail.
2018-05-14 09:41:06 +03:00
Avi Kivity
4500baaaf4 transport: remove unused capture of flags variable 2018-05-14 09:41:06 +03:00
Avi Kivity
2a1f231f82 query_processor: un-default consistency level in make_internal_options
Make the consistency level explicit in the caller in order to clarify
what is going on.

An "internal" query used to mean that it was accessing local tables,
so infinite timeouts and a consistency level of ONE were indicated,
but authentication accesses non-local tables so explicit consistency
level and timeouts are needed.
2018-05-14 09:41:06 +03:00
Avi Kivity
88f8fe3168 transport: merge response write and error handling continuations
The response write continuation does not defer, so traditional try/catch
works well and saves a continuation.
2018-05-14 09:41:06 +03:00
Avi Kivity
3e8d1c8fd7 transport: make write_repsonse() return void
It just schedules the response, and returns immediately.

(I thought about calling it schedule_response(), but usually it will
write the response immediately, since waiting for network writes is
rare in a local network).
2018-05-14 09:41:06 +03:00
Avi Kivity
b26f36c2ec transport: de-template a lambda
Generic templates = annoying.
2018-05-14 09:41:06 +03:00
Avi Kivity
7a9b73f166 transport: merge memory-management and logging continuations
Merge a continuation that just keeps things alive with another that
just logs things.
2018-05-14 09:41:06 +03:00
Avi Kivity
f0887a55e4 transport: remove gate continuation
with_gate() generates a continuation if the protected function defers.
Avoid that by merging a gate::leave() call with another, preexisting,
continuation.
2018-05-14 09:41:06 +03:00
Avi Kivity
876837a5da transport: merge two response processing continuations
We have one coninuation transforming the result, and another shutting
down tracing. Since the first cannot defer, we can merge the two, reducing
the number of tasks processed by the reactor.
2018-05-14 09:41:06 +03:00
Avi Kivity
38619138be transport: simplify response processing continuation
A continuation in the response processing path is only doing
transformation on the output. Make that clear by returning a value,
not a future.
2018-05-14 09:41:06 +03:00
Avi Kivity
f0a1478b6c transport: remove gratuitous continuation from process_request_one()
No need to call then() just to convert exceptions to futures,
futurize_apply() does this with less ado.
2018-05-14 09:41:06 +03:00
Vladimir Krivopalov
1da6144f90 tests: Add test covering checksumming SSTables 3.0 with CRC32.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Vladimir Krivopalov
e6dfa008d8 sstables: Support CRC32 checksum for SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Vladimir Krivopalov
adb43959d1 sstables: Move adler32 routines under the scope of a class.
This is a step towards making digest algorithm customizable at compile
time.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Vladimir Krivopalov
4e4030676f sstables: Move checksum utils into separate header.
Checksummed writer doesn't need to include all compression stuff.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Nadav Har'El
f5536d607e secondary index: fix multiple appearance of rows
This patch fixes a bug where queries using a secondary index would, in
some cases, produce the same rows multiple times.

The problem was that the code begins by finding a list of primary keys
that match the search, and then work on the partitions containing them.
If multiple rows matched in the same partition, the partition was considered
multiple times, and the same rows were output multiple times.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180510203141.17157-1-nyh@scylladb.com>
2018-05-13 20:08:14 +02:00
Avi Kivity
7d29addb1f mutation_reader: optimize make_combined_reader for the single-reader case
If we're given a single reader (can be common in a low-write-rate table,
where most of the data will be in a single large sstable, or in leveled
tables) then we can avoid the overhead of the combining reader by returning
the single input.

Tests: unit (release)
Message-Id: <20180513130333.15424-1-avi@scylladb.com>
2018-05-13 20:07:10 +02:00
Duarte Nunes
a23bda3393 Merge 'Implement separate timeout for range queries' from Avi
"
This patchset implements separate timeouts for range queries, and lays
the foundations for separate timeouts for other query types.

While the feature in itself is worthy, the real motivation is to have
the timeouts decided by the caller, instead of storage_proxy. This in
turn is required to disentangle each layer behaving differently
depending on whether the query is internal or not; instead, the goal
is to have each caller declare its needs in terms of consistency level
and timeouts, and have the lower layers implement its requirements
instead of making their own decisions.

Fixes #3013.

Tests: unit (release)
"

* tag '3013/v1.1' of https://github.com/avikivity/scylla:
  storage_proxy: remove default_query_timeout()
  storage_proxy: don't use default timeouts
  query_options: augment with timeout_config
  thrift: configure thrift transport and handler with a timeout_config
  transport: configure native transport with a timeout_config
  cql3: define and populate timeout_config_selector
  timeout_config: introduce timeout configuration
2018-05-13 20:05:50 +02:00
Glauber Costa
3d2c4c1cf8 main: change I/O scheduler verification code
Before we accept running while not in developer mode, we verify that
the I/O Scheduler is properly configured. Up until now, that meant
verifying that --max-io-requests is properly set and that the number
of I/O Queues is enough to leave at least 4 requests per I/O Queue.

Systems that move to newer versions of Scylla may continue doing that,
so we need to be backwards compatible and keep testing for that.
However, newer systems will not set that option, but pass a YAML
property file (or string) instead. So we need to make sure that
either one of those is set.

If the property file is set, I am deciding here not to test for
number of I/O queues. scylla_io_setup will usually configure that
anyway, plus we plan on soon moving to all-shards-dispatch making
that less important.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180509163737.5907-1-glauber@scylladb.com>
2018-05-13 19:22:54 +03:00
Glauber Costa
2e0c673432 database: release flush permits earlier
There is an ongoing discussion in issue 2678 about the right time to
release permits. Right now we are releasing the permit after we flush
all data for the memtable plus the SSTables accompanying components -
plus flushing them, closing them, etc.

During all that time, we are increasing virtual dirty by adding more
data to the buffers but we are not able to decrease it-- until we
release the permit we can't start flushing the next memtable. This is
much more of a concern than I/O overlapping as described in the issue.

We have a hook in the SSTable write process that is (should be) called
as soon as data is written. We should move the permit release there.

We aren't, though, calling that as early as we could. The call to the
data written hook is writing after the Index is closed, summary is
sealed, etc.

This patch fixes that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180508182746.28310-2-glauber@scylladb.com>
2018-05-13 19:22:54 +03:00
Tomasz Grabiec
8faafdaae5 lsa: Avoid the call to segment_pool::descriptor() in compact() 2018-05-11 19:07:23 +02:00
Tomasz Grabiec
19edf3970e lsa: Make reclamation on reserve refill more efficient
Currently reserve refill allocates segments repeatedly until the
reserve threhsold is met. If single segment allocation needs to
reclaim memory, it will ask the reclaimer for one segment. The
reclaimer could make better decisions if it knew the total number of
segments we try to allocate. In particular, it would not attempt to
compact any segment until it evicts total amount of memory first,
which may reduce the total amount of segment compactions during
refill.

This patch changes refill to increase reclamation step used by
allocate_segment() so that it matches the total amount of memory we
refill.
2018-05-11 19:07:23 +02:00
Takuya ASADA
6fa3c4dcad dist/redhat: replace scylla-libgcc72/scylla-libstdc++72 with scylla-2.2 metapackage
We have conflict between scylla-libgcc72/scylla-libstdc++72 and
scylla-libgcc73/scylla-libstdc++73, need to replace *72 package with
scylla-2.2 metapackage to prevent it.

Fixes #3373

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180510081246.17928-1-syuu@scylladb.com>
2018-05-11 09:41:57 +03:00
Vladimir Krivopalov
f443e85476 sstables: Remove unused 'checksum_file' flag from checksummed_file_writer.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 11:11:06 -07:00
Paweł Dziepak
863a96db48 Merge "Fix partition tombstones for SSTables 3.x" from Vladimir
"Previously, partition tombstone was not written for partitions with no
rows causing corrupted data files.

This is now fixed and covered with tests.

In addition, we now track partition tombstones while collecting encoding
statistics."

* 'projects/sstables-30/fix-partition-tombstone/v3' of https://github.com/argenet/scylla:
  tests: Don't use deprecated schema constructor.
  tests: Add tests to cover partitions consisting only of partition keys.
  sstables: Make sure partition level tombstone is written for partitions with no rows.
  memtable: Collect statistics from partition-level tombstone.
2018-05-10 16:27:20 +01:00
Vladimir Krivopalov
d7177d9013 tests: Don't use deprecated schema constructor.
Rely entirely on schema_builder facilities while preparing schema for
unit tests.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 08:13:29 -07:00
Vladimir Krivopalov
64cdb30379 tests: Add tests to cover partitions consisting only of partition keys.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 08:12:58 -07:00
Vladimir Krivopalov
97079208db sstables: Make sure partition level tombstone is written for partitions with no rows.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 07:28:54 -07:00
Vladimir Krivopalov
ffc3a1ffeb memtable: Collect statistics from partition-level tombstone.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 07:28:50 -07:00
Duarte Nunes
21ccf173a1 Merge 'Preparatory cleanup for stateful range-scans' from Botond
"
This is preparatory cleanup series with fixes/cleanup of miscellaneous
issues that I discovered while working on the stateful range-scans.
Since the stateful range-scans series, even without these patches, is a
20+ patches strong series I'd like to fast-track this, to ease reviewing
the former.
Most of the changes here are related to code-hygenie and effectiveness
and there is a patch that is correctness-related ("querier: check only
the end bound of ranges when matching them") and one that is related to
ease-of-use ("range: clean the deduced transformed type").
Note that altough these changes were made in the context of working on
the stateful range-scans they make sense on their own as well.

Tests: unit(release, debug)
"

* '1865/pre-range-scans-cleanup/v1' of https://github.com/denesb/scylla:
  multishard_combining_reader: use optimized optional for the shard reader
  Use dht::token_range alias for last/preferred replicas
  storage_proxy::coordinator_query_result: merge constructors into one w/ default params
  querier: check only the end bound of ranges when matching them
  querier: take range and slice by value
  querier: remove const params from make_compaction_state()
  querier: make _range and _slice const
  flat_multi_range_mutation_reader: optimize for non-plural range vectors
  range: clean the deduced transformed type
2018-05-10 11:09:44 +01:00
Botond Dénes
7a3eab90c8 multishard_combining_reader: use optimized optional for the shard reader
Use flat_mutation_reader_opt instead of
std::optional<flat_mutation_reader>.
2018-05-10 13:06:47 +03:00
Duarte Nunes
d49348b0e1 Merge 'Include OPTIONS with LIST ROLES' from Jesse
"
Fixes #3420.

Tests: dtest (`auth_test.py`), unit (release)
"

* 'jhk/fix_3420/v2' of https://github.com/hakuch/scylla:
  cql3: Include custom options in LIST ROLES
  auth: Query custom options from the `authenticator`
  auth: Add type alias for custom auth. options
2018-05-10 11:03:29 +01:00
Vladimir Krivopalov
e5477c6c6c utils: Use dedicated enum for Bloom filter format instead of a boolean.
It better reflects the purpose of the parameter and provides better type-safety.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <10a4fc16dafa0fb3234969041f68f9e7bfc61312.1525899669.git.vladimir@scylladb.com>
2018-05-10 09:47:41 +03:00
Avi Kivity
76c64e1f26 Merge "Prepare for the new in-memory representation" from Paweł
"
These patches were extracted from much larger series that introduces new
in-memory representation of cells. They contain various enhanecments and
fixes that to a varying degree make sense on its own. Sending them
separately will hopefuly ease the review and merging proces of the whole
IMR effort.

Tests: unit(release).
"

* tag 'pre-imr/v1' of https://github.com/pdziepak/scylla:
  tests/perf: add microbenchmarks for basic row operations
  tests: simple_schema: add make_row_from_serialized_value()
  row: add clear_hash()
  types: move compare_unsigned() to bytes.hh
  lsa: provide migrator with the object size
  lsa: add free() that does not require object size
  db/view/build_progress: avoid copying mutation fragment
  mutation_partition: enable ADL for cell swap
  types: make some collection_type_impl functions non-static
  counters: drop revertability of apply()
  mutable_view: add default constructor and const_iterator
  tests/mutation_reader: do not apply mutations created on another shard
  sstables: do not call atomic_cell::value() for dead cells
  lsa: sanitize use of migrators
  lsa: reuse registered migrator ids
  lsa: make migrators table thread-local
2018-05-10 09:41:49 +03:00
Botond Dénes
ddd70dc113 Use dht::token_range alias for last/preferred replicas
Use the pre-existing type alias instead of fully spelling out the type
everywhere.
2018-05-10 06:22:39 +03:00
Botond Dénes
52affa2a61 storage_proxy::coordinator_query_result: merge constructors into one w/ default params 2018-05-10 06:22:39 +03:00
Botond Dénes
3b6f4e4901 querier: check only the end bound of ranges when matching them
The querier provides a `matches(const nonwrapping_range&)` member to
allow for checking whether a range matches that with which the querier
was originally created. The check for match is more lax than a strict
equality check as ranges are shrunk query progresses.
Because of this the above member only checked that one of the bounds of
the examined ranges matches. This is adequate as for this purpose
because, in the context of a single query, it is guaranteed that no
two read requests to the same replica will have overlapping range.
However Avi pointed out in a recent, related review, that this check can
be made a little more strict by requiring that the end-bounds of the
two ranges *always* matches, instead of allowing any of the bounds to
match.
2018-05-10 06:22:39 +03:00
Botond Dénes
eba90d0208 querier: take range and slice by value
It needs to copy these anyway so give callers the opportunity to move
these in.
2018-05-10 06:22:39 +03:00
Botond Dénes
546a0e292e querier: remove const params from make_compaction_state() 2018-05-10 06:22:39 +03:00
Botond Dénes
bc01833cad querier: make _range and _slice const
Since we are storing them on the heap we can make them const and still
be movable. We get the cake and can eat it too.
2018-05-10 06:22:39 +03:00
Botond Dénes
f5b012c952 flat_multi_range_mutation_reader: optimize for non-plural range vectors
Don't create a flat_multi_range_mutation_reader when the range vector
has 0 or 1 element. In the former case create an empty reader and in the
latter just create a reader with the mutation-source with the only range
in the vector.
2018-05-10 06:22:39 +03:00
Botond Dénes
16319c2036 range: clean the deduced transformed type
wrapping_range and nonwrapping_range offer a transform() member function
which allows creating a new range by applying a transformer function to
the bounds of the current range. The type of bounds of the new range is
deduced from the return type for this transformer function. However the
return type is used as-is, with any CV or reference attached to it.
Since it doesn't make sense to create a range of references or a type
with CV qualifiers strip these off the deduced type.
2018-05-10 06:22:39 +03:00
Jesse Haber-Kucharsky
4ffb4c6788 cql3: Include custom options in LIST ROLES
An implementation of `authenticator` can support custom options for a
each role.

If, to make up an example, the authenticator supported the `region` key,
then a role would be created as follows:

CREATE ROLE jsmith WITH OPTIONS = { 'region': 'north_america' }
                    AND PASSWORD = 'super_secure';

LIST ROLES will now print this custom option map as an additional column
with the heading "options".

However, none of the implementations of `authenticator` in Scylla
currently support OPTIONS, so LIST ROLES will in practice, for now,
print the empty set:

 role      | super | login | options
-----------+-------+-------+---------
 cassandra |  True |  True |        {}
2018-05-09 21:17:14 -04:00
Jesse Haber-Kucharsky
cd0553ca6a auth: Query custom options from the authenticator
None of the `authenticator` implementations we have support custom
options, but we should support this operation to support the relevant
CQL statements.
2018-05-09 21:12:50 -04:00
Jesse Haber-Kucharsky
e149e48609 auth: Add type alias for custom auth. options 2018-05-09 21:12:47 -04:00
Paweł Dziepak
0b8a85b15f tests/perf: add microbenchmarks for basic row operations 2018-05-09 16:52:26 +01:00
Paweł Dziepak
e949061126 tests: simple_schema: add make_row_from_serialized_value()
simple_schema::make_row() is not very well suited for performance tests
of row and cell creation since it serialises the value. This patch
introduces a new function that performs only minimal actions.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
33dffd5fb6 row: add clear_hash()
Needed to measure the performance of hashing a cell.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
f9940f620a types: move compare_unsigned() to bytes.hh
compare_unsigned() is a general utility function that compares two
bytes_view byte-by-byte. There is no need to include whole type.hh in
order to make it available.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
c6c5accd19 lsa: provide migrator with the object size
While the migration function should have enough information to obtain
the object size itself, the LSA logic needs to compute it as well.
IMR is going to make calculating object sizes more expensive, so by
providing the infromation to the migrator we can avoid some needless
operations.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
884888dc11 lsa: add free() that does not require object size
It is non-trivial to get the size of an IMR object. However, the
standard allocator doesn't really need it and LSA can compute it itself
by asking the migrator.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
75b8b521d9 db/view/build_progress: avoid copying mutation fragment 2018-05-09 16:52:26 +01:00
Paweł Dziepak
00509913fc mutation_partition: enable ADL for cell swap
Calling fully qualified std::swap() prohibits the cell objects from
using their own swap implementations. This patch invokes std::swap in
the usual ADL-friendly way.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
0b4c6b8938 types: make some collection_type_impl functions non-static
The switch to the new in-memory representation will require a larger
parts of the logic be aware of the type of the values they are dealing
with. In most cases it is not a significant burden for the users.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
a2b5779714 counters: drop revertability of apply()
Since 4cfcd8055e 'Merge "Drop reversible
apply() from mutation_partition" from Tomasz' it is no longer required
for apply() to be revertable.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
f7438a8b96 mutable_view: add default constructor and const_iterator
Makes the interface more consistent with bytes_view.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
7c5c77369a tests/mutation_reader: do not apply mutations created on another shard
Scylla uses shared-nothing architecture and communication between the
shards is supposed to be very restricted. Applying to a memtable
mutations created on another shard is way to complex operation to be
allowed. Using frozen mutations is a much safer option.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
55d1d7adfb sstables: do not call atomic_cell::value() for dead cells
The preconditions of value() require the cell to be live.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
b1bec336b3 lsa: sanitize use of migrators
Having migrators dynamically registered and deregistered opens a new
class of bugs. This patch adds some additional checks in the debug mode
with the hopes of catching any misuse early.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
cca9f8c944 lsa: reuse registered migrator ids
With the introduction of the new in-memory representation we will get
type- and schema-dependent migrators. Since there is no bound how many
times they can be created and destroyed it is better to be safe and
reuse registered migrator ids.
2018-05-09 16:52:20 +01:00
Paweł Dziepak
b3699f286d lsa: make migrators table thread-local
Migrators can be registered and deregistered at any time. If the table
is not thread-local we risk race conditions.
2018-05-09 16:10:46 +01:00
Avi Kivity
8d09820472 Merge "Load serialization header for SSTables in 3.0 format" from Piotr
"
SSTable 3.0 format introduces serialization header which is used in reading SSTables in that format.
This patchset implements loading of this new component of Statistics.db.

Tests: units (release)
"

* 'haaawk/sstables3/load_serialization_header_v2' of ssh://github.com/scylladb/seastar-dev:
  Load serialization_header from statistics
  Add parse for disk_array_vint_size
  Add helpers to read/parse vints
  Add signed_vint::serialized_size_from_first_byte
  Add sstable::get_serialization_header
  Move random_access_reader to separate header
2018-05-09 17:48:48 +03:00
Glauber Costa
94f686f946 memtable controller: reduce adjustment period to 50ms
250ms is too high of a period for memtable controller. Since memtable
flushes are relatively efficient, specially in comparison to
compactions, if the shares are high we can flush a lot of data down with
the high shares - so in the next adjustment period our shares will be
minuscule and we won't flush much at all.

This leads to oscillating behavior that is mitigated by adjusting
faster.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180508182746.28310-3-glauber@scylladb.com>
2018-05-09 17:40:46 +03:00
Paweł Dziepak
920131b2f7 Merge "mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged" from Tomasz
"Fixes a bug in partition_snapshot::merge_partition_versions(), which would not
attempt merging if the snapshot is attached to the latest version (in which
case _version is nullptr and _entry is != nullptr). This would cause
partition_version objects to accumulate if there was an older snapshot and it
went away before the latest snapshot. Versions will be removed when the whole
entry goes away (flush or eviction).

May cause performance problems.

Fixes #3402."

* 'tgrabiec/fix-merge_partition_versions' of github.com:tgrabiec/scylla:
  mvcc: Test version merging when snapshots go away
  anchorless_list: Make ranges conform to SinglePassRange
  anchorless_list: Drop deprecated use of std::iterator
  mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged
2018-05-09 15:10:56 +01:00
Piotr Jastrzebski
70a204cdd0 Load serialization_header from statistics
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 15:46:59 +02:00
Piotr Jastrzebski
3e4bc923a8 Add parse for disk_array_vint_size
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 15:46:59 +02:00
Piotr Jastrzebski
6b4df2d424 Add helpers to read/parse vints
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 15:46:46 +02:00
Glauber Costa
aadc709068 scylla_io_setup: run new iotune.
The newer version of iotune, recently merged to Seastar, accepts
a new parameter that tells us where should we store the properties
about the disk.

We are already generating that properties file for the AMI case.
Let's also pass that parameter when calling iotune.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180507175757.9144-1-glauber@scylladb.com>
2018-05-09 16:32:43 +03:00
Amnon Heiman
6bf759128b scylla-housekeeping: support new 2018.1 path variation
Starting from 2018.1 and 2.2 there was a change in the repository path.
It was made to support multiple product (like manager and place the
enterprise in a different path).

As a result, the regular expression that look for the repository fail.

This patch change the way the path is searched, both rpm and debian
varations are combined and both options of the repository path are
unified.

See scylladb/scylla-enterprise#527

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180429151926.20431-1-amnon@scylladb.com>
2018-05-09 15:22:30 +03:00
Botond Dénes
777f3c7dc2 mutation_reader_test: don't lock up with smp=1
test_foreign_reader_destroyed_with_pending_read_ahead lock up completely
when run with SMP=1. As a solution skip the test-case when SMP < 2.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <815585c40a65a66f3b03e6393b46fbd6849c8ef5.1525866777.git.bdenes@scylladb.com>
2018-05-09 15:10:18 +03:00
Piotr Jastrzebski
b602dea726 Add signed_vint::serialized_size_from_first_byte
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 11:41:00 +02:00
Piotr Jastrzebski
589463165c Add sstable::get_serialization_header
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 11:40:59 +02:00
Piotr Jastrzebski
aa126639c0 Move random_access_reader to separate header
It will be used not only in sstables.cc but also
in helpers for reading sstables in M format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 11:40:59 +02:00
Avi Kivity
911c2e7953 Merge "Support Bloom filter format for SSTables 3.x." from Vladimir
"
In SSTables 3.0, the base and increment fields have been swapped in
Bloom filters to reduce collisions (see CASSANDRA-8413). This affects
the resulting values written to Filter.db.

This patchset adds support for reading/writing Filter.db in the format
corresponding to the version of SSTables.

Tests: unit {release}

Filter.db files have been generated using Cassandra 3.11 with same data
as in unit tests and are validated to match those generated by Scylla.
"

* 'projects/sstables-30/write-filter/v1-2' of https://github.com/argenet/scylla:
  Fix mistakes and typos in comments (minor clean-up)
  Check Filter.db in SSTables 3.x write tests.
  Support Bloom filter format used in SSTables 3.0.
  Remove unused overload of i_filter::get_filter().
2018-05-09 11:16:09 +03:00
Vladimir Krivopalov
51c8ea74d6 sstables: generate non-empty summaries for m format
Add summary entries as needed. Also removes the duplicate line that
assigned summary byte cost.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <0d387c68523bae0c121cb15ad1e651ee9a8e4b4a.1525732404.git.vladimir@scylladb.com>
2018-05-09 11:15:02 +03:00
Vladimir Krivopalov
b59549cd16 Fix mistakes and typos in comments (minor clean-up)
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-08 15:28:43 -07:00
Vladimir Krivopalov
e739bb3280 Check Filter.db in SSTables 3.x write tests.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-08 15:28:35 -07:00
Vladimir Krivopalov
0f37c0e684 Support Bloom filter format used in SSTables 3.0.
The two hash values, base and increment, used to produce indices for
setting bits in the filter, have been swapped in SSTables 3.0.
See CASSANDRA-8413 for details.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-08 15:28:27 -07:00
Vladimir Krivopalov
fe2358e8bd Remove unused overload of i_filter::get_filter().
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-08 15:28:18 -07:00
Calle Wilund
b2b1a1f7e1 database: Fix assert in truncate
Fixes crash in cql_tests.StorageProxyCQLTester.table_test
"avoid race condition when deleting sstable on behalf..." changed
discard_sstables behaviour to only return rp:s for sstables owned
and submitted for deletion (not all matching time stamp),
which can in some cases cause zero rp returned.
Message-Id: <20180508070003.1110-1-calle@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
48c96d09d6 db::hints::manager: drain hints when the node is decommissioned/removed
When node is decommissioned/removed it will drain all its hints and all
remote nodes that have hints to it will drain their hints to this node.

What "drain" means? - The node that "drains" hints to a specific
destination will ignore failures and will continue sending hints till the end
of the current segment, erase it and move to the next one till there are
no more segments left.

After all hints are drained the corresponding hints directory is removed.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
ec76f8a27d db::hints::manager: add a few more trace messages
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
6ede32156f db::hints::manager::end_point_hints_manager::sender: add set_stopping()/stopping() methods
It's nicer to have access methods instead of working directly with enum_set methods and values.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
94da744f37 db::hints::manager::end_point_hints_manager::stop(): log the last exception instead of forwarding it
Returning a future with an exception from end_point_manager::stop()
is practically useless because the best the caller can do is to log
it and continue as if it didn't happen because it has other things
to shut down.

Therefore in order to simplify the caller we will log the exception
if it happens and will always return a non-exceptional future.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
8aedbf9d18 db::hints: manager.hh: cleanup: fix the comments
Fix the comments that went out of sync with the current implementation.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
5463b58faa db::hints::manager: rework end_point_hints_manager::stop() to use seastar::async()
This simplifies the code reading and extending.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Botond Dénes
6f7d919470 database: when dropping a table evict all relevant queriers
Queriers shouldn't outlive the table they read from as that could lead
to use-after-free problems when they are destroyed.

Fixes: #3414

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <3d7172cef79bb52b7097596e1d4ebba3a6ff757e.1525716986.git.bdenes@scylladb.com>
2018-05-07 21:20:25 +03:00
Duarte Nunes
c053275a48 db/view/row_locking: Add timeout when waiting for the lock
This ensures we respect the write timeout set by the client when
applying base writes, in case a writes takes too long to acquire the
row lock for the read-before-write phase of a materialized view
update.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180507132755.8751-1-duarte@scylladb.com>
2018-05-07 18:22:39 +01:00
Duarte Nunes
113294074d Merge seastar upstream
* seastar ac02df7...840002c (20):
  > dpdk: protect against missing statistics
  > alien: make visible in documentation
  > Merge "rewrite iotune to conform to the new ioscheduler" from Glauber
  > app_template: Correct outdated comment
  > apps, tests: Catch polymorphic exceptions by reference
  > configure.py: Enhance detection for gcc -fvisibility=hidden bug
  > reactor: add rudimentary task histogram reporting
  > Revert "Merge "rewrite iotune to conform to the new ioscheduler" from Glauber"
  > Merge "rewrite iotune to conform to the new ioscheduler" from Glauber
  > build: Use the same warning name for Clang and GCC
  > core/rwlock: Add support for timeouts
  > fs qualification: protect against EINTR
  > Docker: Fix failing build due to missing GNU make
  > reactor: move optional to experimental so we compile with c++14
  > future: remove allocation from future::get() thread context switch
  > Merge "rpc streaming" from Gleb
  > reactor: put mountpoint_params in seastar namespace
  > Tutorial: in PDF version of tutorial, better backtick typesetting
  > tutorial: support, and start using, links to other sections
  > tutorial: improve second half of semaphores section

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-07 18:22:39 +01:00
Tomasz Grabiec
58fe331c7e mvcc: Test version merging when snapshots go away 2018-05-07 13:54:30 +02:00
Avi Kivity
368e15a8e2 Update scylla-ami submodule
* dist/ami/files/scylla-ami 8a6e4dd...e0b35dc (1):
  > change default roles for EBS / ephemeral
2018-05-07 12:34:04 +03:00
Duarte Nunes
4b3562c3f5 db/view: Limit number of pending view updates
This patch adds a simple and naive mechanism to ensure a base replica
doesn't overwhelm a potentially overloaded view replica by sending too
many concurrent view updates. We add a semaphore to limit to 100 the
number of outstanding view updates. We limit globally per shard, and
not per destination view replica. We also limit statically.

Refs #2538

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-2-duarte@scylladb.com>
2018-05-07 11:25:27 +03:00
Duarte Nunes
2be75bdfc9 db/timeout_clock: Properly scope type names
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-1-duarte@scylladb.com>
2018-05-07 11:24:41 +03:00
Nadav Har'El
c93b56034d tests: improve usability of cql_assertions.hh error messages
The functions in cql_assertions.hh are very convenient, but have one
frustrating drawback: When you have many of those assertions in one
test, it's very hard to know *which* of the similar assertions failed.

The problem is that an error often looks like this:

unknown location(0): fatal error: in "test_many_columns":
std::runtime_error: Expected 2 row(s) but got 0
tests/cql_assertions.cc(131): last checkpoint

Which of the many similar checks in "test_many_columns" failed? Note the
unhelpful "unknown location" and also the "last checkpoint" points to code
in cql_assertions.cc, not in the actual test, so it is useless.

The root cause of these problems is that the Boost macros use the C
preprocessor __FILE__ and __LINE__, which in actual C++ functions like
is_rows() remembers its location, instead of the caller. Fixing this will
not be simple. But this patch has a much simpler solution - fixing the
"last checkpoint". What ruins the last checkpoint is the use of BOOST_REQUIRE
inside the cql_assertions.cc is_rows() - when that succeeds, it records
the location inside cql_assertions.cc (!) as the last success.

If we just replace BOOST_REQUIRE by our own test (just like in the rest of
the cql_assertions.cc code), this code will not override the last checkpoint.
The user can see the last real successful BOOST_REQUIRE, or use
BOOST_TEST_PASSPOINT() to set his own checkpoints between different parts of
the same test.

After this patch, and with adding BOOST_TEST_PASSPOINT() calls between
different parts of my test, the failure above now looks like:

unknown location(0): fatal error: in "test_many_columns":
std::runtime_error: Expected 2 row(s) but got 0
tests/secondary_index_test.cc(299): last checkpoint

The "last checkpoint" now shows me exactly where my failing check was.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501152638.26238-1-nyh@scylladb.com>
2018-05-07 09:19:45 +01:00
Duarte Nunes
eabe471ce8 tests/secondary_index_test: Don't catch polymorphic exceptions by value
Don't slice exception by catching them by value. Instead of catching
by reference, use assert_that_failed().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180506153745.4512-1-duarte@scylladb.com>
2018-05-06 18:53:40 +03:00
Duarte Nunes
ab5a45b00c Merge 'Improve debuggability of result_message' from Avi
"This patchset adds ostream operators to result_message and uses them
in cql_assertions."

* tag 'result_message-print/v1.1' of https://github.com/avikivity/scylla:
  tests: cql_assersions: improve error message when a row is not found
  transport: add ostream support to result_message
  transport: const correctness for result_message::accept()
2018-05-06 14:52:56 +01:00
Avi Kivity
6d3fb69827 tests: cql_assersions: improve error message when a row is not found
Display the row and the result set.
2018-05-06 16:28:37 +03:00
Avi Kivity
07d69ebce2 transport: add ostream support to result_message
Allow printing result_message:s for debugging.
2018-05-06 16:28:35 +03:00
Avi Kivity
50d4d01cb7 tests: fix view_schema_test cql_assertion types
Use utf8_type where warranted.

Fixes view_schema_test failure where the rows did not match. I don't
understand exactly why the failure happened (using the wrong type
should not cause a failure here), but the change fixes the problem.

Tests: view_schema_test (release)
Message-Id: <20180506130015.7450-1-avi@scylladb.com>
2018-05-06 14:25:22 +01:00
Avi Kivity
31f2b3ce15 transport: const correctness for result_message::accept()
The visitor does not alter the result_message it is visiting (and
its signature indicates that) so accept() should be const-qualified
to indicate that and to allow visiting const result_message:s.
2018-05-06 15:51:48 +03:00
Avi Kivity
cc900c23a6 Merge "Write Statistics.db in SSTables 3.x format." from Vladimir
"
This patchset adds support for writing Statistics.db in the SSTables
'mc' (3.x) format. This file is essential for reading data stored in
Data.db as it contains base values used for delta encoding and types of
columns.

This patchset also fixes several bugs found in writing data and index
files as well as bugs in a statistics-related structure definition.

Tests: unit {debug, release}

All SSTables files for write unit tests are validated to be processed by
sstabledump and output is verified to show the expected data.
"

* 'projects/sstables-30/write-statistics/v1' of https://github.com/argenet/scylla:
  Add test covering the composite partition key case.
  Add Statistics.db files to write tests for SSTables 3.0.
  Do not check rows and cells for expiration when writing them to the data file.
  Fix promoted index serialization.
  Fix the order of items in stats_metadata.
  Fix timestamp_epoch value which was truncated on exceeding int32_t type limit.
  Write serialization header to Statistics.db for SSTables 3.x.
  Do not pass schema to metadata_collector::update(column_stats)
  Collect metadata statistics when writing SSTables 3.0.
  Call get_metadata_collector() instead of referencing sstable::_collector directly.
  Fix logic of writing TTLed cells in SSTable 3.0 format.
  Separate statistics for count of cells, columns and rows in column_stats.
  Deserialize collection in a way that doesn't incur shared_ptr counter increment and is generally shorter.
  Track both min & max values for timestamp, TTL and local deletion time in metadata_collector.
  Add class for tracking both extremum values (min and max) on updates.
2018-05-05 16:53:08 +03:00
Vladimir Krivopalov
4ecb3a5e2a Add test covering the composite partition key case.
Mainly to check that the composite type is properly serialized when
writing serialization header to Statistics.db.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:11 -07:00
Vladimir Krivopalov
1b3989adcd Add Statistics.db files to write tests for SSTables 3.0.
For these tests to work, all time-related values are now fixed as these
are stored in Statistics.db files.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:11 -07:00
Vladimir Krivopalov
293ee6ae3f Do not check rows and cells for expiration when writing them to the data file.
Although this logic may be seen as a useful optimization, it hinders
unit tests writing SSTables 3.0 as those need to have fixed time-related
values to produce Statistics.db files with the same content on each run.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:11 -07:00
Vladimir Krivopalov
44bc0f1493 Fix promoted index serialization.
There is a new field introduced in the SSTables 3.0 index file format
named 'partition_header_length' that can be used to skip over to the
first clustering row in a wide partition. This one has not been
previously written and caused malformed indices.

Updated the corresponding test to include a static row and write
multiple wide partitions.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:10 -07:00
Vladimir Krivopalov
56ac941a2e Fix the order of items in stats_metadata.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:10 -07:00
Vladimir Krivopalov
926cdc6d70 Fix timestamp_epoch value which was truncated on exceeding int32_t type limit.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:10 -07:00
Vladimir Krivopalov
5db6002720 Write serialization header to Statistics.db for SSTables 3.x.
Serialization header is a new components in Statistics.db introduced in
SSTables 3.0 ('ma') format. It is essential for reading data file as it
contains the base values used for delta-encoded values (timestamps,
TTLs, local deletion times) and description of column types.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:43:17 -07:00
Vladimir Krivopalov
6e4601d177 Do not pass schema to metadata_collector::update(column_stats)
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:22:32 -07:00
Vladimir Krivopalov
a10ad6b623 Collect metadata statistics when writing SSTables 3.0.
Track min/max timestamps, TTLs, local deletion times and count of cells,
columns and rows.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:22:30 -07:00
Raphael S. Carvalho
abcfc19fe9 db: make compaction slightly faster by not using filtering reader on unshared sstable
After reboot, all existing sstables are considered shared. That's a safe default.
Reader used by compaction decides to use filtering reader (filters out data that
doesn't belong to this shard) if sstable is considered shared even though it may
actually be unshared.
By avoiding filtering reader we're avoiding an extra check for each key, and that
may be meaningful for compaction of tons of small partitions and even range
reads of such. We do so by fixing sstable::_shared, which is now set properly for
existing sstables at start.

quick check using microbenchmark which extends perf_sstable with compaction mode:
before: 69407.61 +- 37.03 partitions / sec (30 runs, 1 concurrent ops)
after: 70161.09 +- 40.35 partitions / sec (30 runs, 1 concurrent ops)

Fixes #3042.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180504182158.21130-1-raphaelsc@scylladb.com>
2018-05-04 19:34:09 +01:00
Raphael S. Carvalho
b65bc511fe sstables/compaction_manager: log user initiated compaction
Sometimes it's hard to figure out from log whether user run major
compaction.

Fixes #1303.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180504181047.20277-1-raphaelsc@scylladb.com>
2018-05-04 19:15:58 +01:00
Duarte Nunes
7916368df8 Merge "Introduce system.large_partitions table" from Piotr
"
This series introduces a system.large_partitions table,
used to gather information on largest partitions in the cluster.

Schema below allows easy extraction of most offending keys and removal
by sstable name, which happens when a table is compacted away.

Schema: (
  keyspace_name text,
  table_name text,
  sstable_name text,
  partition_size bigint,
  key text,
  compaction_time timestamp,
  PRIMARY KEY((keyspace_name, table_name), sstable_name, partition_size, key)
) WITH CLUSTERING ORDER BY (partition_size DESC);
"

Closes #3292.

* 'large_partition_table_3' of https://github.com/psarna/scylla:
  database, sstables, tests: add large_partition_handler
  db: add large_partition_handler interface with implementations
  docs: init system_keyspace entry with system.large_partitions
  db: add system.large_partitions table
2018-05-04 18:18:50 +01:00
Piotr Sarna
bc019205b3 schema: fix typos in a comment
Message-Id: <2b2a169e8a511fa9e0e1556ac7559ce9bef896e1.1525431353.git.sarna@scylladb.com>
2018-05-04 15:26:51 +01:00
Piotr Sarna
fe02c3d0e2 database, sstables, tests: add large_partition_handler
This commit makes database, sstables and tests aware
of which large_partition_handler they use.
Proper large_partition_handler is retrievable from config information
and is based on existing compaction_large_partition_warning_threshold_mb
entry. Right now CQL TABLE variant of large_partition_handler is used
in the database.

Tests use a NOP version of large_partition_handler, which does not
depend on CQL queries at all.
2018-05-04 14:38:13 +02:00
Piotr Sarna
14b3c7e7e7 db: add large_partition_handler interface with implementations
This commit introduces large_partition_handler class, which can be used
to take additional action when large partitions are written.

It comes with two implementations:
 * NOP, used in tests, which does nothing on large partition
   update/delete
 * CQL TABLE, which inserts/deletes information on particular sstable
   to system.large_partitions table, in order to be retrievable from
   cqlsh later.

References #3292
2018-05-04 12:46:31 +02:00
Piotr Sarna
3c82a8a2ff docs: init system_keyspace entry with system.large_partitions
This commit is a first step towards documenting system.* tables.
It contains information about system.large_partitions table.

References #3292
2018-05-04 12:45:40 +02:00
Piotr Sarna
02822efbc8 db: add system.large_partitions table
This commit adds a system.large_partitions table, which can be used
to trace largest partitions of a cluster.
Schema: (
  keyspace_name text,
  table_name text,
  sstable_name text,
  partition_size bigint,
  key text,
  compaction_time timestamp,
  PRIMARY KEY((keyspace_name, table_name), sstable_name, partition_size, key)
) WITH CLUSTERING ORDER BY (partition_size DESC);

References #3292
2018-05-04 12:45:40 +02:00
Raphael S. Carvalho
ce689a0807 database: avoid race condition when deleting sstable on behalf of cf truncate
After removal of deletion manager, caller is now responsible for properly
submitting the deletion of a shared sstable. That's because deletion manager
was responsible for holding deletion until all owners agreed on it.
Resharding for example was changed to delete the shared sstables at the end,
but truncate wasn't changed and so race condition could happen when deleting
same sstable at more than one shard in parallel. Change the operation to only
submit a shared sstable for deletion in only one owner.

Fixes dtest migration_test.TestMigration.migrate_sstable_with_schema_change_test

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180503193427.24049-1-raphaelsc@scylladb.com>
2018-05-04 11:42:56 +01:00
Vladimir Krivopalov
8342073758 Call get_metadata_collector() instead of referencing sstable::_collector directly.
A step to untie classes sstable_writer_m and sstable so that eventually
we could stop them being friends.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
f1816d77cc Fix logic of writing TTLed cells in SSTable 3.0 format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
3e471116b4 Separate statistics for count of cells, columns and rows in column_stats.
SSTables 3.0 format makes a distinction between count of cells and count
of columns. In that sense, a column of a collection type counts as one
column but every atomic cell in it counts as a separate cell.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
fdfe79e899 Deserialize collection in a way that doesn't incur shared_ptr counter
increment and is generally shorter.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
7039dee12b Track both min & max values for timestamp, TTL and local deletion time
in metadata_collector.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
8b8c9a5d10 Add class for tracking both extremum values (min and max) on updates.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Tomasz Grabiec
5e985192b2 db: Log table id and schema version on boot
Message-Id: <1524585689-12458-1-git-send-email-tgrabiec@scylladb.com>
2018-05-03 10:50:31 +03:00
Botond Dénes
5d5bc0e1ab mutation_reader_test: fix multishard-reader test with smp > 3
test_multishard_combining_reader_destroyed_with_pending_create_reader
was failing because it relied on smp == 3 and thus the shard on which
the reader creation is blocked being shard-2. Since the test requires to
be run with smp >= 3 we can hardcode this shard to be 2 because if the
test runs at all we are guaranteed to have at least smp >= 3.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <38883a1f4c18ca0cd065aa13826a4f1858353289.1525328233.git.bdenes@scylladb.com>
2018-05-03 10:30:21 +03:00
Botond Dénes
efa08f623a mutation_reader_test: add description to multishard-tests
These tests are quite complicated and require intimate knowledge of how
foreign_reader and multishard_combining_reader operates. Knowing these
two objects is still required to understand the tests but make it that
much easier by explaining how they were designed to test what they test.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8de580131a8652924de920c2bc68a98e579398ee.1525328226.git.bdenes@scylladb.com>
2018-05-03 10:30:20 +03:00
Paweł Dziepak
bfc017daa8 tests/mutation_reader: do not capture on-stack variable by reference
'shard' is a short-lived on-stack variable that gets captured by
reference by continuation that gets executed on another shard.

Fixes a race condition that leads to an heap-use-after-free.

Message-Id: <20180502150507.2776-1-pdziepak@scylladb.com>
2018-05-02 18:07:37 +03:00
Botond Dénes
d80e586ccb mutation_reader_test: remove leftover comments
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <580dcf664fc4fc84f3a29137fba5c982f57d7601.1525269726.git.bdenes@scylladb.com>
2018-05-02 17:03:50 +03:00
Botond Dénes
e14b0ca13e mutation_reader_test: fix possible use-after-free
The test_foreign_reader_destroyed_with_pending_read_ahead test currently
doesn't ensure that the objects in it's scope are destroyed in the
correct order. This is necessary as there are severeal foreign pointers
to objects that live on remote shards and use each other. Since
foreign pointers destory their managed object in the background we
cannot rely on the to reliably destroy objects in order, nor can we be
sure when the object they manage is actually destroy.
So to work around that ensure that the puppet_reader is destroyed before
the remote_control it references even has a chance of being destroyed.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <232eaa899878b03fb2a765c2916e4f05841472a3.1525269726.git.bdenes@scylladb.com>
2018-05-02 17:03:49 +03:00
Nadav Har'El
68b5eafcc6 secondary index: test index naming
Test for Scylla's default choice of secondary index name (we found one
small problem, see issue #3403, and left it commented out). Also test
the ability to give indices non-default names.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501153439.26619-1-nyh@scylladb.com>
2018-05-02 08:12:14 +03:00
Nadav Har'El
311b25948c secondary index: test indexing of partition-key column
Add a test that adding a secondary-index for an only partition key column
is not allowed (it would be redundant), but indexing one of several partition
key columns *is* allowed. This reproduced issue #3404, and verifies that
it was fixed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501121544.22869-2-nyh@scylladb.com>
2018-05-02 08:11:04 +03:00
Nadav Har'El
79c6bb642f secondary index: fix indexing of partition-key column
Indexing an only partition key component is not allowed (because it would
be redundant), but it should be allowed to index one of several partition
key components. We had a bug in that case: the underlying materialized view
we created had the same column as both a partition key and a clustering
key, which resulted in an assertion failure. This patch fixes that.

Fixes #3404.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501121544.22869-1-nyh@scylladb.com>
2018-05-02 08:06:38 +03:00
Nadav Har'El
21d7507b74 secondary index: move stuff out of db/index directory
The db/index directory contains just a few lines of code that exists
there for historical reasons. It's confusing that we have both db/index
and index/ directory related to secondary-indexing.

This patch moves what little is still in db/index/ to index/. In the
future we should probably get rid of the "secondary_index" class we had
there, but for now, let's at least not have a whole new directory for it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501101246.21143-1-nyh@scylladb.com>
2018-05-01 13:21:24 +03:00
Tomasz Grabiec
0455a19ce0 anchorless_list: Make ranges conform to SinglePassRange
They were missing const version of iterators as well as iterator and
const_iterator member type aliases.
2018-04-30 18:45:32 +02:00
Tomasz Grabiec
9b7e49ef35 anchorless_list: Drop deprecated use of std::iterator 2018-04-30 18:45:32 +02:00
Tomasz Grabiec
aa1458377c mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged
Fixes a bug in partition_snapshot::merge_partition_versions(), which
would not attempt merging if the snapshot is attached to the latest
version (in which case _version is nullptr and _entry is !=
nullptr). This would cause partition_version objects to accumulate if
there was an older snapshot and it went away before the latest
snapshot. Versions will be removed when the whole entry goes away
(flush or eviction).

May have caused performance problems.

Fixes #3402.
2018-04-30 18:45:32 +02:00
Avi Kivity
25545590a4 Merge "Read-ahead related fixes for multishard readers" from Botond
"
Both multishard_combining_reader and foreign_reader use read-head in the
background to avoid blocking consumers. These read-aheads can be still
pending when the reader is destroyed and hence extra attention is needed
to avoid memory errors. Recent manual testing, done in the context of
testing code that is using the multishard reader, proved that these
cases were not handled correctly in the initial series introducing it
(2d126a79b).
This series introduces fixes and comprehensive tests for all problematic
scenarios:
1) multishard_combining_reader is destroyed with pending reader creation
on a remote shard.
2) foreign_reader is destroyed with pending read-ahead.
3) multishard_combining_reader is destroyed with pending read-ahead.
"

* 'multishard-reader-read-ahead-fixes/v2' of https://github.com/denesb/scylla:
  test.py: add custom seastar flags for mutation_reader_test
  test.py: move custom seastar flags for tests declarative
  mutation_reader_test: add read-ahead related multishard reader tests
  tests/mutation_reader_test: change recommented smp to 3
  mutation_reader_test: fix name of existing multishard reader tests
  simple_schema: add global_simple_schema
  simple_schema.hh: remove unused include
  multishard_combining_reader: prepare for read-ahead otliving the reader
  foreign_reader: prepare for read-ahead outliving the reader
  multishard_combining_reader: avoid creating the shard reader twice
  multishard_combining_reader: read_ahead: don't assume reader is created
  multishard_combining_reader: move read-ahead related methods
  multishard_combining_reader: avoid looking up the shard reader twice
  multishard_combining_reader: use optional for maybe created reader
2018-04-30 17:41:50 +03:00
Botond Dénes
f96084d38e test.py: add custom seastar flags for mutation_reader_test
Use -c3 if possible (if the machines has at least 3 cores).
2018-04-30 17:17:45 +03:00
Botond Dénes
52f0bb0481 test.py: move custom seastar flags for tests declarative 2018-04-30 17:17:45 +03:00
Botond Dénes
79684eff8e mutation_reader_test: add read-ahead related multishard reader tests
Add tests for foreign_reader and multishard_combining_reader that check
that readers destroyed while there is pending read-head will not result
in use-after-free.
Specifically check that:
* multishard_combining_reader destroyed with pending reader creation
* foreign_reader destroyed with pending read-ahead
* multishard_combining_reader destroyed with pending read-ahead
does not result in use-after-free or SEGFAULT.

These tests try to do their best to check for correct behaviour with
various BOOST_REQUIRE* checks but they still heavily rely on ASAN to
detect any use-after-free, SEGFAULT or similar errors.
2018-04-30 17:17:45 +03:00
Botond Dénes
cb25afa8bf tests/mutation_reader_test: change recommented smp to 3
Of the test_multishard_combining_reader_reading_empty_table test.
Running this test with smp=3 instead of smp=2 helps detecting additional
read-ahead related memory problems.
2018-04-30 17:17:45 +03:00
Botond Dénes
78266f11c4 mutation_reader_test: fix name of existing multishard reader tests
s/multishard_combined_reader/multishard_combining_reader/
2018-04-30 17:17:44 +03:00
Botond Dénes
783f0f09bf simple_schema: add global_simple_schema
Which allows a simple_schema instance to be transferred to another
shard. In fact a new simple_schema instance will be created on the
remote shard but it will use the same schema instance the the original
one.
2018-04-30 17:17:44 +03:00
Botond Dénes
ed7bde99bc simple_schema.hh: remove unused include 2018-04-30 17:17:44 +03:00
Botond Dénes
04643fb223 multishard_combining_reader: prepare for read-ahead otliving the reader
When the multishard reader is destroyed there might be severeal pending
read-aheads running in the background. These read-aheads need their
associated reader to stay alive until after the read-ahead completes.
To solve this move the flat_mutation_reader into a struct and manage
this struct's lifetime through a shared pointer. Fibers associated with
read-aheads that might outlive the multishard reader will hold on to a
copy of the shard pointer keeping the underlying reader alive until they
complete. To avoid doing any extra work a flag is added to this state
which is set when the multishard reader is destroyed. When this flag is
set, pending continuations will return early.  All this is encapsulated
in multishard_combining_reader::shard_reader the multishard reader code
itself need not be changed.
2018-04-30 17:16:21 +03:00
Botond Dénes
a05d398be7 foreign_reader: prepare for read-ahead outliving the reader
The foreign reader keeps track of ongoing read-aheads via a
foreign_ptr to the read-ahead's future on the remote shard. This pointer
is overwritten after each "remote call" to the remote reader with a
pointer to the future of the new read-ahead's future.
There are severeal problems with the current implementation:
1) There is a new read-ahead launched after each "remote call"
  unconditionally, even if the remote reader is at EOS. This will start
  unecessary read-ahead when the reader is already finished and may be
  soon destroyed (legally) by the client.
2) The pointer to the remote read-ahead future is not set to nullptr
  when a remote call is issued. Thus in the destructor, where we
  attach a continuation to the read-ahead's future to extend the
  reader's lifetime until after the read-ahead finishes, we migh attach
  a continuation to a future that already has one and run into a failed
  assert().

To fix this issues reset the read-ahead pointer to nullptr each time a
remote call is issued and don't start a new read-ahead if the remote
reader is at EOS. This way we can ensure that when the reader is
destroyed we either have a valid and non-stale read-aead future or none
at all and can reliably make a decision about whether we need to extend
the lifetime of the remote reader or not.
2018-04-30 14:34:43 +03:00
Botond Dénes
704d3d8421 multishard_combining_reader: avoid creating the shard reader twice
The multishard reader creates its shard readers on demand when they are
first attempted to be used. However at this time the reader migh already
be in the progress of being created, initiated by a previous read-ahead.
To avoid creating the shard reader twice, before creating the reader
check whether there are any read-aheads in progress. If there is, it
already created (is creating or will create) the reader and hence
synchronise with the read ahead. Synchronisation happens via a promise,
the read ahead creates a promise which will be fulfilled when the reader
is created. A concurrent create_reader() call will wait on this promise
instead of attempting to create a new reader.
2018-04-30 14:34:43 +03:00
Botond Dénes
f9464cfcd7 multishard_combining_reader: read_ahead: don't assume reader is created
Currently it is assumed that when read_ahead is called the reader is
already created. Under most circumstances this will not be true. It was
blind (bad) luck that we didn't hit this before (during testing).
2018-04-30 14:34:43 +03:00
Botond Dénes
d9fceb398a multishard_combining_reader: move read-ahead related methods
To the group of methods that do not assume the reader is already
created. A patch will follow that will update read_ahead() to not assume
that the reader is created.
2018-04-30 14:34:43 +03:00
Botond Dénes
5dcfaa68f6 multishard_combining_reader: avoid looking up the shard reader twice 2018-04-30 14:34:43 +03:00
Botond Dénes
79504a7d28 multishard_combining_reader: use optional for maybe created reader
After a little "research" [1] it turns out my initial fears were
completely without ground, std::optional::operator->() and
std::optional::opterator*() doesn't involve an unnecessary branch and
thus there is no need to hand-roll an optional with a separate bool.

[1] http://en.cppreference.com/w/cpp/utility/optional/operator*
2018-04-30 14:34:37 +03:00
Avi Kivity
c8a6fe3044 storage_proxy: remove default_query_timeout()
No longer used.
2018-04-30 13:19:53 +03:00
Avi Kivity
d8dd7e05a7 storage_proxy: don't use default timeouts
Require all callers to supply timeouts instead of relying on defaults.

Since all callers now have the timeouts set up, they can easily supply
them.
2018-04-30 13:19:53 +03:00
Avi Kivity
7b5db486a0 query_options: augment with timeout_config
Add a timeout_config member to query_options. This lets the query
processor know what timeouts the user of this query want to apply.
2018-04-30 13:19:53 +03:00
Avi Kivity
fcea3ed722 thrift: configure thrift transport and handler with a timeout_config
Let the thrift transport server and request handler know about the
per-request-type timeouts, in preparation for actually using them.
2018-04-30 13:19:53 +03:00
Avi Kivity
f9370ab7e6 transport: configure native transport with a timeout_config
Let the native transport server know about the per-request-type
timeouts, in preparation for actually using them.
2018-04-30 13:19:53 +03:00
Avi Kivity
49fdf01b5d cql3: define and populate timeout_config_selector
Determine which timeout we need to apply at prepare time. We
don't know the numerical value (since it depends on whoever is
executing the query, not just the statement type), but we know
which member of timeout_config we need, so determine and remember
that.
2018-04-30 13:19:49 +03:00
Tomasz Grabiec
423712f1fe storage_proxy: Request schema from the coordinator in the original DC
The mutation forwarding intermediary (src_addr) may not always know
about the schema which was used by the original coordinator. I think
this may be the cause of the "Schema version ... not found" error seen
in one of the clusters which entered some pathological state:

  storage_proxy - Failed to apply mutation from 1.1.1.1#5: std::_Nested_exception<schema_version_loading_failed> (Failed to load schema version 32893223-a911-3a01-ad70-df1eb2a15db1): std::runtime_error (Schema version 32893223-a911-3a01-ad70-df1eb2a15db1 not found)


Fixes #3393.

Message-Id: <1524639030-1696-1-git-send-email-tgrabiec@scylladb.com>
2018-04-30 12:51:09 +03:00
Nadav Har'El
1bbf7ba78c secondary index: add tests for IF NOT EXISTS, IF EXISTS
Confirm that issue #2991 is indeed fixed - creating a secondary index
with IF NOT EXISTS ignores an already existing index, and dropping with
IF EXISTS ignores a non-existant index.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180430071714.10154-1-nyh@scylladb.com>
2018-04-30 10:36:50 +02:00
Nadav Har'El
6e3a53fab0 secondary index: improve testing of case-sensitive column names
The existing test_secondary_index_case_sensitive only tested the
case-sensitive case of the column being indexed, and only in some
scenarios. Further testing exposed more bugs - issue #3388, issue #3391,
issue #3401. This patch adds tests which reproduced those bugs, and now
verifies their fix.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-9-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
a556b2b367 materialized views: fix test_case_sensitivity test
test_case_sensitivity from tests/view_schema_test.cc was well-intentioned,
aiming to test from different angles the issue of non-lowercase (quoted)
column names and their interaction with materialized views.

But unfortunately, it didn't test anything! This is because the quotation
marks were forgotten, so all the identifier in this test were folded to
lowercase, and the test didn't test non-lowercase identifiers like it
intended.

So this patch adds the missing quotes, to make this test great again.

After the patches for issues #3388 and #3391 which I sent earlier, the
test *passes* (before those patches, the fixed test did not pass -
the unfixed test trivially passed).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-8-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
46d4f6f352 secondary index: fix yet another case sensitivity bug
When the secondary index code builds a "%s IS NOT NULL" clause for a
CQL statement, it needs to quote the column name if it needs to be
(not only lowercase, digits and _).

Fixes #3401.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-7-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
8012f231ca materialized views: fix another case-sensitivity bug
We had another case-sensitivity bug in materialized views, where if
a case-sensitive (quoted) column name was listed explicitly on "SELECT"
(instead of implicitly, e.g., in "SELECT *") the column name was
incorrectly folded to lower-case and inserts would fail.

This patch fixes the code, where a "SELECT" statement was built using
the desired column names, but column names that needed quoting were
not being quoted. The bug was in a helper function build_select_statement()
which took column name strings and failed to quote them. We clean up this
function to take column definitions instead of strings - and take care
of the quoting itself. It also needs to quote the table's name in the
select statement being built.

Fixes #3391.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-6-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
e2b2506cb1 materialized views - fix case-sensitive IS NOT NULL
Before this patch, if a materialized view is defined with the restriction
IS NOT NULL on a case-sensitive (quoted) column name, inserts fail with
a "restriction 'foobar IS NOT null' unknown column foobar" error, where
foobar is the lowercased version of the case-sensitive column name.

The problem is that the code uses single_column_relation::to_string()
to convert the relation into a CQL where clause. And indeed, this method
generates a CQL expression; But it calls column_identifier::raw::to_string()
to print identifiers. This is the wrong function - it doesn't quote
identifiers that need quoting because they are not lowercase.

So this patch uses column_identifier::raw::to_cql_string() (a method we
added in the previous patch) to generate the properly quoted CQL relation.

Fixes #3388

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-5-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
b8ee50e6b9 Implement column_identifier::raw::to_cql_string()
Implement a method column_identifier::raw::to_cql_string(). Exactly like
the one without "raw", this method quotes the identifier name as needed
for CQL. We'll need this method in a later patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-4-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
993c4441e5 column_identifier::to_cql_string() using maybe_quote()
There is no reason for to_cql_string() and maybe_quote() to both
implement the same quoting algorithm. Use the latter to implement the
former.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-3-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
f4178f9582 Fix cql3::util::maybe_quote()
The utility function maybe_quote() is supposed to quote identifier names
(name of keyspace, table, or column) according to CQL rules, e.g., if the
name has any uppercase or non-alphanumeric characters, it needs to be
quoted. Unfortunatelty, it didn't quite do the right thing, so this patch
fixes that. This patch also adds a comment explaining what maybe_quote()
is supposed to do (until now, users could only guess).

Fixes #3400.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-2-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
ecc85297a4 secondary index: clean up dead unquoting code
In commit d674b6f672, I fixed a case-
sensitive column name bug by avoiding CQL quoting of a column name
in create_index_statement.cc when building a "targets" option string.
However, there is also matching code in target_parser.hh to unquote
that option string. So this unquoting code is no longer necessary, and
should be dropped.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-1-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Avi Kivity
b6d74b1c19 timeout_config: introduce timeout configuration
Different request types have different timeouts (for example,
read requests have shorter timeouts than truncate requests), and
also different request sources have different timeouts (for example,
an internal local query wants infinite timeout while a user query
has a user-defined timeout).

To allow for this, define two types: timeout_config represents the
timeout configuration for a source (e.g. user), while
timeout_config_selector represents the request type, and is used
to select a timeout within a timeout configuration. The latter is
implemented as a pointer-to-member.

Also introduce an infinite timeout configuration for internal
queries.
2018-04-29 19:52:40 +03:00
Nadav Har'El
a0bc0d2d11 secondary index: fix support for compound partition key
In the current code, if the base table has a compound partition key (i.e.,
multiple partition-key columns) searching its secondary indexes didn't work.
There is no real reason why this, it was a just a bug in preparing the
second query:

Every SI query is converted to two queries. The first queries the associated
materialized view, to find a list of primary keys. Those we need to use in a
second query, of the base table. The second query needs to list, as
restrictions, the keys found above. When a partition key is compound, its
components build one key and one restriction. But in the buggy code, we
incorrectly used each component as a separate (improperly formatted) key
and restriction, and obviously this didn't work.

This patch also adds a test that reproduces this problem and confirms its fix.

In the fixed code I also found another incorrect use of to_cql_string() (which
could break case-sensitive primary key column names) and changed it to
to_string().

Fixes #3210.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429124138.24406-1-nyh@scylladb.com>
2018-04-29 14:40:13 +01:00
Duarte Nunes
b1dd1876e5 gms/gossiper: Prevent duplicate processing of EchoMessage reply
We make multiple attempts to mark a node as alive. We do that be
sending an EchoMessage, and marking the node as alive upon receiving a
successful answer. In case there's a network partition and the nodes
can't reach each other, multiple messages may be delivered and
processed.

We can avoid processing duplicate EchoMessage replies by checking
whether we had already marked the node as alive.

Fixes #1184

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180428191942.31990-1-duarte@scylladb.com>
2018-04-29 14:20:01 +03:00
Avi Kivity
51b235aa7e compress: adjust HAVE_LZ4_COMPRESS_DEFAULT macro for new name
Seastar changed the name of this macro.
2018-04-29 12:57:27 +03:00
Avi Kivity
0530653da9 Merge "adapt scylla_io_setup to recent I/O Scheduler changes" from Glauber
"
Recently many changes have landed in seastar for the I/O Scheduler. We
can now describe the I/O storage of a machine by its visible properties
like throughput and bandwidth instead of relying in an indirect
calculation.

For the instances we support, we can just measure that and start using
them right away.

A version of iotune that computes those properties is not yet ready, but
in its making I have noticed that we aren't really setting the nomerges
and scheduler properties of the disks under testing. We definitely
should, since that can influence the results. So this patchset also
starts doing that.

The commandline for iotunev2 shouldn't change much. When it is ready we
will just adjust this script once more.
"

* 'scylla_io_setup' of github.com:glommer/scylla:
  scylla_io_setup: preconfigure i3 and i2 instances with new I/O scheduler properties
  scylla_lib: drop support for m3 and c3 AWS instance types
  io_setup: call blocktune before tuning I/O
  blocktune: allow it to be called as a library.
  scripts: move scylla-blocktune to scripts location
2018-04-29 11:44:06 +03:00
Avi Kivity
7161244130 Merge seastar upstream
* seastar 70aecca...ac02df7 (5):
  > Merge "Prefix preprocessor definitions" from Jesse
  > cmake: Do not enable warnings transitively
  > posix: prevent unused variable warning
  > build: Adjust DPDK options to fix compilation
  > io_scheduler: adjust property names

DEBUG, DEFAULT_ALLOCATOR, and HAVE_LZ4_COMPRESS_DEFAULT macro
references prefixed with SEASTAR_. Some may need to become
Scylla macros.
2018-04-29 11:03:21 +03:00
Raphael S. Carvalho
043fadb15b sstables/twcs: fix setting of timestamp resolution
iterator incorrectly dereferenced when timestamp resolution not
explicitly specified.

following dtests are fixed:
compaction_additional_test.CompactionAdditionalStrategyTests_with_TimeWindowCompactionStrategy.compaction_is_started_on_boot_test
compaction_additional_test.CompactionAdditionalTest.compact_data_by_time_window_test
compaction_additional_test.CompactionAdditionalTest.compaction_removes_ttld_data_by_time_windows_test
compaction_test.TestCompaction_with_DateTieredCompactionStrategy.compaction_strategy_switching_test

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180427192545.17440-1-raphaelsc@scylladb.com>
2018-04-29 10:44:44 +03:00
Glauber Costa
0c29289c22 scylla_io_setup: preconfigure i3 and i2 instances with new I/O scheduler properties
We can use iotunev2 (or any other I/O generator) to test for the limits
of the disks for the i2 and i3 instance classes. The values I got here
are the values I got from ~5 invocations of the (yet to be upstreamed)
iotune v2, with the IOPS numbers rounded for convenience of reading.

During the execution, I verified that the disks were saturated so we
can trust these numbers even if iotunev2 is merged in a different form.
The numbers are very consistent, unlike what we usually saw with the
first version of iotune.

Previously, we were just multiplying the concurrency number by the
number of disks. Now that we have better infrastructure, we will
manually test i3.large and i3.xlarge, since their disks are smaller
and slower.

For the other i3, and all instances in the i2 family storage scales up
by adding more disks. So we can keep multiplying the characteristics of
one known disk by the number of disks and assuming perfect scaling.

Example for i3, obtained with i3.2xlarge:

read_iops = 411k
read_bandwidth = 1.9GB/s

So for i3.16xlarge, we would have read_iops = 3.28M and 15GB/s - very
close to the numbers advertised by AWS.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Glauber Costa
c85fbd16cb scylla_lib: drop support for m3 and c3 AWS instance types
m3 has 80GB SSDs in its largest form and I doubt anybody has ever
used it with Scylla.

I am also not aware of any c3 deployments. Since it is past generation,
it doesn't even show up in the default instance selector anymore.

I propose we drop AMI support for it. In practice, what that means is
that we won't auto-tune its I/O properties and people that want to use
it will have to run scylla_io_setup - like they do today with the EBS
instances.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Glauber Costa
685a7c9ae6 io_setup: call blocktune before tuning I/O
We are not configuring the disks the way we want them with respect to
scheduler and nomerges. This is an oversigh that became clear now that
I started rewriting iotune-- since I will explicitly test for that. But
since this can affect the results, it should be here all along.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Glauber Costa
9eb8ea8b11 blocktune: allow it to be called as a library.
This patch makes the functions in scylla-blocktune available as a
library for other scripts - namely scylla_io_setup.

The filename, scylla-blocktune, is not the most convenient thing to call
from python so instead of just wrapping it in the usual test for
__main__ I am just splitting the file into two.

Another option would be to patch all callers to call
scylla_blocktune.py, but because we are usually not using extensions in
scripts that are meant to be called directly I decided for the split.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Glauber Costa
f837d5b1f1 scripts: move scylla-blocktune to scripts location
scylla-blocktune currently lives in the top level but this is mostly
historical. When time comes for us to install it, the packaging systems
will copy it to /usr/lib/scylla with the others.

So for consistency let's make sure that it also lives in the scripts
directory.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Vladimir Krivopalov
b3572acd6e A few improvements to encoding_stats structure.
- Use the same default epoch as Origin
  - Use default value for the encoding_stats parameter in sstable::write_components()

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <846c6d2cbb97d2dd25968cb00b8557c86ff5e35c.1524854727.git.vladimir@scylladb.com>
2018-04-27 22:03:38 +03:00
Avi Kivity
2fb1bcfd13 Update scylla-ami submodule
* dist/ami/files/scylla-ami 02b1853...8a6e4dd (1):
  > ds2_configure.py: always use Ec2Snitch for single region case

Fixes #1800.
2018-04-27 21:02:27 +03:00
Vladimir Krivopalov
36fe06fd3e Make abstract_type::is_fixed_length() non-virtual.
This method is called agressively through SSTable 3.0 read/write, we
want to reasonably optimise it to no incur extra indirect calls.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <2d00ddecd112af867a30d3d6930c10165dd5af34.1524851530.git.vladimir@scylladb.com>
2018-04-27 20:57:46 +03:00
Tomasz Grabiec
b1465291cf db: schema_tables: Treat drop of scylla_tables.version as an alter
After upgrade from 1.7 to 2.0, nodes will record a per-table schema
version which matches that on 1.7 to support the rolling upgrade. Any
later schema change (after the upgrade is done) will drop this record
from affected tables so that the per-table schema version is
recalculated. If nodes perform a schema pull (they detect schema
mismatch), then the merge will affect all tables and will wipe the
per-table schema version record from all tables, even if their schema
did not change. If then only some nodes get restarted, the restarted
nodes will load tables with the new (recalculated) per-table schema
version, while not restarted nodes will still use the 1.7 per-table
schema version. Until all nodes are restarted, writes or reads between
nodes from different groups will involve a needless exchange of schema
definition.

This will manifest in logs with repeated messages indicating schema
merge with no effect, triggered by writes:

  database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
  database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
  database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f

The sync will be performed if the receiving shard forgets the foreign
version, which happens if it doesn't process any request referencing
it for more than 1 second.

This may impact latency of writes and reads.

The fix is to treat schema changes which drop the 1.7 per-table schema
version marker as an alter, which will switch in-memory data
structures to use the new per-table schema version immediately,
without the need for a restart.

Fixes #3394

Tests:
    - dtest: schema_test.py, schema_management_test.py
    - reproduced and validated the fix with run_upgrade_tests.sh from git@github.com:tgrabiec/scylla-dtest.git
    - unit (release)

Message-Id: <1524764211-12868-1-git-send-email-tgrabiec@scylladb.com>
2018-04-27 17:12:33 +03:00
Avi Kivity
6154ea734d Merge "upport for writing SSTables 3.0 - rows only" from Vladimir
"
This patch series introduces initial support for writing SSTables in
'mc' format (aka SSTables 3.0).

Currently, the following components are written in 3.0 format:
  - Data.db
  - Index.db
  - Summary.db
(there were no changes to summary files format compared to ka/la)
Other SSTables components are written in the old format for now as they
still need to exist to satisfy post-flush processing.

For now, only rows are written to the data file and indexed. Range
tombstones are not supported.

Writing rows is supported in full with the only exception being counter
cells. All the other features (TTLed data, row/cell level tombstones,
collections, etc) are supported.

Unit tests rely on producing files and binary-comparing them with
'golden' copies that are produced using Cassandra 3.11. This is done to
not block until reading SSTables 3.0 format is implemented.

=======================================
Implementation notes
=======================================

Internally, sstable_writer has been refactored to support multiple
implementations that are instantiated in its constructor based on the
sstable version. Little to no code is shared among sstable_writer_v2 and
sstable_writer_v3 as we only intend to support sstable_writer_v2
alongside sstable_writer_v3 for a single release (to be able to do
rollback on rolling upgrade failure) and then plan to get rid of it
entirely and switch to always writing SSTables in the new format.

The design of sstable_writer_v3 mostly follows that of its precursors
sstable_writer(_v2) and components_writer. Some refactoring and further
code rearrangements are expected in the future but the main code is
there.
"

* 'projects/sstables-30/write-rows/v2' of https://github.com/argenet/scylla:
  Add tests for writing data and index files in SSTables 3.0 ('mc') format.
  Support for writing SSTables 3.0 ('mc') Data.db and Index.db files - rows only.
  Add missing enum values to bound_kind.
  Add building blocks for writing data in SSTables 3.0 format.
  Refactor sstable_writer to support various internal implementations.
  Add is_fixed_length() to data types.
  Add mutation_partition::apply_insert() overload that accepts TTL and expiry for row marker.
2018-04-27 17:10:31 +03:00
Piotr Jastrzebski
d839a945b4 Use goto instead of break in data_consume_rows_context_m::process_state
This way the code will be better predicted.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <271333caa723e8f3ed1db4fbe6b014ebde2b5d3a.1524818584.git.piotr@scylladb.com>
2018-04-27 11:56:13 +03:00
Vladimir Krivopalov
77fdfa3e7a Add tests for writing data and index files in SSTables 3.0 ('mc') format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
15ef4ca73c Support for writing SSTables 3.0 ('mc') Data.db and Index.db files - rows only.
This fix adds functionality for writing data in 'mc' format to Data.db
file according to the SSTables 3.0 data format as described at https://github.com/scylladb/scylla/wiki/SSTables-3.0-Data-File-Format
and Index.db file according to the specification at https://github.com/scylladb/scylla/wiki/SSTables-3.0-Index-File-Format

The following cases are not supported yet:
  - writing counter cells
  - range tombstones

In Index.db, end open markers are not written since range tombstones are not
supported for data files yet.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
3ecc9e9ce4 Add missing enum values to bound_kind.
bound_kind::clustering, bound_kind::excl_end_incl_start and
bound_kind::incl_end_excl_start are used during SSTables 3.0 writing.

bound_kind::static_clustering is not used yet but added for completeness
and parity with the Origin.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
a95664be08 Add building blocks for writing data in SSTables 3.0 format.
For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
bb2bea928a Refactor sstable_writer to support various internal implementations.
This is preparatory work for supporting writing SSTables in multiple
formats.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
54bd74fda0 Add is_fixed_length() to data types.
For any given CQL data type, this member returns whether its values are
of fixed or variable length. This is used by SSTables 3.0 format to only
store the length value for variable-length cells.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
ed62b9a667 Add mutation_partition::apply_insert() overload that accepts TTL and expiry for row marker.
For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 13:27:42 -07:00
Piotr Jastrzebski
a8154e2825 Fix use-after-free in summary parsing
Buffer received from read_exactly is referenced by
a pointer used in do_until loop but is not kept around
and is destroyed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <5edd6d08ec4466fe6abd0e83b4bfb24f1f5c80fa.1524747108.git.piotr@scylladb.com>
2018-04-26 15:54:41 +03:00
Avi Kivity
5119c1e9c1 Merge "Implement reading simple table from sstable 3.x" from Piotr
"
This patchset prepares everything for support of both 2.x and 3.x formats and implements reading from sstable 3.x
very simple table with just partition keys.

Tests: units (release)
"

* 'haaawk/sstables3/read_only_partitions_v4' of ssh://github.com/scylladb/seastar-dev: (22 commits)
  Test for reading sstable in MC format with no columns
  Use new mp_row_consumer_m and data_consume_rows_context_m
  Introduce mp_row_consumer_m
  Rename mp_row_consumer to mp_row_consumer_k_l
  Introduce consumer_m and data_consume_rows_context_m
  Use read_short_length_bytes in RANGE_TOMBSTONE
  Use read_short_length_bytes in ATOM_START
  Use read_short_length_bytes in ROW_START
  Add continuous_data_consumer::read_short_length_bytes
  Reduce duplication with continuous_data_consumer::read_partial_int
  Add test for a simple table with just partition key
  Add test for reading index
  Extract mp_row_consumer to separate header
  Make sstable_mutation_reader independent from mp_row_consumer
  Make sstable_mutation_reader a template
  Make data_consume_context a template
  Move data_consume_rows_context from row.cc to row.hh
  Decouple sstable.hh and row.hh
  Reduce visibility of sstable::data_consume_*
  Move data_consume_context to separate header
  ...
2018-04-26 14:35:42 +03:00
Botond Dénes
b2d71ed872 install_dependencies.sh: centos: add systemd-devel
This optional dependency is needed to properly integrate with systemd.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <bacd07958531e6541d5b1a4ea885f01491002a7b.1524740540.git.bdenes@scylladb.com>
2018-04-26 14:32:36 +03:00
Piotr Jastrzebski
5c223c13d6 Test for reading sstable in MC format with no columns
Just a simple table with only partition key.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
6dd7ce2582 Use new mp_row_consumer_m and data_consume_rows_context_m
When SSTable is in MC format then use those new classes
to be able to read the sstable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
9ba64f65e1 Introduce mp_row_consumer_m
This is a version of mp_row_consumer that can
handle SSTables in MC format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
4aec023927 Rename mp_row_consumer to mp_row_consumer_k_l
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
2ee3d8b87b Introduce consumer_m and data_consume_rows_context_m
Those classes can handle SSTables in MC format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
b343212073 Use read_short_length_bytes in RANGE_TOMBSTONE
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
90bb7802cc Use read_short_length_bytes in ATOM_START
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
6a81a755ee Use read_short_length_bytes in ROW_START
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
06ceea9c3e Add continuous_data_consumer::read_short_length_bytes
This is a common operation so it's better to have it
implemented in a single place.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
e664360730 Reduce duplication with continuous_data_consumer::read_partial_int
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
9a3f93a42b Add test for a simple table with just partition key
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
c6d4f49abb Add test for reading index
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
63f0b57365 Extract mp_row_consumer to separate header
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
e5145b87b0 Make sstable_mutation_reader independent from mp_row_consumer
Take consumer as template parameter instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
9c93f9f5f4 Make sstable_mutation_reader a template
Take DataConsumeRowsContext type as parameter.
This will allow us to implement different context
for reading 3.x files.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
9fad5831df Make data_consume_context a template
Parametrize it with the type of data consume rows context.

There will be different implementations used for different
sstable file formats.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
e2b393df13 Move data_consume_rows_context from row.cc to row.hh
It will be used as a template parameter for sstable_mutation_reader
once it's turned into a template. This means the definition has
to be accessible.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
0e405719e8 Decouple sstable.hh and row.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
bcf5717753 Reduce visibility of sstable::data_consume_*
They are used just in partition.cc, row.cc and sstables_test.cc
so it is usefull to cut their scope by moving them
to data_consume_context.hh.

This will make it much easier to turn data_consume_context into
a template.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
578aa6826f Move data_consume_context to separate header
It's used only in row.cc, partition.cc and sstables_test.cc
so it's better to reduce the dependency just to those files.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
a55cec544e mp_row_consumer: stop depending on sstable_mutation_reader
Introduce mp_row_consumer_reader to cut
a cyclic dependency between them.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
0efcc6b33f Fix use-after-free in estimated_histogram parsing
A pointer to buf was used in do_until but buf wasn't
kept around and was destroyed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:48:02 +02:00
Takuya ASADA
782ebcece4 dist/debian: add --jobs <njobs> option just like build_rpm.sh
On some build environment we may want to limit number of parallel jobs since
ninja-build runs ncpus jobs by default, it may too many since g++ eats very
huge memory.
So support --jobs <njobs> just like on rpm build script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180425205439.30053-1-syuu@scylladb.com>
2018-04-26 12:44:06 +03:00
Duarte Nunes
6f9bc28edf Merge 'Collect statistics on updates to memtables' from Vladimir
"
This patchset brings in a statistics collector that tracks minimal
values for timestamps, TTLs and local deletion times for all the updates
made to a given memtable.

This statistics is later used when flushing memtables into SSTables
using 3.x ('mc') format to delta-encode corresponding values using
collected minimums as bases (that is why it is called encoding
statistics).

This patchset is sent out apart from other changes that introduce
writing SSTables 3.x to facilitate read path implementation that also
needs the encoding_stats structure.

The tests for write path implicitly cover this functionality as any rows
written to a SSTable 3.0 file make use of delta-encoding.
"

* 'projects/sstables-30/collect-encoding-statistics-v4' of https://github.com/argenet/scylla:
  Collect encoding statistics for memtable updates.
  Factor out min_tracker and max_tracker as common helpers.
  Always pass mutation_partitions to partition_entry::apply()
2018-04-26 00:39:15 +01:00
Vladimir Krivopalov
948c4d79d3 Collect encoding statistics for memtable updates.
We keep track of all updates and store the minimal values of timestamps,
TTLs and local deletion times across all the inserted data.
These values are written as a part of serialization_header for
Statistics.db and used for delta-encoding values when writing Data.db
file in SSTables 3.0 (mc) format.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-25 15:39:14 -07:00
Vladimir Krivopalov
f6f99919da Factor out min_tracker and max_tracker as common helpers.
They will be re-used for collecting encoding statistics which is needed
to write SSTables 3.0.

Part of #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-25 14:58:47 -07:00
Vladimir Krivopalov
e1ee833861 Always pass mutation_partitions to partition_entry::apply()
Previously it was also possible to pass a frozen_mutation to it.
Now we de-serialize frozen mutations at the calling side.

This is a pre-requisite for collecting memtable statistics needed for
writing into the SSTables 3.0 format.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-25 14:58:47 -07:00
Moreno Garcia
8dde91d03c docker: Create data_dir if it does not exist
When provisioning a Scylla docker image with --developer-mode 0 (disabled)
scylla_raid_setup is not invoked. As a consequence the "data" directory is not
created and scylla_io_setup fails (steps to reproduce and error message provided
at the end).

This patch adds the same verifications present in scylla_io_setup to docker's
scyllasetup.py and creates the data directory in the case it is not present.

--

Steps to reproduce on AWS i3.2xlarge with Ubuntu 16.04:

sudo -s
apt update && apt upgrade -y && apt-get install docker.io -y

mdadm --create --verbose --force --run /dev/md0 --level=0 -c1024 --raid-devices=1 /dev/nvme0n1
mkfs.xfs /dev/md0 -f -K
mkdir /var/lib/scylla
mount -t xfs /dev/md0 /var/lib/scylla

docker run --name some-scylla \
  --volume /var/lib/scylla:/var/lib/scylla \
  -p 9042:9042 -p 7000:7000 -p 7001:7001 -p 7199:7199 \
  -p 9160:9160 -p 9180:9180 -p 10000:10000 \
  -d scylladb/scylla --overprovisioned 1 --developer-mode 0

docker logs some-scylla
  running: (['/usr/lib/scylla/scylla_dev_mode_setup', '--developer-mode', '0'],)
  running: (['/usr/lib/scylla/scylla_io_setup'],)
  terminate called after throwing an instance of 'std::system_error'
    what():  open: No such file or directory
  ERROR:root:/var/lib/scylla/data did not pass validation tests, it may not be on XFS and/or has limited disk space.
  This is a non-supported setup, and performance is expected to be very bad.
  For better performance, placing your data on XFS-formatted directories is required.
  To override this error, enable developer mode as follow:
  sudo /usr/lib/scylla/scylla_dev_mode_setup --developer-mode 1
  failed!
  Traceback (most recent call last):
    File "/docker-entrypoint.py", line 15, in <module>
      setup.io()
    File "/scyllasetup.py", line 34, in io
      self._run(['/usr/lib/scylla/scylla_io_setup'])
    File "/scyllasetup.py", line 23, in _run
      subprocess.check_call(*args, **kwargs)
    File "/usr/lib64/python3.4/subprocess.py", line 558, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['/usr/lib/scylla/scylla_io_setup']' returned non-zero exit status 1

ls -latr /var/lib/scylla
  total 4
  drwxr-xr-x 44 root root 4096 Abr 24 13:02 ..
  drwxr-xr-x  2 root root    6 Abr 24 13:10 .

Signed-off-by: Moreno Garcia <moreno@scylladb.com>
Message-Id: <20180424173729.22151-1-moreno@scylladb.com>
2018-04-25 17:48:34 +03:00
Calle Wilund
b1edf75c8b types: Make seastar::inet_address the "native" type for CQL inet.
Fixes #3187

Requires seastar "inet_address: Add constructor and conversion function
from/to IPv4"

Implements support IPv6 for CQL inet data. The actual data stored will
now vary between 4 and 16 bytes. gms::inet_address has been augumented
to interop with seastar::inet_address, though of course actually trying
to use an Ipv6 address there or in any of its tables with throw badly.

Tests assuming ipv4 changed. Storing a ipv4_address should be
transparent, as it now "widens". However, since all ipv4 is
inet_address, but not vice versa, there is no implicit overloading on
the read paths. I.e. tests and system_keyspace (where we read ip
addresses from tables explicitly) are modified to use the proper type.
Message-Id: <20180424161817.26316-1-calle@scylladb.com>
2018-04-24 23:12:07 +01:00
Duarte Nunes
9111c6e49a Merge seastar upstream
* seastar 1bb44ac...70aecca (12):
  > Experimental CMake-based build system
  > inet_address: Add constructor and conversion function from/to IPv4
  > tls: Add missing includes and forward declarations to header
  > install_dependencies.sh: fix remaining centos issues
  > rpc: Add missing return when closing client socket
  > install-dependencies.sh: install g++7.3 for centos, instead of g++7.2
  > reactor: fix race beween alien queue construction and start
  > Merge "enhance the I/O Scheduler with bandwidth and throughput limits" from Glauber
  > reactor: gracefully exit if exception happens during initialization
  > build: really add alien_test
  > Merge "reactor: add alien::submit_to()" from Kefu
  > queue: do not consume from aborted queue

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-24 23:07:13 +01:00
Duarte Nunes
f5eeafe1bf tests/secondary_index_test: Add test for dropping index-backing MV
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180424140745.7144-2-duarte@scylladb.com>
2018-04-24 17:02:59 +01:00
Duarte Nunes
9146de3118 service/migration_manager: Don't drop index-backing MV
Unless dropped by the index itself, forbid dropping an index-backing
MV using `drop materialized view`.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180424140745.7144-1-duarte@scylladb.com>
2018-04-24 17:01:59 +01:00
Nadav Har'El
d674b6f672 secondary index: fix bug in indexing case-sensitive column names
CQL normally folds identifiers such as column names to lowercase. However,
if the column name is quoted, case-sensitive column names and other strange
characters can be used. We had a bug where such columns could be indexed,
but then, when trying to use the index in a SELECT statement, it was not
found.

The existing code remembered the index's column after converting it to CQL
format (adding quotes). But such conversion was unnecessary, and wrong,
because the rest of the code works with bare strings and does not involve
actual CQL statements. So the fix avoids this mistaken conversion.

This patch also includes a test to reproduce this problem.

Fixes #3154.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424154920.15924-1-nyh@scylladb.com>
2018-04-24 16:57:17 +01:00
Piotr Sarna
d323b5cddc tests: add missing case-sensitive JSON tests
This commit complements cql_query_test with case-sensitivity cases
for both SELECT JSON and INSERT JSON statements.
Message-Id: <20bc7df2ec644618727183e09f2352ca5546a9b9.1524576066.git.sarna@scylladb.com>
2018-04-24 16:30:56 +03:00
Piotr Sarna
000ce24306 cql3: solve JSON case-sensitivity issues
This commit fixes two closely related issues with handling
case-sensitive column names in JSON:
 * according to doc, case-sensitive names should be wrapped with
   additional pair of double quotes during JSON SELECT
 * logic error in parse_json() prevented INSERT JSON from working
   properly on case-sensitive column names

This commit is followed by updated cql_query_test, which checks
case-sensitive cases as well.
Message-Id: <82d9d5e193a656e99bc86b297c00662a6fb808a0.1524576066.git.sarna@scylladb.com>
2018-04-24 16:30:55 +03:00
Avi Kivity
13ea1a89b5 Merge "Implement loading sstables in 3.x format" from Piotr
"
Pass sstable version to parse, write and describe_type methods to make it possible to handle different versions.
For now serialization header from 3.x format is ignored.

Tests: units (release)
"

* 'haaawk/sstables3/loading_v4' of ssh://github.com/scylladb/seastar-dev:
  Add test for loading the whole sstable
  Add test for loading statistics
  Add support for 3_x stats metadata
  Pass sstable version to describe_type
  Pass sstable version to write methods
  metadata_type: add Serialization type
  Pass sstable_version_types to parse methods
  Add test for reading filter
  Add test for read_summary
  sstables 3.x: Add test for reading TOC
  sstable: Make component_map version dependent
  sstable::component_type: add operator<<
  Extract sstable::component_type to separete header
  Remove unused sstable::get_shared_components
  sstable_version_types: add mc version
2018-04-24 12:49:41 +03:00
Piotr Jastrzebski
6310fc5f1c Add test for loading the whole sstable
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
9e78b6d4c6 Add test for loading statistics
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
df457166b0 Add support for 3_x stats metadata
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
e1e23ec555 Pass sstable version to describe_type
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
1cc1f9af5f Pass sstable version to write methods
This will allow writing different versions differently

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
08da518dae metadata_type: add Serialization type
Ignore it while reading sstable 3_x and throw
if it's present when reading 2_x.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
cb84ca8abb Pass sstable_version_types to parse methods
Parsing will depend on the sstable version when
we have support for both 2_x and 3_x formats.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
444b468d46 Add test for reading filter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
ff06d2153c Add test for read_summary
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
10f9b06145 sstables 3.x: Add test for reading TOC
Make sure DigestCRC32 is handled correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
561ca34ec2 sstable: Make component_map version dependent
Introduce sstable_version_constants that will be a proxy
serving correct constants depending on the format version.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
7aef74c55f sstable::component_type: add operator<<
Make it possible to print out component_type.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
d492e92b15 Extract sstable::component_type to separete header
It will be used in other places which won't depend on
sstable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:29:57 +02:00
Nadav Har'El
4af2604e76 secondary index: update test.py
I forgot that I also need to update test.py for the new test.

It's unfortunate that this script doesn't pick up the list of
tests automatically (perhaps with a black-list of tests we don't
want to run). I wonder if there are additional tests we are
forgetting to run.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424085911.29732-1-nyh@scylladb.com>
2018-04-24 12:11:38 +03:00
Nadav Har'El
9605059a2b secondary index: move tests to separate source file
Move the two tests we have for the secondary indexing feature from the
huge tests/cql_query_test.cc to a new file, secondary_index_test.cc.

Having these tests in a separate file will make it easier and faster to
write more tests for this feature, and to run these tests together.

This patch doesn't change anything in the tests' code - it's just a code
move.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424084700.28816-1-nyh@scylladb.com>
2018-04-24 11:49:57 +03:00
Takuya ASADA
4a8ed4cc6f dist/common/scripts/scylla_raid_setup: prevent 'device or resource busy' on creating mdraid device
According to this web site, there is possibility we have race condition with
mdraid creation vs udev:
http://dev.bizo.com/2012/07/mdadm-device-or-resource-busy.html
And looks like it can happen on our AMI, too (see #2784).

To initialize RAID safely, we should wait udev events are finished before and
after mdadm executed.

Fixes #2784

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1505898196-28389-1-git-send-email-syuu@scylladb.com>
2018-04-24 11:48:40 +03:00
Vladimir Krivopalov
fc644a8778 Fix Scylla to compile with older versions of JsonCpp (<= 1.7.0).
Old versions of JsonCpp declare the following typedefs for internally
used aliases:
    typedef long long int Int64;
    typedef unsigned long long int UInt64;

In newer versions (1.8.x), those are declared as:
    typedef int64_t Int64;
    typedef uint64_t UInt64;

Those base types are not identical so in cases when a type has
constructors overloaded only for specific integral types (such as
Json::Value in JsonCpp or data_value in Scylla), an attempt to
pack/unpack an integer from/to a JSON object causes ambiguous calls.

Fixes #3208

Tests: unit {release}.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <e9fff9f41e0f34b15afc90b5439be03e4295623e.1524556258.git.vladimir@scylladb.com>
2018-04-24 10:58:38 +03:00
Piotr Jastrzebski
279b426ee8 Remove unused sstable::get_shared_components
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 09:45:55 +02:00
Piotr Jastrzebski
7248752698 sstable_version_types: add mc version
This is the latest version of 3.x SSTable format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 09:45:55 +02:00
Raphael S. Carvalho
11940ca39e sstables: Fix bloom filter size after resharding by properly estimating partition count
We were feeding the total estimation partition count of an input shared
sstable to the output unshared ones.

So sstable writer thinks, *from estimation*, that each sstable created
by resharding will have the same data amount as the shared sstable they
are being created from. That's a problem because estimation is feeded to
bloom filter creation which directly influences its size.
So if we're resharding all sstables that belong to all shards, the
disk usage taken by filter components will be multiplied by the number
of shards. That becomes more of a problem with #3302.

Partition count estimation for a shard S will now be done as follow:
    //
    // TE, the total estimated partition count for a shard S, is defined as
    // TE = Sum(i = 0...N) { Ei / Si }.
    //
    // where i is an input sstable that belongs to shard S,
    //       Ei is the estimated partition count for sstable i,
    //       Si is the total number of shards that own sstable i.

Fixes #2672.
Refs #3302.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180423151001.9995-1-raphaelsc@scylladb.com>
2018-04-23 18:11:20 +03:00
Avi Kivity
8a8f688dbf Merge "Materialized views: Fixes to update generation" from Duarte
"
Fixes to several issues around view update generation, pertaining to
timestamp and TTL management.

Fixes #3361
Fixes #3360
Fixes #3140
Refs #3362

Tests: unit(release, debug), dtest(materialized_views.py)
"

Reviewed-by: Nadav Har'El <nyh@scylladb.com>

* 'materialized-views/fixes-galore/v2' of http://github.com/duarten/scylla:
  mutation_partition: Clarify comment about emptiness
  tests: Add view_complex_test
  tests/view_schema_test: Complete test
  db/view: Move cells instead of copying in add_cells_to_view()
  db/view: Handle unselected base columns and corner cases
  mutation_partition: Regular base column in view determines row liveness
  db/view: Don't avoid read-before-write when view PK matches base
  db/view: Process base updates to column unselected by its views
  db/view: Consider partition tombstone when generating updates
  tests/view_schema_test: Remove unneeded test
  mutation_fragment: Allow querying if row is live
  view_info: Add view_column() overload
  view_info: Explicitly initialize base-dependent fields
  cql3/alter_table_statement: Forbid dropping columns of MV base tables
2018-04-23 16:49:29 +03:00
Nadav Har'El
1ec5688b0b Materialized Views: fix incorrect limitations on row filtering
This patch fixes several cases where it was disallowed to create
a materialized view with a filter ("where ..."), for no good reason.
After this patch, these cases will be allowed. Fixes #2367.

In ordinary SELECT queries, certain types of filtering which is known to
be deceptively inefficient is now allowed. For example, trying to query
a range of partition keys cannot be done without reading the entire
database (because the murmur3 tokenizer randomizes the order of partitions).
Restricting two partition key components also cannot be done without
reading excessive amount of the entire partition. So Scylla, following
Cassandra, chooses to disallow such SELECT queries, and give an error
message.

However, the same SELECT statements *should* be allowed when defining a
materialized view. In this case, the filter is just used to check an
individual row - not to search for one - so there is no performance
concern.

Unfortunately the existing code did these validations while building the
SELECT statement's "restrictions", in code shared by both uses of SELECT
(query and MV definition). It was easy to move one of the validations
to later code which runs after the restriction has already been built (and
knows if it is working for query or MV), but because of the way the
"restrictions" objects (translated from Cassandra 2's code) hide what they
contain, many of the checks are harder to perform after having built the
restrictions object. So instead, we add in strategic places in the
restriction-handling code a new "allow_filtering" flag. If restrictions
are built with allow_filtering=true, the extra performance-oriented tests
on the filtering restrictions is not done. Materialized views sets
allow_filtering=true.

The allow_filtering flag will also be useful later when we want to support
the "ALLOW FILTERING" query option which is currently not supported properly
(we have several open issues on that). However note that this patch doesn't
complete that support: I left a FIXME in the spot where we set
allow_filtering in the Materialized Views case, but in the futre also need
to set it if the user specified "ALLOWED FILTERING" in the query.

This patch also enables several unit tests written by Duarte which used to
fail because of this bug, and now pass. These tests verify that the
restrictions are now allowed and filter the view as desired; But I also
added test code to verify that the same restrictions are still forbidden,
as before, when used in ordinary SELECT queries.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Message-Id: <20180423124343.17591-1-nyh@scylladb.com>
2018-04-23 14:08:04 +01:00
Avi Kivity
ff055a291a Merge "Improve "out-of-the-box" build experience on centos" from Botond
"
Make sure install_dependencies.sh installs all the right dependencies
and that the example `configure.py` invokation can just be copy-pasted
into the terminal and will "just work".

Ref: #3208
"

* 'fix_centos_compile/v2' of https://github.com/denesb/scylla:
  install_dependencies.sh: update centos package list and example
  configure.py: add --with-ragel option
  configure.py: add --with-antlr3
  configure.py: check compiler version first
2018-04-23 15:49:27 +03:00
Botond Dénes
bfe741c03d install_dependencies.sh: update centos package list and example
Add missing packages to `yum install` list:
* scylla-boost163-static
* scylla-python34-pyparsing20

Update the configure.py example so that it just works:
* Change g++ to 7.3
* Add --with-antlr3 pointing to antlr3 installed from scylla 3rdparty
2018-04-23 15:46:43 +03:00
Botond Dénes
1efcf215b6 configure.py: add --with-ragel option
To allow the user to select the exact ragel executable they whish to
use.
2018-04-23 15:46:43 +03:00
Botond Dénes
784be9cc43 configure.py: add --with-antlr3
To allow the user to select the exact antlr3 executable they whish to
use.
2018-04-23 15:46:43 +03:00
Botond Dénes
ea8d8f9fbf configure.py: check compiler version first
Before checking anything else (presence of boost, its version, etc.)
check that the compiler is present and can compile and link a simple c++
program.
Before if the compiler was not set up correctly configure.py would fail
at one of the other try_compile checks, whichever came first (usually
the one checking for boost). This lead the user into chasing some
false-positive error when in fact the compiler wasn't working.
2018-04-23 15:46:43 +03:00
Takuya ASADA
7b92c3fd3f dist: Drop AmbientCapabilities from scylla-server.service for Debian 8
Debian 8 causes "Invalid argument" when we used AmbientCapabilities on systemd
unit file, so drop the line when we build .deb package for Debian 8.
For other distributions, keep using the feature.

Fixes #3344

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180423102041.2138-1-syuu@scylladb.com>
2018-04-23 13:27:14 +03:00
Avi Kivity
269207fdf6 Merge "Introducing INSERT JSON and fromJson to CQL3" from Piotr
"
This series complements JSON support with INSERT JSON and fromJson
cql function.

INSERT JSON implementation tries hard to interfere as little as possible
with regular INSERT path. So, after being parsed, insertJsonStatement
exists as a separate statement and is handled in a special way.
Overridden add_update_for_key extracts values from JSON map and applies
them to columns.

Converting from insert_json_statement to insert_statement uses auxiliary
from_json_object methods to convert JSON-encoded types to bytes.
Then, terms are matched to appropriate column names and cells are
updated.

fromJson CQL function uses the same from_json_object helper methods,
but applies them to single arguments, not whole rows.

Existing json handling functions from json.hh and libjsoncpp were used
where possible.

Things implemented:
 * expanding CQL grammar to accept INSERT JSON
 * converting JSON representation of cql values to cql terms
 * serving 'INSERT INTO xxx JSON yyy' clause
 * tests for INSERT JSON and fromJson()
"

* 'json_ops_2' of https://github.com/psarna/scylla:
  tests: add cql unit tests for INSERT JSON
  cql3: add fromJson() function
  cql3: add INSERT JSON parsing to CQL grammar
  cql3: add support for INSERT JSON clause
  cql3: decouple execute from term binding in setters
  cql3: change operation::make_* functions to static
  cql3: add from_json_object function to types
  cql3: Make literals::NULL_VALUE public
2018-04-23 13:19:54 +03:00
Piotr Sarna
97e89f2efb tests: add cql unit tests for INSERT JSON
This commit adds tests for INSERT JSON clause, which is expected
to accept JSON strings and insert appropriate values to columns
defined there.
The tests also cover fromJson function calls and inserting prepared
batch statements with INSERT JSON inside.

References #2058
2018-04-23 12:00:57 +02:00
Piotr Sarna
cd76a01747 cql3: add fromJson() function
This function extends JSON support with fromJson() function,
which can be used in UPDATE clause to transform JSON value
into a value with proper CQL type.

fromJson() accepts strings and may return any type, so its instances,
like toJson(), are generated during calls.

This commit also extends functions::get() with additional
'receiver' parameter. This parameter is used to extract receiver type
information neeeded to generate proper fromJson instance.
Receiver is known only during insert/update, so functions::get() also
accepts a nullptr if receiver is not known (e.g. during selection).

References #2058
2018-04-23 12:00:57 +02:00
Piotr Sarna
9dd34bf34d cql3: add INSERT JSON parsing to CQL grammar
This commit makes it possible to parse INSERT JSON statement
in CQL grammar, so it's available via cqlsh.

References #2058
2018-04-23 12:00:57 +02:00
Piotr Sarna
cdcbf654a8 cql3: add support for INSERT JSON clause
This commit adds the implementation of INSERT JSON clause
which accepts JSON object as parameter and inserts appropriate
values into appropriate columns, as defined in given JSON.

Example:
INSERT INTO testme JSON '{
  "id" : 77,
  "name" : "Jones",
  "ranking" : 8.5
}'

References #2058
2018-04-23 12:00:57 +02:00
Piotr Sarna
bfe3c20035 cql3: decouple execute from term binding in setters
This commit makes it possible to pass values to setters,
instead of having to pass cql3::term instances.
Thanks to that previously prepared terminals can be directly
used in a setter execution.

References #2058
2018-04-23 12:00:56 +02:00
Piotr Sarna
2b729a10bc cql3: change operation::make_* functions to static
This commit makes operation::make* functions static, because they
don't access any instance-specific data anyway. It is later needed
to decouple setter execution from binding a cql3::term.
2018-04-23 12:00:56 +02:00
Piotr Sarna
1d40d2186e cql3: add from_json_object function to types
This commit adds a 'from_json_object' method which will be used
for converting JSON representation of a value to raw bytes representing
the same value. This functionality will be needed by 'INSERT JSON'
clause implementation, which can turn these raw bytes into cql3::term.

References #2058
2018-04-23 12:00:56 +02:00
Piotr Sarna
e3dfa2193b cql3: Make literals::NULL_VALUE public
This commit makes NULL_VALUE public for future use in JSON translation.

References #2058
2018-04-23 12:00:56 +02:00
Botond Dénes
c34b69f4b2 Add PULL_REQUEST_TEMPLATE.md
Hopefully it will guide people wanting to contribute to the mailing
list.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <73c5d9c9884d8595b466412486494d6aa45d1d55.1524476490.git.bdenes@scylladb.com>
2018-04-23 10:45:25 +01:00
Avi Kivity
1a6b891ce2 Update scylla-ami submodule
* dist/ami/files/scylla-ami 9b4be70...02b1853 (1):
  > scylla_install_ami: remove the host id file after scylla_setup
2018-04-23 12:43:56 +03:00
Avi Kivity
b7b3d2bfec tests: continuous_data_consumer_test: increase coverage
Cover also values in the ranges 0 to 1 and 2^63 to 2^64 - 1.
Message-Id: <20180422150938.29143-2-avi@scylladb.com>
2018-04-23 11:39:06 +03:00
Avi Kivity
732177d2b0 tests: continuous_data_consumer_test: reduce runtime
continuous_data_consumer_test takes an unreasonable amount of
time to run, especially in debug mode.  Reduce the run time by
reducing the number of loops.
Message-Id: <20180422150938.29143-1-avi@scylladb.com>
2018-04-23 11:39:06 +03:00
Duarte Nunes
c8baba4e3a mutation_partition: Clarify comment about emptiness
empty() doesn't distinguish between live and dead data, so clarify
that in its comment.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
cc6c96bc92 tests: Add view_complex_test
This patch introduces view_complex_test and adds more test coverage
for materialized views.

A new file was introduced to avoid making view_schema_test slower.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
7ba1291731 tests/view_schema_test: Complete test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
844e0b41d1 db/view: Move cells instead of copying in add_cells_to_view()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
4b4d1dbd1f db/view: Handle unselected base columns and corner cases
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns.

This patch ensures that unselected columns are considered as much as
possible, even though some limitations will still exist. In
particular, we need to represent multiple timestamps (from all the
unselected columns), but have only mechanisms to record a single
timestamp.

We also have some issues when dealing with selected column, and the
way we currently delete them. Consider the following:

create table cf (p int, c int, a int, b int, primary key (p, c))
create materialized view vcf as select a, b
from cf where p is not null and c is not null
primary key (p, c)

1) update cf using timestamp 10 set a = 1 where p = 1 and c = 1
2) delete a from cf using timestamp 11 where p = 1 and c = 1
3) update cf using timestamp 1 set a = 2 where p = 1 and c = 1

After 1), the MV should include a row with row marker @ ts10,
p = 1, c = 1, a = 1. After 2), this row should be removed.

At 3), we should add a row with row marker @ ts1, p = 1, c = 1, a = 1,
with a lower timestamp. This means that the delete should not
insert a row tombstone with timestamp @ 11, as we do now but it should
just delete the view's row marker (which exists) with ts1.

Refs #3362
Fixes #3140
Fixes #3361

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
67dac67c46 mutation_partition: Regular base column in view determines row liveness
When views contain a primary key column that is not part of the base
table primary key, that column determines whether the row is live or
not. We need to ensure that when that cell is dead, and thus the
derived row marker, either by normal deletion of by TTL, so is the
rest of the row.

This patch introduces the idea of shawdowing row marker. We map the
status of the regular base column in the view's PK to the view row's
marker. If this marker is dead, so is that cell in the base table, and
so should the view row become. To enforce that, a view row's dead
marker shadows the whole row if that view includes a base regular
column in its PK.

Fixes #3360

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
4dfce4d369 db/view: Don't avoid read-before-write when view PK matches base
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns. When calculating the view's row marker we need
to access those unselected columns, so we can't avoid the
read-before-write as we were doing.

Refs #3362

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
bd3cedd240 db/view: Process base updates to column unselected by its views
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns. So, process base updates to columns unselected by
any of its views.

Refs #3362

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
ac9b93eb89 db/view: Consider partition tombstone when generating updates
Not adding the partition tombstone to the current list of tombstones
may cause updates to be incorrectly generated.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
e6467f46b7 tests/view_schema_test: Remove unneeded test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
b0cb5480d5 mutation_fragment: Allow querying if row is live
For clustering_row and static_row, allow querying whether they are
live or not.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
164f043768 view_info: Add view_column() overload
For when we already have the base's column_definition.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
31370fd7b1 view_info: Explicitly initialize base-dependent fields
Instead of lazily-initializing the regular base column in the view's
PK field, explicitly initialize it. This will be used by future
patches that don't have access to the schema when wanting to obtain
that column.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
b77b71436d cql3/alter_table_statement: Forbid dropping columns of MV base tables
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns.

The fact that unselected columns can keep a view row alive also
requires that users cannot drop columns of base tables with
materialized views, which this patch implements.

Refs #3362

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Avi Kivity
28be4ff5da Revert "Merge "Implement loading sstables in 3.x format" from Piotr"
This reverts commit 513479f624, reversing
changes made to 01c36556bf. It breaks
booting.

Fixes #3376.
2018-04-23 06:47:00 +03:00
Avi Kivity
513479f624 Merge "Implement loading sstables in 3.x format" from Piotr
"
Pass sstable version to parse, write and describe_type methods to make it possible to handle different versions.
For now serialization header from 3.x format is ignored.

Tests: units (release)
"

* 'haaawk/sstables3/loading_v3' of ssh://github.com/scylladb/seastar-dev:
  Add test for loading the whole sstable
  Add test for loading statistics
  Add support for 3_x stats metadata
  Pass sstable version to describe_type
  Pass sstable version to write methods
  metadata_type: add Serialization type
  Pass sstable_version_types to parse methods
  Add test for reading filter
  Add test for read_summary
  sstables 3.x: Add test for reading TOC
  sstable: Make component_map version dependent
  sstable::component_type: add operator<<
  Extract sstable::component_type to separete header
  Remove unused sstable::get_shared_components
  sstable_version_types: add mc version
2018-04-22 16:18:39 +03:00
Piotr Jastrzebski
0288121c0a Add test for loading the whole sstable
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 15:07:03 +02:00
Piotr Jastrzebski
fbe9ee72d6 Add test for loading statistics
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 15:07:03 +02:00
Piotr Jastrzebski
b683870644 Add support for 3_x stats metadata
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 15:06:51 +02:00
Takuya ASADA
01c36556bf dist/debian: use --configfile to specify pbuilderrc
Use --configfile to specify pbuilderrc, instead of copying it to home directory.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180420024624.9661-1-syuu@scylladb.com>
2018-04-22 16:06:42 +03:00
Piotr Jastrzebski
26ab3056ae Pass sstable version to describe_type
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 14:41:11 +02:00
Piotr Jastrzebski
0022c309ee Pass sstable version to write methods
This will allow writing different versions differently

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 14:41:10 +02:00
Piotr Jastrzebski
65fe564cd2 metadata_type: add Serialization type
Ignore it while reading sstable 3_x and throw
if it's present when reading 2_x.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 14:40:04 +02:00
Piotr Jastrzebski
d68f3b328f Pass sstable_version_types to parse methods
Parsing will depend on the sstable version when
we have support for both 2_x and 3_x formats.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
9b448b9082 Add test for reading filter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
6bb5468ba0 Add test for read_summary
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
6c2cf40ce8 sstables 3.x: Add test for reading TOC
Make sure DigestCRC32 is handled correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
00756582ca sstable: Make component_map version dependent
Introduce sstable_version_constants that will be a proxy
serving correct constants depending on the format version.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
94fbec788e sstable::component_type: add operator<<
Make it possible to print out component_type.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
82d483a1d3 Extract sstable::component_type to separete header
It will be used in other places which won't depend on
sstable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:45:29 +02:00
Avi Kivity
70220d8f85 tests: sstable_datafile_test: peel off redundant parentheses around compression_parameters initializer
The compression_parameter constructor is called with an extra level of
parentheses. Presumably this caused a temporary object to be constructed
and then moved into the argument being initialized, but gcc 8 complains
about ambiguity.

Make it happy by stripping off the redundant parentheses.
Message-Id: <20180421121854.12314-1-avi@scylladb.com>
2018-04-21 13:53:29 +01:00
Avi Kivity
7a141c0240 tests: network_topology_strategy_test: peel off redundant parentheses around token initializer
The token constructor is called with an extra level of parentheses. Presumably
this caused a temporary object to be constructed and then moved into the
variable being initialized, but gcc 8 complains about ambiguity.

Make it happy by stripping off the redundant parentheses.
Message-Id: <20180421121736.12136-1-avi@scylladb.com>
2018-04-21 13:53:29 +01:00
Avi Kivity
7c54e8559c mutation_fragment: fix concept for mutation_fragment::consume()
The parameters to the MutationFragmentConsumer concept must be concrete
types, not decltype(auto).

Reported by gcc 8.
Message-Id: <20180421110738.7574-1-avi@scylladb.com>
2018-04-21 13:53:29 +01:00
Duarte Nunes
6eeb6514f1 Merge 'Introduce "scylla active-sstables" command' from Tomasz
"
Prints info about sstables used by readers

Example:

  (gdb) scylla active-sstables
  sstable "keyspace1"."standard1"#5, readers=3 data_file_size=39393952
  sstable "keyspace1"."standard1"#6, readers=3 data_file_size=127513304
  sstable_count=2, total_index_lists_size=0
"

* 'tgrabiec/gdb-scylla-active-sstables' of github.com:tgrabiec/scylla:
  gdb: Introduce "scylla active-sstables" command
  gdb: Make list_unordered_map() more general
  gdb: Improve compatibility with python2.7
2018-04-19 19:04:59 +01:00
Tomasz Grabiec
fb126abdc5 gdb: Introduce "scylla active-sstables" command
Prints info about sstables used by readers

Example:

  (gdb) scylla active-sstables
  sstable "keyspace1"."standard1"#5, readers=3 data_file_size=39393952
  sstable "keyspace1"."standard1"#6, readers=3 data_file_size=127513304
  sstable_count=2, total_index_lists_size=0
2018-04-19 19:45:52 +02:00
Tomasz Grabiec
68dd61a0e7 gdb: Make list_unordered_map() more general
1) vt.name returns None for some types, use str() instead
 2) some unorderd_maps use 'false' as the second Hash_node template parameter
 3) some consumers will prefer a reference to the value instead of its address
2018-04-19 19:06:00 +02:00
Tomasz Grabiec
309257ddda gdb: Improve compatibility with python2.7
Which is still used in some builds of GDB
2018-04-19 19:04:26 +02:00
Piotr Jastrzebski
0c96573807 Remove unused sstable::get_shared_components
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-18 10:24:57 +02:00
Piotr Jastrzebski
4f1528192f sstable_version_types: add mc version
This is the latest version of 3.x SSTable format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-18 10:24:57 +02:00
Duarte Nunes
1db6d7d6e2 cql3/functions: Add some missing functions
Fixes #3368

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180417170638.12625-1-duarte@scylladb.com>
2018-04-17 21:15:14 +03:00
Duarte Nunes
17917e12ce db/view: Wait for schema agreement in background upon view building
Waiting for schema agreement in the foreground may cause the node to
not boot in useful time.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180417125915.11262-1-duarte@scylladb.com>
2018-04-17 18:03:43 +03:00
Duarte Nunes
b5e7d5fa2c column_family: Make reader without going through mutation source
When doing the read before write for a materialized view update, call
make_reader directly.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180417091918.10043-1-duarte@scylladb.com>
2018-04-17 12:22:36 +03:00
Takuya ASADA
e99f43ef43 dist/debian: call lsb_release after command existance check
Fixes #3364

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523907917-13188-1-git-send-email-syuu@scylladb.com>
2018-04-17 10:54:39 +03:00
Avi Kivity
2c2175ab34 Merge "Add support for reading variant integers from SSTables" from Piotr
"
Enhance continuous_data_consumer to use existing vint serialization for reading
variant integers from SSTables.

Also available at:
https://github.com/scylladb/seastar-dev/commits/haaawk/sstables3/unsigned-vint-v6

Tests: units (release)
"

* 'haaawk/sstables3/unsigned-vint-v6' of ssh://github.com/scylladb/seastar-dev:
  sstables: add test for continuous_data_consumer::read_unsigned_vint
  buffer_input_stream: make it possible to specify chunk size
  Add tests for make_limiting_data_source
  Introduce make_limiting_data_source
  sstables: add continuous_data_consumer::read_unsigned_vint
  Cover serialized_size_from_first_byte in tests
  core: add unsigned_vint::serialized_size_from_first_byte
  sstables: add all dependant headers to consumer.hh
  sstables: add all dependant headers to exceptions.hh
  core: add #pragma once to vint-serialization.hh
2018-04-17 10:09:38 +03:00
Takuya ASADA
ace44784e8 dist/debian: use ~root as HOME to place .pbuilderrc
When 'always_set_home' is specified on /etc/sudoers pbuilder won't read
.pbuilderrc from current user home directory, and we don't have a way to change
the behavor from sudo command parameter.

So let's use ~root/.pbuilderrc and switch to HOME=/root when sudo executed,
this can work both environment which does specified always_set_home and doesn't
specified.

Fixes #3366

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523926024-3937-1-git-send-email-syuu@scylladb.com>
2018-04-17 09:37:16 +03:00
Takuya ASADA
5a71d4f814 dist/debian: use apt-get instead of apt
To suppress following warning, use apt-get instead of apt:
"WARNING: apt does not have a stable CLI interface. Use with caution in scripts."

Fixes #3365

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523909727-13343-1-git-send-email-syuu@scylladb.com>
2018-04-17 09:29:16 +03:00
Piotr Jastrzebski
c5dda1c0c9 sstables: add test for continuous_data_consumer::read_unsigned_vint
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 21:14:34 +02:00
Piotr Jastrzebski
fdad8eba97 buffer_input_stream: make it possible to specify chunk size
This will allow to force input stream to return its data
in chunks of a specified size.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 21:11:13 +02:00
Piotr Jastrzebski
4406d11095 Add tests for make_limiting_data_source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 21:00:35 +02:00
Piotr Jastrzebski
cc6e619aa9 Introduce make_limiting_data_source
This method takes a data_source and returns another data_source
that returns data from the input source but in chunks of limited
size.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 20:56:30 +02:00
Piotr Jastrzebski
b68d1fa5bd sstables: add continuous_data_consumer::read_unsigned_vint
This allows reading unsigned variant integers from
SSTable format 3.x.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 20:30:10 +02:00
Piotr Jastrzebski
4431c1bbe7 Cover serialized_size_from_first_byte in tests
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 20:26:44 +02:00
Piotr Jastrzebski
e423529077 core: add unsigned_vint::serialized_size_from_first_byte
This method takes first byte and determins how many bytes
are used to represent an unsigned variant integer.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 20:12:03 +02:00
Botond Dénes
07fb2e9c4d make_foreign_reader: don't wrap local readers
If the to-be-wrapped reader is local (lives on the same shard where
make_foreign_reader() is called) there is no need to wrap it with
foreign_reader. Just return it as is.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <886ed883b707f163603a40b56b8823f2bb6c47c6.1523873224.git.bdenes@scylladb.com>
2018-04-16 15:11:20 +03:00
Piotr Jastrzebski
20705c4536 sstables: add all dependant headers to consumer.hh
Before it was depending on byteorder.hh that just happend
to be included in all compilation units that were using consumer.hh
This change makes the header compile when used in new compilation units.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 11:02:49 +02:00
Piotr Jastrzebski
9288074d02 sstables: add all dependant headers to exceptions.hh
Before it was depending on print.hh that just happend
to be included in all compilation units that were using
exceptions.hh. This change makes the header compile
 when used in new compilation units.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 11:02:33 +02:00
Avi Kivity
7c01e66d53 cql3: query_processor: store and use just local shard reference of storage_proxy
Since storage_proxy provides access to the entire cluster, a local shard
reference is sufficient.  Adjust query_processor to store a reference to
just the local shard, rather than a seastar::sharded<storage_proxy> and
adjust callers.

This simplifies the code a little.
Message-Id: <20180415142656.25370-3-avi@scylladb.com>
2018-04-16 10:20:50 +02:00
Avi Kivity
f7b102238a cql3: change cql_statement methods to accept a local storage_proxy
The storage_proxy represents the entire cluster, so there's never a need
to access it on a remote shard; the local shard instance will contact
remote shard or remote nodes as needed.

Simplify the API by passing storage_proxy references instead of
seastar::sharded<storage_proxy> references. query_processor and
other callers are adjusted to call seastar::sharded::local() first.
Message-Id: <20180415142656.25370-2-avi@scylladb.com>
2018-04-16 10:18:28 +02:00
Avi Kivity
52882d1bd9 dist: debian: try harder to set the target distribution
build_deb.sh relies on pbuilder picking up a ~/.pbuilderrc which we
copy from the script. According to the pbuilder manual, "~" will refer
to the root directory (since pbuilder is run via sudo). In practice
we've observed this working with "~" referring to the current user's
home directory, but also sometimes failing, while complaining
about /root/.pbuilderrc failing. When it fails, it fails to set
the correct distribution.

To be extra sure, also copy .pbuilderrc to root's home directory. This
way, whatever behavior pbuilder chooses to follow, it will have a
configuration file to read.
Message-Id: <20180410134508.9415-1-avi@scylladb.com>
2018-04-16 10:10:47 +02:00
Avi Kivity
e0545cd2ad Merge seastar upstream
* seastar 2da7d46...1bb44ac (7):
  > doc: exclude non-API paths and symbols
  > docs: move detailed descriptions to top of page
  > doc: add default layout file
  > Merge "Misc fixes for io_tester" from Glauber
  > Merge RPC template cleanup from Gleb
  > Revert "Merge rpc template cleanup from Gleb"
  > Merge rpc template cleanup from Gleb
2018-04-15 15:48:50 +03:00
Daniel Fiala
a3533a62ba Allow /upload to be at the end of a path for sstable file
The patch fixes a bug introduce by commit
089b54f2d2.

When sstable files are stored in .../upload directory
and refresh is initialised with `nodetool` then it fails
because Scyla doesn't expect .../upload to be a part of the path.

Fixes #3334.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180413132019.17779-1-daniel@scylladb.com>
2018-04-14 15:25:55 +03:00
Piotr Sarna
5a6fcebed6 cql3: add toJson function
This commit extends JSON support with toJson() function,
which can be used in SELECT clause to transform a single argument
to JSON form.

toJson() accepts any type including nested collection types,
so instead of being declared with concrete types,
proper toJson() instances are generated during calls.

This commit also supplements JSON CQL query tests with toJson calls.

Finally, it refactors JSON tests so they use do_with_cql_env_thread.

References #2058

Message-Id: <a7833650428e9ef590765a14e91c4d42532588f4.1523528698.git.sarna@scylladb.com>
2018-04-14 15:23:47 +03:00
Gleb Natapov
1a9aaece3e cql_server: fix a race between closing of a connection and notifier registration
There is a race between cql connection closure and notifier
registration. If a connection is closed before notification registration
is complete stale pointer to the connection will remain in notification
list since attempt to unregister the connection will happen to early.
The fix is to move notifier unregisteration after connection's gate
is closed which will ensure that there is no outstanding registration
request. But this means that now a connection with closed gate can be in
notifier list, so with_gate() may throw and abort a notifier loop. Fix
that by replacing with_gate() by call to is_closed();

Fixes: #3355
Tests: unit(release)

Message-Id: <20180412134744.GB22593@scylladb.com>
2018-04-12 16:56:50 +03:00
Raphael S. Carvalho
0c72781939 sstables/twcs: add support to millisecond timestamp resolution
That's blocking KairosDB users because it uses TWCS with millisecond
timestamp resolution.

Also older drivers use millisecond instead of the default microsecond.

Fixes #3152.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180411171244.19958-1-raphaelsc@scylladb.com>
2018-04-12 12:46:52 +03:00
Glauber Costa
98d784aba7 sstables: correctly calculate number of bits in filter
In my well intentioned attempt to use fewer magic numbers in the loading
code I replaced "64" with something calculated automatically from the
type being used.

Except I did it wrong, because sizeof(uint64_t) is 8, not 64.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180411155903.27665-1-glauber@scylladb.com>
2018-04-11 19:03:30 +03:00
Avi Kivity
dc0c458c12 Merge "First series on JSON support in CQL" from Piotr
"
This series introduces 'SELECT JSON' clause support for CQL.
Things implemented:
 * expanding CQL grammar with JSON keyword
 * converting values to JSON format
 * serving 'SELECT JSON *' clauses
 * tests for 'SELECT JSON'
"

* 'json_ops' of https://github.com/psarna/scylla:
  tests: add cql unit tests for SELECT JSON
  cql3: Add JSON token to CQL grammar
  cql3: add support for SELECT JSON clause
  cql3: add to_json_string function to types
2018-04-11 18:26:53 +03:00
Piotr Sarna
fa66e64c24 tests: add cql unit tests for SELECT JSON
This commit adds tests for SELECT JSON clause,
which is expected to return rows in JSON format.

References #2058
2018-04-11 17:12:21 +02:00
Piotr Sarna
1b6e3ccd2b cql3: Add JSON token to CQL grammar
This commit adds JSON keyword to CQL grammar and allows parsing
'SELECT JSON' command in CQL. Additionally, it will be useful
in implementing 'INSERT JSON(...)'.

References #2058
2018-04-11 17:12:21 +02:00
Piotr Sarna
15545da572 cql3: add support for SELECT JSON clause
This commit adds the implementation of SELECT JSON clause
which returns rows in JSON format. Each returned row has a single
'[json]' column.

References #2058
2018-04-11 17:12:02 +02:00
Avi Kivity
2d126a79b5 Merge "Multishard combined reader" from Botond
"
The multishard combined reader provides a convenient
flat_mutation_reader implementation that takes care of efficiently
reading a range from all shards that own data belonging to the range.
All this happens transparently, the user of the reader need only pass a
factory function to the multishard reader which it uses to create
remote readers when needed. These remote readers will then be managed
through foreign reader which abstracts away the fact that the reader is
located on a remote shard.
Sub readers are created for the entire read range, meaning they are free
to cross shard-range limits to fill their buffer. The output of these
sub readers is merged in a round-robin manner, the same way data is
distributed among shards. The multishard reader will move to the next
shard's reader whenever it encounters a partition whose token is after
the delimiter token.
To improve throughput and latency two levels of read-ahead is employed.
One in foreign_reader, which will try to fill the remote shard reader's
buffer in the background, in parallel to processing the results on the
local shard. And one in the multishard reader itself which will
exponentially increase concurrency whenever a sub-reader's buffer
becomes empty. But only if this happened after crossing a shard
boundary. This is important because there is no point in increasing
concurrency if a single sub reader can fill the multishard readers'
buffer.
"

* 'multishard-reader/v3' of https://github.com/denesb/scylla:
  Add unit tests for multishard_combined_reader
  Add multishard_combined_reader
  flat_mutation_reader: add peek_buffer()
  Add unit tests for foreign_reader
  forwardable reader: implement fast_forward_to(position_in_partition)
  Add foreign_reader
  flat_mutation_reader: add detach_buffer()
2018-04-11 18:03:35 +03:00
Glauber Costa
c93bc6b853 sstables: don't rely on parameter evaluation order
Asias reported in issue #3351 that a floating point exception was seen
while loading SSTables. Looking at the trace, that seems to be because
we tried to issue a modulo operation with something that was likely 0.

That field comes from the nr_bits attribute in the large bitset, and our
current code should set it to whatever we read from the Filter file -
something that has been working for ages.

The difference is that after the patch that Asias identified as culprit,
we are moving the array from which we compute the size in the same
parameter list where we are computing the size.

This works for me and passed all my tests - likely because my compiler
was doing left-to-right evaluation as I would expect it to do. But the
standard doesn't guarantee that at all, and it reads:

"Order of evaluation of the operands of almost all C++ operators
(including the order of evaluation of function arguments in a
function-call expression and the order of evaluation of the
subexpressions within any expression) is unspecified. The compiler can
evaluate operands in any order, and may choose another order when the
same expression is evaluated again."

This likely fixes the bug, but even if it doesn't we should patch it,
since we currently have something that is technically an UB.

Fixes #3351.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180411144036.24748-1-glauber@scylladb.com>
2018-04-11 18:01:06 +03:00
Daniel Fiala
202bff0b18 database: Remember versions and formats of all temporary TOC files.
The patch fixes a bug introduce by commit 089b54f2d2.
This bug exhibited when master was deployed in an attempt to populate
materialised views. The nodes restarted in the middle and they were not able
to come back.

The fix is to remember formats and versions of sstables for every generation.

Fixes: #3324.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180410083114.17315-1-daniel@scylladb.com>
2018-04-11 16:47:33 +03:00
Piotr Sarna
399ab1d455 cql3: add to_json_string function to types
This commit adds a 'to_json_string' method which will be used
for converting values to JSON strings. In several cases it's not
sufficient to use 'to_string', e.g. actual strings need to be
surrounded with double quotes.

References #2058
2018-04-11 13:27:56 +02:00
Avi Kivity
4c588de70f tests: apply overprovisioned flag to all tests
Some tests escaped the --overprovisioned flag, causing them to
compete over cpu 0. Add the flag to all tests.
Message-Id: <20180410181606.8341-1-avi@scylladb.com>
2018-04-11 10:48:52 +02:00
Botond Dénes
f931b45dfa test_resources_based_cache_eviction: s/assert/BOOST_REQUIRE_*/
After moving this test into a SEASTAR_THREAD_TEST_CASE we can use the
BOOST_REQUIRE_* macros which have much better diagnostics than simple
assert()s.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d2faa5db2bc352e6a2dcf09287faed42284c3248.1523432699.git.bdenes@scylladb.com>
2018-04-11 10:55:21 +03:00
Botond Dénes
49128d12cf Move querier_cache_resource_based_eviction test into querier_cache.cc
Turns out do_with_cql_env can be used from within SEASTAR test cases so
no reason to have a separate file for a single test case.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <028a28b7d90a3bc5ed4719ce273da05880133c0e.1523432699.git.bdenes@scylladb.com>
2018-04-11 10:55:19 +03:00
Botond Dénes
ff3982a817 Add unit tests for multishard_combined_reader 2018-04-11 10:03:50 +03:00
Botond Dénes
3a6f397fd0 Add multishard_combined_reader
Takes care of reading a range from all shards that own a subrange in the
range. The read happens sequentially, reading from one shard at a time.
Under the scenes it uses combined_mutation_reader and foreign_reader,
the former providing the merging logic and the latter taking care of
transferring the output of the remote readers to the local shard.
Readers are created on-demand by a reader-selector implementation that
creates readers for yet unvisited shards as the read progresses.
The read starts with a concurrency of one, that is the reader reads from
a single shard at a time. The concurrency is exponentially increased (to
a maximum of the number of shards) when a reader's buffer is empty after
moving the next shard. This condition is important as we only wan't to
increase concurrency for sparse tables that have little data and the
reader has to move between shards often. When concurrency is > 1, the
reader issues background read-aheads to the next shards so that by the
time it needs to move to them they have the data ready.
For dense tables (where we rarely cross shards) we rely on the
foreign_reader to issue sufficient read-aheads on its own to avoid
blocking.
2018-04-11 10:03:47 +03:00
Botond Dénes
94140258d0 flat_mutation_reader: add peek_buffer()
Allows peeking at the next mutation fragment in the buffer. As opposed
to the existing `peek()` it assumes there's at least one fragment in the
buffer. Useful for code that already ensured that the buffer is not
empty and doesn't want to introduce a continuation (via `peek()`).
2018-04-11 09:22:49 +03:00
Botond Dénes
de4a3c8bdb Add unit tests for foreign_reader 2018-04-11 09:22:49 +03:00
Botond Dénes
50b67232e5 forwardable reader: implement fast_forward_to(position_in_partition)
Instead of throwing std::bad_function_call. Needed by the foreign_reader
unit test. Not sure how other tests didn't hit this before as the test
is using `run_mutation_source_tests()`.
2018-04-11 09:22:49 +03:00
Botond Dénes
2c0f8d0586 Add foreign_reader
Local representant of a reader located on a remote shard. Manages the
lifecycle and takes care of seamlessly transferring fragments produced
by the remote reader. Fragments are *copied* between the shards in
batches, a bufferful at a time.
To maximize throughput read-ahead is used. After each fill_buffer() or
fast_forward_to() a read-ahead (a fill_buffer() on the remote reader) is
issued. This read-ahead runs in the background and is brough back to
foreground on the next fill_buffer() or fast_forward_to() call.
2018-04-11 09:22:45 +03:00
Botond Dénes
334efb4d70 flat_mutation_reader: add detach_buffer()
Allows for detaching the internal buffer of the reader. Enables
convenient transferring of buffered fragmends in a single batch but
will force the reader to reallocate it's buffer on the next
fill_buffer() call.
Introduced for foreign_reader which favours quick transferring of the
fragments between shards in a single batch, over minimizing allocations,
which can be amortized by background read-aheads.
2018-04-11 09:08:51 +03:00
Piotr Jastrzebski
190cdd27f0 core: add #pragma once to vint-serialization.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-10 20:09:40 +02:00
Raphael S. Carvalho
638a647b7d sstables/compaction_manager: do not break lcs invariant by not allowing parallel compaction for it
After change to serialize compaction on compaction weight (eff62bc61e),
LCS invariant may break because parallel compaction can start, and it's
not currently supported for LCS.

The condition is that weight is deregistered right before last sstable
for a leveled compaction is sealed, so it may happen that a new compaction
starts for the same column family meanwhile that will promote a sstable to
an overlapping token range.

That leads to strategy restoring invariant when it finds the overlapping,
and that means wasted resources.
The fix is about removing a fast path check which is incorrect now because
we release weight early and also fixing a check for ongoing compaction
which prevented compaction from starting for LCS whenever weight tracker
was not empty.

Fixes #3279.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180410034538.30486-1-raphaelsc@scylladb.com>
2018-04-10 20:02:08 +03:00
Avi Kivity
fc488adc72 logalloc: remove segment_descriptor::_lsa_managed
_lsa_managed is always 1:1 with _region, so we can remove it, saving
some space in the segment descriptor vector.

Tests: unit (release), logalloc_test (debug)
Message-Id: <20180410122606.10671-1-avi@scylladb.com>
2018-04-10 13:54:38 +01:00
Asias He
d71a94a08b gossip: Add tokens and host_id in add_saved_endpoint
Problem:

   Start node 1 2 3
   Shutdown node2
   Shutdown node1 node3
   Start node1 node3
   Try to repalce_address for node 2
   The replace operation fails with the error:
     seastar - Exiting on unhandled exception: std::runtime_error
     (Cannot replace_address node2 because it doesn't exist in gossip)

This is because after all nodes shutdown, the other nodes do not have the
tokens and host_id info of node2 until node2 boots up and talks to the cluster.

If node2 can not boots up for whatever reason, currently the only way to
recover node2 is to `nodetool removenode` and bootstrap node2 again. This will
change tokens in the cluster and cause more data movement than just replacing
node2.

To fix, we add the tokens and host_id gossip application state in add_saved_endpoint
during boot up.

This is pretty safe because the generation for application state added by
add_saved_endpoint is zero, if node2 actually boots, other nodes will update
with node2's version.

Before:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool

    {
        "addrs": "127.0.0.2",
        "generation": 0,
        "is_alive": false,
        "update_time": 1523344828953,
        "version": 0
    }

Node 2 can not be replaced.

After:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool

    {
        "addrs": "127.0.0.2",
        "application_state": [
            {
                "application_state": 12,
                "value": "31284090-2557-4036-9367-7bb4ef49c35a",
                "version": 2
            },
            {
                "application_state": 13,
                "value": "... a lot of tokens ...",
                "version": 1
            }
        ],
        "generation": 0,
        "is_alive": false,
        "update_time": 1523344828953,
        "version": 0
    }

Node 2 can be replaced.

Tests: dtest/replace_address_test.py
Fixes: #3347
Message-Id: <117fd6649939e0505847335791be8d7a96e7d273.1523346805.git.asias@scylladb.com>
2018-04-10 13:14:31 +02:00
Piotr Jastrzebski
5cd48407ad test: logalloc_test: Fix build for boost 1.63
Due to https://svn.boost.org/trac10/ticket/12778?replyto=3
BOOST_REQUIRE_NE does not work with nullptr.

Tests: units (release)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <e7158e8a235356fad99560f6fcbecb57615cefe6.1523298193.git.piotr@scylladb.com>
2018-04-10 12:50:22 +03:00
Piotr Jastrzebski
3565820526 sstables: Remove unused mp_row_consumer::skip_partition
The method is never called so we can remove it and
mp_row_consumer::_skip_partition which is set only
by mp_row_consumer::skip_partition

Tests: units (release)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <cae8b09032c58361b7cfb9d02a792cb31b186f5c.1523298605.git.piotr@scylladb.com>
2018-04-10 12:50:05 +03:00
Glauber Costa
b2f9958071 large_bitset: use a chunked_vector internally and simplify API
save and load functions for the large_bitset were introduced by Avi with
d590e327c0.

In that commit, Avi says:

"... providing iterator-based load() and save() methods.  The methods
support partial load/save so that access to very large bitmaps can be
split over multiple tasks."

The only user of this interface is SSTables. And turns out we don't really
split the access like that. What we do instead is to create a chunked vector
and then pass its begin() method with position = 0 and let it write everything.

The problem here is that this require the chunked vector to be fully
initialized, not just reserved. If the bitmap is large enough that in itself
can take a long time without yielding (up to 16ms seen in my setup).

We can simplify things considerably by moving the large_bitset to use a
chunked vector internally: it already uses a poor man's version of it
by allocating chunks internally (it predates the chunked_vector).

By doing that, we can turn save() into a simple copy operation, and do
away with load altogether by adding a new constructor that will just
copy an existing chunked_vector.

Fixes #3341
Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180409234726.28219-1-glauber@scylladb.com>
2018-04-10 10:25:06 +03:00
Paweł Dziepak
252c5dfa52 Merge "logalloc: replace zones with segment-at-a-time alloc/free" from Avi
"This patchset removes zones and replaces them with a simpler system. LSA tries
to allocate segments at higher addresses, so that we'll end up with the standard
allocator using lower addresses and LSA using higher addresses, allowing for easier
allocation from std."

* tag 'lsa-no-zones/v6' of https://github.com/avikivity/scylla:
  tests: add logalloc_test for large contiguous allocations in a challenging environemnt
  logalloc: limit std segment allocations in debug mode
  logalloc: introduce prime_segment_pool()
  logalloc: limit non-contiguous reclaims
  logalloc: pre-allocate all memory as lsa on startup
  tests: add random test for dynamic_bitset
  dynamic_bitset: optimize for large sets
  dynamic_bitset: get rid of resize()
  dynamic_bitset: remove find_*_clear() variants
  logalloc: reduce segment size to 128k
  logalloc: get rid of the emergency reserve stack
  logalloc: replace zones with segment-at-a-time alloc/free
2018-04-09 10:30:11 +02:00
Avi Kivity
80651e6dcc database: reduce idle memtable flush cpu shares to 1%
Commit 1671d9c433 (not on any release branch)
accidentally bumped the idle memtable flush cpu shares to 100 (representing
10%), causing flushes to be too when they don't comsume too much cpu.

Fixes #3243.
Message-Id: <20180408104601.9607-1-avi@scylladb.com>
2018-04-08 17:12:14 +01:00
Avi Kivity
53d97b1da3 Merge seastar upstream
* seastar 33d8f74...2da7d46 (4):
  > http routes: Add parameters to path when adding alias
  > future: compile-time optimize futurize<void>::apply()
  > memory: remove unneeded union 'pla'
  > queue: not_empty()/not_full() should throw when called after abort
2018-04-08 16:36:45 +03:00
Piotr Jastrzebski
9ad00b8207 data_consume_rows_context: Mark RANGE_TOMBSTONE_5 as nonconsuming
This state does not read any data and is used only to perform
action when finishing to read a primitive type.

According to comment on continuous_data_consumer::non_consuming
such states should be marked as non_consuming.

Tests: units (release)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <55a5c9b76268b50312ecd044291f28dcd8179a22.1523005293.git.piotr@scylladb.com>
2018-04-08 15:16:13 +03:00
Alexys Jacob
d3d736cd87 dist: gentoo: rename prometheus node exporter package
net-analyzer/prometheus-node_exporter got moved to app-metrics/node_exporter
and the service name changed on Gentoo Linux

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180405135605.26146-1-ultrabug@gentoo.org>
2018-04-08 14:11:38 +03:00
Avi Kivity
c3a2471c9e tests: add logalloc_test for large contiguous allocations in a challenging environemnt
Test large std allocations in an evironement that has seen many persistent
std allocations interspersed with lsa allocations, causing memory fragmentation.
2018-04-07 21:04:10 +03:00
Avi Kivity
2c670f6161 logalloc: limit std segment allocations in debug mode
Address Sanitizer has a global limit on the number of allocations
(note: not number of allocations less number of frees, but cumulative
number of allocations). Running some tests in debug mode on a machine
with sufficient memory can break that limit.

Work around that limit by restricting the amount of memory the
debug mode segment_pool can allocate. It's also nicer for running
the test on a workstation.
2018-04-07 21:04:10 +03:00
Avi Kivity
2baa16b371 logalloc: introduce prime_segment_pool()
To segregate std and lsa allocations, we prime the segment pool
during initialization so that lsa will release lower-addressed
memory to std, rather than lsa and std competing for memory at
random addresses.

However, tests often evict all of lsa memory for their own
purposes, which defeats this priming.

Extract the functionality into a new prime_segment_pool()
function for use in tests that rely on allocation segregation.
2018-04-07 14:52:58 +03:00
Avi Kivity
ff6325ee7e logalloc: limit non-contiguous reclaims
We may fail to reclaim because a region has reclaim disabled (usually because
it is in an allocating_section. Failed reclaims can cause high CPU usage
if all of the lower addresses happen to be in a reclaim-disabled region (this
is somewhat mitigated by the fact that checking for reclaim disabled is very
cheap), but worse, failing a segment reclaim can lead to reclaimed memory
being fragmented.  This results in the original allocation continuing to fail.

To combat that, we limit the number of failed reclaims. If we reach the limit,
we fail the reclaim.  The surrounding allocating_section will release the
reclaim_lock, and increase reserves, which will result in reclaim being
retried with all regions being reclaimable, and succeed in allocating
contiguous memory.
2018-04-07 14:52:58 +03:00
Avi Kivity
c6c659ce7a logalloc: pre-allocate all memory as lsa on startup
Since lsa tries to keep some non-lsa memory as reserve, we end up
with three blocks of memory: at low addresses, non-lsa memory that was
allocated during startup or subsequently freed by lsa; at middle addresses,
lsa; and at the top addresses, memory that lsa left alone during initial
cache population due to the reserve.

After time passes, both std and lsa will allocate from the top section,
causing a mix of lsa and non-lsa memory. Since lsa tries to free from
lower addresses, this mix will stay there forever, increasing fragmentation.

Fix that by disabling the reserve during startup and allocating all of memory
for lsa. Any further allocation will then have to be satisfied by lsa first
freeing memory from the low addresses, so we will now have just two sections
of memory: low addresses for std, and top addresses for lsa.

Note that this startup allocation does not page in lsa segments, since the
segment constructor does not touch memory.
2018-04-07 14:52:58 +03:00
Avi Kivity
413bf34fbd tests: add random test for dynamic_bitset
Compare against vector<bool> as a reference.
2018-04-07 14:52:58 +03:00
Avi Kivity
ff52767ec9 dynamic_bitset: optimize for large sets
Add 1:64 summary bitmaps so that searching for set bits is O(log n)
instead of O(n).
2018-04-07 14:52:58 +03:00
Avi Kivity
14510ae986 dynamic_bitset: get rid of resize()
Makes it easier to modify later on. Maybe "dynamic" is not so justified now.
2018-04-07 14:52:58 +03:00
Avi Kivity
f219ae1275 dynamic_bitset: remove find_*_clear() variants
They are no longer used, and cannot be efficiently implemenented
for large bitsets using a summary vector approach without slowing
down the find_*_set() variants, which are used.

Also remove find_previous_set() for the same reason.
2018-04-07 14:52:58 +03:00
Avi Kivity
54db0f3d30 logalloc: reduce segment size to 128k
Reducing the segment size reduces the time needed to compact segments,
and increases the number of segments that can be compacted (and so
the probability of finding low-occupancy segments).

128k is the size of I/O buffers and of thread stacks, so we can't
go lower than that without more significant changes.
2018-04-07 14:52:58 +03:00
Avi Kivity
3f17dbfcbc logalloc: get rid of the emergency reserve stack
Instead of keeping specific segments in the emergency reserve,
just keep the number of segments in the reserve. This simplifies the
code considerably.
2018-04-07 14:52:55 +03:00
Avi Kivity
fa73d844e9 logalloc: replace zones with segment-at-a-time alloc/free
This patch replaces the zones mechanism with something simpler: a
single segment is moved from the standard allocator to lsa and vice
versa, at a time. Fragmentation resistance is (hopefully) achieved
by having lsa prefer high addresses for lsa data, and return segments
at low address to the standard allocator. Over time, the two will move
apart.

Moving just once segment at a time reduces the latency costs of
transferring memory between free and std.
2018-04-07 13:48:40 +03:00
Piotr Sarna
a5b6047ffa cql3: add row-wise read statistics
Database read metrics is now extended by total number of rows read,
exported through cql_rows_read field.

Closes #3146
Message-Id: <02f0816c509f3d7fea06da22869eea61548284e2.1522919708.git.sarna@scylladb.com>
2018-04-05 13:39:08 +03:00
Paweł Dziepak
67aaaefde7 Merge "api: type-erase more of the column_family API" from Avi
"Together with the already merged patch, we reduce the object file
from 114MB to 81MB."

* tag 'api-diet-1/v1' of https://github.com/avikivity/scylla:
  api: type-erase all-column_family map_reduce variant
  api: simplify 6-argument map_reduce_cf() variant
2018-04-05 11:07:17 +02:00
Botond Dénes
3c078d2554 forwardable reader: pass down timeout in fast_forward_to()
The `const dht::partition_range&` overload to be more precise. The
timeout wasn't passed to the underlying reader. Spotted during test
debugging.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <39c02a55196d923bd0af8e6be6f0baa578cba070.1522915463.git.bdenes@scylladb.com>
2018-04-05 11:43:21 +03:00
Avi Kivity
1fa8682412 Merge seastar upstream
* seastar 7328d17...33d8f74 (3):
  > memory: switch to buddy allocation
  > tls: Ensure we always pass through semaphores on shutdown
  > memory: replace placement-new in unions with member construction

See scylladb/seastar#426.
2018-04-05 11:12:30 +03:00
Raphael S. Carvalho
30b6c9b4cd database: make sure sstable is also forwarded to shard responsible for its generation
After f59f423f3c, sstable is loaded only at shards
that own it so as to reduce the sstable load overhead.

The problem is that a sstable may no longer be forwarded to a shard that needs to
be aware of its existence which would result in that sstable generation being
reallocated for a write request.
That would result in a failure as follow:
"SSTable write failed due to existence of TOC file for generation..."

This can be fixed by forwarding any sstable at load to all its owner shards
*and* the shard responsible for its generation, which is determined as follow:
s = generation % smp::count

Fixes #3273.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180405035245.30194-1-raphaelsc@scylladb.com>
2018-04-05 10:58:05 +03:00
Tzach Livyatan
58e47fa0b3 docs/docker: Fix and add links to Scylla docs
- Fix link for reporting a Scylla problem
- Add a link to Best Practices for Running Scylla on Docker

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20180404065129.16776-1-tzach@scylladb.com>
2018-04-04 10:52:04 +03:00
Piotr Sarna
ae3265f905 cql_server: use handle_exception for failed accepts
Follows up "cql_server: replace recursion in do_accepts with repeat".
Failed accepts are now handled with handle_exception routine
instead of generic then_wrapped.
Message-Id: <db820a674100ae57f3acc7b49ebae57d0c2bdbb8.1522785444.git.sarna@scylladb.com>
2018-04-03 21:34:46 +01:00
Piotr Sarna
b298bb2f7a cql_server: replace recursion in do_accepts with repeat
Recursion in do_accepts function is now replaced with
repeat utility.

Fixes #2467

Message-Id: <07d6da60726fc3ecc06139309b9716180e8accf7.1522777060.git.sarna@scylladb.com>
2018-04-03 21:23:11 +03:00
Avi Kivity
9cef37e643 Merge "db/view: View building fixes" from Duarte
"
Fixes to the view building process, discovered from field experience.

Tests: dtest(materialized_view_tests.py, smp=2)
"

* 'views/view-build-fixes/v1' of https://github.com/duarten/scylla:
  db/view: Start view building after schema agreement
  db/system_keyspace: scylla_views_builds_in_progress writes are user mem
  db/view: Require configuration option to enable view building
2018-04-03 17:42:21 +03:00
Duarte Nunes
b84bbfc51d tests/view_schema_test: Test empty partition key entries are rejected
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180403122244.10626-2-duarte@scylladb.com>
2018-04-03 15:25:53 +03:00
Duarte Nunes
ec8960df45 db/view: Reject view entries with non-composite, empty partition key
Empty partition keys are not supported on normal tables - they cannot
be inserted or queried (surprisingly, the rules for composite
partition keys are different: all components are then allowed to be
empty). However, the (non-composite) partition key of a view could end
up being empty if that column is: a base table regular column, a
base table clustering key column, or a base table partition key column,
part of a composite key.

Fixes #3262
Refs CASSANDRA-14345

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180403122244.10626-1-duarte@scylladb.com>
2018-04-03 15:25:52 +03:00
Duarte Nunes
d4db043f03 db/view: Start view building after schema agreement
If a base table or view has been dropped in one node, but another
one hasn't yet learned about it, it starts the view build process
immediately on boot, possibly calculating unneeded view updates and
causing errors at the view replica, if that replica has already
processed the schema changes. We should thus wait for schema
agreement, even if the node is a seed.

Fixes #3328

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-03 13:16:28 +01:00
Duarte Nunes
75bb66a50d db/system_keyspace: scylla_views_builds_in_progress writes are user mem
Treat writes to scylla_views_builds_in_progress as user memory, as the
number of writes is dependent on the amount of user data on views
(times the number of views, divided by the view building batch size).

Fixes #3325

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-03 13:16:28 +01:00
Duarte Nunes
bf5045c7eb db/view: Require configuration option to enable view building
View building, enabled by default, can contain or expose issues that
prevent the node from starting. In those cases, it is necessary to
disable view building such that the node can be submitted to
maintenance operations.

Fixes #3329

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-03 13:16:28 +01:00
Avi Kivity
6c35db2c44 api: type-erase all-column_family map_reduce variant
Encapsulate the map_reduce parameters in type-erased
std::function, as well as the iterator-on-all-column-families
logic. Reduces binary size by 18%.
2018-04-03 13:08:22 +03:00
Avi Kivity
0ade558999 api: simplify 6-argument map_reduce_cf() variant
The 6-argument map_reduce_cf function is identical to the 5-argument
version, except that it applies performs an extra cast (by calling
the 6th argument's operator=()).

Simplify the code by calling the 5-argument version from the 6-argument
version.

Reduces binary size by ~10%.
2018-04-03 12:22:14 +03:00
Duarte Nunes
11ece46f14 db/view: Remove leftover debug statement
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180402175238.5528-1-duarte@scylladb.com>
2018-04-03 09:41:33 +01:00
Avi Kivity
cadd983856 api: type-erase map_reduce_cf()
map_reduce_cf() is called with varying template parameters which each
have to be compiled separately. Unifying the internals to use types based
on std::any reduced the object size by 15% (115MB->99MB) with presumably
a commensurate decrease in compile time.

A version that used "I" instead of "std::any" (and thus merged the
internals only for callers that used the same result type) delivered
a 10% decrease in object size.  While std::any is less safe, in this
case it is completely encapsulated.
Message-Id: <20180402213732.432-1-avi@scylladb.com>
2018-04-03 09:31:04 +01:00
Avi Kivity
ffcdcd6d16 tests: logalloc_test: relax test_large_allocation
test_large_allocation attempts to allocate almost half of memory.
With a buddy allocator, even if more than half of memory is free,
and even if it is contiguous, it is unlikely to be available as a
single allocation because the allocator inserts boundaries at powers-
of-two addresses.

Relax the test by allocating smaller chunks (but still the same amount,
and still with challenging sizes); allocating half of memory contiguously
is not a goal.

Also use a vector instead of a deque, and reserve it, so we don't get
intervening non-lsa allocations. I'm not sure there's a problem there
but let's not depend on the allocation patterns.
Message-Id: <20180401150828.13921-1-avi@scylladb.com>
2018-04-02 19:23:06 +01:00
Avi Kivity
7ab52947dc conf: define named_value<log_level> externally
While building with -O1, I saw that the linker could not find
the vtable for named_value<log_level>. Rather than fixing up the
includes (and likely lengthening build time), fix by defining
the class as an extern template, preventing it from being
instantiated at the call site.
Message-Id: <20180401150235.13451-1-avi@scylladb.com>
2018-04-02 19:23:06 +01:00
Avi Kivity
3964fd0be2 client_state: initialize _remote_addr for internal queries
-O1 complains that client_state::_remote_addr is not initialized
(and it is right). The call site is tracing, which likely won't be
invoked for internal queries, but still.
Message-Id: <20180401150410.13651-1-avi@scylladb.com>
2018-04-02 19:23:06 +01:00
Avi Kivity
2edf36f863 bytes: don't allocate NUL terminator
Since bytes is used to encapsulate blobs, not strings, there's no
need for a NUL terminator. It will never be passed to a function
that expects a C string.
Message-Id: <20180401151009.14108-1-avi@scylladb.com>
2018-04-02 19:23:06 +01:00
Duarte Nunes
abe8bbe7b5 Merge seastar upstream
* seastar a66cc34...7328d17 (5):
  > sstring: add support for non-nul-terminated sstrings
  > core/sharded: Make async_sharded_service dtor virtual
  > reactor: pass naked pointer to submit_io
  > Merge http: "Add alias support to the API" from Amnon
  > systemwide_memory_barrier: use madvise(MADV_DONTNEED) instead of mprotect()

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-02 19:23:06 +01:00
Glauber Costa
ef84780c27 docker: default docker to overprovisioned mode.
By default, overprovisioned is not enabled on docker unless it is
explicitly set. I have come to believe that this is a mistake.

If the user is running alone in the machine, and there are no other
processes pinned anywhere - including interrupts - not running
overprovisioned is the best choice.

But everywhere else, it is not: even if a user runs 2 docker containers
in the same machine and statically partitions CPUs with --smp (but
without cpuset) the docker containers will pin themselves to the same
sets of CPU, as they are totally unaware of each other.

It is also very common, specially in some virtualized environments, for
interrupts not to be properly distributed - being particularly keen on
being delivered on CPU0, a CPU which Scylla will pin by default.

Lastly, environments like Kubernetes simply don't support pinning at the
moment.

This patch enables the overprovisioned flag if it is explicitly set -
like we did before - but also by default unless --cpuset is set.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180331142131.842-1-glauber@scylladb.com>
2018-04-01 09:17:20 +03:00
Takuya ASADA
95129c4b12 dist/ami: point wiki page when variables.json
Since there's no document for build_ami.sh on this repo, point to wiki page.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521710239-9687-1-git-send-email-syuu@scylladb.com>
2018-03-29 18:54:42 +03:00
Glauber Costa
a9ef72537f parse and ignore background writer controller
Unused options are not exposed as command line options and will prevent
Scylla from booting when present, although they can still be pased over
YAML, for Cassandra compatibility.

That has never been a problem, but we have been adding options to i3
(and others) that are now deprecated, but were previously marked as
Used. Systems with those options may have issues upgrading.

While this problem is common to all Unused options, the likelihood for
any other unused option to appear in the command line is near zero,
except for those two - since we put them there ourselves.

There are two ways to handle this issue:

1) Mark them as Used, and just ignore them.
2) Add them explicitly to boost program options, and then ignore them.

The second option is preferred here, because we can add them as hidden
options in program_options, meaning they won't show up in the help. We
can then just print a discrete message saying that those options are,
for now on ignored.

v2: mark set as const (Botond)
v3: rebase on top of master, identation suggested by Duarte.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180329145517.8462-1-glauber@scylladb.com>
2018-03-29 17:57:30 +03:00
Avi Kivity
c9aa9f0d86 Revert "logalloc: capture current scheduling group for deferring function"
This reverts commit 3b53f922a3. It's broken
in two ways:

 1. concrete_allocating_function::allocate()'ss caller,
    region_group::start_releaser() loop, will delete the object
    as soon as it returns; however we scheduled some work depending
    on `this` in a separate continuation (via with_scheduling_group())
 2. the calling loop's termination condition depends on the work being
    done immediately, not later.
2018-03-29 16:08:12 +03:00
Vladimir Krivopalov
3a9cb54c76 Merge the pair of index_readers into just one tracking a range.
Historically, we had two index_readers per a sstable_mutation_reader,
one for the lower bound and one for the upper bound. Most of public
members of the index_reader class were only called on either of those.
With the changes introduced in #2981, two readers are even more tied
together as they now have a shared-per-pair list of index pages that
needs proper cleanup and was protruding woefully into the caller code.

This fix re-structures index_reader so that it now keeps track of both
lower and upper bounds. The shared_index_lists structure is encapsulated
within index_reader and becomes an internal detail rather than a
liability.

Fixes #3220.

Tests: unit (debug, release)
+
Tested using cassandra-stress commands from #3189.

perf_fast_forward results indicate there is no performance degradation
caused by thix fix.

=========================== Baseline ===================================
running: large-partition-skips
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1       0         0.494458   1000000    2022418   1018     126960      27       0        0        0        0        0        0        0  97.6%
1       1         1.754717    500000     284946    997     127064       6       0        0        3        3        0        0        0  99.9%
1       8         0.551664    111112     201413    997     127064       6       0        0        3        3        0        0        0  99.7%
1       16        0.383888     58824     153232   1001     127080      10       0        0        5        5        0        0        0  99.5%
1       32        0.289073     30304     104832    997     127064      28       0        0        3        3        0        0        0  99.3%
1       64        0.236963     15385      64926    997     127064     122       0        0        3        3        0        0        0  99.2%
1       256       0.172901      3892      22510    997     127064     217       0        0        3        3        0        0        0  95.5%
1       1024      0.117570       976       8301    997     127064     235       0        0        3        3        0        0        0  49.0%
1       4096      0.085811       245       2855    664      27172     375     274        0        3        3        0        0        0  21.4%
64      1         0.512781    984616    1920149   1142     127064     139       0        0        3        3        0        0        0  98.7%
64      8         0.479232    888896    1854833   1001     127080      10       0        0        5        5        0        0        0  99.6%
64      16        0.451193    800000    1773078    997     127064       6       0        0        3        3        0        0        0  99.6%
64      32        0.408684    666688    1631305    997     127064       6       0        0        3        3        0        0        0  99.5%
64      64        0.351906    500032    1420924    997     127064      14       0        0        3        3        0        0        0  99.5%
64      256       0.227008    200000     881026    997     127064     211       0        0        3        3        0        0        0  99.1%
64      1024      0.125803     58880     468032    997     127064     290       0        0        3        3        0        0        0  65.1%
64      4096      0.098155     15424     157139    703      27856     401     267        0        3        3        0        0        0  25.8%

running: large-partition-slicing
Testing slicing of large partition:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000701         1       1427      9        296       6       4        0        3        3        0        0        0  12.4%
0       32        0.000698        32      45827      9        296       6       3        0        3        3        0        0        0  13.9%
0       256       0.000808       256     316920     10        328       6       3        0        3        3        0        0        0  24.9%
0       4096      0.004368      4096     937697     25        808      14       3        0        3        3        0        0        0  45.9%
500000  1         0.001196         1        836     13        412       9       4        0        3        3        0        0        0  22.7%
500000  32        0.001200        32      26664     13        412       9       4        0        3        3        0        0        0  22.2%
500000  256       0.001503       256     170338     14        444      10       4        0        3        3        0        0        0  25.3%
500000  4096      0.004351      4096     941465     30        956      20       4        0        3        3        0        0        0  50.7%

running: large-partition-slicing-clustering-keys
Testing slicing of large partition using clustering keys:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000625         1       1601      7        176       6       0        0        3        3        0        0        0  23.2%
0       32        0.000604        32      53016      7        176       6       0        0        3        3        0        0        0  24.7%
0       256       0.000695       256     368498      8        180       6       0        0        3        3        0        0        0  36.4%
0       4096      0.004083      4096    1003106     20        692      12       1        0        3        3        0        0        0  47.0%
500000  1         0.001198         1        835     12        516       9       3        0        3        3        0        0        0  22.8%
500000  32        0.000981        32      32631     12        388       9       3        0        3        3        0        0        0  29.2%
500000  256       0.001320       256     194011     13        384      10       3        0        3        3        0        0        0  29.0%
500000  4096      0.003944      4096    1038567     25        840      17       2        0        3        3        0        0        0  52.2%

running: large-partition-slicing-single-key-reader
Testing slicing of large partition, single-partition reader:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000849         1       1178      9        488       6       0        0        3        3        0        0        0  16.5%
0       32        0.000661        32      48415      9        296       6       0        0        3        3        0        0        0  22.2%
0       256       0.000756       256     338648     10        328       6       0        0        3        3        0        0        0  33.3%
0       4096      0.004147      4096     987610     22        840      12       1        0        3        3        0        0        0  47.9%
500000  1         0.001041         1        960     13        476       9       3        0        3        3        0        0        0  25.9%
500000  32        0.001020        32      31375     13        412       9       3        0        3        3        0        0        0  29.1%
500000  256       0.001265       256     202373     14        444      10       3        0        3        3        0        0        0  32.0%
500000  4096      0.004121      4096     994014     30        988      18       3        0        3        3        0        0        0  52.7%

running: large-partition-select-few-rows
Testing selecting few rows from a large partition:
stride  rows      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1000000 1         0.000668         1       1498      9        296       6       4        0        3        3        0        0        0  19.8%
500000  2         0.000976         2       2048     13        412       9       4        0        3        3        0        0        0  29.0%
250000  4         0.001408         4       2842     18        572      12       6        0        3        3        0        0        0  28.8%
125000  8         0.002004         8       3993     29        912      19      10        0        3        3        0        0        0  34.0%
62500   16        0.002883        16       5551     50       1584      32      18        0        3        3        0        0        0  41.9%
2       500000    1.053215    500000     474737   1138     127080     120       0        0        5        5        0        0        0  99.7%

running: large-partition-forwarding
Testing forwarding with clustering restriction in a large partition:
pk-scan   time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
yes       0.002717         2        736     24       2684       8      16        0        3        3        0        0        0  19.7%
no        0.001004         2       1992     13        412       8       2        0        3        3        0        0        0  30.2%

running: small-partition-skips
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         1.466523   1000000     681885   1369     139732      33       1        0        0        0        0        0        0  99.7%
-> 1       1        12.792183    500000      39086   6235     177736    5155       0        0     5123     7663        0        0        0  96.4%
-> 1       8         3.451431    111112      32193   6235     177736    5155       0        0     5123     9673        0        0        0  84.8%
-> 1       16        2.223815     58824      26452   6234     177704    5154       0        0     5122     9965        0        0        0  75.0%
-> 1       32        1.512511     30304      20036   6233     177680    5155       1        0     5123    10090        0        0        0  61.8%
-> 1       64        1.129465     15385      13621   6227     177464    5154       0        0     5122    10159        0        0        0  49.5%
-> 1       256       0.733282      3892       5308   6211     175464    5178      24        0     5122    10220        0        0        0  33.8%
-> 1       1024      0.397302       976       2457   5946     142152    5369     217        0     5120    10235        0        0        0  32.1%
-> 1       4096      0.187746       245       1305   5499      81992    5296     142        0     5122    10240        0        0        0  46.8%
-> 64      1         2.428488    984616     405444   7332     177736    5155      25        0     5123     5208        0        0        0  79.9%
-> 64      8         2.262876    888896     392817   6235     177736    5155       0        0     5123     5654        0        0        0  78.1%
-> 64      16        2.137544    800000     374261   6234     177732    5154       0        0     5122     6110        0        0        0  77.1%
-> 64      32        1.862466    666688     357960   6235     177736    5155       0        0     5123     6844        0        0        0  73.7%
-> 64      64        1.547757    500032     323069   6234     177728    5155       0        0     5123     7651        0        0        0  68.7%
-> 64      256       0.914612    200000     218672   6233     177704    5154       0        0     5122     9202        0        0        0  55.5%
-> 64      1024      0.475472     58880     123835   6229     177492    5154       5        0     5122     9930        0        0        0  45.4%
-> 64      4096      0.271239     15424      56865   6158     169480    5257     114        0     5115    10142        0        0        0  44.1%

running: small-partition-slicing
Testing slicing small partitions:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.003209         1        312      3        260       2       7        0        1        1        0        0        0  15.5%
0       32        0.004205        32       7610     16       1428      10       0        0        5        5        0        0        0  15.7%
0       256       0.009830       256      26042     97       8572      62       0        0       31       31        0        0        0  18.7%
0       4096      0.015471      4096     264748    100       8704      64       0        0       32       32        0        0        0  48.4%
500000  1         0.003654         1        274     34        492      33       0        0       32       64        0        0        0  28.7%
500000  32        0.004287        32       7464     40       1260      36       0        0       32       64        0        0        0  26.0%
500000  256       0.009598       256      26673    100       8748      64       4        0       32       64        0        0        0  20.6%
500000  4096      0.014151      4096     289449    119       7892      85       0        0       53       64        0        0        0  54.1%

========================  With the patch ================================
running: large-partition-skips
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1       0         0.468887   1000000    2132711   1018     126960      29       0        0        0        0        0        0        0  98.4%
1       1         1.735113    500000     288166   1001     127080      10       0        0        5        5        0        0        0  99.9%
1       8         0.535616    111112     207447    997     127064       6       0        0        3        3        0        0        0  99.6%
1       16        0.365487     58824     160947   1001     127080      15       0        0        5        5        0        0        0  99.5%
1       32        0.272208     30304     111326    997     127064      21       0        0        3        3        0        0        0  99.3%
1       64        0.224049     15385      68668    997     127064     208       0        0        3        3        0        0        0  99.1%
1       256       0.159247      3892      24440    997     127064     250       0        0        3        3        0        0        0  94.7%
1       1024      0.102107       976       9559    997     127064     292       0        0        3        3        0        0        0  53.6%
1       4096      0.084310       245       2906    664      27172     371     273        0        3        3        0        0        0  20.2%
64      1         0.508340    984616    1936923   1142     127064     129       0        0        3        3        0        0        0  98.1%
64      8         0.470369    888896    1889786    997     127064       6       0        0        3        3        0        0        0  99.6%
64      16        0.439917    800000    1818526   1001     127080      10       0        0        5        5        0        0        0  99.6%
64      32        0.397938    666688    1675358    997     127064       6       0        0        3        3        0        0        0  99.5%
64      64        0.344144    500032    1452972    997     127064      18       0        0        3        3        0        0        0  99.4%
64      256       0.219996    200000     909107    997     127064     251       0        0        3        3        0        0        0  99.1%
64      1024      0.124294     58880     473715    997     127064     284       1        0        3        3        0        0        0  62.2%
64      4096      0.097580     15424     158065    703      27856     400     267        0        3        3        0        0        0  25.3%

running: large-partition-slicing
Testing slicing of large partition:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000733         1       1365      9        296       6       4        0        3        3        0        0        0  19.3%
0       32        0.000705        32      45417      9        296       6       3        0        3        3        0        0        0  15.3%
0       256       0.000830       256     308364     10        328       6       3        0        3        3        0        0        0  26.7%
0       4096      0.004631      4096     884529     25        808      14       3        0        3        3        0        0        0  48.1%
500000  1         0.001184         1        845     13        412       9       4        0        3        3        0        0        0  23.7%
500000  32        0.001199        32      26690     13        412       9       4        0        3        3        0        0        0  21.9%
500000  256       0.001530       256     167296     14        444      10       4        0        3        3        0        0        0  26.8%
500000  4096      0.004379      4096     935474     30        956      19       4        0        3        3        0        0        0  51.5%

running: large-partition-slicing-clustering-keys
Testing slicing of large partition using clustering keys:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000620         1       1614      7        176       6       0        0        3        3        0        0        0  27.4%
0       32        0.000625        32      51218      7        176       6       0        0        3        3        0        0        0  27.0%
0       256       0.000701       256     365148      8        180       6       0        0        3        3        0        0        0  35.2%
0       4096      0.004063      4096    1008130     20        692      12       1        0        3        3        0        0        0  47.6%
500000  1         0.001208         1        827     12        516       9       3        0        3        3        0        0        0  24.3%
500000  32        0.000973        32      32876     12        388       9       3        0        3        3        0        0        0  28.7%
500000  256       0.001315       256     194612     13        384      10       3        0        3        3        0        0        0  29.0%
500000  4096      0.003950      4096    1037068     25        840      17       2        0        3        3        0        0        0  52.7%

running: large-partition-slicing-single-key-reader
Testing slicing of large partition, single-partition reader:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000844         1       1185      9        488       6       0        0        3        3        0        0        0  16.5%
0       32        0.000656        32      48753      9        296       6       0        0        3        3        0        0        0  23.1%
0       256       0.000751       256     341011     10        328       6       0        0        3        3        0        0        0  34.0%
0       4096      0.004173      4096     981632     22        840      12       1        0        3        3        0        0        0  47.0%
500000  1         0.001036         1        966     13        476       9       3        0        3        3        0        0        0  25.4%
500000  32        0.001014        32      31573     13        412       9       3        0        3        3        0        0        0  27.4%
500000  256       0.001280       256     200044     14        444      10       3        0        3        3        0        0        0  31.8%
500000  4096      0.004081      4096    1003746     30        988      18       3        0        3        3        0        0        0  51.6%

running: large-partition-select-few-rows
Testing selecting few rows from a large partition:
stride  rows      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1000000 1         0.000668         1       1498      9        296       6       3        0        3        3        0        0        0  21.7%
500000  2         0.000958         2       2088     13        412       9       4        0        3        3        0        0        0  27.7%
250000  4         0.001495         4       2676     18        572      12       6        0        3        3        0        0        0  25.8%
125000  8         0.002069         8       3866     29        912      19      10        0        3        3        0        0        0  30.8%
62500   16        0.002856        16       5603     50       1584      32      18        0        3        3        0        0        0  41.7%
2       500000    1.063129    500000     470310   1138     127080     120       0        0        5        5        0        0        0  99.7%

running: large-partition-forwarding
Testing forwarding with clustering restriction in a large partition:
pk-scan   time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
yes       0.002567         2        779     24       2684       8      16        0        3        3        0        0        0  21.5%
no        0.001013         2       1975     13        412       8       2        0        3        3        0        0        0  28.9%

running: small-partition-skips
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         1.349959   1000000     740763   1369     139732      33       1        0        0        0        0        0        0  99.7%
-> 1       1        12.640751    500000      39555   8144     191168    7064       0        0     7032    11481        0        0        0  96.2%
-> 1       8         3.404269    111112      32639   6651     180660    5571       0        0     5539    10505        0        0        0  84.5%
-> 1       16        2.175424     58824      27040   6434     179116    5354       0        0     5322    10365        0        0        0  74.3%
-> 1       32        1.493365     30304      20292   6335     178404    5257       0        0     5225    10294        0        0        0  61.1%
-> 1       64        1.112168     15385      13833   6256     177672    5183       0        0     5151    10217        0        0        0  48.7%
-> 1       256       0.719282      3892       5411   6211     175464    5178      24        0     5122    10220        0        0        0  33.3%
-> 1       1024      0.393236       976       2482   5946     142152    5369     217        0     5120    10235        0        0        0  30.7%
-> 1       4096      0.185284       245       1322   5499      81992    5296     142        0     5122    10240        0        0        0  44.7%
-> 64      1         2.356711    984616     417792   7361     177944    5184      21        0     5152     5266        0        0        0  79.1%
-> 64      8         2.192331    888896     405457   6253     177868    5173       0        0     5141     5690        0        0        0  77.2%
-> 64      16        2.029835    800000     394121   6245     177812    5165       0        0     5133     6132        0        0        0  75.7%
-> 64      32        1.806448    666688     369060   6245     177808    5165       0        0     5133     6864        0        0        0  72.6%
-> 64      64        1.508492    500032     331478   6242     177788    5163       0        0     5131     7667        0        0        0  67.7%
-> 64      256       0.892881    200000     223994   6233     177704    5154       0        0     5122     9202        0        0        0  54.2%
-> 64      1024      0.465715     58880     126429   6229     177492    5154       0        0     5122     9930        0        0        0  44.0%
-> 64      4096      0.266582     15424      57858   6158     169480    5257     114        0     5115    10142        0        0        0  42.3%

running: small-partition-slicing
Testing slicing small partitions:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.003113         1        321      3        260       2       0        0        1        1        0        0        0  13.4%
0       32        0.004166        32       7682     16       1428      10       0        0        5        5        0        0        0  14.9%
0       256       0.009813       256      26088     97       8572      62       0        0       31       31        0        0        0  18.4%
0       4096      0.014798      4096     276794    100       8704      64       0        0       32       32        0        0        0  46.3%
500000  1         0.003700         1        270     34        492      33       0        0       32       64        0        0        0  28.4%
500000  32        0.004030        32       7940     40       1260      36       0        0       32       64        0        0        0  27.8%
500000  256       0.009514       256      26908    100       8748      64       0        0       32       64        0        0        0  20.2%
500000  4096      0.013368      4096     306413    119       7892      85       0        0       53       64        0        0        0  53.6%

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <a72818f79ca4081a606424545b0053fa581d49e7.1522173144.git.vladimir@scylladb.com>
2018-03-29 15:23:31 +03:00
Asias He
f539e993d3 gossip: Relax generation max difference check
start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node

I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:

On node1, node3

    [shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704

On node2

    [shard 0] storage_service - JOINING: waiting for schema information to complete

This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.

   gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
   gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();

Each force_newer_generation_unsafe increases the generation by 1.

Here is an example,

Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
   {
   "addrs": "127.0.0.2",
   "generation": 0,
   "is_alive": false,
   "update_time": 1521778757334,
   "version": 0
   },
```

After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
 {
     "addrs": "127.0.0.2",
     "application_state": [
         {
             "application_state": 0,
             "value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
             "version": 214
         },
         {
             "application_state": 6,
             "value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
             "version": 153
            }
     ],
     "generation": 2,
     "is_alive": false,
     "update_time": 1521779276246,
     "version": 0
 },
```

In gossiper::apply_state_locally, we have this check:

```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
    // assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
  logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);

}
```
to skip the gossip update.

To fix, we relax generation max difference check to allow the generation
of a removed node.

After this patch, the removed node bootstraps successfully.

Tests: dtest:update_cluster_layout_tests.py
Fixes #3331

Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
2018-03-29 12:09:49 +03:00
Glauber Costa
b092234f2b sstables: print informative message earlier
Just saw this today during a crash when creating Materialized Views.
It is still unclear why this happened. But the message says:

Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: scylla: sstables/sstables.cc:2973: sstables::sstable::remove_sstable_with_temp_toc(seastar::sstring, seastar::sstring, seastar::sstring, int64_t, sstables::sstable::version_types, sstables::sstable::format_types)::<lambda()>: Assertion `tmptoc == true' failed.
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Aborting on shard 0.
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Backtrace:
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4b4c
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4df5
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4ea3
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libpthread.so.0+0x000000000000f0ff
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x00000000000355f6
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x0000000000036ce7
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e565
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e611
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000015969d0
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x0000000001596f7a
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x000000000051ca8d

I can't even guess which table caused the problem, let alone which SSTable.
That's because those asserts are the very first thing we do. We can discuss
whether or not assert is the right behaviour (usually we can't guarantee the
state is sane if that is missing, so I don't see a problem)

But it would be nice to see which SSTable we are processing before we assert.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180328160856.10717-1-glauber@scylladb.com>
2018-03-28 19:55:04 +03:00
Avi Kivity
4419e60207 Merge "Add a confiugration API" from Amnon
"
The configuration API is part of scylla v2 configuration.
It uses the new definition capabilities of the API to dynamically create
the swagger definition for the configuration.
This mean that the swagger will contain an entry with description and
type for each of the config value.

To get the v2 of the swager file:
http://localhost:10000/v2

If using with swagger ui, change http://localhost:10000/api-doc to http://localhost:10000/v2
It takes longer to load because the file is much bigger now.
"

* 'amnon/config_api_v5' of github.com:scylladb/seastar-dev:
  Explanation about the API V2
  API: add the config API as part of the v2 API.
  Defining the config api
2018-03-28 12:45:17 +03:00
Amnon Heiman
71a04b5d26 Explanation about the API V2
Currently it holds a general explanation about the V2 and specific entry
about the config.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-03-28 12:42:04 +03:00
Amnon Heiman
94c2d82942 API: add the config API as part of the v2 API.
After this patch, the API v2 will contain a config section with all the
configuration parametes.

get http://localhost:10000/v2

Will contain the config section.

An example for getting a configuration parameter:
curl http://localhost:10000/v2/config/listen_address

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-03-28 12:42:04 +03:00
Amnon Heiman
6d907e43e0 Defining the config api
The config API is created dynamically from the config. This mean that
the swagger definition file will contain the description and types based on the
configuration.

The config.json file is used by the code generator to define a path that is
used to register the handler function.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-03-28 12:41:55 +03:00
Vladimir Krivopalov
b268ea951a tests: perf_fast_forward: Sanitize JSON files names.
Substitute various brackets and parentheses with alnum strings, remove
whitespaces, strip single-range values off curly braces.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <206adea8d05a1e64ce2627df1e4da3a845454906.1522171869.git.vladimir@scylladb.com>
2018-03-28 12:29:07 +03:00
Tomasz Grabiec
52c61df930 Relax includes
To avoid unnecessary recompilations.
Message-Id: <1522168295-994-1-git-send-email-tgrabiec@scylladb.com>
2018-03-28 10:49:07 +03:00
Avi Kivity
4c3e82bd67 Merge "db/view: Populate views with existing base table data" from Duarte
"
This series introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.

The view_builder uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.

We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the view building process.

We employ a flat_mutation_reader for each base table for which we're
building views.

We consume from the reader associated with each base table until all
its views are built. If the reader reaches the end and there are
incomplete views, then a view was added while others were being built.
In such cases, we restart the reader to the beginning of the current
token, but not to the beginning of the token range, when the view is
added. Then, when we exhaust the reader, we simply create a new one
for the whole token range, and resume building the pending views.

We aim to be resource-conscious. On a given shard, at any given moment,
we consume at most from one reader. We also strive for fairness, in that
each build step inserts entries for the views of a different base. Each
build step reads and generates updates for batch_size rows. We lack a
controller, which could potentially allow us to go faster (to execute
multiple steps at the same time, or consume more rows per batch), and
also which would apply backpressure, so we could, for example, delay
executing a build step.

Interaction with the system tables:
  - When we start building a view, we add an entry to the
    scylla_views_builds_in_progress system table. If the node restarts
    at this point, we'll consider these newly inserted views as having
    made no progress, and we'll treat them as new views;
  - When we finish a build step, we update the progress of the views
    that we built during this step by writing the next token to the
    scylla_views_builds_in_progress table. If the node restarts here,
    we'll start building the views at the token in the next_token
    column.
  - When we finish building a view, we mark it as completed in the
    built views system table, and remove it from the in-progress system
    table. Under failure, the following can happen:
        * When we fail to mark the view as built, we'll redo the last
          step upon node reboot;
        * When we fail to delete the in-progress record, upon reboot
          we'll remove this record.
    A view is marked as completed only when all shards have finished
    their share of the work, that is, if a view is not built, then all
    shards will still have an entry in the in-progress system table;
  - A view that a shard finished building, but not all other shards,
    remains in the in-progress system table, with first_token ==
    next_token.

Interaction with the distributed system tables:
  - When we start building a view, we mark the view build as being
    in-progress;
  - When we finish building a view, we mark the view as being built.
    Upon failure, we ensure that if the view is in the in-progress
    system table, then it may not have been written to this table. We
    don't load the built views from this table when starting. When
    starting, the following happens:
         * If the view is in the system.built_views table and not the
           in-progress system table, then it will be in this one;
         * If the view is in the system.built_views table and not in
           this one, it will still be in the in-progress system table -
           we detect this and mark it as built in this table too,
           keeping the invariant;
         * If the view is in this table but not in system.built_views,
           then it will also be in the in-progress system table - we
           don't detect this and will redo the missing step, for
           simplicity.

View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.

When building view updates, we consider that everything is new and
nothing pre-existing is there (which means no tombstones will be sent
out to the paired view replicas).

Tests:
  unit (debug)
  dtest (materialized_view_test.py(smp=1, smp=2))
"

* 'view-building/v4' of https://github.com/duarten/scylla: (22 commits)
  tests/view_build_test: Add tests for view building
  tests/cql_test_env: Move eventually() to this file
  tests/cql_assertions: Assert result set is not empty
  tests/cql_test_env: Start the view_builder
  db/view/view_builder: Allow synchronizing with the end of a build
  db/view/view_builder: Actually build views
  flat_mutation_reader: Make reader from mutation fragments
  db/view/view_builder: React to schema changes
  service/migration_listener: Add class for view notifications
  db/view: Introduce view_builder
  column_family: Add function to populate views
  column_family: Allow synchronizing with in-progress writes
  database: Compare view id instead of name in find_views()
  database: Add get_views() function
  db/view: Return a future when sending view updates
  service/storage_service: Allow querying the view build status
  db: Introduce system_distributed_keyspace
  tests: Add unit test for build_progress_virtual_reader
  db/system_keyspace: Add API for MV-related system tables
  db/system_keyspace: Add virtual reader for MV in-progress build status
  ...
2018-03-27 15:41:28 +03:00
Daniel Fiala
051ed12ad2 cql3/functions: Print function declaration with cql3 types, not with internal types.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180327084953.20313-3-daniel@scylladb.com>
2018-03-27 13:33:29 +03:00
Duarte Nunes
9f5cfa76f7 tests/view_build_test: Add tests for view building
This is a separate file from view_schema_test because that one is
already becoming too long to run; also, having multiple test files
means they can be executed in parallel.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
e5031f70ef tests/cql_test_env: Move eventually() to this file
Move eventually() from view_schema_test to cql_test_env.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
8528584056 tests/cql_assertions: Assert result set is not empty
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
a2c94e7925 tests/cql_test_env: Start the view_builder
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
a45fa8eaa2 db/view/view_builder: Allow synchronizing with the end of a build
Intended for use by unit tests, this patch allows synchronizing with
the end of a build for a particular view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
5f822e3928 db/view/view_builder: Actually build views
This patch adds the missing view building code to the eponymous class.

We consume from the reader associated with each base table until all
its views are built. If the reader reaches the end and there are
incomplete views, then a view was added while others were being built.
In such cases, we restart the reader to the beginning of the current
token, but not to the beginning of the token range, when the view is
added. Then, when we exhaust the reader, we simply create a new one
for the whole token range, and resume building the pending views.

We aim to be resource-conscious. On a given shard, at any given moment,
we consume at most from one reader. We also strive for fairness, in that
each build step inserts entries for the views of a different base. Each
build step reads and generates updates for batch_size rows. We lack a
controller, which could potentially allow us to go faster (to execute
multiple steps at the same time, or consume more rows per batch), and
also which would apply backpressure, so we could, for example, delay
executing a build step.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
1f3e3d3813 flat_mutation_reader: Make reader from mutation fragments
Builds a reader from a set of ordered mutations fragments. This is
useful for building a reader out of a subset of segments returned by a
different reader. It is equivalent to building a mutation out of the
set of mutation fragments, and calling
make_flat_mutation_reader_from_mutations, except that it doest not yet
support fast-forwarding.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
a21efeffa0 db/view/view_builder: React to schema changes
The view_builder now uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.

We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the upcoming view building
code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
3ffa3b6b54 service/migration_listener: Add class for view notifications
Add a convenience base class for view notifications, which provides
a default implementation for all other types of notifications.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
901faabaa2 db/view: Introduce view_builder
This patch introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.

This patch introduces only the bootstrap functionality, which is
responsible for loading the data stored in the system tables and
filling the in-memory data structures with the relevant information,
to be used in subsequent patches for the actual view building. The
interaction with the system tables is as follows.

Interaction with the tables in system_keyspace:
  - When we start building a view, we add an entry to the
    scylla_views_builds_in_progress system table. If the node restarts
    at this point, we'll consider these newly inserted views as having
    made no progress, and we'll treat them as new views;
  - When we finish a build step, we update the progress of the views
    that we built during this step by writing the next token to the
    scylla_views_builds_in_progress table. If the node restarts here,
    we'll start building the views at the token in the next_token
    column.
  - When we finish building a view, we mark it as completed in the
    built views system table, and remove it from the in-progress system
    table. Under failure, the following can happen:
        * When we fail to mark the view as built, we'll redo the last
          step upon node reboot;
        * When we fail to delete the in-progress record, upon reboot
          we'll remove this record.
    A view is marked as completed only when all shards have finished
    their share of the work, that is, if a view is not built, then all
    shards will still have an entry in the in-progress system table;
  - A view that a shard finished building, but not all other shards,
    remains in the in-progress system table, with first_token ==
    next_token.

Interaction with the distributed system table (view_build_status):
  - When we start building a view, we mark the view build as being
    in-progress;
  - When we finish building a view, we mark the view as being built.
    Upon failure, we ensure that if the view is in the in-progress
    system table, then it may not have been written to this table. We
    don't load the built views from this table when starting. When
    starting, the following happens:
         * If the view is in the system.built_views table and not the
           in-progress system table, then it will be in view_build_status;
         * If the view is in the system.built_views table and not in
           this one, it will still be in the in-progress system table -
           we detect this and mark it as built in this table too,
           keeping the invariant;
         * If the view is in this table but not in system.built_views,
           then it will also be in the in-progress system table - we
           don't detect this and will redo the missing step, for
           simplicity.

View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
f298f57137 column_family: Add function to populate views
The populate_views() function takes a set of views to update, a
tokento select base table partitions, and the set of sstables to
query. This lays the foundation for a view building mechanism to exist,
which walks over a given base table, reads data token-by-token,
calculates view updates (in a simplified way, compared to the existing
functions that push view updates), and sends them to the paired view
replicas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
67dd3e6e5d column_family: Allow synchronizing with in-progress writes
This patch adds a mechanism to class column_family through which we
can synchronize with in-progress writes. This is useful for code that,
after some modification, needs to ensure that new writes will see it
before it can proceed.

In particular, this will be used by the view building code, which needs
to wait until the in-progress writes, which may have missed that there
is now a view, is observable to the view building code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
9640205f11 database: Compare view id instead of name in find_views()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
9b9ba525f7 database: Add get_views() function
Returns all the schemas that are views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
dc44a08370 db/view: Return a future when sending view updates
While we now send view mutations asynchronously in the normal view
write path, other processes interested in sending view updates, such
as streaming or view building, may wish to do it synchronously.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
ff15068a41 service/storage_service: Allow querying the view build status
This patch adds support for the nodetool viewbuildstatus command,
which shows the progress of a materialized view build across the
cluster.

A view can be absent from the result, successfully built, or
currently being built.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
78b232d98f db: Introduce system_distributed_keyspace
This patch introduces a distributed system keyspace, used to hold
system tables that need to be replicated across a set of replicas
(that is, can't use the LocalStrategy).

In following patches, we will use this keyspace to hold a table
containing view building status updates for each node, used to support
range movements and a new nodetool command.

Fixes #3237

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
412f081db9 tests: Add unit test for build_progress_virtual_reader
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
4227641a3d db/system_keyspace: Add API for MV-related system tables
This patch implements an API to access the MV-related system tables,
which pertain to the view building process.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
b2cae7ea09 db/system_keyspace: Add virtual reader for MV in-progress build status
Provide a virtual reader so users can query the in-progress view table
in a way compatible with Apache Cassandra.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
7811474697 db/system_keyspace: Add Scylla-specific MV system table
When building a materialized view, we divide our work by shard, so we
need to register which shard did what work in the in-progress system
table. We also add the token we started at, which will enable some
optimizations in the view building code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
38831888d2 db/system_keyspace: Include MV system tables in all_tables()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Avi Kivity
16a7650873 Merge "More extensions: commitlog + system tables" from Calle
"
Additional extension points.

* Allows wrapping commitlog file io (including hinted handoff).
* Allows system schema modification on boot, allowing extensions
  to inject extensions into hardcoded schemas.

Note: to make commitlog file extensions work, we need to both
enforce we can be notified on segment delete, and thus need to
fix the old issue of hard ::unlink call in segment destructor.
Segment delete is therefore moved to a batch routine, run at
intervals/flush. Replay segments and hints are also deleted via
the commitlog object, ensuring an extension is notified (metadata).

Configurable listeneres are now allowed to inject configuration
object into the main config. I.e. a local object can, either
by becoming a "configurable" or manually, add references to
self-describing values that will be parsed from the scylla.yaml
file, effectively extending it.

All these wonderful abstractions courtesy of encryption of course.
But super generalized!
"

* 'calle/commitlog_ext' of github.com:scylladb/seastar-dev:
  db::extensions: Allow extensions to modify (system) schemas
  db::commitlog: Add commitlog/hints file io extension
  db::commitlog: Do segment delete async + force replay delete go via CL
  main/init: Change configurable callbacks and calls to allow adding opts
  util::config_file: Add "add" config item overload
2018-03-26 16:18:22 +03:00
Calle Wilund
ff41f47a08 db::extensions: Allow extensions to modify (system) schemas
Allows extensions/config listeners to potentially augument
(system) schemas at boot time. This is only useful for schemas
who do not pass through system_schema tables.
2018-03-26 11:58:28 +00:00
Calle Wilund
bb1a2c6c2e db::commitlog: Add commitlog/hints file io extension
To allow on-disk data to be augumented.
2018-03-26 11:58:27 +00:00
Calle Wilund
2bc98aebaf db::commitlog: Do segment delete async + force replay delete go via CL
Refs #2858

Push segement files to be deleted to a pending list, and process at
intervals or flush-requests (or shutdown). Note that we do _not_
indescrimenately do deletes in non-anchored tasks, because we need
to guarantee that finshed segments are fully deleted and gone on CL
shutdown, not to be mistaken for replayables.

Also make sure we delete segments replayed via commitlog call,
so IFF we add metadata processing for CL, we can clear it out.
2018-03-26 11:58:27 +00:00
Duarte Nunes
a985ea0fcb column_family: Don't retry flushing memtable if shutdown is requested
Since we just keep retrying, this can cause Scylla to not shutdown for
a while.

The data will be safe in the commit log.

Note that this patch doesn't fix the issue when shutdown goes through
storage_service::drain_on_shutdown - more work is required to handle
that case.

Ref #3318.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-3-duarte@scylladb.com>
2018-03-26 14:36:40 +03:00
Duarte Nunes
50ad37d39b column_family: Increase scope of exception handling when flushing a memtable
In column_family::try_flush_memtable_to_sstable, the handle_exception()
block is on the inside of the continuations to
write_memtable_to_sstable(), which, if it fails, will leave the
sstable in the compaction_backlog_tracker::_ongoing_writes map, which
will waste disk space, and that sstable will map to a dangling pointer
to a destroyed database_sstable_write_monitor, which causes a seg
fault when accessed (for example, through the backlog_controller,
which accounts the _ongoing_writes when calculating the backlog).

Fix this by increasing the scope of handle_exception().

Fixes #3315

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-2-duarte@scylladb.com>
2018-03-26 14:36:16 +03:00
Duarte Nunes
b7bd9b8058 backlog_controller: Stop update timer
On database shutdown, this timer can cause use-after-free errors if
not stopped.

Refs #3315

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-1-duarte@scylladb.com>
2018-03-26 14:36:16 +03:00
Botond Dénes
0e6aa91269 Fix test.py output and error handling
* Don't dump output of failed tests immediately, print the output
for failed tests in the end instead.
* Fix exception printing in run_test(): don't assume passed in error
object is a `bytes` (or bytes-like) object, call the object's str
operator instead and let callers encode bytes objects instead.
* Don't assume Exception object has an `out` member, use operator str
instead to convert it to string.
* Don't print progress in run_test() directly because it results in
incomprehensible output as the executors race to print to stdout. Leave
progress report to the caller who can serialize progress prints.
* Automatically detect non-tty stdout and don't try to edit already
printed text.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7bb7e0003ded9b28710250bff851ea849bb99f7d.1522062795.git.bdenes@scylladb.com>
2018-03-26 14:26:45 +03:00
Avi Kivity
999df41a49 Merge "Bug fixes for access-control, and finalizing roles" from Jesse
"
This series does not add or change any features of access-control and
roles, but addresses some bugs and finalizes the switch to roles.

"auth: Wait for schema agreement" and the patch prior help avoid false
negatives for integration tests and error messages in logs.

"auth: Remove ordering dependence" fixes an important bug in `auth` that
could leave the default superuser in a corrupted state when it is first
created.

Since roles are feature-complete (to the best of the author's knowledge
as of this writing), the final patch in the series removes any warnings
about them being unimplemented.

Tests: unit (release), dtest (PENDING)
"

* 'jhk/auth_fixes/v1' of https://github.com/hakuch/scylla:
  Roles are implemented
  auth: Increase delay before background tasks start
  auth: Remove ordering dependence
  auth: Don't warn on rescheduled task
  auth: Wait for schema agreement
  Single-node clusters can agree on schema
2018-03-26 09:29:41 +03:00
Jesse Haber-Kucharsky
849cf49b8d Roles are implemented
Fixes #1941.
2018-03-26 00:52:59 -04:00
Jesse Haber-Kucharsky
af24637565 auth: Increase delay before background tasks start
I've observed failures due to "missing" the peer nodes by about 1
second. Adding 5 second to the existing delay should cover most false
negative test results.

Fixes #3320.
2018-03-26 00:52:55 -04:00
Jesse Haber-Kucharsky
00f7bc676d auth: Remove ordering dependence
If `auth::password_authenticator` also creates `system_auth.roles` and
we fix the existence check for the default superuser in
`auth::standard_role_manager` to only search for the columns that it
owns (instead of the column itself), then both modules' initialization
are independent of one another.

Fixes #3319.
2018-03-25 22:38:11 -04:00
Jesse Haber-Kucharsky
968c61c296 auth: Don't warn on rescheduled task
Apache Cassandra also prints at the `info` level. This change prevents
tasks which we expect to be rescheduled from failing tests and scaring
users.

A good example of this importance of this change is when queries with a
quorum consistency level (for the default superuser) fail because a
quorum is not available. We will try again in this case, and this should
not cause integration tests to fail.
2018-03-25 22:38:11 -04:00
Jesse Haber-Kucharsky
881656cea4 auth: Wait for schema agreement
Some modules of `auth` create a default superuser if it does not already
exist.

The existence check is through a SELECT query with quorum consistency
level. If the schema for the applicable tables has not yet propagated to
a peer node at the time that it processes this query, then the
`storage_proxy` will print an error message to the log and the query
will be retried.

Eventually, the schema will propagate and the default superuser will be
created. However, the error message in the log causes integration tests
to fail (and is somewhat annoying).

Now, prior to querying for existing data, we wait for all gossip peers
to have the same schema version as we do.

Fixes #2852.
2018-03-25 22:38:08 -04:00
Jesse Haber-Kucharsky
3e415e28bc Single-node clusters can agree on schema
At some points while bootstrapping [1], new non-seed Scylla nodes wait
for schema agreement among all known endpoints in the cluster.

The check for schema agreement was in
`service::migration_manager::is_ready_for_bootstrap`. This function
would return `true` if, at the time of its invocation, the node was
aware of at least one `UP` peer (not itself) and that all `UP` peers had
the same schema version as the node.

We wish to re-use this check in the `auth` sub-system to ensure that
the schema for internal system tables used for access-control have
propagated to the entire cluster.

Unlike in `service/storage_service.cc`, where `is_ready_for_bootstrap`
was only invoked for seed nodes, we wish to wait for schema agreement
for all nodes regardless of whether or not they are seeds.

For a single-node cluster with itself as a seed,
`is_ready_for_bootstrap` would always return `false`.

We therefore change the conditions for schema agreement. Schema
agreement is now reached when there are no known peers (so the endpoint
map of the gossiper consists only of ourselves), or when there is at
least one `UP` peer and all `UP` peers have the same schema version as
us.

This change should not impact any bootstrap behavior in
`storage_service` because seed nodes do not invoke the function and
non-seed nodes wait for peer visibility before checking for schema
agreement.

Since this function is no longer checking for schema agreement only in
the context of bootstrapping non-seed nodes, we rename it to reflect its
generality.

[1] http://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html
2018-03-25 22:08:42 -04:00
Duarte Nunes
aed28c667c db/view: Pass pending endpoints to storage_proxy::send_to_endpoint
This minimizes the number of mutation copies by just doing a single
call to send_to_endpoint().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180325121412.76844-2-duarte@scylladb.com>
2018-03-25 15:45:22 +03:00
Duarte Nunes
fb54c09e0b service/storage_proxy: Pass pending endpoints to send_to_endpoint()
This will allow us to minimize the number of mutation copies in
mutate_MV().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180325121412.76844-1-duarte@scylladb.com>
2018-03-25 15:45:21 +03:00
Avi Kivity
389fb54a42 tests: sstable_test: fix for_each_sstable_version concept (again)
I see the following error:

seastar/core/future-util.hh:597:10: note:   constraints not satisfied
seastar/core/future-util.hh:597:10: note:     with ‘sstables::sstable_version_types* c’
seastar/core/future-util.hh:597:10: note:     with ‘sub_partitions_read::run_test_case()::<lambda(sstables::sstable::version_types)> aa’
seastar/core/future-util.hh:597:10: note: the required expression ‘seastar::futurize_apply(aa, (* c.begin()))’ would be ill-formed
seastar/core/future-util.hh:597:10: note: ‘seastar::futurize_apply(aa, (* c.begin()))’ is not implicitly convertible to ‘seastar::future<>’

The C array all_sstable_versions decayed to a pointer (see second gcc note)
and of course doesn't support std::begin().

Fix by replacing the C array with an std::array<>, which supports std::begin().

Not clear what made this break again, or why it worked before.
Message-Id: <20180325095239.12407-1-avi@scylladb.com>
2018-03-25 13:02:57 +01:00
Duarte Nunes
44996fa6ae Merge 'Reduce link dependencies in tests' from Avi
"
This patchset removes unneeded object files from the test link,
reducing unnecessary links and reducing link time and executable
size.

Tests: build (release)
"

* tag 'test-link/v1' of https://github.com/avikivity/scylla:
  build: link release.o into scylla and perf_fast_forward binaries only
  build: don't link api/ into tests
2018-03-24 20:54:49 +00:00
Avi Kivity
09453ca0db build: link release.o into scylla and perf_fast_forward binaries only
release.o depends on the release date and git hash, and therefore changes
every time ./configure.py is executed.  In turn, this causes all tests to
relink.

Improve the situation by only linking release.o into binaries that require
it.

This helps continuous integration scripts, which call configure.py
unconditionally. Developers usually won't, so they will not see significant
savings.

Tests: build (release)
2018-03-24 22:55:03 +03:00
Avi Kivity
e78cea4121 build: don't link api/ into tests
They don't need it.
2018-03-24 22:55:02 +03:00
Duarte Nunes
f298e3e6f8 database: Log exception which caused flush to fail
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180322204419.12961-1-duarte@scylladb.com>
2018-03-23 10:57:35 +00:00
Takuya ASADA
81fbcbf6bc dist/redhat: don't redefine __debug_install_post on Fedora27 or later
Redefining _debug_install_post does not work on Fedora27 or later,
it seems because of debuginfo generation process had been changed:
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/ITJHJTUO2WFEAYIHANSM6AMAB5SIFASI/

To prevent the build error, move scylla-gdb.py to scylla-server package on
Fedora 27 or later.

Fixes #3313

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521735371-29408-1-git-send-email-syuu@scylladb.com>
2018-03-22 19:39:14 +02:00
Takuya ASADA
879c9f1bf8 dist/redhat: don't use yaml-cpp-static on Fedora
Since Fedora still does not have separated yaml-cpp-static package, don't
depends on it.

Fixes #3183

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521735663-29516-1-git-send-email-syuu@scylladb.com>
2018-03-22 18:24:28 +02:00
Avi Kivity
054854839a Merge "Fix abort during counter table read-on-delete" from Tomasz
"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.

Fixes #3304.
"

* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Test reads with no clustering ranges and no static columns
  tests: simple_schema: Allow creating schema with no static column
  clustering_ranges_walker: Stop after static row in case no clustering ranges
2018-03-22 17:36:20 +02:00
Tomasz Grabiec
604166143c tests: mutation_source_test: Test reads with no clustering ranges and no static columns
Reproduces issue #3304.
2018-03-22 15:00:48 +01:00
Tomasz Grabiec
3a974d1776 tests: simple_schema: Allow creating schema with no static column 2018-03-22 14:44:54 +01:00
Tomasz Grabiec
d1cb6bbf95 clustering_ranges_walker: Stop after static row in case no clustering ranges
When there are no clustering ranges, stop at position which is right
after the static row instead of position which is after all clustered
rows.

This fixes an abort in sstable reader when querying a partition with
no clustering ranges (happens with counter tables) which also doesn't
have any static columns. In such case, the sstable_mutation_reader
will setup the data_consume_context such that it only covers the
static row of the partition, knowing that there is no need to ready
any clustering row. See partition.cc::advance_to_upper_bound().  Later
when we're done with reading the static row (which is absent), we will
try to skip to the first clustering range, which in this case is
missing.  If clustering_ranges_walker tells us to skip to
after_all_clustering_rows(), we will hit an asser inside
continuous_data_consumer::fast_forward_to() due to attempt to skip
past the original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine, becuase we end up
at the same data file position.

Fixes #3304.
2018-03-22 14:44:48 +01:00
Botond Dénes
a65b063ab2 incremental_reader_selector: remote unused members
Since 3d725d6823 the incremental_reader_selector creates readers via
a factory function so these members, used previously for creating the
readers, are not needed anymore.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <64b5cef93c1f9a2e544ccfd89e293627e99dd4cd.1521724155.git.bdenes@scylladb.com>
2018-03-22 13:14:03 +00:00
Takuya ASADA
bef08087e1 scripts/scylla_install_pkg: follow redirection of specified repo URL
We should follow redirection on curl, just like normal web browser does.
Fixes #3312

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521712056-301-1-git-send-email-syuu@scylladb.com>
2018-03-22 12:55:43 +02:00
Vladimir Krivopalov
3010b637c9 perf_fast_forward: fix error in date formatting
Instead of 'month', 'minutes' has been used.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <1e005ecaa992d8205ca44ea4eebbca4621ad9886.1521659341.git.vladimir@scylladb.com>
2018-03-22 09:57:15 +00:00
Avi Kivity
a7d86410b5 Merge "Split more tasks out of the generic scheduling group" from Glauber
"
There are a lot of things that we should be grouping in scheduling
groups that we aren't yet. The write path is not tagged at all,
mutation_query isn't either. Some, like streaming, are used - but not in
all places where they are needed.

Tests: unit (release)
"

* 'split-scheduling-groups-v2' of github.com:glommer/scylla:
  database: group statements in their own scheduling group
  database: apply streaming mutations with streaming priority
  logalloc: capture current scheduling group for deferring function
2018-03-21 15:02:50 +02:00
Nadav Har'El
e5de66d0c4 Materialized Views: unit test for missing view key columns
Add a unit test for reproducing issue #2720 (and verifying its fix)
If a user tries to create a view whose primary key is missing any of the
base table's primary key columns, the creation should fail.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-3-nyh@scylladb.com>
2018-03-21 09:47:41 +00:00
Nadav Har'El
c809dd2e66 Materialized Views: change order of view creation verification
Changed the order to check a couple of error conditions *after* checking
for too many or missing primary key columns. This order (showing the
too many or missing key columns first) is more useful, and is the order
in Cassadra's code.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-2-nyh@scylladb.com>
2018-03-21 09:47:41 +00:00
Nadav Har'El
871cecfd3b Materialized Views: fix checking that view key includes base key
A view's primary key must include all the columns of the base's primary
key. If we don't check this and fail the table's creation, we can discover
problems later on when using the table, as demonstrated in issue #2720.

We had such checking code (translated from the same code in Java) but it
had an extra "else" which caused nothing to be put in "missing_pk_columns"
so the error was never recognized.

Also, when the error does happen, we should print the column's name_as_text(),
not name() which is (surprisingly) just a number.

Fixes #2720.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-1-nyh@scylladb.com>
2018-03-21 09:47:41 +00:00
Nadav Har'El
06aaace5a4 Materialized View: fix one of the unit tests
One of the tests created a base table with 5 primary key columns, but
put only 4 of them in the view. This is not allowed, but prior to fixing
issue #2720 this error was silently ignored. Let's fix the error instead
of relying on this silence.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180321094352.22329-1-nyh@scylladb.com>
2018-03-21 09:46:55 +00:00
Duarte Nunes
0d74442252 tests/sstable_test: Fix concept for for_each_sstable_version
Un-break the build.

Fixes #3307

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180320182011.11068-1-duarte@scylladb.com>
2018-03-20 22:26:06 +00:00
Glauber Costa
9188059427 database: group statements in their own scheduling group
When we introduced the CPU scheduler, we have also introduced a group
for commitlog - but never used it. There is also doubtful value in
separating reads from writes, since they are often part of the same
workload.

To accomodate for that, let's rename the query group to "statement"
(query is not incorrect, just confusing), and move the write path,
currently ungrouped, inside it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-20 16:58:36 -04:00
Glauber Costa
c8e169f6d8 database: apply streaming mutations with streaming priority
We are flushing the streaming memtables with streaming priority, but
applying the mutations themselves is still done with normal priorities.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-20 16:58:35 -04:00
Glauber Costa
3b53f922a3 logalloc: capture current scheduling group for deferring function
When we call run_when_memory_available, it is entirely possible that
the caller is doing that inside a scheduling_group. If we don't defer
we will execute correctly. But if we do defer, the current code will
execute - in the future - with the default scheduling group.

This patch fixes that by capturing the caller scheduling group and
making sure the function is executed later using it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-20 16:58:35 -04:00
Duarte Nunes
237184324e Merge 'Make the read repair decision per-query instead of per-page' from Botond
"
Since f8613a8415 we have reader-caching
on replicas for single-partition queries. This caching works best when
all pages of a query are sent to the same replicas consistently and thus
they can reuse the cached readers there.
The propability-based nature of read-repair works against this as on any
given page a read-repair will be attempted or not based on probability.
This will cause hight drop-rates on the replicas used for read-repair as
the cached reader will not be reusable if the replica was skipped for
one or more pages.
To fix this make the repair-decision once, on the first page of the
query and store the decision in the paging-state. On all remaining
pages of the query use this stored decision.

Tests: unit-tests(release, debug), dtest(paging_advanced_tests.py)

Refs: #1865
"

* 'per_query_repair_decision/v2' of https://github.com/denesb/scylla:
  Make the read-repair decision only once
  storage_proxy: add coordinator_query_options and coordinator_query_result
  Add query_read_repair_decision to paging-state
2018-03-20 11:59:41 +00:00
Takuya ASADA
2045891cc2 dist/debian: use rebuilt libyaml-cpp on Debian9
On Debian9, distribution provided libyaml-cpp does not able to link against
scylla, use rebuilt one from our 3rdparty repo.

fixes #3221

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521496288-12856-1-git-send-email-syuu@scylladb.com>
2018-03-20 12:30:47 +02:00
Nadav Har'El
07f88aef51 Materialized Views: test verification of only one new key column
For several reasons that I cannot fit in the margin, when a view is
created, at most ONE regular column from the base table may be added
to the view's key.
This small new test verifies that if we try to add two columns, the
view creation fails.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319235453.1613-1-nyh@scylladb.com>
2018-03-20 00:30:18 +00:00
Nadav Har'El
1d4ceaa237 Materialized Views: Fix IS NOT NULL unit test
We had a unit test, test_primary_key_is_not_null, for testing that
we correctly complain - or don't complain - on missing "IS NOT NULL"
restrictions, as expected.

However, this test missed the actual bug we had regarding IS NOT NULL
checking - see issue #2628 - because it thought a silly syntax error
which caused an exception, was the exception we expected to see :-)

So in this patch, I rewrote this test. It fixes the test's bug and
demonstrates issue #2628 (and verifies its fix), and also tests a few
more corner cases.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319235000.1399-1-nyh@scylladb.com>
2018-03-20 00:30:18 +00:00
Nadav Har'El
da110d612e Materialized Views: Fix "IS NOT NULL" checking
When creating a materialized view, the user must provide a "IS NOT NULL"
restriction for each of the created view's primary columns. If such a
restriction is missing, the view creation should fail. In #2628 we noticed
that sometimes it wasn't failing, but later updates to such table would fail,
which is a bug.

There is actually one special case where "IS NOT NULL" is optional:
It is optional on the base's partition key column (when there is just
one of these) because it is already assumed that the partition key in
its entirety can never be.

Our "IS NOT NULL" test, validate_primary_key(), had two logic errors
which caused it to miss some cases of missing "IS NOT NULL":

1. Instead of checking whether a certain column is a the base's only
   partition-key column, and avoid testing IS NOT NULL just for that
   specific column, the code tested whether the schema *has* such a
   column, and if it did, the test was skipped for all columns.

2. When the code found the one new column in the view's primary key, it
   was so happy to find it that it immediately returned, and forgot to
   test the IS NOT NULL on that column :-)

Both errors are fixed by this patch.
See the next patch for a unit test.

Fixes #2628.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319233657.522-1-nyh@scylladb.com>
2018-03-20 00:30:18 +00:00
Glauber Costa
f80d4a28d7 flat_mutation_reader: explicitly yield at every partition
Right now we have yield points between partition processing guaranteed
by the fact that there are .get()s around the code, and those include
an yield point.

We have been discussing removing the implicit yield point from get and
pushing that to the caller. In that spirit, let's yield explicitly here
if needed.

It should be the responsibility of the loop that it doesn't hurt
latency, either by the fact that it is bounded by a small number of
iterations or yields. In other words, that loop should have a yield
point on every iteration (like the non-thread variant does).

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180319173051.8918-1-glauber@scylladb.com>
2018-03-19 19:39:01 +02:00
Avi Kivity
03c22ad524 Merge "Support for Cassandra 2.2 (LA) SSTable formats" from Daniel
"
These patches add support for C* 2.2 file(name) format.

Namely:
  * It forces Scylla to write files in la format.
  * Adds storage-service feature for them.
  * cf and ks are determined from directory, not from file-name (for 2.2 format).
  * Adds some other fixes to make dtest happy.
  * Unit tests work with la format or with both formats.
"

* 'danfiala/filename-format-2.2-v4' of https://github.com/hagrid-the-developer/scylla:
  tests/sstables: Tests use la format or iterate over both formats.
  tests/sstables: Helper functions support 2.2 format directory structure.
  stables: Use 2.2 (la) format as a default format to store sstables if it is enabled by feature-bits.
  storage_service: Support la sstable storage format as a feature.
  sstables: make_descriptor accepts sstable-directory, because it is necessary to determine cf and ks in 2.2 format.
  sstables: Throw more detail exception for unknown item in reverse_map.
  sstables/compaction: Suppress NaN in a report of a throughput.
2018-03-19 17:49:44 +02:00
Botond Dénes
eee9bda85b Make the read-repair decision only once
Make the read-repair decision on the first page of a paged-query and use
it for all the remaining pages. This helps querier-cache hit-rates as
reads to nodes will be sent consistently throught the query.
2018-03-19 16:29:43 +02:00
Avi Kivity
fe4049f074 Merge "gossip: Fixes to shadow round" from Duarte
"
Fixes to gossip pertaining to the shadow round.

In particular, an issue preventing a node from being marked as alive is
fixed: After the shadow round and the feature checking, we remove any
endpoints from the state - namely, those that contacted us -, before
re-adding them again. This is because those nodes that replied would
have been marked as alive in the endpoint state map (but not fully,
they'd be absent from the live endpoints list), and re-adding them marks
them as dead.

If the shadow round failed, after doing the feature checking against the
system tables, we were not clearing the state map and re-adding the
endpoints. This leaves the alive marker set, and prevents
real_mark_alive() from eventually being called.

Fixes #3301
"

* 'gossip/shadow-round-fixes/v3' of https://github.com/duarten/scylla:
  gms/gossiper: Remove superfluous check
  service/storage_service: Always re-add loaded endpoints
  gms/gossiper: Check for shadow round completion before throwing
2018-03-19 15:22:35 +02:00
Botond Dénes
2e2abf6edb storage_proxy: add coordinator_query_options and coordinator_query_result
As yet more parameters and return-values are about to be added to all
storage_proxy::query_* methods we need a way that scales better than
changing the signatures every time. To this end we aggregate all
non-mandatory query parameters into `coordinator_query_options` and all
return values into `coordinator_query_result`.
This way new fields can be simply added to the respective structs while
the signatures of the methods themselves and their client code can
remain unchanged.
2018-03-19 15:17:35 +02:00
Botond Dénes
b55dcc2ce5 Add query_read_repair_decision to paging-state
This new field will store the repair-decision made on the first page of
the query. This decision will be sticky to all pages of the query.
In mixed clusters the decision might not happen on the first page and it
might even change during the query as old coordinators will not store
nor respect the decision.
2018-03-19 15:17:31 +02:00
Daniel Fiala
4d703f9c6a tests/sstables: Tests use la format or iterate over both formats.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-19 14:12:10 +01:00
Daniel Fiala
386cae4ad2 tests/sstables: Helper functions support 2.2 format directory structure.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-19 14:12:09 +01:00
Daniel Fiala
089b54f2d2 stables: Use 2.2 (la) format as a default format to store sstables if it is enabled by feature-bits.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-19 14:12:01 +01:00
Daniel Fiala
802be72ca6 storage_service: Support la sstable storage format as a feature.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-19 14:10:31 +01:00
Duarte Nunes
9cadfb27f1 gms/gossiper: Remove superfluous check
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-19 13:08:53 +00:00
Duarte Nunes
2c7b77b6d2 service/storage_service: Always re-add loaded endpoints
After the shadow round and the feature checking, we remove any
endpoints from the state - namely, those that contacted us -, before
re-adding them again. This is because those nodes that replied would
have been marked as alive in the endpoint state map (but not fully,
they'd be absent from the live endpoints list), and re-adding them
marks them as dead.

If the shadow round failed, after doing the feature checking against
the system tables, we were not clearing the state map and re-adding
the endpoints. This left the alive marker set, and prevented
real_mark_alive() from eventually being called.

Fixes #3301

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-19 13:08:53 +00:00
Duarte Nunes
69b28a4f2b gms/gossiper: Check for shadow round completion before throwing
For values of `shadow_round_ms` lower than 1 second, this was assuming
failure without checking.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-19 13:08:53 +00:00
Avi Kivity
601d8f7cff test: switch boost.test from --log_sink to --logger
Upstream fix works only for --logger according to

  https://github.com/boostorg/test/pull/124
Message-Id: <20180319121520.11110-1-avi@scylladb.com>
2018-03-19 13:26:28 +01:00
Calle Wilund
eb10d32ff9 main/init: Change configurable callbacks and calls to allow adding opts
Refs #2526

Allows sub-configs to dynamically add yaml/command line options to
the main config object, i.e. extend the scylla.yaml
2018-03-19 12:24:04 +00:00
Calle Wilund
fc97e39782 util::config_file: Add "add" config item overload 2018-03-19 12:24:04 +00:00
Duarte Nunes
71fddad376 Merge 'Reduce unit test runtime' from Avi
This patchset reduces the time required to run the tests, mostly by
running them in parallel.

I measured a reduction of 3.5X on a 1s4c4t desktop (release mode).

Tests: unit (release)

* tag 'faster-tests/v2' of https://github.com/avikivity/scylla:
  tests: run tests in parallel
  tests: simplify timeout handling
  tests: don't require crash integrity
  tests: allow sharing the machine with other tests
  tests: extract seastar options to a separate variable
  tests: reduce memory for tests
  tests: add "--" unconditionally for boost tests
  tests: start cql_test_env without binding to messaging port
  storage_service: allow starting gossiper without binding to messaging port
  gms: allow gossiper to start_gossiping() without binding to the port
  tests: close file correctly in loading_file_test
2018-03-19 10:24:55 +00:00
Avi Kivity
31b86a46a0 tests: run tests in parallel
Launch tests in a concurrent executor with worker count determined
by available memory.
2018-03-19 12:17:10 +02:00
Avi Kivity
638611a350 tests: simplify timeout handling
The subprocess module can handle timeouts itself, so use this
to simplify the module code.
2018-03-19 12:16:58 +02:00
Avi Kivity
95abed020b tests: don't require crash integrity
We don't resume tests after crashes, so no need to spend time waiting
for the disk to fsync.
2018-03-19 12:16:58 +02:00
Avi Kivity
b3d8dadf0c tests: allow sharing the machine with other tests
By using the overprovisioned flag, we reduce polling and pinning, so
less CPU time is wasted and the scheduler has more options to schedule
reactor threads.
2018-03-19 12:16:58 +02:00
Avi Kivity
3d84c8945d tests: extract seastar options to a separate variable 2018-03-19 12:16:58 +02:00
Avi Kivity
8b1cff90ce tests: reduce memory for tests
If we reduce memory for an individual test, we can run more
in parallel.
2018-03-19 12:16:58 +02:00
Avi Kivity
c3750176d8 tests: add "--" unconditionally for boost tests
Now that we have a minimum boost version, we don't need to check whether
boost requires "--" before test-specific command line arguments. Removing
the check speeds up the test a little.
2018-03-19 12:16:58 +02:00
Avi Kivity
9a04def202 tests: start cql_test_env without binding to messaging port
Allows running tests in parallel.
2018-03-19 12:16:52 +02:00
Avi Kivity
ee68bfa49d storage_service: allow starting gossiper without binding to messaging port 2018-03-19 12:16:11 +02:00
Avi Kivity
02ce0c4cde gms: allow gossiper to start_gossiping() without binding to the port
This is useful in tests, which don't communicate. Binding to a port can
fail if the system is running something else.

It would be better to prevent even more of the gossiper from starting up,
but that is more difficult.
2018-03-19 12:16:11 +02:00
Avi Kivity
f2dd31ee76 tests: close file correctly in loading_file_test
Otherwise, we crash with --overprovisioned on a use-after-free.
2018-03-19 12:16:11 +02:00
Duarte Nunes
810db425a5 gms/gossiper: Synchronize endpoint state destruction
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.

This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map  (in all cores).

Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.

Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
2018-03-18 14:38:04 +02:00
Avi Kivity
38e1eb5e42 Update scylla-ami submodule
* dist/ami/files/scylla-ami 5170011...9b4be70 (1):
  > do not special case i3 for controller code
2018-03-18 11:37:00 +02:00
Takuya ASADA
378bf7cec0 dist/debian: switch Debian9 to boost-1.65
We switched Debian8/Ubuntu14/Ubuntu16 to boost-1.65 to fix #3090, but Debian9
stil uses distribution provided boost-1.62, it causes same build error.
So switch it to our boost-1.65, too.

See c636f552e0

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521250590-16510-1-git-send-email-syuu@scylladb.com>
2018-03-18 10:23:43 +02:00
Daniel Fiala
10db711259 sstables: make_descriptor accepts sstable-directory, because it is necessary to determine cf and ks in 2.2 format.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-18 06:09:47 +01:00
Daniel Fiala
abdf22f5cd sstables: Throw more detail exception for unknown item in reverse_map.
* This can help with debugging.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-18 05:54:15 +01:00
Daniel Fiala
c5eca593fc sstables/compaction: Suppress NaN in a report of a throughput.
* It causes failures in dtest.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-18 05:46:32 +01:00
Glauber Costa
f5c32423b8 summary: don't go through all entries when computing memory size.
Summary has a function, memory_size(), that estimates the amount of
memory the summary takes. It is my understanding that this is called
to serve information to tooling.

First, this function is innacurate because it doesn't take into account
the tokens per each entry, just the keys. But more importantly, it has
to iterate over all keys which can be pretty expensive if the entries
list is long. We are now keeping that in a memory area, with just
pointers in the entry. So instead of iterating through the entries, we
can iterate through the memory areas, which is much cheaper.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180316120915.16809-1-glauber@scylladb.com>
2018-03-16 12:57:19 +00:00
Duarte Nunes
fef9d4fa72 service/storage_service: Avoid superfluous seastar::thread
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180315212202.12176-1-duarte@scylladb.com>
2018-03-16 12:52:15 +00:00
Nadav Har'El
e9702aa126 Materialized Views: don't lose updates while cluster is changing
When the cluster is changed (nodes added or removed), ranges of tokens
are moved between nodes. Scylla initiates a streaming process between an
old and a new owner of the range, which can take a long time. During
that streaming time, the new owner of the range is known as a "pending node"
for this range, and all updates must go to both the old owner (in case the
movement fails!) and the pending node (in case the movement succeeds).

For materialized views, because they are ordinary tables, streaming moves
all the view's data that existed before the streaming started. But we did
not send updates done to the view *during* the streaming. A dtest
demonstrates that the new node will miss some of the view update, and will
require a repair of the view tables immediately after the cluster change
ends, which is not good. To fix that, we need to send every new update
that happens during the streaming also to the "pending node". We already
did this properly for base-table updates, but not to the view updates:
Each base table replica wrote to only one paired view table replica,
and nobody wrote to the new pending node (in case where there is one,
for the particular view token involved).

In this patch, we make sure that all view updates go also to the "pending
nodes" when there are any. We do the same thing that Cassandra does, which
is - *all* base replicas write the update to the pending node(s).
Arguably, it is inefficient that all replicas send the update to the same
node. In most cases it is enough to send it from just one base replica -
the one who is slated to be the new node's pair.  I opened
https://issues.apache.org/jira/browse/CASSANDRA-14262 about this idea.
But that is an optimization. The patch as-is already fixes the bug.

Fixes #3211

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180313171853.17283-1-nyh@scylladb.com>
2018-03-16 12:00:29 +00:00
Duarte Nunes
934d805b4b Merge 'Grant default permissions' from Jesse
The functional change in this series is in the last patch
("auth: Grant all permissions to object creator").

The first patch addresses `const` correctness in `auth`. This change
allowed the new code added in the last patch to be written with the
correct `const` specifiers, and also some code to be removed.

The second-to-last patch addresses error-handling in the authorizer for
unsupported operations and is a prerequisite for the last patch (since
we now always grant permissions for new database objects).

Tests: unit (release)

* 'jhk/default_permissions/v3' of https://github.com/hakuch/scylla:
  auth: Grant all permissions to object creator
  auth: Unify handling for unsupported errors
  auth: Fix life-time issue with parameter
  auth: Fix `const` correctness
2018-03-16 09:43:36 +01:00
Avi Kivity
9eb7c0c65b Merge "Remove (some) reactor stalls in the SSTable code" from Glauber
"
This is an improvement on my latest series. Instead of just
dealing with the problem of destroying the Summary that I have
identified in a previous test, I have tried to find other sources
of stalls.

Some of them are on readers and would affect early processes and
operations like nodetool refresh.

Others are on writers, which can affect any SSTable being written.

Two of those stalls (on large filter, on summary read), I saw in a
synthetic benchmark where I used very small values + nodetool compact
to generate one SSTable with many keys. They were 80ms and 20ms
respectively, and now they are totally gone.

For others, I just tried to be safe (for instance, if we know
reading/writing large vectors can be costly, just always insert
preemption points in them).

With all of these patches applied, I no longer see stalls coming from
the SSTable code in those tests (although given enough time, I am sure I
can find more).

Tests: unit (release)
Fixes: #3282, Fixes #3281, Fixes #3269
"

* 'sstables-stalls-v3-updated' of github.com:glommer/scylla:
  large_bitset/bloom filter: add preemption points in loops
  sstables: read filter in a thread
  abstract summary entry version of the token with a token view
  add a token_view
  sstables: rework summary entries reading
  sstables: avoid calls to resize for vectors
  sstables: replace potentially large for loop with do_until
  summary_entry: do not store key bytes in each summary entry
  tests: change tests to make summary non-copyable
  chunked_vector: do not iterate to destruct trivially destructible types
2018-03-16 09:43:36 +01:00
Glauber Costa
7fd31088f2 large_bitset/bloom filter: add preemption points in loops
SSTables that contain many keys - a common case with small partitions in
long lived nodes - can generate filters that are quite large.

I have seen stalls over 80ms when reading a filter that was the result
of a 6h write load of very small keys after nodetool compact (filter was
in the 100s of MB)

Similar care should be taken when creating the filter, as if the
estimated number of partitions is big, the resulting large_bitset can be
quite big as well.

If we treat the i_filter.hh and large_bitset.hh interfaces as truly
generic, then maybe we should have an in_thread version along with a
common version. But the bloom filter is the only user for both and even
if that changes in the future, it is still a good idea to run something
with a massive loop in a thread.

So for simplicity, I am just asserting that we are on a thread to avoid
surprises, and inserting preemption points in the loops.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:15 -04:00
Glauber Costa
c424ba01df sstables: read filter in a thread
Constructing filter objects can be quite expensive. We will insert some
yield points around, and that is made a lot easier if we are calling
things from a thread.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:15 -04:00
Glauber Costa
e680c7c8cc abstract summary entry version of the token with a token view
dht::token doesn't have a trivial destructor, so destroying an array
full of those can be quite expensive. If we use the same trick as we
used for the summary - storing the token data in a stable memory
location - we can leave the entries with a trivial destructor and destroy
the chunks themselves. Those being larger, they will be more efficient
to delete.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:15 -04:00
Glauber Costa
dddc7e1676 add a token_view
Ideally we would like tokens to be trivially destructible, so that we
can easily dispose of giant vectors holding them. While that is hard to
do with our current infrastructure, we can introduce a token_view, which
holds a bytes_view elements instead of the real data - making it
trivially destructible.

The comparators are then changed to take a token_view, and an implicit
conversion function is provided from tokens so they get compared.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:09 -04:00
Duarte Nunes
9da2b66cff cql3/untyped_result_set: Conform to boost::range concept
Enable some of that boost::copy_range goodness.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180315121801.2808-1-duarte@scylladb.com>
2018-03-15 13:34:44 +01:00
Takuya ASADA
69d226625a dist/ami: update CentOS base image to latest version
Since we requires updated version of systemd, we need to update CentOS base
image.

Fixes #3184

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1518118694-23770-1-git-send-email-syuu@scylladb.com>
2018-03-15 10:47:37 +02:00
Takuya ASADA
945e6ec4f6 dist/debian: use 3rdparty ppa on Ubuntu 18.04
Currently Ubuntu 18.04 uses distribution provided g++ and boost, but it's easier
to maintain Scylla package to build with same version toolchain/libraries, so
switch to them.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521075576-12064-1-git-send-email-syuu@scylladb.com>
2018-03-15 10:41:05 +02:00
Takuya ASADA
1bb3531b90 dist/redhat: build only scylla, iotune
Since we don't package tests, we don't need to build them.
It reduces package building time.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521066363-4859-1-git-send-email-syuu@scylladb.com>
2018-03-15 10:40:35 +02:00
Takuya ASADA
856dc0a636 dist/redhat: switch to gcc-7.3
We have hit following bug on debug-mode binary:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82560
Since it's fixed on gcc-7.3, we need to upgrade our gcc package.

See: https://groups.google.com/d/topic/scylladb-dev/RIdIpqMeTog/discussion
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521064473-17906-1-git-send-email-syuu@scylladb.com>
2018-03-15 10:39:25 +02:00
Vladimir Krivopalov
5c3b32a9bf Remove to_boost_visitor heler.
The minimal Boost version required for Scylla now is 1.58 and this
helper is no longer needed.
Replaced it with more generic visitation utils from Seastar.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <e589ace7ac411d3d55dead475a8a2271f51642f1.1520976010.git.vladimir@scylladb.com>
2018-03-14 23:49:07 +00:00
Avi Kivity
bb4b1f0e91 Merge "Ubuntu/Debian build error fixes" from Takuya
* 'debian-ubuntu-build-fixes-v2' of https://github.com/syuu1228/scylla:
  dist/debian: build only scylla, iotune
  dist/debian: switch to boost-1.65
  dist/debian: switch to gcc-7.3
2018-03-14 22:50:40 +02:00
Takuya ASADA
7f891e7a48 dist/debian: build only scylla, iotune
Since we don't package tests, we don't need to build them.
It reduces package building time.
2018-03-15 04:33:11 +09:00
Glauber Costa
89b28a4bea sstables: rework summary entries reading
Like we did for generic arrays, let's move away from resize() in trying
to read summary entries and move to a reserve/push pattern.

I have tested this patch reading a summary file that a couple of MB big.
Stalls up to 20ms were seen. After applying this patch, no such stalls
are present.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 13:35:15 -04:00
Glauber Costa
a33f0d6f92 sstables: avoid calls to resize for vectors
resize is considered harmful, since it will attempt to allocate memory
and initialize each element of the vector. This can cause reactor stalls
that correlates to latency peaks.

A better idiom is reserve first - so we know we will have enough memory
and won't have to move contents - and push_back/emplace_back each
individual member.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 13:32:36 -04:00
Glauber Costa
0d9488eae6 sstables: replace potentially large for loop with do_until
We are pushing ints here, so it shouldn't be that bad in practice.
But a potentially gigantic for loop is just asking for a stall since we won't
need_preempt() it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 11:58:03 -04:00
Glauber Costa
091b0f9d41 summary_entry: do not store key bytes in each summary entry
If we store a bytes_view instead of bytes, that has a trivial destructor
and then we don't need to destroy each element individually. To do that,
we allocate the data in a couple of large arrays which can be disposed of
easily and point to it.

We still can't destroy trivially because of the token.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 10:46:20 -04:00
Glauber Costa
d15bfbe548 tests: change tests to make summary non-copyable
Right now the summary can be copied, but in real life there is no reason
for this to be a requirement. Tests want it, so we can destroy a summary,
load another, and compare the two. We can achieve this by allowing the first
summary to be moved, and then we can still have a reference to the second.

I am about to make a change that will make the summary not copyable as a
requirement, so we need to do this first.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 10:46:20 -04:00
Glauber Costa
00d04b49a0 chunked_vector: do not iterate to destruct trivially destructible types
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 09:16:54 -04:00
Takuya ASADA
c636f552e0 dist/debian: switch to boost-1.65
We get following compile error on Debian/Ubuntu with boost-1.63:

/opt/scylladb/include/boost/intrusive/pointer_plus_bits.hpp:76:48: error: '*((void*)& __tmp +136)' is used uninitialized in this function [-Werror=uninitialized]
       n = pointer(uintptr_t(p) | (uintptr_t(n) & Mask));
                                         ~~~~~~~~~~~~~~^~~~~~~

This is known issue (https://github.com/boostorg/intrusive/issues/29), fixed
on boost-1.65.

Switch to boost-1.65 to fix the issue.

Fixes #3090
2018-03-14 22:13:24 +09:00
Avi Kivity
a0bc126ae2 Merge seastar upstream
* seastar bcfbe0c...a66cc34 (3):
  > reactor: fix sleep mode
  > cpu scheduler: don't penalize first group to run
  > Simple shellscript to find out which logical CPU's shards are mapped to
2018-03-14 14:14:21 +02:00
Asias He
9b5585ebd5 range_streamer: Stream 10% of ranges instead of 10 ranges per time
If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per
stream plan will cause tons of stream plan to be created to stream data,
each having very few data. This cause each stream plan has low transfer
bandwidth, so that the total time to complete the streaming increases.

It makes more sense to send a percentage of the total ranges per stream
plan than a fixed ranges.

Here is an example to stream a keyspace with 513 ranges in
total, 10 ranges v.s. 10% ranges:

Before:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 51
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 107 seconds

After:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 10
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 22 seconds

Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com>
2018-03-14 10:12:12 +02:00
Asias He
ad7b132188 Revert "streaming: Do not abort session too early in idle detection"
This reverts commit f792c78c96.

With the "Use range_streamer everywhere" (7217b7ab36) series,
all the user of streaming now do streaming with relative small ranges
and can retry streaming at higher level.

Reduce the time-to-recover from 5 hours to 10 minutes per stream session.

Even if the 10 minutes idle detection might cause higher false positive,
it is fine, since we can retry the "small" stream session anyway. In the
long term, we should replace the whole idle detection logic with
whenever the stream initiator goes away, the stream slave goes away.

Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>
2018-03-14 10:11:00 +02:00
Jesse Haber-Kucharsky
6a360c2d17 auth: Grant all permissions to object creator
When a table, keyspace, or role is created, the creator now is
automatically granted all applicable permissions on the object.

This behavior is consistent with Apache Cassandra.

Fixes #3216.
2018-03-14 01:54:31 -04:00
Jesse Haber-Kucharsky
c502fe24ce auth: Unify handling for unsupported errors
Instead of some functions in `allow_all_authorizer` throwing exceptions
and others being silently pass-through, we consistently return exception
futures with `auth::unsupported_authorization_operation`. These errors
are converted to `invalid_request_exception` in the CQL error and
ignored where appropriate in the auth subsystem.
2018-03-14 01:54:28 -04:00
Jesse Haber-Kucharsky
97235445d3 auth: Fix life-time issue with parameter 2018-03-14 01:32:53 -04:00
Jesse Haber-Kucharsky
9117a689cf auth: Fix const correctness
This patch came about because of an important (and obvious, in
hindsight) realization: instances of the authorizer, role manager, and
authenticator are clients for access-control state and not the state
itself. This is reflected directly in Scylla: `auth::service` is
sharded across cores and this is possible because each instance queries
and modifies the same global state.

To give more examples, the value of an instance of `std::vector<int>` is
the structure of the container and its contents. The value of `int
file_descriptor` is an identifier for state maintained elsewhere.

Having watched an excellent talk by Herb Sutter [1] and having read an
informative blog post [2], it's clear that a member function marked
`const` communicates that the observable state of the instance is not
modified.

Thus, the member functions of the role-manager, authenticator, and
authorizer clients should not be marked `const` only if the state of the
client itself is observably changed. By this principle, member functions
which do not change the state of the client, but which mutate the global
state the client is associated with (for example, by creating a role)
are marked `const`.

The `start` (and `stop`) functions of the client have the dual role of
initializing (finalizing) both the local client state and the
external state; they are not marked `const`.

[1] https://herbsutter.com/2013/01/01/video-you-dont-know-const-and-mutable/

[2] http://talesofcpp.fusionfenix.com/post-2/episode-one-to-be-or-not-to-be-const
2018-03-14 01:32:43 -04:00
Avi Kivity
f8613a8415 Merge "Save and recall queriers for paged singular-mutation queries" from Botond
"
Terms
-----

querier: A class encapsulating all the logic and state needed to fill a
page. This Includes the reader, the compact_mutation object and all
associated state.

Preamble
--------

Currently for paged-queries we throw away all readers, compactors and
all associated state that contributed to filling the page and on the
next page we create them from scratch again. Thus on each page we throw
away a considerable amount of work, only to redo it again on the next
page. This has been one of the major contributors to latencies as from
the point of view of a replica each page is as much work as a fresh
query.

Solution
--------

The solution presented in this patch-series is to save queriers after
filling a page and reuse them on the next pages, thus doing the
considerable amount of work involved with creating the them only once.
On each page the coordinator will generate a UUID that identifies this
page. This UUID is used as the key, under which the contributing
queriers will be saved in the cache. On the next page the UUID from the
previous page will be used to lookup saved queriers, and the one from
the current one to saved them afterwards (if the query isn't finished).
These UUIDs (reader_recall_uuid and reader_save_uuid) are attached to
the page-state. Also attached to the page state is the list of replicas
hit on the last page. On the next page this list will be consulted to
hit the same replicas again, thus reusing the queriers saved on them.
Cached queriers will be evicted after a certain period of time to avoid
unecessary resource consumption by abandoned reads.
Cached queriers may also be evicted when the shard faces
resource-pressure, to free up resources.

Splitting up the work
---------------------

This series only fixes the singular-mutation query path, that is queries
that either fetch a single partition, or severeal single partitions (IN
queries). The fix for the scanning query path will be done in a
follow-up series, however much of the infrastructure needed for the
general querier reuse is already introduced by this series.

Ref #1865

Tests: unit-tests(debug, release), dtests(paging_test, paging_additional_test)

Benchmarking summary (read-from-disk)
-------------------------------------

1) Latency

BEFORE
latency mean              : 58.0
latency median            : 57.4
latency 95th percentile   : 68.8
latency 99th percentile   : 79.9
latency 99.9th percentile : 93.6
latency max               : 93.6

AFTER
latency mean              : 41.3
latency median            : 40.5
latency 95th percentile   : 50.8
latency 99th percentile   : 68.9
latency 99.9th percentile : 89.2
latency max               : 89.2

2) Throughput (single partition query)

sum(scylla_cql_reads):
BEFORE: 173'567
AFTER:  427'774

+246%

3) Throughput (IN query, 2 partitions)

sum(scylla_cql_reads):
BEFORE: 85'637
AFTER: 127'431

+148%
"

* '1865/singular-mutations/v8.2' of https://github.com/denesb/scylla: (23 commits)
  Add unit test for resource based cache eviction
  Add unit tests for querier_cache
  Add counters to monitor querier-cache efficiency
  Memory based cache eviction
  Add buffer_size() to flat_mutation_reader
  Resource-based cache eviction
  Time-based cache eviction
  Save and restore queriers in mutation_query() and data_query()
  Add the querier_cache_context helper
  Add querier_cache
  Add querier
  Add are_limits_reached() compact_mutation_state
  Add start_new_page() to compact_mutation_state
  Save last key of the page and method to query it
  Make compact_mutation reusable
  Add the CompactedFragmentsConsumer
  Use the last_replicas stored in the page_state
  query_singular(): return the used replicas
  Consider preferred replicas when choosing endpoints for query_singular()
  Add preferred and last replicas to the signature of query()
  ...
2018-03-13 18:38:59 +02:00
Botond Dénes
c0009750c3 Add unit test for resource based cache eviction
Specifically for the reader-permit based eviction. This test lives in a
separate executable as it uses with_cql_test_env() and thus needs a
main() of it's own.
2018-03-13 16:20:50 +02:00
Botond Dénes
c53b6f75c8 Add unit tests for querier_cache 2018-03-13 12:59:45 +02:00
Avi Kivity
636760c282 Merge "Introduce JSON output format to perf_fast_forward tests." from Vladimir
"
This patchset is a part of a bigger effort for bringing our
microbenchmarking tests from the source tree to be used for regression
testing purposes with CI.

Now, it is possible to export results of tests run into JSON format that
can be stored in ElasticSearch and compared among runs to detect
performance degradation should it happen.

Example of JSON output (formatted for readability):
{
	"results" :
	{
		"parameters" :
		{
			"read" : "64",
			"read,skip,test_run_count" : "64,256,1",
			"skip" : "256",
			"test_run_count" : 1
		},
		"stats" :
		{
			"(KiB)" : 126960,
			"aio" : 993,
			"blocked" : 208,
			"c blk" : 1,
			"c hit" : 0,
			"c miss" : 1,
			"cpu" : 99.779365539550781,
			"dropped" : 0,
			"frag/s" : 311939.61559016741,
			"frags" : 200000,
			"idx blk" : 0,
			"idx hit" : 0,
			"idx miss" : 0,
			"time (s)" : 0.641149729
		}
	},
	"test_group_properties" :
	{
		"message" : "Testing scanning large partition with skips.\nReads whole range interleaving reads with skips according to read-skip pattern",
		"name" : "large-partition-skips",
		"needs_cache" : false,
		"partition_type" : "large"
	},
	"versions" :
	{
		"scylla-server" :
		{
			"commit_id" : "4acfa17f4",
			"date" : "20180306",
			"run_date_time" : "2018-16-06 12:16:41",
			"version" : "666.development"
		}
	}
}
"

* 'issues/2947/v6' of https://github.com/argenet/scylla:
  Add support for JSON output format for perf_fast_forward results.
  Wrap output for customization. Move all output handling to a single managing class.
2018-03-13 12:37:34 +02:00
Benoît Canet
1d0cc7cf20 messaging_service: Start messaging service earlier
The messaging service was completely started
after a bootstraping node finished to join hence
leading to #2034.

Fixes #2034
Message-Id: <20180313084500.27265-1-amnon@scylladb.com>
2018-03-13 10:59:53 +02:00
Botond Dénes
b2f75a6c53 Add counters to monitor querier-cache efficiency
Add the following counters:
(1) querier_cache_lookups
(2) querier_cache_misses
(3) querier_cache_drops
(4) querier_cache_time_based_evictions
(5) querier_cache_resource_based_evictions
(6) querier_cache_memory_based_evictions
(6) querier_cache_population

(1) counts the total number of querier cache lookups. Not all
page-fetches will result in a querier lookup. For example the first page
of a query will not do a lookup as there was no previous page to reuse
the querier from. The second, and all subsequent pages however should
attempt to reuse the querier from the previous page.
(2) counts the subset of (1) where the read have missed the querier
cache (failed to find a matching saved querier).
(3) counts the subset of (1) where the querier was recalled and dropped
immediately. This can happen for example if the querier was at the wrong
position.
(4) counts the cached queriers that were evicted due to their TTL
expiring.
(5) counts the cached queriers that were evicted due to reader-resource
(those limited by reader-concurrency limits) shortage.
(6) counts the cached queriers that were evicted due to reaching the
cache's memory limits (currently set to 4% of the shards' memory).
(7) is the current number of entries in the cache

Note:
* The count of cache hits can be derived from these counters as
(1) - (2).
* cache_drop (3) also implies a cache hit (see above). This means that
the number of actually reused queriers is:
(1) - (2) - (3)
2018-03-13 10:34:34 +02:00
Botond Dénes
8513549b55 Memory based cache eviction
To bound the memory consumption of the querier-cache the total memory
consumption of the cached queriers is limited to 4% of the shard's total
memory.
When inserting a new querier it is first checked whether it's insertion
would cause the limit to be crossed. If this is the case existing
entries are evicted until the memory consumption is sufficiently reduced
so that after inserting the querier it stays below the limit.
Cached queriers are evicted in LRU order as the oldest queriers are the
most likely to be evicted based on their TTL anyway.
To calculate the memory consumption of the cached queriers
flat_mutation_reader::buffer_size() is used. While this is not very
precise as it doesn't include object sizes and member containers it
gives a good picture of the memory consumption of the queriers.

Memory based cache eviction overlaps with resource-based cache eviction
but only to some degree as that only accounts the memory consumption of
sstable readers.
2018-03-13 10:34:34 +02:00
Botond Dénes
f488ae3917 Add buffer_size() to flat_mutation_reader
buffer_size() exposes the collective size of the external memory
consumed by the mutattion-fragments in the flat reader's buffer. This
provides a basis to build basic memory accounting on. Altought this is
not the entire memory consumption of any given reader it is the most
volatile component and usually by far the largest one too.
2018-03-13 10:34:34 +02:00
Botond Dénes
212b2dabc4 Resource-based cache eviction
Readers serving user-reads need to obtain a permit to start reading.
There exists a restriction on how much active readers can be admitted
based on their count and their memory onsumption.
Since the saved readers of cached queriers are techically active (they
hold a permit) they can block new readers from obtaining a permit.
New readers have a higher priority because a cached reader might be
abandoned or used later at best so in the face of memory pressure we
evict cached readers to free up permits for new readers.
Cached queriers are evicted in LRU order as the oldest queriers are the
most likely to be evicted based on their TTL anyway.
2018-03-13 10:34:34 +02:00
Botond Dénes
d5bcadcfda Time-based cache eviction
Cached queriers should not sit in the cache indefinitely otherwise
abandoned reads would cause excess and unncessary resource-usage. Attach
an expiry timer to each cache-entry which evicts it after the TTL
passes.
2018-03-13 10:34:34 +02:00
Botond Dénes
ff808d9ce6 Save and restore queriers in mutation_query() and data_query()
Use the querier_cache (represented by the passed-in
querier_cache_context) object to lookup saved queriers at the start of
the page and save them at the end of it if it is likely that there will
be more page requests.
2018-03-13 10:34:34 +02:00
Botond Dénes
cab38c9f81 Add the querier_cache_context helper
querier_cache_context is supposed to make propagating the cache and the
key down the layers. It comes bundled with some of the required
parameters (the lookup and save state) and aso hides all of the
boiler-plate of dealing with the cache (checking whether the key is
non-empty, etc.). It also makes it possible to not use the cache and
hide this from the lower layers.
2018-03-13 10:34:34 +02:00
Botond Dénes
bbfe17437e Add querier_cache
This is the cache where suspended queriers are going to be saved between
pages. This is not a general purpose cache. It caters to the specific
needs of the querier recall mechanism. More specifically:
(1) Cache entries are of single-use, they are inserted once and the first
lookup removes them. Multiple items may be stored under a single key.
Identifying the correct one happens based on additional information like
the query range. Lookup knows to drop queriers when they cannot be used
to serve the next page.
(2) Cache entries are evicted after a certain time to avoid the
depletion of resources due to abandoned reads.
(3) Cache entries are evicted when facing reader-permit shortage, until
either enough permits are freed up or all entries are evicted.
(4) A memory limiter is set up which keeps the total memory consumption
of the cache under a limit (4% of memory) by evicting the oldest entries
when inserting a new one would cause the total memory consumption to go
above the limit.
(5) It updates the relevant counters of the db_stats.

This patch only implements (1), the other features will be implemented
in their own patches.
2018-03-13 10:34:34 +02:00
Botond Dénes
7a5143a670 Add querier
The querier encapsulates all objects needed to serve queries, except
result builders. It is designed to be suspendable, savable and
resumable. It contains all logic needed to suspend, resume and determine
whether the querier can be resumed or not.
It is the foundation upon which the "reader-reuse" mechanism is built.
2018-03-13 10:34:34 +02:00
Botond Dénes
84d872babf Add are_limits_reached() compact_mutation_state
are_limits_reached() allows querying whether the compactor reached
the page's limits. This is needed to determine whether there will be
more pages and thus whether the compact_mutation_state has to be kept
around.
2018-03-13 10:34:34 +02:00
Botond Dénes
2c1081b0e9 Add start_new_page() to compact_mutation_state
start_new_page() resets the limits to the current page's ones and
sets the _empty_partition flag so that the partition header (if the last
page finished inside a partition) will be reemitted.
2018-03-13 10:34:34 +02:00
Botond Dénes
3fca8aaefb Save last key of the page and method to query it
Make a copy of the current decorated-key in consume_end_of_stream() so
that it persists while the compaction state is suspended.
Also add current_partition() to allow client code to query the partition
the compaction is positioned in. This is needed to determine whether
the start position of the next page matches that of the
compact_mutation_state.
2018-03-13 10:34:34 +02:00
Botond Dénes
2fcc99fe43 Make compact_mutation reusable
Currently compact_mutation is used as a use-once-then-throw-away object.
After it satisfies its consumer it's destroyed together with the
consumer. This conflicts with the effort to save and reuse readers and
associated infrastructure between pages of a query.

To resolve this conflict compact_mutation is split into two classes:
(1) compact_mutation_state
(2) compact_mutation

compact_mutation_state encapsulates all the compaction logic and state,
while compact_mutation continues to provide the same API using
compact_mutation_state behind the scenes.
compact_mutation_state doesn't store the consumer, instead its
consume_* methods are templated on the consumer and take it as an
argument. This allows compact_mutation_state to be independent of the
consumer's type.
Additionally compact_mutation can now be constructed from a shared
pointer to compact_mutation_state. This allows client code to
pre-construct a compaction state and retain it after the
compact_mutation object is destroyed.
These changes allow the state of a compaction to be saved and restored
later while code that is only interested in storing the saved state
can stay independent of the consumer's type.

This patch only contains the splitting of compact_mutation into
compact_mutation and compact_mutation_state. The next patches will add
the missing functionality that is needed to make compact_mutation_state
truly reusable across pages.
2018-03-13 10:34:34 +02:00
Botond Dénes
7bd500049d Add the CompactedFragmentsConsumer
Undust the commented CompactMutationConsumer concept, make it usable and
rename it to CompactedFragmentsConsumer (as we not have flat readers).
2018-03-13 10:34:34 +02:00
Botond Dénes
f1171803b5 Use the last_replicas stored in the page_state
Pass the last_replicas from the page_state as the preferred_replicas
for query() and save the returned last_replicas as the last_replicas
field of the next page_state. The circle is now complete. The first page
of any query will pass an empty list as the preferred replicas (having
no previous paging_state) so the replicas will be selected according to
the load-balancing strategy. Any subsequent page will use the last
replicas from the last page as the preferred ones for the current one.
Thus if all goes well all pages of a query will hit the same replicas.
2018-03-13 10:34:34 +02:00
Botond Dénes
536a32bb5e query_singular(): return the used replicas
This patch implements the last_replicas returning part of the query()
signature changes for singular queries. It allows for client code to
save the last returned replicas and pass it to query() on the next page
as the preferred-replicas parameter, thus faciliate the read requests
for the next page hitting the same replicas.
2018-03-13 10:34:34 +02:00
Botond Dénes
aaf67bcbaa Consider preferred replicas when choosing endpoints for query_singular()
Propagate the preferred_replicas to db::filter_for_query() and consider
them when selecting the endpoints. The algoritm for selecting the
endpoints is as follows:
* Compute the intersection of the endpoint candidates and the
preferred endpoints.
* If this yields a set of endpoints that already satisfies the CL
requirements use this set.
* Otherwise select the remaining endpoints according to the
load-balancing strategy, just like before.
2018-03-13 10:34:34 +02:00
Botond Dénes
eac597d726 Add preferred and last replicas to the signature of query()
preferred_replicas are added to the parameters and last_replicas are
added to the return type. The preferred replicas will be used as a hint
for the selection of the replicas to send the read requests to. The last
replicas (returned) are the replicas actually selected for the read.
This will allow queries to consistently hit the same replicas for each
page thus reusing readers created on these replicas.
For convenience a query() overload is provided that doesn't take or
return the preferred and last replicas.

This patch only adds the parameters and propagates them down to
query_singular() and query_partition_key_range(). The code to actually
use these preferred-replicas will be added in later patches.
This reason for separating this is to reduce noise and improve
reviewability for those functional changes later.
2018-03-13 10:34:34 +02:00
Botond Dénes
f281b3e923 Add last_replicas to paging_state
Helps paged queries consistently hit the same replicas for each
subsequent page. Replicas that already served a page will keep the
readers used for filling it around in a cache. Subsequent page request
hitting the same replicas can reuse these readers to fill the pages
avoiding the work of creating these readers from scratch on every page.
In a mixed cluster older coordinators will ignore this value.
The value of last_replicas may change between pages as nodes may become
available/unavailable or the coordinator may decide to send the read
requests to different replicas at its discretion.
Replicas are identified by an opaque uuid which should only make sense
to the storage-proxy.
2018-03-13 10:34:34 +02:00
Nadav Har'El
fa284f6307 Add query UUID to read command
This patch adds the parameter to read_command which is needed for
caching of readers during multiple pages of a paged queries, which
we will introduce in the next patches.

The query_uuid is a UUID of a previously saved reader, which
the replica is now asked to recall and resume (if this saved reader is
no longer in the cache, it is fine, a new reader will be started).

Additionally a helper flag is_first_page is added so that the replica
can avoid doing any cache lookups (and incrementing miss counters) for
the first page.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-03-13 10:34:34 +02:00
Nadav Har'El
ec7c56d18a Add query UUID to paging state
This patch adds to the "paging_state", the opaque cookie that clients are
supposed to provide when asking for the next page on a paged query, a
unique id field. This new field will be used to tell that a new request
for a page really continues the previous page, and doesn't just by chance
start at the same position the previous page stopped.

We need to support setups with mixed versions - a client may get a paging
state from a coordinator running a new version of Scylla and send it to
a different coordinator running an old version - or vice versa. So the new
uuid field is set up to have a default uuid of UUID() (a recognizable
invalid uuid 0), so new versions receiving no uuid from an old version will
set this invalid uuid, and old versions receiving a uuid from a new version
will simply ignore it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-03-13 10:34:34 +02:00
Avi Kivity
78a9ab827e Merge seastar upstream
* seastar 42159d4...bcfbe0c (1):
  > core: fix directory scanning by returning actual entry type

Fixes #3274 (hopefully).
2018-03-12 20:58:44 +02:00
Duarte Nunes
36b8c1043d Merge 'Reduce dependencies on messaging_service.hh' from Avi
Refactor some includes to reduce dependencies on messaging_service.hh,
which can change quite a lot as it includes many unrelated items itself.

Tests: build

* tag 'includes/messaging_service.hh/v1' of https://github.com/avikivity/scylla:
  tests: reduce dependencies in test_services.hh
  migration_manager: remove dependency on messaging_service.hh in header
  messaging_service: move msg_addr into its own header file
2018-03-12 18:49:13 +00:00
Avi Kivity
bd7881066a tests: reduce dependencies in test_services.hh
Convert storage_service_for_test to a pimpl implementation to
reduce dependencies.  Tests that depended on those includes were
fixed to include their dependencies directly.
2018-03-12 20:05:23 +02:00
Avi Kivity
5f2600a71d migration_manager: remove dependency on messaging_service.hh in header
Use the new msg_addr.hh header to remove a dependency on
messaging_service.hh.
2018-03-12 20:05:23 +02:00
Avi Kivity
dd12214628 messaging_service: move msg_addr into its own header file
Make it possible to use msg_addr without depending on messaging_service.hh.
2018-03-12 20:05:23 +02:00
Avi Kivity
af383228fb locator: remove empty file locator.cc
Empty but for compiler-time-consuming includes.
Message-Id: <20180312073018.21646-1-avi@scylladb.com>
2018-03-12 10:32:26 +01:00
Avi Kivity
29d0a46220 locator: add copyright and license statements to production_snitch_base.cc
Message-Id: <20180312073104.21840-1-avi@scylladb.com>
2018-03-12 10:30:48 +01:00
Asias He
8624467e26 utils: Remove utils/utils.cc
It is used to make sure the header compiles in the early days.
Message-Id: <531fc6570805bd163afedd53f5d71e1b79a477d1.1520840644.git.asias@scylladb.com>
2018-03-12 09:47:40 +02:00
Duarte Nunes
0ccf1c581a Merge 'Reduce gratuitous inclusions of system_keyspace.hh' from Avi
Try to avoid recompilations by reducing inclusions of system_keyspace.hh
in other header files.

Tests: unit (release)

* tag 'system_keyspace.hh/v1' of https://github.com/avikivity/scylla:
  storage_service: remove system_keyspace.hh include
  locator: de-inline reconnectable_snitch_helper
  locator: de-inline production_snitch_base
  cql3: remove #include of system_keyspace.hh
2018-03-11 22:56:20 +00:00
Avi Kivity
cd668061fc storage_service: remove system_keyspace.hh include
Re-distribute include among the files that really need it.
2018-03-11 18:53:49 +02:00
Avi Kivity
b946f8b308 locator: de-inline reconnectable_snitch_helper
Reduce dependencies by de-inlining reconnectable_snitch_helper. A
new home is found in production_snitch_base.cc, which is somewhat
related.
2018-03-11 18:31:05 +02:00
Avi Kivity
84004a2574 locator: de-inline production_snitch_base
De-inlining allows us to remove some dependencies, and those functions
are too complex to inline anyway.

A few always-throwing functions get the [[noreturn]] attribute to
avoid damaging code generation.
2018-03-11 18:22:49 +02:00
Avi Kivity
4f6b892aa1 cql3: remove #include of system_keyspace.hh
We include system_keyspace for just the string "system" (and a related
is_system_keyspace() function). Replace with a forward-declared functions.
2018-03-11 18:02:23 +02:00
Avi Kivity
7441c7153f Merge seastar upstream
* seastar 08e02dc...42159d4 (9):
  > memory: avoid unconditional calls to __tls_init
  > io_tester: bring back information about think time
  > Merge "Avoid continuations in I/O Scheduler path" from Glauber
  > Merge "Extend io_tester to support CPU loads" from Glauber
  > tutorial: fix undue complication in semaphore get_units() example
  > Tutorial: in HTML target, inline code snippets shouldn't be gray
  > tutorial: add build target for split HTML file
  > tutorial: mention seastar::thread as option for object lifetime management
  > tutorial: document new seastar::future::wait()
2018-03-11 15:45:42 +02:00
Avi Kivity
9569ba5e38 Update scylla-ami submodule
* dist/ami/files/scylla-ami 3aa87a7...5170011 (3):
  > scylla_install_ami: install enhanced networking NIC drivers
  > scylla_install_ami: set kernel-ml as default kernel
  > scylla_install_ami: fix NIC down with enhanced networking on new base AMI
2018-03-11 15:45:05 +02:00
Raphael S. Carvalho
fb8ce14a36 sstables: don't set clustering components twice when loading sstable
already called in update_info_for_opened_data() which is called by
open_data(); no need for clustering components to be set early
either.

found it when auditing the code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180310225213.26017-1-raphaelsc@scylladb.com>
2018-03-11 10:10:35 +02:00
Tomasz Grabiec
3937352a9a doc: Fix row_cache.md
Dropped unfinished sentence and added missing "after".
Message-Id: <1520615404-18458-1-git-send-email-tgrabiec@scylladb.com>
2018-03-10 16:27:04 +02:00
Raphael S. Carvalho
87035bd8d1 sstables: fix min and max timestamp when negative timestamp is specified
unsigned type was incorrectly used for keeping track of min and max
timestamp, so a negative number would be treated as a very high
number that would *incorrectly* end up as max timestamp in sstable
metadata.

Fixes #3000.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180308162217.18963-1-raphaelsc@scylladb.com>
2018-03-08 18:31:30 +02:00
Avi Kivity
596a9d0fb3 Merge "Make reader concurrency dual-restricted by count and memory" from Botond
"
Refs #2692
Fixes #3246

The current restricting algorithm [1] restricts the active-reader queue
based on the memory consumption of the existing active readers. When
this memory consumption is above the limit new readers are not admitted.
The inactive reader queue on the other hand has a fixed length.
This caused performance regressions on two workloads:
* read-only: since the inactive-reader queue length is severly limited
  (compared to the previous situation) reads will timeout at loads
  comfortably handled before.
* mixed: since the memory consumption happens only at admission time
  (already created active readers are not limited) memory consumption
  growed significantly causing problems when compactions kicked in.

The solution is to reintroduce the old limit of 100 active concurrent
user-reads while still keeping the memory-based limit as well. For
workloads that don't consume a lot of memory or on large boxes with lots
of memory the count-based limit will be reached which is reverting to the
old well-known behaviour. For memory-hungry workloads or on small boxes
with little memory the memory based-limit will kick in sooner avoiding
memory overconsumption.

[1] introduced by bdbbfe9390
"

* 'restricted-reader-dual-limit/v3' of https://github.com/denesb/scylla:
  Modify unit tests so that they test the dual-limits
  Use the reader_concurrency_semaphore to limit reader concurrency
  Add reader_concurrency_semaphore
  Add reader_resource_tracker param to mutation_source
  mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh
2018-03-08 14:36:05 +02:00
Botond Dénes
341ddd096a Modify unit tests so that they test the dual-limits 2018-03-08 14:12:12 +02:00
Botond Dénes
1259031af3 Use the reader_concurrency_semaphore to limit reader concurrency 2018-03-08 14:12:12 +02:00
Botond Dénes
dfa04c3fea Add reader_concurrency_semaphore
This semaphore implements the new dual, count and memory based active
reader limiting. As purely memory-based limiting proved to cause
problems on big boxes admitting a large number of readers (more than any
disk could handle) the previous count-based limit is reintroduced in
addition to the existing memory-based limit.
When creating new readers first the count-based limit is checked. If
that clears the memory limit is checked before admitting the reader.
reader_conccurency_semaphore wraps the two semaphores that implement
these limits and enforces the correct order of limit checking.
This class also completely replaces the restricted_reader_config struct,
it encapsulates all data and related functinality of the latter, making
client code simpler.
2018-03-08 14:12:12 +02:00
Botond Dénes
872fd369ba Add reader_resource_tracker param to mutation_source
Soon, reader_resource_tracker will only be constructible after the
reader has been admitted. This means that the resource tracker cannot be
preconstructed and just captured by the lambda stored in the mutation
source and instead has to be passed in along the other parameters.
2018-03-08 14:12:09 +02:00
Botond Dénes
d5bb8a47fc mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh
In preparation to reader_concurrency_semaphore being added to the file.
The reader_resource_tracker is really only a helper class for
reader_concurrency_semaphore so the latter is better suited to provide
the name of the file.
2018-03-08 10:29:16 +02:00
Avi Kivity
0ebfe448e3 Merge "Row-level eviction" from Tomasz
"
This series switches granularity of memory-pressure-induced eviction in cache
from a partition to a row.

Since 9b21a9b cache can store partial partitions with row granularity but they
were still evicted as a unit. This is problematic for the following reasons:

 - more is evicted than necessary, which decreases cache efficiency. In the
   worst case, whole cache gets evicted at once

 - evicting large amounts of memory (large partitions) at once may impact
   latency badly

Fixes #2576.

See the documentation added in patch titled "doc: Document row cache eviction"
for details on how eviction works.

Open issues to be fixed incrementally:

  - range tombstones are not evictable

  - cache update still has partition granularity, which
    causes bad latency on memtable flush with large partitions
"

* tag 'tgrabiec/row-level-eviction-v3' of github.com:scylladb/seastar-dev: (43 commits)
  doc: Document row cache eviction
  tests: cache: Add tests for row-level eviction
  tests: cache: Check that data is evictable after schema change
  tests: cache: Move definitions to the top
  tests: perf_cache_eviction: Switch eviction counter to row granularity
  tests: row_cache_alloc_stress: Avoid quadratic behavior
  cache: Introduce unlink_from_lru()
  cache: Add row-level stats about cache update from memtable
  mvcc: Propagate information if insertion happened from ensure_entry_if_complete()
  cache: Track number of rows and row invalidations
  cache: Evict with row granularity
  cache: Track static row insertions separately from regular rows
  tests: mvcc: Use apply_to_incomplete() to create versions
  tests: mvcc: Fix test_apply_to_incomplete()
  tests: cache: Do not depend on particular granularity of eviction
  tests: cache: Make sure readers touch rows in test_eviction()
  mvcc: Store complete rows in each version in evictable entries
  mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_in_latest()
  tests: cache: Invoke partial eviction in test_concurrent_reads_and_eviction
  cache: Ensure all evictable partition_versions have a dummy after all rows
  ...
2018-03-07 17:57:07 +02:00
Tomasz Grabiec
4caeed7e40 doc: Document row cache eviction 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
180a877db3 tests: cache: Add tests for row-level eviction 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
9fab5068c6 tests: cache: Check that data is evictable after schema change 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
f0e0c79a70 tests: cache: Move definitions to the top 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
1e4f9eb2c1 tests: perf_cache_eviction: Switch eviction counter to row granularity 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
48f91b4605 tests: row_cache_alloc_stress: Avoid quadratic behavior
Partitions corresponding to keys have 40k rows. With row-level
eviction touching them inside the loop became a serious performance
issue, because touch() now needs to walk over all rows.
2018-03-07 16:52:59 +01:00
Tomasz Grabiec
641bcd0b35 cache: Introduce unlink_from_lru()
Will be used in row_cache_alloc_stress to unlink partitions which we
don't want to get evicted, instead of reapeatedly calling touch() on
them after each subsequent population. After switching to row-level
LRU, doing so greatly increases run time of the test due to quadratic
behavior.
2018-03-07 16:52:59 +01:00
Tomasz Grabiec
b9d22584bb cache: Add row-level stats about cache update from memtable 2018-03-07 16:52:58 +01:00
Tomasz Grabiec
7c34cd04e2 mvcc: Propagate information if insertion happened from ensure_entry_if_complete()
It's needed by users to update statistics, different ones depending on
if the row already existed or not.
2018-03-07 16:50:55 +01:00
Raphael S. Carvalho
aa75684ee7 sstables: Warn when an extra-large partition is written
Based on https://issues.apache.org/jira/browse/CASSANDRA-9643

For compaction_large_partition_warning_threshold_mb option set to 1,
follow an example output:

WARN  2018-02-22 19:52:11,029 [shard 0] sstable - Writing large
row system/local:{key: pk{00056c6f63616c}, token:-7564491331177403445}
(1276758 bytes)

Fixes #2209.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180306175912.19259-1-raphaelsc@scylladb.com>
2018-03-07 15:49:46 +00:00
Takuya ASADA
c3b2e2580a dist/debian: switch to gcc-7.3
We have hit following bug on debug-mode binary:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82560
Since it's fixed on gcc-7.3, we need to upgrade our gcc package.

See: https://groups.google.com/d/topic/scylladb-dev/RIdIpqMeTog/discussion
2018-03-08 00:06:32 +09:00
Duarte Nunes
9254a9a6fe db/system_keyspace: Move dependency on db/schema_tables to source file
And add missing dependencies to header file.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180307111304.2914-1-duarte@scylladb.com>
2018-03-07 14:45:36 +02:00
Asias He
73d8e2743f dht: Fix log in range_streamer
The address and keyspace should be swapped.

Before:
  range_streamer - Bootstrap with ks3 for keyspace=127.0.0.1 succeeded,
  took 56 seconds

After:
  range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded,
  took 56 seconds

Message-Id: <5c49646f1fbe45e3a1e7545b8470e04b166922c4.1520416042.git.asias@scylladb.com>
2018-03-07 11:49:58 +02:00
Tomasz Grabiec
6ba272a610 debug: scylla_row_cache_report: Remove duplicated phrase from printout
Message-Id: <1520412164-10746-1-git-send-email-tgrabiec@scylladb.com>
2018-03-07 11:15:57 +02:00
Tomasz Grabiec
ad7e2f7460 cache: Add back parition count argument to row_cache_update_one_batch_end probe
sebug/scylla_row_cache_report.stp expects it.

Removed in c4974392b7.
Message-Id: <1520412152-10680-1-git-send-email-tgrabiec@scylladb.com>
2018-03-07 11:15:56 +02:00
Vladimir Krivopalov
8028f90460 Add support for JSON output format for perf_fast_forward results.
The JSON output is arranged in a way that makes it easier to upload
results to ElasticSearch.
All the tests results are placed under the perf_forward_data_output/ directory
For test groups, we create separate subdirectories where we save results
from runs of tests in those groups.
For each test run, we store results in a separate file named:
    <dash-separated-param-list>.<run-number>.json
where
    <dash-separated-param-list> is a dash-separated list of parameters of the current
    test, e.g., 1-64 (for read-skip pattern).

    <run-number> is the number of run of this test with the specified
    parameters. This is needed as the same list of parameters can be
    used more than once (for instance, when cache is enabled).
    Those numbers start with 1, i.e., 1, 2, 3.

So, the path to a resulting JSON file may look like:
    perf_fast_forward_output/large-partition-skips/64-4096.1.json

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-03-06 12:09:00 -08:00
Vladimir Krivopalov
e810fc4e09 Wrap output for customization. Move all output handling to a single managing class.
Instead of passing the output parameters to std::cout straight away, use
helper wrappers. This will allow us to add more formats for gathered
tests results.

Introduce helper writer classes hierarchy that can be extended to
support different output formats (JSON, XML, etc).

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-03-06 09:49:05 -08:00
Tomasz Grabiec
da901b93fc cache: Track number of rows and row invalidations 2018-03-06 11:50:29 +01:00
Tomasz Grabiec
381bf02f55 cache: Evict with row granularity
Instead of evicting whole partitions, evicts whole rows.

As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
2018-03-06 11:50:29 +01:00
Tomasz Grabiec
dce9185fc9 cache: Track static row insertions separately from regular rows
So that row eviction counter, which doesn't look at the static row,
can be in sync with row insertion counter.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
19951ede7d tests: mvcc: Use apply_to_incomplete() to create versions
So that the test doesn't depend on internal invariants.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
ed6271fc87 tests: mvcc: Fix test_apply_to_incomplete()
It should use evictable entries instead of non-evictable ones,
because they are required by apply_to_incomplete().
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
f2bdac2874 tests: cache: Do not depend on particular granularity of eviction 2018-03-06 11:50:28 +01:00
Tomasz Grabiec
c306c1050e tests: cache: Make sure readers touch rows in test_eviction()
With row-level eviction just creating a reader won't necessarily
update the LRU.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
ab407d99cc mvcc: Store complete rows in each version in evictable entries
For row-level eviction we need to ensure that each version has
complete rows so that eviction from older versions doesn't affect the
value of the row in newer snapshots.

This is achieved by copying the row from an older version before
applying the increment in the new version.

Only affects evictable entries, memtables are not affected.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
29d167bf01 mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_in_latest()
To avoid duplication of logic between cache reader and
ensure_entry_if_complete().
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
fb2107416b tests: cache: Invoke partial eviction in test_concurrent_reads_and_eviction
In hope of catching more issues.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
bee875fa7d cache: Ensure all evictable partition_versions have a dummy after all rows
Every evictable version will have a dummy entry at the end so that it can be
tracked in the LRU.

It is also needed to allow old versions to stay around (with
tombstones and static rows) after all rows are evicted. Such versions
must be fully discontinuous, and we need some entry to mark that.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
5320705300 cache: Propagate cache_tracker to places manipulating evictable entries
cache_tracker reference will be needed to link/unlink row entries.

No change of behavior in this patch.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
30df3ddd7d cache: Do not evict from cache_entry destructor
We will need to propagate a cache_tracker reference to evict(). Instead
of evicting from destructor, do so before cache_entry gets unlinked
from the tree. Entries which are not linked, don't need to be
explicitly evicted.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
4efab6f6a6 cache: Use on_evicted() in cache_tracker::clear()
In preparation for switching LRU to row level.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
2118bdce01 cache: Extract cache_entry::on_evicted() 2018-03-06 11:50:27 +01:00
Tomasz Grabiec
24c5949518 cache: cache_tracker: Rename on_merge() to on_partition_merge() 2018-03-06 11:50:27 +01:00
Tomasz Grabiec
d66e864310 cache: cache_tracer: Rename on_erase() to on_partition_erase() 2018-03-06 11:50:27 +01:00
Tomasz Grabiec
3dc9000c51 mutation_partition: Introduce rows_entry::is_last_dummy()
Will be needed by row evictor, which needs to treat last dummies
specially (not evict them).
2018-03-06 11:50:26 +01:00
Tomasz Grabiec
e571bd5a2e mvcc: Add partition_entry::versions_from_oldest() 2018-03-06 11:50:26 +01:00
Tomasz Grabiec
654d4b76c0 anchorless_list: Introduce all_elements_reversed() 2018-03-06 11:50:26 +01:00
Tomasz Grabiec
d9a38c1c85 mutation_partition: Add API to walk from rows_entry to cache_entry
Will be needed on row eviction, to unlink containers when they become
fully evicted.
2018-03-06 11:50:26 +01:00
Tomasz Grabiec
0ccae80332 intrusive_set_external_comparator: Introduce container_of_only_member() 2018-03-06 11:50:26 +01:00
Tomasz Grabiec
758dfd404b intrusive_set_external_comparator: Use auto_unlink on nodes
Needed for row-level eviction, which doesn't have a reference to the
container.
2018-03-06 11:50:26 +01:00
Tomasz Grabiec
1a85c6d556 intrusive_set_external_comparator: Introduce iterator_to() 2018-03-06 11:50:26 +01:00
Tomasz Grabiec
bbe771e28f tests: Add more tests for continuity merging 2018-03-06 11:50:26 +01:00
Tomasz Grabiec
9893e8e5f7 mvcc: Make each version have independent continuity
This change is a preparation for introducing row-level eviction, such that entries
can be evicted from older versions without having to touch other versions.

Currently continuity flags on entries are interpreted relative to the
combined view merged from all entries. For example:

 v2:                  <key=2, cont=1>
 v1: <key=1, cont=1>

In v2, the flag on entry key=2 marks the range (1, 2) as
continuous. This is problematic because if the old version is evicted, continuity
will change in an incorrect way:

   v2:                  <key=2, cont=1>

Here, the range (-inf, 1) would be marked as continuous, which is not true.

To solve this problem, we change the rules for continuity
interpretation in MVCC. Each version will have its own continuity,
fully specified in that version, independent of continuity of other
versions. Continuity of the snapshot will be a union of continuous
ranges in each version.

It is assumed that continuous intervals in different versions are non-
overlapping, except for points corresponding to complete rows, in
which case a later version may overlap with an older version
(overwrite). We make use of this assumption to make calculation of the
union of intervals on merging easier. I make use of the above
assumption in mutation_partition::apply_monotonically().

MVCC population of incomplete entries already almost maintains the
non-overlapping invariant, because population intervals correspond to
intervals which are incomplete in the old snapshot. The only change
needed is to ensure that both population bounds will have entries in
the latest version. Population from memtables doesn't mark any
intervals as continuous, so also conforms. The only change needed
there is to not inherit continuity flags from the old snapshot,
effectively making the new version internally discontinuous except for
row points.

The example from the beginning will become:

 v2: <key=1, cont=0>  <key=2, cont=1>
 v1: <key=1, cont=1>

When marking a range as continuous with some rows present only in
older versions, we need to insert entries in the latest version, so
that we can mark the range as continuous. The easiest solution is to
copy the entry from the old version. Another option would be to add
support for incomplete rows and insert such instead. This way we would
avoid duplicating row contents. This optimization is deferred.
2018-03-06 11:50:25 +01:00
Tomasz Grabiec
bd1e730053 tests: cache: Add test for merging and reading randomly populated versions 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
1b959cb6e9 tests: cache: Take parameters by const& 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
d2744b6ad8 tests: mvcc: Don't set mutations in versions directly
Simply copying mutations which are not fully continuous may violate
MVCC invariants, like the one about non-overlapping continuity which
will be added later. Use apply_to_incomplete() instead.

This unfortunately reduces strenght of the test, since the continuity
of the entry is now completely determined by the first version. We should
use populate() instead, but it doesn't exist yet. It could be extracted
from cache_streamed_mutation, but that's not an easy change.

This is alleviated by adding a similar test to row_cache_test_g, in a
later patch.
2018-03-06 11:32:09 +01:00
Tomasz Grabiec
2a0ece5205 mvcc: Allow dereferencing partition_snapshot_row_weakref 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
d0e1a3c63e mvcc: partition_snapshot_row_weakref: Introduce is_in_latest_version() 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
2f956499a7 mvcc: Drop unused _evictable flag from partition_version_ref 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
313f2c2bb0 cache: Document intent of maybe_update_continuity() 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
3214883a25 cache: Extract cache_streamed_mutation::ensure_population_lower_bound() 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
d9f0c1f097 tests: cache: Fix invalidate() not being waited for
Probably responsible for occasional failures of subsequent assertion.
Didn't mange to reproduce.

Message-Id: <1520330967-584-1-git-send-email-tgrabiec@scylladb.com>
2018-03-06 12:14:04 +02:00
Asias He
25aa59f2f1 gossip: Fix force_after in wait_for_gossip
In commit 8af0b501a2 (gossip: wait for stabilized gossip on bootstrap)

The force_after variable was changed from int32_t to stdx::optional<int32_t>

-            if (force_after > 0 && total_polls > force_after) {
+            if (force_after && total_polls > *force_after) {

Checking force_after > 0 was dropped which is wrong because force_after
is set to -1 by default. So the if branch will always be executed after
1 poll.

We always see:

   [shard 0] gossip - Gossip not settled but startup forced by
   skip_wait_for_gossip_to_settle. Gossp total polls: 1

even if skip_wait_for_gossip_to_settle is not set at all.

Fixes #3257
Message-Id: <845d219cea6101a7c507c13879c850a5c882e510.1520297548.git.asias@scylladb.com>
2018-03-06 10:11:02 +02:00
Vladimir Krivopalov
2cbdb91070 Remove unused io/ directory
Commit 9309a2ee6f ("Remove obselete
files") removed all of the callers but forgot to remove the directory.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <dcdd6ac66e88fac29cc2b0a12936688e71c1d267.1520314939.git.vladimir@scylladb.com>
2018-03-06 08:08:02 +02:00
Asias He
8900e830a3 storage_service: Add missing return in pieces empty check
If pieces.empty is empty, it is bogus to access pieces[0]:

   sstring move_name = pieces[0];

Fix by adding the missing return.

Spotted by Vlad Zolotarov <vladz@scylladb.com>

Fixes #3258
Message-Id: <bcb446f34f953bc51c3704d06630b53fda82e8d2.1520297558.git.asias@scylladb.com>
2018-03-06 08:04:39 +02:00
Vladimir Krivopalov
acdce55572 Inject CryptoPP namespace where Crypto++ byte typedef is used.
In Crypto++ v6, the `byte` typedef has been moved from the global
namespace to the CryptoPP:: namespace.
To make Scylla code compile with both old and new versions, bring the
namespace in so that the code works regardless of the scope of `byte`
definition.

Fixes #3252

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <60e7bfe868b778b1c9bbe15d7247db64b61bd406.1520272198.git.vladimir@scylladb.com>
2018-03-05 20:43:07 +02:00
Avi Kivity
eb598876e5 build: remove broken and unneeded xxhash include path
"-I$full_builddir/{mode}/xxhash" doesn't resolve to a valid path, because
full_builddir is a Python variable, not a Ninja variable.  In build.ninja
it appears as "-I/release/xxhash".

Since the build nevertheless works, we can remove the broken flag instead
of fixing it.
Message-Id: <20180305135919.13634-1-avi@scylladb.com>
2018-03-05 15:34:30 +01:00
Duarte Nunes
0c05fc0bff tests/flush_queue_test: Don't assume continuations run immediately
This patch fixes an issue with test_propagation(), where the test
assumed that after the future returned from wait_for_pending(0)
resolved, the continuations set for the post operation had already
run, which is not true.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180305131908.7667-1-duarte@scylladb.com>
2018-03-05 15:22:33 +02:00
Avi Kivity
1dae29b48d test: mutation_reader_test: fix no-timeout case in reader_wrapper
reader_wrapper's _timeout defaults to now(), which means to time
out immediately rather than no timeout.

Fix by switching to a time_point, defaulting to no_timeout, and
provide a compatible constructor (with a duration parameter) for
callers that do want a duration-based timeout.

Tests: mutation_reader_test (debug, release)
Message-Id: <20180305111739.31972-1-avi@scylladb.com>
2018-03-05 12:40:07 +01:00
Avi Kivity
a9942bd84a Merge seastar upstream
* seastar f841d2d...08e02dc (3):
  > future: make future::wait() a supported function
  > scripts: perftune.py: don't allow cpu-mask that does't include any IRQ CPU
  > Tutorial: show nice dashes in HTML
2018-03-05 12:58:15 +02:00
Vlad Zolotarov
e3ca390333 tests: gce_snitch_test: drop the property file related message
The message in question is printed with printf() which is bad by itself.
And most importantly this test uses a single .property file so this message
doesn't add any interesting information to begin with. Therefore it makes
more sense to drop it than to fix it.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1519661059-13325-1-git-send-email-vladz@scylladb.com>
2018-03-04 16:16:37 +02:00
Takuya ASADA
3229a87fee dist/debian: Drop scylla-fstrim cron job from Debian 8/9
Since we installs scylla-fstrim systemd unit files on Debian 8/9, no need to
install cron job, so drop them.

Fixes #3249

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1519950212-16231-2-git-send-email-syuu@scylladb.com>
2018-03-04 16:13:06 +02:00
Takuya ASADA
759b4de7a5 dist/debian: drop systemd unit files on Ubuntu 14.04
Ubuntu 14.04 uses upstart as init program, don't need systemd unit files,
so drop them.

Fixes #3245

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1519950212-16231-1-git-send-email-syuu@scylladb.com>
2018-03-04 16:13:05 +02:00
Vladimir Krivopalov
e9e9ec2d16 Guidelines for preparing patches in HACKING.md
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <93bf4d5c04848daf2157d1343748410995b224db.1520045191.git.vladimir@scylladb.com>
2018-03-04 16:12:00 +02:00
Piotr Jastrzebski
29eb9f30bc Fix memtable::clear_gently to work in debug mode.
It was getting into an infinite loop because
need_preempt was always returning true.

Tests: units (release,debug)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <a324e7f576b247124080830455c920bdad1f617b.1520025213.git.piotr@scylladb.com>
2018-03-04 14:11:54 +02:00
Vladimir Krivopalov
99bd5180ba Fix Scylla compilation with Crypto++ v6.
In Crypto++ v6, the `byte` typedef has been moved from the global
namespace to the `CryptoPP::` namespace.

This fix brings in the CryptoPP namespace so that the `byte` typedef is
seen with both old and new versions of Crypto++.

Fixes #3252.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <799d055be710231884d101a52c0be8ed8b0a9806.1520125889.git.vladimir@scylladb.com>
2018-03-04 10:23:00 +02:00
Duarte Nunes
45d762703c Merge 'CQL syntax refinements for access-control' from Jesse
This patch series ties up some loose ends around CQL syntax for access-control statements.

The USER-based syntax statements are all backwards compatible. ROLE-specific statements have a new syntax which is described in "cql: Make role syntax for consistent". Other statements (like GRANT) have been updated to accept role names (instead of the more restrictive `username` rule).

Fixes #3217.

Tests: unit (debug)

* 'jhk/roles_syntax/v2' of https://github.com/hakuch/scylla:
  tests: Rename test for consistency
  cql: Eliminate uses of legacy `username` rule
  cql: Elaborate error for quoted user names
  cql: Allow role names to be string literals
  cql: Make role syntax more consistent
  tests: Add CQL syntax tests for access-control
2018-03-02 15:11:14 +00:00
Raphael S. Carvalho
954efcd209 storage_service: log sstable integrity checker status
INFO  2018-02-27 16:02:36,246 [shard 0] storage_service - SSTable data integrity checker is enabled.

Fixes #3071.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180228174253.9190-1-raphaelsc@scylladb.com>
2018-03-01 20:57:06 +01:00
Jesse Haber-Kucharsky
90af3d889a tests: Rename test for consistency
Now we have `cql_auth_query_test` and `cql_auth_syntax_test`.
2018-03-01 12:06:59 -05:00
Jesse Haber-Kucharsky
464f41d2bb cql: Eliminate uses of legacy username rule
All users of `username` are replaced with `userOrRoleName`, except in
USER-specific (legacy) statements: CREATE USER, ALTER USER, DROP USER.
2018-03-01 12:06:59 -05:00
Jesse Haber-Kucharsky
b84e22acdd cql: Elaborate error for quoted user names
Since quoted names are allowed for role names, we add a more descriptive
error message when a quoted name is (erroneously) used for a user name.

This behavior is consistent with Apache Cassandra.
2018-03-01 12:06:59 -05:00
Jesse Haber-Kucharsky
b5264d8bf7 cql: Allow role names to be string literals
This behavior matches that of Apache Cassandra. When a role name is
specified as a string literal (single quotes), the case is preserved.
2018-03-01 12:06:59 -05:00
Jesse Haber-Kucharsky
d7f2035dea cql: Make role syntax more consistent
This patch changes the syntax for CQL statements related to roles to
favor a form like

    CREATE ROLE sam WITH PASSWORD = 'shire' AND LOGIN = false;

instead of

    CREATE ROLE sam WITH PASSWORD 'shire' NOLOGIN;

This new syntax has the benefit of not imposing any ordering constraints
on the modifiers for roles and being consistent with other parts of the
CQL grammar. It is also consistent with syntax in Apache Cassandra.

The old USER-based statements (CREATE USER and ALTER USER) still have
the old forms for backwards compatibility.

A previous change modified the USER-related statements to allow for the
OPTIONS option. However, this was a mistake; only the PASSWORD option
should have been allowed. This patch also corrects this mistake.
2018-03-01 12:04:40 -05:00
Jesse Haber-Kucharsky
62bfc3939c tests: Add CQL syntax tests for access-control
These are quick-running tests for verifying the accepted forms of CQL
statements (and fragments) related to access-control: users, roles, and
permissions.

Establishing the allowed forms of statements is helpful for reference,
but also makes syntax changes (like those expected in later patches)
clearer and more safe.
2018-03-01 11:46:37 -05:00
Tomasz Grabiec
91ccf82ce4 mvcc: Improve printout of partition_snapshot_row_cursor
Multiline output is easier to read by humans.
Also, print continuity.

Message-Id: <1519909484-24531-1-git-send-email-tgrabiec@scylladb.com>
2018-03-01 13:44:00 +00:00
Takuya ASADA
101e909483 dist/debian: install scylla-housekeeping upstart script correctly on Ubuntu 14.04
Since we splited scylla-housekeeping service to two different services for systemd, we don't share same service name between systemd and upstart anymore.
So handle it independently for each distribution, try to install
/etc/init/scylla-housekeeping.conf on Ubuntu 14.04.

Fixes #3239

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1519852659-10688-1-git-send-email-syuu@scylladb.com>
2018-03-01 10:36:11 +02:00
Takuya ASADA
69e3760920 dist/redhat: support CentOS/ppc64le
Support POWER architecture on Scylla.
Since DPDK is not fully supported on POWER (no PMD supported on it yet),
disabled it for now.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180228203048.21593-1-syuu@scylladb.com>
2018-03-01 09:59:39 +02:00
Tomasz Grabiec
30635510a2 intrusive_set_external_comparator: Fix _header having undefined color on move
swap_tree() doesn't change the color of the header, and becasue header
was not initialized, it is undefined (can be both red or black). One
problem this causes is that algo::is_header() expects the header to be
always red. It is used by unlink(), which for trees which have a black
header would infinite-loop.

The fix is to initialize the header.

Fixes #3242.

Message-Id: <1519815091-13111-1-git-send-email-tgrabiec@scylladb.com>
2018-02-28 13:56:58 +02:00
Botond Dénes
ee307751e6 token_metadata: make get_host_id() and get_endpoint_for_host() const
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <febcb558848f8e06661bba592263e55e3192ed47.1519741336.git.bdenes@scylladb.com>
2018-02-27 16:29:13 +02:00
Duarte Nunes
76e6423910 database: Truncate views when truncating the base table
Fixes #3200

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180211124218.41373-1-duarte@scylladb.com>
2018-02-27 15:54:43 +02:00
Amnon Heiman
57d46c6959 scylla-housekeeing: need to support both debian/ubuntu variations
Debian and ubuntu list files come in two variations.
The housekeeping should support both.

This patch change the regexp that match the os in the repository file.
After the introduction of the second list variation, the os name can be in the middle of the path not only at the end.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180227092543.19538-1-amnon@scylladb.com>
2018-02-27 11:40:47 +02:00
Botond Dénes
d088c7724e Make serialization-deserialization of range symmetric
Currently serializing and deserializing singular ranges is asymetric.
When serializing a range we use the start() and end() functions to
obtain _start and _end respectively. However for singular ranges end()
will return _start and therefore the serialized range will have two
engaged optionals for bounds whereas the in-memory version will have only
one. The immediate consequence of this is that after serializing and
deserializing a range it will not compare equal to the original
serialized range. Needless to say this is *very* suprising behaviour.

To remedy the issue we fix the wrapping_range's constructor to not set
_end to the passed in value when the range is singular.
This way the on-wire format can stay compatible to how the range is
percieved by client code (when is_singular(): start() == end()) but
constructing the range from the wire-format will yield a range that will
always compare equal to the original one.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <e5f20b7b45f65ca1f7b347dcccd2ac462869e7ff.1519652739.git.bdenes@scylladb.com>
2018-02-26 20:24:55 +02:00
Avi Kivity
d973445a94 Merge "sstable/schema extensions" from Calle
"
Adds extension points to schema/sstables to enable hooking in
stuff, like, say, something that modifies how sstable disk io
works. (Cough, cough, *encryption*)

Extensions are processed as property keywords in CQL. To add
an extension, a "module" must register it into the extensions
object on boot time. To avoid globals (and yet don't),
extensions are reachable from config (and thus from db).

Table/view tables already contain an extension element, so
we utilize this to persist config.

schema_tables tables/views from mutations now require a "context"
object (currently only extensions, but abstracted for easier
further changes.

Because of how schemas currently operate, there is a super
lame workaround to allow "schema_registry" access to config
and by extension extensions. DB, upon instansiation, calls
a thread local global "init" in schema_registry and registers
the config. It, in turn, can then call table_from_mutations
as required.

Includes the (modified) patch to encapsulate compression
into objects, mainly because it is nice to encapsulate, and
isolate a little.
"

* 'calle/extensions-v5' of github.com:scylladb/seastar-dev:
  extensions: Small unit test
  sstables: Process extensions on file open
  sstables::types: Add optional extensions attribute to scylla metadata
  sstables::disk_types: Add hash and comparator(sstring) to disk_string
  schema_tables: Load/save extensions table
  cql: Add schema extensions processing to properties
  schema_tables: Require context object in schema load path
  schema_tables: Add opaque context object
  config_file_impl: Remove ostream operators
  main/init: Formalize configurables + add extensions to init call
  db::config: Add extensions as a config sub-object
  db::extensions: Configuration object to store various extensions
  cql3::statements::property_definitions: Use std::variant instead of any
  sstables: Add extension type for wrapping file io
  schema: Add opaque type to represent extensions
  sstables::compress/compress: Make compression a virtual object
2018-02-26 17:15:29 +02:00
Paweł Dziepak
5dfa36c526 lsa: add basic sanitizer
LSA being an allocator built on top of the standard may hide some
erroneous usage from AddressSanitizer. Moreover, it has its own classes
of bugs that could be caused by incorrect user behaviour (e.g. migrator
returning wrong object size).

This patch adds basic sanitizer for the LSA that is active in the debug
mode and verifies if the allocator is used correctly and if a problem is
found prints information about the affected object that it has collected
earlier. Theat includes the address and size of an object as well as
backtrace of the allocation site. At the moment the following errors are
being checked for:
 * leaks, objects not freed at region destructor
 * attempts to free objects at invalid address
 * mismatch between object size at allocation and free
 * mismatch between object size at allocation and as reported by the
   migrator
 * internal LSA error: attempt to allocate object at already used
   address
 * internal LSA error: attempt to merge regions containing allocated
   objects at conflicting addresses

Message-Id: <20180226122314.32049-1-pdziepak@scylladb.com>
2018-02-26 14:35:13 +02:00
Botond Dénes
c4b5249a46 backlog_controller::adjust(): fix heap-overflow
Make sure idx will not be equal to _control_points.size() (and thus
overflow the vector) when looking for the first control-point with
a backlog not smaller then the current one, by stopping when it's equal
to _control_points.size() - 1.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <47841592792573d820650d570fa1ab7e58bdac2c.1518700405.git.bdenes@scylladb.com>
2018-02-26 13:47:38 +02:00
Avi Kivity
8fe2414b11 Merge seastar upstream
* seastar 383ccd6...f841d2d (8):
  > Merge "Randomize task queue in debug mode" from Duarte
  > tutorial: document seastar::thread
  > tutorial: add missing seastar namespace
  > tutorial: note about asynchronous functions throwing exceptions
  > thread: stop backtraces on aarch64 from underflowing the stack
  > Revert "core:🧵 ARM64 version of annotating the frame"
  > core:🧵 ARM64 version of annotating the frame
  > core/future-util: Release exception in repeater
2018-02-26 12:54:35 +02:00
Calle Wilund
e75d3dc997 extensions: Small unit test
Test basic operation of schema and sstable extensions
2018-02-26 10:43:37 +00:00
Paweł Dziepak
b103139e4f configure.py: do not ignore optimisation flags
Release mode flags are properly propagated through seastar --optflags
flag, but debug mode flags aren't. This is problematic since they are
used to enable additional debugging features.

After this patch we will end up with some duplicate flags, but that's
not really a problem.

Message-Id: <20180223173617.15199-1-pdziepak@scylladb.com>
2018-02-25 17:09:07 +02:00
Botond Dénes
206e7d40d4 restricted_mutation_reader: switch to std::variant
Tests: unit-tests(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <a8930b764171db131d9d8d5fe4035014ecb452f4.1519391304.git.bdenes@scylladb.com>
2018-02-25 14:35:57 +02:00
Paweł Dziepak
6b66e4833b mvcc: avoid ubsan warning about uninitialised boolean
Message-Id: <20180223160133.21383-1-pdziepak@scylladb.com>
2018-02-23 16:54:23 +00:00
Jesse Haber-Kucharsky
82c8104c72 cql_test_env: Ignore error if user already exists
When a `cql_test_env` points to a data directory that was previously
populated with `cql_test_env`, then the "tester" user will already
exist. This is not an error, so we can just ignore the exception.

Fixes #3224.

Tests: unit (debug)
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <7729e5a98d8020a7ed1b6d12d8726559f0850f9d.1519315698.git.jhaberku@scylladb.com>
2018-02-22 19:30:50 +01:00
Raphael S. Carvalho
f59f423f3c Make sstable loading faster by not invoking all shards for each sstable
Before 312bd9ce25, boot had to call all shards for each sstable
such that they would agree/disagree on their deletion, an atomic
deletion manager requirement.

After its removal, we can afford to call only the shards that own
a given sstable.

Reducing the operation on each sstable from (SSTABLES) * (SHARD_COUNT)
to usually (SSTABLES). It may be the same as before after resharding,
but resharding is an one-off operation.

Boot time should be significantly reduced for nodes with a high smp
count and column family using leveled strategy (which can end up with
thousands of sstables).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180220032554.17776-1-raphaelsc@scylladb.com>
2018-02-22 09:39:56 +00:00
Amnon Heiman
edcfab3262 dist/docker: Add support for housekeeping
This patch takes a modified version of the Ubuntu 14.04 housekeeping
service script and uses it in Docker to validate the current version.

To disable the version validation, pass the --disable-version-check flag
when running the container.

Message-Id: <20180220161231.1630-1-amnon@scylladb.com>
2018-02-21 09:26:02 +02:00
Duarte Nunes
e75f7c41d9 Merge 'Proper clean-up on closing index_reader' from Vladimir
With the changes introduced in #2981 and #3189, the lifetime management
of the objects used by index_reader became more complicated.
This patchset addresses the immediate problems caused by lack of proper
handling.

The more holistic approach to this will take more time and is to be
implemented under #3220. The current fix, however, should be good
enought as a stop-gap solution.

* 'issues/3213/v3' of https://github.com/argenet/scylla:
  Close promoted index streams when closing index_readers.
  Support proper closing of prepended_input_stream.
2018-02-21 01:02:16 +00:00
Vladimir Krivopalov
c996191411 Close promoted index streams when closing index_readers.
Promoted index input streams must be explicitly closed when closing the
index_reader in order to ensure all the pending read-aheads are
completed.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-02-20 16:04:15 -08:00
Vladimir Krivopalov
8d52d809f7 Support proper closing of prepended_input_stream.
When the stream is being closed, the call is forwarded to the stored
data_source.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-02-20 16:04:05 -08:00
Vladimir Krivopalov
721bd3eef6 Added missing 'override' to skip() in buffer_input_stream and prepended_input_stream.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <4e91bead8de7f6fa9b3bfdab8bda73efdb22749d.1519152303.git.vladimir@scylladb.com>
2018-02-20 19:49:11 +00:00
Pekka Enberg
f1f691b555 Merge "Add the GoogleCloudSnitch" from Vlad
"This series adds the GoogleCloudSnitch.

 Fixes #1619"

* 'google-cloud-snitch-v4' of https://github.com/vladzcloudius/scylla:
  config: uncomment/add the supported snitches description
  tests: added gce_snitch_test
  locator::gce_snitch: implementation of the GoogleCloudSnitch
  locator::snitch_base: properly log the failure during the snitch startup
2018-02-19 15:58:56 +02:00
Paweł Dziepak
d97eebe82d tests/cql3: increase TTL to avoid spurious failures
The test inserts some values with a TTL of 1 second and then
reads them back expecting them not to be expired yet. That may not
always be the case if the machine is slow and we are running in the
debug mode. Increasising the TTLs by x100 should help avoid these
false positives.

Message-Id: <20180219133816.17452-1-pdziepak@scylladb.com>
2018-02-19 15:40:19 +02:00
Pekka Enberg
bd365a10d3 Merge "Add an API to get all active repairs" from Amnon
"This series adds an API to return the active repairs by their IDs.

 After this series a call to:

   curl -X GET --header "Accept: application/json" "http://localhost:10000/storage_service/active_repair/"

 Will return an array with the ids of the active repairs.

 Fixes #3193"

* 'amnon/get_active_repairs_v3' of github.com:scylladb/seastar-dev:
  API: Add get active repair api
  repair: Add a get_active_repairs function to return the active repair
2018-02-19 15:32:17 +02:00
Amnon Heiman
4a8f67aa01 conf: Remove unsupported 'stream_throughput_outbound_megabits_per_sec' option
stream_throughput_outbound_megabits_per_sec is not supported and is
found in the unsupported part of scylla.yaml.

This patch removes it from the supported part of the file.

Fixes #2876

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180219111421.30687-1-amnon@scylladb.com>
2018-02-19 15:16:23 +02:00
Duarte Nunes
d394b30882 tests/flush_queue_test: Ensure queue is closed before being destroyed
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217172008.27551-1-duarte@scylladb.com>
2018-02-19 13:10:28 +00:00
Duarte Nunes
294326b5b1 tests/commitlog_test: Close file
Operations on a append_challenged_posix_file_impl schedule asynchronous
operations when they are executed, which capture the file object. To
synchronize with them and prevent use-after-free, we need to call
close() before destroying the file.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217170556.27330-1-duarte@scylladb.com>
2018-02-19 13:10:14 +00:00
Duarte Nunes
ac55210677 tests/logalloc_test: Ensure regions are reclaimed in order
This test relied on task execution order to work correctly. Namely, it
relied on parent regions being reclaimed before child regions
(reclaiming is an asynchronous process started by a call to
start_reclaiming()). This order is necessary because child regions
don't know about parent regions when calculating the biggest region
that should be reclaimed.

We fix this by forcing the reclaim order.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217121655.26057-1-duarte@scylladb.com>
2018-02-19 13:09:59 +00:00
Duarte Nunes
f665f1ab97 db/commitlog: Close the segment file
Operations on a segment's underlying append_challenged_posix_file_impl,
such as truncate(), schedule asynchronous operations when they are
executed, which capture the file object. To synchronize with them and
prevent use-after-free, we need to call close() and only delete the
segment and file when the returned future resolves.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216235754.24257-1-duarte@scylladb.com>
2018-02-19 13:09:41 +00:00
Duarte Nunes
7004f6c7ff db/commitlog: Actually prevent new requests during shutdown
When shutting down the commitlog we try to block all new requests by
acquiring all available resources. We were, however, letting go of the
semaphore permits too early, before closing the gate and shutting down
the active segments.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216234826.24111-1-duarte@scylladb.com>
2018-02-19 13:09:26 +00:00
Duarte Nunes
9ce0be60d4 utils/flush_queue: Remove unused function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216234502.23931-1-duarte@scylladb.com>
2018-02-19 13:09:11 +00:00
Duarte Nunes
4fdcd6c92f tests/serialized_action_test: Don't rely on task execution order
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216191050.21902-1-duarte@scylladb.com>
2018-02-19 13:08:58 +00:00
Duarte Nunes
03608d269e Merge 'On the road to roles' from Jesse
This series takes Scylla most of the way to supporting roles, and
eliminates old user-based code. All the old user-based CQL statements
and functionality should exist as they did before, except now they are
backed internally by roles.

While all the functionality for supporting roles should be present,
role-specific features like granting a role to another role still warn
as "unimplemented". This will continue until the next series addresses
the final touches. These remaining items are:

- A slightly revised CQL syntax consistent with Apache Cassandra's
  revised role syntax.

- A user is automatically granted permissions on resources they create.

Users running a previous version of Scylla should be able to seamlessly
upgrade to a version of Scylla with this series merged. When a newly
upgraded node starts, it detects the presence of old metadata and copies
it to the new metadata tables if no nondefault new metadata yet exists.
A new gossiper feature flag, ROLES, also ensures that access-control
data is not modified while a cluster is in a partially-upgraded state.
If, when the cluster is in a partially upgraded state, a client connects
to an un-upgraded node then likely the change will not be propogated to
the new metadata table. We will document that changes to access-control
are not supported while upgrading in order to account for both cases
(a client connecting to an upgraded and a non-upgraded node).

All unit tests pass (except those which also fail on `master`).

I've run auth-related dtests and they all pass, except for tests which
depend on the old security model and which are therefore invalid.
Upstream dtests have been updated to account for this new security model,
and I will open an appropriate pull request to to similarly update our
own version.

I have also done a test-run cluster upgrade procedure with ccm
consisting of a 3 node cluster. I began by creating the cluster from
`master` and increasing the replication factor of the `system_auth`
keyspace to 3 and repairing the nodes. I then created several users and
granted them permissions on some resources. I then stopped a node,
updated its hardlinked executable to Scylla built from this patch series
, and restarted the node. I observed the migration of legacy data
starting and finishing. Connecting to the node, I observed all the new
roles functionality was working correctly. I verified that attempting to
change access-control information failed with a message about an
upgrading cluster. I repeated the process, node by node, with the
remaining two nodes and finally observed that the entire cluster had
upgraded and that I could modify access-control information freely.  I
will encapsulate this test into a dtest if possible.

Fixes #1941.

* 'jhk/switch_to_roles/v6' of https://github.com/hakuch/scylla: (83 commits)
  cql3: Remove some unimplemented warnings
  cql3: Prevent unhandled exception for anonymous user
  auth: Add alias for set of role names
  auth: Revoke permissions on dropped role resources
  auth: Move definition to corresponding .cc file
  cql3: Fix life-time of `user` from `client_state`
  auth: Migrate legacy data on boot
  auth: Check protected resources of the role-manager
  auth: Protect authenticator resources
  service/client_state: Correct erroneous comment
  client_state: Fix error message
  cql3: Fix error handling for GRANT and REVOKE
  auth: Remove unnecessary `sstring` allocation
  cql3: Rename variables to reflect roles
  auth: Decouple authorization and role management
  auth: Add code to expand a resource family
  cql: Also add `username` col. for LIST PERMISSIONS
  cql3: Fix error handling in LIST PERMISSIONS
  auth: Change error messages to pass dtests
  cql3: Handle errors more precisely for roles
  ...
2018-02-16 13:57:29 +00:00
Tomasz Grabiec
9c3e56fb16 tests: row_cache: Improve test for snapshot consistency on eviction
Reproduces https://github.com/scylladb/scylla/issues/3215.
Message-Id: <1518710592-21925-1-git-send-email-tgrabiec@scylladb.com>
2018-02-15 16:48:23 +00:00
Tomasz Grabiec
b0b57b8143 mvcc: Do not move unevictable snapshots to cache
Commit 6ccd317 introduced a bug in partition_entry::evict() where a
partition entry may be partially evicted if there are non-evictable
snapshots in it. Partially evicting some of the versions may violate
consistency of a snapshot which includes evicted versions. For one,
continuity flags are interpreted realtive to the merged view, not
within a version, so evicting from some of the versions may mark
reanges as continuous when before they were discontinuous. Also, range
tombtsones of the snapshot are taken from all versions, so we can't
partially evict some of them without marking all affected ranges as
discontinuous.

The fix is to revert back to full eviciton, and avoid moving
non-evictable snapshots to cache. When moving whole partition entry to
cache, we first create a neutral empty partition entry and then merge
the memtable entry into it just like we would if the entry already
existed.

Fixes #3215.

Tests: unit (release)
Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>
2018-02-15 16:48:07 +00:00
Paweł Dziepak
1e218e2b80 Merge "Fixes for exception safety in cache and LSA" from Tomasz
"Fixes two issues:
  - update may abort if allocation of an empty partition_version fails
  - LSA region construction is not exception safe, it may leave the misconstructed
    region registered if allocation inside region_group::add() fails."

* tag 'tgrabiec/exception-safety-cache-update-v2' of github.com:scylladb/seastar-dev:
  tests: row_cache: Add test for exception safety of updates from memtable
  tests: flat_reader_assertions: Improve failure message
  cache: Handle exceptions from make_evictable()
  tests: Disable failure injection around background compactor
  lsa: Disable allocation failure injection inside merge()
  lsa: Make region deregistration robust against duplicates
  lsa: Make region allocation exception safe
2018-02-15 10:32:08 +00:00
Tomasz Grabiec
b3415880b2 tests: row_cache: Add test for exception safety of updates from memtable 2018-02-15 10:13:02 +01:00
Jesse Haber-Kucharsky
2348c303df cql3: Remove some unimplemented warnings
While there are some small remaining features for roles, all the old
user-based statements still exist as they did before (except now they're
backed by roles) and should not log warnings.
2018-02-14 14:16:00 -05:00
Jesse Haber-Kucharsky
114cfd4e5a cql3: Prevent unhandled exception for anonymous user
Since `validate` is called after `check_access`, an anonymous user would
not get the expected error message about restrictions on anonymous
users.
2018-02-14 14:16:00 -05:00
Jesse Haber-Kucharsky
a83af20311 auth: Add alias for set of role names
This shortens some type names considerably.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
39a44e3494 auth: Revoke permissions on dropped role resources
Previously, when a table or keyspace was dropped, the
authorizer (through a `migration_listener`) automatically dropped all
permissions granted on that resource.

Likewise, when a role is granted permissions and the role is dropped,
all permissions granted to the role are dropped.

In this change, we now treat role resources just like table and keyspace
resources: if a permission is granted on a role (like "GRANT AUTHORIZE
ON ROLE qa TO phil") and the "qa" role is dropped, then all permissions
on the "qa" role resource are also dropped.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
e6d9d53eca auth: Move definition to corresponding .cc file 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
89b5bf2d7a cql3: Fix life-time of user from client_state 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
fbc97626c4 auth: Migrate legacy data on boot
This change allows for seamless migration of the legacy users metadata
to the new role-based metadata tables. This process is summarized in
`docs/migrating-from-users-to-roles.md`.

In general, if any nondefault metadata exists in the new tables, then
no migration happens. If, in this case, legacy metadata still exists
then a warning is written to the log.

If no nondefault metadata exists in the new tables and the legacy tables
exist, then each node will copy the data from the legacy tables to the
new tables, performing transformations as necessary. An informational
message is written to the log when the migration process starts, and
when the process ends. During the process of copying, data is
overwritten so that multiple nodes racing to migrate data do not
conflict.

Since Apache Cassandra's auth. schema uses the same table for managing
roles and authentication information, some useful functions in
`roles-metadata.hh` have been added to avoid code duplication.

Because a superuser should be able to drop the legacy users tables from
`system_auth` once the cluster has migrated to roles and is functioning
correctly, we remove the restriction on altering anything in the
"system_auth" keyspace. Individual tables in `system_auth` are still
protected later in the function.

When a cluster is upgrading from one that does not support roles to one
that does, some nodes will be running old code which accesses old
metadata and some will be running new code which access new metadata.

With the help of the gossiper `feature` mechanism, clients connecting to
upgraded nodes will be notified (through code in the relevant CQL
statements) that modifications are not allowed until the entire cluster
has upgraded.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
8be0165713 auth: Check protected resources of the role-manager
A new function `auth::service::is_protected` checks the
protected-resource set of all access-control modules (including the
role-manager).
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
8440140465 auth: Protect authenticator resources
A typo meant that only the authorizer resources were protected.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
617e432540 service/client_state: Correct erroneous comment 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
e27cfd4dda client_state: Fix error message
Now that resources are not just keyspaces and tables, the word "schema"
doesn't make sense.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
f9f03bc2e1 cql3: Fix error handling for GRANT and REVOKE
This change gets rid of duplicated code for checking if the grantee or
revokee exist by moving this functionality to the auth. service.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
e18adbcb3e auth: Remove unnecessary sstring allocation
The authorizer now accepts parameters by `string_view`.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
c1a03dbf54 cql3: Rename variables to reflect roles 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
5be16247cc auth: Decouple authorization and role management
auth: Decouple authorization and role management

Access control in Scylla consists of three main modules: authentication,
authorization, and role-management.

Each of these modules is intended to be interchangeable with alternative
implementations. The `auth::service` class composes these modules
together to perform all access-control functionality, including caching.

This architecture implies two main properties of the individual
access-control modules:

- Independence of modules. An implementation of authentication should
  have no dependence or knowledge of authorization or role-management,
  for example.

- Simplicity of implementing the interface. Functionality that is common
  to all implementations should not have to be duplicated in each
  implementation. The abstract interface for a module should capture
  only the differences between particular implementations.

Previously, the authorization interface depended on an instance of
`auth::service` for certain operations, since it required aggregation
over all the roles granted to a particular role or required checking if
a given role had superuser.

This change decouples authorization entirely from role-management: the
authorizer now manages only permissions granted directly to a role, and
not those inherited through other roles.

When a query needs to be authorized, `auth::service::get_permissions`
first uses the role manager to check if the role has superuser. Then, it
aggregates calls to `auth::authorizer::authorize` for each role granted
to the role (again, from the role-manager) to determine the sum-total
permission set. This information is cached for future queries.

This structure allows for easier error handling and
management (something I hope to improve in the future for both the
authorizer and authenticator interfaces), easier system testing, easier
implementation of the abstract interfaces, and clearer system
boundaries (so the code is easier to grok).

Some authorizers, like the "TransitionalAuthorizer", grant permissions
to anonymous users. Therefore, we could not unconditionally authorize an
empty permission set in `auth::service` for anonymous users. To account
for this, the interface of the authorizer has changed to accept an
optional name in `authorize`.

One additional notable change to the authorizer is the
`auth::authorizer::list`: previously, the filtering happened at the CQL
query layer and depended on the roles granted to the role in question.
I've changed the function to simply query for all roles and I do the
filtering in `auth::system` in-memory with the STL. This was necessary
to allow the authorizer to be decoupled from role-management. This
function is only called for LIST PERMISSIONS (so performance is not a
concern), and it significantly reduces demand on the implementation.

Finally, we unconditionally create a user in `cql_test_env` since
authorization requires its existence.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
0ac7d9922d auth: Add code to expand a resource family
This will be useful for the next change, where it is used for
refactoring LIST PERMISSIONS.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
d0ddb354d0 cql: Also add username col. for LIST PERMISSIONS
the value for the `role` column is equal to the value for the `username`
column.

This change makes LIST PERMISSIONS backwards compatible with clients
that expect the `username` column to exist. This functionality also
exists in Apache Cassandra.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
cccfe269cf cql3: Fix error handling in LIST PERMISSIONS
This patch replaces duplicated code for checking the existence of a user
with the same mechanism for doing so as elsewhere: by checking for
`auth::nonexistent_role` being thrown during the course of checking
access-control.

This patch also ensures that exceptions thrown while querying the list
of permissions on a resource get handled correctly.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
13ba128967 auth: Change error messages to pass dtests
The fixed dtests which only failed due to differences in wording and
grammar for error messages are:

- altering_nonexistent_user_throws_exception_test
- cant_create_existing_user_test
- dropping_nonexistent_user_throws_exception_test
- users_cant_alter_their_superuser_status_test
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
f372bbb4bc cql3: Handle errors more precisely for roles
This patch ensures that all the CQL statements for managing roles
correctly catch exceptions in the underlying `role_manager` and re-throw
them as top-level exceptions (like "invalid request").

This patch also refines exception handling so that only the applicable
errors are explicitly caught. This should allow easier auditing in the
future and help to reveal faulty assumptions.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
ce3be07556 auth: Move resource existence checks
Previously, a "data" auth. resource knew how to check it's own existence by
accessing a global variable.

This patch accomplishes two things: it adds existence checking to all
kinds of resources, and moves these checks outside of `auth::resource`
itself and into `auth::service` (so that global variables are no longer
accessed).
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
cf5f6aa4c5 auth: Fix fragile variable life-times
According to the Seastar convention, a parameter passed to a function
taking a reference parameter must live for the duration of the execution
of the returned future.

When possible, variables are statically allocated. When this is not
possible, we use `do_with`.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
5f323a3530 cql3: Check only filtered permissions
When a user executes GRANT or REVOKE, Scylla ensures that they
themselves are granted the permissions they are changing.

The code previously checked a static list of permissions, which we could
have replaced with `auth::permissions::ALL`. Even better, we now expand
the set of filtered permissions into an iterable container.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
f4fc12fbf0 enum_set: Add iterator
Sometimes it is useful to be able to query for all the members of an
`enum_set`, rather than just add, remove, and query for membership. (The
patch following this one makes use of this in the auth. sub-system).

We use the bitset iterator in Seastar to help with the implementation.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
bbe09a4793 enum_set: Throw on bad mask
`super_enum::valid_is_valid_sequence` determines if the numeric index
corresponding to an enumeration value is valid. This is important,
because it is undefined behavior to cast an invalid index into an
enumeration value.

This function is used to check the validity of the `enum_set` mask when
an `enum_set` is constructed in `enum_set::from_mask`. If the mask has
set bits that correspond to invalid enumeration indicies, then we throw
`bad_enum_set_mask`.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
1cf6dd85fb tests: Add basic tests for enum_set
This is motivated by a small addition to `enum_set` and `super_enum`
that follows this patch.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
7db675b298 cql3: Remove std::move on return value
This prevents guaranteed return-value optimization (RVO).
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
357f3afb60 auth: Remove outdated "TODO"
Authorization never happens at this level of the stack, though it
formally did.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
b1d9d0e4ff auth: Reorder authorizer args for consistency 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
c1504cd4ff auth: Pass resource by const ref.
This has the dual benefit of not enforcing copying on implementations of
the abstract interface and also limiting unnecessary copies.

As usual with Seastar, we follow the convention that a reference
parameter to a function is assumed valid for the duration of the
`future` that is returned. `do_with` helps here.

By adding some constants for root resources, we can avoid using
`seastar::do_with` at some call-sites involving `resource` instances.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
45631604b0 auth: Use string_view for paramters 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
c4f686c10f auth: Put definitions inside namespace 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
3de8b4c898 auth/resource: Don't store exn. argument 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
7fd3539d94 cql3: Avoid redundant return when throwing 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
81f38edc61 auth/service: Rename function for consistency 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
ac3c68b0ac auth/role_manager.hh: Unify doc. style 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
0c6bd791c2 auth/role_manager: Remove unnecessary exn. info
We can add it back on an as-needed basis. The other exceptions in the
module do not make similar information available.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
0590dcf6cd auth/authorizer: Add missing const 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
a3eaf9e697 auth: Remove unused "performer" argument
This argument used to be used for access-control checks, but this has
all moved to the CQL layer.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
5fe464d999 auth/default_authorizer: Move access-checks to CQL
All authorization checking lives in the CQL layer. The individual
authenticator, authorizer, and role-manager enforce no access-checks.

It may be a good idea to move these checks a level downward in the
future for ease of testing, but for now we aim for consistency.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
4d2c4177df cql3/list_permissions_statement: Fix formatting
Something strange must have happened with somebody's editor.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
45c6d13812 auth: Remove useless try-catch block
This looks to have been a typo in the original porting work.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
2dc9f00fe3 cql3: Use authenticated_user-specific overload
This prevents us from accidentally accessing a non-existent value.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
68ba6a481b auth: Add has_role helper 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
f8bbbfd8f9 auth: Check role existence when querying perms 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
a0f0e07554 auth: Check for unsupported authentication options
While it's undefined behavior to pass an unsupported option to a
specific authenticator directly, the `auth::service` layer will check
options and throw this exception. It is turned into a
`invalid_request_exception` by the CQL layer.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
e6363e15de auth/resource: Construct from ctor
The motivation behind this change is the idea that constructing a new
instance of an object is the job of the constructor.

One big benefit of this structure (with the addition of helpers for
convenience) is that calls for emplacing instances (like
`std::make_shared`, or `std::vector::emplace_back`) work without any
difficulty. This would not be true for static construction functions.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
12d6f5817d auth: Switch to std::optional
Now that Scylla is a C++17 application, we should no longer use
`std::experimental::optional` (which is a distinct type from
`std::optional`).
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
a633777378 auth/authorizer.hh: Use default keyword 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
739f0e2dbd auth: Move static member function decl. up 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
2e1c3823d0 auth/authorizer: Delete unused member function 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
59c100b37f auth: Use virtual and override
According to previous discussions on the mailing-list with Avi, using
both has the benefits of making virtual functions stand out and also
warning about functions which unintentionally do not override.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
4d9f957dc2 auth/authenticator.hh - Use default keyword 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
f78d89968e auth/authorizer.hh: Replace documentation 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
a66896dd8f auth/authenticator.hh: Replace documentation 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
053b6b4d04 auth: Unify formatting
The goal is for all files in `auth/` to conform to the Seastar/Scylla
`coding-style.md` document.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
a4c7aee238 auth: Fix includes 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
de33124c39 Don't store authenticated_user in shared_ptr
All we require are value semantics.

`client_state` still stores `authenticated_user` in a `shared_ptr`, but
the behavior of that class is complex enough to warrant its own
discussion/design/refactor.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
f7b4f62dab auth/authenticated_user: Add some documentation 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
e11de26d50 auth: Simplify authenticated_user interface
The most important change is replacing `auth::authenticated_user::name`
with a public `std::optional<sstring>` member. Anonymous users have no
name. This replaces the insecure and bug-prone special-string of
"anonymous" for anonymous users, which does unfortunate things with the
authorizer.

The new `auth::is_anonymous` function exists for convenience since
checking the absence of a `std::optional` value can be tedious.

When a caller really wants a name unconditionally, a new stream output
function is also available.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
308a0be5c2 auth/authenticated_user: Make ctor explicit 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
9ac6035f5d auth/authenticated_user: Use std::optional 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
0d1ea0a357 auth/authenticated_user: Mark functions noexcept 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
6cb3b06112 auth/authenticated_user: Remove outdated comment 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
64f844b870 auth/authenticated_user: Hide internal constant 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
15a2b93970 auth/authenticated_user: Use default ctors 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
fa94ee5a3a auth/authenticated_user: Move defns into namespace 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
4fad30ef42 auth/authenticated_user: Remove whitespace 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
2dd632f6e8 auth/authenticated_user: Use string_view in ctor 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
fa159c0ac4 auth: Mark authenticated_user final 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
f18dd25e7e cql3: Fix DROP ROLE IF EXISTS
Checking if the role to be dropped has superuser requires that the role
exists, which means `auth::nonexistent_role` was thrown even when IF
EXISTS was specified.
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
b69c27d210 auth/standard_role_manager: Avoid string copies 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
bcc1fbad3a auth/service.hh: Fix documentation for errors
There is a distinct difference between throwing an exceptional
immediately and returning an exceptional future.
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
741d215516 auth: Switch to roles from users
This is a large change, but it's a necessary evil.

This change brings us to a minimally-functional implementation of roles.
There are many additional changes that are necessary, including refined
grammar, bug fixes, code hygiene, and internal code structure changes.
In the interest of keeping this patch somewhat read-able, those changes
will come in subsequent patches. Until that time, roles are still marked
"unimplemented".

IMPORTANT: This code does not include any mechanism for transitioning a
cluster from user-based access-control to role-based access control. All
existing access-control metadata will be ignored (though not deleted).

Specific changes:

- All user-specific CQL statements now delegate to their roles
  equivalent. The statements are effectively the same, but CREATE USER
  will include LOGIN automatically. Also, LIST USERS only lists roles
  with LOGIN.

- A call to LIST PERMISSIONS will now also list permissions of roles
  that have been granted to the caller, in addition to permissions which
  have been granted directly.

- Much of the logic of creating, altering, and deleting roles has been
  moved to `auth::service`, since these operations require cooperation
  between the authenticator, authorizer, and role-manager.

- LIST USERS actually works as expected now (fixes #2968).
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
41f893d676 Don't use "experimental" optional
We're in C++17 country now.
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
903ea32f30 auth/standard_role_manager: Fix life-time bug
It worked most of the time, but changes in other areas of the code must
have triggered the conditions necessary to make it fail.
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
8878ce456c cql3/statements: Use convenient type alias 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
36b283f7ea auth: Allow empty role updates 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
34280c18bb tests: Rename helper function for clarity 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
635dc3d5ed auth: Include missing header 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
f2b78499fe auth: Fix logic in service::role_has_superuser
The previous code has an off-by-one error since the iterator is
incremented unconditionally prior to being compared to the end of the
collection.

This new version is also shorter thanks to `seastar::do_until`.
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
28a840db72 auth: Add error handling for incompatible modules
The components of access-control (authentication, authorization, and
role-management) are designed as abstract interfaces, but due to
decisions of Apache Cassandra, certain implementations are dependent on
other particular implementations.

This change throws a new exception,
`auth::incompatible_module_combination`, when a dependency is not
satisfied.
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
b3dc90d5d2 auth: Refactor authentication options
The set of allowed options is quite small, so we benefit from a static
representation (member variables) over a dynamic map.

We also logically move the "OPTIONS" option to the domain of the
authenticator (from user management), since this is where it is applied.

This refactor also aims to reduce compilation time by moving
`authentication_options` into its own header file.

While changes to `user_options` were necessary to accommodate the new
structure, that class will be deprecated shortly in the switch to roles.
Therefore, the changes are strictly temporary.
2018-02-14 14:15:57 -05:00
Tomasz Grabiec
1039850515 tests: flat_reader_assertions: Improve failure message 2018-02-14 16:42:49 +01:00
Tomasz Grabiec
27b114fe45 cache: Handle exceptions from make_evictable()
cache_entry constructor was marked noexcept, yet make_evictable() may
fail in rare cases due to allocation in add_version(). Lift the
annotation and make sure that construction has strong exception
guarantees for the moved-in state so that it can be retried without
data loss inside allocating section.
2018-02-14 16:42:49 +01:00
Tomasz Grabiec
74986f31e8 tests: Disable failure injection around background compactor
Failure could be injected into the compactor if the main code under
test defers before reaching allocation failure point, and compactor
gets hit. This is not what the test is supposed to stress, and it
causes abort when memtable_snapshot_source is destroyed, so disable
failure injection there.
2018-02-14 16:42:49 +01:00
Tomasz Grabiec
7e0ff8a920 lsa: Disable allocation failure injection inside merge()
Fixes termiantion in tests due to throw from merge(), which is noexcept.
2018-02-14 16:42:49 +01:00
Tomasz Grabiec
66701c1671 lsa: Make region deregistration robust against duplicates 2018-02-14 16:42:49 +01:00
Tomasz Grabiec
cf876bbe2d lsa: Make region allocation exception safe
We were not unregisterring in case add() fails.
2018-02-14 16:42:49 +01:00
Paweł Dziepak
6c1503241d Merge seastar upstream
* seastar 2b0a81d...383ccd6 (9):
  > future-util: relax concept requirements for do_for_each()
  > seastar-addr2line: improve UX for bactraces read from stdin
  > noncopyable_function: Lift the noexcept guarantee
  > queue: doxygen documentation
  > queue: documentation
  > build: reinstate -Wsign-compare
  > iotune: don't compare sign and unsigned types
  > future-util: Remove unused local in with_scheduling_group()
  > tests/test-utils: Add macro for running tests within a seastar thread
2018-02-14 14:37:42 +00:00
Amnon Heiman
827723cec8 API: Add get active repair api
This patch adds an API to return an array of the ids of current active repairs.

After this patch a call to:
curl http://localhost:10000/storage_service/active_repair/

Will return the active repairs ids

Fixes #3193

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-02-14 11:43:41 +02:00
Amnon Heiman
3f2eae35fd repair: Add a get_active_repairs function to return the active repair
This patch adds a function that returns an array with the ids of the
active repairs by filtering the RUNNING ones in the repair tracker status.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-02-14 11:43:37 +02:00
Duarte Nunes
6f7233fbaf cql3/statements/truncate_statement: Prevent MV from being truncated
To truncate an MV, one must truncate the base table.

Fixes #3188

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180209162720.32757-1-duarte@scylladb.com>
2018-02-13 11:37:27 +00:00
Duarte Nunes
771852e731 Merge 'Fix possible stall in calculate_pending_ranges' from Asias
When the cluster is large or the num_tokens is big, calculate_pending_ranges
can take long time to complete. It now runs in the gossip thread so it can
block the gossip processing. Another problem is it runs in a plain for loop and
can cause the reactor stall.

User see this stall with decommission operations.

I can reproduce up to 4 seconds stall within a two-node cluster each with
`--num-tokens 3072` during decommission.

Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout

Fixes #3203

* tag 'asias/issue_3203_v2.1' of github.com:scylladb/seastar-dev:
  storage_service: Do not wait for update_pending_ranges in handle_state_leaving
  token_metadata: Handle affected_ranges with do_for_each
  token_metadata: Split token_metadata::calculate_pending_ranges
  token_metadata: Futurize calculate_pending_ranges
  storage_service: Futurize storage_service::do_update_pending_ranges
  token_metadata: Speed up token_metadata::get_endpoint
2018-02-13 11:12:22 +00:00
Asias He
74b4035611 storage_service: Do not wait for update_pending_ranges in handle_state_leaving
The call chain is:

storage_service::on_change() -> storage_service::handle_state_leaving()
-> storage_service::update_pending_ranges()

Listeners run as part of gossip message processing, which is
serialized. This means we won't be processing any gossip messages until
update_pending_ranges completes. update_pending_ranges takes time to
complete.

Since we do not wait for update_pending_ranges to complete any more,
multiple update_pending_ranges operations can run at the same time, use
serialized_action to serialize it.

Tested with update_cluster_layout_tests.py
2018-02-13 19:00:43 +08:00
Asias He
c17ce79977 token_metadata: Handle affected_ranges with do_for_each
affected_ranges can be very large in a large cluster or node with big
num_tokens account. calculate_natural_endpoints takes more time to
process in this case as well.

Futurize calculate_pending_ranges_for_leaving and handle the loop with
do_for_each to give some time for the reactor to breath, so it does not
block.
2018-02-13 19:00:43 +08:00
Asias He
60143a7517 token_metadata: Split token_metadata::calculate_pending_ranges
token_metadata::calculate_pending_ranges is a complicated function.
Split it into 3 parts for leaving operation, moving opeartion,
bootstrap opeartion.
2018-02-13 19:00:43 +08:00
Asias He
1834dd023f token_metadata: Futurize calculate_pending_ranges
Now, do_update_pending_ranges is futurized. We can finally futurize
token_metadata::calculate_pending_ranges in order to convert the loops
inside it to do_for_each insead of plain for loops to avoid reactor
stall.
2018-02-13 19:00:43 +08:00
Asias He
33c43b78c7 storage_service: Futurize storage_service::do_update_pending_ranges
Preparation work for the futurizing of the time consuming
token_metadata::calculate_pending_ranges.

In addition, we use do_for_each for the loop. It is better than the
plain for loop because the reactor can yield to avoid stalls in cases
there are tons of keyspaces.
2018-02-13 19:00:43 +08:00
Asias He
96266fc76a token_metadata: Speed up token_metadata::get_endpoint
token_metadata::calculate_pending_ranges ->
abstract_replication_strategy::calculate_natural_endpoints
-> token_metadata::get_endpoint()

With std::map

   INFO  2018-02-09 14:58:32,960 [shard 0] token_metadata - In
   calculate_pending_ranges: affected_ranges.size=6145 stars
   Reactor stalled for 4000 ms on shard 0.
   Backtrace:
     0x00000000004b12cb
     0x00000000004b1561
     /lib64/libpthread.so.0+0x00000000000123af
     0x0000000001159e25
     0x00000000011581eb
     0x000000000114f122
     0x000000000119f8c7
     0x00000000011985a4
     0x00000000011a7e16
     0x0000000001364741
     0x00000000013fe9fd
     0x00000000013ff792
     0x00000000014024b2
     0x000000000141a66f
     0x000000000141d7be
     0x00000000010ed234
     0x000000000112fdaa
     0x00000000011301f4
     0x000000000043543d
   INFO  2018-02-09 14:58:35,993 [shard 0] token_metadata - In
   calculate_pending_ranges: affected_ranges.size=6145 ends

With std::unordered_map

    INFO  2018-02-09 14:47:50,251 [shard 0] token_metadata - In
    calculate_pending_ranges: affected_ranges.size=6145 stars
    INFO  2018-02-09 14:47:51,585 [shard 0] token_metadata - In
    calculate_pending_ranges: affected_ranges.size=6145 ends
2018-02-13 19:00:42 +08:00
Duarte Nunes
ac6abf8021 Merge 'CQL clustering column secondary indexing support' from Pekka
"This patch series adds support for clustering column secondary indexing.

Fixes #2961

Tests: unit-tests (release)"

* 'penberg/cql-2i-clustering-key-indexing/v2' of github.com:penberg/scylla:
  tests/cql_query_test: Add indexed clustering key query test
  cql3: Fix clustering column secondary indexing
  cql3/statements: Add values() helper to restrictions
  cql3/restrictions: Fix multi_column_restriction::values()
  cql3/restrictions: Fix single_column_primary_key_restrictions::values()
2018-02-12 18:49:34 +00:00
Amnon Heiman
d88c27614e scylla-housekeeping: add configuration for api-address
This patch makes the api address and port configurable.

Fixes #2332

Message-Id: <20180204095628.1210-1-amnon@scylladb.com>
2018-02-12 15:26:46 +02:00
Amnon Heiman
449f9af0db API: Use stream_range_as_array to return token endpoints
The token_to_endpoint map can get big that trying to convert it to a
vector will cause large allocation warning.

This patch replace the implementation, so the return json array will be
created directly from the map by using stream_range_as_array helper
function.

Fixes #3185

Message-Id: <20180207153306.30921-1-amnon@scylladb.com>
2018-02-12 15:24:07 +02:00
Avi Kivity
e77ecda1da tests: avoid signed/unsigned compares
Container indices are size_t, and in other places we gratuituously
declare a limit as unsigned and the loop index as signed.

Tests: unit (release)
Message-Id: <20180212121642.10525-1-avi@scylladb.com>
2018-02-12 12:25:21 +00:00
Avi Kivity
87f10bc853 sstables: continuous_data_consumer: make _remain an unsigned type
All of the adjustments to _remain already ensure it is greater than 0,
and indeed a negative _remain doesn't make sense.

Switching to an unsigne types allows us to re-enable -Wsign-compare.

Tests: unit (release)
Message-Id: <20180212121636.10463-1-avi@scylladb.com>
2018-02-12 12:25:21 +00:00
Avi Kivity
55168592ad compaction_manager: fix use-after-free of column_family
Commit cce1a2bce8 ("Use the CPU scheduler")
placed some compaction manager code in a scheduling_group. Unfortunately,
downstream code relied on the callers not deferring, so it can rely
on the column_family's existence. That doesn't happen if the column_family
is removed quickly, as with_scheduling_group() always defers.

Fix applying the scheduling group after we've taken the lock and guaranteed
the stability of the column_family object.

Fixes #3196.
Message-Id: <20180211165155.18179-1-avi@scylladb.com>
2018-02-11 17:53:35 +00:00
Avi Kivity
3f5a8229ac tests: fix for sstable::get_index_reader() removal
71495691aa removed sstable::get_index_reader(),
but forgot to update its callers in tests/.  Update the callers to construct
a temporary shared_index_list and create the index_reader directly.

This is none too clean, but shared_index_lists needs to be retired, and then
the changes in this patch can go away too.

Tests: unit (release)
Message-Id: <20180211164739.17862-1-avi@scylladb.com>
2018-02-11 17:53:08 +00:00
Vladimir Krivopalov
71495691aa Use separate shared_index_lists per sstable_mutation_reader instead of a single one per sstable.
With the changes introduced in #2981, it is no longer safe to share
index_entries among multiple sstable_mutation_readers.
The original intent behind sharing index_entries among index_readers was
to avoid re-reading same pages twice as we have two index readers -
lower and upper bound - for every sstable_mutation_reader. In fact, the
shared entries were held at the sstable object level so index_readers
from different sstable_mutation_readers could have accessed them.

Now, with calls to index_reader::advance_to(pos)/index_reader::advance_past(pos),
index_entry can be accessed in a way that modifies its state if we need
to read more promoted index blocks. It is safe to keep sharing them
between two index_readers within the same sstable_mutation_reader as the
invariant is maintained that readers can be only moved forward.
We cannot safely assume, however, that this invariant holds for multiple
sstable_mutation_readers as it may happen that one of them has read and
thrown away some promoted index blocks that another one needs. So we
restrict sharing to per-sstable_mutation_reader level.

Fixes #3189.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <83957d007621fe4c62af49aebf1838bb2f32ee55.1518226793.git.vladimir@scylladb.com>
2018-02-10 15:08:45 +02:00
Duarte Nunes
d757c87107 cql3/query_processor: Remove prepared statements upon dropping a view
Fixes #3198

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180209143652.31852-1-duarte@scylladb.com>
2018-02-09 16:30:28 +00:00
Avi Kivity
432268f582 Merge "branch 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla" from Raphael
"The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
The manager was needed for orchestrating deletion of shared sstable
across shards. It brings extra complexity that's not longer needed,
and it was also overloading shard 0, but the latter could have
been fixed.

Tests:
- unit: release mode
- dtest: resharding_test.py"

* 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla:
  Remove SSTable's atomic deletion manager
  Stop using SSTable's atomic deletion manager
  database: split column_family::rebuild_sstable_list
2018-02-08 19:10:16 +02:00
Duarte Nunes
456b678e0b database.hh: Fix data query stage argument type
Fixes a merge gone wrong.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180208163338.25238-1-duarte@scylladb.com>
2018-02-08 16:35:10 +00:00
Avi Kivity
404172652e Merge "Use xxHash for digest instead of MD5" from Duarte
"This series changes digest calculation to use a faster algorithm
(xxHash) and to also cache calculated cell hashes that can be kept in
memory to speed up subsequent digest requests.

The MD5 hash function has proved to be slow for large cell values:

size = 256; elapsed = 4us
size = 512; elapsed = 8us
size = 1024; elapsed = 14us
size = 2048; elapsed = 21us
size = 4096; elapsed = 33us
size = 8192; elapsed = 51us
size = 16384; elapsed = 86us
size = 32768; elapsed = 150us
size = 65536; elapsed = 278us
size = 131072; elapsed = 531us
size = 262144; elapsed = 1032us
size = 524288; elapsed = 2026us
size = 1048576; elapsed = 4004us
size = 2097152; elapsed = 7943us
size = 4194304; elapsed = 15800us
size = 8388608; elapsed = 31731us
size = 16777216; elapsed = 64681us
size = 33554432; elapsed = 130752us
size = 67108864; elapsed = 263154us

The xxHash is a non-cryptographic, 64bit (there's work in progress on
the 128 version) hash that can be used to replace MD5. It performs much
better:

size = 256; elapsed = 2us
size = 512; elapsed = 1us
size = 1024; elapsed = 1us
size = 2048; elapsed = 2us
size = 4096; elapsed = 2us
size = 8192; elapsed = 3us
size = 16384; elapsed = 5us
size = 32768; elapsed = 8us
size = 65536; elapsed = 14us
size = 131072; elapsed = 28us
size = 262144; elapsed = 59us
size = 524288; elapsed = 116us
size = 1048576; elapsed = 226us
size = 2097152; elapsed = 456us
size = 4194304; elapsed = 935us
size = 8388608; elapsed = 1848us
size = 16777216; elapsed = 4723us
size = 33554432; elapsed = 10507us
size = 67108864; elapsed = 21622us

Performance was tested using a 3 node cluster with 1 cpu and 8GB,
and with the following cassandra-stress loaders. Measurements are for
the read workload.

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 32699 [READ:32699]
partition rate            : 32699 [READ:32699]
row rate                  : 32699 [READ:32699]
latency mean              : 3.0 [READ:3.0]
latency median            : 3.0 [READ:3.0]
latency 95th percentile   : 3.9 [READ:3.9]
latency 99th percentile   : 4.5 [READ:4.5]
latency 99.9th percentile : 6.6 [READ:6.6]
latency max               : 24.0 [READ:24.0]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:05
END

md5:

Results:
op rate                   : 25241 [READ:25241]
partition rate            : 25241 [READ:25241]
row rate                  : 25241 [READ:25241]
latency mean              : 3.9 [READ:3.9]
latency median            : 3.9 [READ:3.9]
latency 95th percentile   : 5.1 [READ:5.1]
latency 99th percentile   : 5.8 [READ:5.8]
latency 99.9th percentile : 8.0 [READ:8.0]
latency max               : 24.8 [READ:24.8]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:06:36
END

This translates into a 21% improvoment for this workload.

Bigger cell values were also tested:

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 19964 [READ:19964]
partition rate            : 19964 [READ:19964]
row rate                  : 19964 [READ:19964]
latency mean              : 4.9 [READ:4.9]
latency median            : 4.6 [READ:4.6]
latency 95th percentile   : 7.2 [READ:7.2]
latency 99th percentile   : 11.5 [READ:11.5]
latency 99.9th percentile : 13.6 [READ:13.6]
latency max               : 29.2 [READ:29.2]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:08:20
END

md5:

Results:
op rate                   : 12773 [READ:12773]
partition rate            : 12773 [READ:12773]
row rate                  : 12773 [READ:12773]
latency mean              : 7.7 [READ:7.7]
latency median            : 7.3 [READ:7.3]
latency 95th percentile   : 10.2 [READ:10.2]
latency 99th percentile   : 16.8 [READ:16.8]
latency 99.9th percentile : 19.2 [READ:19.2]
latency max               : 71.5 [READ:71.5]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:13:02
END

This translates into a 37% improvoment for this workload.

Fixes #2884

Tests: unit-tests (release), dtests (smp=2)

Note: dtests are kinda broken in master (> 30 failures), so take the
tests tag with a grain of himalayan salt."

* 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits)
  tests/row_cache_test: Test hash caching
  tests/memtable_test: Test hash caching
  tests/mutation_test: Use xxHash instead of MD5 for some tests
  tests/mutation_test: Test xx_hasher alongside md5_hasher
  schema: Remove unneeded include
  service/storage_proxy: Enable hash caching
  service/storage_service: Add and use xxhash feature
  message/messaging_service: Specify algorithm when requesting digest
  storage_proxy: Extract decision about digest algorithm to use
  cache_flat_mutation_reader: Pre-calculate cell hash
  partition_snapshot_reader: Pre-calculate cell hash
  query::partition_slice: Add option to specify when digest is requested
  row: Use cached hash for hash calculation
  mutation_partition: Replace hash_row_slice with appending_hash
  mutation_partition: Allow caching cell hashes
  mutation_partition: Force vector_storage internal storage size
  test.py: Increase memory for row_cache_stress_test
  atomic_cell_hash: Add specialization for atomic_cell_or_collection
  query-result: Use digester instead of md5_hasher
  range_tombstone: Replace feed_hash() member function with appending_hash
  ...
2018-02-08 18:24:58 +02:00
Avi Kivity
6298655178 Merge "Inline and optimise more aggressively" from Paweł
"We have noticed in the past that the compiler is too conservative when it comes
to deciding which functions to inline. Since inlining functions enables further
optimisations such as const folding in some cases the difference in performance
was significant enough to force us to add [[gnu::always_inline]] attribute in
numerous places. However, this is neither a partical nor an elegant solution.

A better way to deal with the problem is to adjust the compiler tunables that
control the heuristics used for making inlining decisions. In particular,
inline-unit-growth seems to affect the performance of the emitted code most.

Apart from making the compiler more eager to inline functions bumping the
optimisation level to -O3 also seems to have a positive impact on the
performance.

Fixes #1644.

Tests: unit-test (release)

Performance tested with gcc 7.3.

Macrobenchmark
perf_simple_query
Flags: -c4 --duration 60
All results are medians.

         ./before    ./after   diff
 read   338662.12  405377.80  19.7%
 write  387378.89  466744.15  20.5%

Microbenchmarks
single run duration:      1.000s
number of runs:           5

BEFORE
test                                      iterations      median         mad         min         max
combined.one_row                              858933   536.389ns     0.819ns   534.823ns   537.208ns
combined.single_active                          8469    77.131us    11.000ns    77.118us    77.145us
combined.many_overlapping                       1199   664.105us   160.807ns   663.818us   668.527us
combined.disjoint_interleaved                   8100    75.522us    22.254ns    75.500us    75.732us
combined.disjoint_ranges                        8288    72.580us    10.571ns    72.568us    72.599us
memtable.one_partition_one_row               1216233   825.581ns     0.446ns   821.450ns   826.027ns
memtable.one_partition_many_rows              127336     7.855us     2.153ns     7.853us     7.898us
memtable.many_partitions_one_row               57919    17.356us     6.028ns    17.259us    17.362us
memtable.many_partitions_many_rows              4751   210.496us   102.339ns   210.393us   211.188us

AFTER
test                                      iterations      median         mad         min         max
combined.one_row                             1002321   450.292ns     0.313ns   447.202ns   450.605ns
combined.single_active                          9605    67.086us     8.620ns    67.073us    67.115us
combined.many_overlapping                       1476   519.554us     5.334ns   519.549us   519.953us
combined.disjoint_interleaved                   9280    64.363us     5.328ns    64.335us    64.369us
combined.disjoint_ranges                        9481    61.893us     3.620ns    61.885us    61.903us
memtable.one_partition_one_row               1432668   699.775ns     0.106ns   696.023ns   699.918ns
memtable.one_partition_many_rows              153692     6.536us     6.885ns     6.501us     6.543us
memtable.many_partitions_one_row               63319    15.879us     5.080ns    15.793us    15.884us
memtable.many_partitions_many_rows              5659   176.717us    66.770ns   176.650us   177.778us"

* tag 'optimise-and-inline/v2' of https://github.com/pdziepak/scylla:
  configure.py: set optimisation level to -O3
  configure.py: set inline-unit-growth to 300
  configure.py: flag_supported: support flags with spaces
  configure.py: rename warning_supported to flag_supported
  configure.py: pass optimisation flags to seastar/configure.py
  cql3/select_statement: do not capture stack variables by reference
2018-02-08 17:45:41 +02:00
Tomasz Grabiec
cce1a2bce8 Merge "Use the CPU scheduler" from Glauber & Avi
In this patchset I am resubmitting Avi's enablement of the CPU scheduler
in his behalf. I've done a ton of testing in the series and there are
some improvements / changes that I had previously sent as a separate series.

What you see here is the result of merging that work.

After this patchset is applied, workloads are smoother and we are able to
uphold the pre-defined shares among the various actors.

We also finally have everything we need to merge the CPU and I/O controllers.
After that is done the code is now much simpler. But also, as a bonus,
controllers that were previously available for I/O only (compactions) are
enabled for CPU as well.

* git@github.com:glommer/scylla.git cpusched-v7:

Avi Kivity (4):
  database, sstables, compaction: convert use of thread_scheduling_group
    to seastar cpu scheduler
  memtable, database: make memtable::clear_gently() inherit
    scheduling_group
  config: mark background_writer_scheduling_quota as Unused
  database: place data_query execution stage into scheduling_group

Glauber Costa (9):
  database, main: set up scheduling_groups for our main tasks
  row_cache: actually use the scheduling group for update_cache
  allow update_cache and clear_gently to use the entire task quota.
  database: remove cpu_flush_quota metric
  controllers: retire auto_adjust_flush_quota
  controllers: allow memtable I/O controller to have shares statically
    set
  controllers: update control points for memtable I/O controller
  controllers: allow a static priority to override the controller output
  controllers: unify the I/O and CPU controllers
2018-02-08 15:58:40 +01:00
Paweł Dziepak
eb5b76ea50 configure.py: set optimisation level to -O3 2018-02-08 14:46:11 +00:00
Paweł Dziepak
bc65659a46 configure.py: set inline-unit-growth to 300
It has been discovered that the compiler is too conservative when
deciding which functions to inline. In particular, the limiting tunable
turned out to be inline-unit-growth which limits inlining in large
translation units.
2018-02-08 14:46:11 +00:00
Paweł Dziepak
89063a9cc0 configure.py: flag_supported: support flags with spaces 2018-02-08 14:46:11 +00:00
Paweł Dziepak
8f4b30b572 configure.py: rename warning_supported to flag_supported
warning_supported() can be used to detect support of any compiler flag,
not just warnings.
2018-02-08 14:46:11 +00:00
Paweł Dziepak
a8372b87eb configure.py: pass optimisation flags to seastar/configure.py 2018-02-08 14:46:11 +00:00
Paweł Dziepak
b635fec9bf cql3/select_statement: do not capture stack variables by reference
Default capture by reference considered harmful in async code.
2018-02-08 14:46:10 +00:00
Avi Kivity
ee763d889a Merge seastar upstream
* seastar 6d02263...2b0a81d (7):
  > configure.py: add -Wno-stringop-overflow
  > configure.py: add --optflags for specifying optimisation flags
  > build: add protobuf-compiler to docker dev image
  > build: update docker builder to newer Fedora
  > json_element: stream_object to get its parameter by value
  > json_element: stream range object
  > build: add yaml-cpp-devel installation to Dockerfile
2018-02-08 16:45:01 +02:00
Raphael S. Carvalho
312bd9ce25 Remove SSTable's atomic deletion manager
Not used anymore, can be deleted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:38:45 -02:00
Raphael S. Carvalho
1472cfcc19 Stop using SSTable's atomic deletion manager
The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:27:17 -02:00
Raphael S. Carvalho
b78881c0e9 database: split column_family::rebuild_sstable_list
The motivation is that resharding will not want the code that is
specific to regular compaction after atomic deletion is removed.
Resharding will eventually only need to replace old tables with
new ones, and it will be in charge of deletion of old tables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:18:18 -02:00
Glauber Costa
4272279bbb controllers: unify the I/O and CPU controllers
We have had so far an I/O controller, for compactions and memtables, and
a CPU controller, for memtables only -- since the scheduling was still
quota-based.

Now that the CPU scheduler is fully functional, it is time to do away
with the differences and integrate them both into one.  We now have a
memtable controller and a compaction controller, and they control both
CPU and I/O.

In the future, we may want to control processes that don't do one of
them, like cache updates. If that ever happens, we'll try to make
controlling one of them optional. But for now, since the I/O and CPU
controllers for our main two processes would look exactly the same we
should integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:30 -05:00
Glauber Costa
7b6f188e27 controllers: allow a static priority to override the controller output
We have merged the I/O controller without this, but we want to integrate
the CPU and I/O controllers into one. Currently, the quota can be
statically set for the CPU controller. For now, until we gain more
experience with it we should allow a static value to override the
controller's output as well.

That is particularly important since we don't yet control some
strategies like LCS and the time-based ones. Users in the field may be
using one of those strategies with a static value for background quota.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
6f295a2a8a controllers: update control points for memtable I/O controller
Right now CPU and I/O controllers have slightly different control points
for no good reason. Let's use the CPU controller ones as the standard, as
we have been using it in the field for longer and trust it more.

The end goal is to fully integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
b895d495cc controllers: allow memtable I/O controller to have shares statically set
This is so it looks more like the CPU controller. The end goal is to integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
c099c98676 controllers: retire auto_adjust_flush_quota
It no longer makes sense now that we have the full scheduler +
controllers.  In its lieu, we will provide an option to statically set
the controller's shares as a safe guard against us getting this wrong.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
2c1d5cf966 database: remove cpu_flush_quota metric
We can now grab that from the CPU scheduler, that exports both runtime
and shares.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
c4974392b7 allow update_cache and clear_gently to use the entire task quota.
We have had a quota of partitions to process in clear_gently /
update_cache, so that we don't overwork. However, with those things now
being in their own task group there is no harm in allowing it to run
until we reach a natural preemption point.

While we are at it, clear_gently did not check for need_preempt()
before, so this patch fixes it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
a3a4d0a17a row_cache: actually use the scheduling group for update_cache
We have moved clear_gently from using a seastar::thread's scheduling_group to
using the CPU scheduler's. However, update_cache was forgotten.

This patch fixes that and gets rid of the old group just in case.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Avi Kivity
ce94e6deb7 database: place data_query execution stage into scheduling_group
Because execution stages defer and batch processing of the function
they run, they escape their fiber's context and therefore the
scheduling group.

Fix (for data_query) by initializing the execution_stage with the
query scheduling_group. To do that we have to move the execution
stage into the database object, so it has access to the scheduling
group during initialization.
2018-02-07 17:19:29 -05:00
Avi Kivity
2ee163d32b config: mark background_writer_scheduling_quota as Unused
Since the background writer flush quota config is no longer used, mark
it Unused.
2018-02-07 17:19:29 -05:00
Avi Kivity
ac525c9124 memtable, database: make memtable::clear_gently() inherit scheduling_group
Instead of using a private thread_scheduling_group, make clear_gently use
its caller's scheduling_group to control resource usage.
2018-02-07 17:19:29 -05:00
Glauber Costa
956af9f099 database, main: set up scheduling_groups for our main tasks
Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.

The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.

Comments from Glauber:

This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:

1) A bug/regression is fixed with the boundary calculations for the
   memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
   go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
   themselves are run in the compaction scheduling group. Having that
   working is what changes this patch the most: we now store the
   scheduling group in the compaction manager and let the compaction
   manager itself enforce the scheduling group.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Avi Kivity
641aaba12c database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.

The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.

Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
2018-02-07 17:19:29 -05:00
Glauber Costa
98549775fa sstable_tests: make sure min_threshold is set explicitly
The SSTable tests are a bit fragile now because they rely on min_threshold
having a particular value. That is the default value, but if I change that
default - which I am planning to do - the test breaks.

Right now the test is not broken, but if we are planning on relying on a
property having a particular value in tests, we should explicitly set it.

So I am proactively chaning min_threshold in the tests to have the value
of 4 explicitly, so we can change that in the future without breaking anything.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180207155513.12498-1-glauber@scylladb.com>
2018-02-07 18:45:52 +01:00
Tomasz Grabiec
d398aa913e cache: Fix calculation of active_reads()
Message-Id: <1518023341-27855-1-git-send-email-tgrabiec@scylladb.com>
2018-02-07 17:20:00 +00:00
Takuya ASADA
2c2173917c dist/common/scripts/scylla_raid_setup: skip blkdiscard when disk is not supported TRIM
Since we unconditionally running blkdiscard on disks, we may get ioctl error
message on some disks which does not support TRIM.

This can be ignore but it's bad UX, so let's skip running blkdiscard when TRIM
is not supported on the disk.

Fixes #2774

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517992904-13838-1-git-send-email-syuu@scylladb.com>
2018-02-07 13:30:05 +02:00
Calle Wilund
264b9d2da0 sstables: Process extensions on file open
Allowing them to wrap/replace an opened file, and add to/read from
scylla metadata.
2018-02-07 10:11:46 +00:00
Calle Wilund
b0c0c3c0ad sstables::types: Add optional extensions attribute to scylla metadata
Allowing storing key:value pairs.
2018-02-07 10:11:46 +00:00
Calle Wilund
68fc076f80 sstables::disk_types: Add hash and comparator(sstring) to disk_string 2018-02-07 10:11:46 +00:00
Calle Wilund
97f9f572f8 schema_tables: Load/save extensions table
Parses the extension map in tables/views using the registered extension.
If a schema row contains an unknown extension, we just preserve the data
in a placeholder.
2018-02-07 10:11:46 +00:00
Calle Wilund
dcc75263c6 cql: Add schema extensions processing to properties
Automatically accept registered schema extensions into the properties
set, and when building, generate the corresponding extension object into
the resulting schema.
2018-02-07 10:11:46 +00:00
Calle Wilund
2b56bbfa7d schema_tables: Require context object in schema load path
Requires "workaround" fix for schema_registry and frozen_mutation, since
the former is a free-float thread local, and the latter is a pure data
carrier. frozen_schema can take a parameter for unfreeze, but schema
registry requires being told which the system extensions are.
2018-02-07 10:11:46 +00:00
Calle Wilund
c2b49ec2e2 schema_tables: Add opaque context object
To allow carrying extensions and potentially more
2018-02-07 10:11:46 +00:00
Calle Wilund
2ee68ce0d4 config_file_impl: Remove ostream operators
We don't generate default strings for command line, so these are not
needed as such, and conflict with other operators in to_string.hh
2018-02-07 10:11:46 +00:00
Calle Wilund
6e31842049 main/init: Formalize configurables + add extensions to init call
Move the configurables to init so tests can link this as well. 
Add extensions object to db config in main and provide to 
configurables. These can then add extensions at this phase.
2018-02-07 10:11:46 +00:00
Calle Wilund
c19d8dd602 db::config: Add extensions as a config sub-object
The idea being that we should have config be a global, immutable
singleton, set up by startup/test then owned/referenced by db etc. 

Extensions are read-only in this context, so init code should set it up
before handing to the config. Or keep a ref to the ext param.
2018-02-07 10:11:46 +00:00
Calle Wilund
78174c6c59 db::extensions: Configuration object to store various extensions
A singular, yet not static global, container for schema/sstable 
extensions.
2018-02-07 10:11:46 +00:00
Calle Wilund
3e8cfbf2a0 cql3::statements::property_definitions: Use std::variant instead of any
Formalizing what stuff we actually keep in the props. And c++17.
2018-02-07 10:11:46 +00:00
Calle Wilund
0dcf287230 sstables: Add extension type for wrapping file io 2018-02-07 10:11:45 +00:00
Calle Wilund
3ab760b375 schema: Add opaque type to represent extensions
A virtual opaque object meant to represent the "extensions" mapping
in schema_tables::tables/views
2018-02-07 10:11:45 +00:00
Calle Wilund
74758c87cd sstables::compress/compress: Make compression a virtual object
Make a "compressor" an actual class, that can be implemented and
registered via class registry. 

For "common" compressors, the objects will be shared, but complex
implementors can be semi-stateful. 

sstable compression is split into two parts: The "static" config
which is shared across shards, and a "local" one, which holds 
a compressor pointer. The latter is encapsulated, along with 
actual compressed data writers, in sstables/compress.cc.

For compression (write), compression writer is instansiated 
with the settings active in table metadata. 

For decompression (read), compression reader is instansiated
with the settings stored in sstable metadata, which can 
differ from the currently active table metadata. 

v2:
* Structured patch sets differently (dependencies)
* Added more comments/api descs
* Added patch to move all sstable compression into compress.cc,
  effectively separating top-level virtual compressor object
  from sstable io knowledge
v3:
* Rebased
v4: 
* Moved all sstable compression logic/knowledge into  
  compress.cc (local compression). Merged the two patches 
  (separation just confuses reader).
2018-02-07 10:11:45 +00:00
Pekka Enberg
3e4c6cc4da tests/cql_query_test: Add indexed clustering key query test 2018-02-06 16:57:27 +02:00
Pekka Enberg
0128f802ed cql3: Fix clustering column secondary indexing
Fix clustering column indexing by lifting the limitation of only
considering non-primary key restrictions in
select_statement::find_index_partition_ranges().
2018-02-06 16:57:27 +02:00
Pekka Enberg
1fdc13d230 cql3/statements: Add values() helper to restrictions
Add values() helper to restrictions class so that we can easily obtain
restriction values for all indexed restrictions.
2018-02-06 16:57:27 +02:00
Paweł Dziepak
6ccd317c38 Merge "Do not evict from memtable snapshots" from Tomasz
"When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.

Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.

Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry. Found during code review.

Introduced in ca8e3c4, so affects 2.1+

Fixes #3186.

Tests: unit (release)"

* 'tgrabiec/do-not-evict-memtable-snapshots' of github.com:tgrabiec/scylla:
  tests: mvcc: Add test for eviction with non-evictable snapshots
  mutation_partition: Define + operator on tombstones
  tests: mvcc: Check that partition is fully discontinuous after eviction
  tests: row_cache: Add test for memtable readers surviving flush and eviction
  memtable: Make printable
  mvcc: Take partition_entry by const ref in operator<<()
  mvcc: Do not evict from non-evictable snapshots
  mvcc: Drop unnecessary assignment to partition_snapshot::_version
  tests: Use partition_entry::make_evictable() where appropriate
  mvcc: Encapsulate construction of evictable entries
2018-02-06 14:46:24 +00:00
Tomasz Grabiec
3c51cc79d5 tests: mvcc: Add test for eviction with non-evictable snapshots 2018-02-06 14:24:19 +01:00
Tomasz Grabiec
d37131d320 mutation_partition: Define + operator on tombstones 2018-02-06 14:24:19 +01:00
Tomasz Grabiec
ec5fe5b207 tests: mvcc: Check that partition is fully discontinuous after eviction
evict() should remove everything, including range tombstones, so whole
clustering range should be marked as discontinuous.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
c1b82e60e3 tests: row_cache: Add test for memtable readers surviving flush and eviction
Reproduces https://github.com/scylladb/scylla/issues/3186
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
d85d651e0f memtable: Make printable
Useful when debugging test failures.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
06b7b54c3d mvcc: Take partition_entry by const ref in operator<<()
Some users will only have const&.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
50f5bee12e mvcc: Do not evict from non-evictable snapshots
When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.

Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.

Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry.

Introduced in ca8e3c4, so affects 2.1+

Fixes #3186.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
c391bff1d2 mvcc: Drop unnecessary assignment to partition_snapshot::_version
merge_partition_versions() is responsible for merging versions
unpinned by the current snapshot. If that fails, we don't need to set
_version back since versions must be still referenced by someone else,
this snapshot is not a unique owner.

This change makes it easier to add tracking of evictability.
2018-02-06 14:24:18 +01:00
Tomasz Grabiec
439cbada2c tests: Use partition_entry::make_evictable() where appropriate 2018-02-06 14:24:18 +01:00
Raphael S. Carvalho
09f4ee808f sstables/compress: Fix race condition in segmented offset reading of shared sstable
Race condition was introduced by commit 028c7a0888, which introduces chunk offset
compression, because a reading state is kept in the compress structure which is
supposed to be immutable and can be shared among shards owning the same sstable.

So it may happen that shard A updates state while shard B relies on information
previously set which leads to incorrect decompression, which in turn leads to
read misbehaving.

We could serialize access to at() which would only lead to contention issues for
shared sstables, but that can be avoided by moving state out of compress structure
which is expected to be immutable after sstable is loaded and feeded to shards that
own it. Sequential accessor (wraps state and reference to segmented_offset) is
added to prevent at() and push_back() interfaces from being polluted.

Tests: release mode.

Fixes #3148.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180205192432.23405-1-raphaelsc@scylladb.com>
2018-02-06 12:10:10 +02:00
Tomasz Grabiec
d899ae0f02 mvcc: Encapsulate construction of evictable entries
Internal invariants of MVCC are better preserved by partition_entry
methods, so move construction of partition entries out of cache_entry
constructors.
2018-02-05 17:54:03 +01:00
Vlad Zolotarov
bc90aa79b3 config: uncomment/add the supported snitches description
Uncomment desscriptions of Ec2SnitchXXX which are supported for a long
time already.
Add the description of the new GoogleCloudSnitch.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-05 10:37:13 -05:00
Vlad Zolotarov
d312aeebf3 tests: added gce_snitch_test
Tests the GoogleCloudSnitch.
Uses the dummy GCE meta server that would be listening on 127.0.0.1:80 by default.
To change the IP of the dummy server one can use the DUMMY_META_SERVER_IP
environment macro.
To use the real GCE meta server (from inside the GCE VM) one should define
the USE_GCE_META_SERVER environment macro.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-05 10:37:08 -05:00
Vlad Zolotarov
8ae2996bf8 locator::gce_snitch: implementation of the GoogleCloudSnitch
This is a snitch that should be used when Scylla runs in GCE VMs in both
single and multi data center (DC) configurations.

This snitch interacts with the GCE (instance metadata) API as
described here: https://cloud.google.com/compute/docs/storing-retrieving-metadata)
similarly to how ec2_snitchXXX interacts with the AWS API.

However unlike ec2_multi_region_snitch the GCE snitch only gets the instance's zone and sets
the DC and the RACK based on it, e.g. for us-central1-a the DC is set to 'us-central'
and the RACK - to 'a'.

GCE snitch doesn't have to learn the internal and external IPs of the instance because in
GCE instances from different regions can interact using internal IPs (in the AWS they can't).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-05 09:57:03 -05:00
Vlad Zolotarov
0a8549abf1 locator::snitch_base: properly log the failure during the snitch startup
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-05 09:49:54 -05:00
Avi Kivity
a94564a637 Merge seastar upstream
* seastar 21badbd...6d02263 (4):
  > build: detect name of ninja executable
  > queue: pop_eventually/push_eventually should throw when called after abort
  > build: compile libfmt out-of-line
  > core/gate: Ensure with_gate leaves gate on exception
2018-02-05 14:42:07 +02:00
Tomasz Grabiec
d21fbc26c7 tests: range_tombstone_list: Do not depend on argument evaluation order
next_pos() calls could be reordered resulting in invalid tombstones being
generated.
Message-Id: <1517833688-20022-1-git-send-email-tgrabiec@scylladb.com>
2018-02-05 12:31:37 +00:00
Tomasz Grabiec
d2baa49313 tests: Do not produce invalid range tombstones
Upper bound should not be smaller than lower bound. Found by
asserting on valid bounds.
Message-Id: <1517833602-19732-1-git-send-email-tgrabiec@scylladb.com>
2018-02-05 12:29:03 +00:00
Takuya ASADA
6d134c0c2b dist/redhat: block installing Scylla on older kernel
We uses AmbientCapabilities directive on systemd unit, but it does not work
on older kernel, causes following error:
"systemd[5370]: Failed at step CAPABILITIES spawning /usr/bin/scylla: Invalid argument"

It only works on kernel-3.10.0-514 == CentOS7.3 or later, block installing rpm
to prevent the error.

Fixes #3176

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517822764-2684-1-git-send-email-syuu@scylladb.com>
2018-02-05 12:57:17 +02:00
Duarte Nunes
46099e4f58 tests/role_manager_test: Stop role_manager
Not stopping them may cause the tests to fail due to an asynchronous
process being scheduled and accessing freed data.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180202221640.28609-1-duarte@scylladb.com>
2018-02-05 09:39:59 +00:00
Avi Kivity
6919c7434e Merge seastar upstream
* seastar 19efbd9...21badbd (4):
  > reactor: change adjustment method for tasks becoming active
  > Merge 'Update ARM port' from Avi
  > http: Do not wait for close connection on stop if listen did not completed
  > core/future-util: Don't allow rvalues in do_for_each()
2018-02-04 14:28:28 +02:00
Avi Kivity
2173e74212 tests: de-template cql_query_test
cql_query_test contains many continuations that are generic lambdas:

  foo().then([] (auto x) { ... })

These templates prevent Eclipse's indexer from inferring the type of x,
and so everything below that point is one big error as far as Eclipse is
concerned.

De-template these lambdas by specifying the real types.

Unfortunately, compile time decrease was not observed.

Tests: cql_query_test (release)
Message-Id: <20180204113503.23297-1-avi@scylladb.com>
2018-02-04 11:48:52 +00:00
Takuya ASADA
dc2b17b3da dist/redhat: link yaml-cpp statically
To avoid incompatibility between distribution provided libyaml-cpp, link it
statically.

Fixes #3173

Message-Id: <1517546935-15858-2-git-send-email-syuu@scylladb.com>
2018-02-03 16:34:36 +02:00
Takuya ASADA
82f217d62a configure.py: make --static-yaml-cpp works properly for Scylla
We are doing static linking of libyaml-cpp for libseatar well, but
mistakenly not for Scylla, need to fix.

Message-Id: <1517546935-15858-1-git-send-email-syuu@scylladb.com>
2018-02-03 16:34:32 +02:00
Amnon Heiman
836876d81a main: stop prometheus server when shutting down
This patch adds a enging().on_exit cleanup for the prometheus server,
similar to other components in the system.

It will stop the server when sutting down.

Fixes #2520
Message-Id: <20180201132647.17638-1-amnon@scylladb.com>
2018-02-02 11:03:51 +01:00
Tomasz Grabiec
582dd36303 Merge 'Fixes for exception safety in memtable range reads' from Paweł
These patches deal with the remaining exception safety issues in the
memtable partition range readers. That includes moving the assignment
to iterator_reader::_last outside of allocating section to avoid
problems caused by exception-unsafe assignment operator. Memory
accotuning code is also moved out of the retryable context to improve
the code robustness and avoid potential problems in the future.

Fixes #3172.

Tests: unit-test (release)

* https://github.com/pdziepak/scylla.git memtable-range-read-exception-safety/v1:
  memtable: do not update iterator_reader::_last in alloc section
  memtable: do not change accounting state in alloc section
  tests/memtable: add more reader exception safety tests
2018-02-02 11:00:58 +01:00
Paweł Dziepak
c2a5fd520f cql3/role-management: avoid static local shared_ptr
Even if shared_ptr is const it doesn't mean that its internal state is
immutable and it still cannot be freely shared across shards.

Fixes assertion failure in build/debug/tests/cql_roles_query_test.

Message-Id: <20180201125221.30531-1-pdziepak@scylladb.com>
2018-02-01 16:28:36 +02:00
Paweł Dziepak
ea50806172 tests/mutation_reader: avoid static local lw_shared_ptr
Shared pointer don't like being shared across shards.

Fixes assertion failure in build/debug/tests/mutation_reader_test.
Message-Id: <20180201125017.30259-1-pdziepak@scylladb.com>
2018-02-01 13:53:55 +01:00
Duarte Nunes
992de302a2 tests/row_cache_test: Test hash caching
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
d28bdb25c5 tests/memtable_test: Test hash caching
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
78508e8e43 tests/mutation_test: Use xxHash instead of MD5 for some tests
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
6cb0bbd978 tests/mutation_test: Test xx_hasher alongside md5_hasher
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
20132fe1b5 schema: Remove unneeded include
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
d7af8ff0e0 service/storage_proxy: Enable hash caching
Set the option that enables the underlying memtable and cache readers
to request caching of a cell's hash, for requests that require a
digest.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
0bab3e59c2 service/storage_service: Add and use xxhash feature
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it. We use a cluster feature
because older versions are not ready to receive a different digest
algorithm than MD5 when answering a data request.

If we ever should add a new hash algorithm, we would also need to
add a new cluster feature for that algorithm. The alternative would be
to add code so a coordinator could negotiate what digest algorithm to
use with the set of replicas it is contacting.

Fixes #2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
440ea56010 message/messaging_service: Specify algorithm when requesting digest
While not strictly needed, specify which algorithm to use when request
a digest from a remote node. This is more flexible than relying on a
cluster wide feature, although that's what we'll do in subsequent
patches. It also makes the verb more consistent with the data request.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
1ee7413b6e storage_proxy: Extract decision about digest algorithm to use
Introduce the digest_algorithm() function, which encapsulates the
decision of which digest algorithm to use. Right now it is set to MD5,
but future patches will change this.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
712c051de6 cache_flat_mutation_reader: Pre-calculate cell hash
When digest is requested, pre-calculate the cell's hash. We consider
the case when the cell is already in the cache, and the case when it
added by the underlying reader.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
ec5b7fb553 partition_snapshot_reader: Pre-calculate cell hash
When digest is requested, pre-calculate the cell's hash. A downside of
this approach is that more work will be done when there are multiple
versions of a row that contain values for the same cell, but we expect
these cases to be rare and the upside of caching a cell's hash to
compensate for the extra work.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
4ea2f52ddb query::partition_slice: Add option to specify when digest is requested
Having this option enables us to communicate from the upper to the
lower layers whether a digest was requested, so that we can pre-calculate
and cache a cell's hash in the readers that have access to the actual
in-memory cells (within the memtable and the row cache).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
42f407ad9e row: Use cached hash for hash calculation
This entails doing the cell hash calculation slightly differently,
where the cell is hashed individually, the resulting hash being added
to the running one.

Instead of propagating a flag all through the call chain, we detect
whether we are in the new mode by the employed hash algorithm.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:49 +00:00
Duarte Nunes
d773e4b9d4 mutation_partition: Replace hash_row_slice with appending_hash
This enables us to only branch once per row on the actual hash
algorithm, instead of once per row data item.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:49 +00:00
Duarte Nunes
99a3e3aa76 mutation_partition: Allow caching cell hashes
We add storage to a row to hold the cached hashes of each individual
cell. We don't store the hash in each cell because that would a)
change the cell equality function, and b) require us to change a cell
in a potentially fragmented buffer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:47 +00:00
Duarte Nunes
71ba99d53e mutation_partition: Force vector_storage internal storage size
This patch forces the size of vector_storage's internal storage to 5,
meaning that the underlying managed_vector will ensure it doesn't need
to externally allocate a buffer to hold the row, if only its first 5
cells are set.

We define this size explicitly so we can change the vector's value
type in upcoming patches without affecting the optimization.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:51 +00:00
Duarte Nunes
996e47a6f9 test.py: Increase memory for row_cache_stress_test
Cells and rows will require more memory when we start caching the cell
hash.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:51 +00:00
Duarte Nunes
7ba63b1521 atomic_cell_hash: Add specialization for atomic_cell_or_collection
Replace the atomic_cell_or_collection::feed_hash() member function
with the specialization of appending_hash, and use that instead.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:51 +00:00
Duarte Nunes
b2e1a91f4d query-result: Use digester instead of md5_hasher
Use the digester class instead of md5_hasher to encapsulate the
decision of which hash algorithm to use.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
a0d748c71c range_tombstone: Replace feed_hash() member function with appending_hash
Replace range_tombstone::feed_hash() with the specialization of
appending_hash, so that we can use the general feed_hash() function.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
12507fb9ce keys: Replace feed_hash() member function with appending_hash
Replace the feed_hash() member function of partition_key and
clustering_key_prefix with the specialization of appending_hash,
so that we can use the general feed_hash() function.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
6b4b429883 query-result: Introduce class result_options
Introduce class result_options to carry result options through the
request pipeline, which at this point mean the result type and the
digest algorithm. This class allows us to encapsulate the concrete
digest algorithm to use.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
041acb7aea query: Add class to encapsulate digest algorithm
This patch paves the way for us to encapsulate the actual digest
algorithm used for a query. The digester class dispatches to a
concrete implementation based on the digest algorithm being used. It
wraps the xxHash algorithm to provide a 128 bit hash, which is the
size of digest expected by the inter-node protocol.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
839ed4e3a4 md5_hasher: Extract hash size
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
5f6aab832b digest_algorithm: Add xxHash option
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
c803ae24fc digest: Introduce xxHash hash algorithm
This patch introduces xx_hasher, a class conforming to the Hasher
concept, which will be used to calculate the data digest in subsequent
patches. It is expected to be an order of magnitude faster than md5.

We use the 64 bit variant of the algorithm, the 128 bit one still
being under development.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
4f0295a35c CMakeLists: Add xxhash directory
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
edb9193c9c configure.py: Configure xxhash
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
102cf40bb7 Add xxhash (fast non-cryptographic hash) as submodule
Signed-off-by: Duarte Nunes <duarte@scylladb.com>

Note:
  xxhash repo should be cloned to Scylla organization, and that
  git url should be used instead.
2018-02-01 00:22:50 +00:00
Paweł Dziepak
20c460d8f0 tests/memtable: add more reader exception safety tests 2018-01-31 16:05:35 +00:00
Paweł Dziepak
c945bdc7f6 memtable: do not change accounting state in alloc section
Allocating sections can be retried so code that has side effects (like
updating flushed bytes accouting) has no place there.
2018-01-31 16:04:31 +00:00
Paweł Dziepak
d803370868 memtable: do not update iterator_reader::_last in alloc section
iterator_reader::_last is a part of the state that survives allocating
section retries, therefore, it should not be modified in the retryable
context.
2018-01-31 16:03:16 +00:00
Avi Kivity
4463e9071a Merge "Adding the API V2 Swagger definition file" from Amnon
"This series adds the base for the V2 Swagger definition file.
After the series, the definition file will be at:
http://localhost:10000/v2

It can be used with the swagger ui, by replacing the url in the search
path."

* 'amnon/swagger_20' of github.com:scylladb/seastar-dev:
  Register the API V2 swagger file
  Adding the header part of the swagger2.0 API
2018-01-31 14:47:50 +02:00
Duarte Nunes
cf6110d840 tests/cell_locker_test: Ensure timeout test finishes in useful time
Use saturating_substract to prevent a really long timeout and having
the test hang.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180130221336.1773-1-duarte@scylladb.com>
2018-01-31 11:34:08 +01:00
Duarte Nunes
01a8e5abb9 Merge 'Materialized views: add local locking' from Nadav
"Before this patch set, our Materialized Views implementation can produce
incorrect results when given concurrent updates of the same base-table
row. Such concurrent updates may result, in certain cases, with two
different rows in the view table, instead of just one with the latest
data. In this series we add locking which serializes the two conflicting
updates, and solves this problem.

I explain in more detail why such locking is needed, and what kinds of
locks are needed, in the third patch."

* 'master' of https://github.com/nyh/scylla:
  Materialized views: serialize read-modify-update of base table
  Materialized views: test row_locker class
  Materialized views: implement row and partition locking mechanism
2018-01-30 17:40:12 +00:00
Tomasz Grabiec
cdd31918d0 Merge 'Make memtable reads exception safe' from Paweł
These patches change the memtable reader implementation (in particular
partition_snapshot_reader) so that the existing exception safety
paroblems are fixed, but also in a way that, hopefully, would make it
easier to reason about the error handling and avoid future bugs in that
area.

The main difficulty related to exception safety is that when an
exception is thrown out of an allocating section that code is run again
with increased memory reserved. If the retryable code has side effects
it is very easy to get incorrect behaviour.

In addition to that, entering an allocating section is not exactly cheap
which encourages doing so rarely and having large sections.

The approach taken by this series is to, first, make entering allocating
sections cheaper and then reducing the amount of logic that runs inside
of them to a minimum.

This means that instead of entering a section once per a call to
flat_mutation_reader::fill_buffer() the allocation section is entered
once for each emitted row. The only state modified from within the
section are cached iterators to the current row, which are dropped on
retry. Hopefully, this would make the reader code easier to reason
about.

The optimisations to the allocating sections and managed_bytes
linearised context has successfully eliminated any penalty caused by
much more fine grained allocating sections.

Fixes #3123.
Fixes #3133.

Tests: unit-tests (release)

BEFORE
test                                      iterations      median         mad         min         max
memtable.one_partition_one_row               1155362   869.139ns     0.282ns   868.465ns   873.253ns
memtable.one_partition_many_rows              127252     7.871us    15.252ns     7.851us     7.886us
memtable.many_partitions_one_row               58715    17.109us     2.765ns    17.013us    17.112us
memtable.many_partitions_many_rows              4839   206.717us   212.385ns   206.505us   207.448us

AFTER
test                                      iterations      median         mad         min         max
memtable.one_partition_one_row               1194453   839.223ns     0.503ns   834.952ns   842.841ns
memtable.one_partition_many_rows              133785     7.477us     4.492ns     7.473us     7.507us
memtable.many_partitions_one_row               60267    16.680us    18.027ns    16.592us    16.700us
memtable.many_partitions_many_rows              4975   201.048us   144.929ns   200.822us   201.699us

        ./before_sq  ./after_sq  diff
 read     337373.86   353694.24  4.8%
 write    388759.99   394135.78  1.4%

* https://github.com/pdziepak/scylla.git memtable-exception-safety/v2:
  tests/perf: add microbenchmarks for memtable reader
  flat_mutation_reader: add allocation point in push_mutation_fragment
  linearization_context: remove non-trivial operations from fast path
  lsa: split alloc section into reserving and reclamation-disabled parts
  lsa: optimise disabling reclamation and invalidation counter
  mutation_fragment: allow creating clustering row in place
  paratition_snapshot_reader: minimise amount of retryable code
  memtable: drop memtable_entry::read()
  tests/memtable: add test for reader exception safety
2018-01-30 18:33:27 +01:00
Paweł Dziepak
1406ac5088 tests/memtable: add test for reader exception safety 2018-01-30 18:33:26 +01:00
Paweł Dziepak
ea7248056f memtable: drop memtable_entry::read() 2018-01-30 18:33:26 +01:00
Paweł Dziepak
0420ca48a5 paratition_snapshot_reader: minimise amount of retryable code
Retryable code that has side effects is a recipe for bugs. This patch
reworkds the snapshot reader so that the amount of logic run with
reclamation disabled is minimal and has a very limited side effects.
2018-01-30 18:33:26 +01:00
Paweł Dziepak
b1cb7d214e mutation_fragment: allow creating clustering row in place
Moving clustering_row is expensive due to amount of data stored
internally. Adding a mutation_fragment constructor that builds a
clustering_row in-place saves some of that moving.
2018-01-30 18:33:26 +01:00
Paweł Dziepak
dcd79af8ed lsa: optimise disabling reclamation and invalidation counter
Most of the lsa gory details are hidden in utils/logalloc.cc. That
includes the actual implementation of a lsa region: region_impl.

However, there is code in the hot path that often accesses the
_reclaiming_enabled member as well as its base class
allocation_strategy.

In order to optimise those accesses another class is introduced:
basic_region_impl that inherits from allocation_strategy and is a base
of region_impl. It is defined in utils/logalloc.hh so that it is
publicly visible and its member functions are inlineable from anywhere
in the code. This class is supposed to be as small as possible, but
contain all members and functions that are accessed from the fast path
and should be inlined.
2018-01-30 18:33:26 +01:00
Paweł Dziepak
d825ae37bf lsa: split alloc section into reserving and reclamation-disabled parts
Allocating sections reserves certain amount of memory, then disables
reclamation and attempts to perform given operation. If that fails due
to std::bad_alloc the reserve is increased and the operation is retried.

Reserving memory is expensive while just disabling reclamation isn't.
Moreover, the code that runs inside the section needs to be safely
retryable. This means that we want the amount of logic running with
reclamation disabled as small as possible, even if it means entering and
leaving the section multiple times.

In order to reduce the performance penalty of such solution the memory
reserving and reclamation disabling parts of the allocating sections are
separated.
2018-01-30 18:33:26 +01:00
Paweł Dziepak
eb2e88e925 linearization_context: remove non-trivial operations from fast path
Since linearization_context is thread_local every time it is accessed
the compiler needs to emit code that checks if it was already
constructed and does so if it wasn't. Moreover, upon leaving the context
from the outermost scope the map needs to be cleared.

All these operations impose some performance overhead and aren't really
necessary if no buffers were linearised (the expected case). This patch
rearranges the code so that lineatization_context is trivially
constructible and the map is cleared only if it was modified.
2018-01-30 18:33:25 +01:00
Paweł Dziepak
a1278b4d6a flat_mutation_reader: add allocation point in push_mutation_fragment
Exception safety tests inject a failure at every allocation and verify
whether the error is handled properly.

push_mutation_fragment() adds a mutation fragment to a circular_buffer,
in theory any call to that function can result in a memory allocation,
but in practice that depends on the implementation details. In order to
improve the effectiveness of the exception safety tests this patch adds
an explicit allocation point in push_mutation_fragment().
2018-01-30 18:33:25 +01:00
Paweł Dziepak
486e0d8740 tests/perf: add microbenchmarks for memtable reader 2018-01-30 18:33:25 +01:00
Avi Kivity
00d70080af Merge "Consume promoted index incrementally" from Vladimir
"This patchset makes index_reader consume promoted index incrementally
on demand as the reader advances through the current partition instead
of storing the entire promoted index which can be huge.

When the current page is parsed, data for promoted indices are turned
into input streams that are only read and parsed if a particular
position within a partition is seeked for. This avoids potentially large
allocations for big partitions."

* 'issues/2981/v10' of https://github.com/argenet/scylla:
  Use advance_past for single partition upper bound.
  Remove obsolete types and methods.
  Simplify continuous_data_consumer::consume_input() interface.
  Parse promoted index entries lazily upon request rather than immediately.
  Add helper input streams: buffer_input_stream and prepended_input_stream.
  Support skipping over bytes from input stream in parsers based on continuous_data_consumer
  Add performance tests for large partition slicing using clustering keys.
2018-01-30 18:22:28 +02:00
Nadav Har'El
2ea1922a4d Materialized views: serialize read-modify-update of base table
Before this patch, our Materialized Views implementation can produce
incorrect results when given concurrent updates of the same base-table
row. Such concurrent updates may result, in certain cases, in two
different rows added to the view table, instead of just one with the latest
data. In this patch we we add locking which serializes the two conflicting
updates, and solves this problem. The locking for a single base-table
column_family is implemented by the row_locker class introduced in a
previous patch.

A long comment in the code of this patch explains in more detail why
this locking is needed, when, and what types of locks are needed: We
sometimes need to lock a single clustering row, sometimes an entire
partition, sometimes an exclusive lock and sometimes a shared lock.

Fixes #3168

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-01-30 16:21:43 +02:00
Nadav Har'El
52e91623ce Materialized views: test row_locker class
This is a unit test for the row_locker facility. It tests various
combination of shared and exclusive locks on rows and on partitions,
some should succeed immediately and some should block.

This tests the row_locker's API only, it does not use or test anything
in Materialized Views.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-01-30 16:19:43 +02:00
Nadav Har'El
31d0a1dd0c Materialized views: implement row and partition locking mechanism
This patch adds a "row_locker" class providing locking (shard-locally) of
individual clustering rows or entire partitions, and both exclusive and
shared locks (a.k.a. reader/writer lock).

As we'll see in a following patch, we need this locking capability for
materialized views, to serialize the read-modify-update modifications
which involve the same rows or partitions.

The new row_locker is significantly different from the existing cell_locker.
The two main differences are that 1. row_locker also supports locking the
entire partition, not just individual rows (or cells in them), and that
2. row_locker supports also shared (reader) locks, not just exclusive locks.
For this reason we opted for a new implementation, instead of making large
modificiations to the existing cell_locker. And we put the source files
in the view/ directory, because row_locker's requirements are pretty
specific to the needs of materialized views.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-01-30 16:16:27 +02:00
Takuya ASADA
bec2b015e3 dist/debian: link yaml-cpp statically
To avoid incompatibility between distribution provided libyaml-cpp, link it
statically.

Fixes #3164

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517313320-10712-1-git-send-email-syuu@scylladb.com>
2018-01-30 14:22:02 +02:00
Botond Dénes
b7d902a9e9 database: remove unused concurrency config members
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b257c7e9d403c55aaec34fc48863c18f9c9ae11a.1517314398.git.bdenes@scylladb.com>
2018-01-30 14:21:25 +02:00
Botond Dénes
71be2e1d0d test.py: don't fail if test's exit code is not 0 on --help
test.py invokes all test executables once with --help to determine
whether it needs a -- to seperate scylla args or not. For this check it
doesn't matter what exit code the test exits with, so don't fail if it's
not 0.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d05be7c3819349e3b22b6249bb83fbf9269d14cb.1517314408.git.bdenes@scylladb.com>
2018-01-30 14:21:01 +02:00
Piotr Jastrzebski
d9415e8ed0 Remove unused consume_streamed_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Tests: units (release)

Message-Id: <fec7f2d01d42921270c90198a7b77b76960ff705.1517310923.git.piotr@scylladb.com>
2018-01-30 13:24:55 +02:00
Duarte Nunes
1e3fae5bef db/schema_tables: Only drop UDTs after merging tables
Dropping a user type requires that all tables using that type also be
dropped. However, a type may appear to be dropped at the same time as
a table, for instance due to the order in which a node receives schema
notifications, or when dropping a keyspace.

When dropping a table, if we build a schema in a shard through a
global_schema_pointer, then we'll check for the existence of any user
type the schema employs. We thus need to ensure types are only dropped
after tables, similarly to how it's done for keyspaces.

Fixes #3068

Tests: unit-tests (release)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180129114137.85149-1-duarte@scylladb.com>
2018-01-30 12:07:04 +01:00
Avi Kivity
e1f4b06295 Merge seastar upstream
* seastar 770c450...19efbd9 (3):
  > configure.py: add --static-yaml-cpp option to link libyaml-cpp statically
  > Merge 'Avoid kernel stalls due to fsync' from Avi
  > rwlock: add exception-safe lock/unlock alternative
2018-01-30 11:44:00 +02:00
Pekka Enberg
da06339b13 scripts/find-maintainer: Find subsystem maintainer
This patch adds a scripts/find-maintainer script, similar to
script/get_maintainer.pl in Linux, which looks up maintainers and
reviewers for a specific file from a MAINTAINERS file.

Example usage looks as follows:

$ ./scripts/find-maintainer cql3/statements/create_view_statement.cc
CQL QUERY LANGUAGE
  Tomasz Grabiec <tgrabiec@scylladb.com>   [maintainer]
  Pekka Enberg <penberg@scylladb.com>      [maintainer]
MATERIALIZED VIEWS
  Duarte Nunes <duarte@scylladb.com>       [maintainer]
  Pekka Enberg <penberg@scylladb.com>      [maintainer]
  Nadav Har'El <nyh@scylladb.com>          [reviewer]
  Duarte Nunes <duarte@scylladb.com>       [reviewer]

The main objective of this script is to make it easier for people to
find reviewers and maintainers for their patches.
Message-Id: <20180119075556.31441-1-penberg@scylladb.com>
2018-01-30 09:42:35 +00:00
Vladimir Krivopalov
b91c3fd47e Use advance_past for single partition upper bound.
Instead of advancing to the next partition, try first find the more
precise position using promoted index blocks.
advance_past() only seeks within currently available PI blocks (or reads
the first batch, if never read before) and uses the position if found,
otherwise resorts to advance_to_next_partition()

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:45 -08:00
Vladimir Krivopalov
6f8c6a0933 Remove obsolete types and methods.
These types and methods are no longer in use since the index_reader is
now consuming promoted index incrementally.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:35 -08:00
Vladimir Krivopalov
0a7a56edd5 Simplify continuous_data_consumer::consume_input() interface.
Remove redundant input parameter as continuous_data_consumer derivatives
would only use themselves as a context. So take it internally and make
the function regular (non-template) and having no parameters.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:26 -08:00
Vladimir Krivopalov
7e15e436de Parse promoted index entries lazily upon request rather than immediately.
Now promoted index is converted into an input_stream and skipped over
instead of being consumed immediately and stored as a single buffer.
The only part that is read right away is the deletion time as it is
likely to be there in the already read buffer and reading it should both
be cheap and prevent from reading the whole promoted index if only
deletion time mark is needed.

When accessed, promoted index is parsed in chunks, buffer by buffer, to
limit memory consumption.

Fixes #2981

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:15 -08:00
Vladimir Krivopalov
9fdf4b24b5 Add helper input streams: buffer_input_stream and prepended_input_stream.
buffer_input_stream is a simple input_stream wrapping a single
temporary_buffer.

prepended_input_stream suits for the case when some data has been read
into a buffer and the rest is still in a stream. It accepts a buffer and
a data_source and first reads from the buffer and then, when it ends,
proceeds reading from the data_source.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:04 -08:00
Vladimir Krivopalov
5dca3100ed Support skipping over bytes from input stream in parsers based on continuous_data_consumer
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:56:55 -08:00
Vladimir Krivopalov
ebdcffab1a Add performance tests for large partition slicing using clustering keys.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:56:35 -08:00
Takuya ASADA
5f835be3aa dist/common/scripts/scylla_io_setup: check data_file_directories existance before running iotune
Currently we don't check data_file_directories existance before running iotune,
therefore it's shows unclear error message.
To make the message better, check the directory existance on scylla_io_setup.

Fixes #3137

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517200647-6347-1-git-send-email-syuu@scylladb.com>
2018-01-29 18:11:12 +02:00
Avi Kivity
3ce5ad3c7c Merge seastar upstream
* seastar d03896d...770c450 (10):
  > tls_test: Fix echo test not setting server trust store
  > tls: Do not restrict re-handshake to client
  > tls: Actually verify client certificate if requested
  > rwlock: add method for determining if an rwlock is locked
  > metrics: Add missing `break` to metric_value::operator+()
  > memory: fix error injector throwing from noexcept memory allocator functions
  > systemwide_memory_barrier: don't use mprotect() on ARM
  > sharded: Add const version of sharded::local()
  > Add const overloads of front() and back() to the circular_buffer.
  > Remove unused lambda captures

Fixes #3072
2018-01-29 15:28:44 +02:00
Botond Dénes
12b1520415 exponential_backoff_retry::do_until_value(): restore indentation
Deferred from previous patch.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <a10053f6c0ed8a24a74e51f1df4e9a5acf59922d.1517222195.git.bdenes@scylladb.com>
2018-01-29 10:50:01 +00:00
Botond Dénes
e0c082616a exponential_backoff_retry::do_until_value(): fix use-after-move
The exponential_backoff_retry instance is captured by move and is then
indirectly moved again as repeat_until_value() moves the lambda its
passed into its internal state. This caused problems as internal
lambdas store references to the instance and these references go stale
after the move.
To fix this keep hold of the existential_backoff_retry instance in an
enclosing do_with() to make it safe for internal lambdas to reference
it.

Indentation will be fixed by the next patch.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <adc49d25a6176756d60e092f3713c0c897732382.1517222195.git.bdenes@scylladb.com>
2018-01-29 10:50:01 +00:00
Duarte Nunes
bfe5a8e96f utils/managed_vector: Return reference to emplaced element
We are in 2018, after all.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180126105417.54285-1-duarte@scylladb.com>
2018-01-26 13:49:56 +01:00
Duarte Nunes
269a4aec23 test.py: Rename streamed_mutation_test
96c97ad1db changed the name of the test,
but didn't update the test.py file.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-01-26 01:04:23 +01:00
Tomasz Grabiec
1219120c00 Merge cleanup of non-flat mutation readers from Piotr
Removes uses of obsolete mutation_reader and streamed_mutation.
Superseded by flat_mutation_reader.

* seastar-dev.git haaawk/cleanup:
  Rename streamed_mutation* files to mutation_fragment*
  Delete unused streamed_mutation
  Delete unused  consume_all(streamed_mutation&)
  Delete unused fill_buffer_from<streamed_mutation>
  Delete unused do_consume_streamed_mutation_flattened
  streamed_mutation: delete operator<<
  streamed_mutation: delete unused make_forwardable
  Delete unused streamed_mutation_opt
  Delete unused check_order_of_fragments
  Delete unused streamed_mutation_from_mutation
  Move test_abandoned_flat_mutation_reader_from_mutation to
  Change test_abandoned_streamed_mutation_from_mutation
  test_mutation_merger_conforms_to_mutation_source: use flat reader
  Delete unused consume(streamed_mutation&)
  Delete unused mutation_from_streamed_mutation(streamed_mutation_opt)
  Delete unused mutation_from_streamed_mutation(streamed_mutation&)
  Delete test_mutation_from_streamed_mutation_from_mutation
  Delete unused freeze(streamed_mutation)
  Delete test_freezing_streamed_mutations
  streamed_mutation: delete unused transform
  test_schema_upgrader_is_equivalent_with_mutation_upgrade: use flat reader
  streamed_mutation: delete unused consume_mutation_fragments_until
  Delete unused merge_mutations
  Delete test_mutation_merger
  Delete unused make_empty_streamed_mutation
  Delete unused streamed_mutation_from_forwarding_streamed_mutation
  Delete unused streamed_mutation_assertions
  Turn test_streamed_mutation_fragments_have_monotonic_positions
  Delete run_conversion_to_mutation_reader_tests
  Delete unused assert_that(streamed_mutation_opt)
  Delete unused assert_that(streamed_mutation)
  Delete unused mutation_reader
  perf_fast_forward: delete unused consume_all
  Delete unused consume(mutation_reader&, Consumer)
  Remove unused mutation_reader_assertions
  Remove unused query_state::reader
  Delete unused make_reader_returning
  Delete unused make_reader_returning_many
  Delete unused make_empty_reader
  Delete unused mutation_reader_from_flat_mutation_reader
  Delete unused flat_mutation_reader_from_mutation_reader
  Delete tests for mutation readers converters
  dummy_incremental_selector: use flat reader
  Delete unused streamed_mutation_from_flat_mutation_reader
  perf_fast_forward: use flat reader in test_forwarding_with_restriction
  perf_fast_forward: use flat reader in slice_partitions
  perf_fast_forward: use flat reader in slice_rows_single_key
  perf_fast_forward: use flat reader in test_reading_all
  perf_fast_forward: use flat reader in slice_rows
  perf_fast_forward: add consume_all_with_next_partition
  perf_fast_forward: use flat reader in scan_with_stride_partitions
  perf_fast_forward: use flat reader in scan_rows_with_stride
  perf_fast_forward: add assert_partition_start
  perf_fast_forward: add consume_all(flat_mutation_reader&)
  partition_checksum::compute_legacy: use only flat reader
  row_cache: rename make_flat_reader to make_reader
  row_cache: Delete unused make_reader
  test_mvcc: use flat reader
  test_cache_population_and_clear_race: use flat reader
  test_cache_population_and_update_race: use flat reader
  test_continuity_flag_and_invalidate_race: use flat reader
  test_update_failure: use flat reader
  row_cache_test: use flat reader in verify_has
  row_cache_test: use flat reader in has_key
  test_sliced_read_row_presence: use flat reader
  test_lru: use flat reader
  test_update_invalidating: use flat reader
  test_scan_with_partial_partitions: use flat reader
  test_cache_populates_partition_tombstone: use flat reader
  test_tombstone_merging_in_partial_partition: use flat reader
  consume_all,populate_range: use flat reader
  test_readers_get_all_data_after_eviction: use flat reader
  test_tombstones_are_not_missed_when_range_is_invalidated: use flat reader
  test_exception_safety_of_reads: use flat reader
  test_exception_safety_of_transitioning_from_underlying_read_to_read_from_cache: use flat reader
  test_exception_safety_of_partition_scan: use flat reader
  test_concurrent_population_before_latest_version_iterator: use flat reader
  test_concurrent_populating_partition_range_reads: use flat reader
  test_random_row_population: use flat reader
  test_continuity_is_populated_when_read_overlaps_with_older_version: use flat reader
  test_continuity_population_with_multicolumn_clustering_key: use flat reader
  test_continuity_is_populated_for_single_row_reads: use flat reader
  flat_mutation_reader_assertions: add produces_compacted
  test_concurrent_setting_of_continuity_on_read_upper_bound: use flat reader
  test_reading_from_random_partial_partition: use flat reader
  test_tombstone_merging_of_overlapping_tombstones_in_many_versions: use flat reader
  test_concurrent_reads_and_eviction: use flat reader
  test_eviction: use flat reader
  test_random_partition_population: use flat reader
  test_single_key_queries_after_population_in_reverse_order: use flat reader
  test_query_of_incomplete_range_goes_to_underlying: use flat reader
  test_cache_delegates_to_underlying_only_once_with_single_partition: use flat reader
  test_cache_uses_continuity_info_for_single_partition_query: use flat reader
  test_cache_delegates_to_underlying_only_once_empty_single_partition_query: use flat reader
  test_cache_delegates_to_underlying_only_once_empty_full_range: use flat reader
  test_cache_works_after_clearing: use flat reader
  test_cache_delegates_to_underlying: use flat reader
  cache_flat_mutation_reader_test: use flat reader
  row_cache_alloc_stress: use flat reader
2018-01-24 21:54:08 +01:00
Piotr Jastrzebski
1f9df7aade Fix master
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 21:00:51 +01:00
Piotr Jastrzebski
96c97ad1db Rename streamed_mutation* files to mutation_fragment*
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
d590a063c6 Delete unused streamed_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
6f468802f4 Delete unused consume_all(streamed_mutation&)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
970a863950 Delete unused fill_buffer_from<streamed_mutation>
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
28c36d8884 Delete unused do_consume_streamed_mutation_flattened
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
6c6068f1da streamed_mutation: delete operator<<
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
f907073bde streamed_mutation: delete unused make_forwardable
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
a346b32584 Delete unused streamed_mutation_opt
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
7161781586 Delete unused check_order_of_fragments
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
41b23a619e Delete unused streamed_mutation_from_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
795102a0f8 Move test_abandoned_flat_mutation_reader_from_mutation to
flat_mutation_reader_test.cc

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
6b78956563 Change test_abandoned_streamed_mutation_from_mutation
to test_abandoned_flat_mutation_reader_from_mutation

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
9e06711805 test_mutation_merger_conforms_to_mutation_source: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
916a9c339c Delete unused consume(streamed_mutation&)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
d9cbb9fedc Delete unused mutation_from_streamed_mutation(streamed_mutation_opt)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
759271f866 Delete unused mutation_from_streamed_mutation(streamed_mutation&)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
a39ddc8cf6 Delete test_mutation_from_streamed_mutation_from_mutation
It tests mutation_from_streamed_mutation that is no longer
used and will be removed in the next patch.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
a1cf4b4cae Delete unused freeze(streamed_mutation)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
0f78e9c24a Delete test_freezing_streamed_mutations
It tests freeze(streamed_mutation) which is no longer used
and will be removed in the next patch.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
05ae4f5d15 streamed_mutation: delete unused transform
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
1c12884fba test_schema_upgrader_is_equivalent_with_mutation_upgrade: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
eec6c2efb5 streamed_mutation: delete unused consume_mutation_fragments_until
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
ca905d38b1 Delete unused merge_mutations
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
8abbabef30 Delete test_mutation_merger
merge_mutations is no longer used and will be removed
by the next patch.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
b82f00fafb Delete unused make_empty_streamed_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
fb42022f03 Delete unused streamed_mutation_from_forwarding_streamed_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
5959337234 Delete unused streamed_mutation_assertions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
8bdc74c9e2 Turn test_streamed_mutation_fragments_have_monotonic_positions
into test_mutation_reader_fragments_have_monotonic_positions

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
a546cfd0d5 Delete run_conversion_to_mutation_reader_tests
It's no longer needed because converters are no longer used.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
05ed42c08d Delete unused assert_that(streamed_mutation_opt)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
912a38d60b Delete unused assert_that(streamed_mutation)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
61f0ac257f Delete unused mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
a944a1f7f1 perf_fast_forward: delete unused consume_all
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
9ce48bc5fc Delete unused consume(mutation_reader&, Consumer)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
7729bc5e7b Remove unused mutation_reader_assertions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
5636a97c81 Remove unused query_state::reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
37285ad7fa Delete unused make_reader_returning
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
864db78fcf Delete unused make_reader_returning_many
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
ff4ffc1c64 Delete unused make_empty_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
0b8aedcc59 Delete unused mutation_reader_from_flat_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
c9575078a1 Delete unused flat_mutation_reader_from_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
f20c19b0e6 Delete tests for mutation readers converters
The converters are not used anywhere any longer and
will be deleted in the next patches.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
88ca42fa69 dummy_incremental_selector: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
8aaf5dc900 Delete unused streamed_mutation_from_flat_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
93355372a0 perf_fast_forward: use flat reader in test_forwarding_with_restriction
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
252909c8ab perf_fast_forward: use flat reader in slice_partitions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
7d082e6ea7 perf_fast_forward: use flat reader in slice_rows_single_key
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
177aa88dc1 perf_fast_forward: use flat reader in test_reading_all
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
899e471222 perf_fast_forward: use flat reader in slice_rows
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
e66c73839e perf_fast_forward: add consume_all_with_next_partition
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
b9bfa49088 perf_fast_forward: use flat reader in scan_with_stride_partitions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
f75c58915d perf_fast_forward: use flat reader in scan_rows_with_stride
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
52021dc605 perf_fast_forward: add assert_partition_start
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
5c213b9cbc perf_fast_forward: add consume_all(flat_mutation_reader&)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
ee6f2ca554 partition_checksum::compute_legacy: use only flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
39ec13133f row_cache: rename make_flat_reader to make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
0f45df96ca row_cache: Delete unused make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
0d76091a28 test_mvcc: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
425c1624cd test_cache_population_and_clear_race: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
dc97acb778 test_cache_population_and_update_race: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
1bead9747a test_continuity_flag_and_invalidate_race: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
4266b9759e test_update_failure: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
d5366026b1 row_cache_test: use flat reader in verify_has
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
56b0157831 row_cache_test: use flat reader in has_key
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
06bca9f4d5 test_sliced_read_row_presence: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
6c3d9cdb9f test_lru: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
a979869a15 test_update_invalidating: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
781d9a324d test_scan_with_partial_partitions: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
f199aab1ad test_cache_populates_partition_tombstone: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
9755f7677c test_tombstone_merging_in_partial_partition: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
2e1b12b6ce consume_all,populate_range: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
d08f4a40b2 test_readers_get_all_data_after_eviction: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
f99992261f test_tombstones_are_not_missed_when_range_is_invalidated: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
50fb2a57b6 test_exception_safety_of_reads: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
f0af5a1321 test_exception_safety_of_transitioning_from_underlying_read_to_read_from_cache: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
98b97be19a test_exception_safety_of_partition_scan: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
5010c082f6 test_concurrent_population_before_latest_version_iterator: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
f8964f3aff test_concurrent_populating_partition_range_reads: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
3e1da7525e test_random_row_population: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
e6cf785829 test_continuity_is_populated_when_read_overlaps_with_older_version: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
2b61411c7b test_continuity_population_with_multicolumn_clustering_key: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
561f5fbb5a test_continuity_is_populated_for_single_row_reads: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
b4cfe4dde2 flat_mutation_reader_assertions: add produces_compacted
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
a1b6557877 test_concurrent_setting_of_continuity_on_read_upper_bound: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
6bbd0c7301 test_reading_from_random_partial_partition: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
327eb8fbbd test_tombstone_merging_of_overlapping_tombstones_in_many_versions: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
07df1a6f87 test_concurrent_reads_and_eviction: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
63f45d522e test_eviction: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
57d19a390a test_random_partition_population: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
e9e8121ffe test_single_key_queries_after_population_in_reverse_order: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
9acbb1e0f4 test_query_of_incomplete_range_goes_to_underlying: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
7456c31e10 test_cache_delegates_to_underlying_only_once_with_single_partition: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
4a3f5249ce test_cache_uses_continuity_info_for_single_partition_query: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
869443e11f test_cache_delegates_to_underlying_only_once_empty_single_partition_query: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
4cc9a0d852 test_cache_delegates_to_underlying_only_once_empty_full_range: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
5cdc77b66e test_cache_works_after_clearing: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
5091474f14 test_cache_delegates_to_underlying: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
e290f46d2d cache_flat_mutation_reader_test: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
4119a61155 row_cache_alloc_stress: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
c0c88b3d4e Fix master
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:53:11 +01:00
Amnon Heiman
a0a1961b6d database: correct the label creation for database reads
The labels in database active_reads metrics where not define correctly.

Label should be created so it will be possible to select based on their
value.

The current implementation define a label "class" with three instances:
user, streaming, system.

Fixes: #2770

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180123125206.23660-1-amnon@scylladb.com>
2018-01-24 20:09:40 +01:00
Piotr Jastrzebski
c394dd9288 row_cache_test: add tests for small_buffer
When a buffer of a flat reader is small then the reader can't
handle range_tombstones correctly.

This is not a problem on a production when the buffer is large.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:09:11 +01:00
Piotr Jastrzebski
19e1f7c285 cache_flat_mutation_reader: fix tombstones handling with small buffer
Before when the buffer was so small that it could fit only a single
range_tombstone, cache_flat_mutation_reader would keep returning
the same tombstone over and over again.

The fix is to set _lower_bound to the next fragment we want to return.

Fixes #3139

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:09:11 +01:00
Tomasz Grabiec
6654fa6df7 row_cache: Drop unnecessary assignment to _lower_bound on exception
We no longer drain cached tombstones since commit
41ede08a1d, so this adjustment of
lower_bound is not needed.

Message-Id: <1516796248-11290-1-git-send-email-tgrabiec@scylladb.com>
2018-01-24 16:39:34 +02:00
Tomasz Grabiec
bf4a90fa51 flat_mutation_reader: Fix use-after-scope on timeout
timeout parameter was captured by reference, and could be accessed out
of scope in case the repeat loop deferred.

Fixes debug-mode failure of flat_mutation_reader_test.

Message-Id: <1516699230-19545-1-git-send-email-tgrabiec@scylladb.com>
2018-01-23 11:39:44 +02:00
Raphael S. Carvalho
2c181b69c9 sstables: fix wildly inaccurate sstable key estimation after dynamic index sampling
The reason sstable key estimation is inaccurate is that it doesn't account that
index sampling is now dynamic.

The estimation is done as follow:
    uint64_t get_estimated_key_count() const {
        return ((uint64_t)_components->summary.header.size_at_full_sampling + 1) *
                _components->summary.header.min_index_interval;
    }

The biggest problem is that _components->summary.header.min_index_interval isn't
actually the minimum interval, but instead the default interval value set in the
schema.
So the estimation gets worse the larger the average partition, because the larger
the average partition the lower the index sampling interval.
One of the problems is that estimation has a big influence on bloom filter size,
and so for large partitions we were generating bigger filters than we had to.

From now on, size at full sampling is calculated as if sampling were static
(which was the case until commit 8726ee937d which introduced size-based
sampling), using minimum index as a strict sampling interval.

Tests: units (release)

Fixes #3113.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180122233612.11147-1-raphaelsc@scylladb.com>
2018-01-23 10:42:24 +02:00
Asias He
5bae9b4e22 gossip: Check get_application_state_ptr in get_host_id
Check the pointer returned from get_application_state_ptr before use it.

Refs #2136

Message-Id: <e2ea32993754a79837dd97a7c5c601461dc5e1d1.1516581663.git.asias@scylladb.com>
2018-01-22 12:56:20 +02:00
Avi Kivity
1193e7d2e2 Merge "CAST from integers to decimal" from Daniel
"It turned out that decimal numbers that were obtained as cast from integers
should always contain just one decimal place 0.

This can be recognised especially when calculating avg(.) over such numbers
because result contains just one decimal point.

Fixes #3111."

* 'danfiala/integers-to-decimal' of github.com:hagrid-the-developer/scylla:
  tests: Add test that decimal obtained as CAST from integer always contain one decimal place.
  types: Decimal that is obtained from integer always contain one decimal place.
2018-01-21 20:21:00 +02:00
Daniel Fiala
4b31348463 tests: Add test that decimal obtained as CAST from integer always contain one decimal place.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-01-21 19:09:03 +01:00
Daniel Fiala
39a08cac6b types: Decimal that is obtained from integer always contain one decimal place.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-01-21 17:37:24 +01:00
Alexys Jacob
bd3517efd8 scyllatop: PEP8 python coding style compliance
this patch fixes the following remarks:
./defaults.py:2:9: E126 continuation line over-indented for hanging indent
./fake.py:15:1: E305 expected 2 blank lines after class or function definition, found 1
./livedata.py:49:17: F402 import 'metric' from line 5 shadowed by loop variable
./scyllatop.py:44:1: E305 expected 2 blank lines after class or function definition, found 1

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180119162939.17866-1-ultrabug@gentoo.org>
2018-01-21 17:15:29 +02:00
Alexys Jacob
604bc40d8a dist: migrate gentoo variant setup scripts from /sbin/service to /sbin/rc-service
the 'service' binary has been removed from gentoo as per news 2017-10-13:
https://gitweb.gentoo.org/data/gentoo-news.git/plain/2017-10-13-openrc-service-binary-removal/2017-10-13-openrc-service-binary-removal.en.txt

this patch updates the scylla setup related scripts where it was used and
make use of the 'rc-service' binary instead

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180119161310.15435-1-ultrabug@gentoo.org>
2018-01-21 17:15:26 +02:00
Glauber Costa
0c00667206 streaming big: keep write_monitor alive until the end of flush
After the new compaction controller code, the monitor has to be kept
alive until the sstable is added to the SSTable set.

This is correctly handled for all the writers, except the streaming big.
That flusher is a big confusing, as it builds an sstable list first and
only later adds the elements in the list to the sstable set. The
monitors are destroyed at the end of phase 1, so we will SIGSEGV later
when calling add_sstable().

The fix for this is to make sure the lifetime of the monitors are tied
to the lifetime of the sstables being handled big the big streaming
flush process.

Caught by dtests, update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test

Fixes #3131
Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test now passes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180118202230.17107-1-glauber@scylladb.com>
2018-01-21 14:09:43 +02:00
Amnon Heiman
1715ccf978 Register the API V2 swagger file
This adds a registration of the V2 swagger file.
V2 uses the Swagger 2.0 format, the initial definitions is empty and can
be reached at:

http://localhost:10000/v2

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-01-21 14:00:27 +02:00
Amnon Heiman
4ccf76c62b Adding the header part of the swagger2.0 API
In Swagger 2.0 all the API is exported as a single file.
The header part of the file, contains general information. It is stored
as an external file so it will be easy to modify when needed.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-01-21 14:00:27 +02:00
Avi Kivity
c743d1258d Merge "Reverse order of version merging in MVCC" from Tomasz
"Changes merging in MVCC to apply newer version to older instead of older to
newer.

Before (v0 = oldest):

  (((v3 + v2) + v1) + v0)

After:

  (v0 + (v1 + (v2 + v3)))

or:

  (((v0 + v1) + v2) + v3)

There are several reasons to do this:

  1) When continuity merging will change semantics to support eviction
     from older versions, it will be easier to implement apply() if we
     can assume that we merge newer to older instead of older to
     newer, since newer version may have entries falling into a
     continuous interval in older, but not the other way around. If we
     didn't revert the order, apply() would have to keep track of
     lower bound of a continuous interval in the right-hand side
     argument (older version) as it is applied and update continuity
     flags in the left hand side by scanning all entries overlapping
     with it. If order is reversed, merging only needs to deal with
     the current entry. Also, if we were to keep the old order, we
     cannot simply move entries from the left hand side as we merge
     because we need to keep track of the lower bound of a continuous
     interval, and we need to provide monotonic exception
     guarantees. So merging would be both more complicated and slower.

  2) With large partitions older versions are typically larger than
     newer versions, and since merging is O(N_right*(1 + log(N_left))),
     it's better to merge newer into older.
     This fixes latency spikes seen in perf_cache_eviction.

Fixes #2715."

* tag 'tgrabiec/reverse-order-of-mvcc-version-merging-v1' of github.com:scylladb/seastar-dev:
  mvcc: Reverse order of version merging
  anchorless_list: Introduce last()
  mvcc: Implement partition_entry::upgrade() using squashed()
  mvcc: Extract version merging functions
  mutation_partition: Add rows_entry::set_dummy()
  position_in_partition: Introduce after_key()
2018-01-21 13:56:57 +02:00
José Guilherme Vanz
380bc0aa0d Swap arguments order of mutation constructor
Swap arguments in the mutation constructor keeping the same standard
from the constructor variants. Refs #3084

Signed-off-by: José Guilherme Vanz <guilherme.sft@gmail.com>
Message-Id: <20180120000154.3823-1-guilherme.sft@gmail.com>
2018-01-21 12:58:42 +02:00
Raphael S. Carvalho
20179c415b service/storage_proxy: dont copy schema to primary_key::less_compare_clustering ctor
schema is expensive to copy, and it's done in a possible hot path.
bumped into it when reading code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180120211217.7273-1-raphaelsc@scylladb.com>
2018-01-20 23:16:15 +02:00
Duarte Nunes
a66c8d7973 row_cache: Don't require external_updater to be copyable
No good reason to copy it around, and even less reason to impose that
constraint on callers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180118181142.15408-1-duarte@scylladb.com>
2018-01-19 13:00:49 +01:00
Tomasz Grabiec
16e06b5b46 Merge "remove ability to create a non-flat mutation reader" from Piotr
* seastar-dev.git haaawk/flat_reader_clean_up_mutation_source_v3:
  test_range_queries: create flat reader from source
  run_sstable_resharding_test: create flat reader from source
  make_sstable_containing: create flat reader from source
  test_cache_delegates_to_underlying_only_once_multiple_mutation: use
    flat reader
  Migrate materalized views to flat_mutation_reader
  test_can_write_and_read_non_compound_range_tombstone_as_compound: use
    flat reader
  test_writing_combined_stream_with_tombstones_at_the_same_position: use
    flat reader
  Add flat_mutation_reader::peek()
  Add flat_mutation_reader_assertions::produces_range_tombstone
  Accept clustering_row_ranges in
    flat_mutation_reader_assertions::produces
  Add flat_mutation_reader_assertions::produces_eos_or_empty_mutation
  Add flat_mutation_reader_assertions::fast_forward_to overload
  test_query_only_static_row: use flat reader
  Move mutation_rebuilder to header
  test_streamed_mutation_forwarding_is_consistent_with_slicing: use flat
    reader
  test_clustering_slices: use flat reader
  test_streamed_mutation_forwarding_guarantees: use flat reader
  test_streamed_mutation_forwarding_across_range_tombstones: use flat
    reader
  test_streamed_mutation_slicing_returns_only_relevant_tombstones: use
    flat reader
  Add flat_mutation_reader_assertions::is_buffer_full
  test_fast_forwarding_across_partitions_to_empty_range: use flat reader
  Remove unused mutation_source::operator()
  mutation_source: rename make_flat_mutation_reader to make_reader
  Clean up imports in tests
2018-01-19 12:43:50 +01:00
Piotr Jastrzebski
eeef0e0f07 Clean up imports in tests
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 09:30:57 +01:00
Piotr Jastrzebski
d266eaa01e mutation_source: rename make_flat_mutation_reader to make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 09:30:12 +01:00
Piotr Jastrzebski
380d5c3402 Remove unused mutation_source::operator()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
bf06c78415 test_fast_forwarding_across_partitions_to_empty_range: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
872b1c9122 Add flat_mutation_reader_assertions::is_buffer_full
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
7ad640a64b test_streamed_mutation_slicing_returns_only_relevant_tombstones: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
6bdfe2a870 test_streamed_mutation_forwarding_across_range_tombstones: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
94480d3e05 test_streamed_mutation_forwarding_guarantees: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
873e3014fb test_clustering_slices: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
494fabc925 test_streamed_mutation_forwarding_is_consistent_with_slicing: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
667ce36981 Move mutation_rebuilder to header
It will be used in tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
c7ce24be06 test_query_only_static_row: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
5a5a5149e3 Add flat_mutation_reader_assertions::fast_forward_to overload
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
82bdc54588 Add flat_mutation_reader_assertions::produces_eos_or_empty_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
f0716d34df Accept clustering_row_ranges in flat_mutation_reader_assertions::produces
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
16e2bc8741 Add flat_mutation_reader_assertions::produces_range_tombstone
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
36771c5c2a Add flat_mutation_reader::peek()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:55:48 +01:00
Raphael S. Carvalho
f779877f43 tests/sstable_test: fix tests by not triggering compiler bug with c++17
$ gcc --version
gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)

The following code

struct S
{
    S(int i = 42);
};

void f()
{
    S( {} );
}

produces this assembly with g++ --std=c++14

  lea rax, [rbp-1]
  mov esi, 0
  mov rdi, rax
  call S::S(int)

and this one with g++ --std=c++17

  lea rax, [rbp-1]
  mov esi, 42
  mov rdi, rax
  call S::S(int)

For more details about compiler bug, check:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83937

NOTE: clang isn't affected by it.

Test relied on braced initialization of compressor (an enum class)
working properly when used as argument to compression_parameters's
ctor. Braced-initilization of an integer based type should be zero,
but default argument (lz4) was used instead, which means compression
was enabled when it shouldn't.

The course of action is to workaround the bug by explicitly setting
compressor type to none.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180119013655.32564-1-raphaelsc@scylladb.com>
2018-01-19 09:27:39 +02:00
Tomasz Grabiec
60d3c25c02 mvcc: Reverse order of version merging
Change merging to apply newer version to older instead of older to
newer.

Before:

  (((v3 + v2) + v1) + v0)

After:

  (v0 + (v1 + (v2 + v3)))

or equivalent:

  (((v0 + v1) + v2) + v3)

There are several reasons to do this:

  1) When continuity merging will change semantics to support eviction
     from older versions, it will be easier to implement apply() if we
     can assume that we merge newer to older instead of older to
     newer, since newer version may have entries falling into a
     continuous interval in older, but not the other way around. If we
     didn't revert the order, apply() would have to keep track of
     lower bound of a continuous interval in the right-hand side
     argument (older version) as it is applied and update continuity
     flags in the left hand side by scanning all entries overlapping
     with it. If order is reversed, merging only needs to deal with
     the current entry. Also, if we were to keep the old order, we
     cannot simply move entries from the left hand side as we merge
     because we need to keep track of the lower bound of a continuous
     interval, and we need to provide monotonic exception
     guarantees. So merging would be both more complicated and slower.

  2) With large partitions older versions are typically larger than
     newer versions, and since merging is O(N_right*(1 + log(N_left))),
     it's better to merge newer into older.

Fixes #2715.
2018-01-18 13:52:08 +01:00
Pekka Enberg
fab73dbdc3 cql3/restrictions: Fix multi_column_restriction::values()
Fix multi_column_restriction::values() similar to
single_column_primary_key_restrictions::values().
2018-01-18 14:38:06 +02:00
Tomasz Grabiec
1292315579 anchorless_list: Introduce last() 2018-01-18 11:32:49 +01:00
Tomasz Grabiec
5331b7b8e2 mvcc: Implement partition_entry::upgrade() using squashed()
To reduce duplication of version merging logic.
2018-01-18 11:32:49 +01:00
Tomasz Grabiec
88aff526df mvcc: Extract version merging functions 2018-01-18 11:32:49 +01:00
Tomasz Grabiec
da0c48a987 mutation_partition: Add rows_entry::set_dummy() 2018-01-18 11:32:49 +01:00
Tomasz Grabiec
bbd9ef6b59 position_in_partition: Introduce after_key() 2018-01-18 11:32:48 +01:00
Pekka Enberg
8b0b9b43b8 cql3/restrictions: Fix single_column_primary_key_restrictions::values()
This patch changes single_column_primary_key_restrictions::values() to
return values obtained via components() instead of the serialized form
that's returned by representation(). We need this to turn clustering key
restriction keys into partition keys for clustering key indexed queries.
2018-01-18 12:14:44 +02:00
Piotr Jastrzebski
0d382e89d7 test_writing_combined_stream_with_tombstones_at_the_same_position: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-18 07:32:59 +01:00
Piotr Jastrzebski
d6aede88d3 test_can_write_and_read_non_compound_range_tombstone_as_compound: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-18 07:32:59 +01:00
Piotr Jastrzebski
4c74b8c7e7 Migrate materalized views to flat_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-18 07:32:35 +01:00
Piotr Jastrzebski
b99dd17dcd test_cache_delegates_to_underlying_only_once_multiple_mutation: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-17 19:51:03 +01:00
Glauber Costa
378f2ba8e4 mutation_reader_test: adjust sleep time to timeout clock and duration
Raphael recently caught this test failing. I can't really reproduce it,
but it seems to me that it is a timing issue: we execute two different
statements, each one should timeout after 10ms. After 20ms, we make sure
that they both timed out.

They don't (in his system), which is explained by the fact that we are
no longer using high resolution clocks for the timeouts. Expirations for
lowres clocks will only happen at every 10ms, and in the worst case we
will miss twoa.

So the fix I am proposing here is to just account for potential
innacuracies in the clocks and calculations by waiting a bit longer.

Ideally, we would use the manual clock for this. But in this case, this
would mean adding template parameters to pretty much all of the
mutation_reader path.

Currently, not only the test failed, it also had an use-after-free
SIGSEGV. That happens because we give up on the reader while the
timeouts is still to happen.

It is the caller responsibility to ensure the lifetime of the reader is
correct. Dealing with that cleanly would require a cancelation mechanism
that we don't have, so we'll just add an assertion that will fail more
gracefully than the SIGSEGV.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-17 17:17:40 +01:00
Glauber Costa
01274774c3 mutation_reader_test: propagate timeouts to fast_forward_to
We are not propagating timeouts to fast_forward_to in the
mutation_reader_test. This is not currently causing any issue, but I
noticed it while chasing one - so let's fix it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-17 17:17:40 +01:00
Tomasz Grabiec
ab6ec571cb test.py: set BOOST_TEST_CATCH_SYSTEM_ERRORS=no
This will make boost UTF abort execution on SIGABRT rather than trying
to continue running other test cases. This doesn't work well with
seastar integration, the suite will hang.
Message-Id: <1516205469-16378-1-git-send-email-tgrabiec@scylladb.com>
2018-01-17 16:15:27 +00:00
Vladimir Krivopalov
73b6e9fbb1 main: Fix warnings when running "scylla --version"
Print Scylla version, if requested, before running Seastar application.

Fixes #3124

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <bbd0f303f612327446ce1f10ebd17ebed8d76048.1516144651.git.vladimir@scylladb.com>
2018-01-17 16:56:10 +02:00
Takuya ASADA
f3c8574135 dist/debian: follow gcc-7.2 package naming changes on 3rdparty repo for Debian 9
Switch to renamed gcc-7.2 package on Debian 9, too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1516191853-2562-1-git-send-email-syuu@scylladb.com>
2018-01-17 14:38:41 +02:00
Takuya ASADA
15e266eea4 dist/debian: fix package name typo on Debian 8
Correct package name is scylla-gcc72-g++-7, not scylla-g++-7.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1516189354-5880-1-git-send-email-syuu@scylladb.com>
2018-01-17 13:45:24 +02:00
Duarte Nunes
dc74ba21ab tests/sstable_utils: Inline make_local_key()
Or the compiler complains about it not being used in some units where
the header is included.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180116235557.96046-1-duarte@scylladb.com>
2018-01-17 12:17:17 +01:00
Avi Kivity
6d7d02315e dist/redhat: support nowait aio even on old distributions
Since we sometimes recommend that the user update to a newer kernel,
it's good to compile support for features that the new kernel supports.
Rather than play games with build-time dependencies, just #define
those features in. It's ugly, but better than depending on third-party
repositories and handling package conflicts.
Message-Id: <20180115143129.22190-1-avi@scylladb.com>
2018-01-17 12:13:44 +01:00
Paweł Dziepak
5efa713344 Merge "revive the round-robin load balancing #2" from Vlad
"The previous series handled a passing of the copy of the client_state from process_request(...)
to the process_request_one(...). However the modified copy of the client_state is returned by the
process_request_one(...) back to the process_request(...) and handling of this direction was missing
in the previous series.

This series completes the #2351 fix."

* 'fix-round-robin-cont-v2' of https://github.com/vladzcloudius/scylla:
  transport::cql_server::process_request_one: return only the required information instead of the whole client_state object
  service::client_state: move auth_state from cql_server::connection to service::client_state
  transport::cql_server: don't cache sasl_challenge object in the cql_server::connection
  service::client_state::merge(): remove not needed timestamp merge
2018-01-16 16:56:05 +00:00
Avi Kivity
4ad212dc01 Merge "Fix memory leak on zone reclaim" from Tomek
"_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.

The fix is to always collect such zones.

Fixes #3129
Refs #3119
Refs #3120"

* 'tgrabiec/fix-free_segments_in_zones-leak' of github.com:scylladb/seastar-dev:
  tests: lsa: Test _free_segments_in_zones is kept correct on reclaim
  lsa: Expose max_zone_segments for tests
  lsa: Expose tracker::non_lsa_used_space()
  lsa: Fix memory leak on zone reclaim
2018-01-16 15:54:03 +02:00
Duarte Nunes
176fefdebc tests/sstable_utils: Don't assume seastar test context
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180116131722.86230-1-duarte@scylladb.com>
2018-01-16 15:42:33 +02:00
Tomasz Grabiec
f20958ae3d tests: lsa: Test _free_segments_in_zones is kept correct on reclaim
Reproducer for https://github.com/scylladb/scylla/issues/3129
2018-01-16 13:17:20 +01:00
Tomasz Grabiec
5c85e9c2db lsa: Expose max_zone_segments for tests 2018-01-16 13:17:20 +01:00
Tomasz Grabiec
99708cc498 lsa: Expose tracker::non_lsa_used_space()
So that it can be used in unit tests.
2018-01-16 13:17:20 +01:00
Tomasz Grabiec
e5f8176c32 lsa: Fix memory leak on zone reclaim
_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.

The fix is to always collect such zones.

Fixes #3129
Refs #3119
Refs #3120
2018-01-16 13:17:11 +01:00
Takuya ASADA
912a14eb9b dist/debian: follow renaming of gcc-7.2 packages on Ubuntu 14.04/16.04
Now we applied our scylla-$(pkg)$(ver) style package naming on gcc-7.2,
so switch to it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1516103292-26942-1-git-send-email-syuu@scylladb.com>
2018-01-16 13:52:05 +02:00
Tomasz Grabiec
b5d5bf5bc4 database: Invalidate only affected ranges from flush_streaming_mutations()
Invalidating whole range causes larger latency spikes.

Regression from 2.0 introduced in d22fdf4261.

Refs #3119

Tests: units (release)

Message-Id: <1516046938-26855-1-git-send-email-tgrabiec@scylladb.com>
2018-01-16 11:17:57 +02:00
Asias He
5107b6ad16 storage_service: Do not wait for restore_replica_count in handle_state_removing
The call chain is:

storage_service::on_change() -> storage_service::handle_state_removing()
-> storage_service::restore_replica_count() -> streamer->stream_async()

Listeners run as part of gossip message processing, which is serialized.
This means we won't be processing any gossip messages until streaming
completes.

In fact, there is no need to wait for restore_replica_count to complete
which can take a long time, since when it completes, this node will send
notification to tell the removal_coordinator that the restore process is
finished on this node. This node will be removed from _replicating_nodes
on the removal_coordinator.

Tested with update_cluster_layout_tests.py

Fixes #2886

Message-Id: <8b4fe637dfea6c56167ddde3ca86fefb8438ce96.1516088237.git.asias@scylladb.com>
2018-01-16 11:01:31 +02:00
Avi Kivity
0cd656ec68 Revert "Advertise compatibility with CQL Version 3.3.2, since CAST functions are supported."
This reverts commit ef3324129a. It breaks cqlsh, and
further was sneaked into mainline in an unrelated patchset rather than merged
on its own.
2018-01-16 10:58:08 +02:00
Piotr Jastrzebski
767e105b24 make_sstable_containing: create flat reader from source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-16 09:33:05 +01:00
Piotr Jastrzebski
a64aa3fae3 run_sstable_resharding_test: create flat reader from source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-16 09:32:35 +01:00
Piotr Jastrzebski
2e9f03099c test_range_queries: create flat reader from source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-16 09:32:35 +01:00
Asias He
3c8ed255ac storage_service: Set NORMAL status after token_metadata is replicated
Commit 2d5fb9d109 (gms/gossiper: Replicate changes incrementally to
other shards) changes the way we replicate _token_metadata and
endpoint_state_map. Before they are replicated at the same time, after
they are not any more. This causes a shard in NORMAL status can still be
with a empty _token_metadata.

We saw errors:

   [shard 12] token_metadata - sorted_tokens is empty in first_token_index!

during CorruptThenRepairNemesis.

Fix by setting the gossip status to NORMAL after replication of
_token_metadata, so that once a node is in NORMAL, we can do repair. The
commit 69c81bcc87 (repair: Do not allow repair until node is in NORMAL
status) prevents the early repair operation by checking if a node is in
NORMAL status.

Fixes #3121

Message-Id: <af6a223733d2e11351f1fa35f59eacfa7d65dd30.1516065564.git.asias@scylladb.com>
2018-01-16 09:41:22 +02:00
Raphael S. Carvalho
2b0b703615 tests: sstable_mutation_test: fix sstable write in tests due to use of non-local keys
that's required after fa5a26f12d on because sstable write fails when sharding
metadata is empty due to lack of keys that belong to current shard.

make_local_key* were moved to header to avoid compiling sstable_utils.cc into
all those tests that rely on simple_schema.hh, which is a lot.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180116052052.7819-1-raphaelsc@scylladb.com>
2018-01-16 09:28:12 +02:00
Vlad Zolotarov
d06b577b86 transport::cql_server::process_request_one: return only the required information instead of the whole client_state object
client_state used in the process_request_one(...) contains all sorts of information irrelevant
to the caller (process_request(...)), e.g. Tracing state. Therefore instead of returning
the whole client_state object (which becomes even a bigger problem if process_one(...) and process_request_one(...)
are executed on different shards) we will return only the pieces of information we really need.

To do that we introduce a new class - processing_result, which is cross-shard-access-ready to begin with.
We are going to return a instance of this new class from the process_request_one(...).

Fixes #2351

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-15 13:09:57 -05:00
Vlad Zolotarov
6cba14c272 service::client_state: move auth_state from cql_server::connection to service::client_state
Move the requests-handling-related state into the client_state. This is needed to properly
define the interface between the process_request(...) and process_request_one(...).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-15 13:09:56 -05:00
Vlad Zolotarov
c2509d290a transport::cql_server: don't cache sasl_challenge object in the cql_server::connection
The benefit of such a caching is rather limited because it's likely to be used exactly once
and then destroyed anyway (in case of a successful authentication).
If the authentication has failed no harm is going to be done if we create this object again when
needed.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-15 13:09:49 -05:00
Vlad Zolotarov
88932cbcf0 service::client_state::merge(): remove not needed timestamp merge
Since the connection::_client_state is the only generator of new timestamps
now there is no need for this merge.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-15 12:54:20 -05:00
Avi Kivity
93076d25b6 Merge "mutation_source: remove support for creation with mutation_reader" from Piotr
"After this patchset it's only possible to create a mutation_source with a function that produces flat_mutation_reader."

* 'haaawk/mutation_source_v1' of ssh://github.com/scylladb/seastar-dev:
  Merge flat_mutation_reader_mutation_source into mutation_source
  Remove unused mutation_reader_mutation_source
  Remove unused mutation_source constructor.
  Migrate make_source to flat reader
  Migrate run_conversion_to_mutation_reader_tests to flat reader
  flat_mutation_reader_from_mutations: add support for slicing
  Remove unused mutation_source constructor.
  Migrate partition_counting_reader to flat reader
  Migrate throttled_mutation_source to flat reader
  Extract delegating_reader from make_delegating_reader
  row_cache_test: call row_cache::make_flat_reader in mutation_sources
  Remove unused friend declaration in flat_mutation_reader::impl
  Migrate make_source_with to flat reader
  Migrate make_empty_mutation_source to flat reader
  Remove unused mutation_source constructor
  Migrate test_multi_range_reader to flat reader
  Remove unused mutation_source constructors
2018-01-15 18:15:53 +02:00
Paweł Dziepak
f6434c9941 tests/perf: add microbenchmarks for the combined reader
Message-Id: <20180111120153.3911-1-pdziepak@scylladb.com>
2018-01-15 17:49:47 +02:00
Avi Kivity
3e0e4a9b56 Merge seastar upstream
* seastar a7a3e6f...d03896d (11):
  > Update dpdk submodule
  > Merge "C++17 aligned allocations" from Avi
  > Prometheus should check that the iterator is valid before using it
  > future-util: failure to allocate internal state is unrecoverable
  > Merge "Introduce simple microbenchmarking framework" from Paweł
  > tutorial: document debuging ignored exceptions
  > Revert "Merge "Introduce simple microbenchmarking framework" from Paweł"
  > Merge "Introduce simple microbenchmarking framework" from Paweł
  > tests/futures: add more tests for parallel_for_each()
  > Add a prometheus.md file
  > prometheus: Support metric family name parameter
2018-01-15 16:16:08 +02:00
Duarte Nunes
83e983d4d0 mutation_partition: Remove unused operator==()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180115013546.67260-1-duarte@scylladb.com>
2018-01-15 11:16:35 +02:00
Duarte Nunes
9d1d9883ff mutation_partition: Remove unused for_each_cell() overload
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180115013618.67351-1-duarte@scylladb.com>
2018-01-15 11:16:34 +02:00
Duarte Nunes
b607662d2e collection_type_impl: Make for_each_cell static
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180115013532.67200-1-duarte@scylladb.com>
2018-01-15 11:16:33 +02:00
Avi Kivity
fe788e0a5d mutation_reader: adjust FragmentProducer concept for timeout
forward_to() no accepts a timeout parameter, and the concept should
reflect it, or it breaks the build when concepts are enabled.
2018-01-14 18:09:37 +02:00
Avi Kivity
90dc409c83 Merge "Support for MIN/MAX aggregation functions over date-types" from Dan
"Added support for min/max functions over date/timestamp/timeuuid.

There was one issue with Scylla's type system internals: no C++ type
was mapped to these types. So special "native_types" were added for them.
It required some changes to native functions because these types don't support
the same operations as their real native counterparts.

Fixes #3104."

* 'danfiala/3104-v1' of https://github.com/hagrid-the-developer/scylla:
  tests: Tests for min/max aggregate functions over date/timestamp and timeuuid.
  functions: Added min/max functions for date/timestamp/timeuuid.
  types: Added native types for timestamp and timeuuid.
  Advertise compatibility with CQL Version 3.3.2, since CAST functions are supported.
2018-01-14 17:26:27 +02:00
Daniel Fiala
1d0d419693 tests: Tests for min/max aggregate functions over date/timestamp and timeuuid.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-01-14 13:17:09 +01:00
Daniel Fiala
5bad03b5a6 functions: Added min/max functions for date/timestamp/timeuuid.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-01-14 13:13:36 +01:00
Daniel Fiala
0d71194da6 types: Added native types for timestamp and timeuuid.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-01-14 13:11:36 +01:00
Mika Eloranta' via ScyllaDB development
bc1248e62a build: rpm build script --xtrace option
Enables bash "set -o xtrace" printing of full executed command lines for
debugging purposes.

Signed-off-by: Mika Eloranta <mel@aiven.io>
Message-Id: <20180113212944.86008-1-mel@aiven.io>
2018-01-14 12:32:32 +02:00
Mika Eloranta' via ScyllaDB development
7266446227 build: fix rpm build script --jobs N handling
Fixes argument misquoting at $SRPM_OPTS expansion for the mock commands
and makes the --jobs argument work as supposed.

Signed-off-by: Mika Eloranta <mel@aiven.io>
Message-Id: <20180113212904.85907-1-mel@aiven.io>
2018-01-14 12:30:19 +02:00
Raphael S. Carvalho
fd2b4a7eb3 mutation_reader_test: remove schema left over from dummy selector
it now lives in base class, and this one is useless.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180114032943.28228-1-raphaelsc@scylladb.com>
2018-01-14 10:59:48 +02:00
Raphael S. Carvalho
16f8150916 tests: mutation_reader_test: Fix test_combined_reader_slicing_with_overlapping_range_tombstones
Test fails after fa5a26f12d because generated sstable doesn't contain data for the
shard it was created at, so sharding metadata is empty, resulting in exception
added in the aforementioned commit. That's fixed by using the new make_local_key()
to generate data that belongs to current shard.

make_local_keys(), from which make_local_key() is built on top of, will be useful
to make sstable test work again with any smp count.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180114032025.26739-1-raphaelsc@scylladb.com>
2018-01-14 10:59:29 +02:00
Tomasz Grabiec
9c391970b8 Merge 'per-request timeouts' from Glauber
Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.

We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.

This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.

In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.

We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.

After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.

Fixes #2462

* git@github.com:glommer/scylla.git timeouts-v8.1:
  database: delete unused function
  consolidate timeout_clock
  mutation_query: add a timeout to the mutation query path
  flat_mutation_reader: pass timeout down to consume()
  add a timeout to fill_buffer
  add a timeout to fast forward to
  restricted_mutation_reader: don't pass timeouts through the config
    structure
  allow request-specific read timeouts in storage proxy reads
2018-01-12 17:06:27 +01:00
Glauber Costa
08a0c3714c allow request-specific read timeouts in storage proxy reads
Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.

We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.

This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.

In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.

We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.

After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.

Fixes #2462

Implementation notes, and general comments about open discussion in 2462:

* Because we are not bypassing the timeout, just setting it high enough,
  I consider the concerns about the batchlog moot: if we fail for any
  other reason that will be propagated. Last case, because the timeout
  is per-CF, we could do what we do for the dirty memory manager and
  move the batchlog alone to use a different timeout setting.

* Storage proxy likes specifying its timeouts as a time_point, whereas
  when we get low enough as to deal with the read_concurrency_config,
  we are talking about deltas. So at some point we need to convert time_points
  to durations. We do that in the database query functions.

v2:
- use per-request instead of per-table timeouts.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:21 -05:00
Glauber Costa
3c9eeea4cf restricted_mutation_reader: don't pass timeouts through the config structure
This patch enables passing a timeout to the restricted_mutation_reader
through the read path interface -- using fill_buffer and friends. This
will serve as a basis for having per-timeout requests.

The config structure still has a timeout, but that is so far only used
to actually pass the value to the query interface. Once that starts
coming from the storage proxy layer (next patch) we will remove.

The query callers are patched so that we pass the timeout down. We patch
the callers in database.cc, but leave the streaming ones alone. That can
be safely done because the default for the query path is now no_timeout,
and that is what the streaming code wants. So there is no need to
complicate the interface to allow for passing a timeout that we intend
to disable.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:21 -05:00
Glauber Costa
5140aaea00 add a timeout to fast forward to
In the last patch, we enabled per-request timeouts, we enable timeouts
in fill_buffer. There are many places, though, in which we
fast_forward_to before we fill_buffer, so in order to make that
effective we need to propagate the timeouts to fast_forward_to as well.

In the same way as fill_buffer, we make the argument optional wherever
possible in the high level callers, making them mandatory in the
implementations.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:19 -05:00
Glauber Costa
d965af42b0 add a timeout to fill_buffer
As part of the work to enable per-request timeouts, we enable timeouts
in fill_buffer.

The argument is made optional at the main classes, but mandatory in all
the ::impl versions. This way we'll make sure we didn't forget anything.

At this point we're still mostly passing that information around and
don't have any entity that will act on those timeouts. In the next patch
we will wire that up.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
54d3ebde4e flat_mutation_reader: pass timeout down to consume()
We pass the timeout that we received from data_query/mutation_query
down to consume, which is responsible for actually reading the data.

To make those timeouts actionable, though, we'll have to patch
fill_buffer(). This will happen in the next patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
8433702c90 mutation_query: add a timeout to the mutation query path
data_query and mutation_query are patched so that they start accepting a
per-query timeout. We will default to no timeout, and then no callers
will be changed yet.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
80c4a211d8 consolidate timeout_clock
At the moment, various different subsystems use their different
ideas of what a timeout_clock is. This makes it a bit harder to pass
timeouts between them because although most are actually a lowres_clock,
that is not guaranteed to be the case. As a matter of fact, the timeout
for restricted reads is expressed as nanoseconds, which is not a valid
duration in the lowres_clock.

As a first step towards fixing this, we'll consolidate all of the
existing timeout_clocks in one, now called db::timeout_clock. Other
things that tend to be expressed in terms of that clock--like the fact
that the maximum time_point means no timeout and a semaphore that
wait()s with that resolution are also moved to the common header.

In the upcoming patch we will fix the restricted reader timeouts to
be expressed in terms of the new timeout_clock.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
40c428dc19 database: delete unused function
no in-tree users.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Paweł Dziepak
bd6fa8b331 configure.py: add dependency seastar/configure.py
Scylla's configure.py calls seastar/configure.py and uses seastar.pc
that it produces to generate Scylla's build.ninja. However, there is no
appropriate dependency in build.ninja and changes to
seastar/configure.py alone do not trigger regeneration of Scylla's
build.ninja. This patch remedies that problem.

Message-Id: <20180111144237.5259-1-pdziepak@scylladb.com>
2018-01-11 16:48:06 +02:00
Takuya ASADA
b68ee98310 dist/debian: make pbuilder works on Debian 9
On Debian 9, 'pbuilder create' fails because of lack of GPG key for
3rdparty repo, so we need --allow-untrusted on 'pbuilder create' and
'pbuilder update'.

Also, apt-key adv --fetch-keys does not works correctly on it, but we can use
"curl <URL> | apt-key add -" as workaround.

Fixes #3088

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513797714-18067-1-git-send-email-syuu@scylladb.com>
2018-01-11 15:02:05 +02:00
Takuya ASADA
420b61b466 dist/debian: follow renaming of gcc-7.2 packages on Debian 8
Now we applied our scylla-$(pkg)$(ver) style package naming on gcc-7.2,
so switch to it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1515522920-8266-1-git-send-email-syuu@scylladb.com>
2018-01-11 15:02:04 +02:00
Duarte Nunes
cbbdfde979 sstables/compaction_backlog_tracker: Constify backlog()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180111004914.25796-1-duarte@scylladb.com>
2018-01-11 13:20:57 +02:00
Duarte Nunes
43ad5bd182 sstables/compaction_backlog_manager: Fix user-after-free
If the compaction_backlog_manager's lifetime ends before the linked
compaction_backlog_tracker's, the latter's _manager pointer not being
cleared, can lead to a use-after-free error when running
~compaction_backlog_tracker(), as evidenced by unit-tests failed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180111004914.25796-2-duarte@scylladb.com>
2018-01-11 13:20:55 +02:00
Amnon Heiman
372b02676a register the cache API before gossip settle
cache service API does not need to wait for the gossip to settle.

Fixes: #2075

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180103094757.13270-1-amnon@scylladb.com>
2018-01-11 10:27:52 +01:00
Paweł Dziepak
b4a4c04bab combined_reader: optimise for disjoint partition streams
The legacy mutation_reader/streamed_mutation design allowed very easily
to skip the partition merging logic if there was only one underlying
reader that has emitted it.

That optimisation was lost after conversion to flat mutation readers
which has impacted the performance. This patch mostly recovers it by
bypassing most of mutation_reader_merger logic if there is only a single
active reader for a given partition.

The performance regression was introduced in
8731c1bc66 "Flatten the implementation of
combined_mutation_reader".

perf_simple_query -c4 read results (medians of 60):

original regression
             before 8731c1     after 8731c1   diff
 read            326241.02        300244.09  -8.0%

this patch
                    before            after  diff
 read            313882.59        325148.05  3.6%
Message-Id: <20180103121019.764-1-pdziepak@scylladb.com>
2018-01-11 10:21:17 +01:00
Duarte Nunes
891c22904b partition_snapshot_reader: Don't push empty static rows
This patch fixes a regression introduced in 259f6759b4, which pushed
static row fragments regardless of them being empty.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180110222936.23085-1-duarte@scylladb.com>
2018-01-11 10:05:51 +01:00
Pekka Enberg
92b2e56211 Merge "Revive round-robin coordinator load balancing" from Vlad
"This series revives the round-robin load balancing added by Pekka back in 2015.

 If somebody tries to enable it with the current master it would quite quickly
 lead to a crash due to a few unresolved issues in the corresponding code.

 Fixes #2351
 Fixes #3118"

* 'fix-round-robin-balancing-v2' of github.com:vladzcloudius/scylla:
  transport::server::process_request(): avoid extra copy of the client_state
  service::cql_server::connection::process_request: use client_state "request copy" constructor
  service::client_state: introduce "request copy" copy-constructor
  service::storage_service: add the get_local_auth_service() accessor
  service::client_state: remove the unused _tracing_session_id field
2018-01-11 09:02:13 +02:00
Daniel Fiala
ef3324129a Advertise compatibility with CQL Version 3.3.2, since CAST functions are supported.
Fixes #3103.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-01-10 15:01:22 +01:00
Avi Kivity
56801d1b8c Update scylla-ami submodule
* dist/ami/files/scylla-ami 3366c93...3aa87a7 (1):
  > Move to kernel-ml kernel stream
2018-01-10 11:58:27 +02:00
Vlad Zolotarov
26a9aa5157 transport::server::process_request(): avoid extra copy of the client_state
Don't use submit_to(...) when we are going to handle the request on a local
shard. Otherwise there is a not needed copy of the _client_state in the submit_to(...)
lambda capture list.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-09 14:00:04 -05:00
Vlad Zolotarov
0b88c52639 service::cql_server::connection::process_request: use client_state "request copy" constructor
Create a cross-shard copy of the client_state object and give it to the single request handling
function and give it a timestamp generated by the original client_state instance (which is promised
to be monotonous).

Fixes #3118

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-09 14:00:04 -05:00
Vlad Zolotarov
430d172040 service::client_state: introduce "request copy" copy-constructor
A new constructor creates a copy of the current client_status to be
used in the context of the handling of a single request.

The copy may take place at a shard different from the one where the
request has been received.

In order to ensure the monotonicity of the timestamps used by the request handled
on the same connection the created copy of the client_state is going to use the same timestamp provided by the
caller instead of generating it.

It's the caller's responsibility to ensure the monotonicity of given timestamps.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-09 14:00:03 -05:00
Duarte Nunes
c142b6d0ee atomic_cell: Remove revert flag
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180109184420.7556-1-duarte@scylladb.com>
2018-01-09 19:54:51 +01:00
Duarte Nunes
259f6759b4 partition_snapshot_reader: Use static_row() to read static_row
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180109162815.5811-2-duarte@scylladb.com>
2018-01-09 19:17:02 +01:00
Duarte Nunes
16c975edcc partition_version: Return static_row fragment from static_row()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180109162815.5811-1-duarte@scylladb.com>
2018-01-09 19:17:02 +01:00
Avi Kivity
7e898d2745 Merge seastar upstream
* seastar 6972a1e...a7a3e6f (1):
  > Update dpdk submodule
2018-01-09 18:17:33 +02:00
Tomasz Grabiec
5a32cf9008 tests: Make bad_alloc from test_concurrent_reads_and_eviction less likely
With -m1G, the test failed sporadically, because too many large
mutations were accumulated in memory. Avoid by limiting backlog.

Message-Id: <1515486430-4778-1-git-send-email-tgrabiec@scylladb.com>
2018-01-09 13:52:38 +02:00
Tomasz Grabiec
40ea74a934 tests: Drop unconditional mutation printing from assertions
sprint() may need to allocate significant amount of memory if mutation
is large, and cause bad_alloc in
row_cache_test::test_concurrent_reads_and_eviction.

Message-Id: <1515486454-4913-1-git-send-email-tgrabiec@scylladb.com>
2018-01-09 13:52:19 +02:00
Avi Kivity
d340a03e81 Merge seastar upstream
* seastar b0f5591...6972a1e (8):
  > Merge NOWAIT AIO from Avi
  > configure: Allow overriding protoc compiler path
  > Tutorial: fix default of --reserve-memory
  > future-util: optimise parallel_for_each()
  > future-utils: avoid defining a template with its default template parameter
  > fix socket_address output stream operator
  > test: fix spelling of "abort_source_test"
  > Make dependencies and doc more arch-friendly
2018-01-09 12:33:40 +02:00
Piotr Jastrzebski
3bddf3415f flat_mutation_reader: Add test for make_forwardable
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <71c8b195e25c3c5c5b97f12e2d7b2f011c0d3162.1515490058.git.piotr@scylladb.com>
2018-01-09 10:46:04 +01:00
Piotr Jastrzebski
945f45f490 Fix fast_forward_to(partition_range&) in forwardable flat reader.
Making sure fast_forward_to(const partition_range&) sets _current
correctly.

Fixes #3089

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6c29cf273f191da0e21035bcbe1592042ecffc70.1515490058.git.piotr@scylladb.com>
2018-01-09 10:46:04 +01:00
Asias He
774307b3a7 streaming: Do send failed message for uninitialized session
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.

In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.

Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test

Fixes #3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>
2018-01-08 15:04:06 +02:00
Raphael S. Carvalho
4610e994e1 sstables: cure our blindness on sstable read failure
After 611774b, we're blind again on which sstable caused a compaction
to fail, leaving us with cryptic message as follow:
compaction_manager - compaction failed: std::runtime_error (compressed
chunk failed checksum)

After this change, now both read failure in compaction or regular read
will report the guilty sstable, see:
compaction_manager - compaction failed: std::runtime_error (SSTable reader
found an exception when reading sstable ./data/.../keyspace1-standard1
ka-1-Data.db : std::runtime_error(compressed chunk failed checksum))

Fixes #3006.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180102230752.14701-1-raphaelsc@scylladb.com>
2018-01-08 13:43:13 +02:00
Avi Kivity
72c673fcc3 Merge "I/O Controller for memtables and compactions" from Glauber
"This patchset implements the compaction controller for I/O shares. The
goal is to automatic adjust compaction shares based on a
strategy-specific backlog. A higher backlog will translate into higher
shares.

As compaction progresses, that reduces the backlog. As new data is
flushed, that increases the backlog. The goal of the controler is to
keep the backlog constant at a certain rate, so that we don't go neither
too fast or too slow.

Tracking reads and writes:
==========================

Tracking of reads and writes happen through the read_monitor and the
write_monitor. The write monitor is an existing interface that has the
purpose of releasing the write permit at particular points of the write
process. We enhance it so to get a reference to an instance that tracks
the current offset inside the sstables::file_writer. This way the
backlog tracker can always know for sure what's the offset of the
current write.

A similar thing is done for reads. The data_consumer already tracks the
position of the current read, and we isolate that into a structure to
which we can get a reference. A read_monitor allows us to connect the
compaction to that reference.

Lifetime management:
====================

In general, tracking objects will be owned by their callers and passed
down as references. The compaction object will own the read monitors and
the compaction write monitors and the memtable flush write monitor will
be kept alive in a do_with block around the flush itself.

The backlog_{write,read}_progress_manager needs to be kept alive until
the SSTable is no longer in progress. For writes, that means until we
are able to add the SSTable charges in full, and for reads (compaction)
that means until we are able to remove the charges in full.

It is important to do that to avoid spikes in the graph. If we remove
the progress managers in a different operation than updating the SSTable
list we will be left in a temporary state where charges appear or
disappear abruptly, to be fixed when the final
add_sstable/remove_sstable happens. So we want those things to happen
together.

The compaction_backlog_tracker is kept alive until the strategy changes,
for example, through ALTER TABLE. Current charges are transferred to the
new strategy's compaction_backlog_tracker object when we do that. If the
type of strategy changes, the current read charges are forgotten. We can
do that because those running compaction will not really contribute to
decrease the backlog of the new compaction strategy.

Tranfer of Charges
==================

When ALTER TABLE happens, we need to transfer ongoing writes to the new
backlog manager. Ongoing reads will still be tracked by the
backlog_manager that originated them.

The rationale for that is that reads still belong to the current
compaction, with the strategy that generated them. But new Tables being
written will add to the backlog of the new strategy.

Note that ALTER TABLE operations not necessarily cause a change of
Strategy. We can be using the same strategy but just changing
properties. If that is the case, we expect no discontinuity in the
backlog graph (tested).

Resharding
==========

Resharding compactions are more complex than normal compactions because
the SSTables are created in one shard and later sent to another shard.
It is better, then, to track resharding compactions separately and let
them have their own backlog tracker, which will insert backlog in
proportion to the amount of data to be resharded.

Memtable Flush I/O Controller
=============================

With the current infrastructure it becomes trivial to add a new
controller, for either I/O or CPU. This patchset then adds an I/O
controller for memtable flushes, using the same backlog algorithm that
we already used for CPU."

* 'compaction-controller-io-v5' of github.com:glommer/scylla:
  database: add a controller for I/O on memtable flushes.
  document the compaction controller
  compaction: adjust shares for compactions
  backlog_controllers: implement generic I/O controller
  factor out some of the controller code
  io shares: multiply all shares by 10
  compaction_strategy: implement backlog manager for the SizeTiered strategy
  infrastructure for backlog estimator for compaction work.
  sstables: notify about end of data component write
  sstables: add read_monitor_generator
  sstables: add read_monitor
  sstables: enhance data consumer with a position tracker
  sstables: enhance the file_writer with an offset tracker
  sstables: pass references instead of pointers for write_monitor
  compaction: control destruction of readers
2018-01-07 15:00:10 +02:00
Avi Kivity
375ed938b4 Merge "Fix potential infinite recursion in leveled compaction" from Raphael
'"The issue is triggered by compaction of sstables of level higher than 0.

The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]

When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.

This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made."

Fixes #2908.'

* 'high_level_compaction_infinite_recursion_fix_v4' of github.com:raphaelsc/scylla:
  tests: test for infinite recursion bug when doing high-level compaction
  Fix potential infinite recursion when combining mutations for leveled compaction
  dht: make it easier to create ring_position_view from token
  dht: introduce is_min/max for ring_position
2018-01-07 13:22:17 +02:00
Vlad Zolotarov
f0d5619634 service::storage_service: add the get_local_auth_service() accessor
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-05 18:00:11 -05:00
Vlad Zolotarov
1d978b9caa service::client_state: remove the unused _tracing_session_id field
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-05 18:00:11 -05:00
Asias He
34f6218dc5 gossip: Show correct nodetool status against the shutdown node itself
If a node shuts itself down due to I/O error (such as ENOSPC), then
nodetool status will show the cluster status at the time the shutdown
occured.

In fact the node will be in shutdown status (nodetool gossipinfo shows
the correct status), however, `nodetool status` does not interpret the
shutdown status, instead it use the output of:

curl -X GET --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/endpoint/live"

to decide if a node is in UN status.

To fix, do not include the node itself in the output of get_live_members

Without this patch, when a node is shutdown due to I/O error:
UN  127.0.0.1  296.2 MB   256          ?  056ff68e-615c-4412-8d35-a4626569b9fd  rack1

With this patch, when a node is shutdown due to I/O error:
?N  127.0.0.1  296.2 MB   256          ?  056ff68e-615c-4412-8d35-a4626569b9fd  rack1

Fixes #1629
Message-Id: <039196a478b5b1a8749b3fdaf7e16cfe2eb73a2f.1498528642.git.asias@scylladb.com>
2018-01-04 08:31:01 +02:00
Glauber Costa
4f1b875784 database: add a controller for I/O on memtable flushes.
The algorithm and principle of operation is the same as the CPU
controller. It is, however, always enabled and we will operate on
I/O shares.

I/O-bound workloads are expected to hit the maximum once virtual
dirty fills up and stay there while the load is steady.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:58:57 -05:00
Glauber Costa
da792641c6 document the compaction controller
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:58:57 -05:00
Glauber Costa
244c564aac compaction: adjust shares for compactions
Compactions can be a heavy disk user and the I/O scheduler can always
guarantee that it uses its fair share of disk.

Such fair share can, however, be a lot more than what compaction indeed
need. This patch draws on the controllers infrastructure to adjust the
I/O shares that the compaction class will get so that compaction
bandwidth is dynamically adjusted.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:58:57 -05:00
Glauber Costa
4b44a22236 backlog_controllers: implement generic I/O controller
Like the CPU controller, but will act on I/O priorities.
Shares can go from 0 to 1000.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:56:54 -05:00
Glauber Costa
1671d9c433 factor out some of the controller code
The control algorithm we are using for memtables have proven itself
quite successful. We will very likely use the same for other processes,
like compactions.

Make the code a bit more generic, so that a new controller has to only
set the desired parameters

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:56:54 -05:00
Raphael S. Carvalho
e641c0d333 tests: test for infinite recursion bug when doing high-level compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-01-03 16:23:02 -02:00
Raphael S. Carvalho
818830715f Fix potential infinite recursion when combining mutations for leveled compaction
The issue is triggered by compaction of sstables of level higher than 0.

The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]

When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.

This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made.

Fix is to use ring_position in reader_selector, such that
inclusiveness would be respected.
So reader_selector::has_new_readers() won't return false positive
under the conditions described above.

Fixes #2908.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-01-03 16:23:01 -02:00
Raphael S. Carvalho
19d994cfff dht: make it easier to create ring_position_view from token
that's done by adding a separate explicit constructor

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-01-03 15:26:26 -02:00
Raphael S. Carvalho
68ac0832b7 dht: introduce is_min/max for ring_position
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-01-03 15:26:25 -02:00
Vlad Zolotarov
976f444813 tests: commitlog_test: fix the compilation and test errors introduced by the hinted_handoff series
Use the default commitlog configuration with the hinted_handoff disabled
in the tests.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1514942938-3844-1-git-send-email-vladz@scylladb.com>
2018-01-03 12:20:34 +00:00
Raphael S. Carvalho
e29b598c5f sstables: make compaction_descriptor's ctor explicit to avoid bad conversion
perf sstable used old sstables::compact_sstables() interface and still compiled
due to bad implicit conversion.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180103041900.21186-1-raphaelsc@scylladb.com>
2018-01-03 12:37:12 +02:00
Calle Wilund
35b9ec868a auth: Fix transitional auth for non-valid credentials
Fixes #3096

The credentials processing for transitional auth was broken
in ba6a41d, "auth: Switch to sharded service which effectively removed
the "virtualization" of underlying auth in the SASL challenge.

As a quick workaround, add the permissive exception handling to
sasl object as well.

Message-Id: <20180103102724.1083-1-calle@scylladb.com>
2018-01-03 12:33:04 +02:00
Amnon Heiman
3ec84a0b1d API tokens_endpoint: use streams
Returning token_endpoints when there are many tokens and end points can
take a long time.

This patch uses output stream to return the result.

Instead of returning a vector, it uses the streaming functionality in
json layer.

Fixes #2476

Message-Id: <20180103081907.5175-1-amnon@scylladb.com>
2018-01-03 11:11:49 +02:00
Glauber Costa
bb29d082d2 io shares: multiply all shares by 10
Technically all that matters is the proportion among the shares so this
change is functionally a noop. However, The CPU scheduler being proposed
has shares that go all the way up to 1000. In the hopes of being able to
unify I/O and CPU controllers one day, this patch brings the I/O shares
more in line with what Avi is doing for the CPU scheduler.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
074a13ecf1 compaction_strategy: implement backlog manager for the SizeTiered strategy
The SizeTiered backlog for a single SSTable is defined as:

   Bi = Ei * log4(T / Si)

Where:

  - Si is the size of this individual SSTable
  - T is the sum of sizes for all individual SSTables
  - Ei is the effective bytes in this SSTable.

The Effective size of an SSTable is:
 - The uncompacted size for an SSTable under compaction
 - The partially written size for an SSTable being written
 - The SSTable size for an SSTable that is not undergoing
   any of those processes.

The Aggregate Backlog for the entire Table is just the sum of
all individual SSTable backlogs, including the SSTables currently
being written.

Care is taken to avoid iterating over all SSTables, by separating
the aggregate backlog into a static component (sstables not changing) and
a component of SSTables that are undergoing change.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
ca284174d0 infrastructure for backlog estimator for compaction work.
This patch adds infrastucture in various points in the system to allow
us to determine the amount of work present as backlog from compactions.

What needs to be done can be explained in three major pieces:

1) Add hooks in the points where sstables are added or inserted to a
   column family (or more precisely, to a compaction_strategy object).

2) Add hooks in reads and write monitors that allows a compaction
   backlog estimator (tracker) to become aware of bytes that are
   partially written and compacted away.

3) Add a per-column family class (compaction_backlog_tracker) that
   can be used to track work that is done and relevant to compactions
   (like the two above), and a compaction manager to provide a
   system-wide backlog based on the response of the individual trackers.

The definition of how much backlog one has is strategy-specific. The
Null strategy is easy, as it never really has any backlog, and so is the
major strategy - since what it really matters is the backlog of the
underlying compaction strategy.

Although backlogs are strategy-specific, they should be "compatible", in
the sense that if a particular strategy has more work to do, it should
yield a higher number than its counterparts.

All the others are presented in this patch as unimplemented: they will
always advertise a mild backlog that should yield a constant
CPU-utilization if used alone.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
86d7c160fd sstables: notify about end of data component write
We need to notify the monitor that the offset tracker that we are using is
about to be destroyed and will no longer be valid.

While we could modify the file_writer interface so that we could capture
the offset_tracker and take ownership of it - guaranteeing it is alive
until we reach the existing on_write_completed(), this feels like a
layer violation.

It is also potentially useful in general to offer the monitor callers
with knowledge that writing the data portion is done.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
3bd6bceaf0 sstables: add read_monitor_generator
Passing the read monitor down to the sstable readers is tricky. The
point of interest - like compaction - are usually very far from the
interfaces that register the monitor, like read_rows. Between the two,
there is usually a mutation_reader, which is and ought to be totally
unaware of the read monitor: technically, a mutation_reader may not even
know it is backed by sstables.

The solution is to create a read_monitor_generator, that can be passed
from the upper layers, like compaction, to the layers that are actually
making the decision of which sstables to create readers for.

Note that we don't need an equivalent piece of infrastructure for
writes, because writes don't happen through hidden layers and have all
the information they need to initialize their monitors.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
9702a0935b sstables: add read_monitor
Similar to the write_monitor, it will track progress of an sstable
being read. In the current interface, we will notify interested users
about what is the current position in the data file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
f0391bf9a0 sstables: enhance data consumer with a position tracker
Callers, like compactions, will be able to know at any time the current
progress of a read.

As we do that, the currently unimplemented position() method of
data_consume_context becomes redundant and is removed.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
110b8531f4 sstables: enhance the file_writer with an offset tracker
Callers, like the memtable flusher or compactions will be able to find
out the current amount of bytes written at any time.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
00df0a5ad3 sstables: pass references instead of pointers for write_monitor
This came from Avi's review on the read_monitors. He suggests we
wouldn't keep shared pointers, and would instead have the caller
ensuring lifetime. That makes sense, but having the writer interface
using shared_ptr and the read interface using references would lead to
an inconsistent interface.

For the sake of consistency we will change the write monitor to take
references before we do that. From database.cc's perspective, we could
now keep the monitors in a do_with() block, but we will keep the
shared_ptrs to manage their lifetime in anticipation of upcoming patches
in this series, where we'll have to pass them somewhere else.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:06 -05:00
Glauber Costa
d4109ebb80 compaction: control destruction of readers
Compactions run from a seastar::thread, in run(). They will either fail
or succeed, and from the point of view of ordering of destruction
between the compaction object and its readers:

- if compaction succeed, we have no control over who gets destructed
  first since both objects will be going out of scope.
- if they fail, we will forceably destruct the compaction object, at
  which point the readers are still alive

From the point of view of lifetime management, it would be nice to make
sure that the compaction object outlives whichever other objects it
needs during compaction.

This nice to have will become paramount when we start adding
read_monitors to the compaction object, that have to, themselves outlive
the readers.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:06 -05:00
Avi Kivity
8795238869 Merge "Fix handling of range tombstones starting at same position" from Tomasz
"When we get two range tombstones with the same lower bound from
different data sources (e.g. two sstable), which need to be combined
into a single stream, they need to be de-overlapped, because each
mutation fragment in the stream must have a different position. If we
have range tombstones [1, 10) and [1, 20), the result of that
de-overlapping will be [1, 10) and [10, 20]. The problem is that if
the stream corresponds to a clustering slice with upper bound greater
than 1, but lower than 10, the second range tombstone would appear as
being out of the query range. This is currently violating assumptions
made by some consumers, like cache populator.

One effect of this may be that a reader will miss rows which are in
the range (1, 10) (after the start of the first range tombstone, and
before the start of the second range tombstone), if the second range
tombstone happens to be the last fragment which was read for a
discontinuous range in cache and we stopped reading at that point
because of a full buffer and cache was evicted before we resumed
reading, so we went to reading from the sstable reader again. There
could be more cases in which this violation may resurface.

There is also a related bug in mutation_fragment_merger. If the reader
is in forwarding mode, and the current range is [1, 5], the reader
would still emit range_tombstone([10, 20]). If that reader is later
fast forwarded to another range, say [6, 8], it may produce fragments
with smaller positions which were emitted before, violating
monotonicity of fragment positions in the stream.

A similar bug was also present in partition_snapshot_flat_reader.

Possible solutions:

 1) relax the assumption (in cache) that streams contain only relevant
 range tombstones, and only require that they contain at least all
 relevant tombstones

 2) allow subsequent range tombstones in a stream to share the same
 starting position (position is weakly monotonic), then we don't need
 to de-overlap the tombstones in readers.

 3) teach combining readers about query restrictions so that they can drop
fragments which fall outside the range

 4) force leaf readers to trim all range tombstones to query restrictions

This patch implements solution no 2. It simplifies combining readers,
which don't need to accumulate and trim range tombstones.

I don't like solution 3, because it makes combining readers more
complicated, slower, and harder to properly construct (currently
combining readers don't need to know restrictions of the leaf
streams).

Solution 4 is confined to implementations of leaf readers, but also
has disadvantage of making those more complicated and slower.

There is only one consumer which needs the tombstones with monotonic positions, and
that is the sstable writer.

Fixes #3093."

* tag 'tgrabiec/fix-out-of-range-tombstones-v1' of github.com:scylladb/seastar-dev:
  tests: row_cache: Introduce test for concurrent read, population and eviction
  tests: sstables: Add test for writing combined stream with range tombstones at same position
  tests: memtable: Test that combined mutation source is a mutation source
  tests: memtable: Test that memtable with many versions is a mutation source
  tests: mutation_source: Add test for stream invariants with overlapping tombstones
  tests: mutation_reader: Test fast forwarding of combined reader with overlapping range tombstones
  tests: mutation_reader: Test combined reader slicing on random mutations
  tests: mutation_source_test: Extract random_mutation_generator::make_partition_keys()
  mutation_fragment: Introduce range()
  clustering_interval_set: Introduce overlaps()
  clustering_interval_set: Extract private make_interval()
  mutation_reader: Allow range tombstones with same position in the fragment stream
  sstables: Handle consecutive range_tombstone fragments with same position
  tests: streamed_mutation_assertions: Merge range_tombstones with the same position in produces_range_tombstone()
  streamed_mutation: Introduce peek()
  mutation_fragment: Extract mergeable_with()
  mutation_reader: Move definition of combining mutation reader to source file
  mutation_reader: Use make_combined_reader() to create combined reader
2018-01-02 18:32:09 +02:00
Raphael S. Carvalho
2a7eaa4933 tests:perf: add compaction mode to perf_sstable
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171209175759.7769-1-raphaelsc@scylladb.com>
2018-01-02 10:16:13 +01:00
Duarte Nunes
39c1987ad7 CMakeLists: Require C++17
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180101165631.2182-1-duarte@scylladb.com>
2018-01-01 19:01:24 +02:00
Avi Kivity
73814db0f1 Merge "auth: Replace delayed_tasks with sleep_abortable" from Duarte
"delayed_tasks has a bug that if the object is destroyed while a timer
callback is queued, the callback will then try to access freed memory.
This series replaces the whole thing with sleep_abortable()."

* 'auth-delayed-tasks/v2' of https://github.com/duarten/scylla:
  auth: Replace delayed_tasks with sleep_abortable
  utils/exponential_backoff_retry: Add helper to automate retries
  utils/exponential_backoff_retry: Add abort_source-based retry
2018-01-01 13:44:01 +02:00
Raphael S. Carvalho
3dcf00ec67 sstables: feed new sstable with its owner shard
Missed opportunity to feed shard id to sstable being written when
working on 67c5c8dc67, so when sstable is reopened after sealed,
its shard doesn't need to be recomputed by open procedure.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171231024529.13664-1-raphaelsc@scylladb.com>
2018-01-01 10:17:07 +02:00
Avi Kivity
d7a91f5b84 build: require C++17 unconditionally
We can now use C++17 in Scylla.
Message-Id: <20171228112934.28659-1-avi@scylladb.com>
2017-12-28 16:44:59 +00:00
Duarte Nunes
81b1455b22 auth: Replace delayed_tasks with sleep_abortable
delayed_tasks has a bug that if the object is destroyed while a timer
callback is queued, the callback will then try to access freed memory.
This could be fixed by providing a stop() function that waits for
pending callbacks, but we can just replace the whole thing by levering
the abort_source-enabled exponential_backoff_retry.
2017-12-28 13:00:28 +00:00
Duarte Nunes
40ad65666f utils/exponential_backoff_retry: Add helper to automate retries
This patch adds the do_until_value static member function to
exponential_backoff_retry, which retries the specified function until
it returns an engaged optional.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-28 13:00:28 +00:00
Duarte Nunes
9a602c7796 utils/exponential_backoff_retry: Add abort_source-based retry
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-28 13:00:28 +00:00
Duarte Nunes
89b353cd95 Delete unused nway_merger.hh
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1514463536-7732-1-git-send-email-duarte@scylladb.com>
2017-12-28 14:21:40 +02:00
Raphael S. Carvalho
c76356fb39 sstables: make shard computation resilient to empty sharding metadata
Scylla metadata could be empty due to bugs like the one introduced by
115ff10. Let's make shard computation resilient to empty sharding
metadata by falling back to the approach that uses first and last
keys to compute shards.

Refs #2932.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171223120140.3642-2-raphaelsc@scylladb.com>
2017-12-28 14:07:06 +02:00
Raphael S. Carvalho
fa5a26f12d sstables: fail sstable write if unable to generate sharding metadata
SSTable can generate an empty sharding metadata after a bug like
the one introduced here 115ff10, that results in tokens being
generated using base table for the view table. That leads to
sstable being deleted in subsequent boot because all shards will
agree on its deletion given that it will not belong to anybody,
and also compaction to crash because this relies on resulting
sstable belonging to one shard at least.

I wouldn't like to spend days debugging it again because sstable
write silently generated empty sharding metadata, so let's make
write fail when it happens (see issue #2932 for details).

Refs #2932.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171223120140.3642-1-raphaelsc@scylladb.com>
2017-12-28 14:07:05 +02:00
Duarte Nunes
2618209c2d Remove obsolete includes and fix build
move.hh was deleted, but files weren't updated to reflect that.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-28 12:03:44 +00:00
Avi Kivity
fc03ba1c08 streamed_mutation: remove non-missing include
"move.hh" should have been missing, but wasn't.
2017-12-28 14:00:34 +02:00
Duarte Nunes
1374f898b9 Merge seastar upstream
Class optimized_optional was moved into seastar, and its usage
simplified so move_and_disengage() is replaced in favour of
std::exchange(_, { }).

* seastar adaca37...b0f5591 (9):
  > Merge "core: Introduce cancellation mechanism" from Duarte
  > Fix Seastar build that no longer builds with --enable-dpdk after the recent commit fd87ea2
  > noncopyable_function: support function objects whose move constructors throw
  > Adding new hardware options to new config format, using new config format for dpdk device
  > Fix check for Boost version during pre-build configuration.
  > variant_utils: add variant_visitor constructor for C++17 mode
  > Merge "Allows json object to be stream to an" from Amnon
  > Merge 'Default to C++17' from Avi
  > Add const version of subscript operator to circular_buffer

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171228112126.18142-1-duarte@scylladb.com>
2017-12-28 13:24:18 +02:00
Nadav Har'El
58f2b6c285 Drop "VIEWS" as unimplemented reason
After materialized views has been implemented (although not enabled by
default), unimplemented::cause::VIEWS is no longer used. I think we can
drop it.

By the way, there are other no longer used unimplemented reasons, we
should probably drop them too.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171224131318.4893-1-nyh@scylladb.com>
2017-12-27 15:08:41 +02:00
Amos Kong
68a3d1e9b2 auth: delete auth/authorizer.cc
This file wasn't used after commit ba6a41d397
Jesses wanted to delete this file, but it's lost.

Signed-off-by: Amos Kong <amos@scylladb.com>
Cc: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <9af5aee2b8d492b865b9b15c9fb16941880600d8.1514305358.git.amos@scylladb.com>
2017-12-26 18:29:38 +02:00
Takuya ASADA
51013f561d dist/debian: rename boost1.63 to scylla-boost163 on Debian 8
We provided "boost1.63" package for Debian 8 since we couldn't build
"scylla-boost163" package witch is available on Ubuntu14/16, but I fixed the
problem and now we have it for Debian 8 too, so switch to it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1514220163-25985-1-git-send-email-syuu@scylladb.com>
2017-12-25 18:51:36 +02:00
Piotr Jastrzebski
0430968426 Merge flat_mutation_reader_mutation_source into mutation_source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 22:32:38 +01:00
Piotr Jastrzebski
3817519844 Remove unused mutation_reader_mutation_source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 21:42:50 +01:00
Piotr Jastrzebski
e0e2fcc013 Remove unused mutation_source constructor.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 21:27:43 +01:00
Piotr Jastrzebski
66f603fc0a Migrate make_source to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 21:27:30 +01:00
Piotr Jastrzebski
d39f8cfb37 Migrate run_conversion_to_mutation_reader_tests to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 21:26:44 +01:00
Piotr Jastrzebski
ab8918c9c3 flat_mutation_reader_from_mutations: add support for slicing
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 21:25:37 +01:00
Piotr Jastrzebski
093d6f06f0 Remove unused mutation_source constructor.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 16:10:41 +01:00
Piotr Jastrzebski
da39ee5ba0 Migrate partition_counting_reader to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 16:10:29 +01:00
Piotr Jastrzebski
0b34906da3 Migrate throttled_mutation_source to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 16:06:23 +01:00
Piotr Jastrzebski
fa938aafdd Extract delegating_reader from make_delegating_reader
and make it a template to enable using it both with reference_wrapper
and flat_mutation_reader directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 15:36:53 +01:00
Tomasz Grabiec
37ddc8bcfd tests: row_cache: Introduce test for concurrent read, population and eviction 2017-12-22 11:58:17 +01:00
Tomasz Grabiec
42ec01661c tests: sstables: Add test for writing combined stream with range tombstones at same position 2017-12-22 11:06:34 +01:00
Tomasz Grabiec
cb34420e1c tests: memtable: Test that combined mutation source is a mutation source 2017-12-22 11:06:34 +01:00
Tomasz Grabiec
7ce02bc22e tests: memtable: Test that memtable with many versions is a mutation source 2017-12-22 11:06:34 +01:00
Tomasz Grabiec
9cd35f4b90 tests: mutation_source: Add test for stream invariants with overlapping tombstones 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
7ce52df88b tests: mutation_reader: Test fast forwarding of combined reader with overlapping range tombstones 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
ca6de9e78c tests: mutation_reader: Test combined reader slicing on random mutations 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
73a79372a4 tests: mutation_source_test: Extract random_mutation_generator::make_partition_keys() 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
2be3cbbb81 mutation_fragment: Introduce range() 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
0c0d52a933 clustering_interval_set: Introduce overlaps() 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
c1d96bda88 clustering_interval_set: Extract private make_interval() 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
41ede08a1d mutation_reader: Allow range tombstones with same position in the fragment stream
When we get two range tombstones with the same lower bound from
different data sources (e.g. two sstable), which need to be combined
into a single stream, they need to be de-overlapped, because each
mutation fragment in the stream must have a different position. If we
have range tombstones [1, 10) and [1, 20), the result of that
de-overlapping will be [1, 10) and [10, 20]. The problem is that if
the stream corresponds to a clustering slice with upper bound greater
than 1, but lower than 10, the second range tombstone would appear as
being out of the query range. This is currently violating assumptions
made by some consumers, like cache populator.

One effect of this may be that a reader will miss rows which are in
the range (1, 10) (after the start of the first range tombstone, and
before the start of the second range tombstone), if the second range
tombstone happens to be the last fragment which was read for a
discontinuous range in cache and we stopped reading at that point
because of a full buffer and cache was evicted before we resumed
reading, so we went to reading from the sstable reader again. There
could be more cases in which this violation may resurface.

There is also a related bug in mutation_fragment_merger. If the reader
is in forwarding mode, and the current range is [1, 5], the reader
would still emit range_tombstone([10, 20]). If that reader is later
fast forwarded to another range, say [6, 8], it may produce fragments
with smaller positions which were emitted before, violating
monotonicity of fragment positions in the stream.

A similar bug was also present in partition_snapshot_flat_reader.

Possible solutions:

 1) relax the assumption (in cache) that streams contain only relevant
 range tombstones, and only require that they contain at least all
 relevant tombstones

 2) allow subsequent range tombstones in a stream to share the same
 starting position (position is weakly monotonic), then we don't need
 to de-overlap the tombstones in readers.

 3) teach combining readers about query restrictions so that they can drop
fragments which fall outside the range

 4) force leaf readers to trim all range tombstones to query restrictions

This patch implements solution no 2. It simplifies combining readers,
which don't need to accumulate and trim range tombstones.

I don't like solution 3, because it makes combining readers more
complicated, slower, and harder to properly construct (currently
combining readers don't need to know restrictions of the leaf
streams).

Solution 4 is confined to implementations of leaf readers, but also
has disadvantage of making those more complicated and slower.

Fixes #3093.
2017-12-22 11:06:20 +01:00
Tomasz Grabiec
f9038d5d78 sstables: Handle consecutive range_tombstone fragments with same position
In preparation for allowing fragment streams to produce range_tombstones
with the same position.
2017-12-22 11:04:02 +01:00
Tomasz Grabiec
92b89d576d tests: streamed_mutation_assertions: Merge range_tombstones with the same position in produces_range_tombstone()
In preparation for allowing fragment stream to produce consecutive
range tombstones with the same position.
2017-12-21 22:45:35 +01:00
Tomasz Grabiec
815cd254e2 streamed_mutation: Introduce peek()
Will be used in assertions to merge consecutive range tombstones.
2017-12-21 22:45:35 +01:00
Piotr Jastrzebski
963b128a87 row_cache_test: call row_cache::make_flat_reader in mutation_sources
instead of calling row_cache::make_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 22:22:11 +01:00
Piotr Jastrzebski
fd1b27c89d Remove unused friend declaration in flat_mutation_reader::impl
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 21:41:15 +01:00
Tomasz Grabiec
c5f82aa5bd mutation_fragment: Extract mergeable_with() 2017-12-21 21:24:11 +01:00
Tomasz Grabiec
60ed5d29c0 mutation_reader: Move definition of combining mutation reader to source file
So that the whole world doesn't recompile when it changes.
2017-12-21 21:24:11 +01:00
Tomasz Grabiec
52285a9e73 mutation_reader: Use make_combined_reader() to create combined reader
So that we can hide the definition of combined_mutation_reader. It's
also less verbose.
2017-12-21 21:24:11 +01:00
Piotr Jastrzebski
a02434120a Migrate make_source_with to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 21:18:35 +01:00
Piotr Jastrzebski
2c1f0250c2 Migrate make_empty_mutation_source to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 21:17:46 +01:00
Piotr Jastrzebski
b5ad96c9ca Remove unused mutation_source constructor
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 21:01:35 +01:00
Piotr Jastrzebski
5eb702a405 Migrate test_multi_range_reader to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 20:48:51 +01:00
Piotr Jastrzebski
b583ef7c8b Remove unused mutation_source constructors
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 20:34:20 +01:00
Paweł Dziepak
dfb6296d08 Merge "Migrate all clients of make_combined_reader to flat reader" from Piotr
"Remove old overloads that use mutation_reader."

* 'haaawk/combined_reader_clients_migration_v1_after_comments_2' of github.com:scylladb/seastar-dev:
  Remove unused make_combined_reader overload.
  Migrate test_fast_forwarding_combining_reader to flat reader
  flat_mutation_reader_from_mutations: support partition_range
  Don't pass fwd to flat_mutation_reader_from_mutations if it's no
  Remove unused make_combined_reader overload.
  Migrate test_combining_two_partially_overlapping_readers to flat reader
  Migrate test_combining_two_non_overlapping_readers to flat reader
  Migrate combined_mutation_reader_test to flat reader
  Migrate test_sm_fast_forwarding_combining_reader to flat reader
  Migrate test_combining_one_empty_reader to flat reader
  Migrate test_combining_two_empty_readers to flat reader
  Migrate test_combining_two_readers_with_one_reader_empty to flat reader
  Migrate test_combining_one_reader_with_many_partitions to flat reader
  Migrate test_combining_two_readers_with_the_same_row to flat reader
  Migrate make_combined_mutation_source to flat reader
  mutation_source: Add constructors for sources that ignore forwarding
2017-12-21 16:04:49 +00:00
Piotr Jastrzebski
04ce7dfb84 Remove unused make_combined_reader overload.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
759baa3a11 Migrate test_fast_forwarding_combining_reader to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
83e55283f7 flat_mutation_reader_from_mutations: support partition_range
This is needed to make it possible for
flat_mutation_reader_from_mutations to replace
make_reader_returning_many.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
9e3da50ed1 Don't pass fwd to flat_mutation_reader_from_mutations if it's no
Default value for fwd is no so there's no need to pass it explicitly.
This is important because we will add additional parameter to
flat_mutation_reader_from_mutations in next patch.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
b3b6db4f50 Remove unused make_combined_reader overload.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
202c562f68 Migrate test_combining_two_partially_overlapping_readers to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
6c62454076 Migrate test_combining_two_non_overlapping_readers to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
bef2cf8ed9 Migrate combined_mutation_reader_test to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
19d4bce624 Migrate test_sm_fast_forwarding_combining_reader to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
17e6f6b089 Migrate test_combining_one_empty_reader to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
1f77370d9e Migrate test_combining_two_empty_readers to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
a702d0ec3f Migrate test_combining_two_readers_with_one_reader_empty to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
9a5d6bd8af Migrate test_combining_one_reader_with_many_partitions to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
13551e6f50 Migrate test_combining_two_readers_with_the_same_row to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
b1c1709127 Migrate make_combined_mutation_source to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:42 +01:00
Piotr Jastrzebski
024e01ad9e mutation_source: Add constructors for sources that ignore forwarding
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 16:59:57 +01:00
Paweł Dziepak
4dfddc97c7 db/schema_tables: do not use moved from shared pointer
Shared pointer view is captured by two continuations, one of which is
moving it away. Using do_with() solves the problem.

Fixes #3092.
Message-Id: <20171221111614.16208-1-pdziepak@scylladb.com>
2017-12-21 15:13:25 +01:00
Tomasz Grabiec
b0a56a91c2 Merge "Remove memtable::make_reader" from Piotr
Migrate all the places that used memtable::make_reader to use
memtable::make_flat_reader and remove memtable::make_reader.

* seastar-dev.git haaawk/remove_memtable_make_reader_v2_rebased:
  Remove memtable::make_reader
  Stop using memtable::make_reader in row_cache_stress_test
  Stop using memtable::make_reader in row_cache_test
  Stop using memtable::make_reader in mutation_test
  Stop using memtable::make_reader in streamed_mutation_test
  Stop using memtable::make_reader in memtable_snapshot_source.hh
  Stop using memtable::make_reader in memtable::apply
  Add consume_partitions(flat_mutation_reader& reader, Consumer consumer)
  Add default parameter values in make_combined_reader
  Migrate test_virtual_dirty_accounting_on_flush to flat reader
  Migrate test_adding_a_column_during_reading_doesnt_affect_read_result
  Simplify flat_reader_assertions& produces(const mutation& m)
  Migrate test_partition_version_consistency_after_lsa_compaction_happens
  flat_mutation_reader: Allow setting buffer capacity
  Add next_mutation() to flat_mutation_reader_assertions
  cf::for_all_partitions::iteration_state: don't store schema_ptr
  read_mutation_from_flat_mutation_reader: don't take schema_ptr
  Migrate test_fast_forward_to_after_memtable_is_flushed to flat reader
2017-12-21 14:02:56 +01:00
Piotr Jastrzebski
17f2eb8ff7 Remove memtable::make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
85d2b24415 Stop using memtable::make_reader in row_cache_stress_test
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
129a282cbf Stop using memtable::make_reader in row_cache_test
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
dc75df6353 Stop using memtable::make_reader in mutation_test
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
660086f2d6 Stop using memtable::make_reader in streamed_mutation_test
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
2a9cd5bffe Stop using memtable::make_reader in memtable_snapshot_source.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
6bcee5976b Stop using memtable::make_reader in memtable::apply
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
a67d6bef29 Add consume_partitions(flat_mutation_reader& reader, Consumer consumer)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
ff718d6573 Add default parameter values in make_combined_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
b1676db658 Migrate test_virtual_dirty_accounting_on_flush to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
b90677272f Migrate test_adding_a_column_during_reading_doesnt_affect_read_result
to flat reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
20e31e462e Simplify flat_reader_assertions& produces(const mutation& m)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
ddecd385c1 Migrate test_partition_version_consistency_after_lsa_compaction_happens
to flat reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
5f8fba8a61 flat_mutation_reader: Allow setting buffer capacity
Needed in tests to limit amount of prefetching done by readers, so
that it's easier to test interleaving of various events.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
b18c075470 Add next_mutation() to flat_mutation_reader_assertions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
308ec43ea5 cf::for_all_partitions::iteration_state: don't store schema_ptr
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
570703a169 read_mutation_from_flat_mutation_reader: don't take schema_ptr
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
681dc26dd1 Migrate test_fast_forward_to_after_memtable_is_flushed to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Tomasz Grabiec
71cc63dfa6 Merge "Fixes for multi_range_reader" from Paweł
The following patches contain fixes for skipping to the next parititon
in multi_range_reader and completelty dissable support for fast
forwarding inside a single partition, which is not needed and would only
add unnecessary complexity.

* https://github.com/pdziepak/scylla.git fix-multi_range_reader/v1:
  flat_multi_range_mutation_reader: disallow
    streamed_mutation::forwarding
  flat_multi_range_mutation_reader: clear buffer on next_partition()
  tests/flat_multi_range_mutation_reader: test skipping to next
    partition
2017-12-21 11:06:57 +01:00
George Tavares
ceecd542cd db/view: Consume updated rows regardless of static row
Using Materialized Views, if the base table has static columns,
and the update in base table mutates static and non static rows,
the streamed_mutation is stopped before process non static row.
The patch avoids stopping the stream_mutation and adds a test case.

Message-Id: <20171220173434.25091-1-tavares.george@gmail.com>
2017-12-21 00:49:15 +01:00
Paweł Dziepak
da0655ab3c tests/flat_multi_range_mutation_reader: test skipping to next partition 2017-12-20 16:08:09 +00:00
Paweł Dziepak
5d72acac0c flat_multi_range_mutation_reader: clear buffer on next_partition() 2017-12-20 16:08:09 +00:00
Paweł Dziepak
3cf46a31a6 flat_multi_range_mutation_reader: disallow streamed_mutation::forwarding
Properly implementing streamed_mutation::forwarding::yes in multi range
reader would noticeably increase its complexity and is not needed.
2017-12-20 14:50:11 +00:00
Tomasz Grabiec
dfe48bbbc7 range_tombstone_list: Fix insert_from()
end_bound was not updated in one of the cases in which end and
end_kind was changed, as a result later merging decision using
end_bound were incorrect. end_bound was using the new key, but the old
end_kind.

Fixes #3083.
Message-Id: <1513772083-5257-1-git-send-email-tgrabiec@scylladb.com>
2017-12-20 12:20:20 +00:00
Raphael S. Carvalho
daaadfd515 compaction_manager: remove dead sstable rewrite submission function
this rewrite submission was used by old resharding, but it's no longer
needed, so let's remove it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171219191052.13689-1-raphaelsc@scylladb.com>
2017-12-20 09:29:43 +02:00
Avi Kivity
772d1f47d7 Merge "Fix read amplification in sstable reads" from Paweł
"4b9a34a85425d1279b471b2ff0b0f2462328929c "Merge sstable_data_source
into sstable_mutation_reader" has introduced unintentional changes, some
of them causing excessive read amplification during empty range reads.
The following patches restore the previous behaviour."

* tag 'fix-read-amplification/v1' of https://github.com/pdziepak/scylla:
  sstables: set _read_enabled to false if possible
  sstables: set _single_partition_read for single parititon reads
2017-12-19 18:17:14 +02:00
Tomasz Grabiec
6a6bf58b98 flat_mutation_reader: Fix make_nonforwardable()
It emitted end-of-stream prematurely if buffer was full.
Message-Id: <1513697716-32634-1-git-send-email-tgrabiec@scylladb.com>
2017-12-19 15:56:49 +00:00
Avi Kivity
2137d753b3 Merge "Serialize compaction of same size tier for different cfs" from Raphael
"Currently, compaction manager will serialize compaction of same size tier
(or weight) if they belong to the same column family. However, it fails to
do so if the compaction jobs belong to different column families.
That can lead to an ungodly amount of running compaction which gets worse
the higher the number of shards and active column families. The problem
is that it may affect overall system performance due to excessive resource
usage. It's easy to trigger it during bootstraping after loading node with
new sstables or repairing, or if lots of cfs are being actively written."

Fixes #1295.

* 'similar_sized_compaction_serialization_v4' of github.com:raphaelsc/scylla:
  sstables: remove column_family from compaction_weight_registration
  compaction_manager: serialize compaction of same size tier for different cfs
  sstables: introduces deregister() and weight() to compaction_weight_registration
  sstables: move compaction_weight_registration to its own header
  sstables: improve compact_sstables() interface
2017-12-19 16:32:27 +02:00
Tomasz Grabiec
7b36c8423c row_cache: Fix single_partition_populating_reader not waiting on create_underlying() to resolve
Results in undefined behavior.
Message-Id: <1513691679-27081-1-git-send-email-tgrabiec@scylladb.com>
2017-12-19 16:12:11 +02:00
Paweł Dziepak
574c6006f6 sstables: set _read_enabled to false if possible 2017-12-19 13:59:13 +00:00
Paweł Dziepak
1beb3552fc sstables: set _single_partition_read for single parititon reads 2017-12-19 13:59:13 +00:00
Piotr Jastrzebski
570fc5afed Use row_cache::make_flat_reader in column_family::make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <ba1659ceed8676f45942ce6e7506158026947345.1513687259.git.piotr@scylladb.com>
2017-12-19 14:42:32 +02:00
Raphael S. Carvalho
928beae242 Fix compilation of db/hints/manager.cc and row_cache.cc
compiler: gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)

Problems introduced in f6a461c7a4
and 37b19ae6ba, respectively.

They both fail to compile due to use of method in lambda without
explicit mention of this. Some of failure is fixed by not using
auto in lambda parameter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171218222144.12297-1-raphaelsc@scylladb.com>
2017-12-19 11:15:45 +01:00
Avi Kivity
d97ea6b0f4 Merge seastar upstream
* seastar 2b23547...adaca37 (7):
  > Merge "Support for skipping over bytes from input stream in input_stream::consume" from Vladimir
  > build: enforce Boost >= 1.58 during configuration.
  > Tutorial: beginning of documentation of CPU scheduling et al.
  > circular_buffer: make move-constructor noexcept
  > circular_buffer: convert existing documentation to doxygen format
  > build: fix detection of membarrier syscall support
  > Merge "Improve systemwide_memory_barrier() on newer Linuces" from Avi
2017-12-19 11:21:35 +02:00
Avi Kivity
8dbc6bbcdc Update scylla-ami submodule
* dist/ami/files/scylla-ami be90a3f...3366c93 (1):
  > scylla_install_ami: skip ec2_check while building AMI
2017-12-19 10:10:22 +02:00
Takuya ASADA
77fbdd487c dist/ami: Switch to official CentOS base image
We had switched our own CentOS base image since we couldn't make built AMI to
public due to base image settings, it's probably because the image provided via
AWS market place.
However, I've found an official image outside of market place, and I succeeded
making built AMI to public based on the image.
URL: https://wiki.centos.org/Cloud/AWS

Once we could able to use official image, we probably should use official one.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513659164-28029-1-git-send-email-syuu@scylladb.com>
2017-12-19 10:07:14 +02:00
Tomasz Grabiec
37b19ae6ba Merge "Migrate cache to use flat_mutation_reader" from Piotr 2017-12-18 17:53:20 +01:00
Piotr Jastrzebski
d756c49baf Rename cache_streamed_mutation_test to cache_flat_mutation_reader_test
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
14d98aaa0b Rename row_cache::create_underlying_flat_reader to
create_underlying_reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
49993e56a9 Remove unused row_cache::create_underlying_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
b976872c1a Rename all *_underlying_flat methods in read_context to *_underlying.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
1457a3d771 Rename cache_entry::*read_flat to cache_entry::*read
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
8b796a884f Rename read_context::enter_flat_partition to enter_partition
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
8d37b71843 Rename autoupdating_underlying_flat_reader to autoupdating_underlying_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
9789c37e9d Remove autoupdating_underlying_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
893e434207 Stop using autoupdating_underlying_reader in read_context
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
6e9b54cc77 Remove unused cache_streamed_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
df17bad13b Remove unused cache_entry::read and do_read
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
003670c3cd Remove unused read_directly_from_underlying overload
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
9fab29be82 Rename _sm to _reader in scanning_and_populating_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
610fa7a2c2 Stop using streamed_mutation in scanning_and_populating_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
3153d5d2c2 Rename _sm to _reader in single_partition_populating_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
556edfab29 Stop using streamed mutation in single_partition_populating_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
fec4468669 Add read_directly_from_underlying that returns flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
7012dc1049 Add make_delegating_reader
It creates a flat_mutation_reader from a reference to another reader.

This makes it easier to compose code in more elegant way.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
4088dcba5a Add make_nonforwardable for flat_mutation_reader.
It turns a reader that allows fast forwarding into
a reader that does not allow it.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:53 +01:00
Piotr Jastrzebski
47eb609aeb Change fill_buffer_from_streamed_mutation to fill_buffer_from
that can handle both streamed_mutation and flat_mutation_reader
as source.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:24:16 +01:00
Nadav Har'El
ba3cb057f5 Fix compilation of tests/hint_test.cc
Starting with commit fb0866ca20, tests
do not have to, and MUST NOT, define the disk error handlers. If they
do, we get a re-definition of variables already defined in
disk-error-handler.cc.

tests/hint_test.cc was apparently written before that commit, so we
need to remove the duplicate variables to get it to link.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218133635.20500-1-nyh@scylladb.com>
2017-12-18 15:37:19 +02:00
Nadav Har'El
101cce3c79 Fix compilation of tests/commitlog_test.cc
In commit 878d58d23a, a new parameter was
added to commitlog::descriptor. The commit message says that "It's default
value is a descriptor::FILENAME_PREFIX." while in reality, it did not have
a default value and compilation of tests/commitlog_test.cc broke, because
it didn't specify a value.

So this patch adds a default value for this parameter, as was suggested
by the original commit message.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218131020.17883-1-nyh@scylladb.com>
2017-12-18 15:35:34 +02:00
Nadav Har'El
73aad5736f Fix compilation of tests/cql_test_env.cc
In commit 1f4f71e619, an
stdx::optional<std::vector<sstring>> parameter was added to storage_proxy's
constructor. However, this parameter was not made optional, and
tests/cql_test_env.cc failed to compile because it didn't provide this
parameter.

This patch makes this parameter optional (if missing, it's like an empty
stdx::optional) so cql_test_env.cc compiles.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218132121.18782-1-nyh@scylladb.com>
2017-12-18 15:32:54 +02:00
Piotr Jastrzebski
527b48564d Fix fast_forward_to in make_forwardable
It wasn't setting _end_of_stream to false which
is necessary.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
880623e2e9 Use cache_entry::read_flat in make_flat_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
a9b6551584 Add cache_entry::read_flat
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
8a275dfaeb Create transform for flat_mutation_reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
a322268416 Turn cache_flat_mutation_reader into a flat reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
f4e048f6ff Add consume_mutation_fragments_until to flat_mutation_reader.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
82c603069b Make cache_flat_mutation_reader a friend of row_cache and cache_tracker
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
f467e84424 Rename cache_streamed_mutation to cache_flat_mutation_reader
in cache_flat_mutation_reader.hh

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
3075780097 Make copy of cache_streamed_mutation.hh
and call it cache_flat_mutation_reader.hh.
It will be turned into a flat mutation reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
072fc2a309 Move lsa_manager to row_cache.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
d525a306a0 Add reserve_one to flat_mutation_reader::impl
This will be used in cache.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
714868db2d Use autoupdating_underlying_flat_reader in read_context
and add read_context::enter_flat_partition. This will
temporarily coexist with read_context::enter_partition
but after everything in cache is migrated to flat reader
the new method will replace old one.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
3e980cac3d Make autoupdating_underlying_flat_reader use flat reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
77b6f7c599 read_context: create a copy of autoupdating_underlying_reader
called autoupdating_underlying_flat_reader. It will be modified
in the next patch to use flat reader to underlying.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
bf4e1c0c54 Add row_cache::create_underlying_flat_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
16a0d306fd Turn scanning_and_populating_reader into flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
656e8622e1 Turn single_partition_populating_reader into flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
1a7011b6b5 Extract fill_buffer_from_streamed_mutation
it will be reused in other readers.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:26:44 +01:00
Raphael S. Carvalho
38318c753a sstables: remove column_family from compaction_weight_registration
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-17 17:42:52 -02:00
Raphael S. Carvalho
eff62bc61e compaction_manager: serialize compaction of same size tier for different cfs
Currently, compaction manager will serialize compaction of same size tier
(or weight) if they belong to the same column family. However, it fails to
do so if the compaction jobs belong to different column families.
That can lead to an ungodly amount of running compaction which gets worse
the higher the number of shards and active column families. The problem
is that it may affect overall system performance due to excessive resource
usage. It's easy to trigger it during bootstraping after loading node with
new sstables or repairing, or if lots of cfs are being actively written.

That being said, compaction jobs of same size tier are now serialized
on a given shard, such that maximum number of compaction (system wise)
is now:
(SHARDS) * (SIZE TIERS)
instead of:
(SHARDS) * (COLUMN FAMILIES) * (SIZE TIERS)

We'll work hard to release a size tier (weight) for a column family
waiting on it as fast as possible, given that we wouldn't like to
underutilize resources available for compaction. We want one starting
after the other. Compaction for a column family that cannot run now
because the size tier is taken, will be postponed. There's a worker
that will be sleeping on a condition variable that will be signalled
whenever a compaction completes. FIFO ordering is used on postponed
list for fairness.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-17 17:42:48 -02:00
Raphael S. Carvalho
fa0e53f626 sstables: introduces deregister() and weight() to compaction_weight_registration
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-17 17:34:08 -02:00
Raphael S. Carvalho
20d8a2c045 sstables: move compaction_weight_registration to its own header
That will be needed for using it in compaction.hh. We can't declare
compaction_weight_registration in compaction_manager.hh, because
compaction.hh can't include the former due to cyclic dependency,
so compaction_weight_registration will be declared in its own
header.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-17 17:26:51 -02:00
Raphael S. Carvalho
49f3cfe746 sstables: improve compact_sstables() interface
Motivation is that a new field in the descriptor will be forwarded
to compaction procedure without extending parameter list even more.
Also beautifies the interface, making it concise and easier to
play with.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-17 17:22:19 -02:00
Michael Munday
0a67a505a5 sstables: write summary in little-endian byte order on big-endian systems
The summary positions are defined to be in 'native' byte order.
Unfortunately this makes sharing files between big- and little-endian
machines much more difficult. For example, test files need to be
generated for both potential byte orders.

This change sets the byte order of the affected data to little-endian.
Ideally there would still be a way to deal with files generated on
big-endian systems using the 'native' byte order (see #3056).

Message-Id: <20171212183652.87881-1-mike.munday@ibm.com>
2017-12-17 11:10:49 +02:00
Glauber Costa
b8f49fcc14 conf: document listen_on_broadcast_address
That's a supported feature that is listed in our help message, but it
is not present in the yaml file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171215011240.16027-1-glauber@scylladb.com>
2017-12-17 10:55:09 +02:00
Vlad Zolotarov
be6f8be9cb messaging_service: fix a mutli-NIC support
Don't enforce the outgoing connections from the 'listen_address'
interface only.

If 'local_address' is given to connect() it will enforce it to use a
particular interface to connect from, even if the destination address
should be accessed from a different interface. If we don't specify the
'local_address' the source interface will be chosen according to the
routing configuration.

Fixes #3066

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1513372688-21595-1-git-send-email-vladz@scylladb.com>
2017-12-17 10:51:20 +02:00
Avi Kivity
11de20fc33 Merge "SSTable summary regeneration fixes" from Raphael
"Fixes #3057."

* 'summary_recreation_fixes_v2' of github.com:raphaelsc/scylla:
  tests: sstable summary recreation sanity test
  sstables: make loading of sstable without summary to work again
  sstables: fix summary generation with dynamic index sampling
2017-12-17 09:17:36 +02:00
Takuya ASADA
c2e87f4677 dist/common/systemd: specify correct repo file path for housekeeping service on Ubuntu/Debian
Currently scylla-housekeeping-daily.service/-restart.service hardcoded
"--repo-files '/etc/yum.repos.d/scylla*.repo'" to specify CentOS .repo file,
but we use same .service for Ubuntu/Debian.
It doesn't work correctly, we need to specify .list file for Debian variants.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513385159-15736-1-git-send-email-syuu@scylladb.com>
2017-12-16 22:03:25 +02:00
Duarte Nunes
f6a461c7a4 Merge 'hinted handoff' from Vlad
"This series is the first part of hinted handoff implementation.
It includes:
   - Minor adjustment of commitlog layer.
   - Generation of hints when storage_proxy calls hint_to_dead_endpoints(...).
   - Sending the hints to the Node that becomes UP.

It doesn't include:
   - Node decommissioning.
   - Resharding."

* 'hinted_handoff-v7-1' of github.com:vladzcloudius/scylla:
  main + storage_service: wire up hints generation
  config: add hints related options
  db::hints::manager: initial commit
  tracing: make the session state modifying methods and tracing::trace(...) noexcept
  utils::fb_utilities: add is_me(addr) method
  tests: hint_test: initial commit
  db::commitlog::replay_position: added std::hash<replay_position>
  db::commitlog: truncate segments to their actual sizes during shutdown
  db::commitlog: allow defining a metrics category name
  db/commitlog/commitlog::descriptor: add a filename_prefix parameter
  db::commitlog::descriptor::descriptor(filename): pass a filename as a const ref
  docs: hinted_handoff_design.md: high level design of a Hinted Handoff feature
2017-12-14 21:16:40 +01:00
Vlad Zolotarov
1f4f71e619 main + storage_service: wire up hints generation
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:08:11 -05:00
Vlad Zolotarov
c2296c9575 config: add hints related options
- hints_directory:
      - This option allows defining of the directory where hints files are going
        to be stored if hinted handoff is enabled.

   - hinted_handoff_enabled:
      - May receive either a boolean value or a list of DCs. In the later case this
        will define the DCs to which Nodes hints are going to be generated.

   - max_hint_window_in_ms:
      - Maximum amount of milliseconds the hints are going to be generated to the Node that is DOWN.
        After this time period the hints are no longer going to be generated until the Node is seen UP.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:08:11 -05:00
Vlad Zolotarov
51bbf18c08 db::hints::manager: initial commit
Curently implemented:
   - Hints generation: db::hints::manager::store_hint(...).
   - Sending: db::hints::manager::on_timer().

TODO:
   - Resharding.
   - Node decommission.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:08:07 -05:00
Vlad Zolotarov
fcff872089 tracing: make the session state modifying methods and tracing::trace(...) noexcept
Make state session creation, stop_forground() and tracing::trace(...) methods
noexcept.
Most of them have already been implemented the way that they won't throw
but this patch makes it official...

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:48 -05:00
Vlad Zolotarov
6c037899b5 utils::fb_utilities: add is_me(addr) method
Add a widely used method that returns TRUE if a given address is a broadcast
address of the local node.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:48 -05:00
Vlad Zolotarov
b20dbe16d8 tests: hint_test: initial commit
Test the regular commitlog with the custom file name prefix.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:48 -05:00
Vlad Zolotarov
ec15d60a2d db::commitlog::replay_position: added std::hash<replay_position>
It's needed for hinted handoff.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:48 -05:00
Vlad Zolotarov
af70c0a709 db::commitlog: truncate segments to their actual sizes during shutdown
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:48 -05:00
Vlad Zolotarov
033af6c950 db::commitlog: allow defining a metrics category name
Add a new field to db::commitlog::config that would define the metrics category name.
If not given - metrics are not going to be registered.
Set it to "commitlog" in db::commitlog::config(const db::config&).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:47 -05:00
Vlad Zolotarov
878d58d23a db/commitlog/commitlog::descriptor: add a filename_prefix parameter
This parameter is used when creating a new segment.
It's default value is a descriptor::FILENAME_PREFIX.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:47 -05:00
Vlad Zolotarov
719b1fb24f db::commitlog::descriptor::descriptor(filename): pass a filename as a const ref
Avoid not needed copy by passing a file name as a reference.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:47 -05:00
Vlad Zolotarov
1ddb6e6509 docs: hinted_handoff_design.md: high level design of a Hinted Handoff feature
Hinted Handoff is a feature that allows replaying failed writes.
The mutation and the destination replica are saved in a log and replayed
later according to the feature configuration.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:47 -05:00
Raphael S. Carvalho
b5ace682a4 tests: sstable summary recreation sanity test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-14 16:59:36 -02:00
Raphael S. Carvalho
cdfa4d5c0d sstables: make loading of sstable without summary to work again
Boot failed when loading sstable with missing summary because a
internal procedure failed to take into account that a sstable
can have its summary recreated from index. Make it work again
by making that procedure aware of that.

Fixes #3057.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-14 16:59:22 -02:00
Raphael S. Carvalho
7c6a19fcc8 sstables: fix summary generation with dynamic index sampling
When recreating summary, data length was passed as data offset to
procedure that decides whether to sample or not. The problem is
that the procedure decides to sample index entry if data offset
is beyond a threshold. So the resulting summary will contain
only N sequential indexes entries starting from the first one,
which makes it quite inefficient. What should be done instead
is to pass position of current index entry, so summary content
will be as if it was created by a regular sstable write.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-14 16:34:00 -02:00
Piotr Jastrzebski
ceaf0dee99 Introduce row_cache::make_flat_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-14 12:49:39 +01:00
Piotr Jastrzebski
ac1d2f98e4 Fix build by removing semicolon after concept
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4504cf47be0a451c58052476bc8cc4f9cba59472.1513248094.git.piotr@scylladb.com>
2017-12-14 10:46:13 +00:00
Raphael S. Carvalho
95d1995876 fix compilation of stream_session.cc
stream_session.cc:417:62: error: cannot call member function ‘utils::UUID streaming::stream_session::plan_id()’ without object
         sslog.warn("[Stream #{}] Failed to send: {}", plan_id(), ep);

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171214022621.19442-1-raphaelsc@scylladb.com>
2017-12-14 10:57:33 +01:00
Amos Kong
b07de93636 Reset default cluster_name back to 'Test Cluster' for compatibility
There are some users used original default cluster_name 'Test Cluster',
they will fail to start the node for cluster_name change if they use
new scylla.yaml.

'ScyllaDB Cluster' isn't more beautiful than 'Test Cluster', reset back
to original old to avoid problem for users.

Fixes #3060

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <8c9dab8a64d0f4ab3a5d6910b87af696c60e5076.1513072453.git.amos@scylladb.com>
2017-12-13 16:57:43 +02:00
Avi Kivity
6cb3b29168 Merge "Convert sstable readers to flat streams" from Paweł
"While aa8c2cbc16 'Merge "Migrate sstables
to flat_mutation_reader" from Piotr' has converted the low-level sstable
reader to the new flat_mutation_reader interface there were still
multiple readers related to sstables that required converting,
including:
 - restricted reader
 - filtering reader
 - single partition sstable reader
This series completes their conversion to the flat stream interface."

* tag 'flat_mutation_reader-sstable-readers/v2' of https://github.com/pdziepak/scylla:
  db: convert single_key_sstalbe_reader to flat streams
  db: fully convert incremental_reader_selector to flat readers
  db: make make_range_sstable_reader() return flat reader
  db: make column_family::make_reader() return flat reader
  db: make column_family::make_sstable_reader() return a flat reader
  filtering_reader: switch to flat mutation fragment streams
  filtering_reader: pass a const dht::decorated_key& to the callback
  mutation_reader: drop make_restricted_reader()
  db: use make_restricted_flat_reader
  mutation_reader: convert restricted reader to flat streams
2017-12-13 15:37:26 +02:00
Paweł Dziepak
8e0da776ab db: convert single_key_sstalbe_reader to flat streams
Before flat mutation readers sstable::read_row() returned a
future<streamed_mutation>. That required a helper reader that would wait
for the streamed_mutations from all relevant sstables to be created and
then construct a mutation merger.
With flat mutation readers sstable::read_row_flat() returns a
flat_mutation_reader (no futures) so that the code can be simplified by
collecting all the relevant readers and creating a combined reader
without suspension points.
The unfortunate disadvantage of the flat_mutation_reader-based approach
is the fact that combined reader now needlessly compares the partition
keys even though we know that we read only a single partition, but
optimising that is out of scope of this patch.
2017-12-13 12:01:03 +00:00
Paweł Dziepak
24026a0c7d db: fully convert incremental_reader_selector to flat readers 2017-12-13 12:01:03 +00:00
Paweł Dziepak
73b3d02cc0 db: make make_range_sstable_reader() return flat reader 2017-12-13 12:01:03 +00:00
Paweł Dziepak
8b3c3fc832 db: make column_family::make_reader() return flat reader 2017-12-13 12:01:03 +00:00
Paweł Dziepak
e12959616c db: make column_family::make_sstable_reader() return a flat reader 2017-12-13 12:01:03 +00:00
Paweł Dziepak
a0a13ceb46 filtering_reader: switch to flat mutation fragment streams 2017-12-13 12:01:03 +00:00
Paweł Dziepak
3bbb3b300d filtering_reader: pass a const dht::decorated_key& to the callback
All users of the filtering reader need only the decorated key of a
partition, but currently the predicate is given a reference to
streamed_mutations which are obsolete now.
2017-12-13 11:57:27 +00:00
Paweł Dziepak
d8dad04564 mutation_reader: drop make_restricted_reader()
make_restricted_reader() has been replaced by
make_restricted_flat_reader().
2017-12-13 11:57:22 +00:00
Paweł Dziepak
f3901eb154 db: use make_restricted_flat_reader 2017-12-13 10:46:41 +00:00
Paweł Dziepak
3839bc5d60 mutation_reader: convert restricted reader to flat streams 2017-12-13 10:46:41 +00:00
Asias He
a9dab60b6c streaming: One cf per time on sender
In the case there are large number of column families, the sender will
send all the column families in parallel. We allow 20% of shard memory
for streaming on the receiver, so each column family will have 1/N, N is
the number of in-flight column families, memory for memtable. Large N
causes a lot of small sstables to be generated.

It is possible there are multiple senders to a single receiver, e.g.,
when a new node joins the cluster, the maximum in-flight column families
is number of peer node. The column families are sent in the order of
cf_id. It is not guaranteed that all peers has the same speed so they
are sending the same cf_id at the same time, though. We still have
chance some of the peers are sending the same cf_id.

Fixes #3065

Message-Id: <46961463c2a5e4f1faff232294dc485ac4f1a04e.1513159678.git.asias@scylladb.com>
2017-12-13 12:32:41 +02:00
Glauber Costa
1aabbc75ab database: delete created SSTables if streaming writes fail
We have had an issue recently where failed SSTable writes left the
generated SSTables dangling in a potentially invalid state. If the write
had, for instance, started and generated tmp TOCs but not finished,
those files would be left for dead.

We had fixed this in commit b7e1575ad4,
but streaming memtables still have the same isse.

Note that we can't fix this in the common function
write_memtable_to_sstable because different flushers have different
retry policies.

Fixes #3062

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171213011741.8156-1-glauber@scylladb.com>
2017-12-13 10:09:20 +02:00
Avi Kivity
73428c96bd Merge "Refined security model for roles" from Jesse
"This patch series refines the security model for the upcoming switch to
roles-based access control. Roles are still do not have any function,
but CQL statements related to roles manipulate metdata. The next major
patch series after this one will switch the system to roles.

Previously, most operations around roles required superuser, but this
violates an important idea in security called the "principal of least
privilege": that a user should have only the minimum access possible to
resources in order to achieve their objective.

To that end, this patch series introduces permissions on role resources.
For example, to grant a role to a user, the performing user must have
been granted AUTHORIZE on the role being granted.

In the table below, a user (role) that has been granted the permission
in the left-most column can perform the CQL query in the right columns
depending on if the permission has been granted to the root role
resource (all roles), or a particular role resource.

Perm.           All roles               Specific role (r)
---------------------------------------------------------
CREATE          CREATE ROLE

ALTER           ALTER ROLE *            ALTER ROLE r

DROP            DROP ROLE *             DROP ROLE r

AUTHORIZE       GRANT ROLE */REVOKE     GRANT ROLE r/
                ROLE *                  REVOKE ROLE r

DESCRIBE        LIST ROLES

The following restrictions around superuser exist:

- CREATE ROLE: Only a superuser can create a superuser role.

- ALTER ROLE: Only a superuser can alter the superuser status of a role.

- ALTER ROLE: You cannot alter the superuser status of yourself or of a
  role granted to you.

- DROP ROLE: Only a superuser can drop a role that has superuser.

The following additional "escape hatches" apply:

- ALTER ROLE: You can alter yourself (except to give yourself
  superuser).

- LIST ROLES: You can list your own roles and list the roles of any role
  granted to you.

Finally, a note on terminology: I like to say a role (or user) "is"
superuser if the role (user) has directly been marked as a superuser. A
role (user) "has" superuser if they have been granted a role that is a
superuser. The second statement encompasses the first, since a role can
always be said to have been granted to itself.

Fixes #2988."

* 'jhk/role_permissions/v2' of https://github.com/hakuch/scylla: (24 commits)
  auth: Move permissions cache instance to service
  auth: Add roles query function to service
  cql3: Update access checks for `revoke_role_statement`
  cql3: Update access checks in `grant_role_statement`
  cql3: Update access checks in `list_roles_statement`
  cql3: Update access checks in `drop_role_statement`
  cql3: Update access checks in `alter_role_statement`
  cql3: Update access checks in `create_role_statement`
  tests: Switch to dedicated testing superuser
  auth: Publicize enforcing check for service
  tests: Expose client state from test env
  Allow checking permissions from `client_state`
  auth: Support querying for granted superuser
  auth/service.hh: Document the class
  cql3: Change `create_role_statement` base
  cql3/Cql.g: Add role resources to grammar
  cql3/Cql.g: Avoid extra copy of `auth::resource`
  auth:resource.cc: Use `string_view` in reverse map
  auth: Add `role` resource kind
  auth: Add the DESCRIBE permission
  ...
2017-12-12 19:52:10 +02:00
Jesse Haber-Kucharsky
092f2e659c auth: Move permissions cache instance to service
Instead of a single sharded service shared all by all instances of
`auth::service`, it makes more sense for each instance of
`auth::service` to own its own instance of the permissions cache.
2017-12-12 12:22:46 -05:00
Jesse Haber-Kucharsky
59911411ed auth: Add roles query function to service
While it just calls into the underlying role manager, this level of
indirection allows us to add a roles cache in the future (which is
consistent with the behavior of Apache Cassandra).
2017-12-12 12:22:42 -05:00
Jesse Haber-Kucharsky
fff120a2be cql3: Update access checks for revoke_role_statement
A role can be revoked from another role if the user has AUTHORIZE
permission on the role being revoked.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
de05baafd2 cql3: Update access checks in grant_role_statement
A role can be granted to another role if the user has AUTHORIZE on the
role being granted.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
749575bbf6 cql3: Update access checks in list_roles_statement
A user with DESCRIBE on the root role resource can list any roles of any
roles, and also the roles in the system.

Otherwise, a user can list all the roles it has been granted and can
list all roles granted to those roles.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
4618766431 cql3: Update access checks in drop_role_statement
A role can be dropped if the performer has DROP permission on the role.
A role that has superuser (either directly or through another role
it has been granted) cannot be dropped except by a superuser.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
9f4281cc77 cql3: Update access checks in alter_role_statement
Only superusers can alter superuser status, but only to roles not
granted to them. You can always alter your own role. You can alter
another role if you have ALTER permission on the role.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
fe6e9fe923 cql3: Update access checks in create_role_statement
CREATE ROLE requires CREATE on <ALL ROLES>. Creating a superuser role
requires that the performer is a superuser.

This change also forms the beginning of a test suite for the CQL
interface to roles. We start with verifying access-control properties of
CREATE ROLE as written in this patch.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
10d3dab9ac tests: Switch to dedicated testing superuser
The auth service will eventually add the default
superuser ("cassandra"), but the current code does so after a delay.
Using a dedicated superuser for unit tests side-steps the issue and
allows the user to be created immediately.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
56d84d4e26 auth: Publicize enforcing check for service 2017-12-12 12:06:49 -05:00
Jesse Haber-Kucharsky
af670328e1 tests: Expose client state from test env
This is useful for manipulating and querying the current user.
2017-12-12 12:03:01 -05:00
Jesse Haber-Kucharsky
6f9df19eb8 Allow checking permissions from client_state
Previously, this function was private and only `ensure_has_permission`
was public. `ensure_has_permission` throws in the absence of a
permission, but it can also be useful to query a permission without it
being an error.
2017-12-12 12:03:01 -05:00
Jesse Haber-Kucharsky
7339295969 auth: Support querying for granted superuser
This functionality is useful for implementing CQL statements and will
replace `auth::is_super_user` once roles have replaced users in Scylla.

Since eventually the auth service will have a roles cache, this function
is here rather than a part `role_manager`.
2017-12-12 12:02:38 -05:00
Jesse Haber-Kucharsky
56e1f2e30f auth/service.hh: Document the class 2017-12-12 11:24:44 -05:00
Jesse Haber-Kucharsky
daea70abe3 cql3: Change create_role_statement base
It is an `authentication_statement`, not an
`authorization_statement` (really, it's neither, but we're being
consistent with Apache Cassandra).
2017-12-12 11:06:49 -05:00
Jesse Haber-Kucharsky
a0cffead69 cql3/Cql.g: Add role resources to grammar 2017-12-12 11:06:49 -05:00
Jesse Haber-Kucharsky
4ae6b02572 cql3/Cql.g: Avoid extra copy of auth::resource 2017-12-12 11:06:49 -05:00
Jesse Haber-Kucharsky
77da3c4496 auth:resource.cc: Use string_view in reverse map
This avoids unnecessary copies.
2017-12-12 11:06:49 -05:00
Jesse Haber-Kucharsky
0546007fb5 auth: Add role resource kind 2017-12-12 11:06:35 -05:00
Jesse Haber-Kucharsky
9452533230 auth: Add the DESCRIBE permission
When a user is granted DESCRIBE on all roles (a resource kind that
doesn't exist yet in the code, but will exist soon), they gain the
ability to execute LIST ROLES queries.
2017-12-12 10:59:26 -05:00
Jesse Haber-Kucharsky
d29463beba auth: Support resource-specific permission sets
Different kinds of resources support different permissions. For example,
a keyspace supports the CREATE permission, which allows a user to
create tables in that keyspace. However, a table does not have an
applicable CREATE permission.

If a non-applicable permission is requested, an
`invalid_request_exception` is thrown.
2017-12-12 10:59:26 -05:00
Avi Kivity
e6940d8d4a Merge "Gossip propagation and stabilization" from Calle
"Fixes #2866
Fixes #2894

Changes gossip propagation to allow "atomic" grouping of values to ensure
their respective order.
Modifies gossip bootstrap startup to potentially wait longer in cases
where stabilization (messages done) takes time, to avoid data loss
in repair."

* 'calle/gossip' of github.com:scylladb/seastar-dev:
  gossip: wait for stabilized gossip on bootstrap
  gossiper: Prevent race condition in  propagation
  utils::to_string: Add printers for pairs+maps
  utils::in: Add helper type for perfect forwarding initializer lists
2017-12-12 17:59:00 +02:00
Jesse Haber-Kucharsky
eb0de39c98 auth/resource.hh: Use Doxygen-style formatting
Though we're still selective about its application.
2017-12-12 10:45:26 -05:00
Jesse Haber-Kucharsky
b986f48960 auth: Remove ALL_DATA permission set
This set is equal to `permissions::ALL`. When we switch over to
resource-specific permission sets, we will filter the set of all
permissions to only those that are applicable for the resource in
question.
2017-12-12 10:30:19 -05:00
Jesse Haber-Kucharsky
b14dc07f14 auth: Move particular permission set to caller
Applicable permission sets will soon be specific to each kind of
resource. This change prepares us for dynamic querying of permission
sets by resource.
2017-12-12 10:30:19 -05:00
Avi Kivity
eda35d2a57 Merge seastar upstream
* seastar ac78eec...2b23547 (10):
  > Merge "update shares for I/O classes" from Glauber
  > Merge "Resumable tasks" from Avi
  > input_stream: un-unroll input_stream::consume()
  > net: adding yaml-based parser for network configuration supporting multiple interfaces
  > scripts: perftune.py: don't attempt to set IRQs' affinity when IRQs list is empty
  > tutorial: fix example code
  > http: api_docs add swagger 2.0 support
  > Support custom function for reading of config-files.
  > Revert "provide an interface for updating the shares of an I/O class"
  > provide an interface for updating the shares of an I/O class
2017-12-12 11:00:38 +02:00
Michael Munday
b68b82dc8d tests: loading_cache_test: align DMA buffers
DMA reads and writes require that data be correctly aligned.

Message-Id: <20171211130202.77608-1-mike.munday@ibm.com>
2017-12-11 15:04:26 +02:00
Michael Munday
aea5f3bd1c sstables: fix compression on big endian systems
The encoding logic was incorrect for big endian systems (shift needed
to be in the opposite direction). Rather than fix that issue I have
re-written the relevant code to restrict the storage format to little
endian byte order on all systems. My hope is that this will be a bit
easier to maintain.

Message-Id: <20171211124454.77488-1-mike.munday@ibm.com>
2017-12-11 14:54:22 +02:00
Michael Munday
9e99105aa2 configure.py: use default system linker if gold is not available
Most distros on s390x don't currently have gold installed by default.
Rather than disable gold on the platform add a check to see if gold
is installed and switch back to using the default system linker if it
isn't. The try_compile_and_link functionality is copied from the
seastar project.

Message-Id: <20171211122156.77385-1-mike.munday@ibm.com>
2017-12-11 14:29:43 +02:00
Paweł Dziepak
d10b74b9cf Merge "Preparatory changes before changing semantics of continuity merging" from Tomasz
"The changes in this series fall into one of the following:
  1) improve unit tests
  2) improve code reuse in mvcc so that later cahnges will be easier
  3) fix minor issues which were exposed by the above"

* tag 'tgrabiec/improve-and-fix-mvcc-tests-v4' of github.com:scylladb/seastar-dev:
  tests: mvcc: Add more tests for consistency of continuity merging
  tests: mvcc: Fix test_apply_is_atomic()
  tests: mvcc: Do not assume that continuity of current row is updated on partition_snapshot_row_cursor::maybe_refresh()
  mvcc: Reuse partition_snapshot_row_cursor in apply_to_incomplete()
  mvcc: Propagate region reference to partition_entry::apply_to_incomplete()
  mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_if_complete()
  mvcc: partition_snapshot_row_cursor: Extract prepare_heap()
  mvcc: Add const-qualified partition_version_ref::operator*()
  tests: mvcc: Use mutation_partition_assertions
  tests: Introduce mutation_partition_assertions
  tests: Randomize static row continuity in random_mutation_generator
  tests: mutation_assertion: Introduce is_continuous()
  mvcc: Introduce partition_snapshot_row_cursor::read_partition()
  mutation_partition: Introduce deletable_row::apply() from a clustering_row fragment
  mutation_partition: Extract sliced() from mutation into mutation_partition
  mvcc: Introduce partition_snapshot::static_row_continuous()
  mvcc: Introduce partition_snapshot::range_tombstones() for full range
  mvcc: Don't require external schema in parition_snapshot::range_tombstones()
  mutation_partition: Define equal_continuity() using get_continuity()
  mutation_partition: Make check_continuity() const-qualified
  mutation_partition: Make check_continuity() public
  mutation_partition: Introduce mutation_partition::get_continuity()
  Introduce clustering_interval_set
  mutation_partition: Leave moved-from row in an empty state
  mutation_partition: Fix upgrade() not preserving static row continuity
2017-12-11 09:31:00 +00:00
Amnon Heiman
bc356a3c15 scylla_setup support private repo on debian during setup
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170917145248.19677-1-amnon@scylladb.com>
2017-12-11 10:36:30 +02:00
Jesse Haber-Kucharsky
7e3a344460 cql3: Add missing return
Since `return` is missing, the "else" branch is also taken and this
results a user being created from scratch.

Fixes #3058.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bf3ca5907b046586d9bfe00f3b61b3ac695ba9c5.1512951084.git.jhaberku@scylladb.com>
2017-12-11 09:55:05 +02:00
Avi Kivity
b29b091f4e Merge "Power8 porting" from Vlad
"This series includes a few patches from Michael Munday <mike.munday@ibm.com> (Z-project)
and a few from me. The most significant is PATCH10 that introduces a vectorized version
of CRC32 calculation (based on the Anton Blanchard's work)."

* 'scylla-power64-port-v2-1' of https://github.com/vladzcloudius/scylla:
  test.py: limit the tests to run on 2 shards with 4GB of memory
  tests: sstable_datafile_test: fix the compilation error on Power
  tests: compound_test: fix the 'narrowing' compilation error on Power
  cql3::constants::literal: fix the empty string parser
  utils::crc32: add power64 crc32 HW accelerated implementation
  repair: use seastar::cache_line_size for aligning to the cache line size
  build: add -lcryptopp to libs
  utils/allocation_strategy: force alignment to be at least sizeof(void*)
  utils::crc: introduce process_le/be(T) methods
  utils/crc: use zlib for crc32 on non-x86 platforms
  main: only perform SSE 4.2 check on x86-family CPUs
  configure.py: don't use 'gold' linker on Power
  configure.pu: add --target flag to override -march value
2017-12-08 20:48:41 +02:00
Vlad Zolotarov
57a6ed5aaa test.py: limit the tests to run on 2 shards with 4GB of memory
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Vlad Zolotarov
22ca5d2596 tests: sstable_datafile_test: fix the compilation error on Power
'char' and int8_t ('unsigned char') are different types. 'bytes' base type
is int8_t - use the correct type for casting.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Vlad Zolotarov
6a51e6fe33 tests: compound_test: fix the 'narrowing' compilation error on Power
'bytes' has int8_t as a base type and 0xff value is out of this type's range.
Use the corresponding signed value instead.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Vlad Zolotarov
3ebaf86ebc cql3::constants::literal: fix the empty string parser
Don't assume the 'char' being signed - this is implementation dependent.
Compare to '\xFF' value which is the actual intent.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Vlad Zolotarov
0145ae2b4b utils::crc32: add power64 crc32 HW accelerated implementation
Based on the work of Anton Blanchard <anton@au.ibm.com>, IBM that may be found
here: https://github.com/antonblanchard/crc32-vpmsum

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Vlad Zolotarov
97506f39b2 repair: use seastar::cache_line_size for aligning to the cache line size
Use seastar::cache_line_size for cache line alignment instead of a hard coded value (64) - this value is
not always correct, e.g. PPC64 platform, where cache line size is 128B.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Tomasz Grabiec
e81a4476c8 tests: mvcc: Add more tests for consistency of continuity merging 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
3b6167b4c4 tests: mvcc: Fix test_apply_is_atomic()
partition_entry::apply() requires that mutations are fully continuous.
2017-12-08 17:50:48 +01:00
Tomasz Grabiec
33c1f33c90 tests: mvcc: Do not assume that continuity of current row is updated on partition_snapshot_row_cursor::maybe_refresh()
It currently is updated only when iterators are invalidated. Better
to not assume that, because it's not really needed, and
maintaining this would complicate maybe_refresh() after continuity
merging rules change later.
2017-12-08 17:50:48 +01:00
Tomasz Grabiec
4094c66979 mvcc: Reuse partition_snapshot_row_cursor in apply_to_incomplete()
Reduces duplication of knowledge about how logical mutation_partition
view is obtained for multiple versions.
2017-12-08 17:50:48 +01:00
Tomasz Grabiec
12704fd679 mvcc: Propagate region reference to partition_entry::apply_to_incomplete() 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
376033af13 mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_if_complete() 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
8e9f8d93ef mvcc: partition_snapshot_row_cursor: Extract prepare_heap() 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
a6e083ef6f mvcc: Add const-qualified partition_version_ref::operator*() 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
230ca7d01b tests: mvcc: Use mutation_partition_assertions 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
c7539f2ed0 tests: Introduce mutation_partition_assertions
mutation_assertions are now delegating to mutation_partition_assertions.
2017-12-08 17:50:47 +01:00
Tomasz Grabiec
0ddb419eca tests: Randomize static row continuity in random_mutation_generator 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
a3f9799d70 tests: mutation_assertion: Introduce is_continuous() 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
05a19737e4 mvcc: Introduce partition_snapshot_row_cursor::read_partition()
Useful in tests.
2017-12-08 17:50:47 +01:00
Tomasz Grabiec
8e8ece5dec mutation_partition: Introduce deletable_row::apply() from a clustering_row fragment 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
b3709047b0 mutation_partition: Extract sliced() from mutation into mutation_partition
So that we can call it on mutation_partition.
2017-12-08 17:50:47 +01:00
Tomasz Grabiec
b26ce36d4b mvcc: Introduce partition_snapshot::static_row_continuous() 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
c283744fcb mvcc: Introduce partition_snapshot::range_tombstones() for full range 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
df964c70f8 mvcc: Don't require external schema in parition_snapshot::range_tombstones() 2017-12-08 17:50:47 +01:00
Michael Munday
8df2afc255 build: add -lcryptopp to libs
Not sure why this is necessary on s390x but not x86.
2017-12-08 10:12:41 -05:00
Michael Munday
18c0ab539e utils/allocation_strategy: force alignment to be at least sizeof(void*)
The alignment of packed structs can be 1. The system¹ posix_memalign
function will return EINVAL when passed this alignment. This fix
forces the alignment to be at least sizeof(void*).

¹ The seastar implementation of posix_memalign does not appear to
  have this limitation currently.
2017-12-08 10:12:41 -05:00
Michael Munday
5158b3f484 utils::crc: introduce process_le/be(T) methods
Replace the oblique process(T) overloads for integer types with
explicit process_le/be(T) methods that would interpret the given integer
as a stream of bytes using the corresponding endiannes.

For instance

process_le(0x11223344) would treat this integer as the following array of bytes:
{0x44, 0x33, 0x22, 0x11}.

process_be(0x11223344) on the other hand would treat this integer as if it's
{0x11, 0x22, 0x33, 0x44}.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 10:12:21 -05:00
Michael Munday
26b7c2622e utils/crc: use zlib for crc32 on non-x86 platforms
Ideally we should use the Castagnoli polynomial to match the SSE 4.2
crc32 instructions, but this works for now.
2017-12-08 09:47:50 -05:00
Michael Munday
f2be7d3e9e main: only perform SSE 4.2 check on x86-family CPUs
The check doesn't make sense on other architectures (e.g. s390x).
2017-12-08 09:47:50 -05:00
Vlad Zolotarov
03693de803 configure.py: don't use 'gold' linker on Power
'gold' linker is not a part of binutils on Power yet.
Let's not use it on Power.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 09:47:50 -05:00
Michael Munday
92d6a2b76c configure.pu: add --target flag to override -march value
This is probably the simplest way to make the build work on other
architectures. --target can be set to an empty string to allow
the compiler's default to be used.

If --target is not set then the default is going to be 'nehalem' on
x86 machines and the compiler's default on all other platforms.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Signed-off-by: Michael Munday <mike.munday@ibm.com>
2017-12-08 09:47:50 -05:00
Tomasz Grabiec
5541c9fd63 mutation_partition: Define equal_continuity() using get_continuity()
This fixes the problem of equal_continuity() being prone to false
positives due to redundant information (extra dummy rows) present in
one of the partitions. get_continuity() is minified, so is not prone
to this.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
bde050835f mutation_partition: Make check_continuity() const-qualified 2017-12-08 12:01:27 +01:00
Tomasz Grabiec
f9257886cb mutation_partition: Make check_continuity() public 2017-12-08 12:01:27 +01:00
Tomasz Grabiec
865bd8a594 mutation_partition: Introduce mutation_partition::get_continuity()
Intended to be used in tests.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
7e5d243a95 Introduce clustering_interval_set
Will make it easy to represent and manipulate continuity in tests.

Could also replace clustering_row_ranges in the future, which is
currently a naked vector<> with no semantic methods.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
22138554e6 mutation_partition: Leave moved-from row in an empty state
Needed by apply_monotonically(). Fixes SIGSEGV in mutation_test_g.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
a305a28574 mutation_partition: Fix upgrade() not preserving static row continuity
We do not rely on this yet, but will.
2017-12-08 12:01:27 +01:00
Paweł Dziepak
051cbbc9af Merge "Fix range tombstone emitting which led to skipping over data" from Tomasz
"Fixes cache reader to not skip over data in some cases involving overlapping
range tombstones in different partition versions and discontinuous cache.

Introduced in 2.0

Fixes #3053."

* tag 'tgrabiec/fix-range-tombstone-slicing-v2' of github.com:scylladb/seastar-dev:
  tests: row_cache: Add reproducer for issue #3053
  tests: mvcc: Add test for partition_snapshot::range_tombstones()
  mvcc: Optimize partition_snapshot::range_tombstones() for single version case
  mvcc: Fix partition_snapshot::range_tombstones()
  tests: random_mutation_generator: Do not emit dummy entries at clustering row positions
2017-12-08 10:27:17 +00:00
Tomasz Grabiec
4cc4c661f3 tests: row_cache: Add reproducer for issue #3053
The issue is that partition_snapshot::range_tombstones() is
deoverlapping tombstones coming from different versions, and it may
happen that due to range tombstone splitting that function will return
a tombstone which starts after the requested range. This breaks
assumptions made by the cache reader. It keeps track of the maximum
fragment position, and if cache reader will then need to read from
sstables due to a miss, it would do so starting from the position
marked by that out of range tombstone, possibly skipping over some
rows.
2017-12-08 10:15:58 +01:00
Tomasz Grabiec
b6f4637aec tests: mvcc: Add test for partition_snapshot::range_tombstones() 2017-12-08 10:15:58 +01:00
Tomasz Grabiec
183554cbc4 mvcc: Optimize partition_snapshot::range_tombstones() for single version case 2017-12-08 10:15:58 +01:00
Tomasz Grabiec
1303320377 mvcc: Fix partition_snapshot::range_tombstones()
partition_snapshot::range_tombstones() is deoverlapping tombstones
coming from different versions and it may happen that due to range
tombstone splitting the method will return a tombstone which starts
after the requested range. This would cause it to return a tombstone
which doesn't overlap with the requested range.

This breaks assumptions made by cache reader. It keeps track of the
maximum fragment position, and if cache reader will then need to read
from sstables due to a miss, it would do so starting from the position
marked by that out of range tombstone, possibly skipping over some
rows.

Exposed by a change in row_cache_test.cc::test_mvcc() which fills the
buffer of sm5 reader after it is created.

Fixes #3053.
2017-12-08 10:15:58 +01:00
Tomasz Grabiec
89e3b734ed tests: random_mutation_generator: Do not emit dummy entries at clustering row positions
It is assumed that dummy entries are only at !is_clustering_row() positions.
Causes cache_streamed_mutation to assert when trying to trim a range tombstone.
2017-12-07 20:20:37 +01:00
Avi Kivity
d934ca55a7 Merge "SSTable resharding fixes" from Raphael
"Didn't affect any release. Regression introduced in 301358e.

Fixes #3041"

* 'resharding_fix_v4' of github.com:raphaelsc/scylla:
  tests: add sstable resharding test to test.py
  tests: fix sstable resharding test
  sstables: Fix resharding by not filtering out mutation that belongs to other shard
  db: introduce make_range_sstable_reader
  rename make_range_sstable_reader to make_local_shard_sstable_reader
  db: extract sstable reader creation from incremental_reader_selector
  db: reuse make_range_sstable_reader in make_sstable_reader
2017-12-07 16:42:48 +02:00
Amos Kong
8fd5d27508 dist/debian: add scylla-tools-core to depends list
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <db39cbda0e08e501633556ab238d816e357ad327.1512646123.git.amos@scylladb.com>
2017-12-07 13:40:10 +02:00
Amos Kong
eb3b138ee2 dist/redhat: add scylla-tools-core to requires list
Fixes #3051

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <f7013a4fbc241bb4429d855671fee4b845b255cd.1512646123.git.amos@scylladb.com>
2017-12-07 13:40:08 +02:00
Gleb Natapov
8f104bab5d storage_proxy: send negative write replies only when entire cluster supports the feature
Message-Id: <20171207102934.GM1885@scylladb.com>
2017-12-07 12:31:35 +02:00
Botond Dénes
1ff65f41fd mutation_reader_merger: don't query the kind of moved-from fragment
Call mutation_fragment_kind() on the fragment *before* it's moved as
there are not guarantees for the state of a moved-from object (apart
from that it's in a valid one).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c47b1e22877bb9499f1fbb9d513093c29ef1901b.1512635422.git.bdenes@scylladb.com>
2017-12-07 10:40:31 +02:00
Avi Kivity
060e5d3354 Merge "Improve time-series performance by not actually compacting fully expired tables" from Raphael
"In time-series, it's common for tables in a given time window to be eventually
fully expired. The deletion of such tables is done by compaction, but there's
*no* need to *actually* compact such fully expired sstables *iff* their full
deletion will not cause older data to be ressurected. In other words, a fully
expired table can be actually skipped (but deleted in the end) by compaction
*iff* it doesn't contain newer data than its overlapping counterparts. So there
may be false negatives, but never false positives.
All that said, the goal behind this patchset is to save read bandwidth of disk
in such scenarios. Given that fully expired sstables will not be read by
compaction process anymore, read amplification will be greatly reduced too.

Fixes #2620."

* 'time_series_performance_improvement_v2_2' of github.com:raphaelsc/scylla:
  tests: check sstable auto correct bad max deletion time
  tests: add test for compaction with fully expired table
  sstables/compaction: do not actually compact fully expired sstables
  sstables: make sstable auto correct max_local_deletion_time
  sstables: switch to const ref wherever possible
  sstables: use gc_clock::time_point for gc_before
  gc_clock: introduce operator<<(ostream&, gc_clock::time_point)
  sstables: introduce sstable::get_max_local_deletion_time
  sstables: remove unnecessary copy in time series strategies
  sstables: change return value type of get_fully_expired_sstables
  dtcs: make code to extract non expired tables faster
  sstables: add has_correct_max_deletion_time to sstable
2017-12-07 10:29:31 +02:00
Avi Kivity
908daa67bd Merge "Generalize data_resource" from Jesse
"Soon we will have resources beyond just keyspaces and table names. There
will be resources for roles, for user-defined functions (UDFs), and
possible resources for REST end-points. This change generalizes the
implementation of a `data_resource` to many different kinds of
resources, though there is still only one kind (`data`).

The most important patch is 2/5 ("auth/resource: Generalize to different
kinds"), which re-writes `auth::data_resource`. The patch message should
sufficiently explain the design decisions involved.

The other patches rename files and identifiers based on the expanded
role of this class, except for 5/5 ("auth/resource.hh: Rename
`resource_ids`"): this patch gives a more appropriate name to a type
alias.

Fixes #3027."

* 'jhk/generalize_resource/v3' of https://github.com/hakuch/scylla:
  auth/resource.hh: Rename `resource_ids`
  auth: Rename `data_resource` files
  cql3/authorization_statement: Fix typo
  auth/resource: Generalize to different kinds
  auth: Rename `data_resource` to `resource`
2017-12-07 10:25:58 +02:00
Botond Dénes
9fce51f8a0 Add streamed mutation fast-forwarding unit test for the flat combined-reader
Test for the bug fixed by 9661769.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <fc917bae8e9c99f026bf7b366e6e9d39faf466af.1512630741.git.bdenes@scylladb.com>
2017-12-07 09:45:12 +02:00
Raphael S. Carvalho
39f7404436 tests: add sstable resharding test to test.py
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:27 -02:00
Raphael S. Carvalho
fc193c29cf tests: fix sstable resharding test
wrong sstable was used when checking for content, and storage service
for test was missing.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:27 -02:00
Raphael S. Carvalho
bad21ba444 sstables: Fix resharding by not filtering out mutation that belongs to other shard
After 301358e, sstable resharding stopped work because shared sstables would
use a filtering reader, which excludes mutation that belong to other shards.
That completely breaks which relies on compaction of mutations that belong
to different shards. The fix is about using recently introduced non local
shard reader.

Fixes #3041.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:26 -02:00
Raphael S. Carvalho
f1b65a115a db: introduce make_range_sstable_reader
introduce reader variant that will allow its caller to read a range
in a given table without any filter applied.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:26 -02:00
Raphael S. Carvalho
d1b146baa6 rename make_range_sstable_reader to make_local_shard_sstable_reader
Tomek says:
"I think that the least surprising behavior for a function named like this
is to read the sstables unfiltered (it just reads them), and the filtering
should be indicated specially in the name or by accepting a parameter."

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:25 -02:00
Raphael S. Carvalho
3d725d6823 db: extract sstable reader creation from incremental_reader_selector
step closer to divorcing incremental_selector from sstables

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 01:53:16 -02:00
Raphael S. Carvalho
ab82bacddd db: reuse make_range_sstable_reader in make_sstable_reader
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 01:53:14 -02:00
Raphael S. Carvalho
5eef7371b3 tests: check sstable auto correct bad max deletion time
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
a86ee38638 tests: add test for compaction with fully expired table
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
809b30c4a2 sstables/compaction: do not actually compact fully expired sstables
There's no need to actually compact a sstable which is fully expired
and which deletion of all its data will not ressurect older data.
For that, a sstable will only be considered fully expired if it
doesn't contain data newer than its overlapping counterparts.
That way, there could be a false negative, but never a false positive.
Currently, a fully expired sstable would unnecessarily waste read
bandwidth of disk. This will help a lot time series workloads in
which data for a given time window is all deleted at once using TTL.

Fixes #2620.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
810e2ec3d9 sstables: make sstable auto correct max_local_deletion_time
sstables created prior to cc6c383 can contain bad max deletion time stat,
which would make get_fully_expired_sstables return sstables that aren't
actually fully expired. Let's make sstable invalidate the stat if it
is potentially incorrect.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
d2ab154f12 sstables: switch to const ref wherever possible
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
d916c8cdad sstables: use gc_clock::time_point for gc_before
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
1d0e6496ec gc_clock: introduce operator<<(ostream&, gc_clock::time_point)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:32 -02:00
Raphael S. Carvalho
fcdce38e7f sstables: introduce sstable::get_max_local_deletion_time
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:47:05 -02:00
Raphael S. Carvalho
18bdf496fe sstables: remove unnecessary copy in time series strategies
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:46:46 -02:00
Raphael S. Carvalho
45c11865fa sstables: change return value type of get_fully_expired_sstables
unordered_set will allow us to quickly extract fully expired tables
from a set of compacting sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:45:55 -02:00
Raphael S. Carvalho
4fe6fea758 dtcs: make code to extract non expired tables faster
since it's O(n) and not O(n log n).

change also needed for change in interface of function to retrieve
fully expired tables, or sort lambda would need to be parametrized.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:40:16 -02:00
Raphael S. Carvalho
11176324bd sstables: add has_correct_max_deletion_time to sstable
Commit cc6c38324 fixes the stat. It was only updated for range
tombstone prior to fix, so a sstable that had a regular cell with
no expiration time could be considered fully expired which can
lead to bad decisions in compaction for time series workloads.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:40:05 -02:00
Jesse Haber-Kucharsky
aea262cdc4 auth/resource.hh: Rename resource_ids 2017-12-06 14:39:40 -05:00
Jesse Haber-Kucharsky
3cad18631d auth: Rename data_resource files
Now that there can be many kinds of resources, the old name doesn't fit.
2017-12-06 14:39:40 -05:00
Jesse Haber-Kucharsky
3665261a90 cql3/authorization_statement: Fix typo 2017-12-06 14:39:40 -05:00
Jesse Haber-Kucharsky
1bb22bb190 auth/resource: Generalize to different kinds
This change generalizes the implementation of a `resource` to many
different kinds of resources, though there is still only one
kind (`data`). In the future, we also expect resource kinds for roles,
user-defined functions (UDFs), and possibly on particular REST
end-points.

I considered several approaches to generalizing to different kinds of
resources.

One approach is to have a base class that is inherited from by different
resource kinds. The common functionality would be accessed through
virtual member functions and kind-specific functions would exist in
sub-classes. I rejected this approach because dealing with different
kinds of resources uniformly requires storage and life-time management
through something like `std::unique_ptr<auth::resource>`, which means
that we lose value semantics (including comparison) and must deal with
complications around ownership.

Another option was to use `boost::variant` (or, in future,
`std::variant`). This is closer to what we want, since there a static
set of resource kinds that we support. I rejected this approach for two
reasons. The first is that all resource kinds share the same data (a
list of segments and a root identifier), which would be duplicated in
each type that composed the variant. The second is that the complexity
and source-code overhead of `boost::variant` didn't seem warranted.

The solution I ended up with is home-grown variant. All resources are
described in the same `final` class: `auth::resource`. This class has
value semantics, supports equality comparison, and has a strict
ordering. All resources have in common a tag ("kind") and a list of
parts. Most operations on resources don't care about the kind of
resource (like getting its name, parsing a name, querying for the
parent, etc). These are just member functions of the class.

When we care about a kind-specific interpretation of a resource, we can
produce a "view" of the resource. For example, `data_resource_view`
allows for accessing the (optional) keyspace and table names.

I anticipate in the future to add functions for creating role
resources (`auth::resource::role`) and also `role_resource_view`.

The functional behaviour of the system should be unchanged with this
patch.

I've added new unit tests in `auth_resource_test.cc` and removed the old
test from `auth_test.cc`.

Fixes #3027.
2017-12-06 14:37:56 -05:00
Jesse Haber-Kucharsky
8fe53ecf78 auth: Rename data_resource to resource
The implementation and interface of `auth::resource` will change soon to
support different kinds of resources beyond just data (keyspaces and
tables).
2017-12-06 10:18:05 -05:00
Gleb Natapov
ddf117535a storage_proxy: add counters for speculative reads
Fixes #3030

Message-Id: <20171206143611.8756-1-gleb@scylladb.com>
2017-12-06 16:38:16 +02:00
Avi Kivity
ccc315bcfe Merge "storage_proxy: allow fail request earlier if CL cannot be reached due to errors" from Gleb
"This is CASSANDRA-7886 and CASSANDRA-8592. The patch series detects
that CL of a request can no longer be reached due to errors and fails
the request earlier. New type of errors are reported: read/write failure
which were introduced in cql v4 protocol. For compatibility if older
protocol is used the error is translated to timeout error."

* 'gleb/request-failure_v2' of github.com:scylladb/seastar-dev:
  storage_proxy: fail read/write requests early if it cannot be completed due to errors
  storage_service: add WRITE_FAILURE_REPLY_FEATURE feature
  gossiper: add node_has_feature() function
  cql: add read/write failure exceptions
  storage_proxy: fix data presence reporting in read timeout error during
  storage_proxy: remove inheritance from enable_shared_from_this for abstract_write_response_handler
  storage_proxy: remove unneeded field in abstract_write_response_handler
  storage_proxy: fix pending endpoint accounting for EACH_QUORUM
  consistency_level: constify quorum_for() and local_quorum_for()
2017-12-06 16:17:19 +02:00
Botond Dénes
9661769313 combined_mutation_reader: fix fast-fowarding related row-skipping bug
When fast forwarding is enabled and all readers positioned inside the
current partition return EOS, return EOS from the combined-reader
too. Instead of skipping to the next partition if there are idle readers
(positioned at some later partition) available. This will cause rows to
be skipped in some cases.

The fix is to distinguish EOS'd readers that are only halted (waiting
for a fast-forward) from thoose really out of data. To achieve this we
track the last fragment-kind the reader emitted. If that was a
partition-end then the reader is out of data, otherwise it might emit
more fragments after a fast-forward. Without this additional information
it is impossible to determine why a reader reached EOS and the code
later may make the wrong decision about whether the combined-reader as
a whole is at EOS or not.
Also when fast-forwarding between partition-ranges or calling
next_partition() we set the last fragment-kind of forwarded readers
because they should emit a partition-start, otherwise they are out of
data.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <6f0b21b1ec62e1197de6b46510d5508cdb4a6977.1512569218.git.bdenes@scylladb.com>
2017-12-06 16:09:05 +02:00
Takuya ASADA
aeb6ebce5a dist/debian: need apt-get update after installing GPG key for 3rdparty repo
We need apt-get update after install GPG key, otherwise we still get
unauthenticated package error on Debian package build.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512556948-29398-1-git-send-email-syuu@scylladb.com>
2017-12-06 12:43:17 +02:00
Jesse Haber-Kucharsky
772b432345 auth: Copying role exceptions cannot throw
This is a small correctness change.

According to cppreference.com [1], derived classes of `std::exception`
are not permitted to throw exceptions when they are copied.

To satisfy this requirement for `auth::roles_argument_exception`, we
store exception members as `std::shared_ptr` which has a `noexcept` copy
ctor. Since exceptions can cross shards, we cannot use a
`seastar::shared_ptr`.

This change is motivated by #3021.

[1] http://en.cppreference.com/w/cpp/error/exception/exception

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <7706df0c701b90e7cb309c84a86d9f813461e801.1512501024.git.jhaberku@scylladb.com>
2017-12-06 09:42:45 +01:00
Vladimir Krivopalov
1fc0c60fdc Support "CREATE TABLE WITH id" command.
Fixes #2059

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <92874a2bf1b4e79ef9f05875b3fa42804d17833c.1512508924.git.vladimir@scylladb.com>
2017-12-06 09:39:56 +01:00
Takuya ASADA
8f02967a3b dist/debian: install CA certificates before install repo GPG key
Since pbuilder chroot environment does not install CA certificates by default,
accessing https://download.opensuse.org will cause certificate verification
error.
So we need to install it before installing 3rdparty repo GPG key.

Also, checking existance of gpgkeys_curl is not needed, since it's always
not installed since we are running the script in clean chroot environment.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512517001-27524-1-git-send-email-syuu@scylladb.com>
2017-12-06 08:57:01 +01:00
Avi Kivity
3501c147b7 Merge "Use new recommended classes from JsonCpp instead of deprecated ones" from Vladimir
"This fix for the issue #2989 first adds unit tests for caching_options which
is the only class that uses the helpers from json.hh. This is done to
have regression tests in place for the main change.
The second commit adds conditional use of new recommended JsonCpp API
where available. For older versions of the library, it uses the old
code."

* 'issues/2989/v1' of https://github.com/argenet/scylla:
  Use CharReaderBuilder/CharReader and StreamWriterBuilder from JsonCpp.
  tests: Add unit tests for caching_options.
2017-12-06 09:11:40 +02:00
Avi Kivity
601a03dda7 Merge "Make sstable tests use flat_mutation_reader" from Paweł
"This series makes sstable tests use flat stream interface. The main
motivation is to allow eventual removal of mutation_reader and
streamed_mutation and ensuring that the conversion between the
interfaces doesn't hide any bugs that would be otherwise found."

* tag 'flat_mutation_reader-sstable-tests/v1' of https://github.com/pdziepak/scylla:
  sstables: drop read_range_rows()
  tests/mutation_reader: stop using read_range_rows()
  incremental_reader_selector: do not use read_range_rows()
  tests/sstable: stop using read_range_rows()
  sstables: drop read_row()
  tests/sstables: use read_row_flat() instead of read_row()
  database: use read_row_flat() instead of read_row()
  tests/sstable_mutation_test: get flat_mutation_readers from mutation sources
  tests/sstables: make sstable_reader return flat_mutation_reader
  sstable: drop read_row() overload accepting sstable::key
  tests/sstable: stop using read_row() with sstable::key
  tests/flat_mutation_reader_assertions: add has_monotonic_positions()
  tests/flat_mutation_reader_assertions: add produces(Range)
  tests/flat_mutation_reader_assertions: add produces(mutation)
  tests/flat_mutation_reader_assertions: add produces(dht::decorated_key)
  tests/flat_mutation_reader_assertions: add produces(mutation_fragment::kind)
  tests/flat_mutation_reader_assertions: fix fast forwarding
2017-12-05 18:10:43 +02:00
Vladimir Krivopalov
b35c2fe177 Attach backtrace to marshal_exception-s thrown from generic functions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <06ad18c3563855771dd3ea8d0ec99533642e1919.1511931828.git.vladimir@scylladb.com>
2017-12-05 16:14:55 +01:00
Paweł Dziepak
0d8f964a79 sstables: drop read_range_rows()
It has been deprecated by read_range_rows_flat().
2017-12-05 14:53:14 +00:00
Paweł Dziepak
0c50f113c8 tests/mutation_reader: stop using read_range_rows() 2017-12-05 14:53:14 +00:00
Paweł Dziepak
ce9a890940 incremental_reader_selector: do not use read_range_rows() 2017-12-05 14:53:14 +00:00
Paweł Dziepak
15ad148604 tests/sstable: stop using read_range_rows()
read_range_rows() is deprecated by read_range_rows_flat().
2017-12-05 14:53:14 +00:00
Paweł Dziepak
e739ad98e5 sstables: drop read_row() 2017-12-05 14:53:14 +00:00
Paweł Dziepak
de8ebd6752 tests/sstables: use read_row_flat() instead of read_row() 2017-12-05 14:53:14 +00:00
Paweł Dziepak
bccca90207 database: use read_row_flat() instead of read_row() 2017-12-05 14:52:57 +00:00
Paweł Dziepak
582bacbd81 tests/sstable_mutation_test: get flat_mutation_readers from mutation sources 2017-12-05 14:52:32 +00:00
Paweł Dziepak
74e1c38f80 tests/sstables: make sstable_reader return flat_mutation_reader 2017-12-05 14:52:32 +00:00
Paweł Dziepak
7fce7a9e3a sstable: drop read_row() overload accepting sstable::key
sstable::key needs to be converted to a dht::decorated_key which needs
to be kept alive until the returned reader dies.
2017-12-05 14:49:25 +00:00
Paweł Dziepak
77a4231147 tests/sstable: stop using read_row() with sstable::key 2017-12-05 14:47:46 +00:00
Paweł Dziepak
52c1e9fcf4 tests/flat_mutation_reader_assertions: add has_monotonic_positions()
has_monotonic_positions() verifies that the stream is monotonic.
Based on streamed_mutation_assertions::has_monotonic_positions().
2017-12-05 14:47:46 +00:00
Paweł Dziepak
5b6f680b45 tests/flat_mutation_reader_assertions: add produces(Range)
The assertions already have produces(mutation) and
produces(dht::decorated_key) overloads. Additional overload that accepts
a range of elements will allow to check if a range of mutations of
decorated keys is produced.
The same interface is exposed by mutation_reader_assertions.
2017-12-05 14:47:46 +00:00
Paweł Dziepak
ef4fa1a8c1 tests/flat_mutation_reader_assertions: add produces(mutation) 2017-12-05 14:47:31 +00:00
Gleb Natapov
16964de1f3 storage_proxy: fail read/write requests early if it cannot be completed due to errors
If errors make reaching CL impossible a request can be aborted earlier
without waiting for timeout.
2017-12-05 16:46:25 +02:00
Gleb Natapov
0be3bd383b storage_service: add WRITE_FAILURE_REPLY_FEATURE feature
Presence of the flag indicates that the node is ready to process
negative mutation write replies.
2017-12-05 16:46:25 +02:00
Calle Wilund
8af0b501a2 gossip: wait for stabilized gossip on bootstrap
Fixes #2866

Instead of a raw 30s sleep waiting for gossip to stabilize/set up 
ranges on bootstrap, use similar logic as 'wait_for_gossip_to_settle'
and loop said 30s or more until we neither grow/shrink ep set, or
are processing ACK:s.
2017-12-05 14:28:34 +00:00
Calle Wilund
1c8302e692 gossiper: Prevent race condition in propagation
Fixes #2894

Allow applying certain application states as monotonic sets,
i.e. allow set of states as input, and ensure the values are 
re-versioned and all applied together.
Then do so for certain states that are  by design coupled
(status/tokens). 

Similar solution as origins, as issue is copy of the same.
2017-12-05 14:28:34 +00:00
Calle Wilund
2095cb82a5 utils::to_string: Add printers for pairs+maps 2017-12-05 14:28:34 +00:00
Calle Wilund
f4362a5289 utils::in: Add helper type for perfect forwarding initializer lists
wrapper type (courtesy of
http://cpptruths.blogspot.se/2013/09/21-ways-of-passing-parameters-plus-one.html#inTidiom)
to enable move semantics in initializer lists. Useful as an engineering
overkill to retain nice call sites.
2017-12-05 14:28:34 +00:00
Paweł Dziepak
d2dfca458f tests/flat_mutation_reader_assertions: add produces(dht::decorated_key)
There is an equivalent member function in mutation_reader assertions.
2017-12-05 13:11:55 +00:00
Paweł Dziepak
28caa76c8c tests/flat_mutation_reader_assertions: add produces(mutation_fragment::kind)
produces(mutation_fragment::kind) is provided by
streamed_mutation_assertions and is going to be needed in order to
fully convert tests to the flat mutation readers.
2017-12-05 13:04:16 +00:00
Paweł Dziepak
21886b7a3f tests/flat_mutation_reader_assertions: fix fast forwarding
Both fast_forward_to() overloads return a future which should be waited
for. Additionally, fast_forward_to(const dht::partition_range&) expects
the range to remain valid at least until the next call to
fast_forward_to(). The original mutation_reader_assertions guaranteed
that and so should flat_mutation_reader_assertions.
2017-12-05 13:04:16 +00:00
Gleb Natapov
fb8a626813 gossiper: add node_has_feature() function
The function allows to check if an endpoint supports certain feature.
2017-12-05 15:02:17 +02:00
Gleb Natapov
6ef26a4a4a cql: add read/write failure exceptions
Those errors were added by cql protocol v4 and are translated to
timeout exception if earlier protocol is negotiated.
2017-12-05 15:02:17 +02:00
Gleb Natapov
6a85cae707 storage_proxy: fix data presence reporting in read timeout error during
_responses variable is never updated, so remove it. response_count() was
meant to be used.
2017-12-05 15:02:17 +02:00
Gleb Natapov
f392bd6db7 storage_proxy: remove inheritance from enable_shared_from_this for abstract_write_response_handler
No code uses shared_from_this() on abstract_write_response_handler
object, so remove the inheritance.
2017-12-05 15:02:17 +02:00
Gleb Natapov
d974c26eeb storage_proxy: remove unneeded field in abstract_write_response_handler 2017-12-05 15:02:17 +02:00
Gleb Natapov
e7cfe2dd1b storage_proxy: fix pending endpoint accounting for EACH_QUORUM
_total_block_for should account for pending endpoints, but for EACH_QUORUM
it did not.
2017-12-05 15:01:37 +02:00
Takuya ASADA
b492a1e1b1 dist/redhat: fix typo on build_rpm.sh
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512466884-18383-2-git-send-email-syuu@scylladb.com>
2017-12-05 13:40:03 +02:00
José Guilherme Vanz
5261eb7225 build_rpm.sh: command line argument not used
The command line argument `--configure-user` of the build_rpm.sh script
is used nowhere. Thus, this commit remove it all code related to
this flag.

Signed-off-by: José Guilherme Vanz <guilherme.sft@gmail.com>
Message-Id: <20171205025920.401-1-guilherme.sft@gmail.com>
2017-12-05 13:24:17 +02:00
Gleb Natapov
357c77a333 consistency_level: constify quorum_for() and local_quorum_for() 2017-12-05 13:01:20 +02:00
Avi Kivity
eea768180b Merge seastar upstream
* seastar dc44656...ac78eec (3):
  > json formatter: Add unsigned support to the json formatter
  > Add missing usual smart-pointer methods to foreign_ptr
  > future-util: remove use of forward references in some primitives
2017-12-05 11:12:12 +02:00
Raphael S. Carvalho
de19e7d942 tests:perf: make perf_sstable write mode work again
Recently, memtable flush in test requires storage service for tests,
or it fails with "Assertion `local_is_initialized()' failed".
storage_service_for_tests needs to run in a thread, that's why
flush_memtable was flattened.
Last but not least, we need to revert flushed memory account because
same memtable is used for all sstables in the perf test so as not
to trigger `_mt._flushed_memory <= _mt.occupancy().used_space()'

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171205012853.21559-1-raphaelsc@scylladb.com>
2017-12-05 10:18:53 +02:00
Vladimir Krivopalov
76775ddf26 Use CharReaderBuilder/CharReader and StreamWriterBuilder from JsonCpp.
In version 1.8.3 of JsonCpp shipped with Fedora 27, old FastWriter and
Reader classes from JsonCpp have been deprecated in favour of
newer/better ones: CharReaderBuilder/CharReader and
StreamWriterBuilder/StreamWriter.
This fix uses the new classes where available or resorts to old ones for
older versions of the library.

Fixes #2989

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2017-12-04 21:03:05 -08:00
Vladimir Krivopalov
114c71dcd8 tests: Add unit tests for caching_options.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2017-12-04 17:42:23 -08:00
Paweł Dziepak
046991b0b7 Merge "Flatten combined_mutation_reader" from Botond
"Convert combined_mutation_reader into a flat_mutation_reader impl. For
now - in the name of incremental progress - all consumers are updated to
use the combined reader through the
mutation_reader_from_flat_mutation_reader adaptor. The combined reader also
uses all it's sub mutation_readers through the
flat_mutation_reader_from_mutation_reader adaptor."

* 'bdenes/flatten-combined-reader-v8' of https://github.com/denesb/scylla:
  Add unit tests for the combined reader - selector interactions
  Add flat_mutation_reader overload of make_combined_reader
  Flatten the implementation of combined_mutation_reader
  Add mutation_fragment_merger
  mutation_fragment::apply(): handle partition start and end too
  Add non-const overload of partition_start::partition_tombstone()
  Make combined_mutation_reader a flat_mutation_reader
  Move the mutation merging logic to combined_mutation_reader
  Remove the unnecessary indirection of mutation_reader_merger::next()
  Move the implementation of combined_mutation_reader into mutation_reader_merger
  Remove unused mutation_and_reader::less_compare and operator<
2017-12-04 13:19:05 +00:00
Avi Kivity
a25b5e30f8 Merge "enable secure-apt for Ubuntu/Debian pbuilder" from Takuya
* 'debian-secure-apt-3rdparty-v3' of https://github.com/syuu1228/scylla:
  dist/debian: support Ubuntu 18.04LTS
  dist/debian: disable ALLOWUNTRUSTED
  dist/debian: enable secure-apt for Debian
  dist/debian: enable secure-apt for Ubuntu
2017-12-04 14:46:42 +02:00
Takuya ASADA
4ea3daede9 dist/debian: support Ubuntu 18.04LTS
Ubuntu 18.04LTS is not released yet, but it's already usable so we can prepare
for it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2017-12-04 19:49:19 +09:00
Takuya ASADA
b4695611ed dist/debian: disable ALLOWUNTRUSTED
We have enabled secure-apt for 3rdparty repos, so we don't need ALLOWUNTRUSTED
anymore.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2017-12-04 19:49:19 +09:00
Takuya ASADA
92f8743f97 dist/debian: enable secure-apt for Debian
Enable secure-apt for Debian as well.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2017-12-04 19:49:19 +09:00
Takuya ASADA
a9fef02f9c dist/debian: enable secure-apt for Ubuntu
Our external repos are already signed repo, so let's enable secure-apt.
Seems like more recent version of Ubuntu (tested on 18.04) does not accept
skipping GPG check, so we need it anyway in near future.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2017-12-04 19:49:19 +09:00
Takuya ASADA
531c2e4e89 dist/ami: support AMI cross build
Now we can cross build our .rpm/.deb packages, so let's extend AMI build script
to support cross build, too.

Also Ubuntu 16.04 support added, since it's latest Ubuntu LTS release.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1510247204-2899-1-git-send-email-syuu@scylladb.com>
2017-12-04 12:33:24 +02:00
Botond Dénes
956b3519dd Add unit tests for the combined reader - selector interactions
There are a few edge cases that were untested and as this patch-series
reworks completely how the combined-reader works these should be tested
as well to ensure they keep working.
2017-12-04 07:57:43 +02:00
Botond Dénes
e7535f5e88 Add flat_mutation_reader overload of make_combined_reader 2017-12-04 07:57:43 +02:00
Botond Dénes
8731c1bc66 Flatten the implementation of combined_mutation_reader
In fact flatten mutation_reader_merger and adjust combined_mutation_reader
accordingly.
2017-12-04 07:57:43 +02:00
Botond Dénes
217740c608 Add mutation_fragment_merger
This is the mutation fragment level equivalent of mutation_merger.
It merges fragments produced by different sources. Mutation
fragments are not as self-contained as streamed mutations, they have
external context, e.g. the partition they belong to. To support this
mutation_fragment_merger operates on a producer instead of a vector of
fragments. Producer can have internal state and can do side-actions as
fragments are consumed.
2017-12-04 07:57:43 +02:00
Botond Dénes
f6d11a3cfc mutation_fragment::apply(): handle partition start and end too 2017-12-04 07:57:43 +02:00
Botond Dénes
e47791810b Add non-const overload of partition_start::partition_tombstone()
And make the const version return a const reference so that code
mutating the returned value won't compile if the partition_start object
is const.
2017-12-04 07:57:43 +02:00
Botond Dénes
3f8110b5b6 Make combined_mutation_reader a flat_mutation_reader
For now only the interface is converted, behind the scenes the previous
implementation remains, it's output is simply converted by
flat_mutation_reader_from_mutation_reader. The implementation will be
converted in the following patches.
2017-12-04 07:57:43 +02:00
Botond Dénes
c011747c30 Move the mutation merging logic to combined_mutation_reader
This is the second step in splitting the combined readers's logic into
two parts as outlined in the previous patch.
2017-12-04 07:57:43 +02:00
Botond Dénes
3681e17555 Remove the unnecessary indirection of mutation_reader_merger::next() 2017-12-04 07:57:43 +02:00
Botond Dénes
c5e57e0961 Move the implementation of combined_mutation_reader into mutation_reader_merger
This simple code-movement and patch lays the groundwork for splitting
the logic in combined_mutation_reader into two blocks:
* one that takes care of moving the readers in lockstep and emits their
    output as a non-decreasing stream of streamed_mutations and
* one that takes care of merging the above stream into
    strictly-increasing stream of streamed_mutations.

This in turn is preparation-work to the transformation of
combined_mutation_reader into a flat_mutation_reader::impl.
2017-12-04 07:57:43 +02:00
Botond Dénes
85b5ded670 Remove unused mutation_and_reader::less_compare and operator< 2017-12-04 07:57:43 +02:00
Avi Kivity
f3d5674108 Merge "auth: Retry delayed task in case of error" from Duarte
"A delayed task can fail to execute, for example if the consistency
level the task required can't be achieves, so we should ensure it is
retried.

Fixes #3038"

* 'auth-retry/v2' of https://github.com/duarten/scylla:
  auth/standard_role_manager: Extend exception handling
  auth/common: Add exception handling and retry to task scheduling
  auth/standard_role_manager: Lift async block to caller
2017-12-03 12:08:03 +02:00
Vladimir Krivopalov
41eb278899 Only allow DISTINCT SELECT queries with partition key restrictions.
Fixes #2049

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <75e69626d797e63fb1e93a9120f135d4959fad1c.1512162540.git.vladimir@scylladb.com>
2017-12-03 11:59:11 +02:00
Duarte Nunes
7434d21023 auth/standard_role_manager: Extend exception handling
Also handle exceptions thrown by has_existing_roles(), and print a
similar message to Apache Cassandra in case of error.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-02 22:40:13 +00:00
Duarte Nunes
01e2c7b614 auth/common: Add exception handling and retry to task scheduling
This follows the implementation in Apache Cassandra. The auth tasks
executed by delay_until_system_ready() usually perform a query with
QUORUM consistency level, which can fail if some nodes are
unavailable. So, we provide both exception handling and a retry
mechanism.

Fixes #3038

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-02 22:40:06 +00:00
Duarte Nunes
82206f966d auth/standard_role_manager: Lift async block to caller
has_existing_roles() creates a seastar thread, but that can be
lifted to the caller for prettier code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-02 20:15:09 +00:00
Takuya ASADA
8c403ea4e0 dist/debian: disable entire pybuild actions
Even after 25bc18b commited, we still see the build error similar to #3036 on
some environment, but not on dh_auto_install, it on dh_auto_test (see #3039).

So we need to disable entire pybuild actions, not just dh_auto_install.

Fixes #3039

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512185097-23828-1-git-send-email-syuu@scylladb.com>
2017-12-02 19:36:43 +02:00
Vladimir Krivopalov
7f7bf8f23a test.py: Fix a typo in role_manager_test name.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <e80ef188c024f178c1c94fe3739b77a2c2448bd4.1512162655.git.vladimir@scylladb.com>
2017-12-01 21:25:08 +00:00
Takuya ASADA
25bc18b8ff dist/debian: skip running dh_auto_install on pybuild
We are getting package build error on dh_auto_install which is invoked by
pybuild.
But since we handle all installation on debian/scylla-server.install, we can
simply skip running dh_auto_install.

Fixes #3036

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512065117-15708-1-git-send-email-syuu@scylladb.com>
2017-12-01 16:06:25 +02:00
Duarte Nunes
9694bee0d4 Merge 'Improvements to mutation printout' from Tomasz
"This series makes it easier to comprehend assertion failures which
involve printing mutation contents."

* 'tgrabiec/mutation-printout' of github.com:scylladb/seastar-dev:
  tests: Introduce mutation_diff script
  mutation: Make printout more concise
  mutation_partition: Don't print absent elements
  mutation_partition: Make row_marker printout similar to other partition elements
  database: Move operator<<() overloads to appropriate source files
  mutation_partition: Use multi-line printout
  position_in_partition: Improve printout
2017-12-01 11:02:02 +00:00
Tomasz Grabiec
c3276451af tests: Introduce mutation_diff script
Converts assertion failure messages which spit out mutation contents
into a human-readable diff.
2017-12-01 10:52:37 +01:00
Tomasz Grabiec
66990867b8 mutation: Make printout more concise
Before:

{ks.cf key {key: pk{000c706b30303030303030303030}, token:-2018791535786252460} data {mutation_partition:

After:

{ks.cf {key: pk{000c706b30303030303030303030}, token:-2018791535786252460} {mutation_partition:
2017-12-01 10:52:37 +01:00
Tomasz Grabiec
05a6c67804 mutation_partition: Don't print absent elements
Makes printout shorter and thus easier to parse.
2017-12-01 10:52:37 +01:00
Tomasz Grabiec
d8b54a57aa mutation_partition: Make row_marker printout similar to other partition elements 2017-12-01 10:52:37 +01:00
Tomasz Grabiec
fd7ab5fe99 database: Move operator<<() overloads to appropriate source files 2017-12-01 10:52:37 +01:00
Tomasz Grabiec
7bde3090b4 mutation_partition: Use multi-line printout
Convert to a multi line output, which is easier to read for a human.

After:

{ks.cf key {key: pk{000c706b30303030303030303030}, token:-2018791535786252460} data {mutation_partition: {tombstone: none},
 range_tombstones: {},
 static: cont=1 {row: },
 clustered: {
    {rows_entry: cont=true dummy=false {position: clustered,ckp{000c636b30303030303030303030},0} {deletable_row: {row: }}},
    {rows_entry: cont=true dummy=true {position: clustered,ckp{000c636b30303030303030303031},0} {deletable_row: {row: }}}}}}
2017-12-01 10:52:37 +01:00
Tomasz Grabiec
36caf0f9db position_in_partition: Improve printout
Before:

 {position: type clustered, bound_weight -1, key ckp{000c636b30303030303030303033}}

After:

 {position: clustered,ckp{000c636b30303030303030303033},-1}

Benefits:

  - most significant parts appear first.
    bound_weight, which is least significant, was in the middle before.

  - shorter, so a bit easier to parse assertion failures.
2017-12-01 10:52:37 +01:00
Jesse Haber-Kucharsky
cc19545f20 auth/standard_role_manager: Fix initialization
Checking for existing roles requires that the system is "settled" first.
This is consistent with the existing code for user-management, but not
with the initial introduction of the role manager.

Fixes #3028.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <57157a0df92dba6bf9a95960b9c8261a45acb1ad.1512093477.git.jhaberku@scylladb.com>
2017-12-01 10:20:16 +01:00
Duarte Nunes
1b4ca6aadf auth/standard_role_manager: Add exception handling for background task
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171130233851.32827-1-duarte@scylladb.com>
2017-12-01 10:20:16 +01:00
Duarte Nunes
ab6f0de6e7 auth/service: Stop role manager instead of starting
Fixes #3028

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171130232032.31924-1-duarte@scylladb.com>
2017-12-01 10:20:16 +01:00
Avi Kivity
f56c8415d8 Merge seastar upstream
* seastar b2a3ea3...dc44656 (1):
  > Update dpdk submodule
2017-11-30 10:37:23 +02:00
Avi Kivity
ca4abb1bbf Merge seastar upstream
* seastar 3b09bad...b2a3ea3 (5):
  > dependency: use new gcc c++ boost
  > test.py: remove unused black_hole
  > util: Add throw_with_backtrace helper to add backtraces to exceptions.
  > tests: add vruntime to scheduling_group_demo
  > Fix Clang build for recently added io_tester app.
2017-11-30 10:31:48 +02:00
Vladimir Krivopalov
6d76ac8043 Lift checks on list and map values to allow values of length > 64K.
Fixes #3007

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <7b232a655b5531d4bfa2be3d9611f8b1ba0349b0.1512021011.git.vladimir@scylladb.com>
2017-11-30 10:31:19 +02:00
Amos Kong
bfc055fedc install different dependence for fedora and centos
The packages are installed from nstall-dependencies.sh don't satisfy
requests in configuration on CentOS. This patch switched to use
newer packages from scylla-3rdparty repo.

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <9bca7b08704f68c604560e5ec7ce0c0358d328da.1511965492.git.amos@scylladb.com>
2017-11-29 17:05:47 +02:00
Duarte Nunes
cda3ddd146 compound_compact: Change universal reference to const reference
The universal reference was introduced so we could bind an rvalue to
the argument, but it would have sufficed to make the argument a const
reference. This is also more consistent with the function's other
overload.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171129132758.19654-1-duarte@scylladb.com>
2017-11-29 14:41:35 +01:00
Tomasz Grabiec
e9cce59b85 Merge "compact_storage serialization fixes" from Duarte
Fix two issues with serializing non-compound range tombstones as
compound: convert a non-compound clustering element to compound and
actually advertise the issue to other nodes.

* git@github.com:duarten/scylla.git  rt-compact-fixes/v1:
  compound_compact: Allow rvalues in size()
  sstables/sstables: Convert non-compound clustering element to compound
  tests/sstable_mutation_test: Verify we can write/read non-correct RTs
  service/storage_service: Export non-compound RT feature
2017-11-29 14:17:50 +01:00
Duarte Nunes
2f513514cc service/storage_service: Export non-compound RT feature
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-29 14:17:50 +01:00
Duarte Nunes
13fc26214e tests/sstable_mutation_test: Verify we can write/read non-correct RTs
Add test to verify we can write and read non-compound tombstones and
compound ones for backward compatibility.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-29 14:17:50 +01:00
Duarte Nunes
013659506b sstables/sstables: Convert non-compound clustering element to compound
576ea421dc introduced a regression
as it didn't change the assumption that all clustering elements where
compound when writing a range tombstone, compound or non-compound, as
compound. Thus, we serialized a non-compound element while we should
have serialized a compound one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-29 14:17:50 +01:00
Duarte Nunes
ec8ce3388e compound_compact: Allow rvalues in size()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-29 14:17:49 +01:00
Paweł Dziepak
586b61d57d size_estimates: convert reader to flat mutation readers
Message-Id: <20171129105909.27084-1-pdziepak@scylladb.com>
2017-11-29 12:14:05 +00:00
Amos Kong
c2bdb3bdbc test.py: remove unused black_hole
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <2e79a58906e8f3ba512586fe4ea4a662fa1a3d35.1511944232.git.amos@scylladb.com>
2017-11-29 11:07:24 +02:00
Amos Kong
fd71405465 auth/transitional: use defined package name prefix
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <f3337b00a9209a9af4918a25145d661488387fa8.1511945338.git.amos@scylladb.com>
2017-11-29 09:59:33 +01:00
Amos Kong
46541d400e test.py: fix test runner description
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <9b6febecc18376e774611322119a6300dc7363e2.1511945338.git.amos@scylladb.com>
2017-11-29 09:59:22 +01:00
Amos Kong
edfaeb40d9 storage_service: fix trace msg in get_ring_delay()
The trace log in get_ring_delay() is wrong.

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <2556583ec160d0417ed669fe3322a16ffda37ce7.1511945338.git.amos@scylladb.com>
2017-11-29 09:59:12 +01:00
Amos Kong
d5caaee0cc main: move messaging service notify to right position
Commit eb13f65949 adjusted the start time
of messaging service, but the notify message wasn't moved together.

Signed-off-by: Amos Kong <amos@scylladb.com>
Cc: Pekka Enberg <penberg@scylladb.com>
Message-Id: <1073f285189686619bb4870ef1be20f0f24e8532.1511945338.git.amos@scylladb.com>
2017-11-29 09:59:01 +01:00
Amos Kong
4be66f8498 main: remove repeat register of storage service API
We repeatedly register storage service API twice. The first one is
before starting storage service, let's remove it.

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <8bb09c2acfed57bf74a81d189fa08ba34a594294.1511945338.git.amos@scylladb.com>
2017-11-29 09:58:50 +01:00
Raphael S. Carvalho
f699cf17ae sstables: fix data_consume_context's move operator and ctor
after 7f8b62bc0b, its move operator and ctor broke. That potentially
leads to error because data_consume_context dtor moves sstable ref
to continuation when waiting for in-flight reads from input stream.
Otherwise, sstable can be destroyed meanwhile and file descriptor
would be invalid, leading to EBADF.

Fixes #3020.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171129014917.11841-1-raphaelsc@scylladb.com>
2017-11-29 09:53:47 +01:00
Avi Kivity
4cfcd8055e Merge "Drop reversible apply() from mutation_partition" from Tomasz
"This simplifies implementation of mutation_partition merging by relaxing
exception guarantees it needs to provide. This allows reverters to be dropped.

Direct motivation for this is to make it easier to implement new semantics
for merging of clustering range continuity.

Implementation details:

We only need strong exception guarantees when applying to the memtable, which is
using MVCC. Instead of calling apply() with strong exception guarantees on the latest
version, we will move the incoming mutation to a new partition_version and then
use monotonic apply() to merge them. If that merging fails, we attach the version with
the remainder, which cannot fail. This way apply() always succeeds if the allocation
of partition_version object succeeds.

Results of `perf_simple_query_g -c1 -m1G --write` (high overwrite rate):

Before:

 101011.13 tps
 102498.07 tps
 103174.68 tps
 102879.55 tps
 103524.48 tps
 102794.56 tps
 103565.11 tps
 103018.51 tps
 103494.37 tps
 102375.81 tps
 103361.65 tps

After:

 101785.37 tps
 101366.19 tps
 103532.26 tps
 100834.83 tps
 100552.11 tps
 100891.31 tps
 101752.06 tps
 101532.00 tps
 100612.06 tps
 102750.62 tps
 100889.16 tps

Fixes #2012."

* tag 'tgrabiec/drop-reversible-apply-v1' of github.com:scylladb/seastar-dev:
  mutation_partition: Drop apply_reversibly()
  mutation_partition: Relax exception guarantees of apply()
  mutation_partition: Introduce apply_weak()
  tests: mvcc: Add test for atomicity of partition_entry::apply()
  tests: Move failure_injecting_allocation_strategy to a header
  tests: mutation_partition: Test exception guarantees of apply_monotonically()
  mvcc: Use apply_monotonically() where sufficient
  mvcc: partition_version: Use apply_monotonically() to provide atomicity
  mvcc: Extract partition_entry::add_version()
  mutation_partition: Introduce apply_monotonically()
  mutation_partition: Introduce row::consume_with()
2017-11-28 16:35:06 +02:00
Tomasz Grabiec
70e14f78a7 mutation_partition: Drop apply_reversibly() 2017-11-28 13:03:06 +01:00
Tomasz Grabiec
091e10fc70 mutation_partition: Relax exception guarantees of apply()
The uses which needed strong or weak exception guarantees were
switched to a solution involving apply_monotonically(). All remaining
uses don't need any exception guarantees.
2017-11-28 13:03:06 +01:00
Tomasz Grabiec
988d3c67b4 mutation_partition: Introduce apply_weak()
Intended to be used by code which doesn't need any exception
guarantees.  Currently just delegates to apply_monotonically().
2017-11-28 13:03:03 +01:00
Tomasz Grabiec
ad37826fcb tests: mvcc: Add test for atomicity of partition_entry::apply() 2017-11-28 12:38:28 +01:00
Tomasz Grabiec
e5532bd644 tests: Move failure_injecting_allocation_strategy to a header 2017-11-28 12:38:28 +01:00
Tomasz Grabiec
1b5f2b0473 tests: mutation_partition: Test exception guarantees of apply_monotonically() 2017-11-28 12:38:28 +01:00
Tomasz Grabiec
376cddb212 mvcc: Use apply_monotonically() where sufficient 2017-11-28 12:38:28 +01:00
Tomasz Grabiec
49c0705409 mvcc: partition_version: Use apply_monotonically() to provide atomicity
This patch drops the use of apply_reversibly(). We move the mutation
to be applied into a new version and then use apply_monotonically() to
merge it (if no snapshot) with the current version. This guarantees
that apply() is atomic even if apply_monotonically() throws.

Fixes #2012.
2017-11-28 12:38:28 +01:00
Tomasz Grabiec
52cabe343c mvcc: Extract partition_entry::add_version() 2017-11-28 12:38:27 +01:00
Tomasz Grabiec
97ebf51d3a mutation_partition: Introduce apply_monotonically()
Has weaker exception guarantees than apply(), which allows for simpler
implementation. Intended to replace the apply() with strong exception
guarantees.
2017-11-28 12:28:51 +01:00
Paweł Dziepak
c0253d683b remove partition_snapshot_reader
All uses of partition_snapshot_reader have already been replaced by
partition_snapshot_flat_reader.

Message-Id: <20171128103929.16614-1-pdziepak@scylladb.com>
2017-11-28 12:49:38 +02:00
Tomasz Grabiec
978b874065 mutation_partition: Introduce row::consume_with() 2017-11-28 11:20:03 +01:00
Duarte Nunes
1fbe9dc851 message/messaging_service: Close all server sockets
We were stopping the loop prematurely.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171127181417.8167-1-duarte@scylladb.com>
2017-11-28 11:08:08 +02:00
Jesse Haber-Kucharsky
fb0866ca20 Move thread_local declarations out of main.cc
Since `disk-error-handler.hh` defines these global variables `extern`,
it makes sense to declare them in the `disk-error-handler.cc` instead of
`main.cc`.

This means that test files don't have to declare them.

Fixes #2735.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <1eed120bfd9bb3647e03fe05b60c871de2df2a86.1511810004.git.jhaberku@scylladb.com>
2017-11-27 20:27:42 +01:00
Tomasz Grabiec
04106b4c96 Merge "Convert memtable flush reader to flat streams" from Paweł
This series converts memtable flush reader to the new flat mutation
readers. Just like the scanning reader, flush reader concatenates
multiple partition snapshot readers in order to provide a stream
of all partitions in the memtable.

* https://github.com/pdziepak/scylla.git flat_mutation_reader-memtable-flush/v1
   tests/flat_mutation_reader_assertion: add produces_partition()
   memtable: make make_flush_reader() return flat_mutation_reader
   flat_mutation_reader: add optimised flat_mutation_reader_opt
   memtable: switch flush reader implementation to flat streams
   tests/memtable: add test for flush reader
2017-11-27 20:07:23 +01:00
Paweł Dziepak
87b600cad8 tests/memtable: add test for flush reader 2017-11-27 20:07:23 +01:00
Paweł Dziepak
9dc566c64b memtable: switch flush reader implementation to flat streams 2017-11-27 20:07:22 +01:00
Paweł Dziepak
9c5acaa823 flat_mutation_reader: add optimised flat_mutation_reader_opt 2017-11-27 20:07:22 +01:00
Paweł Dziepak
32eb6437fd memtable: make make_flush_reader() return flat_mutation_reader 2017-11-27 20:07:22 +01:00
Paweł Dziepak
15099a0e8c tests/flat_mutation_reader_assertion: add produces_partition() 2017-11-27 20:07:22 +01:00
Avi Kivity
b7c96b8bd3 Merge "Dormant role-management and CQL" from Jesse
"This series adds the role-management interface, the primary implementation, and the corresponding CQL.

Importantly, this series does not integrate the system with roles, nor does it remove user-based access control. Several new CQL statements are available and should function, but these modify metadata only and have no functional impact on the actual
+system.

The new statements are:

- CREATE ROLE
- ALTER ROLE
- DROP ROLE
- GRANT ROLE
- REVOKE ROLE
- LIST ROLES

The security model of the role manager is simple at this point: only superusers can create and drop roles. The next patch series will introduce fine-grained role permissions and also slightly change the CQL syntax to more consistent with the
+rest of the grammar. This patch series is a starting point for evolving the roles feature and integrating it.

Fixes #2987."

* 'jhk/role_management/v5' of https://github.com/hakuch/scylla:
  auth: Add `alter_role_statement`
  auth: Add `create_role_statement`
  auth: Add `drop_role_statement`
  auth: Add 'revoke_role_statement'
  auth: Add `grant_role_statement`
  auth: Add `list_roles_statement`
  auth: Add dormant role manager to `service`
  auth/service.cc: Remove redundant declarations
  cql3: Add `role_name` and parser rules
  auth: Add role manager
  auth: Unconditionally create the `system_auth` keyspace
  unimplemented.hh: Use [[noreturn]] instead of GCC attribute
  New `unimplemented` feature: roles
2017-11-27 20:01:34 +02:00
Jesse Haber-Kucharsky
9638c2b822 auth: Add alter_role_statement
As with `create_role_statement`, until roles are integrated with the
rest of the system, authentication-related options are ignored.
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
ba5dfe0a76 auth: Add create_role_statement
Unlike Apache Cassandra, the role manager does not write data related to
password authentication in the metadata tables, and the rest of the
system does not yet integration with the role manager.

Therefore, executing `CREATE ROLE` currently ignores all
authentication-related options (`PASSWORD` and `OPTIONS`).
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
7524607f26 auth: Add drop_role_statement
Dropping a role removes all references to it from other roles.

As with the role-management statements, executing this statement updates
metadata but has not functional impact yet.
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
2594eb0a11 auth: Add 'revoke_role_statement'
As with `grant_role_statement`, executing this statement updates
metadata but has no functional effect.
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
7679d45fb6 auth: Add grant_role_statement
While granting a role updates the necessary metadata, since roles do not
interact with the rest of the system yet, there is not functional impact
of doing so.
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
b024d40d51 auth: Add list_roles_statement 2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
460f3c7065 auth: Add dormant role manager to service
The role manager still does not interact with the rest of the system,
but the role manager is now sharded on all cores and metadata is
created.

The following metadata tables are created:

- `system_auth.roles`
- `system_auth.role_members`

The default superuser, "cassandra", is also created, but has no function.
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
27420fa189 auth/service.cc: Remove redundant declarations 2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
7e78e1ebdc cql3: Add role_name and parser rules
The `userOrRoleName` parser rule is important for future CQL
role-related statements.

`cql3::role_name` is a small utility for role-related CQL statements
that enforce an important property of role names: that they are always
lower-case unless quoted appropriately.
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
b266b4b687 auth: Add role manager
The role manager is responsible for creating, removing, querying for,
granting, and revoking roles.

The role manager does not yet run in production, and is not connected to
the rest of the system.

Included in this patch is the definition of the abstract role management
interface, and also the implementation of the standard role manager.

The standard role manager is tested fully in the `role_manager_test`.
2017-11-27 12:14:20 -05:00
Jesse Haber-Kucharsky
8b23d32bb1 auth: Unconditionally create the system_auth keyspace
The `system_auth` keyspace is used to store tables for authentication
and authorization metadata.

Previously, this keyspace would only be created if either of the
non-default authenticator or authorizer were activated in configuration.

The upcoming role-management system is enabled unconditionally and also
uses the `system_auth` keyspace for its metadata.
2017-11-27 10:01:52 -05:00
Jesse Haber-Kucharsky
832072d1d9 unimplemented.hh: Use [[noreturn]] instead of GCC attribute 2017-11-27 10:01:52 -05:00
Jesse Haber-Kucharsky
b58914feb8 New unimplemented feature: roles 2017-11-27 10:01:52 -05:00
Duarte Nunes
922f095f22 tests: Initialize storage service for some tests
These tests now require having the storage service initialize, which
is needed to decide whether correct non-compound range tombstones
should be emitted or not.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126152921.5199-1-duarte@scylladb.com>
2017-11-26 17:41:06 +02:00
Duarte Nunes
15fbb8e1ca cql3/delete_statement: Allow non-range deletions on non-compound schemas
This patch fixes a regression introduced in
1c872e2ddc.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126102333.3736-1-duarte@scylladb.com>
2017-11-26 12:29:09 +02:00
Takuya ASADA
7380a6088b dist/debian: link libgcc dynamically
As we discussed on the thread (https://github.com/scylladb/scylla/issues/2941),
since we override symbols on libgcc, we need to link libgcc dynamically for
Ubuntu/Debian too (CentOS already do it).

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1511542866-21486-2-git-send-email-syuu@scylladb.com>
2017-11-25 20:09:51 +02:00
Takuya ASADA
df6546d151 dist/debian: switch to our PPA verions of gcc-72
Now we have gcc-7.2 on our PPA for Ubuntu 16.04/14.04, let's switch to it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1511542866-21486-1-git-send-email-syuu@scylladb.com>
2017-11-25 20:09:51 +02:00
Avi Kivity
757d0243a0 Merge seastar upstream
* seastar 7f87529...3b09bad (7):
  > Extend Travis CI to cover Clang 5.0 builds.
  > fair_queue: disallow zeroed shares.
  > Multiple fixes to io_tester to make it compile with GCC 5:
  > transformers: Create tuple explicitely for older compiler support
  > core/sstring: Add construction from `string_view`
  > io_tester: enhanced fair queue tester
  > fstream: do not ignore dma_write return value
2017-11-25 19:50:42 +02:00
Duarte Nunes
4a6ffa3f5c tests/sstable_mutation_test: Change make_reader to make_flat_reader
A merge conflict between 596ebaed1f and
bd1efbc25c caused the test to fail to
build.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-25 15:23:36 +00:00
Tomasz Grabiec
596ebaed1f Merge "Convert sstable writers to flat mutation readers" from Paweł
The following patches convert sstable writers to use flat mutation
readers instead of the legacy mutation_reader interface.
Writers were already using flat consumer interface and used
consume_flattened_in_thread(), so most of the work was limited to
providing an appropriate equivalent for flat mutation readers.

* https://github.com/pdziepak/scylla.git flat_mutation_reader-sstable-write/v1:
  flat_mutation_reader: move consumer_adapter out of consume()
  flat_mutation_reader: introduce consume_in_thread()
  tests/flat_mutation_reader: test consume_in_thread()
  sstables: switch write_components() to flat_mutation_reader
  streamed_mutation: drop streamed_mutation_returning()
  sstables: convert compaction to flat_mutation_reader
  mutation_reader: drop consume_flattened_in_thread()
2017-11-24 16:05:21 +01:00
Tomasz Grabiec
bd1efbc25c Merge "Fixes to sstable files for non-compound schemas" from Duarte
This series mainly fixes issues with the serialization of promoted
index entries for non-compound schemas and with the serialization of
range tombstones, also for non-compound schemas.

We lift the correct cell name writing code into its own function,
and direct all users to it. We also ensure backward compatibility with
incorrectly generated promoted indexes and range tombstones.

Fixes #2995
Fixes #2986
Fixes #2979
Fixes #2992
Fixes #2993

* git@github.com:duarten/scylla.git  promoted-index-serialization/v3:
  sstables/sstables: Unify column name writers
  sstables/sstables: Don't write index entry for a missing row maker
  sstables/sstables: Reuse write_range_tombstone() for row tombstones
  sstables/sstables: Lift index writing for row tombstones
  sstables/sstables: Leverage index code upon range tombstone consume
  sstables/sstables: Move out tombstone check in write_range_tombstone()
  sstables/sstables: A schema with static columns is always compound
  sstables/sstables: Lift column name writing logic
  sstables/sstables: Use schema-aware write_column_name() for
    collections
  sstables/sstables: Use schema-aware write_column_name() for row marker
  sstables/sstables: Use schema-aware write_column_name() for static row
  sstables/sstables: Writing promoted index entry leverages
    column_name_writer
  sstables/sstables: Add supported feature list to sstables
  sstables/sstables: Don't use incorrectly serialized promoted index
  cql3/single_column_primary_key_restrictions: Implement is_inclusive()
  cql3/delete_statement: Constrain range deletions for non-compound
    schemas
  tests/cql_query_test: Verify range deletion constraints
  sstables/sstables: Correctly deserialize range tombstones
  service/storage_service: Add feature for correct non-compound RTs
  tests/sstable_*: Start the storage service for some cases
  sstables/sstable_writer: Prepare to control range tombstone
    serialization
  sstables/sstables: Correctly serialize range tombstones
  tests/sstable_assertions: Fix monotonicity check for promoted indexes
  tests/sstable_assertions: Assert a promoted index is empty
  tests/sstable_mutation_test: Verify promoted index serializes
    correctly
  tests/sstable_mutation_test: Verify promoted index repeats tombstones
  tests/sstable_mutation_test: Ensure range tombstone serializes
    correctly
  tests/sstable_datafile_test: Add test for incorrect promoted index
  tests/sstable_datafile_test: Verify reading of incorrect range
    tombstones
  sstables/sstable: Rename schema-oblivious write_column_name() function
  sstables/sstables: No promoted index without clustering keys
  tests/sstable_mutation_test: Verify promoted index is not generated
  sstables/sstables: Optimize column name writing and indexing
  compound_compat: Don't assume compoundness
2017-11-24 16:03:49 +01:00
Tomasz Grabiec
35e404b1a2 tests: sstable: Make tombstone_purge_test more reliable
TTL of 1 second may cause the cell to expire right after we write it,
if the second component of current time changes right after it. Use
larger ttl to avoid spurious faliures due to this.
Message-Id: <1511463392-1451-1-git-send-email-tgrabiec@scylladb.com>
2017-11-24 10:52:26 +00:00
Vladimir Krivopalov
fb7d46fc2e Allow COUNT(*) and COUNT(1) to be queried with other aggregations or columns
Fixes #2218

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <c387d34969d5bcfb8b2bf42806e6e05a9b8a067c.1511487356.git.vladimir@scylladb.com>
2017-11-24 10:01:21 +00:00
Duarte Nunes
576ea421dc compound_compat: Don't assume compoundness
This patch changes some factory functions so that they don't assume
the schema is compound.

This enables some code simplification in
sstables::write_column_name().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 19:14:15 +00:00
Duarte Nunes
8597e1c3f9 sstables/sstables: Optimize column name writing and indexing
Instead of serializing the column name twice, serialize it once into a
buffer which gets used for index bookkeeping and to write to disk.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 19:14:08 +00:00
Paweł Dziepak
6a1fe70a72 mutation_reader: drop consume_flattened_in_thread() 2017-11-23 18:14:31 +00:00
Paweł Dziepak
b64dd21751 sstables: convert compaction to flat_mutation_reader 2017-11-23 18:14:31 +00:00
Paweł Dziepak
9b39d3b023 streamed_mutation: drop streamed_mutation_returning() 2017-11-23 18:14:31 +00:00
Paweł Dziepak
11b32276e6 sstables: switch write_components() to flat_mutation_reader 2017-11-23 18:14:31 +00:00
Paweł Dziepak
2660a43290 tests/flat_mutation_reader: test consume_in_thread() 2017-11-23 18:14:31 +00:00
Paweł Dziepak
cea5778fee flat_mutation_reader: introduce consume_in_thread()
flat_mutation_reader provides a replacement for the old
consume_flattened*() interface and therefore an 'in-thread' variant is
also necessary. It expects to be executed in a seastar::thread context
and guarantees that the consumer member functions will be invoked inside
that thread as well (which is why it cannot be easily replaced by
non-thread version).

Addition to that, just like the old consume_flattened_in_thread() its
replacement allows specifying a filter functions that causes selected
partitions to be skipped entirely and never reach the consumer.
2017-11-23 18:14:31 +00:00
Duarte Nunes
5aa5780701 tests/sstable_mutation_test: Verify promoted index is not generated
Verify we don't generated a promoted index if the schema lacks
clustering keys.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
10dea07ab7 sstables/sstables: No promoted index without clustering keys
We don't need to generate promoted index if the schema lacks
clustering keys.

Fixes #2995

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
66df2e41fc sstables/sstable: Rename schema-oblivious write_column_name() function
This function is now called write_compound_non_dense_column_name() so
callers are aware of the cases where it call be called.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
338f038e7a tests/sstable_datafile_test: Verify reading of incorrect range tombstones
Add a test to verify that we can still read incorrectly written range
tombstones for non-compound schemas, for previous Scylla versions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
766ca8dff4 tests/sstable_datafile_test: Add test for incorrect promoted index
Ensure we don't load incorrectly generated promoted indexes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
f9a76084e9 tests/sstable_mutation_test: Ensure range tombstone serializes correctly
This patch ensures range tombstones are correctly serialized for dense
non-compound schemas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
e612f71ed6 tests/sstable_mutation_test: Verify promoted index repeats tombstones
Both for compact and non-compact storage schemas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
d8af9ffe5a tests/sstable_mutation_test: Verify promoted index serializes correctly
For different types of schemas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
32cb8b6dc0 tests/sstable_assertions: Assert a promoted index is empty
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
ffaa3341c3 tests/sstable_assertions: Fix monotonicity check for promoted indexes
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
24b867adda sstables/sstables: Correctly serialize range tombstones
This patch ensures we correctly serialize range tombstones for dense
non-compound schemas, which until now assumed the bounds were compound
composite. We also fix the reading function, which assumed the same
thing. This affected Apache Cassandra compatibility.

Fixes #2986

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
3368411e03 sstables/sstable_writer: Prepare to control range tombstone serialization
This patch adds support to sstable_writer to be able to control
correct range tombstone serialization.

When range tombstone serialization will be fixed in subsequent
patches, it will only be enabled when the whole cluster supports the
feature to allow for rollbacks.

The feature needs to be enabled for an sstable as a whole, to prevent
problems with it being enabled during an sstable write.

Thus, the sstable writer will pass on this information to the sstable
methods that carry out the actual file writing.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
19cd65a681 tests/sstable_*: Start the storage service for some cases
We will need to check the cluster's enabled features when writing
range tombstones.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
ae3a58d7ec service/storage_service: Add feature for correct non-compound RTs
This patch adds a cluster feature to enable correct serialization of
non-compound range tombstones. We thus support rollbacks during an
upgrade, as we will only change range tombstone serialization when the
cluster is fully upgraded and all nodes are capable of reading the new
format.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
eeacef3089 sstables/sstables: Correctly deserialize range tombstones
This patch changes the range tombstone read path to deal with
correctly written non-compound range tombstones, while also
maintaining backward compatibility and reading old Scylla-generated
range tombstones.

The fix for the write path will activate an sstable feature which will
connect with this patch.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
e51fc2096b tests/cql_query_test: Verify range deletion constraints
Test that unsupported range deletions against non-compound schemas are
rejected.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
1c872e2ddc cql3/delete_statement: Constrain range deletions for non-compound schemas
We cannot represent ranged deletions with non-inclusive bounds on our
current storage format for schemas that are non-compound, since the
clustering key won't include the EOC byte.

Refs #2986

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
eea4e349ea cql3/single_column_primary_key_restrictions: Implement is_inclusive()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
f217dcc0ce sstables/sstables: Don't use incorrectly serialized promoted index
Promoted indexes generated before this patch by Scylla are considered
incorrect if they belong to a non-compound schema, due to #2993.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
8cdd8e2431 sstables/sstables: Add supported feature list to sstables
This patch adds additional metadata to the scylla sstable component.
Namely, it adds a list of features that the current sstable supports.
The upcoming usages of the feature list are meant for backward
compatibility, but the implementation makes no such assumptions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
e81a3d487d sstables/sstables: Writing promoted index entry leverages column_name_writer
This patch refactors writing a promoted index entry to leverage the
column_name_writer. It not only reduces code duplication, but also
solves two important bugs:

1) Column names for schema types other than compound non-dense were
   not correctly serialized, as the wrong overload of
   write_column_name() was being called, which assumed the specified
   composite to be compound.

2) Before, for some schema types we were passing an empty
   clustering_key to maybe_flush_pi_block(), which caused it to bypass
   appending open range tombstones to the data file, causing wrong
   query results to be returned.

Fixes #2979
Fixes #2992
Fixes #2993

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
51eed140d2 sstables/sstables: Use schema-aware write_column_name() for static row
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
b7624afca6 sstables/sstables: Use schema-aware write_column_name() for row marker
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
42f125c1ef sstables/sstables: Use schema-aware write_column_name() for collections
Eventually all current callers of write_column_name() will move to the
schema-aware one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
96daf17f8c sstables/sstables: Lift column name writing logic
This patch lifts the logic to write a column name depending on the
schema's denseness and compoundness into a function, so that it may
later be reused in other places. We still duplicate the same logic
when writing a clustered row because the index writer requires it for
now.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
594ed2d02a sstables/sstables: A schema with static columns is always compound
A schema can only have static columns if it has at least one
clustering column. A schema with a clustering column is always
compound, unless it is created with compact storage. A schema created
with compact storage cannot have static columns, so we can remove dead
code from the sstable write path.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
8011f3393e sstables/sstables: Move out tombstone check in write_range_tombstone()
We were incurring in superfluous checks as they were already performed
in some of the callers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
1e0f155447 sstables/sstables: Leverage index code upon range tombstone consume
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
989dc1d8c0 sstables/sstables: Lift index writing for row tombstones
This will allow code reuse in the following patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
5dfbdbaa04 sstables/sstables: Reuse write_range_tombstone() for row tombstones
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
d1e1fc928e sstables/sstables: Don't write index entry for a missing row maker
Encapsulate the decision to write the row_marker and to write a
corresponding entry in the promoted index. We now avoid writing the
index entry if there is no row marker, and just start indexing the row
at the first cell.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
8907c1dfb2 sstables/sstables: Unify column name writers
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Paweł Dziepak
7936a55836 flat_mutation_reader: move consumer_adapter out of consume()
Making consumer_adapter a member of flat_mutation_reader::impl instead
of being a local class in consume() will make it possible to reuse that
helper in other functions.
2017-11-23 14:25:31 +00:00
Glauber Costa
881a859b21 transport: enhance reporting of requests blocked in the transport layer
It's hard to make sense of the metric transport.requests_blocked_memory
because it shows a queue size. Specially in production setups scraping
at every 15 seconds, that doesn't tell us much.

We solve that in other layers that record blocking by providing both a
requests_blocked_memory and requests_blocked_memory_current

Fixes #3010

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171123033329.32596-1-glauber@scylladb.com>
2017-11-23 12:37:16 +02:00
Amnon Heiman
3f8d9a87ee estimated_histogram: update the sum and count when merging
When merging histograms the count and the sum should be updated.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20171122154822.23855-1-amnon@scylladb.com>
2017-11-22 16:55:55 +01:00
Glauber Costa
6c4e8049a0 estimated_histogram: also fill up sum metric
Prometheus histograms have 3 embedded metrics: count, buckets, and sum.
Currently we fill up count and buckets but sum is left at 0. This is
particularly bad, since according to the prometheus documentation, the
best way to calculate histogram averages is to write:

  rate(metric_sum[5m]) / rate(metric_count[5m])

One way of keeping track of the sum is adding the value we sampled,
every time we sample. However, the interface for the estimated histogram
has a method that allows to add a metric while allowing to adjust the
count for missing metrics (add_nano())

That makes acumulating a sum inaccurate--as we will have no values for
the points that were added. To overcome that, when we call add_nano(),
we pretend we are introducing new_count - _count metrics, all with the
same value.

Long term, doing away with sampling may help us provide more accurate
results.

After this patch, we are able to correctly calculate latency averages
through the data exported in prometheus.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171122144558.7575-1-glauber@scylladb.com>
2017-11-22 16:10:12 +01:00
Tomasz Grabiec
e9ffe36d65 Merge "Remove sstable::read_rows" from Piotr
* seastar-dev.git haaawk/flat_reader_remove_read_rows:
  sstable_mutation_test: use read_rows_flat instead of read_rows
  perf_sstable: use read_rows_flat instead of read_rows
  Remove sstable::read_rows
2017-11-22 15:50:59 +01:00
Piotr Jastrzebski
0fdfd2c5bc Remove sstable::read_rows
It's no longer used. read_rows_flat is used everything instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:48:57 +01:00
Piotr Jastrzebski
571bac7336 perf_sstable: use read_rows_flat instead of read_rows
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:48:57 +01:00
Piotr Jastrzebski
da2f2164e9 sstable_mutation_test: use read_rows_flat instead of read_rows
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:48:57 +01:00
Tomasz Grabiec
aa8c2cbc16 Merge "Migrate sstables to flat_mutation_reader" from Piotr
Introduce sstable::read_row_flat and sstable::read_range_rows_flat methods
and use them in sstable::as_mutation_source.

* https://github.com/scylladb/seastar-dev/tree/haaawk/flat_reader_sstables_v3:
  Introduce conversion from flat_mutation_reader to streamed_mutation
  Add sstables::read_rows_flat and sstables::read_range_rows_flat
  Turn sstable_mutation_reader into a flat_mutation_reader
  sstable: add getter for filter_tracker
  Move mp_row_consumer methods implementations to the bottom
  Remove unused sstable_mutation_reader constructor
  Replace "sm" with "partition" in get_next_sm and on_sm_finished
  Move advance_to_upper_bound above sstable_mutation_reader
  Store sstable_mutation_reader pointer in mp_row_consumer
  Stop using streamed_mutation in consumer and reader
  Stop using streamed_mutation in sstable_data_source
  Delete sstable_streamed_mutation
  Introduce sstable::read_row_flat
  Migrate sstable::as_mutation_source to flat_mutation_reader
  Remove single_partition_reader_adaptor
  Merge data_consume_context::impl into data_consume_context
  Create data_consume_context_opt.
  Merge on_partition_finished into mark_partition_finished
  Check _partition_finished instead of _current_partition_key
  Merge sstable_data_source into sstable_mutation_reader
  Remove sstable_data_source
  Remove get_next_partition and partition_header
2017-11-22 15:45:21 +01:00
Calle Wilund
912d29e79b storage_service: don't use potentially stale iterator in log
Message-Id: <20171121115119.29642-2-calle@scylladb.com>
2017-11-22 15:26:56 +01:00
Piotr Jastrzebski
df110e8b4d Remove get_next_partition and partition_header
Handle next_partition in on_next_partition instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:24:49 +01:00
Piotr Jastrzebski
a3b69235e3 Remove sstable_data_source
It's not used any more and can be safely removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:24:49 +01:00
Piotr Jastrzebski
4b9a34a854 Merge sstable_data_source into sstable_mutation_reader
There's no need for sstable_data_source to be separated any more.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:24:49 +01:00
Piotr Jastrzebski
f2191e0984 Check _partition_finished instead of _current_partition_key
to check whether partition is finished. In next patch
_current_partition_key will be merged with sstable_data_source::_key
and won't be cleared any more.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:24:49 +01:00
Piotr Jastrzebski
be0c9040a6 Merge on_partition_finished into mark_partition_finished
This simplifies code quite a bit.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:24:49 +01:00
Piotr Jastrzebski
8afbe0ead0 Create data_consume_context_opt.
This will be used in sstable_mutation_reader before
first fill_buffer is called and a proper data_consume_context
is created.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:24:22 +01:00
Duarte Nunes
3d24eed39e service/storage_service: Remove outdated FIXME
Thrift server is now a bit more graceful on shutdown.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171121214341.31165-1-duarte@scylladb.com>
2017-11-22 10:48:46 +02:00
Piotr Jastrzebski
7f8b62bc0b Merge data_consume_context::impl into data_consume_context
There's no reason to use pimpl in data_consume_context

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-21 20:22:38 +01:00
Takuya ASADA
c1b97d11ea dist/redhat: avoid hardcoding GPG key file path on scylla-epel-7-x86_64.cfg
Since we want to support cross building, we shouldn't hardcode GPG file path,
even these files provided on recent version of mock.

This fixes build error on some older build environment such as CentOS-7.2.

Fixes #3002

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1511277722-22917-1-git-send-email-syuu@scylladb.com>
2017-11-21 17:25:39 +02:00
Vladimir Krivopalov
61b1988aa1 Use meaningful error messages when throwing a marshal_exception
Fixes #2977

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <20171121005108.23074-1-vladimir@scylladb.com>
2017-11-21 16:05:43 +02:00
Daniel Fiala
21ea05ada1 utils/big_decimal: Fix compilation issue with converion of cpp_int to uint64_t.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20171121134854.16278-1-daniel@scylladb.com>
2017-11-21 15:51:29 +02:00
Tomasz Grabiec
6969a235f3 Merge "Convert queries to flat mutation readers" from Paweł
These patches convert queries (data, mutation and counter) to flat
mutation readers. All of them already use consume_flattened() to
consume a flat stream of data, so the only major missing thing
 was adding support for reversed partitions to
flat_mutation_reader::consume().

* pdziepak flat_mutation_reader-queries/v3-rebased:
  flat_mutation_reader: keep reference to decorated key valid
  flat_muation_reader: support consuming reversed partitions
  tests/flat_mutation_reader: add test for
    flat_mutation_reader::consume()
  mutation_partition: convert queries to flat_mutation_readers
  tests/row_cache_stress_test: do not use consume_flattened()
  mutation_reader: drop consume_flattened()
  streamed_mutation: drop reverse_streamed_mutation()
2017-11-21 12:55:57 +01:00
Paweł Dziepak
8baf682216 streamed_mutation: drop reverse_streamed_mutation() 2017-11-21 11:37:04 +00:00
Paweł Dziepak
5753e85c6b mutation_reader: drop consume_flattened()
consume_flattened() has been fully replaced by
flat_mutation_reader::consume()
2017-11-21 11:37:04 +00:00
Paweł Dziepak
5851b86369 tests/row_cache_stress_test: do not use consume_flattened() 2017-11-21 11:37:04 +00:00
Paweł Dziepak
48c3db54c9 mutation_partition: convert queries to flat_mutation_readers 2017-11-21 11:37:04 +00:00
Paweł Dziepak
00c8b38a88 tests/flat_mutation_reader: add test for flat_mutation_reader::consume() 2017-11-21 11:37:04 +00:00
Paweł Dziepak
cdb30f74a8 flat_muation_reader: support consuming reversed partitions
Some queries may need the fragments that belong to partition to be
emitted in the reversed order. Current support for that is very limited
(see #1413), but should work reasonably well for small partitions.
2017-11-21 11:37:04 +00:00
Paweł Dziepak
c817adc809 flat_mutation_reader: keep reference to decorated key valid
consume_flattened() guarantees that partition key (passed by reference)
will be valid until the end of partition.
flat_mutation_reader::consume() provides the same interface for consumer
so it also should make sure that the key remains valid.
2017-11-21 11:37:04 +00:00
Paweł Dziepak
1b936876b7 streamed_mutation: make emit_range_tombstone() exception safe
For a time range tombstone that was already removed from a tree
is owned by a raw pointer. This doesn't end well if creation of
a mutation fragment or a call to push_mutation_fragment() throw.
Message-Id: <20171121105749.16559-1-pdziepak@scylladb.com>
2017-11-21 12:28:20 +01:00
Avi Kivity
c6fa727af0 tracing: add missing include
The IDE doesn't understand what lw_shared_ptr<> means without it,
though it does compile.
2017-11-21 13:24:07 +02:00
Piotr Jastrzebski
ae3259c9be position_in_partition: support _type in operator<<
It is useful to print position_in_partition::_type together
with other fields to have a full view of what does the position
represent.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <d2d25155a656aa6c2cefcd4964abccfa31cc4c45.1511252093.git.piotr@scylladb.com>
2017-11-21 12:35:32 +02:00
Vlad Zolotarov
941aa20252 cql_transport::cql_server: fix the distributed prepared statements cache population
Don't std::move() the "query" string inside the parallel_for_each() lambda.
parallel_for_each is going to invoke the given callback object for each element of the range
and as a result the first call of lambda that std::move()s the "query" is going to destroy it for
all other calls.

Fixes #2998

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1511225744-1159-1-git-send-email-vladz@scylladb.com>
2017-11-21 10:37:49 +02:00
Piotr Jastrzebski
644f9d9883 Remove single_partition_reader_adaptor
It is no longer used anywhere.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
eb31ec00a2 Migrate sstable::as_mutation_source to flat_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
11a354b144 Introduce sstable::read_row_flat
This will be used together with sstables::read_range_rows
to migrate sstables::as_mutation_source().

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
65c6f339d6 Delete sstable_streamed_mutation
It's no longer used so can be safely removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
e241b0c2de Stop using streamed_mutation in sstable_data_source
Use a partition_header instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
375c321e9d Stop using streamed_mutation in consumer and reader
Don't use streamed_mutation in mp_row_consumer
and sstable_mutation_reader.

Also use sstable_mutation_reader in sstable::read_row.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:22:57 +01:00
Avi Kivity
ff19cdc092 Merge seastar upstream
* seastar 78cd87f...7f87529 (3):
  > exception: use phdr hash on reactor threads only
  > tests: httpd use noncopyable_function
  > Merge "fixes of issues found by seastar's unit tests" (ppc) from Vlad

Fixes #2967.
2017-11-20 17:16:52 +02:00
Botond Dénes
f059e71056 Add fast-forwarding with no data test to mutation_source_test
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <9cb630bf9441e178b2040709f92767d4a740a875.1511180262.git.bdenes@scylladb.com>
2017-11-20 13:36:14 +01:00
Botond Dénes
a1a0d445d6 flat_mutation_reader_assertions: add fast_forward_to(position_range)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7b530909cf188887377aec3985f9f8c0e3b9b1e8.1511180262.git.bdenes@scylladb.com>
2017-11-20 13:35:57 +01:00
Botond Dénes
8065dca4a1 flat_mutation_reader_from_mutation_reader(): make ff more resilient
Currently flat_mutation_reader_from_mutation_reader()'s
converting_reader will throw std::runtime_error if fast_forward_to() is
called when its internal streamed_mutation_opt is disengaged. This can
create problems if this reader is a sub-reader of a combined reader as the
latter has no way to determine the source of a sub-reader EOS. A reader
can be in EOS either because it reached the end of the current
position_range or because it doesn't have any more data.
To avoid this, instead of throwing we just silently ignore the fact that
the streamed_mutation_opt is disengaged and set _end_of_stream to true
which is still correct.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <83d309b225950bdbbd931f1c5e7fb91c9929ba1c.1511180262.git.bdenes@scylladb.com>
2017-11-20 13:35:42 +01:00
Duarte Nunes
34a0b85982 thrift/server: Handle exception within gate
The exception handling code inspects server state, which could be
destroyed before the handle_exception() task runs since it runs after
exiting the gate. Move the exception handling inside the gate and
avoid scheduling another accept if the server has been stopped.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171116122921.21273-1-duarte@scylladb.com>
2017-11-20 13:55:14 +02:00
Takuya ASADA
f26cde582f configure.py: suppress 'nonnull-compare' warning on antlr3
We get following warning from antlr3 header when we compile Scylla with gcc-7.2:
/opt/scylladb/include/antlr3bitset.inl: In member function 'antlr3::BitsetList<AllocatorType>::BitsetType* antlr3::BitsetList<AllocatorType>::bitsetLoad() [with ImplTraits = antlr3::TraitsBase<antlr3::CustomTraitsBase>]':
/opt/scylladb/include/antlr3bitset.inl:54:2: error: nonnull argument 'this' compared to NULL [-Werror=nonnull-compare]

To make it compilable we need to specify '-Wno-nonnull-compare' on cflags.

Message-Id: <1510952411-20722-2-git-send-email-syuu@scylladb.com>
2017-11-20 13:07:09 +02:00
Takuya ASADA
ab9d7cdc65 dist/debian: switch Debian 3rdparty packages to external build service
Switch Debian 3rdparty packages to our OBS repo
(https://build.opensuse.org/project/subprojects/home:scylladb).

We don't use 3rdparty packages on dist/debian/dep, so dropped them.
Also we switch Debian to gcc-7.2/boost-1.63 on same time.

Due to packaging issues following packages doesn't renamed our 3rdparty
package naming rule for now:
 - gcc-7: renamed as 'xxx-scylla72', instead of scylla-xxx-72.
 - boost1.63: doesn't renamed, also doesn't changed prefix to /opt/scylladb

Message-Id: <1510952411-20722-1-git-send-email-syuu@scylladb.com>
2017-11-20 13:07:04 +02:00
Tomasz Grabiec
cec5b0a5b8 Merge "Fix reversed queries with range tombstones" from Paweł
This series reworks handling of range tombstones in reversed queries
so that they are applied to correct rows. Additionally, the concept
of flipped range tombstones is removed, since it only made it harder
to reason about the code.

Fixes #2982.

* https://github.com/pdziepak/scylla fix-reverse-query-range-tombstone/v2:
  streamed_mutation: fix reversing range tombstones
  range_tombstone: drop flip()
  tests/cql_query_test: test range tombstones and reverse queries
  tests/range_tombstone_list: add test for range_tombstone_accumulator
2017-11-17 16:31:34 +01:00
Piotr Jastrzebski
f7bf782a41 Store sstable_mutation_reader pointer in mp_row_consumer
The reader will be used by mp_row_consumer instead of streamed_mutation
in next patches.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
145fcf846e Move advance_to_upper_bound above sstable_mutation_reader
It will be used in sstable_mutation_reader when the reader
will be used to implement sstable::read_row.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
1c7938c44d Replace "sm" with "partition" in get_next_sm and on_sm_finished
Streamed mutation won't be used any more so get_next_partition
and on_partition_finished are more suitable names.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
4943f52ad7 Remove unused sstable_mutation_reader constructor
The constructor is never used so it can be safely removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
c7971eb8e3 Move mp_row_consumer methods implementations to the bottom
Those methods have to be below sstable_mutation_reader because
they will be using the reader instead of streamed_mutation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
19fcf8accf sstable: add getter for filter_tracker
This will be needed to use sstable_mutation_reader for
sstable::read_row.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
537b42e153 Turn sstable_mutation_reader into a flat_mutation_reader
This is the first step which still uses streamed_mutation.
Next step will be to get rid of streamed_mutation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:00 +01:00
Paweł Dziepak
81b9595dcc tests/range_tombstone_list: add test for range_tombstone_accumulator 2017-11-16 17:15:36 +00:00
Paweł Dziepak
774fcc8c66 tests/cql_query_test: test range tombstones and reverse queries
Reproducer for #2982.
2017-11-16 17:15:36 +00:00
Paweł Dziepak
bb54af66a9 range_tombstone: drop flip()
Flipped range tombstones violated the assumption that position() <
end_position() and therefore could only be used in some specific cases.
2017-11-16 17:15:36 +00:00
Paweł Dziepak
5f08831192 streamed_mutation: fix reversing range tombstones
Right now reversed streamed mutation emits range tombstones after the
mutation fragments affected by them. This breakes the queries.

This patch reworks the way range tombstones are handled in reversed
streams:
 - range tombstones are no longer flipped -- invariant that start bound
   is smaller than the end bound always holds
 - in reversed streams they are ordered by their end_position()

Fixes #2982.
2017-11-16 17:15:36 +00:00
Piotr Jastrzebski
74f0c01865 Add sstables::read_rows_flat and sstables::read_range_rows_flat
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 15:33:23 +01:00
Piotr Jastrzebski
3f70dfc939 Introduce conversion from flat_mutation_reader to streamed_mutation
Allows splitting migration into small steps.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 15:33:23 +01:00
1238 changed files with 69839 additions and 29482 deletions

View File

@@ -1,3 +1,9 @@
This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.
- [] I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.
*Installation details*
Scylla version (or git commit hash):
Cluster size:

4
.github/PULL_REQUEST_TEMPLATE.md vendored Normal file
View File

@@ -0,0 +1,4 @@
Scylla doesn't use pull-requests, please send a patch to the [mailing list](mailto:scylladb-dev@googlegroups.com) instead.
See our [contributing guidelines](../CONTRIBUTING.md) and our [Scylla development guidelines](../HACKING.md) for more information.
If you have any questions please don't hesitate to send a mail to the [dev list](mailto:scylladb-dev@googlegroups.com).

1
.gitignore vendored
View File

@@ -18,3 +18,4 @@ CMakeLists.txt.user
*.egg-info
__pycache__CMakeLists.txt.user
.gdbinit
resources

5
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui
@@ -9,3 +9,6 @@
[submodule "dist/ami/files/scylla-ami"]
path = dist/ami/files/scylla-ami
url = ../scylla-ami
[submodule "xxHash"]
path = xxHash
url = ../xxHash

View File

@@ -125,7 +125,7 @@ list(REMOVE_ITEM SEASTAR_CFLAGS "-DHAVE_GCC6_CONCEPTS")
#
# For ease of browsing the source code, we always pretend that DPDK is enabled.
target_compile_options(scylla PUBLIC
-std=gnu++14
-std=gnu++1z
-DHAVE_DPDK
-DHAVE_HWLOC
"${SEASTAR_CFLAGS}")
@@ -137,4 +137,5 @@ target_include_directories(scylla PUBLIC
${SEASTAR_DPDK_INCLUDE_DIRS}
${SEASTAR_INCLUDE_DIRS}
${Boost_INCLUDE_DIRS}
xxhash
build/release/gen)

View File

@@ -85,7 +85,53 @@ The `-c1 -m1G` arguments limit this Seastar-based test to a single system thread
All changes to Scylla are submitted as patches to the public mailing list. Once a patch is approved by one of the maintainers of the project, it is committed to the maintainers' copy of the repository at https://github.com/scylladb/scylla.
Detailed instructions for formatting patches for the mailing list and advice on preparing good patches are available at the [ScyllaDB website](http://docs.scylladb.com/contribute/).
Detailed instructions for formatting patches for the mailing list and advice on preparing good patches are available at the [ScyllaDB website](http://docs.scylladb.com/contribute/). There are also some guidelines that can help you make the patch review process smoother:
1. Before generating patches, make sure your Git configuration points to `.gitorderfile`. You can do it by running
```bash
$ git config diff.orderfile .gitorderfile
```
2. If you are sending more than a single patch, push your changes into a new branch of your fork of Scylla on GitHub and add a URL pointing to this branch to your cover letter.
3. If you are sending a new revision of an earlier patchset, add a brief summary of changes in this version, for example:
```
In v3:
- declared move constructor and move assignment operator as noexcept
- used std::variant instead of a union
...
```
4. Add information about the tests run with this fix. It can look like
```
"Tests: unit ({mode}), dtest ({smp})"
```
The usual is "Tests: unit (release)", although running debug tests is encouraged.
5. When answering review comments, prefer inline quotes as they make it easier to track the conversation across multiple e-mails.
### Finding a person to review and merge your patches
You can use the `scripts/find-maintainer` script to find a subsystem maintainer and/or reviewer for your patches. The script accepts a filename in the git source tree as an argument and outputs a list of subsystems the file belongs to and their respective maintainers and reviewers. For example, if you changed the `cql3/statements/create_view_statement.hh` file, run the script as follows:
```bash
$ ./scripts/find-maintainer cql3/statements/create_view_statement.hh
```
and you will get output like this:
```
CQL QUERY LANGUAGE
Tomasz Grabiec <tgrabiec@scylladb.com> [maintainer]
Pekka Enberg <penberg@scylladb.com> [maintainer]
MATERIALIZED VIEWS
Pekka Enberg <penberg@scylladb.com> [maintainer]
Duarte Nunes <duarte@scylladb.com> [maintainer]
Nadav Har'El <nyh@scylladb.com> [reviewer]
Duarte Nunes <duarte@scylladb.com> [reviewer]
```
### Running Scylla

131
MAINTAINERS Normal file
View File

@@ -0,0 +1,131 @@
M: Maintainer with commit access
R: Reviewer with subsystem expertise
F: Filename, directory, or pattern for the subsystem
---
AUTH
M: Paweł Dziepak <pdziepak@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Calle Wilund <calle@scylladb.com>
R: Vlad Zolotarov <vladz@scylladb.com>
R: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
F: auth/*
CACHE
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Paweł Dziepak <pdziepak@scylladb.com>
R: Piotr Jastrzebski <piotr@scylladb.com>
F: row_cache*
F: *mutation*
F: tests/mvcc*
COMMITLOG / BATCHLOGa
M: Paweł Dziepak <pdziepak@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Calle Wilund <calle@scylladb.com>
F: db/commitlog/*
F: db/batch*
COORDINATOR
M: Paweł Dziepak <pdziepak@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Gleb Natapov <gleb@scylladb.com>
F: service/storage_proxy*
COMPACTION
R: Raphael S. Carvalho <raphaelsc@scylladb.com>
R: Glauber Costa <glauber@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
F: sstables/compaction*
CQL TRANSPORT LAYER
M: Pekka Enberg <penberg@scylladb.com>
F: transport/*
CQL QUERY LANGUAGE
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Pekka Enberg <penberg@scylladb.com>
F: cql3/*
COUNTERS
M: Paweł Dziepak <pdziepak@scylladb.com>
F: counters*
F: tests/counter_test*
GOSSIP
M: Duarte Nunes <duarte@scylladb.com>
M: Tomasz Grabiec <tgrabiec@scylladb.com>
R: Asias He <asias@scylladb.com>
F: gms/*
DOCKER
M: Pekka Enberg <penberg@scylladb.com>
F: dist/docker/*
LSA
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Paweł Dziepak <pdziepak@scylladb.com>
F: utils/logalloc*
MATERIALIZED VIEWS
M: Duarte Nunes <duarte@scylladb.com>
M: Pekka Enberg <penberg@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
R: Duarte Nunes <duarte@scylladb.com>
F: db/view/*
F: cql3/statements/*view*
PACKAGING
R: Takuya ASADA <syuu@scylladb.com>
F: dist/*
REPAIR
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Asias He <asias@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
F: repair/*
SCHEMA MANAGEMENT
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
M: Pekka Enberg <penberg@scylladb.com>
F: db/schema_tables*
F: db/legacy_schema_migrator*
F: service/migration*
F: schema*
SECONDARY INDEXES
M: Pekka Enberg <penberg@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
R: Pekka Enberg <penberg@scylladb.com>
F: db/index/*
F: cql3/statements/*index*
SSTABLES
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Raphael S. Carvalho <raphaelsc@scylladb.com>
R: Glauber Costa <glauber@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
F: sstables/*
STREAMING
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Asias He <asias@scylladb.com>
F: streaming/*
F: service/storage_service.*
THRIFT TRANSPORT LAYER
M: Duarte Nunes <duarte@scylladb.com>
F: thrift/*
THE REST
M: Avi Kivity <avi@scylladb.com>
M: Paweł Dziepak <pdziepak@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
M: Tomasz Grabiec <tgrabiec@scylladb.com>
F: *

View File

@@ -1,2 +1,5 @@
This project includes code developed by the Apache Software Foundation (http://www.apache.org/),
especially Apache Cassandra.
It also includes files from https://github.com/antonblanchard/crc32-vpmsum (author Anton Blanchard <anton@au.ibm.com>, IBM).
These files are located in utils/arch/powerpc/crc32-vpmsum. Their license may be found in licenses/LICENSE-crc32-vpmsum.TXT.

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=666.development
VERSION=2.3.6
if test -f version
then

View File

@@ -455,7 +455,7 @@
"operations":[
{
"method":"GET",
"summary":"Returns a list of filenames that contain the given key on this node",
"summary":"Returns a list of sstable filenames that contain the given partition key on this node",
"type":"array",
"items":{
"type":"string"
@@ -475,7 +475,7 @@
},
{
"name":"key",
"description":"The key",
"description":"The partition key. In a composite-key scenario, use ':' to separate the columns in the key.",
"required":true,
"allowMultiple":false,
"type":"string",

30
api/api-doc/config.json Normal file
View File

@@ -0,0 +1,30 @@
"/v2/config/{id}": {
"get": {
"description": "Return a config value",
"operationId": "find_config_id",
"produces": [
"application/json"
],
"tags": ["config"],
"parameters": [
{
"name": "id",
"in": "path",
"description": "ID of config to return",
"required": true,
"type": "string"
}
],
"responses": {
"200": {
"description": "Config value"
},
"default": {
"description": "unexpected error",
"schema": {
"$ref": "#/definitions/ErrorModel"
}
}
}
}
}

View File

@@ -792,6 +792,24 @@
}
]
},
{
"path":"/storage_service/active_repair/",
"operations":[
{
"method":"GET",
"summary":"Return an array with the ids of the currently active repairs",
"type":"array",
"items":{
"type":"int"
},
"nickname":"get_active_repair_async",
"produces":[
"application/json"
],
"parameters":[]
}
]
},
{
"path":"/storage_service/repair_async/{keyspace}",
"operations":[
@@ -2111,6 +2129,41 @@
]
}
]
},
{
"path":"/storage_service/view_build_statuses/{keyspace}/{view}",
"operations":[
{
"method":"GET",
"summary":"Gets the progress of a materialized view build",
"type":"array",
"items":{
"type":"mapper"
},
"nickname":"view_build_statuses",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"view",
"description":"View name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
}
],
"models":{
@@ -2175,11 +2228,11 @@
"description":"The column family"
},
"total":{
"type":"int",
"type":"long",
"description":"The total snapshot size"
},
"live":{
"type":"int",
"type":"long",
"description":"The live snapshot size"
}
}

View File

@@ -0,0 +1,29 @@
{
"swagger": "2.0",
"info": {
"version": "1.0.0",
"title": "Scylla API",
"description": "The scylla API version 2.0",
"termsOfService": "http://www.scylladb.com/tos/",
"contact": {
"name": "Scylla Team",
"email": "info@scylladb.com",
"url": "http://scylladb.com"
},
"license": {
"name": "AGPL",
"url": "https://github.com/scylladb/scylla/blob/master/LICENSE.AGPL"
}
},
"host": "{{Host}}",
"basePath": "/v2",
"schemes": [
"http"
],
"consumes": [
"application/json"
],
"produces": [
"application/json"
],
"paths": {

View File

@@ -39,6 +39,7 @@
#include "http/exception.hh"
#include "stream_manager.hh"
#include "system.hh"
#include "api/config.hh"
namespace api {
@@ -54,14 +55,18 @@ static std::unique_ptr<reply> exception_reply(std::exception_ptr eptr) {
future<> set_server_init(http_context& ctx) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
auto rb02 = std::make_shared < api_registry_builder20 > (ctx.api_doc, "/v2");
return ctx.http_server.set_routes([rb, &ctx](routes& r) {
return ctx.http_server.set_routes([rb, &ctx, rb02](routes& r) {
r.register_exeption_handler(exception_reply);
r.put(GET, "/ui", new httpd::file_handler(ctx.api_dir + "/index.html",
new content_replace("html")));
r.add(GET, url("/ui").remainder("path"), new httpd::directory_handler(ctx.api_dir,
new content_replace("html")));
rb->set_api_doc(r);
rb02->set_api_doc(r);
rb02->register_api_file(r, "swagger20_header");
set_config(rb02, ctx, r);
rb->register_function(r, "system",
"The system related API");
set_system(ctx, r);
@@ -112,6 +117,11 @@ future<> set_server_stream_manager(http_context& ctx) {
"The stream manager API", set_stream_manager);
}
future<> set_server_cache(http_context& ctx) {
return register_api(ctx, "cache_service",
"The cache service API", set_cache_service);
}
future<> set_server_gossip_settle(http_context& ctx) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
@@ -119,9 +129,6 @@ future<> set_server_gossip_settle(http_context& ctx) {
rb->register_function(r, "failure_detector",
"The failure detector API");
set_failure_detector(ctx,r);
rb->register_function(r, "cache_service",
"The cache service API");
set_cache_service(ctx,r);
});
}

View File

@@ -46,7 +46,7 @@ future<> set_server_messaging_service(http_context& ctx);
future<> set_server_storage_proxy(http_context& ctx);
future<> set_server_stream_manager(http_context& ctx);
future<> set_server_gossip_settle(http_context& ctx);
future<> set_server_cache(http_context& ctx);
future<> set_server_done(http_context& ctx);
}

View File

@@ -429,7 +429,7 @@ void set_column_family(http_context& ctx, routes& r) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
utils::estimated_histogram res(0);
for (auto i: *cf.get_sstables() ) {
res.merge(i->get_stats_metadata().estimated_column_count);
res.merge(i->get_stats_metadata().estimated_cells_count);
}
return res;
},
@@ -905,5 +905,20 @@ void set_column_family(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(res);
});
});
cf::get_sstables_for_key.set(r, [&ctx](std::unique_ptr<request> req) {
auto key = req->get_query_param("key");
auto uuid = get_uuid(req->param["name"], ctx.db.local());
return ctx.db.map_reduce0([key, uuid] (database& db) {
return db.find_column_family(uuid).get_sstables_by_partition_key(key);
}, std::unordered_set<sstring>(),
[](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {
a.insert(b.begin(),b.end());
return a;
}).then([](const std::unordered_set<sstring>& res) {
return make_ready_future<json::json_return_type>(container_to_vec(res));
});
});
}
}

View File

@@ -24,6 +24,7 @@
#include "api.hh"
#include "api/api-doc/column_family.json.hh"
#include "database.hh"
#include <any>
namespace api {
@@ -37,9 +38,15 @@ template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,
Mapper mapper, Reducer reducer) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([mapper, uuid](database& db) {
return mapper(db.find_column_family(uuid));
}, init, reducer);
using mapper_type = std::function<std::any (database&)>;
using reducer_type = std::function<std::any (std::any, std::any)>;
return ctx.db.map_reduce0(mapper_type([mapper, uuid](database& db) {
return I(mapper(db.find_column_family(uuid)));
}), std::any(std::move(init)), reducer_type([reducer = std::move(reducer)] (std::any a, std::any b) mutable {
return I(reducer(std::any_cast<I>(std::move(a)), std::any_cast<I>(std::move(b))));
})).then([] (std::any r) {
return std::any_cast<I>(std::move(r));
});
}
@@ -51,35 +58,42 @@ future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& n
});
}
template<class Mapper, class I, class Reducer, class Result>
future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,
Mapper mapper, Reducer reducer, Result result) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([mapper, uuid](database& db) {
return mapper(db.find_column_family(uuid));
}, init, reducer);
}
template<class Mapper, class I, class Reducer, class Result>
future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& name, I init,
Mapper mapper, Reducer reducer, Result result) {
return map_reduce_cf_raw(ctx, name, init, mapper, reducer, result).then([result](const I& res) mutable {
return map_reduce_cf_raw(ctx, name, init, mapper, reducer).then([result](const I& res) mutable {
result = res;
return make_ready_future<json::json_return_type>(result);
});
}
template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, I init,
Mapper mapper, Reducer reducer) {
return ctx.db.map_reduce0([mapper, init, reducer](database& db) {
struct map_reduce_column_families_locally {
std::any init;
std::function<std::any (column_family&)> mapper;
std::function<std::any (std::any, std::any)> reducer;
std::any operator()(database& db) const {
auto res = init;
for (auto i : db.get_column_families()) {
res = reducer(res, mapper(*i.second.get()));
}
return res;
}, init, reducer);
}
};
template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, I init,
Mapper mapper, Reducer reducer) {
using mapper_type = std::function<std::any (column_family&)>;
using reducer_type = std::function<std::any (std::any, std::any)>;
auto wrapped_mapper = mapper_type([mapper = std::move(mapper)] (column_family& cf) mutable {
return I(mapper(cf));
});
auto wrapped_reducer = reducer_type([reducer = std::move(reducer)] (std::any a, std::any b) mutable {
return I(reducer(std::any_cast<I>(std::move(a)), std::any_cast<I>(std::move(b))));
});
return ctx.db.map_reduce0(map_reduce_column_families_locally{init, std::move(wrapped_mapper), wrapped_reducer}, std::any(init), wrapped_reducer).then([] (std::any res) {
return std::any_cast<I>(std::move(res));
});
}

112
api/config.cc Normal file
View File

@@ -0,0 +1,112 @@
/*
* Copyright 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "api/config.hh"
#include "api/api-doc/config.json.hh"
#include "db/config.hh"
#include <sstream>
#include <boost/algorithm/string/replace.hpp>
namespace api {
template<class T>
json::json_return_type get_json_return_type(const T& val) {
return json::json_return_type(val);
}
/*
* As commented on db::seed_provider_type is not used
* and probably never will.
*
* Just in case, we will return its name
*/
template<>
json::json_return_type get_json_return_type(const db::seed_provider_type& val) {
return json::json_return_type(val.class_name);
}
std::string format_type(const std::string& type) {
if (type == "int") {
return "integer";
}
return type;
}
future<> get_config_swagger_entry(const std::string& name, const std::string& description, const std::string& type, bool& first, output_stream<char>& os) {
std::stringstream ss;
if (first) {
first=false;
} else {
ss <<',';
};
ss << "\"/config/" << name <<"\": {"
"\"get\": {"
"\"description\": \"" << boost::replace_all_copy(boost::replace_all_copy(boost::replace_all_copy(description,"\n","\\n"),"\"", "''"), "\t", " ") <<"\","
"\"operationId\": \"find_config_"<< name <<"\","
"\"produces\": ["
"\"application/json\""
"],"
"\"tags\": [\"config\"],"
"\"parameters\": ["
"],"
"\"responses\": {"
"\"200\": {"
"\"description\": \"Config value\","
"\"schema\": {"
"\"type\": \"" << format_type(type) << "\""
"}"
"},"
"\"default\": {"
"\"description\": \"unexpected error\","
"\"schema\": {"
"\"$ref\": \"#/definitions/ErrorModel\""
"}"
"}"
"}"
"}"
"}";
return os.write(ss.str());
}
namespace cs = httpd::config_json;
#define _get_config_value(name, type, deflt, status, desc, ...) if (id == #name) {return get_json_return_type(ctx.db.local().get_config().name());}
#define _get_config_description(name, type, deflt, status, desc, ...) f = f.then([&os, &first] {return get_config_swagger_entry(#name, desc, #type, first, os);});
void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r) {
rb->register_function(r, [] (output_stream<char>& os) {
return do_with(true, [&os] (bool& first) {
auto f = make_ready_future();
_make_config_values(_get_config_description)
return f;
});
});
cs::find_config_id.set(r, [&ctx] (const_req r) {
auto id = r.param["id"];
_make_config_values(_get_config_value)
throw bad_param_exception(sstring("No such config entry: ") + id);
});
}
}

View File

@@ -1,5 +1,5 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2018 ScyllaDB
*/
/*
@@ -19,7 +19,12 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
// Used to ensure that all .hh files build, as well as a place to put
// out-of-line implementations.
#pragma once
#include "fb_utilities.hh"
#include "api.hh"
#include <seastar/http/api_docs.hh>
namespace api {
void set_config(std::shared_ptr<api_registry_builder20> rb, http_context& ctx, routes& r);
}

View File

@@ -93,10 +93,13 @@ void set_storage_service(http_context& ctx, routes& r) {
return ctx.db.local().commitlog()->active_config().commit_log_location;
});
ss::get_token_endpoint.set(r, [] (const_req req) {
auto token_to_ep = service::get_local_storage_service().get_token_to_endpoint_map();
std::vector<storage_service_json::mapper> res;
return map_to_key_value(token_to_ep, res);
ss::get_token_endpoint.set(r, [] (std::unique_ptr<request> req) {
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().get_token_to_endpoint_map(), [](const auto& i) {
storage_service_json::mapper val;
val.key = boost::lexical_cast<std::string>(i.first);
val.value = boost::lexical_cast<std::string>(i.second);
return val;
}));
});
ss::get_leaving_nodes.set(r, [](const_req req) {
@@ -355,6 +358,12 @@ void set_storage_service(http_context& ctx, routes& r) {
});
});
ss::get_active_repair_async.set(r, [&ctx](std::unique_ptr<request> req) {
return get_active_repairs(ctx.db).then([] (std::vector<int> res){
return make_ready_future<json::json_return_type>(res);
});
});
ss::repair_async_status.set(r, [&ctx](std::unique_ptr<request> req) {
return repair_get_status(ctx.db, boost::lexical_cast<int>( req->get_query_param("id")))
.then_wrapped([] (future<repair_status>&& fut) {
@@ -843,6 +852,15 @@ void set_storage_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));
});
});
ss::view_build_statuses.set(r, [&ctx] (std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto view = req->param["view"];
return service::get_local_storage_service().view_build_statuses(std::move(keyspace), std::move(view)).then([] (std::unordered_map<sstring, sstring> status) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(std::move(status), res));
});
});
}
}

222
atomic_cell.cc Normal file
View File

@@ -0,0 +1,222 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "atomic_cell.hh"
#include "atomic_cell_or_collection.hh"
#include "types.hh"
/// LSA mirator for cells with irrelevant type
///
///
const data::type_imr_descriptor& no_type_imr_descriptor() {
static thread_local data::type_imr_descriptor state(data::type_info::make_variable_size());
return state;
}
atomic_cell atomic_cell::make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_dead(timestamp, deletion_time), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_counter_update(timestamp, value), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_uninitialized(imr_data.type_info(), timestamp, size), &imr_data.lsa_migrator())
);
}
static imr::utils::object<data::cell::structure> copy_cell(const data::type_imr_descriptor& imr_data, const uint8_t* ptr)
{
using imr_object_type = imr::utils::object<data::cell::structure>;
// If the cell doesn't own any memory it is trivial and can be copied with
// memcpy.
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
if (!f.template get<data::cell::tags::external_data>()) {
data::cell::context ctx(f, imr_data.type_info());
// XXX: We may be better off storing the total cell size in memory. Measure!
auto size = data::cell::structure::serialized_object_size(ptr, ctx);
return imr_object_type::make_raw(size, [&] (uint8_t* dst) noexcept {
std::copy_n(ptr, size, dst);
}, &imr_data.lsa_migrator());
}
return imr_object_type::make(data::cell::copy_fn(imr_data.type_info(), ptr), &imr_data.lsa_migrator());
}
atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)
: atomic_cell(type.imr_state().type_info(),
copy_cell(type.imr_state(), other._view.raw_pointer()))
{ }
atomic_cell_or_collection atomic_cell_or_collection::copy(const abstract_type& type) const {
if (!_data.get()) {
return atomic_cell_or_collection();
}
auto& imr_data = type.imr_state();
return atomic_cell_or_collection(
copy_cell(imr_data, _data.get())
);
}
atomic_cell_or_collection::atomic_cell_or_collection(const abstract_type& type, atomic_cell_view acv)
: _data(copy_cell(type.imr_state(), acv._view.raw_pointer()))
{
}
static collection_mutation_view get_collection_mutation_view(const uint8_t* ptr)
{
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
auto ti = data::type_info::make_collection();
data::cell::context ctx(f, ti);
auto view = data::cell::structure::get_member<data::cell::tags::cell>(ptr).as<data::cell::tags::collection>(ctx);
auto dv = data::cell::variable_value::make_view(view, f.get<data::cell::tags::external_data>());
return collection_mutation_view { dv };
}
collection_mutation_view atomic_cell_or_collection::as_collection_mutation() const {
return get_collection_mutation_view(_data.get());
}
collection_mutation::collection_mutation(const collection_type_impl& type, collection_mutation_view v)
: _data(imr_object_type::make(data::cell::make_collection(v.data), &type.imr_state().lsa_migrator()))
{
}
collection_mutation::collection_mutation(const collection_type_impl& type, bytes_view v)
: _data(imr_object_type::make(data::cell::make_collection(v), &type.imr_state().lsa_migrator()))
{
}
collection_mutation::operator collection_mutation_view() const
{
return get_collection_mutation_view(_data.get());
}
bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_cell_or_collection& other) const
{
auto ptr_a = _data.get();
auto ptr_b = other._data.get();
if (!ptr_a || !ptr_b) {
return !ptr_a && !ptr_b;
}
if (type.is_atomic()) {
auto a = atomic_cell_view::from_bytes(type.imr_state().type_info(), _data);
auto b = atomic_cell_view::from_bytes(type.imr_state().type_info(), other._data);
if (a.timestamp() != b.timestamp()) {
return false;
}
if (a.is_live()) {
if (!b.is_live()) {
return false;
}
if (a.is_counter_update()) {
if (!b.is_counter_update()) {
return false;
}
return a.counter_update_value() == b.counter_update_value();
}
if (a.is_live_and_has_ttl()) {
if (!b.is_live_and_has_ttl()) {
return false;
}
if (a.ttl() != b.ttl() || a.expiry() != b.expiry()) {
return false;
}
}
return a.value() == b.value();
}
return a.deletion_time() == b.deletion_time();
} else {
return as_collection_mutation().data == other.as_collection_mutation().data;
}
}
size_t atomic_cell_or_collection::external_memory_usage(const abstract_type& t) const
{
if (!_data.get()) {
return 0;
}
auto ctx = data::cell::context(_data.get(), t.imr_state().type_info());
auto view = data::cell::structure::make_view(_data.get(), ctx);
auto flags = view.get<data::cell::tags::flags>();
size_t external_value_size = 0;
if (flags.get<data::cell::tags::external_data>()) {
if (flags.get<data::cell::tags::collection>()) {
external_value_size = get_collection_mutation_view(_data.get()).data.size_bytes();
} else {
auto cell_view = data::cell::atomic_cell_view(t.imr_state().type_info(), view);
external_value_size = cell_view.value_size();
}
// Add overhead of chunk headers. The last one is a special case.
external_value_size += (external_value_size - 1) / data::cell::maximum_external_chunk_length * data::cell::external_chunk_overhead;
external_value_size += data::cell::external_last_chunk_overhead;
}
return data::cell::structure::serialized_object_size(_data.get(), ctx)
+ imr_object_type::size_overhead + external_value_size;
}
std::ostream& operator<<(std::ostream& os, const atomic_cell_or_collection& c) {
if (!c._data.get()) {
return os << "{ null atomic_cell_or_collection }";
}
using dc = data::cell;
os << "{ ";
if (dc::structure::get_member<dc::tags::flags>(c._data.get()).get<dc::tags::collection>()) {
os << "collection";
} else {
os << "atomic cell";
}
return os << " @" << static_cast<const void*>(c._data.get()) << " }";
}

View File

@@ -30,200 +30,48 @@
#include <cstdint>
#include <iosfwd>
#include <seastar/util/gcc6-concepts.hh>
#include "data/cell.hh"
#include "data/schema_info.hh"
#include "imr/utils.hh"
template<typename T, typename Input>
static inline
void set_field(Input& v, unsigned offset, T val) {
reinterpret_cast<net::packed<T>*>(v.begin() + offset)->raw = net::hton(val);
}
class abstract_type;
class collection_type_impl;
template<typename T>
static inline
T get_field(const bytes_view& v, unsigned offset) {
return net::ntoh(*reinterpret_cast<const net::packed<T>*>(v.begin() + offset));
}
using atomic_cell_value_view = data::value_view;
using atomic_cell_value_mutable_view = data::value_mutable_view;
class atomic_cell_or_collection;
/*
* Represents atomic cell layout. Works on serialized form.
*
* Layout:
*
* <live> := <int8_t:flags><int64_t:timestamp>(<int32_t:expiry><int32_t:ttl>)?<value>
* <dead> := <int8_t: 0><int64_t:timestamp><int32_t:deletion_time>
*/
class atomic_cell_type final {
private:
static constexpr int8_t LIVE_FLAG = 0x01;
static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
static constexpr int8_t REVERT_FLAG = 0x04; // transient flag used to efficiently implement ReversiblyMergeable for atomic cells.
static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.
static constexpr int8_t COUNTER_IN_PLACE_REVERT = 0x10;
static constexpr unsigned flags_size = 1;
static constexpr unsigned timestamp_offset = flags_size;
static constexpr unsigned timestamp_size = 8;
static constexpr unsigned expiry_offset = timestamp_offset + timestamp_size;
static constexpr unsigned expiry_size = 4;
static constexpr unsigned deletion_time_offset = timestamp_offset + timestamp_size;
static constexpr unsigned deletion_time_size = 4;
static constexpr unsigned ttl_offset = expiry_offset + expiry_size;
static constexpr unsigned ttl_size = 4;
friend class counter_cell_builder;
private:
static bool is_counter_update(bytes_view cell) {
return cell[0] & COUNTER_UPDATE_FLAG;
}
static bool is_revert_set(bytes_view cell) {
return cell[0] & REVERT_FLAG;
}
static bool is_counter_in_place_revert_set(bytes_view cell) {
return cell[0] & COUNTER_IN_PLACE_REVERT;
}
template<typename BytesContainer>
static void set_revert(BytesContainer& cell, bool revert) {
cell[0] = (cell[0] & ~REVERT_FLAG) | (revert * REVERT_FLAG);
}
template<typename BytesContainer>
static void set_counter_in_place_revert(BytesContainer& cell, bool flag) {
cell[0] = (cell[0] & ~COUNTER_IN_PLACE_REVERT) | (flag * COUNTER_IN_PLACE_REVERT);
}
static bool is_live(const bytes_view& cell) {
return cell[0] & LIVE_FLAG;
}
static bool is_live_and_has_ttl(const bytes_view& cell) {
return cell[0] & EXPIRY_FLAG;
}
static bool is_dead(const bytes_view& cell) {
return !is_live(cell);
}
// Can be called on live and dead cells
static api::timestamp_type timestamp(const bytes_view& cell) {
return get_field<api::timestamp_type>(cell, timestamp_offset);
}
template<typename BytesContainer>
static void set_timestamp(BytesContainer& cell, api::timestamp_type ts) {
set_field(cell, timestamp_offset, ts);
}
// Can be called on live cells only
private:
template<typename BytesView>
static BytesView do_get_value(BytesView cell) {
auto expiry_field_size = bool(cell[0] & EXPIRY_FLAG) * (expiry_size + ttl_size);
auto value_offset = flags_size + timestamp_size + expiry_field_size;
cell.remove_prefix(value_offset);
return cell;
}
public:
static bytes_view value(bytes_view cell) {
return do_get_value(cell);
}
static bytes_mutable_view value(bytes_mutable_view cell) {
return do_get_value(cell);
}
// Can be called on live counter update cells only
static int64_t counter_update_value(bytes_view cell) {
return get_field<int64_t>(cell, flags_size + timestamp_size);
}
// Can be called only when is_dead() is true.
static gc_clock::time_point deletion_time(const bytes_view& cell) {
assert(is_dead(cell));
return gc_clock::time_point(gc_clock::duration(
get_field<int32_t>(cell, deletion_time_offset)));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::time_point expiry(const bytes_view& cell) {
assert(is_live_and_has_ttl(cell));
auto expiry = get_field<int32_t>(cell, expiry_offset);
return gc_clock::time_point(gc_clock::duration(expiry));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::duration ttl(const bytes_view& cell) {
assert(is_live_and_has_ttl(cell));
return gc_clock::duration(get_field<int32_t>(cell, ttl_offset));
}
static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
managed_bytes b(managed_bytes::initialized_later(), flags_size + timestamp_size + deletion_time_size);
b[0] = 0;
set_field(b, timestamp_offset, timestamp);
set_field(b, deletion_time_offset, deletion_time.time_since_epoch().count());
return b;
}
static managed_bytes make_live(api::timestamp_type timestamp, bytes_view value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size());
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
return b;
}
static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + sizeof(value));
b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, value_offset, value);
return b;
}
static managed_bytes make_live(api::timestamp_type timestamp, bytes_view value, gc_clock::time_point expiry, gc_clock::duration ttl) {
auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size());
b[0] = EXPIRY_FLAG | LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, expiry_offset, expiry.time_since_epoch().count());
set_field(b, ttl_offset, ttl.count());
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
return b;
}
// make_live_from_serializer() is intended for users that need to serialise
// some object or objects to the format used in atomic_cell::value().
// With just make_live() the patter would look like follows:
// 1. allocate a buffer and write to it serialised objects
// 2. pass that buffer to make_live()
// 3. make_live() needs to prepend some metadata to the cell value so it
// allocates a new buffer and copies the content of the original one
//
// The allocation and copy of a buffer can be avoided.
// make_live_from_serializer() allows the user code to specify the timestamp
// and size of the cell value as well as provide the serialiser function
// object, which would write the serialised value of the cell to the buffer
// given to it by make_live_from_serializer().
template<typename Serializer>
GCC6_CONCEPT(requires requires(Serializer serializer, bytes::iterator it) {
serializer(it);
})
static managed_bytes make_live_from_serializer(api::timestamp_type timestamp, size_t size, Serializer&& serializer) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + size);
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
serializer(b.begin() + value_offset);
return b;
}
template<typename ByteContainer>
friend class atomic_cell_base;
/// View of an atomic cell
template<mutable_view is_mutable>
class basic_atomic_cell_view {
protected:
data::cell::basic_atomic_cell_view<is_mutable> _view;
friend class atomic_cell;
};
public:
using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const uint8_t*, uint8_t*>;
protected:
explicit basic_atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> v)
: _view(std::move(v)) { }
basic_atomic_cell_view(const data::type_info& ti, pointer_type ptr)
: _view(data::cell::make_atomic_cell_view(ti, ptr))
{ }
template<typename ByteContainer>
class atomic_cell_base {
protected:
ByteContainer _data;
protected:
atomic_cell_base(ByteContainer&& data) : _data(std::forward<ByteContainer>(data)) { }
friend class atomic_cell_or_collection;
public:
operator basic_atomic_cell_view<mutable_view::no>() const noexcept {
return basic_atomic_cell_view<mutable_view::no>(_view);
}
void swap(basic_atomic_cell_view& other) noexcept {
using std::swap;
swap(_view, other._view);
}
bool is_counter_update() const {
return atomic_cell_type::is_counter_update(_data);
}
bool is_revert_set() const {
return atomic_cell_type::is_revert_set(_data);
}
bool is_counter_in_place_revert_set() const {
return atomic_cell_type::is_counter_in_place_revert_set(_data);
return _view.is_counter_update();
}
bool is_live() const {
return atomic_cell_type::is_live(_data);
return _view.is_live();
}
bool is_live(tombstone t, bool is_counter) const {
return is_live() && !is_covered_by(t, is_counter);
@@ -232,125 +80,132 @@ public:
return is_live() && !is_covered_by(t, is_counter) && !has_expired(now);
}
bool is_live_and_has_ttl() const {
return atomic_cell_type::is_live_and_has_ttl(_data);
return _view.is_expiring();
}
bool is_dead(gc_clock::time_point now) const {
return atomic_cell_type::is_dead(_data) || has_expired(now);
return !is_live() || has_expired(now);
}
bool is_covered_by(tombstone t, bool is_counter) const {
return timestamp() <= t.timestamp || (is_counter && t.timestamp != api::missing_timestamp);
}
// Can be called on live and dead cells
api::timestamp_type timestamp() const {
return atomic_cell_type::timestamp(_data);
return _view.timestamp();
}
void set_timestamp(api::timestamp_type ts) {
atomic_cell_type::set_timestamp(_data, ts);
_view.set_timestamp(ts);
}
// Can be called on live cells only
auto value() const {
return atomic_cell_type::value(_data);
data::basic_value_view<is_mutable> value() const {
return _view.value();
}
// Can be called on live cells only
size_t value_size() const {
return _view.value_size();
}
bool is_value_fragmented() const {
return _view.is_value_fragmented();
}
// Can be called on live counter update cells only
int64_t counter_update_value() const {
return atomic_cell_type::counter_update_value(_data);
return _view.counter_update_value();
}
// Can be called only when is_dead(gc_clock::time_point)
gc_clock::time_point deletion_time() const {
return !is_live() ? atomic_cell_type::deletion_time(_data) : expiry() - ttl();
return !is_live() ? _view.deletion_time() : expiry() - ttl();
}
// Can be called only when is_live_and_has_ttl()
gc_clock::time_point expiry() const {
return atomic_cell_type::expiry(_data);
return _view.expiry();
}
// Can be called only when is_live_and_has_ttl()
gc_clock::duration ttl() const {
return atomic_cell_type::ttl(_data);
return _view.ttl();
}
// Can be called on live and dead cells
bool has_expired(gc_clock::time_point now) const {
return is_live_and_has_ttl() && expiry() <= now;
}
bytes_view serialize() const {
return _data;
}
void set_revert(bool revert) {
atomic_cell_type::set_revert(_data, revert);
}
void set_counter_in_place_revert(bool flag) {
atomic_cell_type::set_counter_in_place_revert(_data, flag);
return _view.serialize();
}
};
class atomic_cell_view final : public atomic_cell_base<bytes_view> {
atomic_cell_view(bytes_view data) : atomic_cell_base(std::move(data)) {}
public:
static atomic_cell_view from_bytes(bytes_view data) { return atomic_cell_view(data); }
class atomic_cell_view final : public basic_atomic_cell_view<mutable_view::no> {
atomic_cell_view(const data::type_info& ti, const uint8_t* data)
: basic_atomic_cell_view<mutable_view::no>(ti, data) {}
template<mutable_view is_mutable>
atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> view)
: basic_atomic_cell_view<mutable_view::no>(view) { }
friend class atomic_cell;
public:
static atomic_cell_view from_bytes(const data::type_info& ti, const imr::utils::object<data::cell::structure>& data) {
return atomic_cell_view(ti, data.get());
}
static atomic_cell_view from_bytes(const data::type_info& ti, bytes_view bv) {
return atomic_cell_view(ti, reinterpret_cast<const uint8_t*>(bv.begin()));
}
friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);
};
class atomic_cell_mutable_view final : public atomic_cell_base<bytes_mutable_view> {
atomic_cell_mutable_view(bytes_mutable_view data) : atomic_cell_base(std::move(data)) {}
class atomic_cell_mutable_view final : public basic_atomic_cell_view<mutable_view::yes> {
atomic_cell_mutable_view(const data::type_info& ti, uint8_t* data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data) {}
public:
static atomic_cell_mutable_view from_bytes(bytes_mutable_view data) { return atomic_cell_mutable_view(data); }
static atomic_cell_mutable_view from_bytes(const data::type_info& ti, imr::utils::object<data::cell::structure>& data) {
return atomic_cell_mutable_view(ti, data.get());
}
friend class atomic_cell;
};
class atomic_cell_ref final : public atomic_cell_base<managed_bytes&> {
public:
atomic_cell_ref(managed_bytes& buf) : atomic_cell_base(buf) {}
};
using atomic_cell_ref = atomic_cell_mutable_view;
class atomic_cell final : public atomic_cell_base<managed_bytes> {
atomic_cell(managed_bytes b) : atomic_cell_base(std::move(b)) {}
class atomic_cell final : public basic_atomic_cell_view<mutable_view::yes> {
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
atomic_cell(const data::type_info& ti, imr::utils::object<data::cell::structure>&& data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data.get()), _data(std::move(data)) {}
public:
atomic_cell(const atomic_cell&) = default;
class collection_member_tag;
using collection_member = bool_class<collection_member_tag>;
atomic_cell(atomic_cell&&) = default;
atomic_cell& operator=(const atomic_cell&) = default;
atomic_cell& operator=(const atomic_cell&) = delete;
atomic_cell& operator=(atomic_cell&&) = default;
static atomic_cell from_bytes(managed_bytes b) {
return atomic_cell(std::move(b));
void swap(atomic_cell& other) noexcept {
basic_atomic_cell_view<mutable_view::yes>::swap(other);
_data.swap(other._data);
}
atomic_cell(atomic_cell_view other) : atomic_cell_base(managed_bytes{other._data}) {}
operator atomic_cell_view() const {
return atomic_cell_view(_data);
operator atomic_cell_view() const { return atomic_cell_view(_view); }
atomic_cell(const abstract_type& t, atomic_cell_view other);
static atomic_cell make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const bytes& value,
collection_member cm = collection_member::no) {
return make_live(type, timestamp, bytes_view(value), cm);
}
static atomic_cell make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
return atomic_cell_type::make_dead(timestamp, deletion_time);
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value) {
return atomic_cell_type::make_live(timestamp, value);
}
static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value) {
return make_live(timestamp, bytes_view(value));
}
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
return atomic_cell_type::make_live_counter_update(timestamp, value);
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl)
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const bytes& value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member cm = collection_member::no)
{
return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
return make_live(type, timestamp, bytes_view(value), expiry, ttl, cm);
}
static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value,
gc_clock::time_point expiry, gc_clock::duration ttl)
{
return make_live(timestamp, bytes_view(value), expiry, ttl);
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value, ttl_opt ttl) {
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value, ttl_opt ttl, collection_member cm = collection_member::no) {
if (!ttl) {
return atomic_cell_type::make_live(timestamp, value);
return make_live(type, timestamp, value, cm);
} else {
return atomic_cell_type::make_live(timestamp, value, gc_clock::now() + *ttl, *ttl);
return make_live(type, timestamp, value, gc_clock::now() + *ttl, *ttl, cm);
}
}
template<typename Serializer>
static atomic_cell make_live_from_serializer(api::timestamp_type timestamp, size_t size, Serializer&& serializer) {
return atomic_cell_type::make_live_from_serializer(timestamp, size, std::forward<Serializer>(serializer));
}
static atomic_cell make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size);
friend class atomic_cell_or_collection;
friend std::ostream& operator<<(std::ostream& os, const atomic_cell& ac);
};
@@ -364,33 +219,24 @@ class collection_mutation_view;
// list: tbd, probably ugly
class collection_mutation {
public:
managed_bytes data;
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
collection_mutation() {}
collection_mutation(managed_bytes b) : data(std::move(b)) {}
collection_mutation(collection_mutation_view v);
collection_mutation(const collection_type_impl&, collection_mutation_view v);
collection_mutation(const collection_type_impl&, bytes_view bv);
operator collection_mutation_view() const;
};
class collection_mutation_view {
public:
bytes_view data;
bytes_view serialize() const { return data; }
static collection_mutation_view from_bytes(bytes_view v) { return { v }; }
atomic_cell_value_view data;
};
inline
collection_mutation::collection_mutation(collection_mutation_view v)
: data(v.data) {
}
inline
collection_mutation::operator collection_mutation_view() const {
return { data };
}
class column_definition;
int compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right);
void merge_column(const column_definition& def,
void merge_column(const abstract_type& def,
atomic_cell_or_collection& old,
const atomic_cell_or_collection& neww);

View File

@@ -25,6 +25,7 @@
#include "types.hh"
#include "atomic_cell.hh"
#include "atomic_cell_or_collection.hh"
#include "hashing.hh"
#include "counters.hh"
@@ -32,12 +33,15 @@ template<>
struct appending_hash<collection_mutation_view> {
template<typename Hasher>
void operator()(Hasher& h, collection_mutation_view cell, const column_definition& cdef) const {
auto m_view = collection_type_impl::deserialize_mutation_form(cell);
cell.data.with_linearized([&] (bytes_view cell_bv) {
auto ctype = static_pointer_cast<const collection_type_impl>(cdef.type);
auto m_view = ctype->deserialize_mutation_form(cell_bv);
::feed_hash(h, m_view.tomb);
for (auto&& key_and_value : m_view.cells) {
::feed_hash(h, key_and_value.first);
::feed_hash(h, key_and_value.second, cdef);
}
});
}
};
@@ -49,7 +53,9 @@ struct appending_hash<atomic_cell_view> {
feed_hash(h, cell.timestamp());
if (cell.is_live()) {
if (cdef.is_counter()) {
::feed_hash(h, counter_cell_view(cell));
counter_cell_view::with_linearized(cell, [&] (counter_cell_view ccv) {
::feed_hash(h, ccv);
});
return;
}
if (cell.is_live_and_has_ttl()) {
@@ -78,3 +84,15 @@ struct appending_hash<collection_mutation> {
feed_hash(h, static_cast<collection_mutation_view>(cm), cdef);
}
};
template<>
struct appending_hash<atomic_cell_or_collection> {
template<typename Hasher>
void operator()(Hasher& h, const atomic_cell_or_collection& c, const column_definition& cdef) const {
if (cdef.is_atomic()) {
feed_hash(h, c.as_atomic_cell(cdef), cdef);
} else {
feed_hash(h, c.as_collection_mutation(), cdef);
}
}
};

View File

@@ -25,50 +25,56 @@
#include "schema.hh"
#include "hashing.hh"
#include "imr/utils.hh"
// A variant type that can hold either an atomic_cell, or a serialized collection.
// Which type is stored is determined by the schema.
// Has an "empty" state.
// Objects moved-from are left in an empty state.
class atomic_cell_or_collection final {
managed_bytes _data;
// FIXME: This has made us lose small-buffer optimisation. Unfortunately,
// due to the changed cell format it would be less effective now, anyway.
// Measure the actual impact because any attempts to fix this will become
// irrelevant once rows are converted to the IMR as well, so maybe we can
// live with this like that.
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
private:
atomic_cell_or_collection(managed_bytes&& data) : _data(std::move(data)) {}
atomic_cell_or_collection(imr::utils::object<data::cell::structure>&& data) : _data(std::move(data)) {}
public:
atomic_cell_or_collection() = default;
atomic_cell_or_collection(atomic_cell_or_collection&&) = default;
atomic_cell_or_collection(const atomic_cell_or_collection&) = delete;
atomic_cell_or_collection& operator=(atomic_cell_or_collection&&) = default;
atomic_cell_or_collection& operator=(const atomic_cell_or_collection&) = delete;
atomic_cell_or_collection(atomic_cell ac) : _data(std::move(ac._data)) {}
atomic_cell_or_collection(const abstract_type& at, atomic_cell_view acv);
static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }
atomic_cell_view as_atomic_cell() const { return atomic_cell_view::from_bytes(_data); }
atomic_cell_ref as_atomic_cell_ref() { return { _data }; }
atomic_cell_mutable_view as_mutable_atomic_cell() { return atomic_cell_mutable_view::from_bytes(_data); }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm.data)) {}
atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_ref as_atomic_cell_ref(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm._data)) { }
atomic_cell_or_collection copy(const abstract_type&) const;
explicit operator bool() const {
return !_data.empty();
return bool(_data);
}
bool can_use_mutable_view() const {
return !_data.is_fragmented();
static constexpr bool can_use_mutable_view() {
return true;
}
static atomic_cell_or_collection from_collection_mutation(collection_mutation data) {
return std::move(data.data);
}
collection_mutation_view as_collection_mutation() const {
return collection_mutation_view{_data};
}
bytes_view serialize() const {
return _data;
}
bool operator==(const atomic_cell_or_collection& other) const {
return _data == other._data;
}
template<typename Hasher>
void feed_hash(Hasher& h, const column_definition& def) const {
if (def.is_atomic()) {
::feed_hash(h, as_atomic_cell(), def);
} else {
::feed_hash(h, as_collection_mutation(), def);
}
}
size_t external_memory_usage() const {
return _data.external_memory_usage();
void swap(atomic_cell_or_collection& other) noexcept {
_data.swap(other._data);
}
static atomic_cell_or_collection from_collection_mutation(collection_mutation data) { return std::move(data._data); }
collection_mutation_view as_collection_mutation() const;
bytes_view serialize() const;
bool equals(const abstract_type& type, const atomic_cell_or_collection& other) const;
size_t external_memory_usage(const abstract_type&) const;
friend std::ostream& operator<<(std::ostream&, const atomic_cell_or_collection&);
};
namespace std {
inline void swap(atomic_cell_or_collection& a, atomic_cell_or_collection& b) noexcept
{
a.swap(b);
}
}

View File

@@ -23,8 +23,8 @@
#include <stdexcept>
#include "auth/authenticator.hh"
#include "auth/authenticated_user.hh"
#include "auth/authenticator.hh"
#include "auth/common.hh"
namespace cql3 {
@@ -44,52 +44,56 @@ public:
allow_all_authenticator(cql3::query_processor&, ::service::migration_manager&) {
}
future<> start() override {
virtual future<> start() override {
return make_ready_future<>();
}
future<> stop() override {
virtual future<> stop() override {
return make_ready_future<>();
}
const sstring& qualified_java_name() const override {
virtual const sstring& qualified_java_name() const override {
return allow_all_authenticator_name();
}
bool require_authentication() const override {
virtual bool require_authentication() const override {
return false;
}
option_set supported_options() const override {
return option_set();
virtual authentication_option_set supported_options() const override {
return authentication_option_set();
}
option_set alterable_options() const override {
return option_set();
virtual authentication_option_set alterable_options() const override {
return authentication_option_set();
}
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override {
return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());
future<authenticated_user> authenticate(const credentials_map& credentials) const override {
return make_ready_future<authenticated_user>(anonymous_user());
}
future<> create(sstring username, const option_map& options) override {
virtual future<> create(stdx::string_view, const authentication_options& options) const override {
return make_ready_future();
}
future<> alter(sstring username, const option_map& options) override {
virtual future<> alter(stdx::string_view, const authentication_options& options) const override {
return make_ready_future();
}
future<> drop(sstring username) override {
virtual future<> drop(stdx::string_view) const override {
return make_ready_future();
}
const resource_ids& protected_resources() const override {
static const resource_ids ids;
return ids;
virtual future<custom_options> query_custom_options(stdx::string_view role_name) const override {
return make_ready_future<custom_options>();
}
::shared_ptr<sasl_challenge> new_sasl_challenge() const override {
virtual const resource_set& protected_resources() const override {
static const resource_set resources;
return resources;
}
virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const override {
throw std::runtime_error("Should not reach");
}
};

View File

@@ -21,7 +21,7 @@
#pragma once
#include "authorizer.hh"
#include "auth/authorizer.hh"
#include "exceptions/exceptions.hh"
#include "stdx.hh"
@@ -35,8 +35,6 @@ class migration_manager;
namespace auth {
class service;
const sstring& allow_all_authorizer_name();
class allow_all_authorizer final : public authorizer {
@@ -44,54 +42,51 @@ public:
allow_all_authorizer(cql3::query_processor&, ::service::migration_manager&) {
}
future<> start() override {
virtual future<> start() override {
return make_ready_future<>();
}
future<> stop() override {
virtual future<> stop() override {
return make_ready_future<>();
}
const sstring& qualified_java_name() const override {
virtual const sstring& qualified_java_name() const override {
return allow_all_authorizer_name();
}
future<permission_set> authorize(service&, ::shared_ptr<authenticated_user>, data_resource) const override {
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override {
return make_ready_future<permission_set>(permissions::ALL);
}
future<> grant(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override {
throw exceptions::invalid_request_exception("GRANT operation is not supported by AllowAllAuthorizer");
virtual future<> grant(stdx::string_view, permission_set, const resource&) const override {
return make_exception_future<>(
unsupported_authorization_operation("GRANT operation is not supported by AllowAllAuthorizer"));
}
future<> revoke(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override {
throw exceptions::invalid_request_exception("REVOKE operation is not supported by AllowAllAuthorizer");
virtual future<> revoke(stdx::string_view, permission_set, const resource&) const override {
return make_exception_future<>(
unsupported_authorization_operation("REVOKE operation is not supported by AllowAllAuthorizer"));
}
future<std::vector<permission_details>> list(
service&,
::shared_ptr<authenticated_user> performer,
permission_set,
stdx::optional<data_resource>,
stdx::optional<sstring>) const override {
throw exceptions::invalid_request_exception("LIST PERMISSIONS operation is not supported by AllowAllAuthorizer");
virtual future<std::vector<permission_details>> list_all() const override {
return make_exception_future<std::vector<permission_details>>(
unsupported_authorization_operation(
"LIST PERMISSIONS operation is not supported by AllowAllAuthorizer"));
}
future<> revoke_all(sstring dropped_user) override {
return make_ready_future();
virtual future<> revoke_all(stdx::string_view) const override {
return make_exception_future(
unsupported_authorization_operation("REVOKE operation is not supported by AllowAllAuthorizer"));
}
future<> revoke_all(data_resource) override {
return make_ready_future();
virtual future<> revoke_all(const resource&) const override {
return make_exception_future(
unsupported_authorization_operation("REVOKE operation is not supported by AllowAllAuthorizer"));
}
const resource_ids& protected_resources() override {
static const resource_ids ids;
return ids;
}
future<> validate_configuration() const override {
return make_ready_future();
virtual const resource_set& protected_resources() const override {
static const resource_set resources;
return resources;
}
};

View File

@@ -39,26 +39,30 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/authenticated_user.hh"
#include "authenticated_user.hh"
#include <iostream>
const sstring auth::authenticated_user::ANONYMOUS_USERNAME("anonymous");
namespace auth {
auth::authenticated_user::authenticated_user()
: _anon(true)
{}
auth::authenticated_user::authenticated_user(sstring name)
: _name(name), _anon(false)
{}
auth::authenticated_user::authenticated_user(authenticated_user&&) = default;
auth::authenticated_user::authenticated_user(const authenticated_user&) = default;
const sstring& auth::authenticated_user::name() const {
return _anon ? ANONYMOUS_USERNAME : _name;
authenticated_user::authenticated_user(stdx::string_view name)
: name(sstring(name)) {
}
std::ostream& operator<<(std::ostream& os, const authenticated_user& u) {
if (!u.name) {
os << "anonymous";
} else {
os << *u.name;
}
return os;
}
static const authenticated_user the_anonymous_user{};
const authenticated_user& anonymous_user() noexcept {
return the_anonymous_user;
}
bool auth::authenticated_user::operator==(const authenticated_user& v) const {
return _anon ? v._anon : _name == v._name;
}

View File

@@ -41,35 +41,63 @@
#pragma once
#include <experimental/string_view>
#include <functional>
#include <iosfwd>
#include <optional>
#include <seastar/core/sstring.hh>
#include <seastar/core/future.hh>
#include "seastarx.hh"
#include "stdx.hh"
namespace auth {
class authenticated_user {
///
/// A type-safe wrapper for the name of a logged-in user, or a nameless (anonymous) user.
///
class authenticated_user final {
public:
static const sstring ANONYMOUS_USERNAME;
///
/// An anonymous user has no name.
///
std::optional<sstring> name{};
authenticated_user();
authenticated_user(sstring name);
authenticated_user(authenticated_user&&);
authenticated_user(const authenticated_user&);
const sstring& name() const;
/**
* If IAuthenticator doesn't require authentication, this method may return true.
*/
bool is_anonymous() const {
return _anon;
}
bool operator==(const authenticated_user&) const;
private:
sstring _name;
bool _anon;
///
/// An anonymous user.
///
authenticated_user() = default;
explicit authenticated_user(stdx::string_view name);
};
///
/// The user name, or "anonymous".
///
std::ostream& operator<<(std::ostream&, const authenticated_user&);
inline bool operator==(const authenticated_user& u1, const authenticated_user& u2) noexcept {
return u1.name == u2.name;
}
inline bool operator!=(const authenticated_user& u1, const authenticated_user& u2) noexcept {
return !(u1 == u2);
}
const authenticated_user& anonymous_user() noexcept;
inline bool is_anonymous(const authenticated_user& u) noexcept {
return u == anonymous_user();
}
}
namespace std {
template <>
struct hash<auth::authenticated_user> final {
size_t operator()(const auth::authenticated_user &u) const {
return std::hash<std::optional<sstring>>()(u.name);
}
};
}

View File

@@ -1,5 +1,5 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2018 ScyllaDB
*/
/*
@@ -19,15 +19,19 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "auth/authentication_options.hh"
#include <experimental/optional>
#include <iostream>
namespace auth {
std::ostream& operator<<(std::ostream& os, authentication_option a) {
switch (a) {
case authentication_option::password: os << "PASSWORD"; break;
case authentication_option::options: os << "OPTIONS"; break;
}
return os;
}
template<typename T>
inline
std::experimental::optional<T>
move_and_disengage(std::experimental::optional<T>& opt) {
auto t = std::move(opt);
opt = std::experimental::nullopt;
return t;
}

View File

@@ -0,0 +1,64 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <iosfwd>
#include <optional>
#include <stdexcept>
#include <unordered_map>
#include <unordered_set>
#include <seastar/core/print.hh>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
namespace auth {
enum class authentication_option {
password,
options
};
std::ostream& operator<<(std::ostream&, authentication_option);
using authentication_option_set = std::unordered_set<authentication_option>;
using custom_options = std::unordered_map<sstring, sstring>;
struct authentication_options final {
std::optional<sstring> password;
std::optional<custom_options> options;
};
inline bool any_authentication_options(const authentication_options& aos) noexcept {
return aos.password || aos.options;
}
class unsupported_authentication_option : public std::invalid_argument {
public:
explicit unsupported_authentication_option(authentication_option k)
: std::invalid_argument(sprint("The %s option is not supported.", k)) {
}
};
}

View File

@@ -39,29 +39,14 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "authenticator.hh"
#include "authenticated_user.hh"
#include "common.hh"
#include "password_authenticator.hh"
#include "auth/authenticator.hh"
#include "auth/authenticated_user.hh"
#include "auth/common.hh"
#include "auth/password_authenticator.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
#include "utils/class_registrator.hh"
const sstring auth::authenticator::USERNAME_KEY("username");
const sstring auth::authenticator::PASSWORD_KEY("password");
auth::authenticator::option auth::authenticator::string_to_option(const sstring& name) {
if (strcasecmp(name.c_str(), "password") == 0) {
return option::PASSWORD;
}
throw std::invalid_argument(name);
}
sstring auth::authenticator::option_to_string(option opt) {
switch (opt) {
case option::PASSWORD:
return "PASSWORD";
default:
throw std::invalid_argument(sprint("Unknown option {}", opt));
}
}

View File

@@ -41,21 +41,24 @@
#pragma once
#include <experimental/string_view>
#include <memory>
#include <unordered_map>
#include <set>
#include <stdexcept>
#include <unordered_map>
#include <boost/any.hpp>
#include <seastar/core/sstring.hh>
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/enum.hh>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include <seastar/core/shared_ptr.hh>
#include "auth/authentication_options.hh"
#include "auth/resource.hh"
#include "bytes.hh"
#include "data_resource.hh"
#include "enum_set.hh"
#include "exceptions/exceptions.hh"
#include "stdx.hh"
namespace db {
class config;
@@ -65,126 +68,104 @@ namespace auth {
class authenticated_user;
///
/// Abstract client for authenticating role identity.
///
/// All state necessary to authorize a role is stored externally to the client instance.
///
class authenticator {
public:
///
/// The name of the key to be used for the user-name part of password authentication with \ref authenticate.
///
static const sstring USERNAME_KEY;
///
/// The name of the key to be used for the password part of password authentication with \ref authenticate.
///
static const sstring PASSWORD_KEY;
/**
* Supported CREATE USER/ALTER USER options.
* Currently only PASSWORD is available.
*/
enum class option {
PASSWORD
};
static option string_to_option(const sstring&);
static sstring option_to_string(option);
using option_set = enum_set<super_enum<option, option::PASSWORD>>;
using option_map = std::unordered_map<option, boost::any, enum_hash<option>>;
using credentials_map = std::unordered_map<sstring, sstring>;
virtual ~authenticator()
{}
virtual ~authenticator() = default;
virtual future<> start() = 0;
virtual future<> stop() = 0;
///
/// A fully-qualified (class with package) Java-like name for this implementation.
///
virtual const sstring& qualified_java_name() const = 0;
/**
* Whether or not the authenticator requires explicit login.
* If false will instantiate user with AuthenticatedUser.ANONYMOUS_USER.
*/
virtual bool require_authentication() const = 0;
/**
* Set of options supported by CREATE USER and ALTER USER queries.
* Should never return null - always return an empty set instead.
*/
virtual option_set supported_options() const = 0;
virtual authentication_option_set supported_options() const = 0;
/**
* Subset of supportedOptions that users are allowed to alter when performing ALTER USER [themselves].
* Should never return null - always return an empty set instead.
*/
virtual option_set alterable_options() const = 0;
///
/// A subset of `supported_options()` that users are permitted to alter for themselves.
///
virtual authentication_option_set alterable_options() const = 0;
/**
* Authenticates a user given a Map<String, String> of credentials.
* Should never return null - always throw AuthenticationException instead.
* Returning AuthenticatedUser.ANONYMOUS_USER is an option as well if authentication is not required.
*
* @throws authentication_exception if credentials don't match any known user.
*/
virtual future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const = 0;
///
/// Authenticate a user given implementation-specific credentials.
///
/// If this implementation does not require authentication (\ref require_authentication), an anonymous user may
/// result.
///
/// \returns an exceptional future with \ref exceptions::authentication_exception if given invalid credentials.
///
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const = 0;
/**
* Called during execution of CREATE USER query (also may be called on startup, see seedSuperuserOptions method).
* If authenticator is static then the body of the method should be left blank, but don't throw an exception.
* options are guaranteed to be a subset of supportedOptions().
*
* @param username Username of the user to create.
* @param options Options the user will be created with.
* @throws exceptions::request_validation_exception
* @throws exceptions::request_execution_exception
*/
virtual future<> create(sstring username, const option_map& options) = 0;
///
/// Create an authentication record for a new user. This is required before the user can log-in.
///
/// The options provided must be a subset of `supported_options()`.
///
virtual future<> create(stdx::string_view role_name, const authentication_options& options) const = 0;
/**
* Called during execution of ALTER USER query.
* options are always guaranteed to be a subset of supportedOptions(). Furthermore, if the user performing the query
* is not a superuser and is altering himself, then options are guaranteed to be a subset of alterableOptions().
* Keep the body of the method blank if your implementation doesn't support any options.
*
* @param username Username of the user that will be altered.
* @param options Options to alter.
* @throws exceptions::request_validation_exception
* @throws exceptions::request_execution_exception
*/
virtual future<> alter(sstring username, const option_map& options) = 0;
///
/// Alter the authentication record of an existing user.
///
/// The options provided must be a subset of `supported_options()`.
///
/// Callers must ensure that the specification of `alterable_options()` is adhered to.
///
virtual future<> alter(stdx::string_view role_name, const authentication_options& options) const = 0;
///
/// Delete the authentication record for a user. This will disallow the user from logging in.
///
virtual future<> drop(stdx::string_view role_name) const = 0;
/**
* Called during execution of DROP USER query.
*
* @param username Username of the user that will be dropped.
* @throws exceptions::request_validation_exception
* @throws exceptions::request_execution_exception
*/
virtual future<> drop(sstring username) = 0;
///
/// Query for custom options (those corresponding to \ref authentication_options::options).
///
/// If no options are set the result is an empty container.
///
virtual future<custom_options> query_custom_options(stdx::string_view role_name) const = 0;
/**
* Set of resources that should be made inaccessible to users and only accessible internally.
*
* @return Keyspaces, column families that will be unmodifiable by users; other resources.
* @see resource_ids
*/
virtual const resource_ids& protected_resources() const = 0;
///
/// System resources used internally as part of the implementation. These are made inaccessible to users.
///
virtual const resource_set& protected_resources() const = 0;
///
/// A stateful SASL challenge which supports many authentication schemes (depending on the implementation).
///
class sasl_challenge {
public:
virtual ~sasl_challenge() {}
virtual ~sasl_challenge() = default;
virtual bytes evaluate_response(bytes_view client_response) = 0;
virtual bool is_complete() const = 0;
virtual future<::shared_ptr<authenticated_user>> get_authenticated_user() const = 0;
virtual future<authenticated_user> get_authenticated_user() const = 0;
};
/**
* Provide a sasl_challenge to be used by the CQL binary protocol server. If
* the configured authenticator requires authentication but does not implement this
* interface we refuse to start the binary protocol server as it will have no way
* of authenticating clients.
* @return sasl_challenge implementation
*/
virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const = 0;
};
inline std::ostream& operator<<(std::ostream& os, authenticator::option opt) {
return os << authenticator::option_to_string(opt);
}
}

View File

@@ -1,118 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "authorizer.hh"
#include "authenticated_user.hh"
#include "common.hh"
#include "default_authorizer.hh"
#include "auth.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
#include "utils/class_registrator.hh"
const sstring& auth::allow_all_authorizer_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "AllowAllAuthorizer";
return name;
}
/**
* Authenticator is assumed to be a fully state-less immutable object (note all the const).
* We thus store a single instance globally, since it should be safe/ok.
*/
static std::unique_ptr<auth::authorizer> global_authorizer;
using authorizer_registry = class_registry<auth::authorizer, cql3::query_processor&>;
future<>
auth::authorizer::setup(const sstring& type) {
if (type == allow_all_authorizer_name()) {
class allow_all_authorizer : public authorizer {
public:
future<> start() override {
return make_ready_future<>();
}
future<> stop() override {
return make_ready_future<>();
}
const sstring& qualified_java_name() const override {
return allow_all_authorizer_name();
}
future<permission_set> authorize(::shared_ptr<authenticated_user>, data_resource) const override {
return make_ready_future<permission_set>(permissions::ALL);
}
future<> grant(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override {
throw exceptions::invalid_request_exception("GRANT operation is not supported by AllowAllAuthorizer");
}
future<> revoke(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override {
throw exceptions::invalid_request_exception("REVOKE operation is not supported by AllowAllAuthorizer");
}
future<std::vector<permission_details>> list(::shared_ptr<authenticated_user> performer, permission_set, optional<data_resource>, optional<sstring>) const override {
throw exceptions::invalid_request_exception("LIST PERMISSIONS operation is not supported by AllowAllAuthorizer");
}
future<> revoke_all(sstring dropped_user) override {
return make_ready_future();
}
future<> revoke_all(data_resource) override {
return make_ready_future();
}
const resource_ids& protected_resources() override {
static const resource_ids ids;
return ids;
}
future<> validate_configuration() const override {
return make_ready_future();
}
};
global_authorizer = std::make_unique<allow_all_authorizer>();
return make_ready_future();
} else {
auto a = authorizer_registry::create(type, cql3::get_local_query_processor());
auto f = a->start();
return f.then([a = std::move(a)]() mutable {
global_authorizer = std::move(a);
});
}
}
auth::authorizer& auth::authorizer::get() {
assert(global_authorizer);
return *global_authorizer;
}

View File

@@ -41,127 +41,116 @@
#pragma once
#include <vector>
#include <experimental/string_view>
#include <functional>
#include <optional>
#include <stdexcept>
#include <tuple>
#include <vector>
#include <experimental/optional>
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include "permission.hh"
#include "data_resource.hh"
#include "auth/permission.hh"
#include "auth/resource.hh"
#include "seastarx.hh"
#include "stdx.hh"
namespace auth {
class service;
class authenticated_user;
class role_or_anonymous;
struct permission_details {
sstring user;
data_resource resource;
sstring role_name;
::auth::resource resource;
permission_set permissions;
bool operator<(const permission_details& v) const {
return std::tie(user, resource, permissions) < std::tie(v.user, v.resource, v.permissions);
}
};
using std::experimental::optional;
inline bool operator==(const permission_details& pd1, const permission_details& pd2) {
return std::forward_as_tuple(pd1.role_name, pd1.resource, pd1.permissions.mask())
== std::forward_as_tuple(pd2.role_name, pd2.resource, pd2.permissions.mask());
}
inline bool operator!=(const permission_details& pd1, const permission_details& pd2) {
return !(pd1 == pd2);
}
inline bool operator<(const permission_details& pd1, const permission_details& pd2) {
return std::forward_as_tuple(pd1.role_name, pd1.resource, pd1.permissions)
< std::forward_as_tuple(pd2.role_name, pd2.resource, pd2.permissions);
}
class unsupported_authorization_operation : public std::invalid_argument {
public:
using std::invalid_argument::invalid_argument;
};
///
/// Abstract client for authorizing roles to access resources.
///
/// All state necessary to authorize a role is stored externally to the client instance.
///
class authorizer {
public:
virtual ~authorizer() {}
virtual ~authorizer() = default;
virtual future<> start() = 0;
virtual future<> stop() = 0;
///
/// A fully-qualified (class with package) Java-like name for this implementation.
///
virtual const sstring& qualified_java_name() const = 0;
/**
* The primary Authorizer method. Returns a set of permissions of a user on a resource.
*
* @param user Authenticated user requesting authorization.
* @param resource Resource for which the authorization is being requested. @see DataResource.
* @return Set of permissions of the user on the resource. Should never return empty. Use permission.NONE instead.
*/
virtual future<permission_set> authorize(service&, ::shared_ptr<authenticated_user>, data_resource) const = 0;
///
/// Query for the permissions granted directly to a role for a particular \ref resource (and not any of its
/// parents).
///
/// The optional role name is empty when an anonymous user is authorized. Some implementations may still wish to
/// grant default permissions in this case.
///
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const = 0;
/**
* Grants a set of permissions on a resource to a user.
* The opposite of revoke().
*
* @param performer User who grants the permissions.
* @param permissions Set of permissions to grant.
* @param to Grantee of the permissions.
* @param resource Resource on which to grant the permissions.
*
* @throws RequestValidationException
* @throws RequestExecutionException
*/
virtual future<> grant(::shared_ptr<authenticated_user> performer, permission_set, data_resource, sstring to) = 0;
///
/// Grant a set of permissions to a role for a particular \ref resource.
///
/// \throws \ref unsupported_authorization_operation if granting permissions is not supported.
///
virtual future<> grant(stdx::string_view role_name, permission_set, const resource&) const = 0;
/**
* Revokes a set of permissions on a resource from a user.
* The opposite of grant().
*
* @param performer User who revokes the permissions.
* @param permissions Set of permissions to revoke.
* @param from Revokee of the permissions.
* @param resource Resource on which to revoke the permissions.
*
* @throws RequestValidationException
* @throws RequestExecutionException
*/
virtual future<> revoke(::shared_ptr<authenticated_user> performer, permission_set, data_resource, sstring from) = 0;
///
/// Revoke a set of permissions from a role for a particular \ref resource.
///
/// \throws \ref unsupported_authorization_operation if revoking permissions is not supported.
///
virtual future<> revoke(stdx::string_view role_name, permission_set, const resource&) const = 0;
/**
* Returns a list of permissions on a resource of a user.
*
* @param performer User who wants to see the permissions.
* @param permissions Set of Permission values the user is interested in. The result should only include the matching ones.
* @param resource The resource on which permissions are requested. Can be null, in which case permissions on all resources
* should be returned.
* @param of The user whose permissions are requested. Can be null, in which case permissions of every user should be returned.
*
* @return All of the matching permission that the requesting user is authorized to know about.
*
* @throws RequestValidationException
* @throws RequestExecutionException
*/
virtual future<std::vector<permission_details>> list(service&, ::shared_ptr<authenticated_user> performer, permission_set, optional<data_resource>, optional<sstring>) const = 0;
///
/// Query for all directly granted permissions.
///
/// \throws \ref unsupported_authorization_operation if listing permissions is not supported.
///
virtual future<std::vector<permission_details>> list_all() const = 0;
/**
* This method is called before deleting a user with DROP USER query so that a new user with the same
* name wouldn't inherit permissions of the deleted user in the future.
*
* @param droppedUser The user to revoke all permissions from.
*/
virtual future<> revoke_all(sstring dropped_user) = 0;
///
/// Revoke all permissions granted directly to a particular role.
///
/// \throws \ref unsupported_authorization_operation if revoking permissions is not supported.
///
virtual future<> revoke_all(stdx::string_view role_name) const = 0;
/**
* This method is called after a resource is removed (i.e. keyspace or a table is dropped).
*
* @param droppedResource The resource to revoke all permissions on.
*/
virtual future<> revoke_all(data_resource) = 0;
///
/// Revoke all permissions granted to any role for a particular resource.
///
/// \throws \ref unsupported_authorization_operation if revoking permissions is not supported.
///
virtual future<> revoke_all(const resource&) const = 0;
/**
* Set of resources that should be made inaccessible to users and only accessible internally.
*
* @return Keyspaces, column families that will be unmodifiable by users; other resources.
*/
virtual const resource_ids& protected_resources() = 0;
/**
* Validates configuration of IAuthorizer implementation (if configurable).
*
* @throws ConfigurationException when there is a configuration error.
*/
virtual future<> validate_configuration() const = 0;
///
/// System resources used internally as part of the implementation. These are made inaccessible to users.
///
virtual const resource_set& protected_resources() const = 0;
};
}

View File

@@ -25,8 +25,10 @@
#include "cql3/query_processor.hh"
#include "cql3/statements/create_table_statement.hh"
#include "database.hh"
#include "schema_builder.hh"
#include "service/migration_manager.hh"
#include "timeout_config.hh"
namespace auth {
@@ -39,14 +41,32 @@ const sstring AUTH_PACKAGE_NAME("org.apache.cassandra.auth.");
}
static logging::logger auth_log("auth");
// Func must support being invoked more than once.
future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_function<future<>()> func) {
struct empty_state { };
return delay_until_system_ready(as).then([&as, func = std::move(func)] () mutable {
return exponential_backoff_retry::do_until_value(1s, 1min, as, [func = std::move(func)] {
return func().then_wrapped([] (auto&& f) -> stdx::optional<empty_state> {
if (f.failed()) {
auth_log.info("Auth task failed with error, rescheduling: {}", f.get_exception());
return { };
}
return { empty_state() };
});
});
}).discard_result();
}
future<> create_metadata_table_if_missing(
const sstring& table_name,
stdx::string_view table_name,
cql3::query_processor& qp,
const sstring& cql,
stdx::string_view cql,
::service::migration_manager& mm) {
auto& db = qp.db().local();
if (db.has_schema(meta::AUTH_KS, table_name)) {
if (db.has_schema(meta::AUTH_KS, sstring(table_name))) {
return make_ready_future<>();
}
@@ -58,7 +78,7 @@ future<> create_metadata_table_if_missing(
auto statement = static_pointer_cast<cql3::statements::create_table_statement>(
parsed_statement->prepare(db, qp.get_cql_stats())->statement);
const auto schema = statement->get_cf_meta_data();
const auto schema = statement->get_cf_meta_data(qp.db().local());
const auto uuid = generate_legacy_id(schema->ks_name(), schema->cf_name());
schema_builder b(schema);
@@ -67,4 +87,18 @@ future<> create_metadata_table_if_missing(
return mm.announce_new_column_family(b.build(), false);
}
future<> wait_for_schema_agreement(::service::migration_manager& mm, const database& db) {
static const auto pause = [] { return sleep(std::chrono::milliseconds(500)); };
return do_until([&db] { return db.get_version() != database::empty_version; }, pause).then([&mm] {
return do_until([&mm] { return mm.have_schema_agreement(); }, pause);
});
}
const timeout_config& internal_distributed_timeout_config() noexcept {
static const auto t = 5s;
static const timeout_config tc{t, t, t, t, t, t, t};
return tc;
}
}

View File

@@ -22,14 +22,23 @@
#pragma once
#include <chrono>
#include <experimental/string_view>
#include <seastar/core/future.hh>
#include <seastar/core/abort_source.hh>
#include <seastar/util/noncopyable_function.hh>
#include <seastar/core/reactor.hh>
#include <seastar/core/resource.hh>
#include <seastar/core/sstring.hh>
#include "delayed_tasks.hh"
#include "log.hh"
#include "seastarx.hh"
#include "utils/exponential_backoff_retry.hh"
using namespace std::chrono_literals;
class database;
class timeout_config;
namespace service {
class migration_manager;
@@ -59,16 +68,24 @@ future<> once_among_shards(Task&& f) {
return make_ready_future<>();
}
template <class Task, class Clock>
void delay_until_system_ready(delayed_tasks<Clock>& ts, Task&& f) {
static const typename std::chrono::milliseconds delay_duration(10000);
ts.schedule_after(delay_duration, std::forward<Task>(f));
inline future<> delay_until_system_ready(seastar::abort_source& as) {
return sleep_abortable(15s, as);
}
// Func must support being invoked more than once.
future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_function<future<>()> func);
future<> create_metadata_table_if_missing(
const sstring& table_name,
stdx::string_view table_name,
cql3::query_processor&,
const sstring& cql,
stdx::string_view cql,
::service::migration_manager&);
future<> wait_for_schema_agreement(::service::migration_manager&, const database&);
///
/// Time-outs for internal, non-local CQL queries.
///
const timeout_config& internal_distributed_timeout_config() noexcept;
}

View File

@@ -1,171 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "data_resource.hh"
#include <regex>
#include "service/storage_proxy.hh"
const sstring auth::data_resource::ROOT_NAME("data");
auth::data_resource::data_resource(level l, const sstring& ks, const sstring& cf)
: _level(l), _ks(ks), _cf(cf)
{
}
auth::data_resource::data_resource()
: data_resource(level::ROOT)
{}
auth::data_resource::data_resource(const sstring& ks)
: data_resource(level::KEYSPACE, ks)
{}
auth::data_resource::data_resource(const sstring& ks, const sstring& cf)
: data_resource(level::COLUMN_FAMILY, ks, cf)
{}
auth::data_resource::level auth::data_resource::get_level() const {
return _level;
}
auth::data_resource auth::data_resource::from_name(
const sstring& s) {
static std::regex slash_regex("/");
auto i = std::regex_token_iterator<sstring::const_iterator>(s.begin(),
s.end(), slash_regex, -1);
auto e = std::regex_token_iterator<sstring::const_iterator>();
auto n = std::distance(i, e);
if (n > 3 || ROOT_NAME != sstring(*i++)) {
throw std::invalid_argument(sprint("%s is not a valid data resource name", s));
}
if (n == 1) {
return data_resource();
}
auto ks = *i++;
if (n == 2) {
return data_resource(ks.str());
}
auto cf = *i++;
return data_resource(ks.str(), cf.str());
}
sstring auth::data_resource::name() const {
switch (get_level()) {
case level::ROOT:
return ROOT_NAME;
case level::KEYSPACE:
return sprint("%s/%s", ROOT_NAME, _ks);
case level::COLUMN_FAMILY:
default:
return sprint("%s/%s/%s", ROOT_NAME, _ks, _cf);
}
}
auth::data_resource auth::data_resource::get_parent() const {
switch (get_level()) {
case level::KEYSPACE:
return data_resource();
case level::COLUMN_FAMILY:
return data_resource(_ks);
default:
throw std::invalid_argument("Root-level resource can't have a parent");
}
}
const sstring& auth::data_resource::keyspace() const {
if (is_root_level()) {
throw std::invalid_argument("ROOT data resource has no keyspace");
}
return _ks;
}
const sstring& auth::data_resource::column_family() const {
if (!is_column_family_level()) {
throw std::invalid_argument(sprint("%s data resource has no column family", name()));
}
return _cf;
}
bool auth::data_resource::has_parent() const {
return !is_root_level();
}
bool auth::data_resource::exists() const {
switch (get_level()) {
case level::ROOT:
return true;
case level::KEYSPACE:
return service::get_local_storage_proxy().get_db().local().has_keyspace(_ks);
case level::COLUMN_FAMILY:
default:
return service::get_local_storage_proxy().get_db().local().has_schema(_ks, _cf);
}
}
sstring auth::data_resource::to_string() const {
switch (get_level()) {
case level::ROOT:
return "<all keyspaces>";
case level::KEYSPACE:
return sprint("<keyspace %s>", _ks);
case level::COLUMN_FAMILY:
default:
return sprint("<table %s.%s>", _ks, _cf);
}
}
bool auth::data_resource::operator==(const data_resource& v) const {
return _ks == v._ks && _cf == v._cf;
}
bool auth::data_resource::operator<(const data_resource& v) const {
return _ks < v._ks ? true : (v._ks < _ks ? false : _cf < v._cf);
}
std::ostream& auth::operator<<(std::ostream& os, const data_resource& r) {
return os << r.to_string();
}

View File

@@ -1,159 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "utils/hash.hh"
#include <iosfwd>
#include <set>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
namespace auth {
class data_resource {
private:
enum class level {
ROOT, KEYSPACE, COLUMN_FAMILY
};
static const sstring ROOT_NAME;
level _level;
sstring _ks;
sstring _cf;
data_resource(level, const sstring& ks = {}, const sstring& cf = {});
level get_level() const;
public:
/**
* Creates a DataResource representing the root-level resource.
* @return the root-level resource.
*/
data_resource();
/**
* Creates a DataResource representing a keyspace.
*
* @param keyspace Name of the keyspace.
*/
data_resource(const sstring& ks);
/**
* Creates a DataResource instance representing a column family.
*
* @param keyspace Name of the keyspace.
* @param columnFamily Name of the column family.
*/
data_resource(const sstring& ks, const sstring& cf);
/**
* Parses a data resource name into a DataResource instance.
*
* @param name Name of the data resource.
* @return DataResource instance matching the name.
*/
static data_resource from_name(const sstring&);
/**
* @return Printable name of the resource.
*/
sstring name() const;
/**
* @return Parent of the resource, if any. Throws IllegalStateException if it's the root-level resource.
*/
data_resource get_parent() const;
bool is_root_level() const {
return get_level() == level::ROOT;
}
bool is_keyspace_level() const {
return get_level() == level::KEYSPACE;
}
bool is_column_family_level() const {
return get_level() == level::COLUMN_FAMILY;
}
/**
* @return keyspace of the resource.
* @throws std::invalid_argument if it's the root-level resource.
*/
const sstring& keyspace() const;
/**
* @return column family of the resource.
* @throws std::invalid_argument if it's not a cf-level resource.
*/
const sstring& column_family() const;
/**
* @return Whether or not the resource has a parent in the hierarchy.
*/
bool has_parent() const;
/**
* @return Whether or not the resource exists in scylla.
*/
bool exists() const;
sstring to_string() const;
bool operator==(const data_resource&) const;
bool operator<(const data_resource&) const;
size_t hash_value() const {
return utils::tuple_hash()(_ks, _cf);
}
};
/**
* Resource id mappings, i.e. keyspace and/or column families.
*/
using resource_ids = std::set<data_resource>;
std::ostream& operator<<(std::ostream&, const data_resource&);
}

View File

@@ -39,198 +39,291 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <unistd.h>
#include <crypt.h>
#include <random>
#include <chrono>
#include "auth/default_authorizer.hh"
extern "C" {
#include <crypt.h>
#include <unistd.h>
}
#include <chrono>
#include <random>
#include <boost/algorithm/string/join.hpp>
#include <boost/range.hpp>
#include <seastar/core/reactor.hh>
#include "common.hh"
#include "default_authorizer.hh"
#include "authenticated_user.hh"
#include "permission.hh"
#include "auth/authenticated_user.hh"
#include "auth/common.hh"
#include "auth/permission.hh"
#include "auth/role_or_anonymous.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
const sstring& auth::default_authorizer_name() {
namespace auth {
const sstring& default_authorizer_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "CassandraAuthorizer";
return name;
}
static const sstring USER_NAME = "username";
static const sstring ROLE_NAME = "role";
static const sstring RESOURCE_NAME = "resource";
static const sstring PERMISSIONS_NAME = "permissions";
static const sstring PERMISSIONS_CF = "permissions";
static const sstring PERMISSIONS_CF = "role_permissions";
static logging::logger alogger("default_authorizer");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
auth::authorizer,
auth::default_authorizer,
authorizer,
default_authorizer,
cql3::query_processor&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.CassandraAuthorizer");
auth::default_authorizer::default_authorizer(cql3::query_processor& qp, ::service::migration_manager& mm)
default_authorizer::default_authorizer(cql3::query_processor& qp, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm) {
}
auth::default_authorizer::~default_authorizer() {
default_authorizer::~default_authorizer() {
}
future<> auth::default_authorizer::start() {
static const sstring create_table = sprint("CREATE TABLE %s.%s ("
"%s text,"
"%s text,"
"%s set<text>,"
"PRIMARY KEY(%s, %s)"
") WITH gc_grace_seconds=%d", meta::AUTH_KS,
PERMISSIONS_CF, USER_NAME, RESOURCE_NAME, PERMISSIONS_NAME,
USER_NAME, RESOURCE_NAME, 90 * 24 * 60 * 60); // 3 months.
static const sstring legacy_table_name{"permissions"};
return auth::once_among_shards([this] {
return auth::create_metadata_table_if_missing(
bool default_authorizer::legacy_metadata_exists() const {
return _qp.db().local().has_schema(meta::AUTH_KS, legacy_table_name);
}
future<bool> default_authorizer::any_granted() const {
static const sstring query = sprint("SELECT * FROM %s.%s LIMIT 1", meta::AUTH_KS, PERMISSIONS_CF);
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{},
true).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return !results->empty();
});
}
future<> default_authorizer::migrate_legacy_metadata() const {
alogger.info("Starting migration of legacy permissions metadata.");
static const sstring query = sprint("SELECT * FROM %s.%s", meta::AUTH_KS, legacy_table_name);
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
return do_with(
row.get_as<sstring>("username"),
parse_resource(row.get_as<sstring>(RESOURCE_NAME)),
[this, &row](const auto& username, const auto& r) {
const permission_set perms = permissions::from_strings(row.get_set<sstring>(PERMISSIONS_NAME));
return grant(username, perms, r);
});
}).finally([results] {});
}).then([] {
alogger.info("Finished migrating legacy permissions metadata.");
}).handle_exception([](std::exception_ptr ep) {
alogger.error("Encountered an error during migration!");
std::rethrow_exception(ep);
});
}
future<> default_authorizer::start() {
static const sstring create_table = sprint(
"CREATE TABLE %s.%s ("
"%s text,"
"%s text,"
"%s set<text>,"
"PRIMARY KEY(%s, %s)"
") WITH gc_grace_seconds=%d",
meta::AUTH_KS,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME,
PERMISSIONS_NAME,
ROLE_NAME,
RESOURCE_NAME,
90 * 24 * 60 * 60); // 3 months.
return once_among_shards([this] {
return create_metadata_table_if_missing(
PERMISSIONS_CF,
_qp,
create_table,
_migration_manager);
});
}
_migration_manager).then([this] {
_finished = do_after_system_ready(_as, [this] {
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
future<> auth::default_authorizer::stop() {
return make_ready_future<>();
}
future<auth::permission_set> auth::default_authorizer::authorize(
service& ser, ::shared_ptr<authenticated_user> user, data_resource resource) const {
return auth::is_super_user(ser, *user).then([this, user, resource = std::move(resource)](bool is_super) {
if (is_super) {
return make_ready_future<permission_set>(permissions::ALL);
}
/**
* TOOD: could create actual data type for permission (translating string<->perm),
* but this seems overkill right now. We still must store strings so...
*/
auto query = sprint("SELECT %s FROM %s.%s WHERE %s = ? AND %s = ?"
, PERMISSIONS_NAME, meta::AUTH_KS, PERMISSIONS_CF, USER_NAME, RESOURCE_NAME);
return _qp.process(query, db::consistency_level::LOCAL_ONE, {user->name(), resource.name() })
.then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
if (res->empty() || !res->one().has(PERMISSIONS_NAME)) {
return make_ready_future<permission_set>(permissions::NONE);
}
return make_ready_future<permission_set>(permissions::from_strings(res->one().get_set<sstring>(PERMISSIONS_NAME)));
} catch (exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to authorize {} for {}", user->name(), resource);
return make_ready_future<permission_set>(permissions::NONE);
}
});
});
}
#include <boost/range.hpp>
future<> auth::default_authorizer::modify(
::shared_ptr<authenticated_user> performer, permission_set set,
data_resource resource, sstring user, sstring op) {
// TODO: why does this not check super user?
auto query = sprint("UPDATE %s.%s SET %s = %s %s ? WHERE %s = ? AND %s = ?",
meta::AUTH_KS, PERMISSIONS_CF, PERMISSIONS_NAME,
PERMISSIONS_NAME, op, USER_NAME, RESOURCE_NAME);
return _qp.process(query, db::consistency_level::ONE, {
permissions::to_strings(set), user, resource.name() }).discard_result();
}
future<> auth::default_authorizer::grant(
::shared_ptr<authenticated_user> performer, permission_set set,
data_resource resource, sstring to) {
return modify(std::move(performer), std::move(set), std::move(resource), std::move(to), "+");
}
future<> auth::default_authorizer::revoke(
::shared_ptr<authenticated_user> performer, permission_set set,
data_resource resource, sstring from) {
return modify(std::move(performer), std::move(set), std::move(resource), std::move(from), "-");
}
future<std::vector<auth::permission_details>> auth::default_authorizer::list(
service& ser, ::shared_ptr<authenticated_user> performer, permission_set set,
optional<data_resource> resource, optional<sstring> user) const {
return auth::is_super_user(ser, *performer).then([this, performer, set = std::move(set), resource = std::move(resource), user = std::move(user)](bool is_super) {
if (!is_super && (!user || performer->name() != *user)) {
throw exceptions::unauthorized_exception(sprint("You are not authorized to view %s's permissions", user ? *user : "everyone"));
}
auto query = sprint("SELECT %s, %s, %s FROM %s.%s", USER_NAME, RESOURCE_NAME, PERMISSIONS_NAME, meta::AUTH_KS, PERMISSIONS_CF);
// Oh, look, it is a case where it does not pay off to have
// parameters to process in an initializer list.
future<::shared_ptr<cql3::untyped_result_set>> f = make_ready_future<::shared_ptr<cql3::untyped_result_set>>();
if (resource && user) {
query += sprint(" WHERE %s = ? AND %s = ?", USER_NAME, RESOURCE_NAME);
f = _qp.process(query, db::consistency_level::ONE, {*user, resource->name()});
} else if (resource) {
query += sprint(" WHERE %s = ? ALLOW FILTERING", RESOURCE_NAME);
f = _qp.process(query, db::consistency_level::ONE, {resource->name()});
} else if (user) {
query += sprint(" WHERE %s = ?", USER_NAME);
f = _qp.process(query, db::consistency_level::ONE, {*user});
} else {
f = _qp.process(query, db::consistency_level::ONE, {});
}
return f.then([set](::shared_ptr<cql3::untyped_result_set> res) {
std::vector<permission_details> result;
for (auto& row : *res) {
if (row.has(PERMISSIONS_NAME)) {
auto username = row.get_as<sstring>(USER_NAME);
auto resource = data_resource::from_name(row.get_as<sstring>(RESOURCE_NAME));
auto ps = permissions::from_strings(row.get_set<sstring>(PERMISSIONS_NAME));
ps = permission_set::from_mask(ps.mask() & set.mask());
result.emplace_back(permission_details {username, resource, ps});
}
}
return make_ready_future<std::vector<permission_details>>(std::move(result));
});
});
}
future<> auth::default_authorizer::revoke_all(sstring dropped_user) {
auto query = sprint("DELETE FROM %s.%s WHERE %s = ?", meta::AUTH_KS,
PERMISSIONS_CF, USER_NAME);
return _qp.process(query, db::consistency_level::ONE, { dropped_user }).discard_result().handle_exception(
[dropped_user](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", dropped_user, e);
if (legacy_metadata_exists()) {
if (!any_granted().get0()) {
migrate_legacy_metadata().get0();
return;
}
});
alogger.warn("Ignoring legacy permissions metadata since role permissions exist.");
}
});
});
});
});
}
future<> auth::default_authorizer::revoke_all(data_resource resource) {
auto query = sprint("SELECT %s FROM %s.%s WHERE %s = ? ALLOW FILTERING",
USER_NAME, meta::AUTH_KS, PERMISSIONS_CF, RESOURCE_NAME);
return _qp.process(query, db::consistency_level::LOCAL_ONE, { resource.name() })
.then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
future<> default_authorizer::stop() {
_as.request_abort();
return _finished.handle_exception_type([](const sleep_aborted&) {});
}
future<permission_set>
default_authorizer::authorize(const role_or_anonymous& maybe_role, const resource& r) const {
if (is_anonymous(maybe_role)) {
return make_ready_future<permission_set>(permissions::NONE);
}
static const sstring query = sprint(
"SELECT %s FROM %s.%s WHERE %s = ? AND %s = ?",
PERMISSIONS_NAME,
meta::AUTH_KS,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME);
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{*maybe_role.name, r.name()}).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return permissions::NONE;
}
return permissions::from_strings(results->one().get_set<sstring>(PERMISSIONS_NAME));
});
}
future<>
default_authorizer::modify(
stdx::string_view role_name,
permission_set set,
const resource& resource,
stdx::string_view op) const {
return do_with(
sprint(
"UPDATE %s.%s SET %s = %s %s ? WHERE %s = ? AND %s = ?",
meta::AUTH_KS,
PERMISSIONS_CF,
PERMISSIONS_NAME,
PERMISSIONS_NAME,
op,
ROLE_NAME,
RESOURCE_NAME),
[this, &role_name, set, &resource](const auto& query) {
return _qp.process(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
{permissions::to_strings(set), sstring(role_name), resource.name()}).discard_result();
});
}
future<> default_authorizer::grant(stdx::string_view role_name, permission_set set, const resource& resource) const {
return modify(role_name, std::move(set), resource, "+");
}
future<> default_authorizer::revoke(stdx::string_view role_name, permission_set set, const resource& resource) const {
return modify(role_name, std::move(set), resource, "-");
}
future<std::vector<permission_details>> default_authorizer::list_all() const {
static const sstring query = sprint(
"SELECT %s, %s, %s FROM %s.%s",
ROLE_NAME,
RESOURCE_NAME,
PERMISSIONS_NAME,
meta::AUTH_KS,
PERMISSIONS_CF);
return _qp.process(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
{},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
std::vector<permission_details> all_details;
for (const auto& row : *results) {
if (row.has(PERMISSIONS_NAME)) {
auto role_name = row.get_as<sstring>(ROLE_NAME);
auto resource = parse_resource(row.get_as<sstring>(RESOURCE_NAME));
auto perms = permissions::from_strings(row.get_set<sstring>(PERMISSIONS_NAME));
all_details.push_back(permission_details{std::move(role_name), std::move(resource), std::move(perms)});
}
}
return all_details;
});
}
future<> default_authorizer::revoke_all(stdx::string_view role_name) const {
static const sstring query = sprint(
"DELETE FROM %s.%s WHERE %s = ?",
meta::AUTH_KS,
PERMISSIONS_CF,
ROLE_NAME);
return _qp.process(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result().handle_exception([role_name](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", role_name, e);
}
});
}
future<> default_authorizer::revoke_all(const resource& resource) const {
static const sstring query = sprint(
"SELECT %s FROM %s.%s WHERE %s = ? ALLOW FILTERING",
ROLE_NAME,
meta::AUTH_KS,
PERMISSIONS_CF,
RESOURCE_NAME);
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{resource.name()}).then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
return parallel_for_each(res->begin(), res->end(), [this, res, resource](const cql3::untyped_result_set::row& r) {
auto query = sprint("DELETE FROM %s.%s WHERE %s = ? AND %s = ?"
, meta::AUTH_KS, PERMISSIONS_CF, USER_NAME, RESOURCE_NAME);
return _qp.process(query, db::consistency_level::LOCAL_ONE, { r.get_as<sstring>(USER_NAME), resource.name() })
.discard_result().handle_exception([resource](auto ep) {
return parallel_for_each(
res->begin(),
res->end(),
[this, res, resource](const cql3::untyped_result_set::row& r) {
static const sstring query = sprint(
"DELETE FROM %s.%s WHERE %s = ? AND %s = ?",
meta::AUTH_KS,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME);
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{r.get_as<sstring>(ROLE_NAME), resource.name()}).discard_result().handle_exception(
[resource](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
@@ -246,12 +339,9 @@ future<> auth::default_authorizer::revoke_all(data_resource resource) {
});
}
const auth::resource_ids& auth::default_authorizer::protected_resources() {
static const resource_ids ids({ data_resource(meta::AUTH_KS, PERMISSIONS_CF) });
return ids;
const resource_set& default_authorizer::protected_resources() const {
static const resource_set resources({ make_data_resource(meta::AUTH_KS, PERMISSIONS_CF) });
return resources;
}
future<> auth::default_authorizer::validate_configuration() const {
return make_ready_future();
}

View File

@@ -43,7 +43,9 @@
#include <functional>
#include "authorizer.hh"
#include <seastar/core/abort_source.hh>
#include "auth/authorizer.hh"
#include "cql3/query_processor.hh"
#include "service/migration_manager.hh"
@@ -56,36 +58,45 @@ class default_authorizer : public authorizer {
::service::migration_manager& _migration_manager;
abort_source _as{};
future<> _finished{make_ready_future<>()};
public:
default_authorizer(cql3::query_processor&, ::service::migration_manager&);
~default_authorizer();
future<> start() override;
virtual future<> start() override;
future<> stop() override;
virtual future<> stop() override;
const sstring& qualified_java_name() const override {
virtual const sstring& qualified_java_name() const override {
return default_authorizer_name();
}
future<permission_set> authorize(service&, ::shared_ptr<authenticated_user>, data_resource) const override;
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override;
future<> grant(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override;
virtual future<> grant(stdx::string_view, permission_set, const resource&) const override;
future<> revoke(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override;
virtual future<> revoke( stdx::string_view, permission_set, const resource&) const override;
future<std::vector<permission_details>> list(service&, ::shared_ptr<authenticated_user>, permission_set, optional<data_resource>, optional<sstring>) const override;
virtual future<std::vector<permission_details>> list_all() const override;
future<> revoke_all(sstring) override;
virtual future<> revoke_all(stdx::string_view) const override;
future<> revoke_all(data_resource) override;
virtual future<> revoke_all(const resource&) const override;
const resource_ids& protected_resources() override;
future<> validate_configuration() const override;
virtual const resource_set& protected_resources() const override;
private:
future<> modify(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring, sstring);
bool legacy_metadata_exists() const;
future<bool> any_granted() const;
future<> migrate_legacy_metadata() const;
future<> modify(stdx::string_view, permission_set, const resource&, stdx::string_view) const;
};
} /* namespace auth */

View File

@@ -39,48 +39,56 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <unistd.h>
#include <crypt.h>
#include <random>
#include <chrono>
#include "auth/password_authenticator.hh"
extern "C" {
#include <crypt.h>
#include <unistd.h>
}
#include <algorithm>
#include <chrono>
#include <random>
#include <boost/algorithm/cxx11/all_of.hpp>
#include <seastar/core/reactor.hh>
#include "common.hh"
#include "password_authenticator.hh"
#include "authenticated_user.hh"
#include "auth/authenticated_user.hh"
#include "auth/common.hh"
#include "auth/roles-metadata.hh"
#include "cql3/untyped_result_set.hh"
#include "log.hh"
#include "service/migration_manager.hh"
#include "utils/class_registrator.hh"
const sstring& auth::password_authenticator_name() {
namespace auth {
const sstring& password_authenticator_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "PasswordAuthenticator";
return name;
}
// name of the hash column.
static const sstring SALTED_HASH = "salted_hash";
static const sstring USER_NAME = "username";
static const sstring DEFAULT_USER_NAME = auth::meta::DEFAULT_SUPERUSER_NAME;
static const sstring DEFAULT_USER_PASSWORD = auth::meta::DEFAULT_SUPERUSER_NAME;
static const sstring CREDENTIALS_CF = "credentials";
static const sstring DEFAULT_USER_NAME = meta::DEFAULT_SUPERUSER_NAME;
static const sstring DEFAULT_USER_PASSWORD = meta::DEFAULT_SUPERUSER_NAME;
static logging::logger plogger("password_authenticator");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
auth::authenticator,
auth::password_authenticator,
authenticator,
password_authenticator,
cql3::query_processor&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
auth::password_authenticator::~password_authenticator()
{}
password_authenticator::~password_authenticator() {
}
auth::password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::migration_manager& mm)
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm) {
, _migration_manager(mm)
, _stopped(make_ready_future<>()) {
}
// TODO: blowfish
@@ -141,7 +149,9 @@ static sstring gensalt() {
// blowfish 2011 fix, blowfish, sha512, sha256, md5
for (sstring pfx : { "$2y$", "$2a$", "$6$", "$5$", "$1$" }) {
salt = pfx + input;
if (crypt_r("fisk", salt.c_str(), &tlcrypt)) {
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
prefix = pfx;
return salt;
}
@@ -153,76 +163,128 @@ static sstring hashpw(const sstring& pass) {
return hashpw(pass, gensalt());
}
future<> auth::password_authenticator::start() {
return auth::once_among_shards([this] {
gensalt(); // do this once to determine usable hashing
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
return !row.get_or<sstring>(SALTED_HASH, "").empty();
}
static const sstring create_table = sprint(
"CREATE TABLE %s.%s ("
"%s text,"
"%s text," // salt + hash + number of rounds
"options map<text,text>,"// for future extensions
"PRIMARY KEY(%s)"
") WITH gc_grace_seconds=%d",
meta::AUTH_KS,
CREDENTIALS_CF, USER_NAME, SALTED_HASH, USER_NAME,
90 * 24 * 60 * 60); // 3 months.
static const sstring update_row_query = sprint(
"UPDATE %s SET %s = ? WHERE %s = ?",
meta::roles_table::qualified_name(),
SALTED_HASH,
meta::roles_table::role_col_name);
return auth::create_metadata_table_if_missing(
CREDENTIALS_CF,
_qp,
create_table,
_migration_manager).then([this] {
auth::delay_until_system_ready(_delayed, [this] {
return has_existing_users().then([this](bool existing) {
if (!existing) {
return _qp.process(
sprint(
"INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",
meta::AUTH_KS,
CREDENTIALS_CF,
USER_NAME, SALTED_HASH),
db::consistency_level::ONE,
{ DEFAULT_USER_NAME, hashpw(DEFAULT_USER_PASSWORD) }).then([](auto) {
plogger.info("Created default user '{}'", DEFAULT_USER_NAME);
});
}
static const sstring legacy_table_name{"credentials"};
return make_ready_future<>();
});
});
});
bool password_authenticator::legacy_metadata_exists() const {
return _qp.db().local().has_schema(meta::AUTH_KS, legacy_table_name);
}
future<> password_authenticator::migrate_legacy_metadata() const {
plogger.info("Starting migration of legacy authentication metadata.");
static const sstring query = sprint("SELECT * FROM %s.%s", meta::AUTH_KS, legacy_table_name);
return _qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
auto username = row.get_as<sstring>("username");
auto salted_hash = row.get_as<sstring>(SALTED_HASH);
return _qp.process(
update_row_query,
consistency_for_user(username),
internal_distributed_timeout_config(),
{std::move(salted_hash), username}).discard_result();
}).finally([results] {});
}).then([] {
plogger.info("Finished migrating legacy authentication metadata.");
}).handle_exception([](std::exception_ptr ep) {
plogger.error("Encountered an error during migration!");
std::rethrow_exception(ep);
});
}
future<> auth::password_authenticator::stop() {
return make_ready_future<>();
future<> password_authenticator::create_default_if_missing() const {
return default_role_row_satisfies(_qp, &has_salted_hash).then([this](bool exists) {
if (!exists) {
return _qp.process(
update_row_query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
{hashpw(DEFAULT_USER_PASSWORD), DEFAULT_USER_NAME}).then([](auto&&) {
plogger.info("Created default superuser authentication record.");
});
}
return make_ready_future<>();
});
}
db::consistency_level auth::password_authenticator::consistency_for_user(const sstring& username) {
if (username == DEFAULT_USER_NAME) {
future<> password_authenticator::start() {
return once_among_shards([this] {
gensalt(); // do this once to determine usable hashing
auto f = create_metadata_table_if_missing(
meta::roles_table::name,
_qp,
meta::roles_table::creation_query(),
_migration_manager);
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_salted_hash).get0()) {
if (legacy_metadata_exists()) {
plogger.warn("Ignoring legacy authentication metadata since nondefault data already exist.");
}
return;
}
if (legacy_metadata_exists()) {
migrate_legacy_metadata().get0();
return;
}
create_default_if_missing().get0();
});
});
return f;
});
}
future<> password_authenticator::stop() {
_as.request_abort();
return _stopped.handle_exception_type([] (const sleep_aborted&) { });
}
db::consistency_level password_authenticator::consistency_for_user(stdx::string_view role_name) {
if (role_name == DEFAULT_USER_NAME) {
return db::consistency_level::QUORUM;
}
return db::consistency_level::LOCAL_ONE;
}
const sstring& auth::password_authenticator::qualified_java_name() const {
const sstring& password_authenticator::qualified_java_name() const {
return password_authenticator_name();
}
bool auth::password_authenticator::require_authentication() const {
bool password_authenticator::require_authentication() const {
return true;
}
auth::authenticator::option_set auth::password_authenticator::supported_options() const {
return option_set::of<option::PASSWORD>();
authentication_option_set password_authenticator::supported_options() const {
return authentication_option_set{authentication_option::password};
}
auth::authenticator::option_set auth::password_authenticator::alterable_options() const {
return option_set::of<option::PASSWORD>();
authentication_option_set password_authenticator::alterable_options() const {
return authentication_option_set{authentication_option::password};
}
future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::authenticate(
future<authenticated_user> password_authenticator::authenticate(
const credentials_map& credentials) const {
if (!credentials.count(USERNAME_KEY)) {
throw exceptions::authentication_exception(sprint("Required key '%s' is missing", USERNAME_KEY));
@@ -240,16 +302,29 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au
// Rely on query processing caching statements instead, and lets assume
// that a map lookup string->statement is not gonna kill us much.
return futurize_apply([this, username, password] {
return _qp.process(sprint("SELECT %s FROM %s.%s WHERE %s = ?", SALTED_HASH,
meta::AUTH_KS, CREDENTIALS_CF, USER_NAME),
consistency_for_user(username), {username}, true);
static const sstring query = sprint(
"SELECT %s FROM %s WHERE %s = ?",
SALTED_HASH,
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
query,
consistency_for_user(username),
internal_distributed_timeout_config(),
{username},
true);
}).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
if (res->empty() || !checkpw(password, res->one().get_as<sstring>(SALTED_HASH))) {
auto salted_hash = std::experimental::optional<sstring>();
if (!res->empty()) {
salted_hash = res->one().get_opt<sstring>(SALTED_HASH);
}
if (!salted_hash || !checkpw(password, *salted_hash)) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>(username));
return make_ready_future<authenticated_user>(username);
} catch (std::system_error &) {
std::throw_with_nested(exceptions::authentication_exception("Could not verify password"));
} catch (exceptions::request_execution_exception& e) {
@@ -260,52 +335,65 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au
});
}
future<> auth::password_authenticator::create(sstring username,
const option_map& options) {
try {
auto password = boost::any_cast<sstring>(options.at(option::PASSWORD));
auto query = sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?)",
meta::AUTH_KS, CREDENTIALS_CF, USER_NAME, SALTED_HASH);
return _qp.process(query, consistency_for_user(username), { username, hashpw(password) }).discard_result();
} catch (std::out_of_range&) {
throw exceptions::invalid_request_exception("PasswordAuthenticator requires PASSWORD option");
future<> password_authenticator::create(stdx::string_view role_name, const authentication_options& options) const {
if (!options.password) {
return make_ready_future<>();
}
return _qp.process(
update_row_query,
consistency_for_user(role_name),
internal_distributed_timeout_config(),
{hashpw(*options.password), sstring(role_name)}).discard_result();
}
future<> auth::password_authenticator::alter(sstring username,
const option_map& options) {
try {
auto password = boost::any_cast<sstring>(options.at(option::PASSWORD));
auto query = sprint("UPDATE %s.%s SET %s = ? WHERE %s = ?",
meta::AUTH_KS, CREDENTIALS_CF, SALTED_HASH, USER_NAME);
return _qp.process(query, consistency_for_user(username), { hashpw(password), username }).discard_result();
} catch (std::out_of_range&) {
throw exceptions::invalid_request_exception("PasswordAuthenticator requires PASSWORD option");
future<> password_authenticator::alter(stdx::string_view role_name, const authentication_options& options) const {
if (!options.password) {
return make_ready_future<>();
}
static const sstring query = sprint(
"UPDATE %s SET %s = ? WHERE %s = ?",
meta::roles_table::qualified_name(),
SALTED_HASH,
meta::roles_table::role_col_name);
return _qp.process(
query,
consistency_for_user(role_name),
internal_distributed_timeout_config(),
{hashpw(*options.password), sstring(role_name)}).discard_result();
}
future<> auth::password_authenticator::drop(sstring username) {
try {
auto query = sprint("DELETE FROM %s.%s WHERE %s = ?",
meta::AUTH_KS, CREDENTIALS_CF, USER_NAME);
return _qp.process(query, consistency_for_user(username), { username }).discard_result();
} catch (std::out_of_range&) {
throw exceptions::invalid_request_exception("PasswordAuthenticator requires PASSWORD option");
}
future<> password_authenticator::drop(stdx::string_view name) const {
static const sstring query = sprint(
"DELETE %s FROM %s WHERE %s = ?",
SALTED_HASH,
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
query, consistency_for_user(name),
internal_distributed_timeout_config(),
{sstring(name)}).discard_result();
}
const auth::resource_ids& auth::password_authenticator::protected_resources() const {
static const resource_ids ids({ data_resource(meta::AUTH_KS, CREDENTIALS_CF) });
return ids;
future<custom_options> password_authenticator::query_custom_options(stdx::string_view role_name) const {
return make_ready_future<custom_options>();
}
::shared_ptr<auth::authenticator::sasl_challenge> auth::password_authenticator::new_sasl_challenge() const {
class plain_text_password_challenge: public sasl_challenge {
const resource_set& password_authenticator::protected_resources() const {
static const resource_set resources({make_data_resource(meta::AUTH_KS, meta::roles_table::name)});
return resources;
}
::shared_ptr<authenticator::sasl_challenge> password_authenticator::new_sasl_challenge() const {
class plain_text_password_challenge : public sasl_challenge {
const password_authenticator& _self;
public:
plain_text_password_challenge(const password_authenticator& self) : _self(self)
{}
plain_text_password_challenge(const password_authenticator& self) : _self(self) {
}
/**
* SASL PLAIN mechanism specifies that credentials are encoded in a
@@ -355,10 +443,12 @@ const auth::resource_ids& auth::password_authenticator::protected_resources() co
_complete = true;
return {};
}
bool is_complete() const override {
return _complete;
}
future<::shared_ptr<authenticated_user>> get_authenticated_user() const override {
future<authenticated_user> get_authenticated_user() const override {
return _self.authenticate(_credentials);
}
private:
@@ -368,49 +458,4 @@ const auth::resource_ids& auth::password_authenticator::protected_resources() co
return ::make_shared<plain_text_password_challenge>(*this);
}
//
// Similar in structure to `auth::service::has_existing_users()`, but trying to generalize the pattern breaks all kinds
// of module boundaries and leaks implementation details.
//
future<bool> auth::password_authenticator::has_existing_users() const {
static const sstring default_user_query = sprint(
"SELECT * FROM %s.%s WHERE %s = ?",
meta::AUTH_KS,
CREDENTIALS_CF,
USER_NAME);
static const sstring all_users_query = sprint(
"SELECT * FROM %s.%s LIMIT 1",
meta::AUTH_KS,
CREDENTIALS_CF);
// This logic is borrowed directly from Apache Cassandra. By first checking for the presence of the default user, we
// can potentially avoid doing a range query with a high consistency level.
return _qp.process(
default_user_query,
db::consistency_level::ONE,
{ meta::DEFAULT_SUPERUSER_NAME },
true).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
}
return _qp.process(
default_user_query,
db::consistency_level::QUORUM,
{ meta::DEFAULT_SUPERUSER_NAME },
true).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
}
return _qp.process(
all_users_query,
db::consistency_level::QUORUM).then([](auto results) {
return make_ready_future<bool>(!results->empty());
});
});
});
}

View File

@@ -41,9 +41,10 @@
#pragma once
#include "authenticator.hh"
#include <seastar/core/abort_source.hh>
#include "auth/authenticator.hh"
#include "cql3/query_processor.hh"
#include "delayed_tasks.hh"
namespace service {
class migration_manager;
@@ -55,35 +56,49 @@ const sstring& password_authenticator_name();
class password_authenticator : public authenticator {
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;
delayed_tasks<> _delayed{};
future<> _stopped;
seastar::abort_source _as;
public:
static db::consistency_level consistency_for_user(stdx::string_view role_name);
password_authenticator(cql3::query_processor&, ::service::migration_manager&);
~password_authenticator();
future<> start() override;
virtual future<> start() override;
future<> stop() override;
virtual future<> stop() override;
const sstring& qualified_java_name() const override;
bool require_authentication() const override;
option_set supported_options() const override;
option_set alterable_options() const override;
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override;
future<> create(sstring username, const option_map& options) override;
future<> alter(sstring username, const option_map& options) override;
future<> drop(sstring username) override;
const resource_ids& protected_resources() const override;
::shared_ptr<sasl_challenge> new_sasl_challenge() const override;
virtual const sstring& qualified_java_name() const override;
virtual bool require_authentication() const override;
static db::consistency_level consistency_for_user(const sstring& username);
virtual authentication_option_set supported_options() const override;
virtual authentication_option_set alterable_options() const override;
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const override;
virtual future<> create(stdx::string_view role_name, const authentication_options& options) const override;
virtual future<> alter(stdx::string_view role_name, const authentication_options& options) const override;
virtual future<> drop(stdx::string_view role_name) const override;
virtual future<custom_options> query_custom_options(stdx::string_view role_name) const override;
virtual const resource_set& protected_resources() const override;
virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const override;
private:
future<bool> has_existing_users() const;
bool legacy_metadata_exists() const;
future<> migrate_legacy_metadata() const;
future<> create_default_if_missing() const;
};
}

View File

@@ -39,32 +39,33 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <unordered_map>
#include <boost/algorithm/string.hpp>
#include "permission.hh"
#include "auth/permission.hh"
#include <boost/algorithm/string.hpp>
#include <unordered_map>
const auth::permission_set auth::permissions::ALL = auth::permission_set::of<
auth::permission::CREATE,
auth::permission::ALTER,
auth::permission::DROP,
auth::permission::SELECT,
auth::permission::MODIFY,
auth::permission::AUTHORIZE,
auth::permission::DESCRIBE>();
const auth::permission_set auth::permissions::ALL_DATA =
auth::permission_set::of<auth::permission::CREATE,
auth::permission::ALTER, auth::permission::DROP,
auth::permission::SELECT,
auth::permission::MODIFY,
auth::permission::AUTHORIZE>();
const auth::permission_set auth::permissions::ALL = auth::permissions::ALL_DATA;
const auth::permission_set auth::permissions::NONE;
const auth::permission_set auth::permissions::ALTERATIONS =
auth::permission_set::of<auth::permission::CREATE,
auth::permission::ALTER, auth::permission::DROP>();
static const std::unordered_map<sstring, auth::permission> permission_names({
{ "READ", auth::permission::READ },
{ "WRITE", auth::permission::WRITE },
{ "CREATE", auth::permission::CREATE },
{ "ALTER", auth::permission::ALTER },
{ "DROP", auth::permission::DROP },
{ "SELECT", auth::permission::SELECT },
{ "MODIFY", auth::permission::MODIFY },
{ "AUTHORIZE", auth::permission::AUTHORIZE },
});
{"READ", auth::permission::READ},
{"WRITE", auth::permission::WRITE},
{"CREATE", auth::permission::CREATE},
{"ALTER", auth::permission::ALTER},
{"DROP", auth::permission::DROP},
{"SELECT", auth::permission::SELECT},
{"MODIFY", auth::permission::MODIFY},
{"AUTHORIZE", auth::permission::AUTHORIZE},
{"DESCRIBE", auth::permission::DESCRIBE}});
const sstring& auth::permissions::to_string(permission p) {
for (auto& v : permission_names) {

View File

@@ -42,10 +42,11 @@
#pragma once
#include <unordered_set>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
#include "enum_set.hh"
#include "seastarx.hh"
namespace auth {
@@ -66,9 +67,13 @@ enum class permission {
// permission management
AUTHORIZE, // required for GRANT and REVOKE.
DESCRIBE, // required on the root-level role resource to list all roles.
};
typedef enum_set<super_enum<permission,
typedef enum_set<
super_enum<
permission,
permission::READ,
permission::WRITE,
permission::CREATE,
@@ -76,16 +81,15 @@ typedef enum_set<super_enum<permission,
permission::DROP,
permission::SELECT,
permission::MODIFY,
permission::AUTHORIZE>> permission_set;
permission::AUTHORIZE,
permission::DESCRIBE>> permission_set;
bool operator<(const permission_set&, const permission_set&);
namespace permissions {
extern const permission_set ALL_DATA;
extern const permission_set ALL;
extern const permission_set NONE;
extern const permission_set ALTERATIONS;
const sstring& to_string(permission);
permission from_string(const sstring&);
@@ -93,7 +97,6 @@ permission from_string(const sstring&);
std::unordered_set<sstring> to_strings(const permission_set&);
permission_set from_strings(const std::unordered_set<sstring>&);
}
}

View File

@@ -39,13 +39,15 @@ permissions_cache_config permissions_cache_config::from_db_config(const db::conf
permissions_cache::permissions_cache(const permissions_cache_config& c, service& ser, logging::logger& log)
: _cache(c.max_entries, c.validity_period, c.update_period, log, [&ser, &log](const key_type& k) {
log.debug("Refreshing permissions for {}", k.first.name());
return ser.underlying_authorizer().authorize(ser, ::make_shared<authenticated_user>(k.first), k.second);
log.debug("Refreshing permissions for {}", k.first);
return ser.get_uncached_permissions(k.first, k.second);
}) {
}
future<permission_set> permissions_cache::get(::shared_ptr<authenticated_user> user, data_resource r) {
return _cache.get(key_type(*user, r));
future<permission_set> permissions_cache::get(const role_or_anonymous& maybe_role, const resource& r) {
return do_with(key_type(maybe_role, r), [this](const auto& k) {
return _cache.get(k);
});
}
}

View File

@@ -22,37 +22,29 @@
#pragma once
#include <chrono>
#include <experimental/string_view>
#include <functional>
#include <iostream>
#include <optional>
#include <utility>
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/sstring.hh>
#include "auth/authenticated_user.hh"
#include "auth/data_resource.hh"
#include "auth/permission.hh"
#include "auth/resource.hh"
#include "auth/role_or_anonymous.hh"
#include "log.hh"
#include "stdx.hh"
#include "utils/hash.hh"
#include "utils/loading_cache.hh"
namespace std {
template <>
struct hash<auth::data_resource> final {
size_t operator()(const auth::data_resource & v) const {
return v.hash_value();
}
};
template <>
struct hash<auth::authenticated_user> final {
size_t operator()(const auth::authenticated_user & v) const {
return utils::tuple_hash()(v.name(), v.is_anonymous());
}
};
inline std::ostream& operator<<(std::ostream& os, const std::pair<auth::authenticated_user, auth::data_resource>& p) {
os << "{user: " << p.first.name() << ", data_resource: " << p.second << "}";
inline std::ostream& operator<<(std::ostream& os, const pair<auth::role_or_anonymous, auth::resource>& p) {
os << "{role: " << p.first << ", resource: " << p.second << "}";
return os;
}
@@ -76,7 +68,7 @@ struct permissions_cache_config final {
class permissions_cache final {
using cache_type = utils::loading_cache<
std::pair<authenticated_user, data_resource>,
std::pair<role_or_anonymous, resource>,
permission_set,
utils::loading_cache_reload_enabled::yes,
utils::simple_entry_size<permission_set>,
@@ -89,15 +81,11 @@ class permissions_cache final {
public:
explicit permissions_cache(const permissions_cache_config&, service&, logging::logger&);
future<> start() {
return make_ready_future<>();
}
future <> stop() {
return _cache.stop();
}
future<permission_set> get(::shared_ptr<authenticated_user>, data_resource);
future<permission_set> get(const role_or_anonymous&, const resource&);
};
}

296
auth/resource.cc Normal file
View File

@@ -0,0 +1,296 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/resource.hh"
#include <algorithm>
#include <iterator>
#include <unordered_map>
#include <boost/algorithm/string/join.hpp>
#include <boost/algorithm/string/split.hpp>
#include "service/storage_proxy.hh"
namespace auth {
std::ostream& operator<<(std::ostream& os, resource_kind kind) {
switch (kind) {
case resource_kind::data: os << "data"; break;
case resource_kind::role: os << "role"; break;
}
return os;
}
static const std::unordered_map<resource_kind, stdx::string_view> roots{
{resource_kind::data, "data"},
{resource_kind::role, "roles"}};
static const std::unordered_map<resource_kind, std::size_t> max_parts{
{resource_kind::data, 2},
{resource_kind::role, 1}};
static permission_set applicable_permissions(const data_resource_view& dv) {
if (dv.table()) {
return permission_set::of<
permission::ALTER,
permission::DROP,
permission::SELECT,
permission::MODIFY,
permission::AUTHORIZE>();
}
return permission_set::of<
permission::CREATE,
permission::ALTER,
permission::DROP,
permission::SELECT,
permission::MODIFY,
permission::AUTHORIZE>();
}
static permission_set applicable_permissions(const role_resource_view& rv) {
if (rv.role()) {
return permission_set::of<permission::ALTER, permission::DROP, permission::AUTHORIZE>();
}
return permission_set::of<
permission::CREATE,
permission::ALTER,
permission::DROP,
permission::AUTHORIZE,
permission::DESCRIBE>();
}
resource::resource(resource_kind kind) : _kind(kind), _parts{sstring(roots.at(kind))} {
}
resource::resource(resource_kind kind, std::vector<sstring> parts) : resource(kind) {
_parts.reserve(parts.size() + 1);
_parts.insert(_parts.end(), std::make_move_iterator(parts.begin()), std::make_move_iterator(parts.end()));
}
resource::resource(data_resource_t, stdx::string_view keyspace)
: resource(resource_kind::data, std::vector<sstring>{sstring(keyspace)}) {
}
resource::resource(data_resource_t, stdx::string_view keyspace, stdx::string_view table)
: resource(resource_kind::data, std::vector<sstring>{sstring(keyspace), sstring(table)}) {
}
resource::resource(role_resource_t, stdx::string_view role)
: resource(resource_kind::role, std::vector<sstring>{sstring(role)}) {
}
sstring resource::name() const {
return boost::algorithm::join(_parts, "/");
}
std::optional<resource> resource::parent() const {
if (_parts.size() == 1) {
return {};
}
resource copy = *this;
copy._parts.pop_back();
return copy;
}
permission_set resource::applicable_permissions() const {
permission_set ps;
switch (_kind) {
case resource_kind::data: ps = ::auth::applicable_permissions(data_resource_view(*this)); break;
case resource_kind::role: ps = ::auth::applicable_permissions(role_resource_view(*this)); break;
}
return ps;
}
bool operator<(const resource& r1, const resource& r2) {
if (r1._kind != r2._kind) {
return r1._kind < r2._kind;
}
return std::lexicographical_compare(
r1._parts.cbegin() + 1,
r1._parts.cend(),
r2._parts.cbegin() + 1,
r2._parts.cend());
}
std::ostream& operator<<(std::ostream& os, const resource& r) {
switch (r.kind()) {
case resource_kind::data: return os << data_resource_view(r);
case resource_kind::role: return os << role_resource_view(r);
}
return os;
}
data_resource_view::data_resource_view(const resource& r) : _resource(r) {
if (r._kind != resource_kind::data) {
throw resource_kind_mismatch(resource_kind::data, r._kind);
}
}
std::optional<stdx::string_view> data_resource_view::keyspace() const {
if (_resource._parts.size() == 1) {
return {};
}
return _resource._parts[1];
}
std::optional<stdx::string_view> data_resource_view::table() const {
if (_resource._parts.size() <= 2) {
return {};
}
return _resource._parts[2];
}
std::ostream& operator<<(std::ostream& os, const data_resource_view& v) {
const auto keyspace = v.keyspace();
const auto table = v.table();
if (!keyspace) {
os << "<all keyspaces>";
} else if (!table) {
os << "<keyspace " << *keyspace << '>';
} else {
os << "<table " << *keyspace << '.' << *table << '>';
}
return os;
}
role_resource_view::role_resource_view(const resource& r) : _resource(r) {
if (r._kind != resource_kind::role) {
throw resource_kind_mismatch(resource_kind::role, r._kind);
}
}
std::optional<stdx::string_view> role_resource_view::role() const {
if (_resource._parts.size() == 1) {
return {};
}
return _resource._parts[1];
}
std::ostream& operator<<(std::ostream& os, const role_resource_view& v) {
const auto role = v.role();
if (!role) {
os << "<all roles>";
} else {
os << "<role " << *role << '>';
}
return os;
}
resource parse_resource(stdx::string_view name) {
static const std::unordered_map<stdx::string_view, resource_kind> reverse_roots = [] {
std::unordered_map<stdx::string_view, resource_kind> result;
for (const auto& pair : roots) {
result.emplace(pair.second, pair.first);
}
return result;
}();
std::vector<sstring> parts;
boost::split(parts, name, [](char ch) { return ch == '/'; });
if (parts.empty()) {
throw invalid_resource_name(name);
}
const auto iter = reverse_roots.find(parts[0]);
if (iter == reverse_roots.end()) {
throw invalid_resource_name(name);
}
const auto kind = iter->second;
parts.erase(parts.begin());
if (parts.size() > max_parts.at(kind)) {
throw invalid_resource_name(name);
}
return resource(kind, std::move(parts));
}
static const resource the_root_data_resource{resource_kind::data};
const resource& root_data_resource() {
return the_root_data_resource;
}
static const resource the_root_role_resource{resource_kind::role};
const resource& root_role_resource() {
return the_root_role_resource;
}
resource_set expand_resource_family(const resource& rr) {
resource r = rr;
resource_set rs;
while (true) {
const auto pr = r.parent();
rs.insert(std::move(r));
if (!pr) {
break;
}
r = std::move(*pr);
}
return rs;
}
}

254
auth/resource.hh Normal file
View File

@@ -0,0 +1,254 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <experimental/string_view>
#include <iostream>
#include <optional>
#include <stdexcept>
#include <tuple>
#include <vector>
#include <unordered_set>
#include <seastar/core/print.hh>
#include <seastar/core/sstring.hh>
#include "auth/permission.hh"
#include "seastarx.hh"
#include "stdx.hh"
#include "utils/hash.hh"
namespace auth {
class invalid_resource_name : public std::invalid_argument {
public:
explicit invalid_resource_name(stdx::string_view name)
: std::invalid_argument(sprint("The resource name '%s' is invalid.", name)) {
}
};
enum class resource_kind {
data, role
};
std::ostream& operator<<(std::ostream&, resource_kind);
///
/// Type tag for constructing data resources.
///
struct data_resource_t final {};
///
/// Type tag for constructing role resources.
///
struct role_resource_t final {};
///
/// Resources are entities that users can be granted permissions on.
///
/// There are data (keyspaces and tables) and role resources. There may be other kinds of resources in the future.
///
/// When they are stored as system metadata, resources have the form `root/part_0/part_1/.../part_n`. Each kind of
/// resource has a specific root prefix, followed by a maximum of `n` parts (where `n` is distinct for each kind of
/// resource as well). In this code, this form is called the "name".
///
/// Since all resources have this same structure, all the different kinds are stored in instances of the same class:
/// \ref resource. When we wish to query a resource for kind-specific data (like the table of a "data" resource), we
/// create a kind-specific "view" of the resource.
///
class resource final {
resource_kind _kind;
std::vector<sstring> _parts;
public:
///
/// A root resource of a particular kind.
///
explicit resource(resource_kind);
resource(data_resource_t, stdx::string_view keyspace);
resource(data_resource_t, stdx::string_view keyspace, stdx::string_view table);
resource(role_resource_t, stdx::string_view role);
resource_kind kind() const noexcept {
return _kind;
}
///
/// A machine-friendly identifier unique to each resource.
///
sstring name() const;
std::optional<resource> parent() const;
permission_set applicable_permissions() const;
private:
resource(resource_kind, std::vector<sstring> parts);
friend class std::hash<resource>;
friend class data_resource_view;
friend class role_resource_view;
friend bool operator<(const resource&, const resource&);
friend bool operator==(const resource&, const resource&);
friend resource parse_resource(stdx::string_view);
};
bool operator<(const resource&, const resource&);
inline bool operator==(const resource& r1, const resource& r2) {
return (r1._kind == r2._kind) && (r1._parts == r2._parts);
}
inline bool operator!=(const resource& r1, const resource& r2) {
return !(r1 == r2);
}
std::ostream& operator<<(std::ostream&, const resource&);
class resource_kind_mismatch : public std::invalid_argument {
public:
explicit resource_kind_mismatch(resource_kind expected, resource_kind actual)
: std::invalid_argument(
sprint("This resource has kind '%s', but was expected to have kind '%s'.", actual, expected)) {
}
};
/// A "data" view of \ref resource.
///
/// If neither `keyspace` nor `table` is present, this is the root resource.
class data_resource_view final {
const resource& _resource;
public:
///
/// \throws `resource_kind_mismatch` if the argument is not a `data` resource.
///
explicit data_resource_view(const resource& r);
std::optional<stdx::string_view> keyspace() const;
std::optional<stdx::string_view> table() const;
};
std::ostream& operator<<(std::ostream&, const data_resource_view&);
///
/// A "role" view of \ref resource.
///
/// If `role` is not present, this is the root resource.
///
class role_resource_view final {
const resource& _resource;
public:
///
/// \throws \ref resource_kind_mismatch if the argument is not a "role" resource.
///
explicit role_resource_view(const resource&);
std::optional<stdx::string_view> role() const;
};
std::ostream& operator<<(std::ostream&, const role_resource_view&);
///
/// Parse a resource from its name.
///
/// \throws \ref invalid_resource_name when the name is malformed.
///
resource parse_resource(stdx::string_view name);
const resource& root_data_resource();
inline resource make_data_resource(stdx::string_view keyspace) {
return resource(data_resource_t{}, keyspace);
}
inline resource make_data_resource(stdx::string_view keyspace, stdx::string_view table) {
return resource(data_resource_t{}, keyspace, table);
}
const resource& root_role_resource();
inline resource make_role_resource(stdx::string_view role) {
return resource(role_resource_t{}, role);
}
}
namespace std {
template <>
struct hash<auth::resource> {
static size_t hash_data(const auth::data_resource_view& dv) {
return utils::tuple_hash()(std::make_tuple(auth::resource_kind::data, dv.keyspace(), dv.table()));
}
static size_t hash_role(const auth::role_resource_view& rv) {
return utils::tuple_hash()(std::make_tuple(auth::resource_kind::role, rv.role()));
}
size_t operator()(const auth::resource& r) const {
std::size_t value;
switch (r._kind) {
case auth::resource_kind::data: value = hash_data(auth::data_resource_view(r)); break;
case auth::resource_kind::role: value = hash_role(auth::role_resource_view(r)); break;
}
return value;
}
};
}
namespace auth {
using resource_set = std::unordered_set<resource>;
//
// A resource and all of its parents.
//
resource_set expand_resource_family(const resource&);
}

169
auth/role_manager.hh Normal file
View File

@@ -0,0 +1,169 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <experimental/string_view>
#include <memory>
#include <optional>
#include <stdexcept>
#include <unordered_set>
#include <seastar/core/future.hh>
#include <seastar/core/print.hh>
#include <seastar/core/sstring.hh>
#include "auth/resource.hh"
#include "seastarx.hh"
#include "stdx.hh"
namespace auth {
struct role_config final {
bool is_superuser{false};
bool can_login{false};
};
///
/// Differential update for altering existing roles.
///
struct role_config_update final {
std::optional<bool> is_superuser{};
std::optional<bool> can_login{};
};
///
/// A logical argument error for a role-management operation.
///
class roles_argument_exception : public std::invalid_argument {
public:
using std::invalid_argument::invalid_argument;
};
class role_already_exists : public roles_argument_exception {
public:
explicit role_already_exists(stdx::string_view role_name)
: roles_argument_exception(sprint("Role %s already exists.", role_name)) {
}
};
class nonexistant_role : public roles_argument_exception {
public:
explicit nonexistant_role(stdx::string_view role_name)
: roles_argument_exception(sprint("Role %s doesn't exist.", role_name)) {
}
};
class role_already_included : public roles_argument_exception {
public:
role_already_included(stdx::string_view grantee_name, stdx::string_view role_name)
: roles_argument_exception(
sprint("%s already includes role %s.", grantee_name, role_name)) {
}
};
class revoke_ungranted_role : public roles_argument_exception {
public:
revoke_ungranted_role(stdx::string_view revokee_name, stdx::string_view role_name)
: roles_argument_exception(
sprint("%s was not granted role %s, so it cannot be revoked.", revokee_name, role_name)) {
}
};
using role_set = std::unordered_set<sstring>;
enum class recursive_role_query { yes, no };
///
/// Abstract client for managing roles.
///
/// All state necessary for managing roles is stored externally to the client instance.
///
/// All implementations should throw role-related exceptions as documented. Authorization is not addressed here, and
/// access-control should never be enforced in implementations.
///
class role_manager {
public:
virtual ~role_manager() = default;
virtual stdx::string_view qualified_java_name() const noexcept = 0;
virtual const resource_set& protected_resources() const = 0;
virtual future<> start() = 0;
virtual future<> stop() = 0;
///
/// \returns an exceptional future with \ref role_already_exists for a role that has previously been created.
///
virtual future<> create(stdx::string_view role_name, const role_config&) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
virtual future<> drop(stdx::string_view role_name) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
virtual future<> alter(stdx::string_view role_name, const role_config_update&) const = 0;
///
/// Grant `role_name` to `grantee_name`.
///
/// \returns an exceptional future with \ref nonexistant_role if either the role or the grantee do not exist.
///
/// \returns an exceptional future with \ref role_already_included if granting the role would be redundant, or
/// create a cycle.
///
virtual future<> grant(stdx::string_view grantee_name, stdx::string_view role_name) const = 0;
///
/// Revoke `role_name` from `revokee_name`.
///
/// \returns an exceptional future with \ref nonexistant_role if either the role or the revokee do not exist.
///
/// \returns an exceptional future with \ref revoke_ungranted_role if the role was not granted.
///
virtual future<> revoke(stdx::string_view revokee_name, stdx::string_view role_name) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
virtual future<role_set> query_granted(stdx::string_view grantee, recursive_role_query) const = 0;
virtual future<role_set> query_all() const = 0;
virtual future<bool> exists(stdx::string_view role_name) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
virtual future<bool> is_superuser(stdx::string_view role_name) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
virtual future<bool> can_login(stdx::string_view role_name) const = 0;
};
}

41
auth/role_or_anonymous.cc Normal file
View File

@@ -0,0 +1,41 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/role_or_anonymous.hh"
#include <iostream>
namespace auth {
std::ostream& operator<<(std::ostream& os, const role_or_anonymous& mr) {
os << mr.name.value_or("<anonymous>");
return os;
}
bool operator==(const role_or_anonymous& mr1, const role_or_anonymous& mr2) noexcept {
return mr1.name == mr2.name;
}
bool is_anonymous(const role_or_anonymous& mr) noexcept {
return !mr.name.has_value();
}
}

66
auth/role_or_anonymous.hh Normal file
View File

@@ -0,0 +1,66 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <experimental/string_view>
#include <functional>
#include <iosfwd>
#include <optional>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
#include "stdx.hh"
namespace auth {
class role_or_anonymous final {
public:
std::optional<sstring> name{};
role_or_anonymous() = default;
role_or_anonymous(stdx::string_view name) : name(name) {
}
};
std::ostream& operator<<(std::ostream&, const role_or_anonymous&);
bool operator==(const role_or_anonymous&, const role_or_anonymous&) noexcept;
inline bool operator!=(const role_or_anonymous& mr1, const role_or_anonymous& mr2) noexcept {
return !(mr1 == mr2);
}
bool is_anonymous(const role_or_anonymous&) noexcept;
}
namespace std {
template <>
struct hash<auth::role_or_anonymous> {
size_t operator()(const auth::role_or_anonymous& mr) const {
return hash<std::optional<sstring>>()(mr.name);
}
};
}

122
auth/roles-metadata.cc Normal file
View File

@@ -0,0 +1,122 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/roles-metadata.hh"
#include <boost/algorithm/cxx11/any_of.hpp>
#include <seastar/core/print.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/sstring.hh>
#include "auth/common.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
namespace auth {
namespace meta {
namespace roles_table {
stdx::string_view creation_query() {
static const sstring instance = sprint(
"CREATE TABLE %s ("
" %s text PRIMARY KEY,"
" can_login boolean,"
" is_superuser boolean,"
" member_of set<text>,"
" salted_hash text"
")",
qualified_name(),
role_col_name);
return instance;
}
stdx::string_view qualified_name() noexcept {
static const sstring instance = AUTH_KS + "." + sstring(name);
return instance;
}
}
}
future<bool> default_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p) {
static const sstring query = sprint(
"SELECT * FROM %s WHERE %s = ?",
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return do_with(std::move(p), [&qp](const auto& p) {
return qp.process(
query,
db::consistency_level::ONE,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return make_ready_future<bool>(false);
}
return make_ready_future<bool>(p(results->one()));
});
}
return make_ready_future<bool>(p(results->one()));
});
});
}
future<bool> any_nondefault_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p) {
static const sstring query = sprint("SELECT * FROM %s", meta::roles_table::qualified_name());
return do_with(std::move(p), [&qp](const auto& p) {
return qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return false;
}
static const sstring col_name = sstring(meta::roles_table::role_col_name);
return boost::algorithm::any_of(*results, [&p](const cql3::untyped_result_set_row& row) {
const bool is_nondefault = row.get_as<sstring>(col_name) != meta::DEFAULT_SUPERUSER_NAME;
return is_nondefault && p(row);
});
});
});
}
}

69
auth/roles-metadata.hh Normal file
View File

@@ -0,0 +1,69 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <experimental/string_view>
#include <functional>
#include <seastar/core/future.hh>
#include "seastarx.hh"
#include "stdx.hh"
namespace cql3 {
class query_processor;
class untyped_result_set_row;
}
namespace auth {
namespace meta {
namespace roles_table {
stdx::string_view creation_query();
constexpr stdx::string_view name{"roles", 5};
stdx::string_view qualified_name() noexcept;
constexpr stdx::string_view role_col_name{"role", 4};
}
}
///
/// Check that the default role satisfies a predicate, or `false` if the default role does not exist.
///
future<bool> default_role_row_satisfies(
cql3::query_processor&,
std::function<bool(const cql3::untyped_result_set_row&)>);
///
/// Check that any nondefault role satisfies a predicate. `false` if no nondefault roles exist.
///
future<bool> any_nondefault_role_row_satisfies(
cql3::query_processor&,
std::function<bool(const cql3::untyped_result_set_row&)>);
}

View File

@@ -21,6 +21,7 @@
#include "auth/service.hh"
#include <algorithm>
#include <map>
#include <seastar/core/future-util.hh>
@@ -30,10 +31,13 @@
#include "auth/allow_all_authenticator.hh"
#include "auth/allow_all_authorizer.hh"
#include "auth/common.hh"
#include "auth/password_authenticator.hh"
#include "auth/role_or_anonymous.hh"
#include "auth/standard_role_manager.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/config.hh"
#include "db/consistency_level.hh"
#include "db/consistency_level_type.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
#include "service/migration_listener.hh"
@@ -73,11 +77,18 @@ private:
void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}
void on_drop_keyspace(const sstring& ks_name) override {
_authorizer.revoke_all(auth::data_resource(ks_name));
_authorizer.revoke_all(
auth::make_data_resource(ks_name)).handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
});
}
void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override {
_authorizer.revoke_all(auth::data_resource(ks_name, cf_name));
_authorizer.revoke_all(
auth::make_data_resource(
ks_name, cf_name)).handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
});
}
void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}
@@ -86,40 +97,23 @@ private:
void on_drop_view(const sstring& ks_name, const sstring& view_name) override {}
};
static sharded<permissions_cache> sharded_permissions_cache{};
static db::consistency_level consistency_for_user(const sstring& name) {
if (name == meta::DEFAULT_SUPERUSER_NAME) {
return db::consistency_level::QUORUM;
} else {
return db::consistency_level::LOCAL_ONE;
}
}
static future<::shared_ptr<cql3::untyped_result_set>> select_user(cql3::query_processor& qp, const sstring& name) {
// Here was a thread local, explicit cache of prepared statement. In normal execution this is
// fine, but since we in testing set up and tear down system over and over, we'd start using
// obsolete prepared statements pretty quickly.
// Rely on query processing caching statements instead, and lets assume
// that a map lookup string->statement is not gonna kill us much.
return qp.process(
sprint(
"SELECT * FROM %s.%s WHERE %s = ?",
meta::AUTH_KS,
meta::USERS_CF,
meta::user_name_col_name),
consistency_for_user(name),
{ name },
true);
static future<> validate_role_exists(const service& ser, stdx::string_view role_name) {
return ser.underlying_role_manager().exists(role_name).then([role_name](bool exists) {
if (!exists) {
throw nonexistant_role(role_name);
}
});
}
service_config service_config::from_db_config(const db::config& dc) {
const qualified_name qualified_authorizer_name(meta::AUTH_PACKAGE_NAME, dc.authorizer());
const qualified_name qualified_authenticator_name(meta::AUTH_PACKAGE_NAME, dc.authenticator());
const qualified_name qualified_role_manager_name(meta::AUTH_PACKAGE_NAME, dc.role_manager());
service_config c;
c.authorizer_java_name = qualified_authorizer_name;
c.authenticator_java_name = qualified_authenticator_name;
c.role_manager_java_name = qualified_role_manager_name;
return c;
}
@@ -128,40 +122,47 @@ service::service(
permissions_cache_config c,
cql3::query_processor& qp,
::service::migration_manager& mm,
std::unique_ptr<authorizer> a,
std::unique_ptr<authenticator> b)
: _cache_config(std::move(c))
std::unique_ptr<authorizer> z,
std::unique_ptr<authenticator> a,
std::unique_ptr<role_manager> r)
: _permissions_cache_config(std::move(c))
, _permissions_cache(nullptr)
, _qp(qp)
, _migration_manager(mm)
, _authorizer(std::move(a))
, _authenticator(std::move(b))
, _authorizer(std::move(z))
, _authenticator(std::move(a))
, _role_manager(std::move(r))
, _migration_listener(std::make_unique<auth_migration_listener>(*_authorizer)) {
// The password authenticator requires that the `standard_role_manager` is running so that the roles metadata table
// it manages is created and updated. This cross-module dependency is rather gross, but we have to maintain it for
// the sake of compatibility with Apache Cassandra and its choice of auth. schema.
if ((_authenticator->qualified_java_name() == password_authenticator_name())
&& (_role_manager->qualified_java_name() != standard_role_manager_name())) {
throw incompatible_module_combination(
sprint(
"The %s authenticator must be loaded alongside the %s role-manager.",
password_authenticator_name(),
standard_role_manager_name()));
}
}
service::service(
permissions_cache_config cache_config,
permissions_cache_config c,
cql3::query_processor& qp,
::service::migration_manager& mm,
const service_config& sc)
: service(
std::move(cache_config),
std::move(c),
qp,
mm,
create_object<authorizer>(sc.authorizer_java_name, qp, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, mm)) {
create_object<authenticator>(sc.authenticator_java_name, qp, mm),
create_object<role_manager>(sc.role_manager_java_name, qp, mm)) {
}
bool service::should_create_metadata() const {
const bool null_authorizer = _authorizer->qualified_java_name() == allow_all_authorizer_name();
const bool null_authenticator = _authenticator->qualified_java_name() == allow_all_authenticator_name();
return !null_authorizer || !null_authenticator;
}
future<> service::create_metadata_if_missing() {
future<> service::create_keyspace_if_missing() const {
auto& db = _qp.db().local();
auto f = make_ready_future<>();
if (!db.has_keyspace(meta::AUTH_KS)) {
std::map<sstring, sstring> opts{{"replication_factor", "1"}};
@@ -173,91 +174,42 @@ future<> service::create_metadata_if_missing() {
// We use min_timestamp so that default keyspace metadata will loose with any manual adjustments.
// See issue #2129.
f = _migration_manager.announce_new_keyspace(ksm, api::min_timestamp, false);
return _migration_manager.announce_new_keyspace(ksm, api::min_timestamp, false);
}
return f.then([this] {
// 3 months.
static const auto gc_grace_seconds = 90 * 24 * 60 * 60;
static const sstring users_table_query = sprint(
"CREATE TABLE %s.%s (%s text, %s boolean, PRIMARY KEY (%s)) WITH gc_grace_seconds=%s",
meta::AUTH_KS,
meta::USERS_CF,
meta::user_name_col_name,
meta::superuser_col_name,
meta::user_name_col_name,
gc_grace_seconds);
return create_metadata_table_if_missing(
meta::USERS_CF,
_qp,
users_table_query,
_migration_manager);
}).then([this] {
delay_until_system_ready(_delayed, [this] {
return has_existing_users().then([this](bool existing) {
if (!existing) {
//
// Create default superuser.
//
static const sstring query = sprint(
"INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",
meta::AUTH_KS,
meta::USERS_CF,
meta::user_name_col_name,
meta::superuser_col_name);
return _qp.process(
query,
db::consistency_level::ONE,
{ meta::DEFAULT_SUPERUSER_NAME, true }).then([](auto&&) {
log.info("Created default superuser '{}'", meta::DEFAULT_SUPERUSER_NAME);
}).handle_exception([](auto exn) {
try {
std::rethrow_exception(exn);
} catch (const exceptions::request_execution_exception&) {
log.warn("Skipped default superuser setup: some nodes were not ready");
}
}).discard_result();
}
return make_ready_future<>();
});
});
return make_ready_future<>();
});
return make_ready_future<>();
}
future<> service::start() {
return once_among_shards([this] {
if (should_create_metadata()) {
return create_metadata_if_missing();
}
return make_ready_future<>();
return create_keyspace_if_missing();
}).then([this] {
return when_all_succeed(_authorizer->start(), _authenticator->start());
return when_all_succeed(_role_manager->start(), _authorizer->start(), _authenticator->start());
}).then([this] {
_permissions_cache = std::make_unique<permissions_cache>(_permissions_cache_config, *this, log);
}).then([this] {
return once_among_shards([this] {
_migration_manager.register_listener(_migration_listener.get());
return sharded_permissions_cache.start(std::ref(_cache_config), std::ref(*this), std::ref(log));
return make_ready_future<>();
});
});
}
future<> service::stop() {
return once_among_shards([this] {
_delayed.cancel_all();
return sharded_permissions_cache.stop();
}).then([this] {
return when_all_succeed(_authorizer->stop(), _authenticator->stop());
// Only one of the shards has the listener registered, but let's try to
// unregister on each one just to make sure.
_migration_manager.unregister_listener(_migration_listener.get());
return _permissions_cache->stop().then([this] {
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop());
});
}
future<bool> service::has_existing_users() const {
future<bool> service::has_existing_legacy_users() const {
if (!_qp.db().local().has_schema(meta::AUTH_KS, meta::USERS_CF)) {
return make_ready_future<bool>(false);
}
static const sstring default_user_query = sprint(
"SELECT * FROM %s.%s WHERE %s = ?",
meta::AUTH_KS,
@@ -275,7 +227,8 @@ future<bool> service::has_existing_users() const {
return _qp.process(
default_user_query,
db::consistency_level::ONE,
{ meta::DEFAULT_SUPERUSER_NAME },
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
@@ -284,7 +237,8 @@ future<bool> service::has_existing_users() const {
return _qp.process(
default_user_query,
db::consistency_level::QUORUM,
{ meta::DEFAULT_SUPERUSER_NAME },
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
@@ -292,62 +246,342 @@ future<bool> service::has_existing_users() const {
return _qp.process(
all_users_query,
db::consistency_level::QUORUM).then([](auto results) {
db::consistency_level::QUORUM,
infinite_timeout_config).then([](auto results) {
return make_ready_future<bool>(!results->empty());
});
});
});
}
future<bool> service::is_existing_user(const sstring& name) const {
return select_user(_qp, name).then([](auto results) {
return !results->empty();
future<permission_set>
service::get_uncached_permissions(const role_or_anonymous& maybe_role, const resource& r) const {
if (is_anonymous(maybe_role)) {
return _authorizer->authorize(maybe_role, r);
}
const stdx::string_view role_name = *maybe_role.name;
return has_superuser(role_name).then([this, role_name, &r](bool superuser) {
if (superuser) {
return make_ready_future<permission_set>(r.applicable_permissions());
}
//
// Aggregate the permissions from all granted roles.
//
return do_with(permission_set(), [this, role_name, &r](auto& all_perms) {
return get_roles(role_name).then([this, &r, &all_perms](role_set all_roles) {
return do_with(std::move(all_roles), [this, &r, &all_perms](const auto& all_roles) {
return parallel_for_each(all_roles, [this, &r, &all_perms](stdx::string_view role_name) {
return _authorizer->authorize(role_name, r).then([&all_perms](permission_set perms) {
all_perms = permission_set::from_mask(all_perms.mask() | perms.mask());
});
});
});
}).then([&all_perms] {
return all_perms;
});
});
});
}
future<bool> service::is_super_user(const sstring& name) const {
return select_user(_qp, name).then([](auto results) {
return !results->empty() && results->one().template get_as<bool>(meta::superuser_col_name);
future<permission_set> service::get_permissions(const role_or_anonymous& maybe_role, const resource& r) const {
return _permissions_cache->get(maybe_role, r);
}
future<bool> service::has_superuser(stdx::string_view role_name) const {
return this->get_roles(std::move(role_name)).then([this](role_set roles) {
return do_with(std::move(roles), [this](const role_set& roles) {
return do_with(false, roles.begin(), [this, &roles](bool& any_super, auto& iter) {
return do_until(
[&roles, &any_super, &iter] { return any_super || (iter == roles.end()); },
[this, &any_super, &iter] {
return _role_manager->is_superuser(*iter++).then([&any_super](bool super) {
any_super = super;
});
}).then([&any_super] {
return any_super;
});
});
});
});
}
future<> service::insert_user(const sstring& name, bool is_superuser) {
return _qp.process(
sprint(
"INSERT INTO %s.%s (%s, %s) VALUES (?, ?)",
meta::AUTH_KS,
meta::USERS_CF,
meta::user_name_col_name,
meta::superuser_col_name),
consistency_for_user(name),
{ name, is_superuser }).discard_result();
future<role_set> service::get_roles(stdx::string_view role_name) const {
//
// We may wish to cache this information in the future (as Apache Cassandra does).
//
return _role_manager->query_granted(role_name, recursive_role_query::yes);
}
future<> service::delete_user(const sstring& name) {
return _qp.process(
sprint(
"DELETE FROM %s.%s WHERE %s = ?",
meta::AUTH_KS,
meta::USERS_CF,
meta::user_name_col_name),
consistency_for_user(name),
{ name }).discard_result();
}
future<bool> service::exists(const resource& r) const {
switch (r.kind()) {
case resource_kind::data: {
const auto& db = _qp.db().local();
future<permission_set> service::get_permissions(::shared_ptr<authenticated_user> u, data_resource r) const {
return sharded_permissions_cache.local().get(std::move(u), std::move(r));
data_resource_view v(r);
const auto keyspace = v.keyspace();
const auto table = v.table();
if (table) {
return make_ready_future<bool>(db.has_schema(sstring(*keyspace), sstring(*table)));
}
if (keyspace) {
return make_ready_future<bool>(db.has_keyspace(sstring(*keyspace)));
}
return make_ready_future<bool>(true);
}
case resource_kind::role: {
role_resource_view v(r);
const auto role = v.role();
if (role) {
return _role_manager->exists(*role);
}
return make_ready_future<bool>(true);
}
}
return make_ready_future<bool>(false);
}
//
// Free functions.
//
future<bool> is_super_user(const service& ser, const authenticated_user& u) {
if (u.is_anonymous()) {
future<bool> has_superuser(const service& ser, const authenticated_user& u) {
if (is_anonymous(u)) {
return make_ready_future<bool>(false);
}
return ser.is_super_user(u.name());
return ser.has_superuser(*u.name);
}
future<role_set> get_roles(const service& ser, const authenticated_user& u) {
if (is_anonymous(u)) {
return make_ready_future<role_set>();
}
return ser.get_roles(*u.name);
}
future<permission_set> get_permissions(const service& ser, const authenticated_user& u, const resource& r) {
return do_with(role_or_anonymous(), [&ser, &u, &r](auto& maybe_role) {
maybe_role.name = u.name;
return ser.get_permissions(maybe_role, r);
});
}
bool is_enforcing(const service& ser) {
const bool enforcing_authorizer = ser.underlying_authorizer().qualified_java_name() != allow_all_authorizer_name();
const bool enforcing_authenticator = ser.underlying_authenticator().qualified_java_name()
!= allow_all_authenticator_name();
return enforcing_authorizer || enforcing_authenticator;
}
bool is_protected(const service& ser, const resource& r) noexcept {
return ser.underlying_role_manager().protected_resources().count(r)
|| ser.underlying_authenticator().protected_resources().count(r)
|| ser.underlying_authorizer().protected_resources().count(r);
}
static void validate_authentication_options_are_supported(
const authentication_options& options,
const authentication_option_set& supported) {
const auto check = [&supported](authentication_option k) {
if (supported.count(k) == 0) {
throw unsupported_authentication_option(k);
}
};
if (options.password) {
check(authentication_option::password);
}
if (options.options) {
check(authentication_option::options);
}
}
future<> create_role(
const service& ser,
stdx::string_view name,
const role_config& config,
const authentication_options& options) {
return ser.underlying_role_manager().create(name, config).then([&ser, name, &options] {
if (!auth::any_authentication_options(options)) {
return make_ready_future<>();
}
return futurize_apply(
&validate_authentication_options_are_supported,
options,
ser.underlying_authenticator().supported_options()).then([&ser, name, &options] {
return ser.underlying_authenticator().create(name, options);
}).handle_exception([&ser, &name](std::exception_ptr ep) {
// Roll-back.
return ser.underlying_role_manager().drop(name).then([ep = std::move(ep)] {
std::rethrow_exception(ep);
});
});
});
}
future<> alter_role(
const service& ser,
stdx::string_view name,
const role_config_update& config_update,
const authentication_options& options) {
return ser.underlying_role_manager().alter(name, config_update).then([&ser, name, &options] {
if (!any_authentication_options(options)) {
return make_ready_future<>();
}
return futurize_apply(
&validate_authentication_options_are_supported,
options,
ser.underlying_authenticator().supported_options()).then([&ser, name, &options] {
return ser.underlying_authenticator().alter(name, options);
});
});
}
future<> drop_role(const service& ser, stdx::string_view name) {
return do_with(make_role_resource(name), [&ser, name](const resource& r) {
auto& a = ser.underlying_authorizer();
return when_all_succeed(
a.revoke_all(name),
a.revoke_all(r)).handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
});
}).then([&ser, name] {
return ser.underlying_authenticator().drop(name);
}).then([&ser, name] {
return ser.underlying_role_manager().drop(name);
});
}
future<bool> has_role(const service& ser, stdx::string_view grantee, stdx::string_view name) {
return when_all_succeed(
validate_role_exists(ser, name),
ser.get_roles(grantee)).then([name](role_set all_roles) {
return make_ready_future<bool>(all_roles.count(sstring(name)) != 0);
});
}
future<bool> has_role(const service& ser, const authenticated_user& u, stdx::string_view name) {
if (is_anonymous(u)) {
return make_ready_future<bool>(false);
}
return has_role(ser, *u.name, name);
}
future<> grant_permissions(
const service& ser,
stdx::string_view role_name,
permission_set perms,
const resource& r) {
return validate_role_exists(ser, role_name).then([&ser, role_name, perms, &r] {
return ser.underlying_authorizer().grant(role_name, perms, r);
});
}
future<> grant_applicable_permissions(const service& ser, stdx::string_view role_name, const resource& r) {
return grant_permissions(ser, role_name, r.applicable_permissions(), r);
}
future<> grant_applicable_permissions(const service& ser, const authenticated_user& u, const resource& r) {
if (is_anonymous(u)) {
return make_ready_future<>();
}
return grant_applicable_permissions(ser, *u.name, r);
}
future<> revoke_permissions(
const service& ser,
stdx::string_view role_name,
permission_set perms,
const resource& r) {
return validate_role_exists(ser, role_name).then([&ser, role_name, perms, &r] {
return ser.underlying_authorizer().revoke(role_name, perms, r);
});
}
future<std::vector<permission_details>> list_filtered_permissions(
const service& ser,
permission_set perms,
std::optional<stdx::string_view> role_name,
const std::optional<std::pair<resource, recursive_permissions>>& resource_filter) {
return ser.underlying_authorizer().list_all().then([&ser, perms, role_name, &resource_filter](
std::vector<permission_details> all_details) {
if (resource_filter) {
const resource r = resource_filter->first;
const auto resources = resource_filter->second
? auth::expand_resource_family(r)
: auth::resource_set{r};
all_details.erase(
std::remove_if(
all_details.begin(),
all_details.end(),
[&resources](const permission_details& pd) {
return resources.count(pd.resource) == 0;
}),
all_details.end());
}
std::transform(
std::make_move_iterator(all_details.begin()),
std::make_move_iterator(all_details.end()),
all_details.begin(),
[perms](permission_details pd) {
pd.permissions = permission_set::from_mask(pd.permissions.mask() & perms.mask());
return pd;
});
// Eliminate rows with an empty permission set.
all_details.erase(
std::remove_if(all_details.begin(), all_details.end(), [](const permission_details& pd) {
return pd.permissions.mask() == 0;
}),
all_details.end());
if (!role_name) {
return make_ready_future<std::vector<permission_details>>(std::move(all_details));
}
//
// Filter out rows based on whether permissions have been granted to this role (directly or indirectly).
//
return do_with(std::move(all_details), [&ser, role_name](auto& all_details) {
return ser.get_roles(*role_name).then([&all_details](role_set all_roles) {
all_details.erase(
std::remove_if(
all_details.begin(),
all_details.end(),
[&all_roles](const permission_details& pd) {
return all_roles.count(pd.role_name) == 0;
}),
all_details.end());
return make_ready_future<std::vector<permission_details>>(std::move(all_details));
});
});
});
}
}

View File

@@ -21,18 +21,21 @@
#pragma once
#include <experimental/string_view>
#include <memory>
#include <optional>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include <seastar/util/bool_class.hh>
#include "auth/authenticator.hh"
#include "auth/authorizer.hh"
#include "auth/authenticated_user.hh"
#include "auth/permission.hh"
#include "auth/permissions_cache.hh"
#include "delayed_tasks.hh"
#include "auth/role_manager.hh"
#include "seastarx.hh"
#include "stdx.hh"
namespace cql3 {
class query_processor;
@@ -49,18 +52,40 @@ class migration_listener;
namespace auth {
class authenticator;
class authorizer;
class role_or_anonymous;
struct service_config final {
static service_config from_db_config(const db::config&);
sstring authorizer_java_name;
sstring authenticator_java_name;
sstring role_manager_java_name;
};
///
/// Due to poor (in this author's opinion) decisions of Apache Cassandra, certain choices of one role-manager,
/// authenticator, or authorizer imply restrictions on the rest.
///
/// This exception is thrown when an invalid combination of modules is selected, with a message explaining the
/// incompatibility.
///
class incompatible_module_combination : public std::invalid_argument {
public:
using std::invalid_argument::invalid_argument;
};
///
/// Client for access-control in the system.
///
/// Access control encompasses user/role management, authentication, and authorization. This client provides access to
/// the dynamically-loaded implementations of these modules (through the `underlying_*` member functions), but also
/// builds on their functionality with caching and abstractions for common operations.
///
/// All state associated with access-control is stored externally to any particular instance of this class.
///
class service final {
permissions_cache_config _cache_config;
permissions_cache_config _permissions_cache_config;
std::unique_ptr<permissions_cache> _permissions_cache;
cql3::query_processor& _qp;
@@ -70,19 +95,25 @@ class service final {
std::unique_ptr<authenticator> _authenticator;
std::unique_ptr<role_manager> _role_manager;
// Only one of these should be registered, so we end up with some unused instances. Not the end of the world.
std::unique_ptr<::service::migration_listener> _migration_listener;
delayed_tasks<> _delayed{};
public:
service(
permissions_cache_config,
cql3::query_processor&,
::service::migration_manager&,
std::unique_ptr<authorizer>,
std::unique_ptr<authenticator>);
std::unique_ptr<authenticator>,
std::unique_ptr<role_manager>);
///
/// This constructor is intended to be used when the class is sharded via \ref seastar::sharded. In that case, the
/// arguments must be copyable, which is why we delay construction with instance-construction instructions instead
/// of the instances themselves.
///
service(
permissions_cache_config,
cql3::query_processor&,
@@ -93,40 +124,173 @@ public:
future<> stop();
future<bool> is_existing_user(const sstring& name) const;
///
/// \returns an exceptional future with \ref nonexistant_role if the named role does not exist.
///
future<permission_set> get_permissions(const role_or_anonymous&, const resource&) const;
future<bool> is_super_user(const sstring& name) const;
///
/// Like \ref get_permissions, but never returns cached permissions.
///
future<permission_set> get_uncached_permissions(const role_or_anonymous&, const resource&) const;
future<> insert_user(const sstring& name, bool is_superuser);
///
/// Query whether the named role has been granted a role that is a superuser.
///
/// A role is always granted to itself. Therefore, a role that "is" a superuser also "has" superuser.
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
future<bool> has_superuser(stdx::string_view role_name) const;
future<> delete_user(const sstring& name);
///
/// Return the set of all roles granted to the given role, including itself and roles granted through other roles.
///
/// \returns an exceptional future with \ref nonexistent_role if the role does not exist.
future<role_set> get_roles(stdx::string_view role_name) const;
future<permission_set> get_permissions(::shared_ptr<authenticated_user>, data_resource) const;
authenticator& underlying_authenticator() {
return *_authenticator;
}
future<bool> exists(const resource&) const;
const authenticator& underlying_authenticator() const {
return *_authenticator;
}
authorizer& underlying_authorizer() {
return *_authorizer;
}
const authorizer& underlying_authorizer() const {
return *_authorizer;
}
const role_manager& underlying_role_manager() const {
return *_role_manager;
}
private:
future<bool> has_existing_users() const;
future<bool> has_existing_legacy_users() const;
bool should_create_metadata() const;
future<> create_metadata_if_missing();
future<> create_keyspace_if_missing() const;
};
future<bool> is_super_user(const service&, const authenticated_user&);
future<bool> has_superuser(const service&, const authenticated_user&);
future<role_set> get_roles(const service&, const authenticated_user&);
future<permission_set> get_permissions(const service&, const authenticated_user&, const resource&);
///
/// Access-control is "enforcing" when either the authenticator or the authorizer are not their "allow-all" variants.
///
/// Put differently, when access control is not enforcing, all operations on resources will be allowed and users do not
/// need to authenticate themselves.
///
bool is_enforcing(const service&);
///
/// Protected resources cannot be modified even if the performer has permissions to do so.
///
bool is_protected(const service&, const resource&) noexcept;
///
/// Create a role with optional authentication information.
///
/// \returns an exceptional future with \ref role_already_exists if the user or role exists.
///
/// \returns an exceptional future with \ref unsupported_authentication_option if an unsupported option is included.
///
future<> create_role(
const service&,
stdx::string_view name,
const role_config&,
const authentication_options&);
///
/// Alter an existing role and its authentication information.
///
/// \returns an exceptional future with \ref nonexistant_role if the named role does not exist.
///
/// \returns an exceptional future with \ref unsupported_authentication_option if an unsupported option is included.
///
future<> alter_role(
const service&,
stdx::string_view name,
const role_config_update&,
const authentication_options&);
///
/// Drop a role from the system, including all permissions and authentication information.
///
/// \returns an exceptional future with \ref nonexistant_role if the named role does not exist.
///
future<> drop_role(const service&, stdx::string_view name);
///
/// Check if `grantee` has been granted the named role.
///
/// \returns an exceptional future with \ref nonexistent_role if `grantee` or `name` do not exist.
///
future<bool> has_role(const service&, stdx::string_view grantee, stdx::string_view name);
///
/// Check if the authenticated user has been granted the named role.
///
/// \returns an exceptional future with \ref nonexistent_role if the user or `name` do not exist.
///
future<bool> has_role(const service&, const authenticated_user&, stdx::string_view name);
///
/// \returns an exceptional future with \ref nonexistent_role if the named role does not exist.
///
/// \returns an exceptional future with \ref unsupported_authorization_operation if granting permissions is not
/// supported.
///
future<> grant_permissions(
const service&,
stdx::string_view role_name,
permission_set,
const resource&);
///
/// Like \ref grant_permissions, but grants all applicable permissions on the resource.
///
/// \returns an exceptional future with \ref nonexistent_role if the named role does not exist.
///
/// \returns an exceptional future with \ref unsupported_authorization_operation if granting permissions is not
/// supported.
///
future<> grant_applicable_permissions(const service&, stdx::string_view role_name, const resource&);
future<> grant_applicable_permissions(const service&, const authenticated_user&, const resource&);
///
/// \returns an exceptional future with \ref nonexistent_role if the named role does not exist.
///
/// \returns an exceptional future with \ref unsupported_authorization_operation if revoking permissions is not
/// supported.
///
future<> revoke_permissions(
const service&,
stdx::string_view role_name,
permission_set,
const resource&);
using recursive_permissions = bool_class<struct recursive_permissions_tag>;
///
/// Query for all granted permissions according to filtering criteria.
///
/// Only permissions included in the provided set are included.
///
/// If a role name is provided, only permissions granted (directly or recursively) to the role are included.
///
/// If a resource filter is provided, only permissions granted on the resource are included. When \ref
/// recursive_permissions is `true`, permissions on a parent resource are included.
///
/// \returns an exceptional future with \ref nonexistent_role if a role name is included which refers to a role that
/// does not exist.
///
/// \returns an exceptional future with \ref unsupported_authorization_operation if listing permissions is not
/// supported.
///
future<std::vector<permission_details>> list_filtered_permissions(
const service&,
permission_set,
std::optional<stdx::string_view> role_name,
const std::optional<std::pair<resource, recursive_permissions>>& resource_filter);
}

View File

@@ -0,0 +1,555 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/standard_role_manager.hh"
#include <experimental/optional>
#include <unordered_set>
#include <vector>
#include <boost/algorithm/string/join.hpp>
#include <seastar/core/future-util.hh>
#include <seastar/core/print.hh>
#include <seastar/core/sleep.hh>
#include <seastar/core/sstring.hh>
#include <seastar/core/thread.hh>
#include "auth/common.hh"
#include "auth/roles-metadata.hh"
#include "cql3/query_processor.hh"
#include "db/consistency_level_type.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
#include "utils/class_registrator.hh"
namespace auth {
namespace meta {
namespace role_members_table {
constexpr stdx::string_view name{"role_members" , 12};
static stdx::string_view qualified_name() noexcept {
static const sstring instance = AUTH_KS + "." + sstring(name);
return instance;
}
}
}
static logging::logger log("standard_role_manager");
static const class_registrator<
role_manager,
standard_role_manager,
cql3::query_processor&,
::service::migration_manager&> registration("org.apache.cassandra.auth.CassandraRoleManager");
struct record final {
sstring name;
bool is_superuser;
bool can_login;
role_set member_of;
};
static db::consistency_level consistency_for_role(stdx::string_view role_name) noexcept {
if (role_name == meta::DEFAULT_SUPERUSER_NAME) {
return db::consistency_level::QUORUM;
}
return db::consistency_level::LOCAL_ONE;
}
static future<stdx::optional<record>> find_record(cql3::query_processor& qp, stdx::string_view role_name) {
static const sstring query = sprint(
"SELECT * FROM %s WHERE %s = ?",
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return qp.process(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name)},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return stdx::optional<record>();
}
const cql3::untyped_result_set_row& row = results->one();
return stdx::make_optional(
record{
row.get_as<sstring>(sstring(meta::roles_table::role_col_name)),
row.get_as<bool>("is_superuser"),
row.get_as<bool>("can_login"),
(row.has("member_of")
? row.get_set<sstring>("member_of")
: role_set())});
});
}
static future<record> require_record(cql3::query_processor& qp, stdx::string_view role_name) {
return find_record(qp, role_name).then([role_name](stdx::optional<record> mr) {
if (!mr) {
throw nonexistant_role(role_name);
}
return make_ready_future<record>(*mr);
});
}
static bool has_can_login(const cql3::untyped_result_set_row& row) {
return row.has("can_login") && !(boolean_type->deserialize(row.get_blob("can_login")).is_null());
}
stdx::string_view standard_role_manager_name() noexcept {
static const sstring instance = meta::AUTH_PACKAGE_NAME + "CassandraRoleManager";
return instance;
}
stdx::string_view standard_role_manager::qualified_java_name() const noexcept {
return standard_role_manager_name();
}
const resource_set& standard_role_manager::protected_resources() const {
static const resource_set resources({
make_data_resource(meta::AUTH_KS, meta::roles_table::name),
make_data_resource(meta::AUTH_KS, meta::role_members_table::name)});
return resources;
}
future<> standard_role_manager::create_metadata_tables_if_missing() const {
static const sstring create_role_members_query = sprint(
"CREATE TABLE %s ("
" role text,"
" member text,"
" PRIMARY KEY (role, member)"
")",
meta::role_members_table::qualified_name());
return when_all_succeed(
create_metadata_table_if_missing(
meta::roles_table::name,
_qp,
meta::roles_table::creation_query(),
_migration_manager),
create_metadata_table_if_missing(
meta::role_members_table::name,
_qp,
create_role_members_query,
_migration_manager));
}
future<> standard_role_manager::create_default_role_if_missing() const {
return default_role_row_satisfies(_qp, &has_can_login).then([this](bool exists) {
if (!exists) {
static const sstring query = sprint(
"INSERT INTO %s (%s, is_superuser, can_login) VALUES (?, true, true)",
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
{meta::DEFAULT_SUPERUSER_NAME}).then([](auto&&) {
log.info("Created default superuser role '{}'.", meta::DEFAULT_SUPERUSER_NAME);
return make_ready_future<>();
});
}
return make_ready_future<>();
}).handle_exception_type([](const exceptions::unavailable_exception& e) {
log.warn("Skipped default role setup: some nodes were not ready; will retry");
return make_exception_future<>(e);
});
}
static const sstring legacy_table_name{"users"};
bool standard_role_manager::legacy_metadata_exists() const {
return _qp.db().local().has_schema(meta::AUTH_KS, legacy_table_name);
}
future<> standard_role_manager::migrate_legacy_metadata() const {
log.info("Starting migration of legacy user metadata.");
static const sstring query = sprint("SELECT * FROM %s.%s", meta::AUTH_KS, legacy_table_name);
return _qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
role_config config;
config.is_superuser = row.get_as<bool>("super");
config.can_login = true;
return do_with(
row.get_as<sstring>("name"),
std::move(config),
[this](const auto& name, const auto& config) {
return this->create_or_replace(name, config);
});
}).finally([results] {});
}).then([] {
log.info("Finished migrating legacy user metadata.");
}).handle_exception([](std::exception_ptr ep) {
log.error("Encountered an error during migration!");
std::rethrow_exception(ep);
});
}
future<> standard_role_manager::start() {
return once_among_shards([this] {
return this->create_metadata_tables_if_missing().then([this] {
_stopped = auth::do_after_system_ready(_as, [this] {
return seastar::async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get0()) {
if (this->legacy_metadata_exists()) {
log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
}
return;
}
if (this->legacy_metadata_exists()) {
this->migrate_legacy_metadata().get0();
return;
}
create_default_role_if_missing().get0();
});
});
});
});
}
future<> standard_role_manager::stop() {
_as.request_abort();
return _stopped.handle_exception_type([] (const sleep_aborted&) { });
}
future<> standard_role_manager::create_or_replace(stdx::string_view role_name, const role_config& c) const {
static const sstring query = sprint(
"INSERT INTO %s (%s, is_superuser, can_login) VALUES (?, ?, ?)",
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name), c.is_superuser, c.can_login},
true).discard_result();
}
future<>
standard_role_manager::create(stdx::string_view role_name, const role_config& c) const {
return this->exists(role_name).then([this, role_name, &c](bool role_exists) {
if (role_exists) {
throw role_already_exists(role_name);
}
return this->create_or_replace(role_name, c);
});
}
future<>
standard_role_manager::alter(stdx::string_view role_name, const role_config_update& u) const {
static const auto build_column_assignments = [](const role_config_update& u) -> sstring {
std::vector<sstring> assignments;
if (u.is_superuser) {
assignments.push_back(sstring("is_superuser = ") + (*u.is_superuser ? "true" : "false"));
}
if (u.can_login) {
assignments.push_back(sstring("can_login = ") + (*u.can_login ? "true" : "false"));
}
return boost::algorithm::join(assignments, ", ");
};
return require_record(_qp, role_name).then([this, role_name, &u](record) {
if (!u.is_superuser && !u.can_login) {
return make_ready_future<>();
}
return _qp.process(
sprint(
"UPDATE %s SET %s WHERE %s = ?",
meta::roles_table::qualified_name(),
build_column_assignments(u),
meta::roles_table::role_col_name),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result();
});
}
future<> standard_role_manager::drop(stdx::string_view role_name) const {
return this->exists(role_name).then([this, role_name](bool role_exists) {
if (!role_exists) {
throw nonexistant_role(role_name);
}
// First, revoke this role from all roles that are members of it.
const auto revoke_from_members = [this, role_name] {
static const sstring query = sprint(
"SELECT member FROM %s WHERE role = ?",
meta::role_members_table::qualified_name());
return _qp.process(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name)}).then([this, role_name](::shared_ptr<cql3::untyped_result_set> members) {
return parallel_for_each(
members->begin(),
members->end(),
[this, role_name](const cql3::untyped_result_set_row& member_row) {
const sstring member = member_row.template get_as<sstring>("member");
return this->modify_membership(member, role_name, membership_change::remove);
}).finally([members] {});
});
};
// In parallel, revoke all roles that this role is members of.
const auto revoke_members_of = [this, grantee = role_name] {
return this->query_granted(
grantee,
recursive_role_query::no).then([this, grantee](role_set granted_roles) {
return do_with(
std::move(granted_roles),
[this, grantee](const role_set& granted_roles) {
return parallel_for_each(
granted_roles.begin(),
granted_roles.end(),
[this, grantee](const sstring& role_name) {
return this->modify_membership(grantee, role_name, membership_change::remove);
});
});
});
};
// Finally, delete the role itself.
auto delete_role = [this, role_name] {
static const sstring query = sprint(
"DELETE FROM %s WHERE %s = ?",
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result();
};
return when_all_succeed(revoke_from_members(), revoke_members_of()).then([delete_role = std::move(delete_role)] {
return delete_role();
});
});
}
future<>
standard_role_manager::modify_membership(
stdx::string_view grantee_name,
stdx::string_view role_name,
membership_change ch) const {
const auto modify_roles = [this, role_name, grantee_name, ch] {
const auto query = sprint(
"UPDATE %s SET member_of = member_of %s ? WHERE %s = ?",
meta::roles_table::qualified_name(),
(ch == membership_change::add ? '+' : '-'),
meta::roles_table::role_col_name);
return _qp.process(
query,
consistency_for_role(grantee_name),
internal_distributed_timeout_config(),
{role_set{sstring(role_name)}, sstring(grantee_name)}).discard_result();
};
const auto modify_role_members = [this, role_name, grantee_name, ch] {
switch (ch) {
case membership_change::add:
return _qp.process(
sprint(
"INSERT INTO %s (role, member) VALUES (?, ?)",
meta::role_members_table::qualified_name()),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
case membership_change::remove:
return _qp.process(
sprint(
"DELETE FROM %s WHERE role = ? AND member = ?",
meta::role_members_table::qualified_name()),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
}
return make_ready_future<>();
};
return when_all_succeed(modify_roles(), modify_role_members());
}
future<>
standard_role_manager::grant(stdx::string_view grantee_name, stdx::string_view role_name) const {
const auto check_redundant = [this, role_name, grantee_name] {
return this->query_granted(
grantee_name,
recursive_role_query::yes).then([role_name, grantee_name](role_set roles) {
if (roles.count(sstring(role_name)) != 0) {
throw role_already_included(grantee_name, role_name);
}
return make_ready_future<>();
});
};
const auto check_cycle = [this, role_name, grantee_name] {
return this->query_granted(
role_name,
recursive_role_query::yes).then([role_name, grantee_name](role_set roles) {
if (roles.count(sstring(grantee_name)) != 0) {
throw role_already_included(role_name, grantee_name);
}
return make_ready_future<>();
});
};
return when_all_succeed(check_redundant(), check_cycle()).then([this, role_name, grantee_name] {
return this->modify_membership(grantee_name, role_name, membership_change::add);
});
}
future<>
standard_role_manager::revoke(stdx::string_view revokee_name, stdx::string_view role_name) const {
return this->exists(role_name).then([this, revokee_name, role_name](bool role_exists) {
if (!role_exists) {
throw nonexistant_role(sstring(role_name));
}
}).then([this, revokee_name, role_name] {
return this->query_granted(
revokee_name,
recursive_role_query::no).then([revokee_name, role_name](role_set roles) {
if (roles.count(sstring(role_name)) == 0) {
throw revoke_ungranted_role(revokee_name, role_name);
}
return make_ready_future<>();
}).then([this, revokee_name, role_name] {
return this->modify_membership(revokee_name, role_name, membership_change::remove);
});
});
}
static future<> collect_roles(
cql3::query_processor& qp,
stdx::string_view grantee_name,
bool recurse,
role_set& roles) {
return require_record(qp, grantee_name).then([&qp, &roles, recurse](record r) {
return do_with(std::move(r.member_of), [&qp, &roles, recurse](const role_set& memberships) {
return do_for_each(memberships.begin(), memberships.end(), [&qp, &roles, recurse](const sstring& role_name) {
roles.insert(role_name);
if (recurse) {
return collect_roles(qp, role_name, true, roles);
}
return make_ready_future<>();
});
});
});
}
future<role_set> standard_role_manager::query_granted(stdx::string_view grantee_name, recursive_role_query m) const {
const bool recurse = (m == recursive_role_query::yes);
return do_with(
role_set{sstring(grantee_name)},
[this, grantee_name, recurse](role_set& roles) {
return collect_roles(_qp, grantee_name, recurse, roles).then([&roles] { return roles; });
});
}
future<role_set> standard_role_manager::query_all() const {
static const sstring query = sprint(
"SELECT %s FROM %s",
meta::roles_table::role_col_name,
meta::roles_table::qualified_name());
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
return _qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([](::shared_ptr<cql3::untyped_result_set> results) {
role_set roles;
std::transform(
results->begin(),
results->end(),
std::inserter(roles, roles.begin()),
[](const cql3::untyped_result_set_row& row) {
return row.get_as<sstring>(role_col_name_string);
});
return roles;
});
}
future<bool> standard_role_manager::exists(stdx::string_view role_name) const {
return find_record(_qp, role_name).then([](stdx::optional<record> mr) {
return static_cast<bool>(mr);
});
}
future<bool> standard_role_manager::is_superuser(stdx::string_view role_name) const {
return require_record(_qp, role_name).then([](record r) {
return r.is_superuser;
});
}
future<bool> standard_role_manager::can_login(stdx::string_view role_name) const {
return require_record(_qp, role_name).then([](record r) {
return r.can_login;
});
}
}

View File

@@ -0,0 +1,105 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "auth/role_manager.hh"
#include <experimental/string_view>
#include <unordered_set>
#include <seastar/core/abort_source.hh>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include "stdx.hh"
#include "seastarx.hh"
namespace cql3 {
class query_processor;
}
namespace service {
class migration_manager;
}
namespace auth {
stdx::string_view standard_role_manager_name() noexcept;
class standard_role_manager final : public role_manager {
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;
future<> _stopped;
seastar::abort_source _as;
public:
standard_role_manager(cql3::query_processor& qp, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm)
, _stopped(make_ready_future<>()) {
}
virtual stdx::string_view qualified_java_name() const noexcept override;
virtual const resource_set& protected_resources() const override;
virtual future<> start() override;
virtual future<> stop() override;
virtual future<> create(stdx::string_view role_name, const role_config&) const override;
virtual future<> drop(stdx::string_view role_name) const override;
virtual future<> alter(stdx::string_view role_name, const role_config_update&) const override;
virtual future<> grant(stdx::string_view grantee_name, stdx::string_view role_name) const override;
virtual future<> revoke(stdx::string_view revokee_name, stdx::string_view role_name) const override;
virtual future<role_set> query_granted(stdx::string_view grantee_name, recursive_role_query) const override;
virtual future<role_set> query_all() const override;
virtual future<bool> exists(stdx::string_view role_name) const override;
virtual future<bool> is_superuser(stdx::string_view role_name) const override;
virtual future<bool> can_login(stdx::string_view role_name) const override;
private:
enum class membership_change { add, remove };
future<> create_metadata_tables_if_missing() const;
bool legacy_metadata_exists() const;
future<> migrate_legacy_metadata() const;
future<> create_default_role_if_missing() const;
future<> create_or_replace(stdx::string_view role_name, const role_config&) const;
future<> modify_membership(stdx::string_view role_name, stdx::string_view grantee_name, membership_change) const;
};
}

View File

@@ -39,20 +39,17 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "authenticator.hh"
#include "authenticated_user.hh"
#include "authenticator.hh"
#include "authorizer.hh"
#include "password_authenticator.hh"
#include "default_authorizer.hh"
#include "permission.hh"
#include "auth/authenticated_user.hh"
#include "auth/authenticator.hh"
#include "auth/authorizer.hh"
#include "auth/default_authorizer.hh"
#include "auth/password_authenticator.hh"
#include "auth/permission.hh"
#include "db/config.hh"
#include "utils/class_registrator.hh"
namespace auth {
class service;
static const sstring PACKAGE_NAME("com.scylladb.auth.");
static const sstring& transitional_authenticator_name() {
@@ -67,38 +64,47 @@ static const sstring& transitional_authorizer_name() {
class transitional_authenticator : public authenticator {
std::unique_ptr<authenticator> _authenticator;
public:
static const sstring PASSWORD_AUTHENTICATOR_NAME;
transitional_authenticator(cql3::query_processor& qp, ::service::migration_manager& mm)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, mm))
{}
: transitional_authenticator(std::make_unique<password_authenticator>(qp, mm)) {
}
transitional_authenticator(std::unique_ptr<authenticator> a)
: _authenticator(std::move(a))
{}
future<> start() override {
: _authenticator(std::move(a)) {
}
virtual future<> start() override {
return _authenticator->start();
}
future<> stop() override {
virtual future<> stop() override {
return _authenticator->stop();
}
const sstring& qualified_java_name() const override {
virtual const sstring& qualified_java_name() const override {
return transitional_authenticator_name();
}
bool require_authentication() const override {
virtual bool require_authentication() const override {
return true;
}
option_set supported_options() const override {
virtual authentication_option_set supported_options() const override {
return _authenticator->supported_options();
}
option_set alterable_options() const override {
virtual authentication_option_set alterable_options() const override {
return _authenticator->alterable_options();
}
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override {
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const override {
auto i = credentials.find(authenticator::USERNAME_KEY);
if ((i == credentials.end() || i->second.empty()) && (!credentials.count(PASSWORD_KEY) || credentials.at(PASSWORD_KEY).empty())) {
if ((i == credentials.end() || i->second.empty())
&& (!credentials.count(PASSWORD_KEY) || credentials.at(PASSWORD_KEY).empty())) {
// return anon user
return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());
return make_ready_future<authenticated_user>(anonymous_user());
}
return make_ready_future().then([this, &credentials] {
return _authenticator->authenticate(credentials);
@@ -107,29 +113,39 @@ public:
std::rethrow_exception(ep);
} catch (exceptions::authentication_exception&) {
// return anon user
return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());
return make_ready_future<authenticated_user>(anonymous_user());
}
});
}
future<> create(sstring username, const option_map& options) override {
return _authenticator->create(username, options);
virtual future<> create(stdx::string_view role_name, const authentication_options& options) const override {
return _authenticator->create(role_name, options);
}
future<> alter(sstring username, const option_map& options) override {
return _authenticator->alter(username, options);
virtual future<> alter(stdx::string_view role_name, const authentication_options& options) const override {
return _authenticator->alter(role_name, options);
}
future<> drop(sstring username) override {
return _authenticator->drop(username);
virtual future<> drop(stdx::string_view role_name) const override {
return _authenticator->drop(role_name);
}
const resource_ids& protected_resources() const override {
virtual future<custom_options> query_custom_options(stdx::string_view role_name) const override {
return _authenticator->query_custom_options(role_name);
}
virtual const resource_set& protected_resources() const override {
return _authenticator->protected_resources();
}
::shared_ptr<sasl_challenge> new_sasl_challenge() const override {
virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const override {
class sasl_wrapper : public sasl_challenge {
public:
sasl_wrapper(::shared_ptr<sasl_challenge> sasl)
: _sasl(std::move(sasl))
{}
bytes evaluate_response(bytes_view client_response) override {
: _sasl(std::move(sasl)) {
}
virtual bytes evaluate_response(bytes_view client_response) override {
try {
return _sasl->evaluate_response(client_response);
} catch (exceptions::authentication_exception&) {
@@ -137,14 +153,27 @@ public:
return {};
}
}
bool is_complete() const {
virtual bool is_complete() const override {
return _complete || _sasl->is_complete();
}
future<::shared_ptr<authenticated_user>> get_authenticated_user() const {
return _sasl->get_authenticated_user();
virtual future<authenticated_user> get_authenticated_user() const {
return futurize_apply([this] {
return _sasl->get_authenticated_user().handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
});
});
}
private:
::shared_ptr<sasl_challenge> _sasl;
bool _complete = false;
};
return ::make_shared<sasl_wrapper>(_authenticator->new_sasl_challenge());
@@ -153,55 +182,65 @@ public:
class transitional_authorizer : public authorizer {
std::unique_ptr<authorizer> _authorizer;
public:
transitional_authorizer(cql3::query_processor& qp, ::service::migration_manager& mm)
: transitional_authorizer(std::make_unique<default_authorizer>(qp, mm))
{}
: transitional_authorizer(std::make_unique<default_authorizer>(qp, mm)) {
}
transitional_authorizer(std::unique_ptr<authorizer> a)
: _authorizer(std::move(a))
{}
~transitional_authorizer()
{}
future<> start() override {
: _authorizer(std::move(a)) {
}
~transitional_authorizer() {
}
virtual future<> start() override {
return _authorizer->start();
}
future<> stop() override {
virtual future<> stop() override {
return _authorizer->stop();
}
const sstring& qualified_java_name() const override {
virtual const sstring& qualified_java_name() const override {
return transitional_authorizer_name();
}
future<permission_set> authorize(service& ser, ::shared_ptr<authenticated_user> user, data_resource resource) const override {
return is_super_user(ser, *user).then([](bool s) {
static const permission_set transitional_permissions =
permission_set::of<permission::CREATE,
permission::ALTER, permission::DROP,
permission::SELECT, permission::MODIFY>();
return make_ready_future<permission_set>(s ? permissions::ALL : transitional_permissions);
});
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override {
static const permission_set transitional_permissions =
permission_set::of<
permission::CREATE,
permission::ALTER,
permission::DROP,
permission::SELECT,
permission::MODIFY>();
return make_ready_future<permission_set>(transitional_permissions);
}
future<> grant(::shared_ptr<authenticated_user> user, permission_set ps, data_resource r, sstring s) override {
return _authorizer->grant(std::move(user), std::move(ps), std::move(r), std::move(s));
virtual future<> grant(stdx::string_view s, permission_set ps, const resource& r) const override {
return _authorizer->grant(s, std::move(ps), r);
}
future<> revoke(::shared_ptr<authenticated_user> user, permission_set ps, data_resource r, sstring s) override {
return _authorizer->revoke(std::move(user), std::move(ps), std::move(r), std::move(s));
virtual future<> revoke(stdx::string_view s, permission_set ps, const resource& r) const override {
return _authorizer->revoke(s, std::move(ps), r);
}
future<std::vector<permission_details>> list(service& ser, ::shared_ptr<authenticated_user> user, permission_set ps, optional<data_resource> r, optional<sstring> s) const override {
return _authorizer->list(ser, std::move(user), std::move(ps), std::move(r), std::move(s));
virtual future<std::vector<permission_details>> list_all() const override {
return _authorizer->list_all();
}
future<> revoke_all(sstring s) override {
return _authorizer->revoke_all(std::move(s));
virtual future<> revoke_all(stdx::string_view s) const override {
return _authorizer->revoke_all(s);
}
future<> revoke_all(data_resource r) override {
return _authorizer->revoke_all(std::move(r));
virtual future<> revoke_all(const resource& r) const override {
return _authorizer->revoke_all(r);
}
const resource_ids& protected_resources() override {
virtual const resource_set& protected_resources() const override {
return _authorizer->protected_resources();
}
future<> validate_configuration() const override {
return _authorizer->validate_configuration();
}
};
}
@@ -214,10 +253,10 @@ static const class_registrator<
auth::authenticator,
auth::transitional_authenticator,
cql3::query_processor&,
::service::migration_manager&> transitional_authenticator_reg("com.scylladb.auth.TransitionalAuthenticator");
::service::migration_manager&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
static const class_registrator<
auth::authorizer,
auth::transitional_authorizer,
cql3::query_processor&,
::service::migration_manager&> transitional_authorizer_reg("com.scylladb.auth.TransitionalAuthorizer");
::service::migration_manager&> transitional_authorizer_reg(auth::PACKAGE_NAME + "TransitionalAuthorizer");

146
backlog_controller.hh Normal file
View File

@@ -0,0 +1,146 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <seastar/core/scheduling.hh>
#include <seastar/core/timer.hh>
#include <seastar/core/gate.hh>
#include <chrono>
// Simple proportional controller to adjust shares for processes for which a backlog can be clearly
// defined.
//
// Goal is to consume the backlog as fast as we can, but not so fast that we steal all the CPU from
// incoming requests, and at the same time minimize user-visible fluctuations in the quota.
//
// What that translates to is we'll try to keep the backlog's firt derivative at 0 (IOW, we keep
// backlog constant). As the backlog grows we increase CPU usage, decreasing CPU usage as the
// backlog diminishes.
//
// The exact point at which the controller stops determines the desired CPU usage. As the backlog
// grows and approach a maximum desired, we need to be more aggressive. We will therefore define two
// thresholds, and increase the constant as we cross them.
//
// Doing that divides the range in three (before the first, between first and second, and after
// second threshold), and we'll be slow to grow in the first region, grow normally in the second
// region, and aggressively in the third region.
//
// The constants q1 and q2 are used to determine the proportional factor at each stage.
class backlog_controller {
public:
future<> shutdown() {
_update_timer.cancel();
return std::move(_inflight_update);
}
protected:
struct control_point {
float input;
float output;
};
seastar::scheduling_group _scheduling_group;
const ::io_priority_class& _io_priority;
std::chrono::milliseconds _interval;
timer<> _update_timer;
std::vector<control_point> _control_points;
std::function<float()> _current_backlog;
// updating shares for an I/O class may contact another shard and returns a future.
future<> _inflight_update;
virtual void update_controller(float quota);
void adjust();
backlog_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval,
std::vector<control_point> control_points, std::function<float()> backlog)
: _scheduling_group(sg)
, _io_priority(iop)
, _interval(interval)
, _update_timer([this] { adjust(); })
, _control_points({{0,0}})
, _current_backlog(std::move(backlog))
, _inflight_update(make_ready_future<>())
{
_control_points.insert(_control_points.end(), control_points.begin(), control_points.end());
_update_timer.arm_periodic(_interval);
}
// Used when the controllers are disabled and a static share is used
// When that option is deprecated we should remove this.
backlog_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares)
: _scheduling_group(sg)
, _io_priority(iop)
, _inflight_update(make_ready_future<>())
{
update_controller(static_shares);
}
virtual ~backlog_controller() {}
public:
backlog_controller(backlog_controller&&) = default;
float backlog_of_shares(float shares) const;
seastar::scheduling_group sg() {
return _scheduling_group;
}
};
// memtable flush CPU controller.
//
// - First threshold is the soft limit line,
// - Maximum is the point in which we'd stop consuming request,
// - Second threshold is halfway between them.
//
// Below the soft limit, we are in no particular hurry to flush, since it means we're set to
// complete flushing before we a new memtable is ready. The quota is dirty * q1, and q1 is set to a
// low number.
//
// The first half of the virtual dirty region is where we expect to be usually, so we have a low
// slope corresponding to a sluggish response between q1 * soft_limit and q2.
//
// In the second half, we're getting close to the hard dirty limit so we increase the slope and
// become more responsive, up to a maximum quota of qmax.
class flush_controller : public backlog_controller {
static constexpr float hard_dirty_limit = 1.0f;
public:
flush_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares) : backlog_controller(sg, iop, static_shares) {}
flush_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty)
: backlog_controller(sg, iop, std::move(interval),
std::vector<backlog_controller::control_point>({{soft_limit, 10}, {soft_limit + (hard_dirty_limit - soft_limit) / 2, 200} , {hard_dirty_limit, 1000}}),
std::move(current_dirty)
)
{}
};
class compaction_controller : public backlog_controller {
public:
static constexpr unsigned normalization_factor = 30;
static constexpr float disable_backlog = std::numeric_limits<double>::infinity();
static constexpr float backlog_disabled(float backlog) { return std::isinf(backlog); }
compaction_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares) : backlog_controller(sg, iop, static_shares) {}
compaction_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval, std::function<float()> current_backlog)
: backlog_controller(sg, iop, std::move(interval),
std::vector<backlog_controller::control_point>({{0.5, 10}, {1.5, 100} , {normalization_factor, 1000}}),
std::move(current_backlog)
)
{}
};

View File

@@ -29,7 +29,7 @@
#include <functional>
#include "utils/mutable_view.hh"
using bytes = basic_sstring<int8_t, uint32_t, 31>;
using bytes = basic_sstring<int8_t, uint32_t, 31, false>;
using bytes_view = std::experimental::basic_string_view<int8_t>;
using bytes_mutable_view = basic_mutable_view<bytes_view::value_type>;
using bytes_opt = std::experimental::optional<bytes>;
@@ -78,3 +78,11 @@ struct appending_hash<bytes_view> {
h.update(reinterpret_cast<const char*>(v.begin()), v.size() * sizeof(bytes_view::value_type));
}
};
inline int32_t compare_unsigned(bytes_view v1, bytes_view v2) {
auto n = memcmp(v1.begin(), v2.begin(), std::min(v1.size(), v2.size()));
if (n) {
return n;
}
return (int32_t) (v1.size() - v2.size());
}

View File

@@ -65,8 +65,9 @@ private:
size_type _size;
public:
class fragment_iterator : public std::iterator<std::input_iterator_tag, bytes_view> {
chunk* _current;
chunk* _current = nullptr;
public:
fragment_iterator() = default;
fragment_iterator(chunk* current) : _current(current) {}
fragment_iterator(const fragment_iterator&) = default;
fragment_iterator& operator=(const fragment_iterator&) = default;
@@ -289,6 +290,24 @@ public:
}
}
// Removes n bytes from the end of the bytes_ostream.
// Beware of O(n) algorithm.
void remove_suffix(size_t n) {
_size -= n;
auto left = _size;
auto current = _begin.get();
while (current) {
if (current->offset >= left) {
current->offset = left;
_current = current;
current->next.reset();
return;
}
left -= current->offset;
current = current->next.get();
}
}
// begin() and end() form an input range to bytes_view representing fragments.
// Any modification of this instance invalidates iterators.
fragment_iterator begin() const { return { _begin.get() }; }

View File

@@ -24,53 +24,20 @@
#include <vector>
#include "row_cache.hh"
#include "mutation_reader.hh"
#include "streamed_mutation.hh"
#include "mutation_fragment.hh"
#include "partition_version.hh"
#include "utils/logalloc.hh"
#include "query-request.hh"
#include "partition_snapshot_reader.hh"
#include "partition_snapshot_row_cursor.hh"
#include "read_context.hh"
#include "flat_mutation_reader.hh"
namespace cache {
extern logging::logger clogger;
class lsa_manager {
row_cache& _cache;
public:
lsa_manager(row_cache& cache) : _cache(cache) { }
template<typename Func>
decltype(auto) run_in_read_section(const Func& func) {
return _cache._read_section(_cache._tracker.region(), [&func] () {
return with_linearized_managed_bytes([&func] () {
return func();
});
});
}
template<typename Func>
decltype(auto) run_in_update_section(const Func& func) {
return _cache._update_section(_cache._tracker.region(), [&func] () {
return with_linearized_managed_bytes([&func] () {
return func();
});
});
}
template<typename Func>
void run_in_update_section_with_allocator(Func&& func) {
return _cache._update_section(_cache._tracker.region(), [this, &func] () {
return with_linearized_managed_bytes([this, &func] () {
return with_allocator(_cache._tracker.region().allocator(), [this, &func] () mutable {
return func();
});
});
});
}
logalloc::region& region() { return _cache._tracker.region(); }
logalloc::allocating_section& read_section() { return _cache._read_section; }
};
class cache_streamed_mutation final : public streamed_mutation::impl {
class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
enum class state {
before_static_row,
@@ -93,6 +60,7 @@ class cache_streamed_mutation final : public streamed_mutation::impl {
// - _next_row_in_range = _next.position() < _upper_bound
// - _last_row points at a direct predecessor of the next row which is going to be read.
// Used for populating continuity.
// - _population_range_starts_before_all_rows is set accordingly
reading_from_underlying,
end_of_stream
@@ -108,19 +76,9 @@ class cache_streamed_mutation final : public streamed_mutation::impl {
partition_snapshot_row_weakref _last_row;
// We need to be prepared that we may get overlapping and out of order
// range tombstones. We must emit fragments with strictly monotonic positions,
// so we can't just trim such tombstones to the position of the last fragment.
// To solve that, range tombstones are accumulated first in a range_tombstone_stream
// and emitted once we have a fragment with a larger position.
range_tombstone_stream _tombstones;
// Holds the lower bound of a position range which hasn't been processed yet.
// Only fragments with positions < _lower_bound have been emitted.
//
// It is assumed that !_lower_bound.is_clustering_row(). We depend on this when
// calling range_tombstone::trim_front() and when inserting dummy entries. Dummy
// entries are assumed to be only at !is_clustering_row() positions.
// Only rows with positions < _lower_bound have been emitted, and only
// range_tombstones with positions <= _lower_bound.
position_in_partition _lower_bound;
position_in_partition_view _upper_bound;
@@ -129,75 +87,109 @@ class cache_streamed_mutation final : public streamed_mutation::impl {
partition_snapshot_row_cursor _next_row;
bool _next_row_in_range = false;
future<> do_fill_buffer();
// True iff current population interval, since the previous clustering row, starts before all clustered rows.
// We cannot just look at _lower_bound, because emission of range tombstones changes _lower_bound and
// because we mark clustering intervals as continuous when consuming a clustering_row, it would prevent
// us from marking the interval as continuous.
// Valid when _state == reading_from_underlying.
bool _population_range_starts_before_all_rows;
// Whether _lower_bound was changed within current fill_buffer().
// If it did not then we cannot break out of it (e.g. on preemption) because
// forward progress is not guaranteed in case iterators are getting constantly invalidated.
bool _lower_bound_changed = false;
future<> do_fill_buffer(db::timeout_clock::time_point);
void copy_from_cache_to_buffer();
future<> process_static_row();
future<> process_static_row(db::timeout_clock::time_point);
void move_to_end();
void move_to_next_range();
void move_to_range(query::clustering_row_ranges::const_iterator);
void move_to_next_entry();
// Emits all delayed range tombstones with positions smaller than upper_bound.
void drain_tombstones(position_in_partition_view upper_bound);
// Emits all delayed range tombstones.
void drain_tombstones();
void add_to_buffer(const partition_snapshot_row_cursor&);
void add_clustering_row_to_buffer(mutation_fragment&&);
void add_to_buffer(range_tombstone&&);
void add_to_buffer(mutation_fragment&&);
future<> read_from_underlying();
future<> read_from_underlying(db::timeout_clock::time_point);
void start_reading_from_underlying();
bool after_current_range(position_in_partition_view position);
bool can_populate() const;
// Marks the range between _last_row (exclusive) and _next_row (exclusive) as continuous,
// provided that the underlying reader still matches the latest version of the partition.
void maybe_update_continuity();
// Tries to ensure that the lower bound of the current population range exists.
// Returns false if it failed and range cannot be populated.
// Assumes can_populate().
bool ensure_population_lower_bound();
void maybe_add_to_cache(const mutation_fragment& mf);
void maybe_add_to_cache(const clustering_row& cr);
void maybe_add_to_cache(const range_tombstone& rt);
void maybe_add_to_cache(const static_row& sr);
void maybe_set_static_row_continuous();
void finish_reader() {
push_mutation_fragment(partition_end());
_end_of_stream = true;
_state = state::end_of_stream;
}
void touch_partition();
public:
cache_streamed_mutation(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges&& crr,
lw_shared_ptr<read_context> ctx,
lw_shared_ptr<partition_snapshot> snp,
row_cache& cache)
: streamed_mutation::impl(std::move(s), std::move(dk), snp->partition_tombstone())
cache_flat_mutation_reader(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges&& crr,
lw_shared_ptr<read_context> ctx,
lw_shared_ptr<partition_snapshot> snp,
row_cache& cache)
: flat_mutation_reader::impl(std::move(s))
, _snp(std::move(snp))
, _position_cmp(*_schema)
, _ck_ranges(std::move(crr))
, _ck_ranges_curr(_ck_ranges.begin())
, _ck_ranges_end(_ck_ranges.end())
, _lsa_manager(cache)
, _tombstones(*_schema)
, _lower_bound(position_in_partition::before_all_clustered_rows())
, _upper_bound(position_in_partition_view::before_all_clustered_rows())
, _read_context(std::move(ctx))
, _next_row(*_schema, *_snp)
{
clogger.trace("csm {}: table={}.{}", this, _schema->ks_name(), _schema->cf_name());
push_mutation_fragment(partition_start(std::move(dk), _snp->partition_tombstone()));
}
cache_streamed_mutation(const cache_streamed_mutation&) = delete;
cache_streamed_mutation(cache_streamed_mutation&&) = delete;
virtual future<> fill_buffer() override;
virtual ~cache_streamed_mutation() {
cache_flat_mutation_reader(const cache_flat_mutation_reader&) = delete;
cache_flat_mutation_reader(cache_flat_mutation_reader&&) = delete;
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override;
virtual ~cache_flat_mutation_reader() {
maybe_merge_versions(_snp, _lsa_manager.region(), _lsa_manager.read_section());
}
virtual void next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty()) {
_end_of_stream = true;
}
}
virtual future<> fast_forward_to(const dht::partition_range&, db::timeout_clock::time_point timeout) override {
clear_buffer();
_end_of_stream = true;
return make_ready_future<>();
}
virtual future<> fast_forward_to(position_range pr, db::timeout_clock::time_point timeout) override {
throw std::bad_function_call();
}
};
inline
future<> cache_streamed_mutation::process_static_row() {
if (_snp->version()->partition().static_row_continuous()) {
future<> cache_flat_mutation_reader::process_static_row(db::timeout_clock::time_point timeout) {
if (_snp->static_row_continuous()) {
_read_context->cache().on_row_hit();
row sr = _lsa_manager.run_in_read_section([this] {
return _snp->static_row();
static_row sr = _lsa_manager.run_in_read_section([this] {
return _snp->static_row(_read_context->digest_requested());
});
if (!sr.empty()) {
push_mutation_fragment(mutation_fragment(static_row(std::move(sr))));
push_mutation_fragment(mutation_fragment(std::move(sr)));
}
return make_ready_future<>();
} else {
_read_context->cache().on_row_miss();
return _read_context->get_next_fragment().then([this] (mutation_fragment_opt&& sr) {
return _read_context->get_next_fragment(timeout).then([this] (mutation_fragment_opt&& sr) {
if (sr) {
assert(sr->is_static_row());
maybe_add_to_cache(sr->as_static_row());
@@ -209,44 +201,53 @@ future<> cache_streamed_mutation::process_static_row() {
}
inline
future<> cache_streamed_mutation::fill_buffer() {
void cache_flat_mutation_reader::touch_partition() {
if (_snp->at_latest_version()) {
rows_entry& last_dummy = *_snp->version()->partition().clustered_rows().rbegin();
_snp->tracker()->touch(last_dummy);
}
}
inline
future<> cache_flat_mutation_reader::fill_buffer(db::timeout_clock::time_point timeout) {
if (_state == state::before_static_row) {
auto after_static_row = [this] {
auto after_static_row = [this, timeout] {
if (_ck_ranges_curr == _ck_ranges_end) {
_end_of_stream = true;
_state = state::end_of_stream;
touch_partition();
finish_reader();
return make_ready_future<>();
}
_state = state::reading_from_cache;
_lsa_manager.run_in_read_section([this] {
move_to_range(_ck_ranges_curr);
});
return fill_buffer();
return fill_buffer(timeout);
};
if (_schema->has_static_columns()) {
return process_static_row().then(std::move(after_static_row));
return process_static_row(timeout).then(std::move(after_static_row));
} else {
return after_static_row();
}
}
clogger.trace("csm {}: fill_buffer(), range={}, lb={}", this, *_ck_ranges_curr, _lower_bound);
return do_until([this] { return _end_of_stream || is_buffer_full(); }, [this] {
return do_fill_buffer();
return do_until([this] { return _end_of_stream || is_buffer_full(); }, [this, timeout] {
return do_fill_buffer(timeout);
});
}
inline
future<> cache_streamed_mutation::do_fill_buffer() {
future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_point timeout) {
if (_state == state::move_to_underlying) {
_state = state::reading_from_underlying;
_population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema);
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)}).then([this] {
return read_from_underlying();
return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)}, timeout).then([this, timeout] {
return read_from_underlying(timeout);
});
}
if (_state == state::reading_from_underlying) {
return read_from_underlying();
return read_from_underlying(timeout);
}
// assert(_state == state::reading_from_cache)
return _lsa_manager.run_in_read_section([this] {
@@ -266,9 +267,13 @@ future<> cache_streamed_mutation::do_fill_buffer() {
}
_next_row.maybe_refresh();
clogger.trace("csm {}: next={}, cont={}", this, _next_row.position(), _next_row.continuous());
while (!is_buffer_full() && _state == state::reading_from_cache) {
_lower_bound_changed = false;
while (_state == state::reading_from_cache) {
copy_from_cache_to_buffer();
if (need_preempt()) {
// We need to check _lower_bound_changed even if is_buffer_full() because
// we may have emitted only a range tombstone which overlapped with _lower_bound
// and thus didn't cause _lower_bound to change.
if ((need_preempt() || is_buffer_full()) && _lower_bound_changed) {
break;
}
}
@@ -277,8 +282,8 @@ future<> cache_streamed_mutation::do_fill_buffer() {
}
inline
future<> cache_streamed_mutation::read_from_underlying() {
return consume_mutation_fragments_until(_read_context->get_streamed_mutation(),
future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::time_point timeout) {
return consume_mutation_fragments_until(_read_context->underlying().underlying(),
[this] { return _state != state::reading_from_underlying || is_buffer_full(); },
[this] (mutation_fragment mf) {
_read_context->cache().on_row_miss();
@@ -323,13 +328,14 @@ future<> cache_streamed_mutation::read_from_underlying() {
auto inserted = insert_result.second;
auto it = insert_result.first;
if (inserted) {
_snp->tracker()->insert(*e);
e.release();
auto next = std::next(it);
it->set_continuous(next->continuous());
clogger.trace("csm {}: inserted dummy at {}, cont={}", this, it->position(), it->continuous());
}
});
} else if (!_ck_ranges_curr->start() || _last_row.refresh(*_snp)) {
} else if (ensure_population_lower_bound()) {
with_allocator(_snp->region().allocator(), [&] {
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, _upper_bound, is_dummy::yes, is_continuous::yes));
@@ -338,6 +344,7 @@ future<> cache_streamed_mutation::read_from_underlying() {
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted dummy at {}", this, _upper_bound);
_snp->tracker()->insert(*e);
e.release();
} else {
clogger.trace("csm {}: mark {} as continuous", this, insert_result.first->position());
@@ -357,37 +364,53 @@ future<> cache_streamed_mutation::read_from_underlying() {
}
});
return make_ready_future<>();
});
}, timeout);
}
inline
void cache_streamed_mutation::maybe_update_continuity() {
if (can_populate() && (!_ck_ranges_curr->start() || _last_row.refresh(*_snp))) {
if (_next_row.is_in_latest_version()) {
clogger.trace("csm {}: mark {} continuous", this, _next_row.get_iterator_in_latest_version()->position());
_next_row.get_iterator_in_latest_version()->set_continuous(true);
} else {
// Cover entry from older version
with_allocator(_snp->region().allocator(), [&] {
auto& rows = _snp->version()->partition().clustered_rows();
rows_entry::compare less(*_schema);
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, _next_row.position(), is_dummy(_next_row.dummy()), is_continuous::yes));
auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted dummy at {}", this, e->position());
e.release();
}
});
bool cache_flat_mutation_reader::ensure_population_lower_bound() {
if (_population_range_starts_before_all_rows) {
return true;
}
if (!_last_row.refresh(*_snp)) {
return false;
}
// Continuity flag we will later set for the upper bound extends to the previous row in the same version,
// so we need to ensure we have an entry in the latest version.
if (!_last_row.is_in_latest_version()) {
with_allocator(_snp->region().allocator(), [&] {
auto& rows = _snp->version()->partition().clustered_rows();
rows_entry::compare less(*_schema);
// FIXME: Avoid the copy by inserting an incomplete clustering row
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, *_last_row));
e->set_continuous(false);
auto insert_result = rows.insert_check(rows.end(), *e, less);
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted lower bound dummy at {}", this, e->position());
_snp->tracker()->insert(*e);
e.release();
}
});
}
return true;
}
inline
void cache_flat_mutation_reader::maybe_update_continuity() {
if (can_populate() && ensure_population_lower_bound()) {
with_allocator(_snp->region().allocator(), [&] {
rows_entry& e = _next_row.ensure_entry_in_latest().row;
e.set_continuous(true);
});
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const mutation_fragment& mf) {
void cache_flat_mutation_reader::maybe_add_to_cache(const mutation_fragment& mf) {
if (mf.is_range_tombstone()) {
maybe_add_to_cache(mf.as_range_tombstone());
} else {
@@ -398,9 +421,10 @@ void cache_streamed_mutation::maybe_add_to_cache(const mutation_fragment& mf) {
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const clustering_row& cr) {
void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
if (!can_populate()) {
_last_row = nullptr;
_population_range_starts_before_all_rows = false;
_read_context->cache().on_mispopulate();
return;
}
@@ -409,52 +433,69 @@ void cache_streamed_mutation::maybe_add_to_cache(const clustering_row& cr) {
mutation_partition& mp = _snp->version()->partition();
rows_entry::compare less(*_schema);
if (_read_context->digest_requested()) {
cr.cells().prepare_hash(*_schema, column_kind::regular_column);
}
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(cr.key(), cr.tomb(), cr.marker(), cr.cells()));
current_allocator().construct<rows_entry>(*_schema, cr.key(), cr.tomb(), cr.marker(), cr.cells()));
new_entry->set_continuous(false);
auto it = _next_row.iterators_valid() ? _next_row.get_iterator_in_latest_version()
: mp.clustered_rows().lower_bound(cr.key(), less);
auto insert_result = mp.clustered_rows().insert_check(it, *new_entry, less);
if (insert_result.second) {
_read_context->cache().on_row_insert();
_snp->tracker()->insert(*new_entry);
new_entry.release();
}
it = insert_result.first;
rows_entry& e = *it;
if (!_ck_ranges_curr->start() || _last_row.refresh(*_snp)) {
if (ensure_population_lower_bound()) {
clogger.trace("csm {}: set_continuous({})", this, e.position());
e.set_continuous(true);
} else {
_read_context->cache().on_mispopulate();
}
with_allocator(standard_allocator(), [&] {
_last_row = partition_snapshot_row_weakref(*_snp, it);
_last_row = partition_snapshot_row_weakref(*_snp, it, true);
});
_population_range_starts_before_all_rows = false;
});
}
inline
bool cache_streamed_mutation::after_current_range(position_in_partition_view p) {
bool cache_flat_mutation_reader::after_current_range(position_in_partition_view p) {
return _position_cmp(p, _upper_bound) >= 0;
}
inline
void cache_streamed_mutation::start_reading_from_underlying() {
void cache_flat_mutation_reader::start_reading_from_underlying() {
clogger.trace("csm {}: start_reading_from_underlying(), range=[{}, {})", this, _lower_bound, _next_row_in_range ? _next_row.position() : _upper_bound);
_state = state::move_to_underlying;
_next_row.touch();
}
inline
void cache_streamed_mutation::copy_from_cache_to_buffer() {
void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
clogger.trace("csm {}: copy_from_cache, next={}, next_row_in_range={}", this, _next_row.position(), _next_row_in_range);
_next_row.touch();
position_in_partition_view next_lower_bound = _next_row.dummy() ? _next_row.position() : position_in_partition_view::after_key(_next_row.key());
for (auto&& rts : _snp->range_tombstones(*_schema, _lower_bound, _next_row_in_range ? next_lower_bound : _upper_bound)) {
add_to_buffer(std::move(rts));
if (is_buffer_full()) {
return;
for (auto &&rts : _snp->range_tombstones(_lower_bound, _next_row_in_range ? next_lower_bound : _upper_bound)) {
position_in_partition::less_compare less(*_schema);
// This guarantees that rts starts after any emitted clustering_row
// and not before any emitted range tombstone.
if (!less(_lower_bound, rts.position())) {
rts.set_start(*_schema, _lower_bound);
} else {
_lower_bound = position_in_partition(rts.position());
_lower_bound_changed = true;
if (is_buffer_full()) {
return;
}
}
push_mutation_fragment(std::move(rts));
}
// We add the row to the buffer even when it's full.
// This simplifies the code. For more info see #3139.
if (_next_row_in_range) {
_last_row = _next_row;
add_to_buffer(_next_row);
@@ -465,15 +506,13 @@ void cache_streamed_mutation::copy_from_cache_to_buffer() {
}
inline
void cache_streamed_mutation::move_to_end() {
drain_tombstones();
_end_of_stream = true;
_state = state::end_of_stream;
void cache_flat_mutation_reader::move_to_end() {
finish_reader();
clogger.trace("csm {}: eos", this);
}
inline
void cache_streamed_mutation::move_to_next_range() {
void cache_flat_mutation_reader::move_to_next_range() {
auto next_it = std::next(_ck_ranges_curr);
if (next_it == _ck_ranges_end) {
move_to_end();
@@ -484,12 +523,13 @@ void cache_streamed_mutation::move_to_next_range() {
}
inline
void cache_streamed_mutation::move_to_range(query::clustering_row_ranges::const_iterator next_it) {
void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::const_iterator next_it) {
auto lb = position_in_partition::for_range_start(*next_it);
auto ub = position_in_partition_view::for_range_end(*next_it);
_last_row = nullptr;
_lower_bound = std::move(lb);
_upper_bound = std::move(ub);
_lower_bound_changed = true;
_ck_ranges_curr = next_it;
auto adjacent = _next_row.advance_to(_lower_bound);
_next_row_in_range = !after_current_range(_next_row.position());
@@ -509,7 +549,8 @@ void cache_streamed_mutation::move_to_range(query::clustering_row_ranges::const_
auto new_entry = current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no);
return rows.insert_before(_next_row.get_iterator_in_latest_version(), *new_entry);
});
_last_row = partition_snapshot_row_weakref(*_snp, it);
_snp->tracker()->insert(*it);
_last_row = partition_snapshot_row_weakref(*_snp, it, true);
} else {
_read_context->cache().on_mispopulate();
}
@@ -520,7 +561,7 @@ void cache_streamed_mutation::move_to_range(query::clustering_row_ranges::const_
// _next_row must be inside the range.
inline
void cache_streamed_mutation::move_to_next_entry() {
void cache_flat_mutation_reader::move_to_next_entry() {
clogger.trace("csm {}: move_to_next_entry(), curr={}", this, _next_row.position());
if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
move_to_next_range();
@@ -538,31 +579,7 @@ void cache_streamed_mutation::move_to_next_entry() {
}
inline
void cache_streamed_mutation::drain_tombstones(position_in_partition_view pos) {
while (true) {
reserve_one();
auto mfo = _tombstones.get_next(pos);
if (!mfo) {
break;
}
push_mutation_fragment(std::move(*mfo));
}
}
inline
void cache_streamed_mutation::drain_tombstones() {
while (true) {
reserve_one();
auto mfo = _tombstones.get_next();
if (!mfo) {
break;
}
push_mutation_fragment(std::move(*mfo));
}
}
inline
void cache_streamed_mutation::add_to_buffer(mutation_fragment&& mf) {
void cache_flat_mutation_reader::add_to_buffer(mutation_fragment&& mf) {
clogger.trace("csm {}: add_to_buffer({})", this, mf);
if (mf.is_clustering_row()) {
add_clustering_row_to_buffer(std::move(mf));
@@ -573,10 +590,10 @@ void cache_streamed_mutation::add_to_buffer(mutation_fragment&& mf) {
}
inline
void cache_streamed_mutation::add_to_buffer(const partition_snapshot_row_cursor& row) {
void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_cursor& row) {
if (!row.dummy()) {
_read_context->cache().on_row_hit();
add_clustering_row_to_buffer(row.row());
add_clustering_row_to_buffer(row.row(_read_context->digest_requested()));
}
}
@@ -584,35 +601,35 @@ void cache_streamed_mutation::add_to_buffer(const partition_snapshot_row_cursor&
// (1) no fragment with position >= _lower_bound was pushed yet
// (2) If _lower_bound > mf.position(), mf was emitted
inline
void cache_streamed_mutation::add_clustering_row_to_buffer(mutation_fragment&& mf) {
void cache_flat_mutation_reader::add_clustering_row_to_buffer(mutation_fragment&& mf) {
clogger.trace("csm {}: add_clustering_row_to_buffer({})", this, mf);
auto& row = mf.as_clustering_row();
auto key = row.key();
try {
drain_tombstones(row.position());
push_mutation_fragment(std::move(mf));
_lower_bound = position_in_partition::after_key(std::move(key));
} catch (...) {
// We may have emitted some of the range tombstones which start after the old _lower_bound
_lower_bound = position_in_partition::for_key(std::move(key));
throw;
}
auto new_lower_bound = position_in_partition::after_key(row.key());
push_mutation_fragment(std::move(mf));
_lower_bound = std::move(new_lower_bound);
_lower_bound_changed = true;
}
inline
void cache_streamed_mutation::add_to_buffer(range_tombstone&& rt) {
void cache_flat_mutation_reader::add_to_buffer(range_tombstone&& rt) {
clogger.trace("csm {}: add_to_buffer({})", this, rt);
// This guarantees that rt starts after any emitted clustering_row
if (!rt.trim_front(*_schema, _lower_bound)) {
// and not before any emitted range tombstone.
position_in_partition::less_compare less(*_schema);
if (!less(_lower_bound, rt.end_position())) {
return;
}
_lower_bound = position_in_partition(rt.position());
_tombstones.apply(std::move(rt));
drain_tombstones(_lower_bound);
if (!less(_lower_bound, rt.position())) {
rt.set_start(*_schema, _lower_bound);
} else {
_lower_bound = position_in_partition(rt.position());
_lower_bound_changed = true;
}
push_mutation_fragment(std::move(rt));
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const range_tombstone& rt) {
void cache_flat_mutation_reader::maybe_add_to_cache(const range_tombstone& rt) {
if (can_populate()) {
clogger.trace("csm {}: maybe_add_to_cache({})", this, rt);
_lsa_manager.run_in_update_section_with_allocator([&] {
@@ -624,11 +641,14 @@ void cache_streamed_mutation::maybe_add_to_cache(const range_tombstone& rt) {
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const static_row& sr) {
void cache_flat_mutation_reader::maybe_add_to_cache(const static_row& sr) {
if (can_populate()) {
clogger.trace("csm {}: populate({})", this, sr);
_read_context->cache().on_row_insert();
_read_context->cache().on_static_row_insert();
_lsa_manager.run_in_update_section_with_allocator([&] {
if (_read_context->digest_requested()) {
sr.cells().prepare_hash(*_schema, column_kind::static_column);
}
_snp->version()->partition().static_row().apply(*_schema, column_kind::static_column, sr.cells());
});
} else {
@@ -637,7 +657,7 @@ void cache_streamed_mutation::maybe_add_to_cache(const static_row& sr) {
}
inline
void cache_streamed_mutation::maybe_set_static_row_continuous() {
void cache_flat_mutation_reader::maybe_set_static_row_continuous() {
if (can_populate()) {
clogger.trace("csm {}: set static row continuous", this);
_snp->version()->partition().set_static_row_continuous(true);
@@ -647,19 +667,19 @@ void cache_streamed_mutation::maybe_set_static_row_continuous() {
}
inline
bool cache_streamed_mutation::can_populate() const {
bool cache_flat_mutation_reader::can_populate() const {
return _snp->at_latest_version() && _read_context->cache().phase_of(_read_context->key()) == _read_context->phase();
}
} // namespace cache
inline streamed_mutation make_cache_streamed_mutation(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges crr,
row_cache& cache,
lw_shared_ptr<cache::read_context> ctx,
lw_shared_ptr<partition_snapshot> snp)
inline flat_mutation_reader make_cache_flat_mutation_reader(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges crr,
row_cache& cache,
lw_shared_ptr<cache::read_context> ctx,
lw_shared_ptr<partition_snapshot> snp)
{
return make_streamed_mutation<cache::cache_streamed_mutation>(
return make_flat_mutation_reader<cache::cache_flat_mutation_reader>(
std::move(s), std::move(dk), std::move(crr), std::move(ctx), std::move(snp), cache);
}

View File

@@ -75,7 +75,7 @@ mutation canonical_mutation::to_mutation(schema_ptr s) const {
auto version = mv.schema_version();
auto pk = mv.key();
mutation m(std::move(pk), std::move(s));
mutation m(std::move(s), std::move(pk));
if (version == m.schema()->version()) {
auto partition_view = mutation_partition_view::from_view(mv.partition());

View File

@@ -23,27 +23,15 @@
#include <boost/intrusive/unordered_set.hpp>
#if __has_include(<boost/container/small_vector.hpp>)
#include <boost/container/small_vector.hpp>
template <typename T, size_t N>
using small_vector = boost::container::small_vector<T, N>;
#else
#include <vector>
template <typename T, size_t N>
using small_vector = std::vector<T>;
#endif
#include "utils/small_vector.hh"
#include "fnv1a_hasher.hh"
#include "streamed_mutation.hh"
#include "mutation_fragment.hh"
#include "mutation_partition.hh"
#include "db/timeout_clock.hh"
class cells_range {
using ids_vector_type = small_vector<column_id, 5>;
using ids_vector_type = utils::small_vector<column_id, 5>;
position_in_partition_view _position;
ids_vector_type _ids;
@@ -142,11 +130,7 @@ struct cell_locker_stats {
};
class cell_locker {
public:
using timeout_clock = lowres_clock;
private:
using semaphore_type = basic_semaphore<default_timeout_exception_factory, timeout_clock>;
class partition_entry;
struct cell_address {
@@ -158,7 +142,7 @@ private:
public enable_lw_shared_from_this<cell_entry> {
partition_entry& _parent;
cell_address _address;
semaphore_type _semaphore { 0 };
db::timeout_semaphore _semaphore { 0 };
friend class cell_locker;
public:
@@ -187,7 +171,7 @@ private:
return _address.position;
}
future<> lock(timeout_clock::time_point _timeout) {
future<> lock(db::timeout_clock::time_point _timeout) {
return _semaphore.wait(_timeout);
}
void unlock() {
@@ -387,7 +371,7 @@ public:
// partition_cells_range is required to be in cell_locker::schema()
future<std::vector<locked_cell>> lock_cells(const dht::decorated_key& dk, partition_cells_range&& range,
timeout_clock::time_point timeout);
db::timeout_clock::time_point timeout);
};
@@ -416,7 +400,7 @@ struct cell_locker::locker {
partition_cells_range::iterator _current_ck;
cells_range::const_iterator _current_cell;
timeout_clock::time_point _timeout;
db::timeout_clock::time_point _timeout;
std::vector<locked_cell> _locks;
cell_locker_stats& _stats;
private:
@@ -430,7 +414,7 @@ private:
bool is_done() const { return _current_ck == _range.end(); }
public:
explicit locker(const ::schema& s, cell_locker_stats& st, partition_entry& pe, partition_cells_range&& range, timeout_clock::time_point timeout)
explicit locker(const ::schema& s, cell_locker_stats& st, partition_entry& pe, partition_cells_range&& range, db::timeout_clock::time_point timeout)
: _hasher(s)
, _eq_cmp(s)
, _partition_entry(pe)
@@ -458,7 +442,7 @@ public:
};
inline
future<std::vector<locked_cell>> cell_locker::lock_cells(const dht::decorated_key& dk, partition_cells_range&& range, timeout_clock::time_point timeout) {
future<std::vector<locked_cell>> cell_locker::lock_cells(const dht::decorated_key& dk, partition_cells_range&& range, db::timeout_clock::time_point timeout) {
partition_entry::hasher pe_hash;
partition_entry::equal_compare pe_eq(*_schema);

View File

@@ -42,17 +42,6 @@ std::ostream& operator<<(std::ostream& out, const bound_kind k);
bound_kind invert_kind(bound_kind k);
int32_t weight(bound_kind k);
static inline bound_kind flip_bound_kind(bound_kind bk)
{
switch (bk) {
case bound_kind::excl_end: return bound_kind::excl_start;
case bound_kind::incl_end: return bound_kind::incl_start;
case bound_kind::excl_start: return bound_kind::excl_end;
case bound_kind::incl_start: return bound_kind::incl_end;
}
abort();
}
class bound_view {
public:
const static thread_local clustering_key empty_prefix;

View File

@@ -25,7 +25,7 @@
#include "schema.hh"
#include "query-request.hh"
#include "streamed_mutation.hh"
#include "mutation_fragment.hh"
// Utility for in-order checking of overlap with position ranges.
class clustering_ranges_walker {
@@ -70,7 +70,7 @@ public:
{
if (!with_static_row) {
if (_current == _end) {
_current_start = _current_end = position_in_partition_view::after_all_clustered_rows();
_current_start = position_in_partition_view::before_all_clustered_rows();
} else {
_current_start = position_in_partition_view::for_range_start(*_current);
_current_end = position_in_partition_view::for_range_end(*_current);

View File

@@ -23,8 +23,10 @@
#include "sstables/shared_sstable.hh"
#include "exceptions/exceptions.hh"
#include "sstables/compaction_backlog_manager.hh"
class column_family;
class table;
using column_family = table;
class schema;
using schema_ptr = lw_shared_ptr<const schema>;
@@ -120,6 +122,8 @@ public:
}
sstable_set make_sstable_set(schema_ptr schema) const;
compaction_backlog_tracker& get_backlog_tracker();
};
// Creates a compaction_strategy object from one of the strategies available.

View File

@@ -28,6 +28,7 @@
#include <boost/range/iterator_range.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include "utils/serialization.hh"
#include "util/backtrace.hh"
#include "unimplemented.hh"
enum class allow_prefixes { no, yes };
@@ -144,7 +145,7 @@ public:
}
len = read_simple<size_type>(_v);
if (_v.size() < len) {
throw marshal_exception();
throw_with_backtrace<marshal_exception>(sprint("compound_type iterator - not enough bytes, expected %d, got %d", len, _v.size()));
}
}
_current = bytes_view(_v.begin(), len);

View File

@@ -25,6 +25,7 @@
#include <boost/range/adaptor/transformed.hpp>
#include "compound.hh"
#include "schema.hh"
#include "sstables/version.hh"
//
// This header provides adaptors between the representation used by our compound_type<>
@@ -241,7 +242,7 @@ public:
using component_view = std::pair<bytes_view, eoc>;
private:
template<typename Value, typename = std::enable_if_t<!std::is_same<const data_value, std::decay_t<Value>>::value>>
static size_t size(Value& val) {
static size_t size(const Value& val) {
return val.size();
}
static size_t size(const data_value& val) {
@@ -302,7 +303,7 @@ private:
}
public:
template <typename Describer>
auto describe_type(Describer f) const {
auto describe_type(sstables::sstable_version_types v, Describer f) const {
return f(const_cast<bytes&>(_bytes));
}
@@ -345,7 +346,7 @@ public:
}
len = read_simple<size_type>(_v);
if (_v.size() < len) {
throw marshal_exception();
throw_with_backtrace<marshal_exception>(sprint("composite iterator - not enough bytes, expected %d, got %d", len, _v.size()));
}
}
auto value = bytes_view(_v.begin(), len);
@@ -445,17 +446,16 @@ public:
return _is_compound;
}
// The following factory functions assume this composite is a compound value.
template <typename ClusteringElement>
static composite from_clustering_element(const schema& s, const ClusteringElement& ce) {
return serialize_value(ce.components(s));
return serialize_value(ce.components(s), s.is_compound());
}
static composite from_exploded(const std::vector<bytes_view>& v, eoc marker = eoc::none) {
static composite from_exploded(const std::vector<bytes_view>& v, bool is_compound, eoc marker = eoc::none) {
if (v.size() == 0) {
return composite(bytes(size_t(1), bytes::value_type(marker)));
return composite(bytes(size_t(1), bytes::value_type(marker)), is_compound);
}
return serialize_value(v, true, marker);
return serialize_value(v, is_compound, marker);
}
static composite static_prefix(const schema& s) {

345
compress.cc Normal file
View File

@@ -0,0 +1,345 @@
/*
* Copyright (C) 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <lz4.h>
#include <zlib.h>
#include <snappy-c.h>
#include "compress.hh"
#include "utils/class_registrator.hh"
const sstring compressor::namespace_prefix = "org.apache.cassandra.io.compress.";
class lz4_processor: public compressor {
public:
using compressor::compressor;
size_t uncompress(const char* input, size_t input_len, char* output,
size_t output_len) const override;
size_t compress(const char* input, size_t input_len, char* output,
size_t output_len) const override;
size_t compress_max_size(size_t input_len) const override;
};
class snappy_processor: public compressor {
public:
using compressor::compressor;
size_t uncompress(const char* input, size_t input_len, char* output,
size_t output_len) const override;
size_t compress(const char* input, size_t input_len, char* output,
size_t output_len) const override;
size_t compress_max_size(size_t input_len) const override;
};
class deflate_processor: public compressor {
public:
using compressor::compressor;
size_t uncompress(const char* input, size_t input_len, char* output,
size_t output_len) const override;
size_t compress(const char* input, size_t input_len, char* output,
size_t output_len) const override;
size_t compress_max_size(size_t input_len) const override;
};
compressor::compressor(sstring name)
: _name(std::move(name))
{}
std::set<sstring> compressor::option_names() const {
return {};
}
std::map<sstring, sstring> compressor::options() const {
return {};
}
shared_ptr<compressor> compressor::create(const sstring& name, const opt_getter& opts) {
if (name.empty()) {
return {};
}
qualified_name qn(namespace_prefix, name);
for (auto& c : { lz4, snappy, deflate }) {
if (c->name() == qn) {
return c;
}
}
return compressor_registry::create(qn, opts);
}
shared_ptr<compressor> compressor::create(const std::map<sstring, sstring>& options) {
auto i = options.find(compression_parameters::SSTABLE_COMPRESSION);
if (i != options.end() && !i->second.empty()) {
return create(i->second, [&options](const sstring& key) -> opt_string {
auto i = options.find(key);
if (i == options.end()) {
return std::experimental::nullopt;
}
return { i->second };
});
}
return {};
}
thread_local const shared_ptr<compressor> compressor::lz4 = make_shared<lz4_processor>(namespace_prefix + "LZ4Compressor");
thread_local const shared_ptr<compressor> compressor::snappy = make_shared<snappy_processor>(namespace_prefix + "SnappyCompressor");
thread_local const shared_ptr<compressor> compressor::deflate = make_shared<deflate_processor>(namespace_prefix + "DeflateCompressor");
const sstring compression_parameters::SSTABLE_COMPRESSION = "sstable_compression";
const sstring compression_parameters::CHUNK_LENGTH_KB = "chunk_length_kb";
const sstring compression_parameters::CRC_CHECK_CHANCE = "crc_check_chance";
compression_parameters::compression_parameters()
: compression_parameters(nullptr)
{}
compression_parameters::~compression_parameters()
{}
compression_parameters::compression_parameters(compressor_ptr c)
: _compressor(std::move(c))
{}
compression_parameters::compression_parameters(const std::map<sstring, sstring>& options) {
_compressor = compressor::create(options);
validate_options(options);
auto chunk_length = options.find(CHUNK_LENGTH_KB);
if (chunk_length != options.end()) {
try {
_chunk_length = std::stoi(chunk_length->second) * 1024;
} catch (const std::exception& e) {
throw exceptions::syntax_exception(sstring("Invalid integer value ") + chunk_length->second + " for " + CHUNK_LENGTH_KB);
}
}
auto crc_chance = options.find(CRC_CHECK_CHANCE);
if (crc_chance != options.end()) {
try {
_crc_check_chance = std::stod(crc_chance->second);
} catch (const std::exception& e) {
throw exceptions::syntax_exception(sstring("Invalid double value ") + crc_chance->second + "for " + CRC_CHECK_CHANCE);
}
}
}
void compression_parameters::validate() {
if (_chunk_length) {
auto chunk_length = _chunk_length.value();
if (chunk_length <= 0) {
throw exceptions::configuration_exception(sstring("Invalid negative or null ") + CHUNK_LENGTH_KB);
}
// _chunk_length must be a power of two
if (chunk_length & (chunk_length - 1)) {
throw exceptions::configuration_exception(sstring(CHUNK_LENGTH_KB) + " must be a power of 2.");
}
}
if (_crc_check_chance && (_crc_check_chance.value() < 0.0 || _crc_check_chance.value() > 1.0)) {
throw exceptions::configuration_exception(sstring(CRC_CHECK_CHANCE) + " must be between 0.0 and 1.0.");
}
}
std::map<sstring, sstring> compression_parameters::get_options() const {
if (!_compressor) {
return std::map<sstring, sstring>();
}
auto opts = _compressor->options();
opts.emplace(compression_parameters::SSTABLE_COMPRESSION, _compressor->name());
if (_chunk_length) {
opts.emplace(sstring(CHUNK_LENGTH_KB), std::to_string(_chunk_length.value() / 1024));
}
if (_crc_check_chance) {
opts.emplace(sstring(CRC_CHECK_CHANCE), std::to_string(_crc_check_chance.value()));
}
return opts;
}
bool compression_parameters::operator==(const compression_parameters& other) const {
return _compressor == other._compressor
&& _chunk_length == other._chunk_length
&& _crc_check_chance == other._crc_check_chance;
}
bool compression_parameters::operator!=(const compression_parameters& other) const {
return !(*this == other);
}
void compression_parameters::validate_options(const std::map<sstring, sstring>& options) {
// currently, there are no options specific to a particular compressor
static std::set<sstring> keywords({
sstring(SSTABLE_COMPRESSION),
sstring(CHUNK_LENGTH_KB),
sstring(CRC_CHECK_CHANCE),
});
std::set<sstring> ckw;
if (_compressor) {
ckw = _compressor->option_names();
}
for (auto&& opt : options) {
if (!keywords.count(opt.first) && !ckw.count(opt.first)) {
throw exceptions::configuration_exception(sprint("Unknown compression option '%s'.", opt.first));
}
}
}
size_t lz4_processor::uncompress(const char* input, size_t input_len,
char* output, size_t output_len) const {
// We use LZ4_decompress_safe(). According to the documentation, the
// function LZ4_decompress_fast() is slightly faster, but maliciously
// crafted compressed data can cause it to overflow the output buffer.
// Theoretically, our compressed data is created by us so is not malicious
// (and accidental corruption is avoided by the compressed-data checksum),
// but let's not take that chance for now, until we've actually measured
// the performance benefit that LZ4_decompress_fast() would bring.
// Cassandra's LZ4Compressor prepends to the chunk its uncompressed length
// in 4 bytes little-endian (!) order. We don't need this information -
// we already know the uncompressed data is at most the given chunk size
// (and usually is exactly that, except in the last chunk). The advance
// knowledge of the uncompressed size could be useful if we used
// LZ4_decompress_fast(), but we prefer LZ4_decompress_safe() anyway...
input += 4;
input_len -= 4;
auto ret = LZ4_decompress_safe(input, output, input_len, output_len);
if (ret < 0) {
throw std::runtime_error("LZ4 uncompression failure");
}
return ret;
}
size_t lz4_processor::compress(const char* input, size_t input_len,
char* output, size_t output_len) const {
if (output_len < LZ4_COMPRESSBOUND(input_len) + 4) {
throw std::runtime_error("LZ4 compression failure: length of output is too small");
}
// Write input_len (32-bit data) to beginning of output in little-endian representation.
output[0] = input_len & 0xFF;
output[1] = (input_len >> 8) & 0xFF;
output[2] = (input_len >> 16) & 0xFF;
output[3] = (input_len >> 24) & 0xFF;
#ifdef SEASTAR_HAVE_LZ4_COMPRESS_DEFAULT
auto ret = LZ4_compress_default(input, output + 4, input_len, LZ4_compressBound(input_len));
#else
auto ret = LZ4_compress(input, output + 4, input_len);
#endif
if (ret == 0) {
throw std::runtime_error("LZ4 compression failure: LZ4_compress() failed");
}
return ret + 4;
}
size_t lz4_processor::compress_max_size(size_t input_len) const {
return LZ4_COMPRESSBOUND(input_len) + 4;
}
size_t deflate_processor::uncompress(const char* input,
size_t input_len, char* output, size_t output_len) const {
z_stream zs;
zs.zalloc = Z_NULL;
zs.zfree = Z_NULL;
zs.opaque = Z_NULL;
zs.avail_in = 0;
zs.next_in = Z_NULL;
if (inflateInit(&zs) != Z_OK) {
throw std::runtime_error("deflate uncompression init failure");
}
// yuck, zlib is not const-correct, and also uses unsigned char while we use char :-(
zs.next_in = reinterpret_cast<unsigned char*>(const_cast<char*>(input));
zs.avail_in = input_len;
zs.next_out = reinterpret_cast<unsigned char*>(output);
zs.avail_out = output_len;
auto res = inflate(&zs, Z_FINISH);
inflateEnd(&zs);
if (res == Z_STREAM_END) {
return output_len - zs.avail_out;
} else {
throw std::runtime_error("deflate uncompression failure");
}
}
size_t deflate_processor::compress(const char* input,
size_t input_len, char* output, size_t output_len) const {
z_stream zs;
zs.zalloc = Z_NULL;
zs.zfree = Z_NULL;
zs.opaque = Z_NULL;
zs.avail_in = 0;
zs.next_in = Z_NULL;
if (deflateInit(&zs, Z_DEFAULT_COMPRESSION) != Z_OK) {
throw std::runtime_error("deflate compression init failure");
}
zs.next_in = reinterpret_cast<unsigned char*>(const_cast<char*>(input));
zs.avail_in = input_len;
zs.next_out = reinterpret_cast<unsigned char*>(output);
zs.avail_out = output_len;
auto res = ::deflate(&zs, Z_FINISH);
deflateEnd(&zs);
if (res == Z_STREAM_END) {
return output_len - zs.avail_out;
} else {
throw std::runtime_error("deflate compression failure");
}
}
size_t deflate_processor::compress_max_size(size_t input_len) const {
z_stream zs;
zs.zalloc = Z_NULL;
zs.zfree = Z_NULL;
zs.opaque = Z_NULL;
zs.avail_in = 0;
zs.next_in = Z_NULL;
if (deflateInit(&zs, Z_DEFAULT_COMPRESSION) != Z_OK) {
throw std::runtime_error("deflate compression init failure");
}
auto res = deflateBound(&zs, input_len);
deflateEnd(&zs);
return res;
}
size_t snappy_processor::uncompress(const char* input, size_t input_len,
char* output, size_t output_len) const {
if (snappy_uncompress(input, input_len, output, &output_len)
== SNAPPY_OK) {
return output_len;
} else {
throw std::runtime_error("snappy uncompression failure");
}
}
size_t snappy_processor::compress(const char* input, size_t input_len,
char* output, size_t output_len) const {
auto ret = snappy_compress(input, input_len, output, &output_len);
if (ret != SNAPPY_OK) {
throw std::runtime_error("snappy compression failure: snappy_compress() failed");
}
return output_len;
}
size_t snappy_processor::compress_max_size(size_t input_len) const {
return snappy_max_compressed_length(input_len);
}

View File

@@ -21,135 +21,103 @@
#pragma once
#include "exceptions/exceptions.hh"
#include <map>
#include <set>
enum class compressor {
none,
lz4,
snappy,
deflate,
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/sstring.hh>
#include "exceptions/exceptions.hh"
#include "stdx.hh"
class compressor {
sstring _name;
public:
compressor(sstring);
virtual ~compressor() {}
/**
* Unpacks data in "input" to output. If output_len is of insufficient size,
* exception is thrown. I.e. you should keep track of the uncompressed size.
*/
virtual size_t uncompress(const char* input, size_t input_len, char* output,
size_t output_len) const = 0;
/**
* Packs data in "input" to output. If output_len is of insufficient size,
* exception is thrown. Maximum required size is obtained via "compress_max_size"
*/
virtual size_t compress(const char* input, size_t input_len, char* output,
size_t output_len) const = 0;
/**
* Returns the maximum output size for compressing data on "input_len" size.
*/
virtual size_t compress_max_size(size_t input_len) const = 0;
/**
* Returns accepted option names for this compressor
*/
virtual std::set<sstring> option_names() const;
/**
* Returns original options used in instantiating this compressor
*/
virtual std::map<sstring, sstring> options() const;
/**
* Compressor class name.
*/
const sstring& name() const {
return _name;
}
// to cheaply bridge sstable compression options / maps
using opt_string = stdx::optional<sstring>;
using opt_getter = std::function<opt_string(const sstring&)>;
static shared_ptr<compressor> create(const sstring& name, const opt_getter&);
static shared_ptr<compressor> create(const std::map<sstring, sstring>&);
static thread_local const shared_ptr<compressor> lz4;
static thread_local const shared_ptr<compressor> snappy;
static thread_local const shared_ptr<compressor> deflate;
static const sstring namespace_prefix;
};
template<typename BaseType, typename... Args>
class class_registry;
using compressor_ptr = shared_ptr<compressor>;
using compressor_registry = class_registry<compressor_ptr, const typename compressor::opt_getter&>;
class compression_parameters {
public:
static constexpr int32_t DEFAULT_CHUNK_LENGTH = 4 * 1024;
static constexpr double DEFAULT_CRC_CHECK_CHANCE = 1.0;
static constexpr auto SSTABLE_COMPRESSION = "sstable_compression";
static constexpr auto CHUNK_LENGTH_KB = "chunk_length_kb";
static constexpr auto CRC_CHECK_CHANCE = "crc_check_chance";
static const sstring SSTABLE_COMPRESSION;
static const sstring CHUNK_LENGTH_KB;
static const sstring CRC_CHECK_CHANCE;
private:
compressor _compressor;
compressor_ptr _compressor;
std::experimental::optional<int> _chunk_length;
std::experimental::optional<double> _crc_check_chance;
public:
compression_parameters(compressor c = compressor::lz4) : _compressor(c) { }
compression_parameters(const std::map<sstring, sstring>& options) {
validate_options(options);
compression_parameters();
compression_parameters(compressor_ptr);
compression_parameters(const std::map<sstring, sstring>& options);
~compression_parameters();
auto it = options.find(SSTABLE_COMPRESSION);
if (it == options.end() || it->second.empty()) {
_compressor = compressor::none;
return;
}
const auto& compressor_class = it->second;
if (is_compressor_class(compressor_class, "LZ4Compressor")) {
_compressor = compressor::lz4;
} else if (is_compressor_class(compressor_class, "SnappyCompressor")) {
_compressor = compressor::snappy;
} else if (is_compressor_class(compressor_class, "DeflateCompressor")) {
_compressor = compressor::deflate;
} else {
throw exceptions::configuration_exception(sstring("Unsupported compression class '") + compressor_class + "'.");
}
auto chunk_length = options.find(CHUNK_LENGTH_KB);
if (chunk_length != options.end()) {
try {
_chunk_length = std::stoi(chunk_length->second) * 1024;
} catch (const std::exception& e) {
throw exceptions::syntax_exception(sstring("Invalid integer value ") + chunk_length->second + " for " + CHUNK_LENGTH_KB);
}
}
auto crc_chance = options.find(CRC_CHECK_CHANCE);
if (crc_chance != options.end()) {
try {
_crc_check_chance = std::stod(crc_chance->second);
} catch (const std::exception& e) {
throw exceptions::syntax_exception(sstring("Invalid double value ") + crc_chance->second + "for " + CRC_CHECK_CHANCE);
}
}
}
compressor get_compressor() const { return _compressor; }
compressor_ptr get_compressor() const { return _compressor; }
int32_t chunk_length() const { return _chunk_length.value_or(int(DEFAULT_CHUNK_LENGTH)); }
double crc_check_chance() const { return _crc_check_chance.value_or(double(DEFAULT_CRC_CHECK_CHANCE)); }
void validate() {
if (_chunk_length) {
auto chunk_length = _chunk_length.value();
if (chunk_length <= 0) {
throw exceptions::configuration_exception(sstring("Invalid negative or null ") + CHUNK_LENGTH_KB);
}
// _chunk_length must be a power of two
if (chunk_length & (chunk_length - 1)) {
throw exceptions::configuration_exception(sstring(CHUNK_LENGTH_KB) + " must be a power of 2.");
}
}
if (_crc_check_chance && (_crc_check_chance.value() < 0.0 || _crc_check_chance.value() > 1.0)) {
throw exceptions::configuration_exception(sstring(CRC_CHECK_CHANCE) + " must be between 0.0 and 1.0.");
}
}
std::map<sstring, sstring> get_options() const {
if (_compressor == compressor::none) {
return std::map<sstring, sstring>();
}
std::map<sstring, sstring> opts;
opts.emplace(sstring(SSTABLE_COMPRESSION), compressor_name());
if (_chunk_length) {
opts.emplace(sstring(CHUNK_LENGTH_KB), std::to_string(_chunk_length.value() / 1024));
}
if (_crc_check_chance) {
opts.emplace(sstring(CRC_CHECK_CHANCE), std::to_string(_crc_check_chance.value()));
}
return opts;
}
bool operator==(const compression_parameters& other) const {
return _compressor == other._compressor
&& _chunk_length == other._chunk_length
&& _crc_check_chance == other._crc_check_chance;
}
bool operator!=(const compression_parameters& other) const {
return !(*this == other);
}
void validate();
std::map<sstring, sstring> get_options() const;
bool operator==(const compression_parameters& other) const;
bool operator!=(const compression_parameters& other) const;
private:
void validate_options(const std::map<sstring, sstring>& options) {
// currently, there are no options specific to a particular compressor
static std::set<sstring> keywords({
sstring(SSTABLE_COMPRESSION),
sstring(CHUNK_LENGTH_KB),
sstring(CRC_CHECK_CHANCE),
});
for (auto&& opt : options) {
if (!keywords.count(opt.first)) {
throw exceptions::configuration_exception(sprint("Unknown compression option '%s'.", opt.first));
}
}
}
bool is_compressor_class(const sstring& value, const sstring& class_name) {
static const sstring namespace_prefix = "org.apache.cassandra.io.compress.";
return value == class_name || value == namespace_prefix + class_name;
}
sstring compressor_name() const {
switch (_compressor) {
case compressor::lz4:
return "org.apache.cassandra.io.compress.LZ4Compressor";
case compressor::snappy:
return "org.apache.cassandra.io.compress.SnappyCompressor";
case compressor::deflate:
return "org.apache.cassandra.io.compress.DeflateCompressor";
default:
abort();
}
}
void validate_options(const std::map<sstring, sstring>&);
};

View File

@@ -14,7 +14,7 @@
# one logical cluster from joining another.
# It is recommended to change the default value when creating a new cluster.
# You can NOT modify this value for an existing cluster
#cluster_name: 'ScyllaDB Cluster'
#cluster_name: 'Test Cluster'
# This defines the number of tokens randomly assigned to this node on the ring
# The more tokens, relative to other nodes, the larger the proportion of data
@@ -87,6 +87,13 @@ listen_address: localhost
# Leaving this blank will set it to the same value as listen_address
# broadcast_address: 1.2.3.4
# When using multiple physical network interfaces, set this to true to listen on broadcast_address
# in addition to the listen_address, allowing nodes to communicate in both interfaces.
# Ignore this property if the network configuration automatically routes between the public and private networks such as EC2.
#
# listen_on_broadcast_address: false
# port for the CQL native transport to listen for clients on
# For security reasons, you should not expose this port to the internet. Firewall it if needed.
native_transport_port: 9042
@@ -100,13 +107,6 @@ native_transport_port: 9042
# keeping native_transport_port unencrypted.
#native_transport_port_ssl: 9142
# Throttles all outbound streaming file transfers on this node to the
# given total throughput in Mbps. This is necessary because Scylla does
# mostly sequential IO when streaming data during bootstrap or repair, which
# can lead to saturating the network connection and degrading rpc performance.
# When unset, the default is 200 Mbps or 25 MB/s.
# stream_throughput_outbound_megabits_per_sec: 200
# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 5000
@@ -240,9 +240,8 @@ batch_size_fail_threshold_in_kb: 50
# Uncomment to enable experimental features
# experimental: true
###################################################
## Not currently supported, reserved for future use
###################################################
# The directory where hints files are stored if hinted handoff is enabled.
# hints_directory: /var/lib/scylla/hints
# See http://wiki.apache.org/cassandra/HintedHandoff
# May either be "true" or "false" to enable globally, or contain a list
@@ -266,6 +265,10 @@ batch_size_fail_threshold_in_kb: 50
# cross-dc handoff tends to be slower
# max_hints_delivery_threads: 2
###################################################
## Not currently supported, reserved for future use
###################################################
# Maximum throttle in KBs per second, total. This will be
# reduced proportionally to the number of nodes in the cluster.
# batchlog_replay_throttle_in_kb: 1024

View File

@@ -20,9 +20,11 @@
# along with Scylla. If not, see <http://www.gnu.org/licenses/>.
#
import os, os.path, textwrap, argparse, sys, shlex, subprocess, tempfile, re
import os, os.path, textwrap, argparse, sys, shlex, subprocess, tempfile, re, platform
from distutils.spawn import find_executable
tempfile.tempdir = "./build/tmp"
configure_args = str.join(' ', [shlex.quote(x) for x in sys.argv[1:]])
for line in open('/etc/os-release'):
@@ -83,17 +85,33 @@ def pkg_config(option, package):
return output.decode('utf-8').strip()
def try_compile(compiler, source = '', flags = []):
with tempfile.NamedTemporaryFile() as sfile:
sfile.file.write(bytes(source, 'utf-8'))
sfile.file.flush()
return subprocess.call([compiler, '-x', 'c++', '-o', '/dev/null', '-c', sfile.name] + args.user_cflags.split() + flags,
stdout = subprocess.DEVNULL,
stderr = subprocess.DEVNULL) == 0
return try_compile_and_link(compiler, source, flags = flags + ['-c'])
def warning_supported(warning, compiler):
def ensure_tmp_dir_exists():
if not os.path.exists(tempfile.tempdir):
os.makedirs(tempfile.tempdir)
def try_compile_and_link(compiler, source = '', flags = []):
ensure_tmp_dir_exists()
with tempfile.NamedTemporaryFile() as sfile:
ofile = tempfile.mktemp()
try:
sfile.file.write(bytes(source, 'utf-8'))
sfile.file.flush()
# We can't write to /dev/null, since in some cases (-ftest-coverage) gcc will create an auxiliary
# output file based on the name of the output file, and "/dev/null.gcsa" is not a good name
return subprocess.call([compiler, '-x', 'c++', '-o', ofile, sfile.name] + args.user_cflags.split() + flags,
stdout = subprocess.DEVNULL,
stderr = subprocess.DEVNULL) == 0
finally:
if os.path.exists(ofile):
os.unlink(ofile)
def flag_supported(flag, compiler):
# gcc ignores -Wno-x even if it is not supported
adjusted = re.sub('^-Wno-', '-W', warning)
return try_compile(flags = ['-Werror', adjusted], compiler = compiler)
adjusted = re.sub('^-Wno-', '-W', flag)
split = adjusted.split(' ')
return try_compile(flags = ['-Werror'] + split, compiler = compiler)
def debug_flag(compiler):
src_with_auto = textwrap.dedent('''\
@@ -108,6 +126,14 @@ def debug_flag(compiler):
print('Note: debug information disabled; upgrade your compiler')
return ''
def gold_supported(compiler):
src_main = 'int main(int argc, char **argv) { return 0; }'
if try_compile_and_link(source = src_main, flags = ['-fuse-ld=gold'], compiler = compiler):
return '-fuse-ld=gold'
else:
print('Note: gold not found; using default system linker')
return ''
def maybe_static(flag, libs):
if flag and not args.static:
libs = '-Wl,-Bstatic {} -Wl,-Bdynamic'.format(libs)
@@ -133,6 +159,13 @@ class Thrift(object):
def endswith(self, end):
return self.source.endswith(end)
def default_target_arch():
mach = platform.machine()
if platform.machine() in ['i386', 'i686', 'x86_64']:
return 'nehalem'
else:
return ''
class Antlr3Grammar(object):
def __init__(self, source):
self.source = source
@@ -154,13 +187,13 @@ modes = {
'debug': {
'sanitize': '-fsanitize=address -fsanitize=leak -fsanitize=undefined',
'sanitize_libs': '-lasan -lubsan',
'opt': '-O0 -DDEBUG -DDEBUG_SHARED_PTR -DDEFAULT_ALLOCATOR',
'opt': '-O0 -DDEBUG -DDEBUG_SHARED_PTR -DDEFAULT_ALLOCATOR -DDEBUG_LSA_SANITIZER',
'libs': '',
},
'release': {
'sanitize': '',
'sanitize_libs': '',
'opt': '-O2',
'opt': '-O3',
'libs': '',
},
}
@@ -168,7 +201,7 @@ modes = {
scylla_tests = [
'tests/mutation_test',
'tests/mvcc_test',
'tests/streamed_mutation_test',
'tests/mutation_fragment_test',
'tests/flat_mutation_reader_test',
'tests/schema_registry_test',
'tests/canonical_mutation_test',
@@ -178,6 +211,7 @@ scylla_tests = [
'tests/partitioner_test',
'tests/frozen_mutation_test',
'tests/serialized_action_test',
'tests/hint_test',
'tests/clustering_ranges_walker_test',
'tests/perf/perf_mutation',
'tests/lsa_async_eviction_test',
@@ -189,11 +223,12 @@ scylla_tests = [
'tests/perf/perf_simple_query',
'tests/perf/perf_fast_forward',
'tests/perf/perf_cache_eviction',
'tests/cache_streamed_mutation_test',
'tests/cache_flat_mutation_reader_test',
'tests/row_cache_stress_test',
'tests/memory_footprint',
'tests/perf/perf_sstable',
'tests/cql_query_test',
'tests/secondary_index_test',
'tests/storage_proxy_test',
'tests/schema_change_test',
'tests/mutation_reader_test',
@@ -201,6 +236,7 @@ scylla_tests = [
'tests/row_cache_test',
'tests/test-serialization',
'tests/sstable_test',
'tests/sstable_3_x_test',
'tests/sstable_mutation_test',
'tests/sstable_resharding_test',
'tests/memtable_test',
@@ -215,6 +251,7 @@ scylla_tests = [
'tests/config_test',
'tests/gossiping_property_file_snitch_test',
'tests/ec2_snitch_test',
'tests/gce_snitch_test',
'tests/snitch_reset_test',
'tests/network_topology_strategy_test',
'tests/query_processor_test',
@@ -236,27 +273,50 @@ scylla_tests = [
'tests/database_test',
'tests/nonwrapping_range_test',
'tests/input_stream_test',
'tests/sstable_atomic_deletion_test',
'tests/virtual_reader_test',
'tests/view_schema_test',
'tests/view_build_test',
'tests/view_complex_test',
'tests/counter_test',
'tests/cell_locker_test',
'tests/row_locker_test',
'tests/streaming_histogram_test',
'tests/duration_test',
'tests/vint_serialization_test',
'tests/continuous_data_consumer_test',
'tests/compress_test',
'tests/chunked_vector_test',
'tests/loading_cache_test',
'tests/castas_fcts_test',
'tests/big_decimal_test',
'tests/aggregate_fcts_test',
'tests/role_manager_test',
'tests/caching_options_test',
'tests/auth_resource_test',
'tests/cql_auth_query_test',
'tests/enum_set_test',
'tests/extensions_test',
'tests/cql_auth_syntax_test',
'tests/querier_cache',
'tests/limiting_data_source_test',
'tests/meta_test',
'tests/imr_test',
'tests/partition_data_test',
'tests/reusable_buffer_test',
'tests/json_test'
]
perf_tests = [
'tests/perf/perf_mutation_readers',
'tests/perf/perf_mutation_fragment',
'tests/perf/perf_idl',
]
apps = [
'scylla',
]
tests = scylla_tests
tests = scylla_tests + perf_tests
other = [
'iotune',
@@ -278,6 +338,8 @@ arg_parser.add_argument('--cflags', action = 'store', dest = 'user_cflags', defa
help = 'Extra flags for the C++ compiler')
arg_parser.add_argument('--ldflags', action = 'store', dest = 'user_ldflags', default = '',
help = 'Extra flags for the linker')
arg_parser.add_argument('--target', action = 'store', dest = 'target', default = default_target_arch(),
help = 'Target architecture (-march)')
arg_parser.add_argument('--compiler', action = 'store', dest = 'cxx', default = 'g++',
help = 'C++ compiler path')
arg_parser.add_argument('--c-compiler', action='store', dest='cc', default='gcc',
@@ -296,6 +358,8 @@ arg_parser.add_argument('--static-thrift', dest = 'staticthrift', action = 'stor
help = 'Link libthrift statically')
arg_parser.add_argument('--static-boost', dest = 'staticboost', action = 'store_true',
help = 'Link boost statically')
arg_parser.add_argument('--static-yaml-cpp', dest = 'staticyamlcpp', action = 'store_true',
help = 'Link libyaml-cpp statically')
arg_parser.add_argument('--tests-debuginfo', action = 'store', dest = 'tests_debuginfo', type = int, default = 0,
help = 'Enable(1)/disable(0)compiler debug information generation for tests')
arg_parser.add_argument('--python', action = 'store', dest = 'python', default = 'python3',
@@ -306,6 +370,10 @@ arg_parser.add_argument('--enable-gcc6-concepts', dest='gcc6_concepts', action='
help='enable experimental support for C++ Concepts as implemented in GCC 6')
arg_parser.add_argument('--enable-alloc-failure-injector', dest='alloc_failure_injector', action='store_true', default=False,
help='enable allocation failure injection')
arg_parser.add_argument('--with-antlr3', dest='antlr3_exec', action='store', default=None,
help='path to antlr3 executable')
arg_parser.add_argument('--with-ragel', dest='ragel_exec', action='store', default=None,
help='path to ragel executable')
args = arg_parser.parse_args()
defines = []
@@ -315,42 +383,51 @@ extra_cxxflags = {}
cassandra_interface = Thrift(source = 'interface/cassandra.thrift', service = 'Cassandra')
scylla_core = (['database.cc',
'atomic_cell.cc',
'schema.cc',
'frozen_schema.cc',
'schema_registry.cc',
'bytes.cc',
'mutation.cc',
'streamed_mutation.cc',
'mutation_fragment.cc',
'partition_version.cc',
'row_cache.cc',
'canonical_mutation.cc',
'frozen_mutation.cc',
'memtable.cc',
'schema_mutations.cc',
'release.cc',
'supervisor.cc',
'utils/logalloc.cc',
'utils/large_bitset.cc',
'utils/buffer_input_stream.cc',
'utils/limiting_data_source.cc',
'mutation_partition.cc',
'mutation_partition_view.cc',
'mutation_partition_serializer.cc',
'mutation_reader.cc',
'flat_mutation_reader.cc',
'mutation_query.cc',
'json.cc',
'keys.cc',
'counters.cc',
'counters.cc',
'compress.cc',
'sstables/mp_row_consumer.cc',
'sstables/sstables.cc',
'sstables/sstable_version.cc',
'sstables/compress.cc',
'sstables/row.cc',
'sstables/partition.cc',
'sstables/compaction.cc',
'sstables/compaction_strategy.cc',
'sstables/compaction_manager.cc',
'sstables/atomic_deletion.cc',
'sstables/integrity_checked_file_impl.cc',
'sstables/prepended_input_stream.cc',
'sstables/m_format_write_helpers.cc',
'sstables/m_format_read_helpers.cc',
'transport/event.cc',
'transport/event_notifier.cc',
'transport/server.cc',
'transport/messages/result_message.cc',
'cql3/abstract_marker.cc',
'cql3/attributes.cc',
'cql3/cf_name.cc',
@@ -370,7 +447,6 @@ scylla_core = (['database.cc',
'cql3/statements/create_table_statement.cc',
'cql3/statements/create_view_statement.cc',
'cql3/statements/create_type_statement.cc',
'cql3/statements/create_user_statement.cc',
'cql3/statements/drop_index_statement.cc',
'cql3/statements/drop_keyspace_statement.cc',
'cql3/statements/drop_table_statement.cc',
@@ -392,8 +468,6 @@ scylla_core = (['database.cc',
'cql3/statements/truncate_statement.cc',
'cql3/statements/alter_table_statement.cc',
'cql3/statements/alter_view_statement.cc',
'cql3/statements/alter_user_statement.cc',
'cql3/statements/drop_user_statement.cc',
'cql3/statements/list_users_statement.cc',
'cql3/statements/authorization_statement.cc',
'cql3/statements/permission_altering_statement.cc',
@@ -402,9 +476,10 @@ scylla_core = (['database.cc',
'cql3/statements/revoke_statement.cc',
'cql3/statements/alter_type_statement.cc',
'cql3/statements/alter_keyspace_statement.cc',
'cql3/statements/role-management-statements.cc',
'cql3/update_parameters.cc',
'cql3/ut_name.cc',
'cql3/user_options.cc',
'cql3/role_name.cc',
'thrift/handler.cc',
'thrift/server.cc',
'thrift/thrift_validation.cc',
@@ -440,21 +515,26 @@ scylla_core = (['database.cc',
'cql3/variable_specifications.cc',
'db/consistency_level.cc',
'db/system_keyspace.cc',
'db/system_distributed_keyspace.cc',
'db/size_estimates_virtual_reader.cc',
'db/schema_tables.cc',
'db/cql_type_parser.cc',
'db/legacy_schema_migrator.cc',
'db/commitlog/commitlog.cc',
'db/commitlog/commitlog_replayer.cc',
'db/commitlog/commitlog_entry.cc',
'db/hints/manager.cc',
'db/hints/resource_manager.cc',
'db/config.cc',
'db/extensions.cc',
'db/heat_load_balance.cc',
'db/index/secondary_index.cc',
'db/large_partition_handler.cc',
'db/marshal/type_parser.cc',
'db/batchlog_manager.cc',
'db/view/view.cc',
'db/view/row_locking.cc',
'index/secondary_index_manager.cc',
'io/io.cc',
'utils/utils.cc',
'index/secondary_index.cc',
'utils/UUID_gen.cc',
'utils/i_filter.cc',
'utils/bloom_filter.cc',
@@ -490,7 +570,6 @@ scylla_core = (['database.cc',
'locator/network_topology_strategy.cc',
'locator/everywhere_replication_strategy.cc',
'locator/token_metadata.cc',
'locator/locator.cc',
'locator/snitch_base.cc',
'locator/simple_snitch.cc',
'locator/rack_inferring_snitch.cc',
@@ -498,6 +577,7 @@ scylla_core = (['database.cc',
'locator/production_snitch_base.cc',
'locator/ec2_snitch.cc',
'locator/ec2_multi_region_snitch.cc',
'locator/gce_snitch.cc',
'message/messaging_service.cc',
'service/client_state.cc',
'service/migration_task.cc',
@@ -530,12 +610,16 @@ scylla_core = (['database.cc',
'auth/authenticator.cc',
'auth/common.cc',
'auth/default_authorizer.cc',
'auth/data_resource.cc',
'auth/resource.cc',
'auth/roles-metadata.cc',
'auth/password_authenticator.cc',
'auth/permission.cc',
'auth/permissions_cache.cc',
'auth/service.cc',
'auth/standard_role_manager.cc',
'auth/transitional.cc',
'auth/authentication_options.cc',
'auth/role_or_anonymous.cc',
'tracing/tracing.cc',
'tracing/trace_keyspace_helper.cc',
'tracing/trace_state.cc',
@@ -545,6 +629,9 @@ scylla_core = (['database.cc',
'disk-error-handler.cc',
'duration.cc',
'vint-serialization.cc',
'utils/arch/powerpc/crc32-vpmsum/crc32_wrapper.cc',
'querier.cc',
'data/cell.cc',
]
+ [Antlr3Grammar('cql3/Cql.g')]
+ [Thrift('interface/cassandra.thrift', 'Cassandra')]
@@ -581,7 +668,9 @@ api = ['api/api.cc',
'api/api-doc/stream_manager.json',
'api/stream_manager.cc',
'api/api-doc/system.json',
'api/system.cc'
'api/system.cc',
'api/config.cc',
'api/api-doc/config.json',
]
idls = ['idl/gossip_digest.idl.hh',
@@ -609,7 +698,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/cache_temperature.idl.hh',
]
scylla_tests_dependencies = scylla_core + api + idls + [
scylla_tests_dependencies = scylla_core + idls + [
'tests/cql_test_env.cc',
'tests/cql_assertions.cc',
'tests/result_set_assertions.cc',
@@ -622,7 +711,7 @@ scylla_tests_seastar_deps = [
]
deps = {
'scylla': idls + ['main.cc'] + scylla_core + api,
'scylla': idls + ['main.cc', 'release.cc'] + scylla_core + api,
}
pure_boost_tests = set([
@@ -646,6 +735,15 @@ pure_boost_tests = set([
'tests/compress_test',
'tests/chunked_vector_test',
'tests/big_decimal_test',
'tests/caching_options_test',
'tests/auth_resource_test',
'tests/enum_set_test',
'tests/cql_auth_syntax_test',
'tests/meta_test',
'tests/imr_test',
'tests/partition_data_test',
'tests/reusable_buffer_test',
'tests/json_test',
])
tests_not_using_seastar_test_framework = set([
@@ -676,7 +774,14 @@ for t in scylla_tests:
deps[t] += scylla_tests_dependencies
deps[t] += scylla_tests_seastar_deps
else:
deps[t] += scylla_core + api + idls + ['tests/cql_test_env.cc']
deps[t] += scylla_core + idls + ['tests/cql_test_env.cc']
perf_tests_seastar_deps = [
'seastar/tests/perf/perf_tests.cc'
]
for t in perf_tests:
deps[t] = [t + '.cc'] + scylla_tests_dependencies + perf_tests_seastar_deps
deps['tests/sstable_test'] += ['tests/sstable_datafile_test.cc', 'tests/sstable_utils.cc']
deps['tests/mutation_reader_test'] += ['tests/sstable_utils.cc']
@@ -688,6 +793,10 @@ deps['tests/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'tests/mur
deps['tests/allocation_strategy_test'] = ['tests/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/log_heap_test'] = ['tests/log_heap_test.cc']
deps['tests/anchorless_list_test'] = ['tests/anchorless_list_test.cc']
deps['tests/perf/perf_fast_forward'] += ['release.cc']
deps['tests/meta_test'] = ['tests/meta_test.cc']
deps['tests/imr_test'] = ['tests/imr_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/reusable_buffer_test'] = ['tests/reusable_buffer_test.cc']
warnings = [
'-Wno-mismatched-tags', # clang-only
@@ -703,14 +812,25 @@ warnings = [
'-Wno-misleading-indentation',
'-Wno-overflow',
'-Wno-noexcept-type',
'-Wno-nonnull-compare'
]
warnings = [w
for w in warnings
if warning_supported(warning = w, compiler = args.cxx)]
if flag_supported(flag = w, compiler = args.cxx)]
warnings = ' '.join(warnings + ['-Wno-error=deprecated-declarations'])
optimization_flags = [
'--param inline-unit-growth=300',
]
optimization_flags = [o
for o in optimization_flags
if flag_supported(flag = o, compiler = args.cxx)]
modes['release']['opt'] += ' ' + ' '.join(optimization_flags)
gold_linker_flag = gold_supported(compiler = args.cxx)
dbgflag = debug_flag(args.cxx) if args.debuginfo else ''
tests_link_rule = 'link' if args.tests_debuginfo else 'link_stripped'
@@ -750,6 +870,22 @@ for pkglist in optional_packages:
alternatives = ':'.join(pkglist[1:])
print('Missing optional package {pkglist[0]} (or alteratives {alternatives})'.format(**locals()))
compiler_test_src = '''
#if __GNUC__ < 7
#error "MAJOR"
#elif __GNUC__ == 7
#if __GNUC_MINOR__ < 3
#error "MINOR"
#endif
#endif
int main() { return 0; }
'''
if not try_compile_and_link(compiler=args.cxx, source=compiler_test_src):
print('Wrong GCC version. Scylla needs GCC >= 7.3 to compile.')
sys.exit(1)
if not try_compile(compiler=args.cxx, source='#include <boost/version.hpp>'):
print('Boost not installed. Please install {}.'.format(pkgname("boost-devel")))
sys.exit(1)
@@ -798,14 +934,20 @@ if args.staticcxx:
seastar_flags += ['--static-stdc++']
if args.staticboost:
seastar_flags += ['--static-boost']
if args.staticyamlcpp:
seastar_flags += ['--static-yaml-cpp']
if args.gcc6_concepts:
seastar_flags += ['--enable-gcc6-concepts']
if args.alloc_failure_injector:
seastar_flags += ['--enable-alloc-failure-injector']
seastar_cflags = args.user_cflags + " -march=nehalem"
seastar_cflags = args.user_cflags
if args.target != '':
seastar_cflags += ' -march=' + args.target
seastar_ldflags = args.user_ldflags
seastar_flags += ['--compiler', args.cxx, '--c-compiler', args.cc, '--cflags=%s' % (seastar_cflags), '--ldflags=%s' %(seastar_ldflags)]
seastar_flags += ['--compiler', args.cxx, '--c-compiler', args.cc, '--cflags=%s' % (seastar_cflags), '--ldflags=%s' %(seastar_ldflags),
'--c++-dialect=gnu++1z', '--optflags=%s' % (modes['release']['opt']),
]
status = subprocess.call([python, './configure.py'] + seastar_flags, cwd = 'seastar')
@@ -836,11 +978,16 @@ for mode in build_modes:
seastar_deps = 'practically_anything_can_change_so_lets_run_it_every_time_and_restat.'
args.user_cflags += " " + pkg_config("--cflags", "jsoncpp")
libs = ' '.join(['-lyaml-cpp', '-llz4', '-lz', '-lsnappy', pkg_config("--libs", "jsoncpp"),
maybe_static(args.staticboost, '-lboost_filesystem'), ' -lcrypt',
libs = ' '.join([maybe_static(args.staticyamlcpp, '-lyaml-cpp'), '-llz4', '-lz', '-lsnappy', pkg_config("--libs", "jsoncpp"),
maybe_static(args.staticboost, '-lboost_filesystem'), ' -lcrypt', ' -lcryptopp',
maybe_static(args.staticboost, '-lboost_date_time'),
])
xxhash_dir = 'xxHash'
if not os.path.exists(xxhash_dir) or not os.listdir(xxhash_dir):
raise Exception(xxhash_dir + ' is empty. Run "git submodule update --init".')
if not args.staticboost:
args.user_cflags += ' -DBOOST_TEST_DYN_LINK'
@@ -863,20 +1010,31 @@ os.makedirs(outdir, exist_ok = True)
do_sanitize = True
if args.static:
do_sanitize = False
if args.antlr3_exec:
antlr3_exec = args.antlr3_exec
else:
antlr3_exec = "antlr3"
if args.ragel_exec:
ragel_exec = args.ragel_exec
else:
ragel_exec = "ragel"
with open(buildfile, 'w') as f:
f.write(textwrap.dedent('''\
configure_args = {configure_args}
builddir = {outdir}
cxx = {cxx}
cxxflags = {user_cflags} {warnings} {defines}
ldflags = -fuse-ld=gold {user_ldflags}
ldflags = {gold_linker_flag} {user_ldflags}
libs = {libs}
pool link_pool
depth = {link_pool_depth}
pool seastar_pool
depth = 1
rule ragel
command = ragel -G2 -o $out $in
command = {ragel_exec} -G2 -o $out $in
description = RAGEL $out
rule gen
command = echo -e $text > $out
@@ -898,7 +1056,7 @@ with open(buildfile, 'w') as f:
for mode in build_modes:
modeval = modes[mode]
f.write(textwrap.dedent('''\
cxxflags_{mode} = -I. -I $builddir/{mode}/gen -I seastar -I seastar/build/{mode}/gen
cxxflags_{mode} = {opt} -DXXH_PRIVATE_API -I. -I $builddir/{mode}/gen -I seastar -I seastar/build/{mode}/gen
rule cxx.{mode}
command = $cxx -MD -MT $out -MF $out.d {seastar_cflags} $cxxflags $cxxflags_{mode} $obj_cxxflags -c -o $out $in
description = CXX $out
@@ -922,7 +1080,7 @@ with open(buildfile, 'w') as f:
# Because we add such a variable to every function, and because `ExceptionBaseType` is not a global
# name, we also add a global typedef to avoid compilation errors.
command = sed -e '/^#if 0/,/^#endif/d' $in > $builddir/{mode}/gen/$in $
&& antlr3 $builddir/{mode}/gen/$in $
&& {antlr3_exec} $builddir/{mode}/gen/$in $
&& sed -i -e 's/^\\( *\)\\(ImplTraits::CommonTokenType\\* [a-zA-Z0-9_]* = NULL;\\)$$/\\1const \\2/' $
-e '1i using ExceptionBaseType = int;' $
-e 's/^{{/{{ ExceptionBaseType\* ex = nullptr;/; $
@@ -930,7 +1088,7 @@ with open(buildfile, 'w') as f:
s/exceptions::syntax_exception e/exceptions::syntax_exception\& e/' $
build/{mode}/gen/${{stem}}Parser.cpp
description = ANTLR3 $in
''').format(mode = mode, **modeval))
''').format(mode = mode, antlr3_exec = antlr3_exec, **modeval))
f.write('build {mode}: phony {artifacts}\n'.format(mode = mode,
artifacts = str.join(' ', ('$builddir/' + mode + '/' + x for x in build_artifacts))))
compiles = {}
@@ -946,6 +1104,7 @@ with open(buildfile, 'w') as f:
objs = ['$builddir/' + mode + '/' + src.replace('.cc', '.o')
for src in srcs
if src.endswith('.cc')]
objs.append('$builddir/../utils/arch/powerpc/crc32-vpmsum/crc32.S')
has_thrift = False
for dep in deps[binary]:
if isinstance(dep, Thrift):
@@ -1049,7 +1208,7 @@ with open(buildfile, 'w') as f:
rule configure
command = {python} configure.py $configure_args
generator = 1
build build.ninja: configure | configure.py
build build.ninja: configure | configure.py seastar/configure.py
rule cscope
command = find -name '*.[chS]' -o -name "*.cc" -o -name "*.hh" | cscope -bq -i-
description = CSCOPE

View File

@@ -39,16 +39,32 @@ private:
return ::is_compatible(new_def.kind, kind) && new_def.type->is_value_compatible_with(*old_type);
}
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, atomic_cell_view cell) {
if (is_compatible(new_def, old_type, kind) && cell.timestamp() > new_def.dropped_at()) {
dst.apply(new_def, atomic_cell_or_collection(cell));
if (!is_compatible(new_def, old_type, kind) || cell.timestamp() <= new_def.dropped_at()) {
return;
}
auto new_cell = [&] {
if (cell.is_live() && !old_type->is_counter()) {
if (cell.is_live_and_has_ttl()) {
return atomic_cell_or_collection(
atomic_cell::make_live(*new_def.type, cell.timestamp(), cell.value().linearize(), cell.expiry(), cell.ttl())
);
}
return atomic_cell_or_collection(
atomic_cell::make_live(*new_def.type, cell.timestamp(), cell.value().linearize())
);
} else {
return atomic_cell_or_collection(*new_def.type, cell);
}
}();
dst.apply(new_def, std::move(new_cell));
}
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, collection_mutation_view cell) {
if (!is_compatible(new_def, old_type, kind)) {
return;
}
cell.data.with_linearized([&] (bytes_view cell_bv) {
auto&& ctype = static_pointer_cast<const collection_type_impl>(old_type);
auto old_view = ctype->deserialize_mutation_form(cell);
auto old_view = ctype->deserialize_mutation_form(cell_bv);
collection_type_impl::mutation_view new_view;
if (old_view.tomb.timestamp > new_def.dropped_at()) {
@@ -60,6 +76,7 @@ private:
}
}
dst.apply(new_def, ctype->serialize_mutation_form(std::move(new_view)));
});
}
public:
converting_mutation_partition_applier(
@@ -120,11 +137,11 @@ public:
// Appends the cell to dst upgrading it to the new schema.
// Cells must have monotonic names.
static void append_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, const atomic_cell_or_collection& cell) {
static void append_cell(row& dst, column_kind kind, const column_definition& new_def, const column_definition& old_def, const atomic_cell_or_collection& cell) {
if (new_def.is_atomic()) {
accept_cell(dst, kind, new_def, old_type, cell.as_atomic_cell());
accept_cell(dst, kind, new_def, old_def.type, cell.as_atomic_cell(old_def));
} else {
accept_cell(dst, kind, new_def, old_type, cell.as_collection_mutation());
accept_cell(dst, kind, new_def, old_def.type, cell.as_collection_mutation());
}
}
};

View File

@@ -78,10 +78,10 @@ std::vector<counter_shard> counter_cell_view::shards_compatible_with_1_7_4() con
return sorted_shards;
}
static bool apply_in_place(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
static bool apply_in_place(const column_definition& cdef, atomic_cell_mutable_view dst, atomic_cell_mutable_view src)
{
auto dst_ccmv = counter_cell_mutable_view(dst.as_mutable_atomic_cell());
auto src_ccmv = counter_cell_mutable_view(src.as_mutable_atomic_cell());
auto dst_ccmv = counter_cell_mutable_view(dst);
auto src_ccmv = counter_cell_mutable_view(src);
auto dst_shards = dst_ccmv.shards();
auto src_shards = src_ccmv.shards();
@@ -118,48 +118,19 @@ static bool apply_in_place(atomic_cell_or_collection& dst, atomic_cell_or_collec
auto src_ts = src_ccmv.timestamp();
dst_ccmv.set_timestamp(std::max(dst_ts, src_ts));
src_ccmv.set_timestamp(dst_ts);
src.as_mutable_atomic_cell().set_counter_in_place_revert(true);
return true;
}
static void revert_in_place_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
void counter_cell_view::apply(const column_definition& cdef, atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
assert(dst.can_use_mutable_view() && src.can_use_mutable_view());
auto dst_ccmv = counter_cell_mutable_view(dst.as_mutable_atomic_cell());
auto src_ccmv = counter_cell_mutable_view(src.as_mutable_atomic_cell());
auto dst_shards = dst_ccmv.shards();
auto src_shards = src_ccmv.shards();
auto dst_it = dst_shards.begin();
auto src_it = src_shards.begin();
while (src_it != src_shards.end()) {
while (dst_it != dst_shards.end() && dst_it->id() < src_it->id()) {
++dst_it;
}
assert(dst_it != dst_shards.end() && dst_it->id() == src_it->id());
dst_it->swap_value_and_clock(*src_it);
++src_it;
}
auto dst_ts = dst_ccmv.timestamp();
auto src_ts = src_ccmv.timestamp();
dst_ccmv.set_timestamp(src_ts);
src_ccmv.set_timestamp(dst_ts);
src.as_mutable_atomic_cell().set_counter_in_place_revert(false);
}
bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
auto dst_ac = dst.as_atomic_cell();
auto src_ac = src.as_atomic_cell();
auto dst_ac = dst.as_atomic_cell(cdef);
auto src_ac = src.as_atomic_cell(cdef);
if (!dst_ac.is_live() || !src_ac.is_live()) {
if (dst_ac.is_live() || (!src_ac.is_live() && compare_atomic_cell_for_merge(dst_ac, src_ac) < 0)) {
std::swap(dst, src);
return true;
}
return false;
return;
}
if (dst_ac.is_counter_update() && src_ac.is_counter_update()) {
@@ -167,22 +138,26 @@ bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_
auto dst_v = dst_ac.counter_update_value();
dst = atomic_cell::make_live_counter_update(std::max(dst_ac.timestamp(), src_ac.timestamp()),
src_v + dst_v);
return true;
return;
}
assert(!dst_ac.is_counter_update());
assert(!src_ac.is_counter_update());
with_linearized(dst_ac, [&] (counter_cell_view dst_ccv) {
with_linearized(src_ac, [&] (counter_cell_view src_ccv) {
if (counter_cell_view(dst_ac).shard_count() >= counter_cell_view(src_ac).shard_count()
&& dst.can_use_mutable_view() && src.can_use_mutable_view()) {
if (apply_in_place(dst, src)) {
return true;
if (dst_ccv.shard_count() >= src_ccv.shard_count()) {
auto dst_amc = dst.as_mutable_atomic_cell(cdef);
auto src_amc = src.as_mutable_atomic_cell(cdef);
if (!dst_amc.is_value_fragmented() && !src_amc.is_value_fragmented()) {
if (apply_in_place(cdef, dst_amc, src_amc)) {
return;
}
}
}
src.as_mutable_atomic_cell().set_counter_in_place_revert(false);
auto dst_shards = counter_cell_view(dst_ac).shards();
auto src_shards = counter_cell_view(src_ac).shards();
auto dst_shards = dst_ccv.shards();
auto src_shards = src_ccv.shards();
counter_cell_builder result;
combine(dst_shards.begin(), dst_shards.end(), src_shards.begin(), src_shards.end(),
@@ -191,22 +166,9 @@ bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_
});
auto cell = result.build(std::max(dst_ac.timestamp(), src_ac.timestamp()));
src = std::exchange(dst, atomic_cell_or_collection(cell));
return true;
}
void counter_cell_view::revert_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
if (dst.as_atomic_cell().is_counter_update()) {
auto src_v = src.as_atomic_cell().counter_update_value();
auto dst_v = dst.as_atomic_cell().counter_update_value();
dst = atomic_cell::make_live(dst.as_atomic_cell().timestamp(),
long_type->decompose(dst_v - src_v));
} else if (src.as_atomic_cell().is_counter_in_place_revert_set()) {
revert_in_place_apply(dst, src);
} else {
std::swap(dst, src);
}
src = std::exchange(dst, atomic_cell_or_collection(std::move(cell)));
});
});
}
stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, atomic_cell_view b)
@@ -216,13 +178,15 @@ stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, at
if (!b.is_live() || !a.is_live()) {
if (b.is_live() || (!a.is_live() && compare_atomic_cell_for_merge(b, a) < 0)) {
return atomic_cell(a);
return atomic_cell(*counter_type, a);
}
return { };
}
auto a_shards = counter_cell_view(a).shards();
auto b_shards = counter_cell_view(b).shards();
return with_linearized(a, [&] (counter_cell_view a_ccv) {
return with_linearized(b, [&] (counter_cell_view b_ccv) {
auto a_shards = a_ccv.shards();
auto b_shards = b_ccv.shards();
auto a_it = a_shards.begin();
auto a_end = a_shards.end();
@@ -244,18 +208,21 @@ stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, at
if (!result.empty()) {
diff = result.build(std::max(a.timestamp(), b.timestamp()));
} else if (a.timestamp() > b.timestamp()) {
diff = atomic_cell::make_live(a.timestamp(), bytes_view());
diff = atomic_cell::make_live(*counter_type, a.timestamp(), bytes_view());
}
return diff;
});
});
}
void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset) {
// FIXME: allow current_state to be frozen_mutation
auto transform_new_row_to_shards = [clock_offset] (auto& cells) {
cells.for_each_cell([clock_offset] (auto, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
auto transform_new_row_to_shards = [&s = *m.schema(), clock_offset] (column_kind kind, auto& cells) {
cells.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto& cdef = s.column_at(kind, id);
auto acv = ac_o_c.as_atomic_cell(cdef);
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
@@ -266,32 +233,35 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
};
if (!current_state) {
transform_new_row_to_shards(m.partition().static_row());
transform_new_row_to_shards(column_kind::static_column, m.partition().static_row());
for (auto& cr : m.partition().clustered_rows()) {
transform_new_row_to_shards(cr.row().cells());
transform_new_row_to_shards(column_kind::regular_column, cr.row().cells());
}
return;
}
clustering_key::less_compare cmp(*m.schema());
auto transform_row_to_shards = [clock_offset] (auto& transformee, auto& state) {
auto transform_row_to_shards = [&s = *m.schema(), clock_offset] (column_kind kind, auto& transformee, auto& state) {
std::deque<std::pair<column_id, counter_shard>> shards;
state.for_each_cell([&] (column_id id, const atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
auto& cdef = s.column_at(kind, id);
auto acv = ac_o_c.as_atomic_cell(cdef);
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
counter_cell_view ccv(acv);
counter_cell_view::with_linearized(acv, [&] (counter_cell_view ccv) {
auto cs = ccv.local_shard();
if (!cs) {
return; // continue
}
shards.emplace_back(std::make_pair(id, counter_shard(*cs)));
});
});
transformee.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
auto& cdef = s.column_at(kind, id);
auto acv = ac_o_c.as_atomic_cell(cdef);
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
@@ -313,7 +283,7 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
});
};
transform_row_to_shards(m.partition().static_row(), current_state->partition().static_row());
transform_row_to_shards(column_kind::static_column, m.partition().static_row(), current_state->partition().static_row());
auto& cstate = current_state->partition();
auto it = cstate.clustered_rows().begin();
@@ -323,10 +293,10 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
++it;
}
if (it == end || cmp(cr.key(), it->key())) {
transform_new_row_to_shards(cr.row().cells());
transform_new_row_to_shards(column_kind::regular_column, cr.row().cells());
continue;
}
transform_row_to_shards(cr.row().cells(), it->row().cells());
transform_row_to_shards(column_kind::regular_column, cr.row().cells(), it->row().cells());
}
}

View File

@@ -79,7 +79,7 @@ static_assert(std::is_pod<counter_id>::value, "counter_id should be a POD type")
std::ostream& operator<<(std::ostream& os, const counter_id& id);
template<typename View>
template<mutable_view is_mutable>
class basic_counter_shard_view {
enum class offset : unsigned {
id = 0u,
@@ -88,7 +88,8 @@ class basic_counter_shard_view {
total_size = unsigned(logical_clock) + sizeof(int64_t),
};
private:
typename View::pointer _base;
using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const signed char*, signed char*>;
pointer_type _base;
private:
template<typename T>
T read(offset off) const {
@@ -100,7 +101,7 @@ public:
static constexpr auto size = size_t(offset::total_size);
public:
basic_counter_shard_view() = default;
explicit basic_counter_shard_view(typename View::pointer ptr) noexcept
explicit basic_counter_shard_view(pointer_type ptr) noexcept
: _base(ptr) { }
counter_id id() const { return read<counter_id>(offset::id); }
@@ -111,7 +112,7 @@ public:
static constexpr size_t off = size_t(offset::value);
static constexpr size_t size = size_t(offset::total_size) - off;
typename View::value_type tmp[size];
signed char tmp[size];
std::copy_n(_base + off, size, tmp);
std::copy_n(other._base + off, size, _base + off);
std::copy_n(tmp, size, other._base + off);
@@ -138,7 +139,7 @@ public:
};
};
using counter_shard_view = basic_counter_shard_view<bytes_view>;
using counter_shard_view = basic_counter_shard_view<mutable_view::no>;
std::ostream& operator<<(std::ostream& os, counter_shard_view csv);
@@ -198,7 +199,7 @@ public:
return do_apply(other);
}
static size_t serialized_size() {
static constexpr size_t serialized_size() {
return counter_shard_view::size;
}
void serialize(bytes::iterator& out) const {
@@ -252,15 +253,33 @@ public:
}
atomic_cell build(api::timestamp_type timestamp) const {
return atomic_cell::make_live_from_serializer(timestamp, serialized_size(), [this] (bytes::iterator out) {
serialize(out);
});
// If we can assume that the counter shards never cross fragment boundaries
// the serialisation code gets much simpler.
static_assert(data::cell::maximum_external_chunk_length % counter_shard::serialized_size() == 0);
auto ac = atomic_cell::make_live_uninitialized(*counter_type, timestamp, serialized_size());
auto dst_it = ac.value().begin();
auto dst_current = *dst_it++;
for (auto&& cs : _shards) {
if (dst_current.empty()) {
dst_current = *dst_it++;
}
assert(!dst_current.empty());
auto value_dst = dst_current.data();
cs.serialize(value_dst);
dst_current.remove_prefix(counter_shard::serialized_size());
}
return ac;
}
static atomic_cell from_single_shard(api::timestamp_type timestamp, const counter_shard& cs) {
return atomic_cell::make_live_from_serializer(timestamp, counter_shard::serialized_size(), [&cs] (bytes::iterator out) {
cs.serialize(out);
});
// We don't really need to bother with fragmentation here.
static_assert(data::cell::maximum_external_chunk_length >= counter_shard::serialized_size());
auto ac = atomic_cell::make_live_uninitialized(*counter_type, timestamp, counter_shard::serialized_size());
auto dst = ac.value().first_fragment().begin();
cs.serialize(dst);
return ac;
}
class inserter_iterator : public std::iterator<std::output_iterator_tag, counter_shard> {
@@ -287,28 +306,32 @@ public:
// <counter_id> := <int64_t><int64_t>
// <shard> := <counter_id><int64_t:value><int64_t:logical_clock>
// <counter_cell> := <shard>*
template<typename View>
template<mutable_view is_mutable>
class basic_counter_cell_view {
protected:
atomic_cell_base<View> _cell;
using linearized_value_view = std::conditional_t<is_mutable == mutable_view::no,
bytes_view, bytes_mutable_view>;
using pointer_type = typename linearized_value_view::pointer;
basic_atomic_cell_view<is_mutable> _cell;
linearized_value_view _value;
private:
class shard_iterator : public std::iterator<std::input_iterator_tag, basic_counter_shard_view<View>> {
typename View::pointer _current;
basic_counter_shard_view<View> _current_view;
class shard_iterator : public std::iterator<std::input_iterator_tag, basic_counter_shard_view<is_mutable>> {
pointer_type _current;
basic_counter_shard_view<is_mutable> _current_view;
public:
shard_iterator() = default;
shard_iterator(typename View::pointer ptr) noexcept
shard_iterator(pointer_type ptr) noexcept
: _current(ptr), _current_view(ptr) { }
basic_counter_shard_view<View>& operator*() noexcept {
basic_counter_shard_view<is_mutable>& operator*() noexcept {
return _current_view;
}
basic_counter_shard_view<View>* operator->() noexcept {
basic_counter_shard_view<is_mutable>* operator->() noexcept {
return &_current_view;
}
shard_iterator& operator++() noexcept {
_current += counter_shard_view::size;
_current_view = basic_counter_shard_view<View>(_current);
_current_view = basic_counter_shard_view<is_mutable>(_current);
return *this;
}
shard_iterator operator++(int) noexcept {
@@ -318,7 +341,7 @@ private:
}
shard_iterator& operator--() noexcept {
_current -= counter_shard_view::size;
_current_view = basic_counter_shard_view<View>(_current);
_current_view = basic_counter_shard_view<is_mutable>(_current);
return *this;
}
shard_iterator operator--(int) noexcept {
@@ -335,22 +358,23 @@ private:
};
public:
boost::iterator_range<shard_iterator> shards() const {
auto bv = _cell.value();
auto begin = shard_iterator(bv.data());
auto end = shard_iterator(bv.data() + bv.size());
auto begin = shard_iterator(_value.data());
auto end = shard_iterator(_value.data() + _value.size());
return boost::make_iterator_range(begin, end);
}
size_t shard_count() const {
return _cell.value().size() / counter_shard_view::size;
return _cell.value().size_bytes() / counter_shard_view::size;
}
public:
protected:
// ac must be a live counter cell
explicit basic_counter_cell_view(atomic_cell_base<View> ac) noexcept : _cell(ac) {
explicit basic_counter_cell_view(basic_atomic_cell_view<is_mutable> ac, linearized_value_view vv) noexcept
: _cell(ac), _value(vv)
{
assert(_cell.is_live());
assert(!_cell.is_counter_update());
}
public:
api::timestamp_type timestamp() const { return _cell.timestamp(); }
static data_type total_value_type() { return long_type; }
@@ -381,18 +405,22 @@ public:
}
};
struct counter_cell_view : basic_counter_cell_view<bytes_view> {
struct counter_cell_view : basic_counter_cell_view<mutable_view::no> {
using basic_counter_cell_view::basic_counter_cell_view;
template<typename Function>
static decltype(auto) with_linearized(basic_atomic_cell_view<mutable_view::no> ac, Function&& fn) {
return ac.value().with_linearized([&] (bytes_view value_view) {
counter_cell_view ccv(ac, value_view);
return fn(ccv);
});
}
// Returns counter shards in an order that is compatible with Scylla 1.7.4.
std::vector<counter_shard> shards_compatible_with_1_7_4() const;
// Reversibly applies two counter cells, at least one of them must be live.
// Returns true iff dst was modified.
static bool apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
// Reverts apply performed by apply_reversible().
static void revert_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
static void apply(const column_definition& cdef, atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
// Computes a counter cell containing minimal amount of data which, when
// applied to 'b' returns the same cell as 'a' and 'b' applied together.
@@ -401,9 +429,15 @@ struct counter_cell_view : basic_counter_cell_view<bytes_view> {
friend std::ostream& operator<<(std::ostream& os, counter_cell_view ccv);
};
struct counter_cell_mutable_view : basic_counter_cell_view<bytes_mutable_view> {
struct counter_cell_mutable_view : basic_counter_cell_view<mutable_view::yes> {
using basic_counter_cell_view::basic_counter_cell_view;
explicit counter_cell_mutable_view(atomic_cell_mutable_view ac) noexcept
: basic_counter_cell_view<mutable_view::yes>(ac, ac.value().first_fragment())
{
assert(!ac.value().is_fragmented());
}
void set_timestamp(api::timestamp_type ts) { _cell.set_timestamp(ts); }
};

View File

@@ -1,89 +0,0 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <seastar/core/thread.hh>
#include <seastar/core/timer.hh>
#include <chrono>
// Simple proportional controller to adjust shares of memtable/streaming flushes.
//
// Goal is to flush as fast as we can, but not so fast that we steal all the CPU from incoming
// requests, and at the same time minimize user-visible fluctuations in the flush quota.
//
// What that translates to is we'll try to keep virtual dirty's firt derivative at 0 (IOW, we keep
// virtual dirty constant), which means that the rate of incoming writes is equal to the rate of
// flushed bytes.
//
// The exact point at which the controller stops determines the desired flush CPU usage. As we
// approach the hard dirty limit, we need to be more aggressive. We will therefore define two
// thresholds, and increase the constant as we cross them.
//
// 1) the soft limit line
// 2) halfway between soft limit and dirty limit
//
// The constants q1 and q2 are used to determine the proportional factor at each stage.
//
// Below the soft limit, we are in no particular hurry to flush, since it means we're set to
// complete flushing before we a new memtable is ready. The quota is dirty * q1, and q1 is set to a
// low number.
//
// The first half of the virtual dirty region is where we expect to be usually, so we have a low
// slope corresponding to a sluggish response between q1 * soft_limit and q2.
//
// In the second half, we're getting close to the hard dirty limit so we increase the slope and
// become more responsive, up to a maximum quota of qmax.
//
// For now we'll just set them in the structure not to complicate the constructor. But q1, q2 and
// qmax can easily become parameters if we find another user.
class flush_cpu_controller {
static constexpr float hard_dirty_limit = 0.50;
static constexpr float q1 = 0.01;
static constexpr float q2 = 0.2;
static constexpr float qmax = 1;
float _current_quota = 0.0f;
float _goal;
std::function<float()> _current_dirty;
std::chrono::milliseconds _interval;
timer<> _update_timer;
seastar::thread_scheduling_group _scheduling_group;
seastar::thread_scheduling_group *_current_scheduling_group = nullptr;
void adjust();
public:
seastar::thread_scheduling_group* scheduling_group() {
return _current_scheduling_group;
}
float current_quota() const {
return _current_quota;
}
struct disabled {
seastar::thread_scheduling_group *backup;
};
flush_cpu_controller(disabled d) : _scheduling_group(std::chrono::nanoseconds(0), 0), _current_scheduling_group(d.backup) {}
flush_cpu_controller(std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty);
flush_cpu_controller(flush_cpu_controller&&) = default;
};

View File

@@ -56,13 +56,16 @@ options {
#include "cql3/statements/index_prop_defs.hh"
#include "cql3/statements/raw/use_statement.hh"
#include "cql3/statements/raw/batch_statement.hh"
#include "cql3/statements/create_user_statement.hh"
#include "cql3/statements/alter_user_statement.hh"
#include "cql3/statements/drop_user_statement.hh"
#include "cql3/statements/list_users_statement.hh"
#include "cql3/statements/grant_statement.hh"
#include "cql3/statements/revoke_statement.hh"
#include "cql3/statements/list_permissions_statement.hh"
#include "cql3/statements/alter_role_statement.hh"
#include "cql3/statements/list_roles_statement.hh"
#include "cql3/statements/grant_role_statement.hh"
#include "cql3/statements/revoke_role_statement.hh"
#include "cql3/statements/drop_role_statement.hh"
#include "cql3/statements/create_role_statement.hh"
#include "cql3/statements/index_target.hh"
#include "cql3/statements/ks_prop_defs.hh"
#include "cql3/selection/raw_selector.hh"
@@ -80,6 +83,8 @@ options {
#include "cql3/maps.hh"
#include "cql3/sets.hh"
#include "cql3/lists.hh"
#include "cql3/role_name.hh"
#include "cql3/role_options.hh"
#include "cql3/type_cast.hh"
#include "cql3/tuples.hh"
#include "cql3/user_types.hh"
@@ -89,6 +94,7 @@ options {
#include "core/sstring.hh"
#include "CqlLexer.hpp"
#include <algorithm>
#include <unordered_map>
#include <map>
}
@@ -236,6 +242,12 @@ struct uninitialized {
return res;
}
bool convert_boolean_literal(stdx::string_view s) {
std::string lower_s(s.size(), '\0');
std::transform(s.cbegin(), s.cend(), lower_s.begin(), &::tolower);
return lower_s == "true";
}
void add_raw_update(std::vector<std::pair<::shared_ptr<cql3::column_identifier::raw>,::shared_ptr<cql3::operation::raw_update>>>& operations,
::shared_ptr<cql3::column_identifier::raw> key, ::shared_ptr<cql3::operation::raw_update> update)
{
@@ -345,6 +357,12 @@ cqlStatement returns [shared_ptr<raw::parsed_statement> stmt]
| st32=createViewStatement { $stmt = st32; }
| st33=alterViewStatement { $stmt = st33; }
| st34=dropViewStatement { $stmt = st34; }
| st35=listRolesStatement { $stmt = st35; }
| st36=grantRoleStatement { $stmt = st36; }
| st37=revokeRoleStatement { $stmt = st37; }
| st38=dropRoleStatement { $stmt = st38; }
| st39=createRoleStatement { $stmt = st39; }
| st40=alterRoleStatement { $stmt = st40; }
;
/*
@@ -355,7 +373,7 @@ useStatement returns [::shared_ptr<raw::use_statement> stmt]
;
/**
* SELECT <expression>
* SELECT [JSON] <expression>
* FROM <CF>
* WHERE KEY = "key1" AND COL > 1 AND COL < 100
* LIMIT <NUMBER>;
@@ -366,10 +384,12 @@ selectStatement returns [shared_ptr<raw::select_statement> expr]
::shared_ptr<cql3::term::raw> limit;
raw::select_statement::parameters::orderings_type orderings;
bool allow_filtering = false;
bool is_json = false;
}
: K_SELECT ( ( K_DISTINCT { is_distinct = true; } )?
sclause=selectClause
| sclause=selectCountClause
: K_SELECT (
( K_JSON { is_json = true; } )?
( K_DISTINCT { is_distinct = true; } )?
sclause=selectClause
)
K_FROM cf=columnFamilyName
( K_WHERE wclause=whereClause )?
@@ -377,7 +397,7 @@ selectStatement returns [shared_ptr<raw::select_statement> expr]
( K_LIMIT rows=intValue { limit = rows; } )?
( K_ALLOW K_FILTERING { allow_filtering = true; } )?
{
auto params = ::make_shared<raw::select_statement::parameters>(std::move(orderings), is_distinct, allow_filtering);
auto params = ::make_shared<raw::select_statement::parameters>(std::move(orderings), is_distinct, allow_filtering, is_json);
$expr = ::make_shared<raw::select_statement>(std::move(cf), std::move(params),
std::move(sclause), std::move(wclause), std::move(limit));
}
@@ -396,6 +416,7 @@ selector returns [shared_ptr<raw_selector> s]
unaliasedSelector returns [shared_ptr<selectable::raw> s]
@init { shared_ptr<selectable::raw> tmp; }
: ( c=cident { tmp = c; }
| K_COUNT '(' countArgument ')' { tmp = selectable::with_function::raw::make_count_rows_function(); }
| K_WRITETIME '(' c=cident ')' { tmp = make_shared<selectable::writetime_or_ttl::raw>(c, true); }
| K_TTL '(' c=cident ')' { tmp = make_shared<selectable::writetime_or_ttl::raw>(c, false); }
| f=functionName args=selectionFunctionArgs { tmp = ::make_shared<selectable::with_function::raw>(std::move(f), std::move(args)); }
@@ -412,16 +433,6 @@ selectionFunctionArgs returns [std::vector<shared_ptr<selectable::raw>> a]
')'
;
selectCountClause returns [std::vector<shared_ptr<raw_selector>> expr]
@init{ auto alias = make_shared<cql3::column_identifier>("count", false); }
: K_COUNT '(' countArgument ')' (K_AS c=ident { alias = c; })? {
auto&& with_fn = ::make_shared<cql3::selection::selectable::with_function::raw>(
cql3::functions::function_name::native_function("countRows"),
std::vector<shared_ptr<cql3::selection::selectable::raw>>());
$expr.push_back(make_shared<cql3::selection::raw_selector>(with_fn, alias));
}
;
countArgument
: '*'
| i=INTEGER { if (i->getText() != "1") {
@@ -440,33 +451,51 @@ orderByClause[raw::select_statement::parameters::orderings_type& orderings]
: c=cident (K_ASC | K_DESC { reversed = true; })? { orderings.emplace_back(c, reversed); }
;
jsonValue returns [::shared_ptr<cql3::term::raw> value]
:
| s=STRING_LITERAL { $value = cql3::constants::literal::string(sstring{$s.text}); }
| ':' id=ident { $value = new_bind_variables(id); }
| QMARK { $value = new_bind_variables(shared_ptr<cql3::column_identifier>{}); }
;
/**
* INSERT INTO <CF> (<column>, <column>, <column>, ...)
* VALUES (<value>, <value>, <value>, ...)
* USING TIMESTAMP <long>;
*
*/
insertStatement returns [::shared_ptr<raw::insert_statement> expr]
insertStatement returns [::shared_ptr<raw::modification_statement> expr]
@init {
auto attrs = ::make_shared<cql3::attributes::raw>();
std::vector<::shared_ptr<cql3::column_identifier::raw>> column_names;
std::vector<::shared_ptr<cql3::term::raw>> values;
bool if_not_exists = false;
::shared_ptr<cql3::term::raw> json_value;
}
: K_INSERT K_INTO cf=columnFamilyName
'(' c1=cident { column_names.push_back(c1); } ( ',' cn=cident { column_names.push_back(cn); } )* ')'
K_VALUES
'(' v1=term { values.push_back(v1); } ( ',' vn=term { values.push_back(vn); } )* ')'
( K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
( usingClause[attrs] )?
{
$expr = ::make_shared<raw::insert_statement>(std::move(cf),
std::move(attrs),
std::move(column_names),
std::move(values),
if_not_exists);
}
('(' c1=cident { column_names.push_back(c1); } ( ',' cn=cident { column_names.push_back(cn); } )* ')'
K_VALUES
'(' v1=term { values.push_back(v1); } ( ',' vn=term { values.push_back(vn); } )* ')'
( K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
( usingClause[attrs] )?
{
$expr = ::make_shared<raw::insert_statement>(std::move(cf),
std::move(attrs),
std::move(column_names),
std::move(values),
if_not_exists);
}
| K_JSON
json_token=jsonValue { json_value = $json_token.value; }
( K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
( usingClause[attrs] )?
{
$expr = ::make_shared<raw::insert_json_statement>(std::move(cf),
std::move(attrs),
std::move(json_value),
if_not_exists);
}
)
;
usingClause[::shared_ptr<cql3::attributes::raw> attrs]
@@ -975,7 +1004,7 @@ truncateStatement returns [::shared_ptr<truncate_statement> stmt]
;
/**
* GRANT <permission> ON <resource> TO <username>
* GRANT <permission> ON <resource> TO <grantee>
*/
grantStatement returns [::shared_ptr<grant_statement> stmt]
: K_GRANT
@@ -983,12 +1012,12 @@ grantStatement returns [::shared_ptr<grant_statement> stmt]
K_ON
resource
K_TO
username
{ $stmt = ::make_shared<grant_statement>($permissionOrAll.perms, $resource.res, $username.text); }
grantee=userOrRoleName
{ $stmt = ::make_shared<grant_statement>($permissionOrAll.perms, $resource.res, std::move(grantee)); }
;
/**
* REVOKE <permission> ON <resource> FROM <username>
* REVOKE <permission> ON <resource> FROM <revokee>
*/
revokeStatement returns [::shared_ptr<revoke_statement> stmt]
: K_REVOKE
@@ -996,80 +1025,104 @@ revokeStatement returns [::shared_ptr<revoke_statement> stmt]
K_ON
resource
K_FROM
username
{ $stmt = ::make_shared<revoke_statement>($permissionOrAll.perms, $resource.res, $username.text); }
revokee=userOrRoleName
{ $stmt = ::make_shared<revoke_statement>($permissionOrAll.perms, $resource.res, std::move(revokee)); }
;
/**
* GRANT <rolename> to <grantee>
*/
grantRoleStatement returns [::shared_ptr<grant_role_statement> stmt]
: K_GRANT role=userOrRoleName K_TO grantee=userOrRoleName
{ $stmt = ::make_shared<grant_role_statement>(std::move(role), std::move(grantee)); }
;
/**
* REVOKE <rolename> FROM <revokee>
*/
revokeRoleStatement returns [::shared_ptr<revoke_role_statement> stmt]
: K_REVOKE role=userOrRoleName K_FROM revokee=userOrRoleName
{ $stmt = ::make_shared<revoke_role_statement>(std::move(role), std::move(revokee)); }
;
listPermissionsStatement returns [::shared_ptr<list_permissions_statement> stmt]
@init {
std::experimental::optional<auth::data_resource> r;
std::experimental::optional<sstring> u;
std::optional<auth::resource> r;
std::optional<sstring> role;
bool recursive = true;
}
: K_LIST
permissionOrAll
( K_ON resource { r = $resource.res; } )?
( K_OF username { u = sstring($username.text); } )?
( K_OF rn=userOrRoleName { role = sstring(static_cast<cql3::role_name>(rn).to_string()); } )?
( K_NORECURSIVE { recursive = false; } )?
{ $stmt = ::make_shared<list_permissions_statement>($permissionOrAll.perms, std::move(r), std::move(u), recursive); }
{ $stmt = ::make_shared<list_permissions_statement>($permissionOrAll.perms, std::move(r), std::move(role), recursive); }
;
permission returns [auth::permission perm]
: p=(K_CREATE | K_ALTER | K_DROP | K_SELECT | K_MODIFY | K_AUTHORIZE)
: p=(K_CREATE | K_ALTER | K_DROP | K_SELECT | K_MODIFY | K_AUTHORIZE | K_DESCRIBE)
{ $perm = auth::permissions::from_string($p.text); }
;
permissionOrAll returns [auth::permission_set perms]
: K_ALL ( K_PERMISSIONS )? { $perms = auth::permissions::ALL_DATA; }
: K_ALL ( K_PERMISSIONS )? { $perms = auth::permissions::ALL; }
| p=permission ( K_PERMISSION )? { $perms = auth::permission_set::from_mask(auth::permission_set::mask_for($p.perm)); }
;
resource returns [auth::data_resource res]
: r=dataResource { $res = $r.res; }
resource returns [uninitialized<auth::resource> res]
: d=dataResource { $res = std::move(d); }
| r=roleResource { $res = std::move(r); }
;
dataResource returns [auth::data_resource res]
: K_ALL K_KEYSPACES { $res = auth::data_resource(); }
| K_KEYSPACE ks = keyspaceName { $res = auth::data_resource($ks.id); }
dataResource returns [uninitialized<auth::resource> res]
: K_ALL K_KEYSPACES { $res = auth::resource(auth::resource_kind::data); }
| K_KEYSPACE ks = keyspaceName { $res = auth::make_data_resource($ks.id); }
| ( K_COLUMNFAMILY )? cf = columnFamilyName
{ $res = auth::data_resource($cf.name->get_keyspace(), $cf.name->get_column_family()); }
{ $res = auth::make_data_resource($cf.name->get_keyspace(), $cf.name->get_column_family()); }
;
roleResource returns [uninitialized<auth::resource> res]
: K_ALL K_ROLES { $res = auth::resource(auth::resource_kind::role); }
| K_ROLE role = userOrRoleName { $res = auth::make_role_resource(static_cast<const cql3::role_name&>(role).to_string()); }
;
/**
* CREATE USER [IF NOT EXISTS] <username> [WITH PASSWORD <password>] [SUPERUSER|NOSUPERUSER]
*/
createUserStatement returns [::shared_ptr<create_user_statement> stmt]
createUserStatement returns [::shared_ptr<create_role_statement> stmt]
@init {
auto opts = ::make_shared<cql3::user_options>();
bool superuser = false;
cql3::role_options opts;
opts.is_superuser = false;
opts.can_login = true;
bool ifNotExists = false;
}
: K_CREATE K_USER (K_IF K_NOT K_EXISTS { ifNotExists = true; })? username
( K_WITH userOptions[opts] )?
( K_SUPERUSER { superuser = true; } | K_NOSUPERUSER { superuser = false; } )?
{ $stmt = ::make_shared<create_user_statement>($username.text, std::move(opts), superuser, ifNotExists); }
( K_WITH K_PASSWORD v=STRING_LITERAL { opts.password = $v.text; })?
( K_SUPERUSER { opts.is_superuser = true; } | K_NOSUPERUSER { opts.is_superuser = false; } )?
{ $stmt = ::make_shared<create_role_statement>(cql3::role_name($username.text, cql3::preserve_role_case::yes), std::move(opts), ifNotExists); }
;
/**
* ALTER USER <username> [WITH PASSWORD <password>] [SUPERUSER|NOSUPERUSER]
*/
alterUserStatement returns [::shared_ptr<alter_user_statement> stmt]
alterUserStatement returns [::shared_ptr<alter_role_statement> stmt]
@init {
auto opts = ::make_shared<cql3::user_options>();
std::experimental::optional<bool> superuser;
cql3::role_options opts;
}
: K_ALTER K_USER username
( K_WITH userOptions[opts] )?
( K_SUPERUSER { superuser = true; } | K_NOSUPERUSER { superuser = false; } )?
{ $stmt = ::make_shared<alter_user_statement>($username.text, std::move(opts), std::move(superuser)); }
( K_WITH K_PASSWORD v=STRING_LITERAL { opts.password = $v.text; })?
( K_SUPERUSER { opts.is_superuser = true; } | K_NOSUPERUSER { opts.is_superuser = false; } )?
{ $stmt = ::make_shared<alter_role_statement>(cql3::role_name($username.text, cql3::preserve_role_case::yes), std::move(opts)); }
;
/**
* DROP USER [IF EXISTS] <username>
*/
dropUserStatement returns [::shared_ptr<drop_user_statement> stmt]
dropUserStatement returns [::shared_ptr<drop_role_statement> stmt]
@init { bool ifExists = false; }
: K_DROP K_USER (K_IF K_EXISTS { ifExists = true; })? username { $stmt = ::make_shared<drop_user_statement>($username.text, ifExists); }
: K_DROP K_USER (K_IF K_EXISTS { ifExists = true; })? username
{ $stmt = ::make_shared<drop_role_statement>(cql3::role_name($username.text, cql3::preserve_role_case::yes), ifExists); }
;
/**
@@ -1079,12 +1132,67 @@ listUsersStatement returns [::shared_ptr<list_users_statement> stmt]
: K_LIST K_USERS { $stmt = ::make_shared<list_users_statement>(); }
;
userOptions[::shared_ptr<cql3::user_options> opts]
: userOption[opts]
/**
* CREATE ROLE [IF NOT EXISTS] <role_name> [WITH <roleOption> [AND <roleOption>]*]
*/
createRoleStatement returns [::shared_ptr<create_role_statement> stmt]
@init {
cql3::role_options opts;
opts.is_superuser = false;
opts.can_login = false;
bool if_not_exists = false;
}
: K_CREATE K_ROLE (K_IF K_NOT K_EXISTS { if_not_exists = true; })? name=userOrRoleName
(K_WITH roleOptions[opts])?
{ $stmt = ::make_shared<create_role_statement>(name, std::move(opts), if_not_exists); }
;
userOption[::shared_ptr<cql3::user_options> opts]
: k=K_PASSWORD v=STRING_LITERAL { opts->put($k.text, $v.text); }
/**
* ALTER ROLE <rolename> [WITH <roleOption> [AND <roleOption>]*]
*/
alterRoleStatement returns [::shared_ptr<alter_role_statement> stmt]
@init {
cql3::role_options opts;
}
: K_ALTER K_ROLE name=userOrRoleName
(K_WITH roleOptions[opts])?
{ $stmt = ::make_shared<alter_role_statement>(name, std::move(opts)); }
;
/**
* DROP ROLE [IF EXISTS] <rolename>
*/
dropRoleStatement returns [::shared_ptr<drop_role_statement> stmt]
@init {
bool if_exists = false;
}
: K_DROP K_ROLE (K_IF K_EXISTS { if_exists = true; })? name=userOrRoleName
{ $stmt = ::make_shared<drop_role_statement>(name, if_exists); }
;
/**
* LIST ROLES [OF <rolename>] [NORECURSIVE]
*/
listRolesStatement returns [::shared_ptr<list_roles_statement> stmt]
@init {
bool recursive = true;
std::optional<cql3::role_name> grantee;
}
: K_LIST K_ROLES
(K_OF g=userOrRoleName { grantee = std::move(g); })?
(K_NORECURSIVE { recursive = false; })?
{ $stmt = ::make_shared<list_roles_statement>(grantee, recursive); }
;
roleOptions[cql3::role_options& opts]
: roleOption[opts] (K_AND roleOption[opts])*
;
roleOption[cql3::role_options& opts]
: K_PASSWORD '=' v=STRING_LITERAL { opts.password = $v.text; }
| K_OPTIONS '=' m=mapLiteral { opts.options = convert_property_map(m); }
| K_SUPERUSER '=' b=BOOLEAN { opts.is_superuser = convert_boolean_literal($b.text); }
| K_LOGIN '=' b=BOOLEAN { opts.can_login = convert_boolean_literal($b.text); }
;
/** DEFINITIONS **/
@@ -1125,12 +1233,13 @@ userTypeName returns [uninitialized<cql3::ut_name> name]
: (ks=ident '.')? ut=non_type_ident { $name = cql3::ut_name(ks, ut); }
;
#if 0
userOrRoleName returns [RoleName name]
@init { $name = new RoleName(); }
: roleName[name] {return $name;}
userOrRoleName returns [uninitialized<cql3::role_name> name]
: t=IDENT { $name = cql3::role_name($t.text, cql3::preserve_role_case::no); }
| t=STRING_LITERAL { $name = cql3::role_name($t.text, cql3::preserve_role_case::yes); }
| t=QUOTED_NAME { $name = cql3::role_name($t.text, cql3::preserve_role_case::yes); }
| k=unreserved_keyword { $name = cql3::role_name(k, cql3::preserve_role_case::no); }
| QMARK {add_recognition_error("Bind variables cannot be used for role names");}
;
#endif
ksName[::shared_ptr<cql3::keyspace_element_name> name]
: t=IDENT { $name->set_keyspace($t.text, false);}
@@ -1153,15 +1262,6 @@ idxName[::shared_ptr<cql3::index_name> name]
| QMARK {add_recognition_error("Bind variables cannot be used for index names");}
;
#if 0
roleName[RoleName name]
: t=IDENT { $name.setName($t.text, false); }
| t=QUOTED_NAME { $name.setName($t.text, true); }
| k=unreserved_keyword { $name.setName(k, false); }
| QMARK {addRecognitionError("Bind variables cannot be used for role names");}
;
#endif
constant returns [shared_ptr<cql3::constants::literal> constant]
@init{std::string sign;}
: t=STRING_LITERAL { $constant = cql3::constants::literal::string(sstring{$t.text}); }
@@ -1506,6 +1606,7 @@ tuple_type returns [shared_ptr<cql3::cql3_type::raw> t]
username
: IDENT
| STRING_LITERAL
| QUOTED_NAME { add_recognition_error("Quoted strings are not supported for user names"); }
;
// Basically the same as cident, but we need to exlude existing CQL3 types
@@ -1544,8 +1645,13 @@ basic_unreserved_keyword returns [sstring str]
| K_ALL
| K_USER
| K_USERS
| K_ROLE
| K_ROLES
| K_SUPERUSER
| K_NOSUPERUSER
| K_LOGIN
| K_NOLOGIN
| K_OPTIONS
| K_PASSWORD
| K_EXISTS
| K_CUSTOM
@@ -1565,6 +1671,7 @@ basic_unreserved_keyword returns [sstring str]
| K_LANGUAGE
| K_NON
| K_DETERMINISTIC
| K_JSON
) { $str = $k.text; }
;
@@ -1637,13 +1744,19 @@ K_OF: O F;
K_REVOKE: R E V O K E;
K_MODIFY: M O D I F Y;
K_AUTHORIZE: A U T H O R I Z E;
K_DESCRIBE: D E S C R I B E;
K_NORECURSIVE: N O R E C U R S I V E;
K_USER: U S E R;
K_USERS: U S E R S;
K_ROLE: R O L E;
K_ROLES: R O L E S;
K_SUPERUSER: S U P E R U S E R;
K_NOSUPERUSER: N O S U P E R U S E R;
K_PASSWORD: P A S S W O R D;
K_LOGIN: L O G I N;
K_NOLOGIN: N O L O G I N;
K_OPTIONS: O P T I O N S;
K_CLUSTERING: C L U S T E R I N G;
K_ASCII: A S C I I;
@@ -1695,6 +1808,7 @@ K_NON: N O N;
K_OR: O R;
K_REPLACE: R E P L A C E;
K_DETERMINISTIC: D E T E R M I N I S T I C;
K_JSON: J S O N;
K_SCYLLA_TIMEUUID_LIST_INDEX: S C Y L L A '_' T I M E U U I D '_' L I S T '_' I N D E X;
K_SCYLLA_COUNTER_SHARD_LIST: S C Y L L A '_' C O U N T E R '_' S H A R D '_' L I S T;

View File

@@ -0,0 +1,187 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "cql3/prepared_statements_cache.hh"
namespace cql3 {
struct authorized_prepared_statements_cache_size {
size_t operator()(const statements::prepared_statement::checked_weak_ptr& val) {
// TODO: improve the size approximation - most of the entry is occupied by the key here.
return 100;
}
};
class authorized_prepared_statements_cache_key {
public:
using cache_key_type = std::pair<auth::authenticated_user, typename cql3::prepared_cache_key_type::cache_key_type>;
private:
cache_key_type _key;
public:
authorized_prepared_statements_cache_key(auth::authenticated_user user, cql3::prepared_cache_key_type prepared_cache_key)
: _key(std::move(user), std::move(prepared_cache_key.key())) {}
cache_key_type& key() { return _key; }
const cache_key_type& key() const { return _key; }
bool operator==(const authorized_prepared_statements_cache_key& other) const {
return _key == other._key;
}
bool operator!=(const authorized_prepared_statements_cache_key& other) const {
return !(*this == other);
}
static size_t hash(const auth::authenticated_user& user, const cql3::prepared_cache_key_type::cache_key_type& prep_cache_key) {
return utils::hash_combine(std::hash<auth::authenticated_user>()(user), utils::tuple_hash()(prep_cache_key));
}
};
/// \class authorized_prepared_statements_cache
/// \brief A cache of previously authorized statements.
///
/// Entries are inserted every time a new statement is authorized.
/// Entries are evicted in any of the following cases:
/// - When the corresponding prepared statement is not valid anymore.
/// - Periodically, with the same period as the permission cache is refreshed.
/// - If the corresponding entry hasn't been used for \ref entry_expiry.
class authorized_prepared_statements_cache {
public:
struct stats {
uint64_t authorized_prepared_statements_cache_evictions = 0;
};
static stats& shard_stats() {
static thread_local stats _stats;
return _stats;
}
struct authorized_prepared_statements_cache_stats_updater {
static void inc_hits() noexcept {}
static void inc_misses() noexcept {}
static void inc_blocks() noexcept {}
static void inc_evictions() noexcept {
++shard_stats().authorized_prepared_statements_cache_evictions;
}
};
private:
using cache_key_type = authorized_prepared_statements_cache_key;
using checked_weak_ptr = typename statements::prepared_statement::checked_weak_ptr;
using cache_type = utils::loading_cache<cache_key_type,
checked_weak_ptr,
utils::loading_cache_reload_enabled::yes,
authorized_prepared_statements_cache_size,
std::hash<cache_key_type>,
std::equal_to<cache_key_type>,
authorized_prepared_statements_cache_stats_updater>;
public:
using key_type = cache_key_type;
using value_type = checked_weak_ptr;
using entry_is_too_big = typename cache_type::entry_is_too_big;
using iterator = typename cache_type::iterator;
private:
cache_type _cache;
logging::logger& _logger;
public:
// Choose the memory budget such that would allow us ~4K entries when a shard gets 1GB of RAM
authorized_prepared_statements_cache(std::chrono::milliseconds entry_expiration, std::chrono::milliseconds entry_refresh, size_t cache_size, logging::logger& logger)
: _cache(cache_size, entry_expiration, entry_refresh, logger, [this] (const key_type& k) {
_cache.remove(k);
return make_ready_future<value_type>();
})
, _logger(logger)
{}
future<> insert(auth::authenticated_user user, cql3::prepared_cache_key_type prep_cache_key, value_type v) noexcept {
return _cache.get_ptr(key_type(std::move(user), std::move(prep_cache_key)), [v = std::move(v)] (const cache_key_type&) mutable {
return make_ready_future<value_type>(std::move(v));
}).discard_result();
}
iterator find(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
struct key_view {
const auth::authenticated_user& user_ref;
const cql3::prepared_cache_key_type& prep_cache_key_ref;
};
struct hasher {
size_t operator()(const key_view& kv) {
return cql3::authorized_prepared_statements_cache_key::hash(kv.user_ref, kv.prep_cache_key_ref.key());
}
};
struct equal {
bool operator()(const key_type& k1, const key_view& k2) {
return k1.key().first == k2.user_ref && k1.key().second == k2.prep_cache_key_ref.key();
}
bool operator()(const key_view& k2, const key_type& k1) {
return operator()(k1, k2);
}
};
return _cache.find(key_view{user, prep_cache_key}, hasher(), equal());
}
iterator end() {
return _cache.end();
}
void remove(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
iterator it = find(user, prep_cache_key);
_cache.remove(it);
}
size_t size() const {
return _cache.size();
}
size_t memory_footprint() const {
return _cache.memory_footprint();
}
future<> stop() {
return _cache.stop();
}
};
}
namespace std {
template <>
struct hash<cql3::authorized_prepared_statements_cache_key> final {
size_t operator()(const cql3::authorized_prepared_statements_cache_key& k) const {
return cql3::authorized_prepared_statements_cache_key::hash(k.key().first, k.key().second);
}
};
inline std::ostream& operator<<(std::ostream& out, const cql3::authorized_prepared_statements_cache_key& k) {
return out << "{ " << k.key().first << ", " << k.key().second << " }";
}
}

View File

@@ -22,6 +22,7 @@
#include "cql3/column_identifier.hh"
#include "exceptions/exceptions.hh"
#include "cql3/selection/simple_selector.hh"
#include "cql3/util.hh"
#include <regex>
@@ -62,14 +63,11 @@ sstring column_identifier::to_string() const {
}
sstring column_identifier::to_cql_string() const {
static const std::regex unquoted_identifier_re("[a-z][a-z0-9_]*");
if (std::regex_match(_text.begin(), _text.end(), unquoted_identifier_re)) {
return _text;
}
static const std::regex double_quote_re("\"");
std::string result = _text;
std::regex_replace(result, double_quote_re, "\"\"");
return '"' + result + '"';
return util::maybe_quote(_text);
}
sstring column_identifier::raw::to_cql_string() const {
return util::maybe_quote(_text);
}
column_identifier::raw::raw(sstring raw_text, bool keep_case)

View File

@@ -123,6 +123,7 @@ public:
bool operator!=(const raw& other) const;
virtual sstring to_string() const;
sstring to_cql_string() const;
friend std::hash<column_identifier::raw>;
friend std::ostream& operator<<(std::ostream& out, const column_identifier::raw& id);

View File

@@ -85,8 +85,8 @@ public:
virtual ::shared_ptr<terminal> bind(const query_options& options) override { return {}; }
virtual sstring to_string() const override { return "null"; }
};
static thread_local const ::shared_ptr<terminal> NULL_VALUE;
public:
static thread_local const ::shared_ptr<terminal> NULL_VALUE;
virtual ::shared_ptr<term> prepare(database& db, const sstring& keyspace, ::shared_ptr<column_specification> receiver) override {
if (!is_assignable(test_assignment(db, keyspace, receiver))) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment/decrement");
@@ -123,7 +123,7 @@ public:
// This is a workaround for antlr3 not distinguishing between
// calling in lexer setText() with an empty string and not calling
// setText() at all.
if (text.size() == 1 && text[0] == -1) {
if (text.size() == 1 && text[0] == '\xFF') {
text.reset();
}
return ::make_shared<literal>(type::STRING, text);
@@ -203,10 +203,14 @@ public:
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
auto value = _t->bind_and_get(params._options);
execute(m, prefix, params, column, std::move(value));
}
static void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const column_definition& column, cql3::raw_value_view value) {
if (value.is_null()) {
m.set_cell(prefix, column, std::move(make_dead_cell(params)));
} else if (value.is_value()) {
m.set_cell(prefix, column, std::move(make_cell(*value, params)));
m.set_cell(prefix, column, std::move(make_cell(*column.type, *value, params)));
}
}
};

View File

@@ -395,18 +395,15 @@ operator<<(std::ostream& os, const cql3_type::raw& r) {
namespace util {
sstring maybe_quote(const sstring& s) {
static const std::regex unquoted("\\w*");
static const std::regex double_quote("\"");
if (std::regex_match(s.begin(), s.end(), unquoted)) {
return s;
sstring maybe_quote(const sstring& identifier) {
static const std::regex unquoted_identifier_re("[a-z][a-z0-9_]*");
if (std::regex_match(identifier.begin(), identifier.end(), unquoted_identifier_re)) {
return identifier;
}
std::ostringstream ss;
ss << "\"";
std::regex_replace(std::ostreambuf_iterator<char>(ss), s.begin(), s.end(), double_quote, "\"\"");
ss << "\"";
return ss.str();
static const std::regex double_quote_re("\"");
std::string result = identifier;
std::regex_replace(result, double_quote_re, "\"\"");
return '"' + result + '"';
}
}

View File

@@ -45,6 +45,7 @@
#include "service/query_state.hh"
#include "service/storage_proxy.hh"
#include "cql3/query_options.hh"
#include "timeout_config.hh"
namespace cql_transport {
@@ -62,10 +63,15 @@ class metadata;
shared_ptr<const metadata> make_empty_metadata();
class cql_statement {
timeout_config_selector _timeout_config_selector;
public:
explicit cql_statement(timeout_config_selector timeout_selector) : _timeout_config_selector(timeout_selector) {}
virtual ~cql_statement()
{ }
timeout_config_selector get_timeout_config_selector() const { return _timeout_config_selector; }
virtual uint32_t get_bound_terms() = 0;
/**
@@ -81,7 +87,7 @@ public:
*
* @param state the current client state
*/
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) = 0;
virtual void validate(service::storage_proxy& proxy, const service::client_state& state) = 0;
/**
* Execute the statement and return the resulting result or null if there is no result.
@@ -90,15 +96,7 @@ public:
* @param options options for this query (consistency, variables, pageSize, ...)
*/
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) = 0;
/**
* Variant of execute used for internal query against the system tables, and thus only query the local node = 0.
*
* @param state the current query state
*/
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute_internal(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) = 0;
execute(service::storage_proxy& proxy, service::query_state& state, const query_options& options) = 0;
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const = 0;
@@ -111,6 +109,7 @@ public:
class cql_statement_no_metadata : public cql_statement {
public:
using cql_statement::cql_statement;
virtual shared_ptr<const metadata> get_result_metadata() const override {
return make_empty_metadata();
}

View File

@@ -67,6 +67,12 @@ class error_collector : public error_listener<RecognizerType, ExceptionBaseType>
*/
const sstring_view _query;
/**
* An empty bitset to be used as a workaround for AntLR null dereference
* bug.
*/
static typename ExceptionBaseType::BitsetListType _empty_bit_list;
public:
/**
@@ -144,6 +150,14 @@ private:
break;
}
default:
// AntLR Exception class has a bug of dereferencing a null
// pointer in the displayRecognitionError. The following
// if statement makes sure it will not be null before the
// call to that function (displayRecognitionError).
// bug reference: https://github.com/antlr/antlr3/issues/191
if (!ex->get_expectingSet()) {
ex->set_expectingSet(&_empty_bit_list);
}
ex->displayRecognitionError(token_names, msg);
}
return msg.str();
@@ -345,4 +359,8 @@ private:
#endif
};
template<typename RecognizerType, typename TokenType, typename ExceptionBaseType>
typename ExceptionBaseType::BitsetListType
error_collector<RecognizerType,TokenType,ExceptionBaseType>::_empty_bit_list = typename ExceptionBaseType::BitsetListType();
}

View File

@@ -42,6 +42,7 @@
#pragma once
#include "types.hh"
#include "cql3/cql3_type.hh"
#include <vector>
#include <iosfwd>
#include <boost/functional/hash.hpp>
@@ -90,6 +91,10 @@ public:
return false;
}
virtual sstring column_name(const std::vector<sstring>& column_names) override {
return sprint("%s(%s)", _name, join(", ", column_names));
}
virtual void print(std::ostream& os) const override;
};
@@ -101,9 +106,9 @@ abstract_function::print(std::ostream& os) const {
if (i > 0) {
os << ", ";
}
os << _arg_types[i]->name(); // FIXME: asCQL3Type()
os << _arg_types[i]->as_cql3_type()->to_string();
}
os << ") -> " << _return_type->name(); // FIXME: asCQL3Type()
os << ") -> " << _return_type->as_cql3_type()->to_string();
}
}

View File

@@ -67,6 +67,19 @@ public:
}
};
static const sstring COUNT_ROWS_FUNCTION_NAME = "countRows";
class count_rows_function final : public native_aggregate_function {
public:
count_rows_function() : native_aggregate_function(COUNT_ROWS_FUNCTION_NAME, long_type, {}) {}
virtual std::unique_ptr<aggregate> new_aggregate() override {
return std::make_unique<impl_count_function>();
}
virtual sstring column_name(const std::vector<sstring>& column_names) override {
return "count";
}
};
/**
* The function used to count the number of rows of a result set. This function is called when COUNT(*) or COUNT(1)
* is specified.
@@ -74,7 +87,7 @@ public:
inline
shared_ptr<aggregate_function>
make_count_rows_function() {
return make_native_aggregate_function_using<impl_count_function>("countRows", long_type);
return make_shared<count_rows_function>();
}
template <typename Type>
@@ -214,9 +227,29 @@ make_avg_function() {
return make_shared<avg_function_for<Type>>();
}
template <typename T>
struct aggregate_type_for {
using type = T;
};
template<>
struct aggregate_type_for<simple_date_native_type> {
using type = simple_date_native_type::primary_type;
};
template<>
struct aggregate_type_for<timestamp_native_type> {
using type = timestamp_native_type::primary_type;
};
template<>
struct aggregate_type_for<timeuuid_native_type> {
using type = timeuuid_native_type::primary_type;
};
template <typename Type>
class impl_max_function_for final : public aggregate_function::aggregate {
std::experimental::optional<Type> _max{};
std::experimental::optional<typename aggregate_type_for<Type>::type> _max{};
public:
virtual void reset() override {
_max = {};
@@ -225,13 +258,13 @@ public:
if (!_max) {
return {};
}
return data_type_for<Type>()->decompose(*_max);
return data_type_for<Type>()->decompose(Type{*_max});
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
auto val = value_cast<Type>(data_type_for<Type>()->deserialize(*values[0]));
auto val = value_cast<typename aggregate_type_for<Type>::type>(data_type_for<Type>()->deserialize(*values[0]));
if (!_max) {
_max = val;
} else {
@@ -263,7 +296,7 @@ make_max_function() {
template <typename Type>
class impl_min_function_for final : public aggregate_function::aggregate {
std::experimental::optional<Type> _min{};
std::experimental::optional<typename aggregate_type_for<Type>::type> _min{};
public:
virtual void reset() override {
_min = {};
@@ -272,13 +305,13 @@ public:
if (!_min) {
return {};
}
return data_type_for<Type>()->decompose(*_min);
return data_type_for<Type>()->decompose(Type{*_min});
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
auto val = value_cast<Type>(data_type_for<Type>()->deserialize(*values[0]));
auto val = value_cast<typename aggregate_type_for<Type>::type>(data_type_for<Type>()->deserialize(*values[0]));
if (!_min) {
_min = val;
} else {

View File

@@ -81,6 +81,15 @@ public:
virtual void print(std::ostream& os) const = 0;
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) = 0;
virtual bool has_reference_to(function& f) = 0;
/**
* Returns the name of the function to use within a ResultSet.
*
* @param column_names the names of the columns used to call the function
* @return the name of the function to use within a ResultSet
*/
virtual sstring column_name(const std::vector<sstring>& column_names) = 0;
friend class function_call;
friend std::ostream& operator<<(std::ostream& os, const function& f);
};

View File

@@ -42,10 +42,16 @@
#pragma once
#include "core/sstring.hh"
#include "db/system_keyspace.hh"
#include "seastarx.hh"
#include <iosfwd>
#include <functional>
namespace db {
sstring system_keyspace_name();
}
namespace cql3 {
namespace functions {
@@ -56,7 +62,7 @@ public:
sstring name;
static function_name native_function(sstring name) {
return function_name(db::system_keyspace::NAME, name);
return function_name(db::system_keyspace_name(), name);
}
function_name() = default; // for ANTLR

View File

@@ -20,6 +20,7 @@
*/
#include "functions.hh"
#include "function_call.hh"
#include "token_fct.hh"
#include "cql3/maps.hh"
@@ -41,11 +42,22 @@ functions::init() {
declare(time_uuid_fcts::make_min_timeuuid_fct());
declare(time_uuid_fcts::make_max_timeuuid_fct());
declare(time_uuid_fcts::make_date_of_fct());
declare(time_uuid_fcts::make_unix_timestamp_of_fcf());
declare(time_uuid_fcts::make_unix_timestamp_of_fct());
declare(time_uuid_fcts::make_currenttimestamp_fct());
declare(time_uuid_fcts::make_currentdate_fct());
declare(time_uuid_fcts::make_currenttime_fct());
declare(time_uuid_fcts::make_currenttimeuuid_fct());
declare(time_uuid_fcts::make_timeuuidtodate_fct());
declare(time_uuid_fcts::make_timestamptodate_fct());
declare(time_uuid_fcts::make_timeuuidtotimestamp_fct());
declare(time_uuid_fcts::make_datetotimestamp_fct());
declare(time_uuid_fcts::make_timeuuidtounixtimestamp_fct());
declare(time_uuid_fcts::make_timestamptounixtimestamp_fct());
declare(time_uuid_fcts::make_datetounixtimestamp_fct());
declare(make_uuid_fct());
for (auto&& type : cql3_type::values()) {
// Note: because text and varchar ends up being synonimous, our automatic makeToBlobFunction doesn't work
// Note: because text and varchar ends up being synonymous, our automatic makeToBlobFunction doesn't work
// for varchar, so we special case it below. We also skip blob for obvious reasons.
if (type == cql3_type::varchar || type == cql3_type::blob) {
continue;
@@ -95,6 +107,22 @@ functions::init() {
declare(aggregate_fcts::make_max_function<sstring>());
declare(aggregate_fcts::make_min_function<sstring>());
declare(aggregate_fcts::make_count_function<simple_date_native_type>());
declare(aggregate_fcts::make_max_function<simple_date_native_type>());
declare(aggregate_fcts::make_min_function<simple_date_native_type>());
declare(aggregate_fcts::make_count_function<timestamp_native_type>());
declare(aggregate_fcts::make_max_function<timestamp_native_type>());
declare(aggregate_fcts::make_min_function<timestamp_native_type>());
declare(aggregate_fcts::make_count_function<timeuuid_native_type>());
declare(aggregate_fcts::make_max_function<timeuuid_native_type>());
declare(aggregate_fcts::make_min_function<timeuuid_native_type>());
declare(aggregate_fcts::make_count_function<utils::UUID>());
declare(aggregate_fcts::make_max_function<utils::UUID>());
declare(aggregate_fcts::make_min_function<utils::UUID>());
//FIXME:
//declare(aggregate_fcts::make_count_function<bytes>());
//declare(aggregate_fcts::make_max_function<bytes>());
@@ -144,23 +172,73 @@ functions::get_overload_count(const function_name& name) {
return _declared.count(name);
}
inline
shared_ptr<function>
make_to_json_function(data_type t) {
return make_native_scalar_function<true>("tojson", utf8_type, {t},
[t](cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
return utf8_type->decompose(t->to_json_string(parameters[0]));
});
}
inline
shared_ptr<function>
make_from_json_function(database& db, const sstring& keyspace, data_type t) {
return make_native_scalar_function<true>("fromjson", t, {utf8_type},
[&db, &keyspace, t](cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
Json::Value json_value = json::to_json_value(utf8_type->to_string(parameters[0].value()));
bytes_opt parsed_json_value;
if (!json_value.isNull()) {
parsed_json_value.emplace(t->from_json_object(json_value, sf));
}
return std::move(parsed_json_value);
});
}
shared_ptr<function>
functions::get(database& db,
const sstring& keyspace,
const function_name& name,
const std::vector<shared_ptr<assignment_testable>>& provided_args,
const sstring& receiver_ks,
const sstring& receiver_cf) {
const sstring& receiver_cf,
shared_ptr<column_specification> receiver) {
static const function_name TOKEN_FUNCTION_NAME = function_name::native_function("token");
static const function_name TO_JSON_FUNCTION_NAME = function_name::native_function("tojson");
static const function_name FROM_JSON_FUNCTION_NAME = function_name::native_function("fromjson");
if (name.has_keyspace()
? name == TOKEN_FUNCTION_NAME
: name.name == TOKEN_FUNCTION_NAME.name)
{
? name == TOKEN_FUNCTION_NAME
: name.name == TOKEN_FUNCTION_NAME.name) {
return ::make_shared<token_fct>(db.find_schema(receiver_ks, receiver_cf));
}
if (name.has_keyspace()
? name == TO_JSON_FUNCTION_NAME
: name.name == TO_JSON_FUNCTION_NAME.name) {
if (provided_args.size() != 1) {
throw exceptions::invalid_request_exception("toJson() accepts 1 argument only");
}
selection::selector *sp = dynamic_cast<selection::selector *>(provided_args[0].get());
if (!sp) {
throw exceptions::invalid_request_exception("toJson() is only valid in SELECT clause");
}
return make_to_json_function(sp->get_type());
}
if (name.has_keyspace()
? name == FROM_JSON_FUNCTION_NAME
: name.name == FROM_JSON_FUNCTION_NAME.name) {
if (provided_args.size() != 1) {
throw exceptions::invalid_request_exception("fromJson() accepts 1 argument only");
}
if (!receiver) {
throw exceptions::invalid_request_exception("fromJson() can only be called if receiver type is known");
}
return make_from_json_function(db, keyspace, receiver->type);
}
std::vector<shared_ptr<function>> candidates;
auto&& add_declared = [&] (function_name fn) {
auto&& fns = _declared.equal_range(fn);
@@ -405,7 +483,7 @@ function_call::raw::prepare(database& db, const sstring& keyspace, ::shared_ptr<
[] (auto&& x) -> shared_ptr<assignment_testable> {
return x;
});
auto&& fun = functions::functions::get(db, keyspace, _name, args, receiver->ks_name, receiver->cf_name);
auto&& fun = functions::functions::get(db, keyspace, _name, args, receiver->ks_name, receiver->cf_name, receiver);
if (!fun) {
throw exceptions::invalid_request_exception(sprint("Unknown function %s called", _name));
}
@@ -469,7 +547,7 @@ function_call::raw::test_assignment(database& db, const sstring& keyspace, share
// of another, existing, function. In that case, we return true here because we'll throw a proper exception
// later with a more helpful error message that if we were to return false here.
try {
auto&& fun = functions::get(db, keyspace, _name, _terms, receiver->ks_name, receiver->cf_name);
auto&& fun = functions::get(db, keyspace, _name, _terms, receiver->ks_name, receiver->cf_name, receiver);
if (fun && receiver->type->equals(fun->return_type())) {
return assignment_testable::test_result::EXACT_MATCH;
} else if (!fun || receiver->type->is_value_compatible_with(*fun->return_type())) {

View File

@@ -80,16 +80,18 @@ public:
const function_name& name,
const std::vector<shared_ptr<assignment_testable>>& provided_args,
const sstring& receiver_ks,
const sstring& receiver_cf);
const sstring& receiver_cf,
::shared_ptr<column_specification> receiver = nullptr);
template <typename AssignmentTestablePtrRange>
static shared_ptr<function> get(database& db,
const sstring& keyspace,
const function_name& name,
AssignmentTestablePtrRange&& provided_args,
const sstring& receiver_ks,
const sstring& receiver_cf) {
const sstring& receiver_cf,
::shared_ptr<column_specification> receiver = nullptr) {
const std::vector<shared_ptr<assignment_testable>> args(std::begin(provided_args), std::end(provided_args));
return get(db, keyspace, name, args, receiver_ks, receiver_cf);
return get(db, keyspace, name, args, receiver_ks, receiver_cf, receiver);
}
static std::vector<shared_ptr<function>> find(const function_name& name);
static shared_ptr<function> find(const function_name& name, const std::vector<data_type>& arg_types);

View File

@@ -64,23 +64,5 @@ public:
}
};
template <class Aggregate>
class native_aggregate_function_using : public native_aggregate_function {
public:
native_aggregate_function_using(sstring name, data_type type)
: native_aggregate_function(std::move(name), type, {}) {
}
virtual std::unique_ptr<aggregate> new_aggregate() override {
return std::make_unique<Aggregate>();
}
};
template <class Aggregate>
shared_ptr<native_aggregate_function>
make_native_aggregate_function_using(sstring name, data_type type) {
return ::make_shared<native_aggregate_function_using<Aggregate>>(name, type);
}
}
}

View File

@@ -117,7 +117,7 @@ make_date_of_fct() {
inline
shared_ptr<function>
make_unix_timestamp_of_fcf() {
make_unix_timestamp_of_fct() {
return make_native_scalar_function<true>("unixtimestampof", long_type, { timeuuid_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
@@ -129,6 +129,163 @@ make_unix_timestamp_of_fcf() {
});
}
inline shared_ptr<function>
make_currenttimestamp_fct() {
return make_native_scalar_function<true>("currenttimestamp", timestamp_type, {},
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
return {timestamp_type->decompose(timestamp_native_type{db_clock::now()})};
});
}
inline shared_ptr<function>
make_currenttime_fct() {
return make_native_scalar_function<true>("currenttime", time_type, {},
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
constexpr int64_t milliseconds_in_day = 3600 * 24 * 1000;
int64_t milliseconds_since_epoch = std::chrono::duration_cast<std::chrono::milliseconds>(db_clock::now().time_since_epoch()).count();
int64_t nanoseconds_today = (milliseconds_since_epoch % milliseconds_in_day) * 1000 * 1000;
return {time_type->decompose(time_native_type{nanoseconds_today})};
});
}
inline shared_ptr<function>
make_currentdate_fct() {
return make_native_scalar_function<true>("currentdate", simple_date_type, {},
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
auto to_simple_date = get_castas_fctn(simple_date_type, timestamp_type);
return {simple_date_type->decompose(to_simple_date(timestamp_native_type{db_clock::now()}))};
});
}
inline
shared_ptr<function>
make_currenttimeuuid_fct() {
return make_native_scalar_function<true>("currenttimeuuid", timeuuid_type, {},
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
return {timeuuid_type->decompose(timeuuid_native_type{utils::UUID_gen::get_time_UUID()})};
});
}
inline
shared_ptr<function>
make_timeuuidtodate_fct() {
return make_native_scalar_function<true>("todate", simple_date_type, { timeuuid_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto ts = db_clock::time_point(db_clock::duration(UUID_gen::unix_timestamp(UUID_gen::get_UUID(*bb))));
auto to_simple_date = get_castas_fctn(simple_date_type, timestamp_type);
return {simple_date_type->decompose(to_simple_date(ts))};
});
}
inline
shared_ptr<function>
make_timestamptodate_fct() {
return make_native_scalar_function<true>("todate", simple_date_type, { timestamp_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto ts_obj = timestamp_type->deserialize(*bb);
if (ts_obj.is_null()) {
return {};
}
auto to_simple_date = get_castas_fctn(simple_date_type, timestamp_type);
return {simple_date_type->decompose(to_simple_date(ts_obj))};
});
}
inline
shared_ptr<function>
make_timeuuidtotimestamp_fct() {
return make_native_scalar_function<true>("totimestamp", timestamp_type, { timeuuid_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto ts = db_clock::time_point(db_clock::duration(UUID_gen::unix_timestamp(UUID_gen::get_UUID(*bb))));
return {timestamp_type->decompose(ts)};
});
}
inline
shared_ptr<function>
make_datetotimestamp_fct() {
return make_native_scalar_function<true>("totimestamp", timestamp_type, { simple_date_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto simple_date_obj = simple_date_type->deserialize(*bb);
if (simple_date_obj.is_null()) {
return {};
}
auto from_simple_date = get_castas_fctn(timestamp_type, simple_date_type);
return {timestamp_type->decompose(from_simple_date(simple_date_obj))};
});
}
inline
shared_ptr<function>
make_timeuuidtounixtimestamp_fct() {
return make_native_scalar_function<true>("tounixtimestamp", long_type, { timeuuid_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
return {long_type->decompose(UUID_gen::unix_timestamp(UUID_gen::get_UUID(*bb)))};
});
}
inline
shared_ptr<function>
make_timestamptounixtimestamp_fct() {
return make_native_scalar_function<true>("tounixtimestamp", long_type, { timestamp_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto ts_obj = timestamp_type->deserialize(*bb);
if (ts_obj.is_null()) {
return {};
}
return {long_type->decompose(ts_obj)};
});
}
inline
shared_ptr<function>
make_datetounixtimestamp_fct() {
return make_native_scalar_function<true>("tounixtimestamp", long_type, { simple_date_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto simple_date_obj = simple_date_type->deserialize(*bb);
if (simple_date_obj.is_null()) {
return {};
}
auto from_simple_date = get_castas_fctn(timestamp_type, simple_date_type);
return {long_type->decompose(from_simple_date(simple_date_obj))};
});
}
}
}
}

View File

@@ -202,12 +202,6 @@ lists::delayed_value::bind(const query_options& options) {
if (bo.is_unset_value()) {
return constants::UNSET_VALUE;
}
// We don't support value > 64K because the serialization format encode the length as an unsigned short.
if (bo->size() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("List value is too long. List values are limited to %d bytes but %d bytes value provided",
std::numeric_limits<uint16_t>::max(),
bo->size()));
}
buffers.push_back(std::move(to_bytes(*bo)));
}
@@ -243,7 +237,12 @@ lists::precision_time::get_next(db_clock::time_point millis) {
void
lists::setter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
const auto& value = _t->bind(params._options);
auto value = _t->bind(params._options);
execute(m, prefix, params, column, std::move(value));
}
void
lists::setter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value) {
if (value == constants::UNSET_VALUE) {
return;
}
@@ -305,12 +304,7 @@ lists::setter_by_index::execute(mutation& m, const clustering_key_prefix& prefix
if (!value) {
mut.cells.emplace_back(eidx, params.make_dead_cell());
} else {
if (value->size() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(
sprint("List value is too long. List values are limited to %d bytes but %d bytes value provided",
std::numeric_limits<uint16_t>::max(), value->size()));
}
mut.cells.emplace_back(eidx, params.make_cell(*value));
mut.cells.emplace_back(eidx, params.make_cell(*ltype->value_comparator(), *value, atomic_cell::collection_member::yes));
}
auto smut = ltype->serialize_mutation_form(mut);
m.set_cell(prefix, column, atomic_cell_or_collection::from_collection_mutation(std::move(smut)));
@@ -337,7 +331,7 @@ lists::setter_by_uuid::execute(mutation& m, const clustering_key_prefix& prefix,
list_type_impl::mutation mut;
mut.cells.reserve(1);
mut.cells.emplace_back(to_bytes(*index), params.make_cell(*value));
mut.cells.emplace_back(to_bytes(*index), params.make_cell(*ltype->value_comparator(), *value, atomic_cell::collection_member::yes));
auto smut = ltype->serialize_mutation_form(mut);
m.set_cell(prefix, column,
atomic_cell_or_collection::from_collection_mutation(
@@ -376,7 +370,7 @@ lists::do_append(shared_ptr<term> value,
auto uuid1 = utils::UUID_gen::get_time_UUID_bytes();
auto uuid = bytes(reinterpret_cast<const int8_t*>(uuid1.data()), uuid1.size());
// FIXME: can e be empty?
appended.cells.emplace_back(std::move(uuid), params.make_cell(*e));
appended.cells.emplace_back(std::move(uuid), params.make_cell(*ltype->value_comparator(), *e, atomic_cell::collection_member::yes));
}
m.set_cell(prefix, column, ltype->serialize_mutation_form(appended));
} else {
@@ -385,7 +379,7 @@ lists::do_append(shared_ptr<term> value,
m.set_cell(prefix, column, params.make_dead_cell());
} else {
auto newv = list_value->get_with_protocol_version(cql_serialization_format::internal());
m.set_cell(prefix, column, params.make_cell(std::move(newv)));
m.set_cell(prefix, column, params.make_cell(*column.type, std::move(newv)));
}
}
}
@@ -406,14 +400,14 @@ lists::prepender::execute(mutation& m, const clustering_key_prefix& prefix, cons
mut.cells.reserve(lvalue->get_elements().size());
// We reverse the order of insertion, so that the last element gets the lastest time
// (lists are sorted by time)
auto&& ltype = static_cast<const list_type_impl*>(column.type.get());
for (auto&& v : lvalue->_elements | boost::adaptors::reversed) {
auto&& pt = precision_time::get_next(time);
auto uuid = utils::UUID_gen::get_time_UUID_bytes(pt.millis.time_since_epoch().count(), pt.nanos);
mut.cells.emplace_back(bytes(uuid.data(), uuid.size()), params.make_cell(*v));
mut.cells.emplace_back(bytes(uuid.data(), uuid.size()), params.make_cell(*ltype->value_comparator(), *v, atomic_cell::collection_member::yes));
}
// now reverse again, to get the original order back
std::reverse(mut.cells.begin(), mut.cells.end());
auto&& ltype = static_cast<const list_type_impl*>(column.type.get());
m.set_cell(prefix, column, atomic_cell_or_collection::from_collection_mutation(ltype->serialize_mutation_form(std::move(mut))));
}

View File

@@ -147,6 +147,7 @@ public:
: operation(column, std::move(t)) {
}
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
static void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value);
};
class setter_by_index : public operation {

View File

@@ -245,11 +245,6 @@ maps::delayed_value::bind(const query_options& options) {
if (value_bytes.is_unset_value()) {
return constants::UNSET_VALUE;
}
if (value_bytes->size() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Map value is too long. Map values are limited to %d bytes but %d bytes value provided",
std::numeric_limits<uint16_t>::max(),
value_bytes->size()));
}
buffers.emplace(std::move(to_bytes(*key_bytes)), std::move(to_bytes(*value_bytes)));
}
return ::make_shared<value>(std::move(buffers));
@@ -271,6 +266,11 @@ maps::marker::bind(const query_options& options) {
void
maps::setter::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
auto value = _t->bind(params._options);
execute(m, row_key, params, column, std::move(value));
}
void
maps::setter::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value) {
if (value == constants::UNSET_VALUE) {
return;
}
@@ -300,16 +300,11 @@ maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, c
if (!key) {
throw invalid_request_exception("Invalid null map key");
}
if (value && value->size() >= std::numeric_limits<uint16_t>::max()) {
throw invalid_request_exception(
sprint("Map value is too long. Map values are limited to %d bytes but %d bytes value provided",
std::numeric_limits<uint16_t>::max(),
value->size()));
}
auto avalue = value ? params.make_cell(*value) : params.make_dead_cell();
map_type_impl::mutation update = { {}, { { std::move(to_bytes(*key)), std::move(avalue) } } };
// should have been verified as map earlier?
auto ctype = static_pointer_cast<const map_type_impl>(column.type);
auto avalue = value ? params.make_cell(*ctype->get_values_type(), *value, atomic_cell::collection_member::yes) : params.make_dead_cell();
map_type_impl::mutation update;
update.cells.emplace_back(std::move(to_bytes(*key)), std::move(avalue));
// should have been verified as map earlier?
auto col_mut = ctype->serialize_mutation_form(std::move(update));
m.set_cell(prefix, column, std::move(col_mut));
}
@@ -334,10 +329,10 @@ maps::do_put(mutation& m, const clustering_key_prefix& prefix, const update_para
return;
}
for (auto&& e : map_value->map) {
mut.cells.emplace_back(e.first, params.make_cell(e.second));
}
auto ctype = static_pointer_cast<const map_type_impl>(column.type);
for (auto&& e : map_value->map) {
mut.cells.emplace_back(e.first, params.make_cell(*ctype->get_values_type(), e.second, atomic_cell::collection_member::yes));
}
auto col_mut = ctype->serialize_mutation_form(std::move(mut));
m.set_cell(prefix, column, std::move(col_mut));
} else {
@@ -347,7 +342,7 @@ maps::do_put(mutation& m, const clustering_key_prefix& prefix, const update_para
} else {
auto v = map_type_impl::serialize_partially_deserialized_form({map_value->map.begin(), map_value->map.end()},
cql_serialization_format::internal());
m.set_cell(prefix, column, params.make_cell(std::move(v)));
m.set_cell(prefix, column, params.make_cell(*column.type, std::move(v)));
}
}
}

View File

@@ -117,6 +117,7 @@ public:
}
virtual void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) override;
static void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value);
};
class setter_by_key : public operation {

View File

@@ -87,15 +87,15 @@ public:
virtual ~operation() {}
atomic_cell make_dead_cell(const update_parameters& params) const {
static atomic_cell make_dead_cell(const update_parameters& params) {
return params.make_dead_cell();
}
atomic_cell make_cell(bytes_view value, const update_parameters& params) const {
return params.make_cell(value);
static atomic_cell make_cell(const abstract_type& type, bytes_view value, const update_parameters& params) {
return params.make_cell(type, value);
}
atomic_cell make_counter_update_cell(int64_t delta, const update_parameters& params) const {
static atomic_cell make_counter_update_cell(int64_t delta, const update_parameters& params) {
return params.make_counter_update_cell(delta);
}

View File

@@ -68,6 +68,14 @@ public:
static thrift_prepared_id_type thrift_id(const prepared_cache_key_type& key) {
return key.key().second;
}
bool operator==(const prepared_cache_key_type& other) const {
return _key == other._key;
}
bool operator!=(const prepared_cache_key_type& other) const {
return !(*this == other);
}
};
class prepared_statements_cache {
@@ -102,9 +110,9 @@ private:
}
};
public:
static const std::chrono::minutes entry_expiry;
public:
using key_type = prepared_cache_key_type;
using value_type = checked_weak_ptr;
using statement_is_too_big = typename cache_type::entry_is_too_big;
@@ -116,8 +124,8 @@ private:
value_extractor_fn _value_extractor_fn;
public:
prepared_statements_cache(logging::logger& logger)
: _cache(memory::stats().total_memory() / 256, entry_expiry, logger)
prepared_statements_cache(logging::logger& logger, size_t size)
: _cache(size, entry_expiry, logger)
{}
template <typename LoadFunc>
@@ -155,6 +163,10 @@ public:
size_t memory_footprint() const {
return _cache.memory_footprint();
}
future<> stop() {
return _cache.stop();
}
};
}
@@ -168,4 +180,11 @@ inline std::ostream& operator<<(std::ostream& os, const cql3::prepared_cache_key
os << p.key();
return os;
}
template<>
struct hash<cql3::prepared_cache_key_type> final {
size_t operator()(const cql3::prepared_cache_key_type& k) const {
return utils::tuple_hash()(k.key());
}
};
}

View File

@@ -46,10 +46,11 @@ namespace cql3 {
thread_local const query_options::specific_options query_options::specific_options::DEFAULT{-1, {}, {}, api::missing_timestamp};
thread_local query_options query_options::DEFAULT{db::consistency_level::ONE, std::experimental::nullopt,
thread_local query_options query_options::DEFAULT{db::consistency_level::ONE, infinite_timeout_config, std::experimental::nullopt,
std::vector<cql3::raw_value_view>(), false, query_options::specific_options::DEFAULT, cql_serialization_format::latest()};
query_options::query_options(db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
std::vector<cql3::raw_value_view> value_views,
@@ -57,6 +58,7 @@ query_options::query_options(db::consistency_level consistency,
specific_options options,
cql_serialization_format sf)
: _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values(std::move(values))
, _value_views(value_views)
@@ -67,12 +69,14 @@ query_options::query_options(db::consistency_level consistency,
}
query_options::query_options(db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
bool skip_metadata,
specific_options options,
cql_serialization_format sf)
: _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values(std::move(values))
, _value_views()
@@ -84,12 +88,14 @@ query_options::query_options(db::consistency_level consistency,
}
query_options::query_options(db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
specific_options options,
cql_serialization_format sf)
: _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values()
, _value_views(std::move(value_views))
@@ -99,9 +105,10 @@ query_options::query_options(db::consistency_level consistency,
{
}
query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_value> values, specific_options options)
query_options::query_options(db::consistency_level cl, const ::timeout_config& timeout_config, std::vector<cql3::raw_value> values, specific_options options)
: query_options(
cl,
timeout_config,
{},
std::move(values),
false,
@@ -113,6 +120,7 @@ query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_val
query_options::query_options(std::unique_ptr<query_options> qo, ::shared_ptr<service::pager::paging_state> paging_state)
: query_options(qo->_consistency,
qo->get_timeout_config(),
std::move(qo->_names),
std::move(qo->_values),
std::move(qo->_value_views),
@@ -124,7 +132,7 @@ query_options::query_options(std::unique_ptr<query_options> qo, ::shared_ptr<ser
query_options::query_options(std::vector<cql3::raw_value> values)
: query_options(
db::consistency_level::ONE, std::move(values))
db::consistency_level::ONE, infinite_timeout_config, std::move(values))
{}
db::consistency_level query_options::get_consistency() const
@@ -209,19 +217,18 @@ void query_options::prepare(const std::vector<::shared_ptr<column_specification>
}
auto& names = *_names;
std::vector<cql3::raw_value> ordered_values;
std::vector<cql3::raw_value_view> ordered_values;
ordered_values.reserve(specs.size());
for (auto&& spec : specs) {
auto& spec_name = spec->name->text();
for (size_t j = 0; j < names.size(); j++) {
if (names[j] == spec_name) {
ordered_values.emplace_back(_values[j]);
ordered_values.emplace_back(_value_views[j]);
break;
}
}
}
_values = std::move(ordered_values);
fill_value_views();
_value_views = std::move(ordered_values);
}
void query_options::fill_value_views()

Some files were not shown because too many files have changed in this diff Show More