Compare commits

...

1244 Commits

Author SHA1 Message Date
Avi Kivity
b252bba4a2 Merge "mvcc: Fix incorrect schema version being used to copy the mutation when applying (#5099)" from Tomasz
"
Currently affects only counter tables.

Introduced in 27014a2.

mutation_partition(s, mp) is incorrect because it uses s to interpret
mp, while it should use mp_schema.

We may hit this if the current node has a newer schema than the
incoming mutation. This can happen during table schema altering when we receive the
mutation from a node which hasn't processed the schema change yet.

This is undefined behavior in general. If the alter was adding or
removing columns, this may result in corruption of the write where
values of one column are inserted into a different column.

Fixes #5095.
"

* 'fix-schema-alter-counter-tables' of https://github.com/tgrabiec/scylla:
  mvcc: Fix incorrect schema verison being used to copy the mutation when applying
  mutation_partition: Track and validate schema version in debug builds
  tests: Use the correct schema to access mutation_partition

(cherry picked from commit 83bc59a89f)
2019-09-28 19:48:49 +03:00
Botond Dénes
a0b9fcc041 cache_flat_mutation_reader: read_from_underlying(): propagate timeout
Propagate the timeout to `consume_mutation_fragments_until()` and hence
to the underlying reader, to ensure queued sstable reads that belong
to timed-out requests are dropped from the queue, instead of
pointlessly serving them.

consume_mutation_fragments_until() received a `timeout` parameter as it
didn't have one.

Fixes: #1068
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190906135629.67342-1-bdenes@scylladb.com>
2019-09-06 16:05:42 +02:00
Paweł Dziepak
35c9b675c1 mutation_partition: verify row::append_cell() precondition
row::append_cell() has a precondition that the new cell column id needs
to be larger than that of any other already existing cell. If this
precondition is violated the row will end up in an invalid state. This
patch adds assertion to make sure we fail early in such cases.

(cherry picked from commit 060e3f8ac2)
2019-08-23 15:06:35 +02:00
Jenkins
d71836fef7 release: prepare for 2.3.6 by hagitsegev 2019-08-17 13:12:40 +03:00
Tomasz Grabiec
f8e150e97c Merge "Fix the system.size_estimates table" from Kamil
Fixes a segfault when querying for an empty keyspace.

Also, fixes an infinite loop on smp > 1. Queries to
system.size_estimates table which are not single-partition queries
caused Scylla to go into an infinite loop inside
multishard_combining_reader::fill_buffer. This happened because
multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for
size_estimates_mutation_reader.

Fixes #4689
2019-08-14 15:33:33 +02:00
Kamil Braun
10c300f894 Fix infinite looping when performing a range query on system.size_estimates.
Queries to system.size_estimates table which are not single parition queries
caused Scylla to go into an infinite loop inside multishard_combining_reader::fill_buffer.
This happened because multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for size_estimates_mutation_reader.
This commit fixes the issue and closes #4689.
2019-08-14 13:11:56 +02:00
Kamil Braun
de1d3e5c6b Fix segmentation fault when querying system.size_estimates for an empty keyspace. 2019-08-14 13:11:56 +02:00
Kamil Braun
69810c13ca Refactor size_estimates_virtual_reader
Move the implementation of size_estimates_mutation_reader
to a separate compilation unit to speed up compilation times
and increase readability.

Refactor tests to use seastar::thread.
2019-08-14 13:11:54 +02:00
Raphael S. Carvalho
9b025a5742 table: do not rely on undefined behavior in cleanup_sstables
It shouldn't rely on argument evaluation order, which is ub.

Fixes #4718.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry-picked from commit 0e732ed1cf)
2019-08-07 22:01:11 +03:00
Piotr Jastrzębski
74eebc4cab sstables: ka/la: reader: Make sure push_ready_fragments() does not miss to emit partition_end (#4790)
Currently, if there is a fragment in _ready and _out_of_range was set
after row end was consumer, push_ready_fragments() would return
without emitting partition_end.

This is problematic once we make consume_row_start() emit
partiton_start directly, because we will want to assume that all
fragments for the previous partition are emitted by then. If they're
not, then we'd emit partition_start before partition_end for the
previous partition. The fix is to make sure that
push_ready_fragments() emits everything.

Fixes #4786

(cherry picked from commit 9b8ac5ecbc)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-08-05 10:17:42 +03:00
Piotr Sarna
9b2ca4ee44 main: stop view builder conditionally
The view builder is started only if it's enabled in config,
via the view_building=true variable. Unfortunately, stopping
the builder was unconditional, which may result in failed
assertions during shutdown. To remedy this, view building
is stopped only if it was previously started.

Fixes #4589

(cherry picked from commit efa7951ea5)
2019-06-26 11:06:02 +03:00
Benny Halevy
773bf45774 time_window_backlog_tracker: fix use after free
Fixes #4465

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190430094209.13958-1-bhalevy@scylladb.com>
(cherry picked from commit 3a2fa82d6e)
2019-05-06 09:38:49 +03:00
Asias He
c6705b4335 streaming: Get rid of the keep alive timer in streaming
There is no guarantee that rpc streaming makes progress in some time
period. Remove the keep alive timer in streaming to avoid killing the
session when the rpc streaming is just slow.

The keep alive timer is used to close the session in the following case:

n2 (the rpc streaming sender) streams to n1 (the rpc streaming receiver)
kill -9 n2

We need this because we do not kill the session when gossip think a node
is down, because we think the node down might only be temporary
and it is a waste to drop the previous work that has done especially
when the stream session takes long time.

Since in range_streamer, we do not stream all data in a single stream
session, we stream 10% of the data per time, and we have retry logic.
I think it is fine to kill a stream session when gossip thinks a node is
down. This patch changes to close all stream session with the node that
gossip think it is down.
Message-Id: <bdbb9486a533eee25fcaf4a23a946629ba946537.1551773823.git.asias@scylladb.com>

(cherry picked from commit b8158dd65d)
Message-Id: <4ebc544c85261873591fd5ac30043e693d74434a.1555466551.git.asias@scylladb.com>
2019-04-17 17:40:08 +03:00
Tomasz Grabiec
3997871b4d lsa: Cover more bad_alloc cases with abort
When --abort-on-lsa-bad-alloc is enabled we want to abort whenever
we think we can be out of memory.

We covered failures due to bad_alloc thrown from inside of the
allocation section, but did not cover failures from reservations done
at the beginning of with_reserve(). Fix by moving the trap into
reserve().

Message-Id: <1553258915-27929-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3356a085d2)
2019-04-17 15:21:35 +02:00
Tomasz Grabiec
4ff1d731bd lsa: Fix spurios abort with --enable-abort-on-lsa-bad-alloc
allocate_segment() can fail even though we're not out of memory, when
it's invoked inside an allocating section with the cache region
locked. That section may later succeed after retried after memory
reclamation.

We should ignore bad_alloc thrown inside allocating section body and
fail only when the whole section fails.

Fixes #2924

Message-Id: <1550597493-22500-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit dafe22dd83)
2019-04-17 15:20:21 +02:00
Jenkins
0e0f9143c9 release: prepare for 2.3.5 by hagitsegev 2019-04-17 11:47:53 +03:00
Raphael S. Carvalho
9d809d6ea4 database: fix 2x increase in disk usage during cleanup compaction
Don't hold reference to sstables cleaned up, so that file descriptors
for their index and data files will be closed and consequently disk
space released.

Fixes #3735.

Backport note:
To reduce risk considerably, we'll not backport a mechanism to release
sstable introduced in incremental compaction work.
Instead, only one sstable is passed to table::cleanup_sstables() at a
time (it won't affect performance because the operation is serialized
anyway), to make it easy to release reference to cleaned sstable held
by compaction manager.

tests: release mode.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180914194047.26288-1-raphaelsc@scylladb.com>
(cherry picked from commit 5bc028f78b)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190416025801.15048-1-raphaelsc@scylladb.com>
2019-04-16 17:41:30 +03:00
Tomasz Grabiec
630d599c34 schema_tables: Serialize schema merges fairly
All schema changes made to the node locally are serialized on a
semaphore which lives on shard 0. For historical reasons, they don't
queue but rather try to take the lock without blocking and retry on
failure with a random delay from the range [0, 100 us]. Contenders
which do not originate on shard 0 will have an extra disadvantage as
each lock attempt will be longer by the across-shard round trip
latency. If there is constant contention on shard 0, contenders
originating from other shards may keep loosing to take the lock.

Schema merge executed on behalf of a DDL statement may originate on
any shard. Same for the schema merge which is coming from a push
notification. Schema merge executed as part of the background schema
pull will originate on shard 0 only, where the application state
change listeners run. So if there are constant schema pulls, DDL
statements may take a long time to get through.

The fix is to serialize merge requests fairly, by using the blocking
semaphore::wait(), which is fair.

We don't have to back-off any more, since submit_to() no longer has a
global concurrency limit.

Fixes #4436.

Message-Id: <1555349915-27703-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3fd82021b1)
2019-04-16 10:21:10 +03:00
Takuya ASADA
0933c1a00a dist/docker/redhat: switch to python36
Since EPEL switched python3 default version to 3.6, we need to follow
the change.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190409094139.4797-2-syuu@scylladb.com>
(cherry picked from commit d527ef19f7)
2019-04-14 21:17:57 +03:00
Takuya ASADA
7a7099fcfb dist/ami: switch to python36
Since EPEL switched python3 default version to 3.6, we need to follow
the change.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190409094139.4797-1-syuu@scylladb.com>
2019-04-14 21:17:03 +03:00
Avi Kivity
50235aacb4 compaction: fix use-after-free when calculating backlog after schema change
The problem happens after a schema change because we fail to properly
remove ongoing compaction, which stopped being tracked, from list that
is used to calculate backlog, so it may happen that a compaction read
monitor (ceases to exist after compaction ends) is used after freed.

Fixes #4410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190409024936.23775-1-raphaelsc@scylladb.com>
(cherry-picked from commit 8a117c338a)
2019-04-14 14:00:36 +03:00
Takuya ASADA
e888009f12 dist/redhat: switch to python36
Since EPEL switched python3 default version to 3.6, we need to follow
the change.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190408160934.32701-1-syuu@scylladb.com>
(cherry picked from commit 6e51a95668)
2019-04-09 09:08:28 +03:00
Piotr Sarna
a19615ee9b types: fix varint and decimal serialization
Varint and decimal types serialization did not update the output
iterator after generating a value, which may lead to corrupted
sstables - variable-length integers were properly serialized,
but if anything followed them directly in the buffer (e.g. in a tuple),
their value will be overwritten.

Fixes #4348

Tests: unit (dev)
dtest: json_test.FromJsonUpdateTests.complex_data_types_test
       json_test.FromJsonInsertTests.complex_data_types_test
       json_test.ToJsonSelectTests.complex_data_types_test

Note that dtests still do not succeed 100% due to formatting differences
in compared results (e.g. 1.0e+07 vs 1.0E7, but it's no longer a query
correctness issue.

(cherry picked from commit 287a02dc05)
2019-03-26 16:38:58 +02:00
Tomasz Grabiec
357ca67fda row_cache: Fix abort in cache populating read concurrent with memtable flush
When we're populating a partition range and the population range ends
with a partition key (not a token) which is present in sstables and
there was a concurrent memtable flush, we would abort on the following
assert in cache::autoupdating_underlying_reader:

     utils::phased_barrier::phase_type creation_phase() const {
         assert(_reader);
         return _reader_creation_phase;
     }

That's because autoupdating_underlying_reader::move_to_next_partition()
clears the _reader field when it tries to recreate a reader but it finds
the new range to be empty:

         if (!_reader || _reader_creation_phase != phase) {
            if (_last_key) {
                auto cmp = dht::ring_position_comparator(*_cache._schema);
                auto&& new_range = _range.split_after(*_last_key, cmp);
                if (!new_range) {
                    _reader = {};
                    return make_ready_future<mutation_fragment_opt>();
                }

Fix by not asserting on _reader. creation_phase() will now be
meaningful even after we clear the _reader. The meaning of
creation_phase() is now "the phase in which the reader was last
created or 0", which makes it valid in more cases than before.

If the reader was never created we will return 0, which is smaller
than any phase returned by cache::phase_of(), since cache starts from
phase 1. This shouldn't affect current behavior, since we'd abort() if
called for this case, it just makes the value more appropriate for the
new semantics.

Tests:

  - unit.row_cache_test (debug)

Fixes #4236
Message-Id: <1553107389-16214-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 69775c5721)
2019-03-22 09:34:15 -03:00
Jenkins
7818c63eb1 release: prepare for 2.3.4 by hagitsegev 2019-03-18 12:44:10 +02:00
Eliran Sinvani
da10eae18c cql3 : fix a crash upon preparing select with an IN restriction due to memory violation
When preparing a select query with a multicolumn in restriction, the
node crashed due to using a parameter after using a move on it.

Tests:
1. UnitTests (release)
2. Preparing a select statement that crashed the system before,
and verify it is not crashing.

Fixes #3204
Fixes #3692

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <7ebd210cd714a460ee5557ac612da970cee03270.1537947897.git.eliransin@scylladb.com>
(cherry picked from commit 22ad5434d1)
2019-03-07 10:10:03 +00:00
Tomasz Grabiec
d5292cd3ec sstable/compaction: Use correct schema in the writing consumer
Introduced in 2a437ab427.

regular_compaction::select_sstable_writer() creates the sstable writer
when the first partition is consumed from the combined mutation
fragment stream. It gets the schema directly from the table
object. That may be a different schema than the one used by the
readers if there was a concurrent schema alter duringthat small time
window. As a result, the writing consumer attached to readers will
interpret fragments using the wrong version of the schema.

One effect of this is storing values of some columns under a different
column.

This patch replaces all column_family::schema() accesses with accesses
to the _schema memeber which is obtained once per compaction and is
the same schema which readers use.

Fixes #4304.

Tests:

  - manual tests with hard-coded schema change injection to reproduce the bug
  - build/dev/scylla boot
  - tests/sstable_mutation_test

Message-Id: <1551698056-23386-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 58e7ad20eb)
2019-03-05 15:06:28 +02:00
Avi Kivity
9cb35361d9 Update seastar submodule
* seastar 10ac122...efda428 (1):
  > net: fix tcp load balancer accounting leak while moving socket to other shard

Fixes #4269.
2019-03-05 15:06:07 +02:00
Jenkins
3e285248be release: prepare for 2.3.3 by hagitsegev 2019-02-19 14:02:37 +02:00
Raphael S. Carvalho
6f10ccb441 database: Fix race condition in sstable snapshot
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.

Refs #4051.

(master commit 1b7cad3531)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>
2019-02-19 10:13:56 +02:00
Botond Dénes
df420499bc service/storage_service: fix pre-bootstrap wait for schema agreement
When bootstrapping, a node should to wait to have a schema agreement
with its peers, before it can join the ring. This is to ensure it can
immediately accept writes. Failing to reach schema agreement before
joining is not fatal, as the node can pull unknown schemas on writes
on-demand. However, if such a schema contains references to UDFs, the
node will reject writes using it, due to #3760.

To ensure that schema agreement is reached before joining the ring,
`storage_service::join_token_ring()` has to checks. First it checks that
at least one peer was connected previously. For this it compares
`database::get_version()` with `database::empty_version`. The (implied)
assumption is that this will become something other than
`database::empty_version` only after having connected (and pulled
schemas from) at least one peer. This assumption doesn't hold anymore,
as we now set the version earlier in the boot process.
The second check verifies that we have the same schema version as all
known, live peers. This check assumes (since 3e415e2) that we have
already "met" all (or at least some) of our peers and if there is just
one known node (us) it concludes that this is a single-node cluster,
which automatically has schema agreement.
It's easy to see how these two checks will fail. The first fails to
ensure that we have met our peers, and the second wrongfully concludes
that we are a one-node cluster, and hence have schema agreement.

To fix this, modify the first check. Instead of relying on the presence
of a non-empty database version, supposedly implying that we already
talked to our peers, explicitely make sure that we have really talked to
*at least* one other node, before proceeding to the second check, which
will now do the correct thing, actually checking the schema versions.

Fixes: #4196

Branches: 3.0, 2.3

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40b95b18e09c787e31ba6c5519fb64d68b4ca32e.1550228389.git.bdenes@scylladb.com>
(cherry picked from commit 2125e99531)
2019-02-16 19:04:38 +02:00
Avi Kivity
d29527b4e1 auth: password_authenticator: protect against NULL salted_hash
In case salted_hash was NULL, we'd access uninitialized memory when dereferencing
the optional in get_as<>().

Protect against that by using get_opt() and failing authentication if we see a NULL.

Fixes #4168.

Tests: unit (release)
Branches: 3.0, 2.3
Message-Id: <20190211173820.8053-1-avi@scylladb.com>
(cherry picked from commit da9628c6dc)
2019-02-11 23:55:06 +02:00
Duarte Nunes
8a90e242e4 Merge 'Fix misdetection of remote counter shards' from Paweł
"
The code reading counter cells form sstables verifies that there are no
unsupported local or remote shards. The latter are detected by checking
if all shards are present in the counter cell header (only remote shards
do not have entries there). However, the logic responsible for doing
that was incorrectly computing the total number of counter shards in a
cell if the header was larger than a single counter shard. This resulted
in incorrect complaints that remote shards are present.

Fixes #4206

Tests: unit(release)
"

* tag 'counter-header-fix/v1' of https://github.com/pdziepak/scylla:
  tests/sstables: test counter cell header with large number of shards
  sstables/counters: fix remote counter shard detection

(cherry picked from commit d2d885fb93)
2019-02-11 14:18:54 +02:00
Calle Wilund
8a78c0aba9 commitlog_replayer: Bugfix: finding truncation positions uses local var ref
"uuid" was ref:ed in a continuation. Works 99.9% of the time because
the continuation is not actually delayed (and assuming we begin the
checks with non-truncated (system) cf:s it works).
But if we do delay continuation, the resulting cf map will be
borked.

Fixes #4187.

Message-Id: <20190204141831.3387-1-calle@scylladb.com>
(cherry picked from commit 9cadbaa96f)
2019-02-04 18:02:43 +02:00
Botond Dénes
8a2bbcf138 auth/service: unregister migration listener on stop()
Otherwise any event that triggers notification to this listener would
trigger a heap-use-after-free.

Refs: #4107

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b6bbd609371a2312aed7571b05119d59c7d103d7.1548067626.git.bdenes@scylladb.com>
(cherry picked from commit f229dff210)
2019-01-22 17:55:18 +02:00
Pekka Enberg
22c891e6df Update scylla-ami submodule
* dist/ami/files/scylla-ami a425887...fe156a5 (1):
  > scylla_install_ami: update NIC drivers

See scylladb/scylla-ami#44
2019-01-17 08:45:22 +02:00
Duarte Nunes
1841d0c2d9 tests/gossip_test: Use RAII for orderly destruction
Change the test so that services are correctly teared down, by the
correct order (e.g., storage_service access the messaging_service when
stopping).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814112111.8521-2-duarte@scylladb.com>
(cherry picked from commit 495a92c5b6)
2019-01-08 19:44:58 +02:00
Duarte Nunes
e10107fe5a tests/gossip_test: Don't bind address to avoid conflicts
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814112111.8521-1-duarte@scylladb.com>
(cherry picked from commit 3956a77235)
2019-01-08 19:44:52 +02:00
Jenkins
0b3a4679db release: prepare for 2.3.2 2019-01-08 14:40:33 +02:00
Avi Kivity
ba60d666a9 Update seastar submodule
* seastar db30251...10ac122 (1):
  > iotune: Initialize io_rates member variables

Fixes #4064.
2019-01-08 11:41:00 +02:00
Avi Kivity
6ea4d0b75c Update seastar submodule
* seastar b846dfe...db30251 (1):
  > reactor: disable nowait aio due to a kernel bug

Fixes #3996.
2018-12-17 15:56:47 +02:00
Vladimir Krivopalov
8c5911f312 database: Capture io_priority_class by reference to avoid dangling ref.
The original reference points to a thread-local storage object that
guaranteed to outlive the continuation, but copying it make the
subsequent calls point to a local object and introduces a use-after-free
bug.

Fixes #3948

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
(cherry picked from commit 68458148e7)
2018-12-02 13:32:45 +02:00
Tomasz Grabiec
de00d7f5a1 utils: phased_barrier: Make advance_and_await() have strong exception guarantees
Currently, when advance_and_await() fails to allocate the new gate
object, it will throw bad_alloc and leave the phased_barrier object in
an invalid state. Calling advance_and_await() again on it will result
in undefined behavior (typically SIGSEGV) beacuse _gate will be
disengaged.

One place affected by this is table::seal_active_memtable(), which
calls _flush_barrier.advance_and_await(). If this throws, subsequent
flush attempts will SIGSEGV.

This patch rearranges the code so that advance_and_await() has strong
exception guarantees.
Message-Id: <1542645562-20932-1-git-send-email-tgrabiec@scylladb.com>

Fixes #3931.

(cherry picked from commit 57e25fa0f8)
2018-11-21 12:18:08 +02:00
Glauber Costa
e5f9dae4bb remove monitor if sstable write failed
In (almost) all SSTable write paths, we need to inform the monitor that
the write has failed as well. The monitor will remove the SSTable from
controller's tracking at that point.

Except there is one place where we are not doing that: streaming of big
mutations. Streaming of big mutations is an interesting use case, in
which it is done in 2 parts: if the writing of the SSTable fails right
away, then we do the correct thing.

But the SSTables are not commited at that point and the monitors are
still kept around with the SSTables until a later time, when they are
finally committed. Between those two points in time, it is possible that
the streaming code will detect a failure and manually call
fail_streaming_mutations(), which marks the SSTable for deletions. At
that point we should propagate that information to the monitor as well,
but we don't.

Fixes #3732 (hopefully)
Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181114213618.16789-1-glauber@scylladb.com>
(cherry picked from commit 9f403334c8)
2018-11-20 20:40:44 +02:00
Glauber Costa
e13e796290 sstables: correctly parse estimated histograms
In commit a33f0d6, we changed the way we handle arrays during the write
and parse code to avoid reactor stalls. Some potentially big loops were
transformed into futurized loops, and also some calls to vector resizes
were replaced by a reserve + push_back idiom.

The latter broke parsing of the estimated histogram. The reason being
that the vectors that are used here are already initialized internally
by the estimated_histogram object. Therefore, when we push_back, we
don't fill the array all the way from index 0, but end up with a zeroed
beginning and only push back some of the elements we need.

We could revert this array to a resize() call. After all, the reason we
are using reserve + push_back is to avoid calling the constructor member
for each element, but We don't really expect the integer specialization
to do any of that.

However, to avoid confusion with future developers that may feel tempted
to converted this as well for the sake of consistency, it is safer to
just make sure these arrays are zeroed.

Fixes #3918

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181116130853.10473-1-glauber@scylladb.com>
(cherry picked from commit c6811bd877)
2018-11-17 17:20:38 +02:00
Avi Kivity
336c771663 release: prepare for 2.3.1 2018-10-19 20:53:17 +03:00
Avi Kivity
82968afc25 locator: fix abstract_replication_strategy::get_ranges() and friends violating sort order
get_ranges() is supposed to return ranges in sorted order. However, a35136533d
broke this and returned the range that was supposed to be last in the second
position (e.g. [0, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9]). The broke cleanup, which
relied on the sort order to perform a binary search. Other users of the
get_ranges() family did not rely on the sort order.

Fixes #3872.
Message-Id: <20181019113613.1895-1-avi@scylladb.com>

(cherry picked from commit 1ce52d5432)
2018-10-19 20:52:31 +03:00
Duarte Nunes
383dcffb53 Merge 'Fix issues with endpoint state replication to other shards' from Tomasz
Fixes #3798
Fixes #3694

Tests:

  unit(release), dtest([new] cql_tests.py:TruncateTester.truncate_after_restart_test)

* tag 'fix-gossip-shard-replication-v1' of github.com:tgrabiec/scylla:
  gms/gossiper: Replicate enpoint states in add_saved_endpoint()
  gms/gossiper: Make reset_endpoint_state_map() have effect on all shards
  gms/gossiper: Replicate STATUS change from mark_as_shutdown() to other shards
  gms/gossiper: Always override states from older generations

(cherry picked from commit 48ebe6552c)
2018-10-17 10:09:07 +02:00
Glauber Costa
0c2abc007c api: use longs instead of ints for snapshot sizes
Int types in json will be serialized to int types in C++. They will then
only be able to handle 4GB, and we tend to store more data than that.

Without this patch, listsnapshots is broken in all versions.

Fixes: #3845

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181012155902.7573-1-glauber@scylladb.com>
(cherry picked from commit 98332de268)
2018-10-12 22:02:25 +03:00
Avi Kivity
1498c4f150 Update seastar submodule
* seastar ebf4812...b846dfe (1):
  > prometheus: Fix histogram text representation

Fixes #3827.
2018-10-09 16:38:04 +03:00
Eliran Sinvani
f388992a94 cql3 : add workaround to antlr3 null dereference bug
The Antlr3 exception class has a null dereference bug that crashes
the system when trying to extract the exception message using
ANTLR_Exception<...>::displayRecognitionError(...) function. When
a parsing error occurs the CqlParser throws an exception which in
turn processesed for some special cases in scylla to generate a custom
message. The default case however, creates the message using
displayRecognitionError, causing the system to crash.
The fix is a simple workaround, making sure the pointer is not null
before the call to the function. A "proper" fix can't be implemented
because the exception class itself is implemented outside scylla
in antlr headers that resides on the host machine os.

Tested manualy 2 testcases, a typo causing scylla to crash and
a cql comment without a newline at the end also caused scylla to crash.
Ran unit tests (release).

Fixes #3740
Fixes #3764

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <cfc7e0d758d7a855d113bb7c8191b0fd7d2e8921.1538566542.git.eliransin@scylladb.com>
(cherry picked from commit 20f49566a2)
2018-10-04 14:09:41 +03:00
Avi Kivity
310540c11f utils: crc32: mark power crc32 assembly as not requiring an executable stack
The linker uses an opt-in system for non-executable stack: if all object files
opt into a non-executable stack, the binary will have a non-executable stack,
which is very desirable for security. The compiler cooperates by opting into
a non-executable stack whenever possible (always for our code).

However, we also have an assembly file (for fast power crc32 computations).
Since it doesn't opt into a non-executable stack, we get a binary with
executable stack, which Gentoo's build system rightly complains about.

Fix by adding the correct incantation to the file.

Fixes #3799.

Reported-by: Alexys Jacob <ultrabug@gmail.com>
Message-Id: <20181002151251.26383-1-avi@scylladb.com>
(cherry picked from commit aaab8a3f46)
2018-10-02 23:23:23 +03:00
Calle Wilund
7d833023cc storage_proxy: Add missing re-throw in truncate_blocking
Iff truncation times out, we want to log it, but the exception should
not be swallowed, but re-thrown.

Fixes #3796.

Message-Id: <20181001112325.17809-1-calle@scylladb.com>
(cherry picked from commit 2996b8154f)
2018-10-01 21:48:57 +02:00
Avi Kivity
d94ac196e0 Update scylla-ami submodule
* dist/ami/files/scylla-ami e7aa504...a425887 (1):
  > scylla_install_ami: enable ssh_deletekeys

See scylladb/scylla-ami#31
2018-09-30 16:32:40 +03:00
Duarte Nunes
1d7430995e tests/aggregate_fcts_test: Add test case for wrapped types
Provide a test case which checks a type being wrapped in a
reverse_type plays no role in assignment.

Refs #3789

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223201.28152-2-duarte@scylladb.com>
(cherry picked from commit 17578c3579)
2018-09-28 14:34:19 +03:00
Duarte Nunes
b662a7f8a4 cql3/selection/selector: Unwrap types when validating assignment
When validating assignment between two types, it's possible one of
them is wrapped in a reverse_type, if it comes, for example, from the
type associated with a clustering column. When checking for weak
assignment the types are correctly unwrapped, but not when checking
for an exact match, which this patch fixes.

Technically, the receiver is never a reversed_type for the current
callers, but this is the morally correct implementation, as the type
being reversed or not plays no role in assignment.

Tests: unit(release)

Fixes #3789

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223201.28152-1-duarte@scylladb.com>
(cherry picked from commit 5e7bb20c8a)
2018-09-28 14:34:08 +03:00
Paweł Dziepak
447ad72882 transport: fix use-after-free in read_name_and_value_list()
(cherry picked from commit 1eeef4383c)
2018-09-27 14:05:45 +01:00
Duarte Nunes
b8485d3bce cql3/query_processor: Validate presence of statement values timeously
We need to validate before calling query_options::prepare() whether
the set of prepared statement values sent in the query matches the
amount of names we need to bind, otherwise we risk an out-of-bounds
access if the client also specified names together with the values.

Refs #3688

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814225607.14215-1-duarte@scylladb.com>
(cherry picked from commit 805ce6e019)
2018-09-27 14:05:37 +01:00
Takuya ASADA
034b0f50db dist/redhat: specify correct repo file path on scylla-housekeeping services
Currently, both scylla-housekeeping-daily/-restart services mistakenly
specify repo file path as "@@REPOFILES@@", witch is copied from .in
template, need to be replace with actual path.

Fixes #3776

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180921031605.9330-1-syuu@scylladb.com>
(cherry picked from commit 21a12aa458)
2018-09-25 12:30:03 +03:00
Avi Kivity
12ec0becf3 messaging: fix unbounded allocation in TLS RPC server
The non-TLS RPC server has an rpc::resource_limits configuration that limits
its memory consumption, but the TLS server does not. That means a many-node
TLS configuration can OOM if all nodes gang up on a single replica.

Fix by passing the limits to the TLS server too.

Fixes #3757.
Message-Id: <20180907192607.19802-1-avi@scylladb.com>

(cherry picked from commit 4553238653)
2018-09-17 20:25:49 +03:00
Piotr Sarna
666b19552d cql3, 2.3: refuse serving multi-restriction indexed queries
Secondary index queries do not work correctly when multiple
restrictions are present - the rest of the restrictions is simply
ignored, which results in too many rows returned to the client.
This 2.3 fix makes these unsafe queries return an error instead.

Refs #3754

Message-Id: <7e470052d8ffc5bd8dc12e0d7f2705f0754afdbb.1536243391.git.sarna@scylladb.com>
2018-09-17 20:16:01 +03:00
Takuya ASADA
178f870a03 dist/ami/files/.scylla_ami_login: fix python error message on unsupported instance type
We changed usage of colorprint() on f8cec2f891,
need to pass format parameters to the function.

Fixes #3680

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Tested-by: Amos Kong <amos@scylladb.com>
Message-Id: <20180913182450.13308-1-syuu@scylladb.com>
2018-09-17 14:53:17 +03:00
Pekka Enberg
1b18f16dc1 release: prepare for 2.3.0 2018-09-14 13:52:02 +03:00
Pekka Enberg
28934575e4 docker: Update RPM repository to 2.3 2018-09-12 15:54:17 +02:00
Gleb Natapov
182cbeefb0 mutation_query_test: add test for result size calculation
Check that digest only and digest+data query calculate result size to be
the same.

Message-Id: <20180906153800.GK2326@scylladb.com>
(cherry picked from commit 9e438933a2)
2018-09-12 15:54:17 +02:00
Gleb Natapov
b70fc41a90 mutation_partition: accurately account for result size in digest only queries
When measuring_output_stream is used to calculate result's element size
it incorrectly takes into account not only serialized element size, but
a placeholder that ser::qr_partition__rows/qr_partition__static_row__cells
constructors puts in the beginning. Fix it by taking starting point in a
stream before element serialization and subtracting it afterwords.

Fixes #3755

Message-Id: <20180906153609.GJ2326@scylladb.com>
(cherry picked from commit d7674288a9)
2018-09-12 15:54:12 +02:00
Tomasz Grabiec
debfc795b2 tests: flat_mutation_reader: Use fluent assertions for better error messages
Message-Id: <1531908313-29810-2-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit dc453d4f5d)
2018-09-12 15:54:09 +02:00
Tomasz Grabiec
0d094575ec tests: flat_mutation_reader_assertions: Introduce produces(mutation_fragment)
Message-Id: <1531908313-29810-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 604c8baed8)
2018-09-12 15:54:06 +02:00
Tomasz Grabiec
20baef69a9 mutation_fragment: Fix clustering_row::equal() using incorrect column kind
Incorrect column_kind was passed, which may cause wrong type to be
used for comparison if schema contains static columns. Affects only
tests.

Spotted during code review.
Message-Id: <1531144991-2658-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 1336744a05)
2018-09-07 13:34:26 +02:00
Pekka Enberg
1bac88601d release: prepare for 2.3.rc3 2018-09-07 07:41:46 +03:00
Vlad Zolotarov
e581fd1463 loading_cache: make size() return the size of lru_list instead of loading_shared_values
reloading flow may hold the items in the underlying loading_shared_values
after they have been removed (e.g. via remove(key) API) thereby loading_shared_values.size()
doesn't represent the correct value for the loading_cache. lru_list.size() on the other hand - does.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit 1e56c7dd58)
2018-09-06 16:57:22 +03:00
Vlad Zolotarov
b366bff998 loading_cache: make iterator work on top of lru_list iterators instead of loading_shared_values'
Reloading may hold value in the underlying loading_shared_values while
the corresponding cache values have already been deleted.

This may create weird situations like this:

<populate cache with 10 entries>
cache.remove(key1);
for (auto& e : cache) {
    std::out << e << std::endl;
}

<all 10 entries are printed, including the one for "key1">

In order to avoid such situations we are going to make the loading_cache::iterator
to be a transform_iterator of lru_list::iterator instead of loading_shared_values::iterator
because lru_list contains entries only for cached items.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit 945d26e4ee)
2018-09-06 16:57:12 +03:00
Gleb Natapov
38e6984ba5 mutation_partition: correctly measure static row size when doing digest calculation
The code uses incorrect output stream in case only digest is requested
and thus getting incorrect data size. Failing to correctly account
for static row size while calculating digest may cause digest mismatch
between digest and data query.

Fixes #3753.

Message-Id: <20180905131219.GD2326@scylladb.com>
(cherry picked from commit 98092353df)
2018-09-06 16:50:58 +03:00
Vlad Zolotarov
332f76579e tests: loading_cache_test: configure a validity timeout in test_loading_cache_loading_different_keys to a greater value
Change the validity timeout from 1s to 1h in order to avoid false alarms
on busy systems: for a short value there is a chance that
(loading_cache.size() == num_loaders) check is going to run after some elements
of the cache have already been evicted.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20180904193026.7304-1-vladz@scylladb.com>
(cherry picked from commit dae70e1166)
2018-09-06 16:05:42 +03:00
Jesse Haber-Kucharsky
315a03cf6c auth: Use finite time-out for all QUORUM reads
Commit e664f9b0c6 transitioned internal
CQL queries in the auth. sub-system to be executed with finite time-outs
instead of infinite ones.

It should have also modified the functions in `auth/roles-metadata.cc`
to have finite time-outs.

This change fixes some previously failing dtests, particularly around
repair. Without this change, the QUORUM query fails to terminate when
the necessary consistency level cannot be achieved.

Fixes #3736.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <e244dc3e731b4019f3be72c52a91f23ee4bb68d1.1536163859.git.jhaberku@scylladb.com>
(cherry picked from commit 682805b22c)
2018-09-05 22:54:32 +03:00
Asias He
1847dc7a6a storage_service: Wait for range setup before announcing join status
When a joining node announcing join status through gossip, other
existing nodes will send writes to the joining node. At this time, it
is possible the joining node hasn't learnt the tokens of other nodes
that causes the error like below:

   token_metadata - sorted_tokens is empty in first_token_index!
   storage_proxy - Failed to apply mutation from 127.0.4.1#0:
   std::runtime_error (sorted_tokens is empty in first_token_index!)

To fix, wait for the token range setup before announcing the join
status.

Fixes: #3382
Tests: 60 run of materialized_views_test.py:TestMaterializedViews.add_dc_during_mv_update_test

Message-Id: <01abb21ae3315ae275297e507c5956e5774557ef.1536128531.git.asias@scylladb.com>
(cherry picked from commit 89b769a073)
2018-09-05 15:32:29 +03:00
Eliran Sinvani
dd11b5987e cql3: backport test of multicolumn IN with repetitions.
The test failed after backport of the containing commit (d734d31), the
reason is that the query was missing ALLOW FILTERING which is required.
In newer versions the allow filtering enforcement "misses" some
cases that needs the filtering anotation due to cavaet in testing multi
column restriction for ALLOW FULTERRING requirement. This issue was
introduced as part of refactoring the multicolumn restrictions classes
and already has an open issue: #3574

Tests: Unitests(release)

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <ad88b7218fa55466be7bc4303dc50326a3d59733.1534322238.git.eliransin@scylladb.com>
(cherry picked from commit d734d316a6)
Message-Id: <928f1fbecffa43c4700541ee6603bb4607871510.1536146137.git.eliransin@scylladb.com>
2018-09-05 14:30:07 +03:00
Paweł Dziepak
a134e8699a test.py: do not disable human-readable format with --jenkins flag
When test.py is run with --jenkins flag Boost UTF is asked to generate
an XML file with the test results. This automatically disables the
human-readable output printed to stdout. There is no real reason to do
so and it is actually less confusing when the Boost UTF messages are in
the test output together with Scylla logger messages.

Message-Id: <20180704172913.23462-1-pdziepak@scylladb.com>
(cherry picked from commit 07a429e837)
2018-09-04 14:26:00 +02:00
Takuya ASADA
bd7dcbb8d2 dist/common/scripts/scylla_raid_setup: create scylla-server.service.d when it doesn't exist
When /etc/systemd/system/scylla-server.service.d/capabilities.conf is
not installed, we don't have /etc/systemd/system/scylla-server.service.d/,
need to create it.

Fixes #3738

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180904015841.18433-1-syuu@scylladb.com>
(cherry picked from commit bd8a5664b8)
2018-09-04 14:42:55 +03:00
Tomasz Grabiec
74e61528a6 managed_vector: Make external_memory_usage() ignore reserved space
This ensures that row::external_memory_usage() is invariant to
insertion order of cells.

It should be so, so that accounting of a clustering_row, merged from
multiple MVCC versions by the partition_snapshot_flat_reader on behalf
of a memtable flush, doesn't give a greater result than what is used
by the memtable region. Overaccounting leads to assertion failure in
~flush_memory_accounter.

Fixes #3625 (hopefully).

Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 4fb3f7e8eb)
2018-09-03 19:58:10 +03:00
Duarte Nunes
5eb4fde2d5 Merge 'utils::loading_cache: improve reload() robustness' from Vlad
"This series introduces a few improvements related to a reload flow.

From now on the callback may assume that the "key" parameter value
is kept alive till the end of its execution in the reloading flow.

It may also safely evict as many items from the cache as needed."

Fixes #3606

* 'loading_cache_improve_reload-v1' of https://github.com/vladzcloudius/scylla:
  utils::loading_cache: hold a shared_value_ptr to the value when we reload
  utils::loading_cache::on_timer(): remove not needed capture of "this"
  utils::loading_cache::on_timer(): use chunked_vector for storing elements we want to reload

(cherry picked from commit f6aadd8077)
2018-08-29 10:12:32 +01:00
Duarte Nunes
cc0703f8ca utils/loading_cache: Avoid using invalidated iterators
When periodically reloading the values in the loading_cache, we would
iterate over the list of entries and call the load() function for
those which need to be reloaded.

For some concrete caches, load() can remove the entry from the LRU set,
and can be executed inline from the parallel_for_each(). This means we
could potentially keep iterating using an invalidated iterator.

Fix this by using a temporary container to hold those entries to be
reloaded.

Spotted when reading the code.

Also use if constexpr and fix the comment in the function containing
the changes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180712124143.13638-1-duarte@scylladb.com>
(cherry picked from commit 63b63b0461)
2018-08-29 10:12:11 +01:00
Botond Dénes
678283a5bb loading_cache::reload(): obtain key before calling _load()
The continuation attached to _load() needs the key of the loaded entry
to check whether it was disposed during the load. However if _load()
invalidates the entry the continuation's capture line will access
invalid memory while trying to obtain the key.
To avoid this save a copy of the key before calling _load() and pass it
to both _load() and the continuation.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b571b73076ca863690f907fbd3fb4ff54e597b28.1531393608.git.bdenes@scylladb.com>
(cherry picked from commit 2e7bf9c6f9)
2018-08-29 10:12:03 +01:00
Avi Kivity
552c0d7641 Update scylla-ami submodule
* dist/ami/files/scylla-ami b7db861...e7aa504 (1):
  > scylla_create_devices: fix mouting RAID volume after reboot

Fixes #3640.
2018-08-28 15:45:36 +03:00
Piotr Sarna
860c06660b tests: add multi-column pk test to INSERT JSON case
Refs #3687
Message-Id: <6ba1328549ed701691ca7cbdacc7d6fa72f2c3de.1534171422.git.sarna@scylladb.com>

(cherry picked from commit aa2bfc0a71)
2018-08-28 14:39:43 +03:00
Piotr Sarna
db733ba075 cql3: fix handling multi-column partition key in INSERT JSON
Multiple column partition keys were previously handled incorrectly,
now the implementation is based on from_exploded instead of
from_singular.

Fixes #3687
Message-Id: <09e0bdb0f1c18d49b9e67c21777d93ba1545a13c.1534171422.git.sarna@scylladb.com>

(cherry picked from commit fa72422baa)
2018-08-28 14:39:41 +03:00
Avi Kivity
88677d39c8 Update seastar submodule
* seastar ed62fbd...ebf4812 (4):
  > correctly configure I/O Scheduler for usage with the YAML file
  > iotune: adjust num-io-queues recommendation
  > reactor: switch indentation
  > properly configure I/O Scheduler when --max-io-requests is passed

Fixes #3722.
Fixes #3721.
Fixes #3718.
2018-08-28 14:37:31 +03:00
Avi Kivity
d767dee5ec migration_manager: downgrade frightening "Can't send migration request" ERROR
This error is transient, since as soon as the node is up we will be able
to send the migration request.  Downgrade it to a warning to reduce anxiety
among people who actually read the logs (like QA).

The message is also badly worded as no one can guess what a migration
request is, but that is left to another patch.

Fixes #3706.
Message-Id: <20180821070200.18691-1-avi@scylladb.com>

(cherry picked from commit 5792a59c96)
2018-08-28 09:11:11 +03:00
Tomasz Grabiec
702f6ee1b7 database: Run system table flushes in the main scheduling group
memtable flushes for system and regular region groups run under the
memtable_scheduling_group, but the controller adjusts shares based on
the occupancy of the regular region group.

It can happen that regular is not under pressure, but system is. In
this case the controller will incorrectly assign low shares to the
memtable flush of system. This may result in high latency and low
throughput for writes in the system group.

I observed writes to the sytem keyspace timing out (on scylla-2.3-rc2)
in the dtest: limits_test.py:TestLimits.max_cells_test, which went
away after this.

Fixes #3717.

Message-Id: <1535016026-28006-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 10f6b125c8)
2018-08-27 16:41:35 +02:00
Piotr Sarna
473b9aec65 cql3: throw proper request exception for INSERT JSON
JSON code is amended in order to return proper
"Missing mandatory PRIMARY KEY part" message instead of generic
"Attempt to access value of a disengaged optional object".

Fixes #3665
Message-Id: <69157d659d51ce5a2d408614ce3ba7bf8e3a5d88.1534161127.git.sarna@scylladb.com>

(cherry picked from commit 310d0a74b9)
2018-08-27 12:36:33 +03:00
Tomasz Grabiec
b548061257 database: Avoid OOM when soft pressure but nothing to flush
There could be soft pressure, but soft-pressure flusher may not be
able to make progress (Refs #3716). It will keep trying to flush empty
memtables, which block on earlier flushes to complete, and thus
allocate continuations in memory. Those continuations accumulate in
memory and can cause OOM.

flush will take longer to complete. Due to scheduling group isolation,
the soft-pressure flusher will keep getting the CPU.

This causes bad_alloc and crashes of dtest:
limits_test.py:TestLimits.max_cells_test

Fixes #3717

Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 2afce13967)
2018-08-26 18:05:35 +03:00
Tomasz Grabiec
01165a9ae7 database: Make soft-pressure memtable flusher not consider already flushed memtables
The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.

The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.

I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.

The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.

This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.

Fixes #3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 1e50f85288)
2018-08-26 18:05:35 +03:00
Tomasz Grabiec
5cdb963768 logalloc: Make evictable_occupancy() indicate no free space
Doesn't fix any bug, but it's closer to the truth that all segments
are used rather than none is used.

Message-Id: <1535040132-11153-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 364418b5c5)
2018-08-26 18:05:35 +03:00
Eliran Sinvani
7c9b9a4e24 cql3: ensure repeated values in IN clauses don't return repeated rows
When the list of values in the IN list of a single column contains
duplicates, multiple executors are activated since the assumption
is that each value in the IN list corresponds to a different partition.
this results in the same row appearing in the result number times
corresponding to the duplication of the partition value.

Added queries for the in restriction unitest and fixed with a bad result check.

Fixes #2837
Tests: Queries as in the usecase from the GitHub issue in both forms ,
prepared and plain (using python driver),Unitest.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <ad88b7218fa55466be7bc4303dc50326a3d59733.1534322238.git.eliransin@scylladb.com>
(cherry picked from commit d734d316a6)
2018-08-26 18:05:33 +03:00
Avi Kivity
f475c65ae6 Update scylla-ami submodule
* dist/ami/files/scylla-ami c7e5a70...b7db861 (2):
  > scylla-ami-setup.service: run only on first startup
  > Use fstab to mount RAID volume on every reboot

(cherry picked from commit 54ac334f4b)
2018-08-26 12:40:58 +03:00
Takuya ASADA
687372bc48 dist/common/scripts/scylla_raid_setup: refuse start scylla-server.service when RAID volume is not mounted
Since the Linux system abort booting when it fails to mount fstab entries,
user may not able to see an error message when we use fstab to mount
/var/lib/scylla on AMI.

Instead of abort booting, we can just abort to start scylla-server.service
when RAID volume is not mounted, using RequiresMountsFor directive of systemd
unit file.

See #3640

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180824185511.17557-1-syuu@scylladb.com>
(cherry picked from commit ff55e3c247)
2018-08-26 12:40:50 +03:00
Piotr Sarna
65c140121c tests: add parsing varint from JSON string test
Refs #3666
Message-Id: <f4205e9484f5385796fade7986e3e38dcbc65bac.1534845398.git.sarna@scylladb.com>

(cherry picked from commit 4a274ee7e2)
2018-08-26 11:11:18 +03:00
Piotr Sarna
ed68ad220f types: enable deserializing varint from JSON string
Previously deserialization failed because the JSON string
representing a number was unnecessarily quoted.

Fixes #3666
Message-Id: <a0a100dbac7c151d627522174303657d1da05c27.1534845398.git.sarna@scylladb.com>

(cherry picked from commit 37a5c38471)
2018-08-26 11:11:18 +03:00
Piotr Sarna
35f4b8fbbe cql3: add proper setting of empty collections in INSERT JSON
Previously empty collections where incorrectly added as dead cells,
which resulted in serialization errors later.

Fixes #3664
Message-Id: <a9c90d66c6737641cafe40edb779df490ada0309.1534848313.git.sarna@scylladb.com>

(cherry picked from commit 465045368f)
2018-08-26 11:11:18 +03:00
Duarte Nunes
48012fe418 Merge seastar upstream
* seastar 22437af...ed62fbd (1):
  > core: fix __libc_free return type signature

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-22 13:49:39 +01:00
Tomasz Grabiec
c862ccda91 Merge 'Fix multi-cell static list updates in the presence of ckeys' from Duarte
Fixes a regression introduced in
9e88b60ef5, which broke the lookup for
prefetched values of lists when a clustering key is specified.

This is the code that was removed from some list operations:

 std::experimental::optional<clustering_key> row_key;
 if (!column.is_static()) {
   row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
 }
 ...
 auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column);

Put it back, in the form of common code in the update_parameters class.

Fixes #3703

* https://github.com/duarten/scylla cql-list-fixes/v1:
  tests/cql_query_test: Test multi-cell static list updates with ckeys
  cql3/lists: Fix multi-cell static list updates in the presence of ckeys
  keys: Add factory for an empty clustering_key_prefix_view

(cherry picked from commit 6937cc2d1c)
2018-08-21 17:35:14 +01:00
Duarte Nunes
83b1057c4b cql3/query_options: Use _value_views in prepare()
_value_views is the authoritative data structure for the
client-specified values. Indeed, the ctor called
transport::request::read_options() leaves _values completely empty.

In query_options::prepare() we were, however, using _values to
associated values to the client-specified column names, and not
_value_views. Fix this by using _value_views instead.

As for the reasons we didn't see this bug earlier, I assume it's
because very few drivers set the 0x04 query options flag, which means
column names are omitted. This is the right thing to do since most
drivers have enough information to correctly position the values.

Fixes #3688

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814234605.14775-1-duarte@scylladb.com>
(cherry picked from commit a4355fe7e7)
2018-08-21 18:23:10 +03:00
Gleb Natapov
c1cb779dd2 storage_proxy: do not fail read without speculation on connection error
After ac27d1c93b if a read executor has just enough targets to
achieve request's CL and a connection to one of them will be dropped
during execution ReadFailed error will be returned immediately and
client will not have a chance to issue speculative read (retry). The
patch changes the code to not return ReadFailed error immediately, but
wait for timeout instead and give a client chance to issue speculative
read in case read executor does not have additional targets to send
speculative reads to by itself.

Fixes #3699.
Message-Id: <20180819131646.GK2326@scylladb.com>

(cherry picked from commit 7277ee2939)
2018-08-20 13:06:51 +03:00
Hagit Segev
b47d18f9fd support 2.3 RC2 2018-08-19 20:17:24 +03:00
Tomasz Grabiec
f8713b019e mutation_partition: Fix exception safety of row::apply_monotonically()
When emplace_back() fails, value is already moved-from into a
temporary, which breaks monotonicity expected from
apply_monotonically(). As a result, writes to that cell will be lost.

The fix is to avoid the temporary by in-place construction of
cell_and_hash. To do that, appropriate cell_and_hash constructor was
added.

Found by mutation_test.cc::test_apply_monotonically_is_monotonic with
some modifications to the random mutation generator.

Introduced in 99a3e3a.

Fixes #3678.

Message-Id: <1533816965-27328-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 024b3c9fd9)
2018-08-13 10:44:27 +02:00
Takuya ASADA
cd5e4eace5 dist/common/scripts/scylla_setup: don't proceed RAID setup until user type 'done'
Need to wait user confirmation before running RAID setup.

See #3659
Fixes #3681

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810194507.1115-1-syuu@scylladb.com>
(cherry picked from commit 2ef1b094d7)
2018-08-12 15:08:47 +03:00
Takuya ASADA
4fb5403670 dist/common/scripts/scylla_setup: don't mention about interactive mode prompt when running on non-interactive mode
Skip showing message when it's non-interactive mode.

Fixes #3674

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810191945.32693-1-syuu@scylladb.com>
(cherry picked from commit b7cf3d7472)
2018-08-12 15:08:37 +03:00
Takuya ASADA
e9df6c42ce dist/common/scripts/scylla_setup: check existance of housekeeping.cfg before asking to run version check
Skip asking to run version check when housekeeping.cfg is already
exists.
Fixes #3657

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180807232313.15525-1-syuu@scylladb.com>
(cherry picked from commit ef9475dd3c)
2018-08-12 15:08:20 +03:00
Takuya ASADA
5fdf492ccc dist/debian: fix install scylla-server.service
On previous commit we moved debian/scylla-server.service to
debian/scylla-server.scylla-server.service to explicitly specify
subpackage name, but it doesn't work for dh_installinit without '--name'
option.

Result of that current scylla-server .deb package missing
scylla-server.service, so we need to rename the service to original
file name.

Fixes #3675

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810221944.24837-1-syuu@scylladb.com>
(cherry picked from commit f30b701872)
2018-08-12 15:07:28 +03:00
Duarte Nunes
fd2b02a12c Merge 'JSON support fixes' from Piotr
"
This series addresses SELECT/INSERT JSON support issues, namely
handling null values properly and parsing decimals from strings.
It also comes with updated cql tests.

Tests: unit (release)
"

Fixes #3666
Fixes #3664
Fixes #3667

* 'json_fixes_3' of https://github.com/psarna/scylla:
  cql3: remove superfluous null conversions in to_json_string
  tests: update JSON cql tests
  cql3: enable parsing decimal JSON values from string
  cql3: add missing return for dead cells
  cql3: simplify parsing optional JSON values
  cql3: add handling null value in to_json
  cql3: provide to_json_string for optional bytes argument

(cherry picked from commit 95677877c2)
2018-08-12 15:05:43 +03:00
Takuya ASADA
f8cec2f891 dist/common/scripts: pass format variables to colorprint()
When we use str.format() to pass variables on the message it will always
causes Exception like "KeyError: 'red'", since the message contains color
variables but it's not passed to str.format().
To avoid the error we need to pass all format variables to colorprint()
and run str.format() inside the function.

Fixes #3649

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180803015216.14328-1-syuu@scylladb.com>
(cherry picked from commit ad7bc313f7)
2018-08-09 10:46:19 +03:00
Avi Kivity
e4d6577ef2 Update seastar submodule
* seastar 814a055...22437af (1):
  > tls.cc: Make "close" timeout delay exception proof

Fixes #3461.
2018-08-08 13:35:10 +03:00
Takuya ASADA
346027248d dist/common/scripts/scylla_setup: print message when EC2 instance is optimized for Scylla
Currently scylla_ec2_check exits silently when EC2 instance is optimized
for Scylla, it's not clear a result of the check, need to output
message.

Note that this change effects AMI login prompt too.

Fixes #3655

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180808024256.9601-1-syuu@scylladb.com>
(cherry picked from commit 15825d8bf1)
2018-08-08 13:26:52 +03:00
Takuya ASADA
2cf6191353 dist/common/scripts/scylla_setup: fix typo on interactive setup
Scylls -> Scylla

Fixes #3656

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180808002443.1374-1-syuu@scylladb.com>
(cherry picked from commit 652eb5ae0e)
2018-08-08 13:26:52 +03:00
Avi Kivity
b52d647de2 docker: adjust for script conversion to Python
Since our scripts were converted to Python, we can no longer
source them from a shell. Execute them directly instead. Also,
we now need to import configuration variables ourselves, since
scylla_prepare, being an independent process, won't do it for
us.

Fixes #3647
Message-Id: <20180802153017.11112-1-avi@scylladb.com>

(cherry picked from commit c9caaa8e6e)
2018-08-07 18:58:44 +03:00
Takuya ASADA
f7c96a37f1 dist/common/scripts/scylla_setup: use specified NIC ifname correctly
Interactive NIC selection prompt always returns 'eth0' as selected NIC name
mistakenly, need to fix.

Fixes #3651

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180803020724.15155-1-syuu@scylladb.com>
(cherry picked from commit a300926495)
2018-08-07 09:51:25 +03:00
Jesse Haber-Kucharsky
ae71ffdcfd auth: Don't use unsupported hashing algorithms
In previous versions of Fedora, the `crypt_r` function returned
`nullptr` when a requested hashing algorithm was not supported.

This is consistent with the documentation of the function in its man
page.

As of Fedora 28, the function's behavior changes so that the encrypted
text is not `nullptr` on error, but instead the string "*0".

The info pages for `crypt_r` clarify somewhat (and contradict the man
pages):

    Some implementations return `NULL` on failure, and others return an
    _invalid_ hashed passphrase, which will begin with a `*` and will
    not be the same as SALT.

Because of this change of behavior, users running Scylla on a Fedora 28
machine which was upgraded from a previous release would not be able to
authenticate: an unsupported hashing algorithm would be selected,
producing encrypted text that did not match the entry in the table.

With this change, unsupported algorithms are correctly detected and
users should be able to continue to authenticate themselves.

Fixes #3637.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bcd708f3ec195870fa2b0d147c8910fb63db7e0e.1533322594.git.jhaberku@scylladb.com>
(cherry picked from commit fce10f2c6e)
2018-08-05 10:30:16 +03:00
Avi Kivity
a235900388 Merge "Fix exception safety in imr::utils::object" from Paweł
"

There is an exception safety problem in imr::utils::object. If multiple
memory allocations are needed and one of them fails the main object is
going to be freed (as expected). However, at this stage it is not
constructed yet, so  when LSA asks its migrator for the size it may get
a meaningless value. The solution is to remember the size until object
is fully created and use sized deallocation in case of failures.

Fixes #3618.

Tests: unit(release, debug/imr_test)
"

(cherry picked from commit 3b42fcfeb2)
2018-08-03 11:54:53 +03:00
Takuya ASADA
be9f150341 dist/debian: install *.service on correct subpackage
We mistakenly installing scylla-housekeeping-*.service to scylla-conf
package, all *.service should explicitly specified subpackage name.

Fixes #3642

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180801233042.307-1-syuu@scylladb.com>
(cherry picked from commit 1bb463f7e5)
2018-08-03 11:53:22 +03:00
Gleb Natapov
2478fa1f6e storage_proxy: fix rpc connection failure handling by read operation
Currently rpc::closed_error is not counted towards replica failure
during read and thus read operation waits for timeout even if one
of the nodes dies. Fix this by counting rpc::closed_error towards
failed attempts.

Fixes #3590.

Message-Id: <20180708123522.GC28899@scylladb.com>
(cherry picked from commit ac27d1c93b)
2018-08-02 11:41:58 +03:00
Gleb Natapov
d95ac1826e cache_hitrate_calculator: fix race when new table is added during calculations
The calculation consists of several parts with preemption point between
them, so a table can be added while calculation is ongoing. Do not
assume that table exists in intermediate data structure.

Fixes #3636

Message-Id: <20180801093147.GD23569@scylladb.com>
(cherry picked from commit 44a6afad8c)
2018-08-01 14:30:45 +03:00
Avi Kivity
6fc17345e9 Merge "No infinite time-outs for internal distributed queries" from Jesse
"
This series replaces infinite time-outs in internal distributed
(non-local) CQL queries with finite ones.

The implementation of tracing, which also performs internal queries,
already has finite time-outs, so it is unchanged.

Fixes #3603.
"

* 'jhk/finite_time_outs/v2' of https://github.com/hakuch/scylla:
  Use finite time-outs for internal auth. queries
  Use finite query time-outs for `system_distributed`

(cherry picked from commit 620e950fc8)
2018-08-01 14:23:49 +03:00
Takuya ASADA
4bfa0ae247 dist/common/scripts/scylla_ntp_setup: fix typo
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1533070539-2147-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 2cd99d800b)
2018-08-01 10:58:36 +03:00
Avi Kivity
174b7870e6 Update ami submodule
* dist/ami/files/scylla-ami d53834f...c7e5a70 (1):
  > ds2_configure.py: uncomment 'cluster_name' when it's commented out
2018-07-31 09:33:46 +03:00
Takuya ASADA
e95b4ee825 dist/common/scripts/scylla_ntp_setup: fix typo
Comment on Python is "#" not "//".

Fixes #3629

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180730091022.4512-1-syuu@scylladb.com>
(cherry picked from commit 032b26deeb)
2018-07-30 13:53:14 +03:00
Takuya ASADA
464305de1c dist/common/scripts/scylla_ntp_setup: ignore ntpdate error
Even ntpdate fails to adjust clock ntpd may able to recover it later,
ignore ntpdate error keep running the script.

Fixes #3629

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180726080206.28891-1-syuu@scylladb.com>
(cherry picked from commit 8e4d1350c9)
2018-07-30 13:53:08 +03:00
Avi Kivity
3a1a9e1a11 dist: redhat: fix up bad file ownership of rpms/srpms
mock outputs files owned by root. This causes attempts
by scripts that want to junk the working directory (typically
continuous integration) to fail on permission errors.

Fixup those permissions after the fact.
Message-Id: <20180719163553.5186-1-avi@scylladb.com>

(cherry picked from commit b167647bf6)
2018-07-26 14:22:52 +03:00
Avi Kivity
90dac5d944 Merge "Fix JSON string quoting" from Piotr
"

This mini-series covers a regression caused by newest versions
of jsoncpp library, which changed the way of quoting UTF-8 strings.

Tests: unit (release)
"

* 'add_json_quoting_3' of https://github.com/psarna/scylla:
  tests: add JSON unit test
  types: use value_to_quoted_string in JSON quoting
  json: add value_to_quoted_string helper function

Ref #3622.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>

(cherry picked from commit d6ef74fe36)
2018-07-26 12:03:35 +03:00
Piotr Sarna
e5a83d105c cql3: fix INSERT JSON grammar
Previously CQL grammar wrongfully required INSERT JSON queries
to provide a list of columns, even though they are already
present in JSON itself.
Unfortunately, tests were written with this false assumption as well,
so they're are updated.
Message-Id: <33b496cba523f0f27b6cbf5539a90b6feb20269e.1532514111.git.sarna@scylladb.com>

Fixes #3631.

(cherry picked from commit f66aace685)
2018-07-25 14:53:35 +01:00
Takuya ASADA
9b4a0a2879 dist/debian: fix ImportError on pystache
Seems like pystache does not provides dependency, need to install it on
build_deb.sh.

Fixes #3627

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180724164852.16094-1-syuu@scylladb.com>
(cherry picked from commit 58f094e06d)
2018-07-25 11:39:22 +03:00
Pekka Enberg
adad12ddc3 release: prepare for 2.3.rc1 2018-07-24 09:21:38 +03:00
Avi Kivity
a77bb1fe34 Merge "row_cache: Fix violation of continuity on concurrent eviction and population" from Tomasz
"
The problem happens under the following circumstances:

  - we have a partially populated partition in cache, with a gap in the middle

  - a read with no clustering restrictions trying to populate that gap

  - eviction of the entry for the lower bound of the gap concurrent with population

The population may incorrectly mark the range before the gap as continuous.
This may result in temporary loss of writes in that clustering range. The
problem heals by clearing cache.

Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been
failing sporadically.

The problem is in ensure_population_lower_bound(), which returns true if
current clustering range covers all rows, which means that the populator has a
right to set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise, we're populating since _last_row and should
consult it.

Fixes #3608.
"

* 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla:
  row_cache: Fix violation of continuity on concurrent eviction and population
  position_in_partition: Introduce is_before_all_clustered_rows()

(cherry picked from commit 31151cadd4)
2018-07-18 12:05:51 +02:00
Tomasz Grabiec
3c7e6dfdb9 mutation_partition: Fix exception-safety of row copy constructor
In case population of the vector throws, the vector object would not
be destroyed. It's a managed object, so in addition to causing a leak,
it would corrupt memory if later moved by the LSA, because it would
try to fixup forward references to itself.

Caused sporadic failures and crashes of row_cache_test, especially
with allocation failure injector enabled.

Introduced in 27014a23d7.
Message-Id: <1531757764-7638-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 3f509ee3a2)
2018-07-17 18:25:12 +02:00
Amos Kong
fab136ae1d scylla_setup: nic setup dialog is only for interactive mode
Current code raises dialog even for non-interactive mode when we pass options
in executing scylla_setup. This blocked automatical artifact-test.

Fixes #3549

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <58f90e1e2837f31d9333d7e9fb68ce05208323da.1531824972.git.amos@scylladb.com>
(cherry picked from commit 0fcdab8538)
2018-07-17 18:24:23 +03:00
Botond Dénes
a4218f536b storage_proxy: use the original row limits for the final results merging
`query_partition_key_range()` does the final result merging and trimming
(if necessary) to make sure we don't send more rows to the client than
requested. This merging and trimming is done by a continuation attached
to the `query_partition_key_range_concurrent()` which does the actual
querying. The continuations captures via value the `row_limit` and
`partition_limit` fields of the `query::read_command` object of the
query. This has an unexpected consequence. The lambda object is
constructed after the call to `query_partition_key_range_concurrent()`
returns. If this call doesn't defer, any modifications done to the read
command object done by `query_partition_key_range_concurrent()` will be
visible to the lambda. This is undesirable because
`query_partition_key_range_concurrent()` updates the read command object
directly as the vnodes are traversed which in turn will result in the
lambda doing the final trimming according to a decremented `row_limits`,
which will cause the paging logic to declare the query as exhausted
prematurely because the page will not be full.
To avoid all this make a copy of the relevant limit fields before
`query_partition_key_range_concurrent()` is called and pass these copies
to the continuation, thus ensuring that the final trimming will be done
according to the original page limits.

Spotted while investigating a dtest failure on my 1865/range-scans/v2
branch. On that branch the way range scans are executed on replicas is
completely refactored. These changes appearantly reduce the number of
continuations in the read path to the point where an entire page can be
filled without deferring and thus causing the problem to surface.

Fixes #3605.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f11e80a6bf8089d49ba3c112b25a69edf1a92231.1531743940.git.bdenes@scylladb.com>
(cherry picked from commit cc4acb6e26)
2018-07-16 16:55:12 +03:00
Takuya ASADA
9f4431ef04 dist/common/scripts/scylla_prepare: fix error when /etc/scylla/ami_disabled exists
On this part shell command wasn't converted to python3, need to fix.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180715075015.13071-1-syuu@scylladb.com>
(cherry picked from commit 9479ff6b1e)
2018-07-16 09:56:57 +03:00
Takuya ASADA
66250bf8cc dist/redhat: drop scylla_lib.sh from .rpm
Since we dropped scylla_lib.sh at 58e6ad22b2,
we need remove it from RPM spec file too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180712155129.17056-1-syuu@scylladb.com>
(cherry picked from commit 1511d92473)
2018-07-16 09:44:48 +03:00
Takuya ASADA
88fe3c2694 dist/common/scripts/scylla_ec2_check: support custom NIC ifname on EC2
Since some AMIs using consistent network device naming, primary NIC
ifname is not 'eth0'.
But we hardcoded NIC name as 'eth0' on scylla_ec2_check, we need to add
--nic option to specify custom NIC ifname.

Fixes #3584

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180712142446.15909-1-syuu@scylladb.com>
(cherry picked from commit ee61660b76)
2018-07-16 09:44:26 +03:00
Takuya ASADA
db4c3d3e52 dist/common/scripts/scylla_util.py: fix typo
Fix typo, and rename get_mode_cpu_set() to get_mode_cpuset(), since a
term 'cpuset' is not included '_' on other places.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180711141923.12675-1-syuu@scylladb.com>
(cherry picked from commit 8f80d23b07)
2018-07-16 09:43:47 +03:00
Takuya ASADA
ca22a1cd1a dist/common/scripts: drop scylla_lib.sh
Drop scylla_lib.sh since all bash scripts depends on the library is
already converted to python3, and all scylla_lib.sh features are
implemented on scylla_util.py.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180711114756.21823-1-syuu@scylladb.com>
(cherry picked from commit 58e6ad22b2)
2018-07-16 09:43:25 +03:00
Avi Kivity
f9b702764e Update scylla-ami submodule
* dist/ami/files/scylla-ami 5200f3f...d53834f (1):
  > Merge "AMI scripts python3 conversion" from Takuya

(cherry picked from commit 83d72f3755)
2018-07-16 09:43:15 +03:00
Avi Kivity
54701bd95c Merge "more conversion from bash to python3" from Takuya
"Converted more scripts to python3."

* 'script_python_conversion2_v2' of https://github.com/syuu1228/scylla:
  dist/common/scripts/scylla_util.py: make run()/out() functions shorter
  dist/ami: install python34 to run scylla_install_ami
  dist/common/scripts/scylla_ec2_check: move ec2 related code to class aws_instance
  dist/common/scripts: drop class concolor, use colorprint()
  dist/ami/files/.bash_profile: convert almost all lines to python3
  dist/common/scripts: convert node_exporter_install to python3
  dist/common/scripts: convert scylla_stop to python3
  dist/common/scripts: convert scylla_prepare to python3

(cherry picked from commit 693cf77022)
2018-07-16 09:41:50 +03:00
Asias He
30eca5f534 storage_service: Limit number of REPLICATION_FINISHED verb can retry
In the removenode operation, if the message servicing is stopped, e.g., due
to disk io error isolation, the node can keep retrying the
REPLICATION_FINISHED verb infinitely.

Scylla log full of such message was observed:

[shard 0] storage_service - Fail to send REPLICATION_FINISHED to $IP:0:
seastar::rpc::closed_error (connection is closed)

To fix, limit the number of retires.

Tests: update_cluster_layout_tests.py

Fixes #3542

Message-Id: <638d392d6b39cc2dd2b175d7f000e7fb1d474f87.1529927816.git.asias@scylladb.com>
(cherry picked from commit bb4d361cf6)
2018-07-16 09:33:56 +03:00
Piotr Sarna
cd057d3882 database: make drop_column_family wait on reads in progress
drop_column_family now waits for both writes and reads in progress.
It solves possible liveness issues with row cache, when column_family
could be dropped prematurely, before the read request was finished.

Phaser operation is passed inside database::query() call.
There are other places where reading logic is applied (e.g. view
replicas), but these are guarded with different synchronization
mechanisms, while _pending_reads_phaser applies to regular reads only.

Fixes #3357

Reported-by: Duarte Nunes <duarte@scylladb.com>
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Message-Id: <d58a5ee10596d0d62c765ee2114ac171b6f087d2.1529928323.git.sarna@scylladb.com>
(cherry picked from commit 03753cc431)
2018-07-16 09:32:15 +03:00
Piotr Sarna
c5a5a2265e database: add phaser for reads
Currently drop_column_family waits on write_in_progress phaser,
but there's no such mechanism for reads. This commit adds
a corresponding reads phaser.

Refs #3357

Reported-by: Duarte Nunes <duarte@scylladb.com>
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Message-Id: <70b5fdd44efbc24df61585baef024b809cabe527.1529928323.git.sarna@scylladb.com>
(cherry picked from commit e1a867cbe3)
2018-07-16 09:32:06 +03:00
Takuya ASADA
3e482c6c9d dist/common/scripts/scylla_util.py: use os.open(O_EXCL) to verify disk is unused
To simplify is_unused_disk(), just try to open the disk instead of
checking multiple block subsystems.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180709102729.30066-1-syuu@scylladb.com>
(cherry picked from commit 1a5a40e5f6)
2018-07-11 12:51:17 +03:00
Avi Kivity
5b6cadb890 Update scylla-ami submodule
* dist/ami/files/scylla-ami 67293ba...5200f3f (1):
  > Add custom script options to AMI user-data

(cherry picked from commit 7d0df2a06d)
2018-07-11 12:51:08 +03:00
Takuya ASADA
9cf8cd6c02 dist/common/scripts/scylla_util.py: strip double quote from sysconfig parameter
Current sysconfig_parser.get() returns parameter including double quote,
it will cause problem by append text using sysconfig_parser.set().

Fixes #3587

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180706172219.16859-1-syuu@scylladb.com>
(cherry picked from commit 929ba016ed)
2018-07-11 12:51:01 +03:00
Vlad Zolotarov
b34567b69b dist: scylla_lib.sh: get_mode_cpu_set: split the declaration and ssignment to the local variable
In bash local variable declaration is a separate operation with its own exit status
(always 0) therefore constructs like

local var=`cmd`

will always result in the 0 exit status ($? value) regardless of the actual
result of "cmd" invocation.

To overcome this we should split the declaration and the assignment to be like this:

local var
var=`cmd`

Fixes #3508

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1529702903-24909-3-git-send-email-vladz@scylladb.com>
(cherry picked from commit 7495c8e56d)
2018-07-11 12:50:51 +03:00
Vlad Zolotarov
02b763ed97 dist: scylla_lib.sh: get_mode_cpu_set: don't let the error messages out
References #3508

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1529702903-24909-2-git-send-email-vladz@scylladb.com>
(cherry picked from commit f3ca17b1a1)
2018-07-11 12:50:43 +03:00
Takuya ASADA
05500a52d7 dist/common/scripts/scylla_sysconfig_setup: fix typo
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180705133313.16934-1-syuu@scylladb.com>
(cherry picked from commit 4df982fe07)
2018-07-11 12:50:32 +03:00
Avi Kivity
4afa558e97 Update scylla-ami submodule
* dist/ami/files/scylla-ami 0fd9d23...67293ba (1):
  > scylla_install_ami: fix broken argument parser

Fixes #3578.

(cherry picked from commit dd083122f9)
2018-07-11 12:50:24 +03:00
Takuya ASADA
f3956421f7 dist/ami: hardcode target for scylla_current_repo since we don't have --target option anymore
We break build_ami.sh since we dropped Ubuntu support, scylla_current_repo
command does not finishes because of less argument ('--target' with no
distribution name, since $TARGET is always blank now).
It need to hardcoded as centos.

Fixes #3577

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180705035251.29160-1-syuu@scylladb.com>
(cherry picked from commit 3bcc123000)
2018-07-11 12:49:52 +03:00
Takuya ASADA
a17a6ce8f5 dist/debian/build_deb.sh: make build_deb.sh more simplified
Use is_debian()/is_ubuntu() to detect target distribution, also install
pystache by path since package name is different between Fedora and
CentOS.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180703193224.4773-1-syuu@scylladb.com>
(cherry picked from commit 3cb7ddaf68)
2018-07-11 12:49:40 +03:00
Takuya ASADA
58a362c1f2 dist/ami/files/.bash_profile: drop Ubuntu support
Drop Ubuntu support on login prompt, too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180703192813.4589-1-syuu@scylladb.com>
(cherry picked from commit ed1d0b6839)
2018-07-11 12:49:30 +03:00
Alexys Jacob
361b2dd7a5 Support Gentoo Linux on node_health_check script.
Gentoo Linux was not supported by the node_health_check script
which resulted in the following error message displayed:

"This s a Non-Supported OS, Please Review the Support Matrix"

This patch adds support for Gentoo Linux while adding a TODO note
to add support for authenticated clusters which the script does
not support yet.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180703124458.3788-1-ultrabug@gentoo.org>
(cherry picked from commit 8c03c1e2ce)
2018-07-11 12:49:22 +03:00
Duarte Nunes
f6a2bafae2 Merge 'Expose sharding information to connections' from Avi
"
In the same way that drivers can route requests to a coordinator that
is also a replica of the data used by the request, we can allow
drivers to route requests directly to the shard. This patchset
adds and documents a way for drivers to know which shard a connection
is connected to, and how to perform this routing.
"

* tag 'shard-info-alt/v1' of https://github.com/avikivity/scylla:
  doc: documented protocol extension for exposing sharding
  transport: expose more information about sharding via the OPTIONS/SUPPORTED messages
  dht: add i_partitioner::sharding_ignore_msb()

(cherry picked from commit 33d7de0805)
2018-07-09 17:06:30 +03:00
Avi Kivity
2ec25a55cd Update seastar submodule
* seastar d7f35d7...814a055 (1):
  > reactor: pollable_fd: limit fragment count to IOV_MAX
2018-07-09 17:05:26 +03:00
Avi Kivity
d3fb7c5515 .gitmodules: branch seastar
This allows us to backport individual patches to seastar for
branch-2.3.
2018-07-09 17:03:50 +03:00
Botond Dénes
b1ac6a36f2 tests/cql_query_tess: add unit test for querying empty ranges test
A bug was found recently (#3564) in the paging logic, where the code
assumed the queried ranges list is non-empty. This assumption is
incorrect as there can be valid (if rare) queries that can result in the
ranges list to be empty. Add a unit test that executes such a query with
paging enabled to detect any future bugs related to assumptions about
the ranges list being non-empty.

Refs: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f5ba308c4014c24bb392060a7e72e7521ff021fa.1530618836.git.bdenes@scylladb.com>
(cherry picked from commit c236a96d7d)
Message-Id: <af315aef64d381a7f486ba190c9a1b5bdd6f800b.1530698046.git.bdenes@scylladb.com>
2018-07-04 12:13:33 +02:00
Botond Dénes
8cba125bce query_pager: use query::is_single_partition() to check for singular range
Use query::is_single_partition() to check whether the queried ranges are
singular or not. The current method of using
`dht::partition_range::is_singular()` is incorrect, as it is possible to
build a singular range that doesn't represent a single partition.
`query::is_single_partition()` correctly checks for this so use it
instead.

Found during code-review.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f671f107e8069910a2f84b14c8d22638333d571c.1530675889.git.bdenes@scylladb.com>
(cherry picked from commit 8084ce3a8e)
2018-07-04 12:03:18 +02:00
Tomasz Grabiec
f46f9f7533 Merge "Fix atomic_cell_or_collection::external_memory_usage()" from Paweł
After the transition to the new in-memory representation in
aab6b0ee27 'Merge "Introduce new in-memory
representation for cells" from Paweł'
atomic_cell_or_collection::external_memory_usage() stopped accounting
for the externally stored data. Since, it wasn't covered by the unit
tests the bug remained unnotices until now.

This series fixes the memory usage calculation and adds proper unit
tests.

* https://github.com/pdziepak/scylla.git fix-external-memory-usage/v1:
  tests/mutation: properly mark atomic_cells that are collection members
  imr::utils::object: expose size overhead
  data::cell: expose size overhead of external chunks
  atomic_cell: add external chunks and overheads to
    external_memory_usage()
  tests/mutation: test external_memory_usage()

(cherry picked from commit 2ffb621271)
2018-07-04 11:45:06 +02:00
Botond Dénes
090d991f8e query_pager: be prepared to _ranges being empty
do_fetch_page() checks in the beginning whether there is a saved query
state already, meaning this is not the first page. If there is not it
checks whether the query is for a singulular partitions or a range scan
to decide whether to enable the stateful queries or not. This check
assumed that there is at least one range in _ranges which will not hold
under some circumstances. Add a check for _ranges being empty.

Fixes: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cbe64473f8013967a93ef7b2104c7ca0507afac9.1530610709.git.bdenes@scylladb.com>
(cherry picked from commit 59a30f0684)
2018-07-03 18:33:25 +03:00
Avi Kivity
ae15a80d01 Merge "more scylla_setup fixes" from Takuya
"
Added NIC / Disk existance check, --force-raid mode on
scylla_raid_setup.
"

* 'scylla_setup_fix4' of https://github.com/syuu1228/scylla:
  dist/common/scripts/scylla_raid_setup: verify specified disks are unused
  dist/common/scripts/scylla_raid_setup: add --force-raid to construct raid even only one disk is specified
  dist/common/scripts/scylla_setup: don't accept disk path if it's not block device
  dist/common/scripts/scylla_raid_setup: verify specified disk paths are block device
  dist/common/scripts/scylla_sysconfig_setup: verify NIC existance

(cherry picked from commit a36b1f1967)
2018-07-03 18:33:04 +03:00
Takuya ASADA
6cf902343a scripts: merge scylla_install_pkg to scylla-ami
scylla_install_pkg is initially written for one-liner-installer, but now
it only used for creating AMI, and it just few lines of code, so it should be
merge into scylla_install_ami script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612150106.26573-2-syuu@scylladb.com>
(cherry picked from commit 084c824d12)
2018-07-03 18:32:58 +03:00
Takuya ASADA
d5e59f671c dist/ami: drop Ubuntu AMI support
Drop Ubuntu AMI since it's not maintained for a long time, and we have
no plan to officially provide it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612150106.26573-1-syuu@scylladb.com>
(cherry picked from commit fafcacc31c)
2018-07-03 18:32:53 +03:00
Avi Kivity
38944655c5 Uodate scylla-ami submodule
* dist/ami/files/scylla-ami 36e8511...0fd9d23 (2):
  > scylla_install_ami: merge scylla_install_pkg
  > scylla_install_ami: drop Ubuntu AMI

(cherry picked from commit 677991f353)
2018-07-03 18:32:45 +03:00
Avi Kivity
06e274ff34 Merge "scylla_setup fixes" from Takuya
"
I found problems on previously submmited patchset 'scylla_setup fixes'
and 'more fixes for scylla_setup', so fixed them and merged into one
patchset.

Also added few more patches.
"

* 'scylla_setup_fix3' of https://github.com/syuu1228/scylla:
  dist/common/scripts/scylla_setup: allow input multiple disk paths on RAID disk prompt
  dist/common/scripts/scylla_raid_setup: skip constructing RAID0 when only one disk specified
  dist/common/scripts/scylla_raid_setup: fix module import
  dist/common/scripts/scylla_setup: check disk is used in MDRAID
  dist/common/scripts/scylla_setup: move unmasking scylla-fstrim.timer on scylla_fstrim_setup
  dist/common/scripts/scylla_setup: use print() instead of logging.error()
  dist/common/scripts/scylla_setup: implement do_verify_package() for Gentoo Linux
  dist/common/scripts/scylla_coredump_setup: run os.remove() when deleting directory is symlink
  dist/common/scripts/scylla_setup: don't include the disk on unused list when it contains partitions
  dist/common/scripts/scylla_setup: skip running rest of the check when the disk detected as used
  dist/common/scripts/scylla_setup: add a disk to selected list correctly
  dist/common/scripts/scylla_setup: fix wrong indent
  dist/common/scripts: sync instance type list for detect NIC type to latest one
  dist/common/scripts: verify systemd unit existance using 'systemctl cat'

(cherry picked from commit 0b148d0070)
2018-07-03 18:32:35 +03:00
Avi Kivity
c24d4a8acb Merge "Fix handling of stale write replies in storage_proxy" from Gleb
"
If a coordinator sends write requests with ID=X and restarts it may get a reply to
the request after it restarts and sends another request with the same ID (but to
different replicas). This condition will trigger an assert in a coordinator. Drop
the assertion in favor of a warning and initialize handler id in a way to make
this situation less likely.

Fixes: #3153
"

* 'gleb/write-handler-id' of github.com:scylladb/seastar-dev:
  storage_proxy: initialize write response id counter from wall clock value
  storage_proxy: drop virtual from signal(gms::inet_address)
  storage_proxy: do not assert on getting an unexpected write reply

(cherry picked from commit a45c3aa8c7)
2018-07-02 11:56:52 +03:00
Nadav Har'El
5f95b76c65 repair: fix combination of "-pr" and "-local" repair options
When nodetool repair is used with the combination of the "-pr" (primary
range) and "-local" (only repair with nodes in the same DC) options,
Scylla needs to define the "primary ranges" differently: Rather than
assign one node in the entire cluster to be the primary owner of every
token, we need one node in each data-center - so that a "-local"
repair will cover all the tokens.

Fixes #3557.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180701132445.21685-1-nyh@scylladb.com>
(cherry picked from commit 3194ce16b3)
2018-07-02 11:56:41 +03:00
Tomasz Grabiec
0bdb7e1e7c row_cache: Fix memtable reads concurrent with cache update missing writes
Introduced in 5b59df3761.

It is incorrect to erase entries from the memtable being moved to
cache if partition update can be preempted because a later memtable
read may create a snapshot in the memtable before memtable writes for
that partition are made visible through cache. As a result the read
may miss some of the writes which were in the memtable. The code was
checking for presence of snapshots when entering the partition, but
this condition may change if update is preempted. The fix is to not
allow erasing if update is preemptible.

This also caused SIGSEGVs because we were assuming that no such
snapshots will be created and hence were not invalidating iterators on
removal of the entries, which results in undefined behavior when such
snapshots are actually created.

Fixes SIGSEGV in dtest: limits_test.py:TestLimits.max_cells_test

Fixes #3532

Message-Id: <1530129009-13716-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit b464b66e90)
2018-07-01 15:36:21 +03:00
Avi Kivity
56ea4f3154 Merge "Disable sstable filtering based on min/max clustering key components" from Tomasz
"
With DateTiered and TimeWindow, there is a read optimization enabled
which excludes sstables based on overlap with recorded min/max values
of clustering key components. The problem is that it doesn't take into
account partition tombstones and static rows, which should still be
returned by the reader even if there is no overlap in the query's
clustering range. A read which returns no clustering rows can
mispopulate cache, which will appear as partition deletion or writes
to the static row being lost. Until node restart or eviction of the
partition entry.

There is also a bad interaction between cache population on read and
that optimization. When the clustering range of the query doesn't
overlap with any sstable, the reader will return no partition markers
for the read, which leads cache populator to assume there is no
partition in sstables and it will cache an empty partition. This will
cause later reads of that partition to miss prior writes to that
partition until it is evicted from cache or node is restarted.

Disable until a more elaborate fix is implemented.

Fixes #3552
Fixes #3553
"

* tag 'tgrabiec/disable-min-max-sstable-filtering-v1' of github.com:tgrabiec/scylla:
  tests: Add test for slicing a mutation source with date tiered compaction strategy
  tests: Check that database conforms to mutation source
  database: Disable sstable filtering based on min/max clustering key components

(cherry picked from commit e1efda8b0c)
2018-06-27 17:01:28 +03:00
Calle Wilund
d9c178063c sstables::compress: Ensure unqualified compressor name if possible
Fixes #3546

Both older origin and scylla writes "known" compressor names (i.e. those
in origin namespace) unqualified (i.e. LZ4Compressor).

This behaviour was not preserved in the virtualization change. But
probably should be.

Message-Id: <20180627110930.1619-1-calle@scylladb.com>
(cherry picked from commit 054514a47a)
2018-06-27 17:01:22 +03:00
Avi Kivity
b21b7f73b9 version: prepare for scylla 2.3-rc0 2018-06-27 14:14:19 +03:00
Avi Kivity
9a7ecdb3b9 Merge "Deglobalise cache_tracker" from Paweł
"
Cache tracker is a thread-local global object that indirectly depends on
the lifetimes of other objects. In particular, a member of
cache_tracker: mutation_cleaner may extend the lifetime of a
mutation_partition until the cleaner is destroyed. The
mutation_partition itself depends on LSA migrators which are
thread-local objects. Since, there is no direct dependency between
LSA-migrators and cache_tracker it is not guarantee that the former
won't be destroyed before the latter. The easiest (barring some unit
tests that repeat the same code several billion times) solution is to
stop using globals.

This series also improves the part of LSA sanitiser that deals with
migrators.

Fixes #3526.

Tests: unit(release)
"

* tag 'deglobalise-cache-tracker/v1-rebased' of https://github.com/pdziepak/scylla:
  mutation_cleaner: add disclaimer about mutation_partition lifetime
  lsa: enhance sanitizer for migrators
  lsa: formalise migrator id requirements
  row_cache: deglobalise row cache tracker
2018-06-26 16:38:12 +01:00
Asias He
c3b5a2ecd5 gossip: Fix tokens assignment in assassinate_endpoint
The tokens vector is defined a few lines above and is needed outsie the
if block.

Do not redefine it again in the if block, otherwise the tokens will be empty.

Found by code inspection.

Fixes #3551.

Message-Id: <c7a06375c65c950e94236571127f533e5a60cbfd.1530002177.git.asias@scylladb.com>
2018-06-26 16:38:12 +01:00
Tomasz Grabiec
6d6b93d1e7 flat_mutation_reader: Move field initialization to initializer list
This works around a problem of std::terminate() being called in debug
mode build if initialization of _current throws.

Backtrace:

Thread 2 "row_cache_test_" received signal SIGABRT, Aborted.
0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
(gdb) bt
  #0  0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
  #1  0x00007ffff17d077d in abort () from /lib64/libc.so.6
  #2  0x00007ffff5773025 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
  #3  0x00007ffff5770c16 in ?? () from /lib64/libstdc++.so.6
  #4  0x00007ffff576fb19 in ?? () from /lib64/libstdc++.so.6
  #5  0x00007ffff5770508 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
  #6  0x00007ffff3ce4ee3 in ?? () from /lib64/libgcc_s.so.1
  #7  0x00007ffff3ce570e in _Unwind_Resume () from /lib64/libgcc_s.so.1
  #8  0x0000000003633602 in reader::reader (this=0x60e0001160c0, r=...) at flat_mutation_reader.cc:214
  #9  0x0000000003655864 in std::make_unique<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (__args#0=...)
    at /usr/include/c++/7/bits/unique_ptr.h:825
  #10 0x0000000003649a63 in make_flat_mutation_reader<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (args#0=...)
    at flat_mutation_reader.hh:440
  #11 0x000000000363565d in make_forwardable (m=...) at flat_mutation_reader.cc:270
  #12 0x000000000303f962 in memtable::make_flat_reader (this=0x61300001d540, s=..., range=..., slice=..., pc=..., trace_state_ptr=..., fwd=..., fwd_mr=...)
    at memtable.cc:592

Message-Id: <1528792447-13336-1-git-send-email-tgrabiec@scylladb.com>
2018-06-25 20:03:23 +03:00
Avi Kivity
31eeae0126 Merge "Avoid buffer linearisation in read path" from Paweł
"
The read path on coordinator involves a lot of passing around buffers
and some occasional processing. We start with query::result obtained
from the storage_proxy which is then transformed into a
cql3::result_set, which is then used to write a response. Buffers are
copied and linearised quite excessively.

This series attempts to remedy that by using view of fragmented buffers
as much as possible. The first part deals with reading from
query::result. ser::buffer_view is introduced which enables the IDL
infrastructure to read a buffer without copying or linearising it.
The second part is switching native protocol layer to use bytes_ostream
instead of std::vector<char> to hold the generated response to the
client. The last part introduces cql3::result_generator which is an
alternative to cql3::result_set that passes buffer views without copying
or linearising anything from query::result to the native protocl layer
(or Thrift). It is only used in simple cases, when no processing at the
CQL layer is required, except for paged queries which require some
simple interpretation of the results and are supported by the result
generator.

Tests: unit(release), dtests(paging_test.py paging_additional_test.py
  cql_additional_tests.py cql_tracing_test.py cql_prepared_test.py
  cql_cast_test.py cql_tests.py)
"

* tag 'buffer-views-query-result/v2' of https://github.com/pdziepak/scylla: (34 commits)
  cql3: select_statement: use fetch_page_generator() if possible
  pager: add fetch_page_generator()
  pager: make the visitor handle_result() accepts a template parameter
  pager: make query_result_visitor base class a template parameter
  pager: make myvistor a member class of query_pager
  pager: make shared pointers to selection constant
  pager: merge query_pager and query_pagers::impl
  cql3: select_statement: use result_generator if possible
  cql3: selection: add is_trivial()
  cql3: result: support result_generator
  cql3: add lazy result_generator
  cql3: add result class
  cql3::result_set: fix encapsulation
  thrift: use cql3::result_set visiting interface
  transport: use cql3::result_set visiting interface
  cql3::result_set: add visit()
  transport: response: add write_int_placeholder()
  transport: steal response buffers and make send zero-copy
  transport: use reusable_buffer for compression
  transport: response: use bytes_ostream
  ...
2018-06-25 17:37:50 +03:00
Paweł Dziepak
bdc299cc38 mutation_cleaner: add disclaimer about mutation_partition lifetime
mutation_cleaner has already caused problems by extending lifetime of
mutation_partition past the lifetime of LSA migrators that it uses (due
to the fact that both the cleaner and migrators where thread-local
globals). Since, the long term goal is to make mutation_partition
internal representation depend more and more on schema that lifetime
extension may again cause problems in the future, so let's add a
disclaimer that hopefuly, will help avoiding them.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
55bf9d78a6 lsa: enhance sanitizer for migrators
Current LSA sanitizer performs only basic checks on the migrators use,
without doing any additonal reporting in case an error is detected. This
patch enhances it so that when a problem is detected relevant stack
traces get printed.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
fcd9b1f821 lsa: formalise migrator id requirements
object_descriptor uses special encoding for migrator ids which assumes
that the valid ones are in a range smaller than uint32_t. Let's add some
static asserts that make this fact more visible.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
96b0577343 row_cache: deglobalise row cache tracker
Row cache tracker has numerous implicit dependencies on ohter objects
(e.g. LSA migrators for data held by mutation_cleaner). The fact that
both cache tracker and some of those dependencies are thread local
objects makes it hard to guarantee correct destruction order.

Let's deglobalise cache tracker and put in in the database class.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
2b1fcfe019 cql3: select_statement: use fetch_page_generator() if possible 2018-06-25 09:21:47 +01:00
Paweł Dziepak
1cf3cb285f pager: add fetch_page_generator()
fetch_page_generator() is an equivalent of fetch_page(), but instead of
building a cql3::result_set it returns a cql3::result_generator().
2018-06-25 09:21:47 +01:00
Paweł Dziepak
f6fe831d49 pager: make the visitor handle_result() accepts a template parameter 2018-06-25 09:21:47 +01:00
Paweł Dziepak
fc87ca5926 pager: make query_result_visitor base class a template parameter
So far query_result_visitor was tied to result_set_builder. The goal is
to enable result_generator to work with paged queries as well so we need
to decouple them.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
dc9a65ea76 pager: make myvistor a member class of query_pager
It is going to be come a class template.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
319b2cde7e pager: make shared pointers to selection constant
Shared pointers make code harder to reason about, it is not easy to get
rid of them in this piece of the code, but we can restore at least a bit
of sanity by adding consts.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
327d3de51e pager: merge query_pager and query_pagers::impl
There is just a single implementation of query_pager and there is no
reason to make anything virtual. Devirtualising this code will allow
higher layers to pass visitors via templates.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
fa5dea91e7 cql3: select_statement: use result_generator if possible 2018-06-25 09:21:47 +01:00
Paweł Dziepak
3f1184d16d cql3: selection: add is_trivial()
cql3::result_generator supports only trivial selections.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
adad31ba6b cql3: result: support result_generator
cql3::result can now hold either a result_set or a result_generator.
Some code that is not performance critical expects to get result_set so
a way of converting the result_generator to a result_set is added.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
02443d10db cql3: add lazy result_generator
result_generator is a restricted alternative of result_set. It supports
only the simples cases, but is much cheaper as it passes data almost
directly from query::result to its visitor bypassing much of the CQL
layer.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
dca68afce6 cql3: add result class
So far the only way of returing a result of a CQL query was to build a
result_set. An alternative lazy result generator is going to be
introduced for the simple cases when no transformations at CQL layer are
needed. To do that we need to hide the fact that there are going to be
multiple representations of a cql results from the users.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
29cc4a4c0b cql3::result_set: fix encapsulation 2018-06-25 09:21:47 +01:00
Paweł Dziepak
8f26d9c03f thrift: use cql3::result_set visiting interface 2018-06-25 09:21:47 +01:00
Paweł Dziepak
54d5dc414d transport: use cql3::result_set visiting interface 2018-06-25 09:21:47 +01:00
Paweł Dziepak
2e4234ab63 cql3::result_set: add visit()
This visiting interface for result_set satisfies most of its users (at
least all of those which are in the hot path). It will allow having an
alternative of result_set (i.e. lazy result generator) which would
provide exaclty the same interface.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
c0e7160625 transport: response: add write_int_placeholder()
This allows the response writer to defer writing integers until later
time. It will be used by lazy response generator which will know the
number of rows in the response only after they are all written.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
88aff8eda8 transport: steal response buffers and make send zero-copy
Each response is sent only once, so we can safely steal its buffers and
pass them to the output_stream using the zero-copy interface.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
821e6683e3 transport: use reusable_buffer for compression
Compression algorithms require us to linearise bytes_ostream. This may
cause an excessive number of large allocations. Using reusable_buffers
can avoid that.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
a7c4d407ce transport: response: use bytes_ostream
std::vector<char> is not a very good container for incrementally
building a response. It may cause excessive copies and allocations. If
the response is large it will put more pressure on the memory allocator
by requiring the buffer to be contiguous.

We already have bytes_ostream which avoids all of these problems, so
let's use it.
2018-06-25 09:22:43 +01:00
Paweł Dziepak
c04d38b76b transport: drop response::make_message() 2018-06-25 09:22:35 +01:00
Paweł Dziepak
444acf49af transport: use std::unique_ptr for the response
So far cql_server::response was passed around using shared pointers.
They have very big cost of making it hard to reason about the code. All
that is not necessary and we can easily switch to using much more
sensible std::unique_ptr.
2018-06-25 09:22:24 +01:00
Paweł Dziepak
12f89299b2 transport: move response to a separate header
There are some other translation units which right now are satisfied
with the response being an incomplete type. This means that
std::unique_ptr can't be used for it. Let's move the class declaration
to a header that can be included where needed.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
3b9ba30497 tests: add test for reusable buffers 2018-06-25 09:21:47 +01:00
Paweł Dziepak
b4c5e1a6d4 utils: add reusable_buffer
This commit adds a helper class reusable_buffer which can be used to
avoid excessive memory allocations of large buffers when bytes_ostream
needs to be linearised. The idea is that reusable_buffer in most cases
is going to be thread local so that multiple continuation chains can
reuse the same large buffer.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
8feab33cf4 query::result: use std::optional instead of experimental version 2018-06-25 09:21:47 +01:00
Paweł Dziepak
9d140488bd tests/perf: add performance test for IDL 2018-06-25 09:21:47 +01:00
Paweł Dziepak
4704c4efab query::result: avoid copying and linearising cell value
query::result_view already operates on views of a serialised
query::result. However, until now the value of a cell was always
linearised and copied. This patch makes use of ser::buffer_view to avoid
that.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
982f71a804 query::result_view: add concept 2018-06-25 09:21:47 +01:00
Paweł Dziepak
2914f64b2d serializer: user buffer_view in bytes deserialiser 2018-06-25 09:21:47 +01:00
Paweł Dziepak
19caf709e1 serializer: add view of a fragmented stream
ser::buffer_view is a view of a fragmented buffer in a stream od
IDL-serialised data. It can be used to deserialise IDL objects without
needless copying and linearisation of large blobs.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
fe8dc1fa5c bytes_ostream: add remove_suffix() 2018-06-25 09:21:47 +01:00
Paweł Dziepak
969219d5bc tests/random-utils: add missing include 2018-06-25 09:21:47 +01:00
Paweł Dziepak
a85197a7b5 bytes_ostream: make fragment_iterator default constructible 2018-06-25 09:21:47 +01:00
Piotr Sarna
828497ad19 hints: amend a comment in device limits
To make the comment less confusing, 'group of managers'
is used instead of 'device'.

Refs #3516

Reported-by: Vlad Zolotarov <vladz@scylladb.com>
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Message-Id: <60c9ab6b47195570f7ce7dff9556e3739b7ae00f.1529862547.git.sarna@scylladb.com>
2018-06-24 19:14:59 +01:00
Avi Kivity
48dc875e49 Merge "convert setup scripts to python3" from Takuya
"
Converted all setup scripts from bash to python3.
"

* 'scripts_python_conversion_v1' of https://github.com/syuu1228/scylla:
  dist/common/scripts: convert scylla_kernel_check to python3
  dist/common/scripts: convert scylla_ec2_check to python3
  dist/common/scripts: convert scylla_sysconfig_setup to python3
  dist/common/scripts: convert scylla_setup to python3
  dist/common/scripts: convert scylla_selinux_setup to python3
  dist/common/scripts: convert scylla_raid_setup to python3
  dist/common/scripts: convert scylla_ntp_setup to python3
  dist/common/scripts: convert scylla_fstrim_setup to python3
  dist/common/scripts: convert scylla_dev_mode_setup to python3
  dist/common/scripts: convert scylla_cpuset_setup to python3
  dist/common/scripts: convert scylla_cpuscaling_setup to python3
  dist/common/scripts: convert scylla_coredump_setup to python3
  dist/common/scripts: convert scylla_bootparam_setup to python3
  dist/common/scripts: extend scylla_util.py to convert setup scripts to python3
  dist/common/scripts: convert scylla_io_setup and scylla_util.py to python3
2018-06-24 15:02:08 +03:00
Avi Kivity
40dbdae24e Update seastar submodule
> Merge "Allow creating views from simple streams" from Paweł
  > IOTune: allow duration to be configurable and change its defaults
2018-06-24 14:54:46 +03:00
Avi Kivity
cb549c767a database: rename column_family to table
The name "column_family" is both awkward and obsolete. Rename to
the modern and accurate "table".

An alias is kept to avoid huge code churn.

To prevent a One Definition Rule violation, a preexisting "table"
type is moved to a new namespace row_cache_stress_test.

Tests: unit (release)
Message-Id: <20180624065238.26481-1-avi@scylladb.com>
2018-06-24 14:54:46 +03:00
Tomasz Grabiec
2d4177355a Merge "Support for writing range tombstones to SSTables 3.x" from Vladimir
This patchset brings support for writing range tombstones to SSTables
3.x. ('mc' format).

In SSTables 3.x, range tombstones are represented by so-called range
tombstone markers (hereafter RT markers) that denote range tombstone
start and end bounds. So each range tombstone is represented in data
file by two ordered RT markers.
There are also markers that both close the previous range tombstone and
open the new one in case if two range tombstones are ajdacent. This is
done to consume less disk space on such occasions.
Range tombstones written as RT markers are naturally non-overlapping.

* github.com:argenet/scylla projects/sstables-30/write-range-tombstones/v6
range_tombstone_stream: Remove an unused boolean flag.
Revert "Add missing enum values to bound_kind."
sstables: Move to_deletion_time helper up and make it static.
sstables: Write end-of-partition byte before flushing the last index
block.
sstables: Add support for writing range tombstones in SSTables 3.x
format.
tests: Add unit test covering simple range tombstone.
tests: Add unit test covering adjacent range tombstones.
tests: Add test to cover non-adjacent RTs.
tests: Add test covering mixed rows and range tombstones.
tests: Add test covering SSTables 3.x with many RTs.
tests: Add unit test covering overlapping RTs and rows.
tests: Add tests writing a range tombstone and a row overlapping with
its start.
tests: Add tests writing a range tombstone and a row overlapping with
its end.
tests: Add function that writes from multiple memtable into SSTables.
tests: Add test where 2nd range tombstone covers the remainder of the
1st one.
tests: Add test writing two non-adjacent range tombstones with same
clustering key prefix at their bounds.
tests: Add test covering overlapped range tombstones.
2018-06-22 15:47:18 +02:00
Tomasz Grabiec
f09fff090a Merge 'Enhance space watchdog' from Piotr Sarna
"
This series addresses issue #3516 and enhances space watchdog to make it
device-aware. It's needed because since last MV-related changes, space
watchdog can be responsible for multiple hints manager, which means
multiple directories, which may mean multiple devices.
Hence, having a single static space size limit is not enough anymore
and watchdog should take it into account that different managers
may work on different disks, while yet another managers can share
the same device.

Tests: unit (release)
"

* 'enhance_space_watchdog_4' of https://github.com/psarna/scylla:
hints: reserve more space for dedicated storage
hints: add is_mountpoint function
hints: make space_watchdog device-aware
hints: add device_id to manager
hints: add get_device_id function
2018-06-22 15:45:47 +02:00
Piotr Sarna
8b43ac3a57 hints: reserve more space for dedicated storage
Reserving 10% of space for hints managers makes sense if the device
is shared with other components (like /data or /commitlog).
But, if hints directory is mounted on a dedicated storage, it makes
sense to reserve much more - 90% was chosen as a sane limit.
Whether storage is 'dedicated' or not is based on a simple check
if given hints directory is a mount point.

Fixes #3516

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:27:00 +02:00
Piotr Sarna
32f86ca61e hints: add is_mountpoint function
A helper function that checks whether a path is also a mount point
is added.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:52 +02:00
Piotr Sarna
b6c1b8c5ef hints: make space_watchdog device-aware
Instead of having one static space limit for all directories,
space_watchdog now keeps a per-device limit, shared among
hints managers residing on the same disks.

References #3516

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:45 +02:00
Piotr Sarna
d22668de04 hints: add device_id to manager
In order to make space_watchdog device-aware, device_id field
is added to hints manager. It's an equivalent of stat.st_dev
and it identifies the disk that contains manager's root directory.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:37 +02:00
Piotr Sarna
91b5e33c6a hints: add get_device_id function
In order to distinguish which directories reside on which devices,
get_device_id function is added to resource manager.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:25:47 +02:00
Takuya ASADA
ca52407fd6 dist/common/scripts: convert scylla_kernel_check to python3
Convert bash script to python3.
2018-06-22 12:31:12 +09:00
Takuya ASADA
5efbb714ff dist/common/scripts: convert scylla_ec2_check to python3
Convert bash script to python3.
2018-06-22 12:30:59 +09:00
Takuya ASADA
d0b9464dc7 dist/common/scripts: convert scylla_sysconfig_setup to python3
Convert bash script to python3.
2018-06-22 12:30:49 +09:00
Takuya ASADA
d3a3d0f8de dist/common/scripts: convert scylla_setup to python3
Convert bash script to python3.
2018-06-22 12:30:37 +09:00
Takuya ASADA
8030e89725 dist/common/scripts: convert scylla_selinux_setup to python3
Convert bash script to python3.
2018-06-22 12:30:29 +09:00
Takuya ASADA
8cfc4f1c3d dist/common/scripts: convert scylla_raid_setup to python3
Convert bash script to python3.
2018-06-22 12:30:19 +09:00
Takuya ASADA
63a287b7d4 dist/common/scripts: convert scylla_ntp_setup to python3
Convert bash script to python3.
2018-06-22 12:30:10 +09:00
Takuya ASADA
01eea76a4e dist/common/scripts: convert scylla_fstrim_setup to python3
Convert bash script to python3.
2018-06-22 12:29:56 +09:00
Takuya ASADA
5e07567c60 dist/common/scripts: convert scylla_dev_mode_setup to python3
Convert bash script to python3.
2018-06-22 12:29:44 +09:00
Takuya ASADA
ccc6dbf6c7 dist/common/scripts: convert scylla_cpuset_setup to python3
Convert bash script to python3.
2018-06-22 12:29:25 +09:00
Takuya ASADA
7fd81510a4 dist/common/scripts: convert scylla_cpuscaling_setup to python3
Convert bash script to python3.
2018-06-22 12:29:04 +09:00
Takuya ASADA
e858674a79 dist/common/scripts: convert scylla_coredump_setup to python3
Convert bash script to python3.
2018-06-22 12:28:50 +09:00
Takuya ASADA
b3ee02dd1e dist/common/scripts: convert scylla_bootparam_setup to python3
Convert bash script to python3.
2018-06-22 12:27:56 +09:00
Takuya ASADA
2a4ba883c8 dist/common/scripts: extend scylla_util.py to convert setup scripts to python3
To porting setup scripts to python3, following utility functions/classes
introduced:
 - run(): execute command line, returns return code
 - out(): execute command line, returns stdout as string
 - is_debian_variant() / is_redhat_variant() / is_gentoo_variant()
 / is_ec2() / is_systemd(): detect specific environment
 - hex2list(): implement hex2list.py code as a function
 - makedirs(): same as os.makedirs() but do nothing when dir is exists
 - dist_name() / dist_ver(): alias of platform.dist()
 - class systemd_unit: an utility to control systemd unit using systemctl
 - class sysconfig_parser: reader/writer of /etc/sysconfig files
 - class concolor: ANSI color escape sequences list
2018-06-22 12:21:37 +09:00
Takuya ASADA
b7c980ac56 dist/common/scripts: convert scylla_io_setup and scylla_util.py to python3
To share scylla_util.py with python3 converted setup scripts, these
scripts need to be python3 too.
2018-06-22 12:11:27 +09:00
Glauber Costa
7f6b6fa129 github: direct users asking questions to our mailing list.
Very often people use the issue tracker to just ask questions. We have
been telling them to close the bug and move the discussion somewhere
else but it would be better if people were already directed to the right
place before they even get it wrong.

This would be easier to everybody.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180621135051.3254-1-glauber@scylladb.com>
2018-06-21 17:43:23 +03:00
Tomasz Grabiec
0f380f24c3 Update seastar submodule
* seastar 3c60b82...7aca670 (2):
  > Merge "Log stack trace during exception" from Gleb
  > shared_ptr: Introduce lw_shared_ptr::dispose() for convenience
2018-06-21 12:19:33 +02:00
Vladimir Krivopalov
ea09cf732d tests: Add test covering overlapped range tombstones.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
5df3cd1787 tests: Add test writing two non-adjacent range tombstones with same clustering key prefix at their bounds.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
35b90b2d1e tests: Add test where 2nd range tombstone covers the remainder of the 1st one.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
2f277c29e8 tests: Add function that writes from multiple memtable into SSTables.
This comes in handy when we want to test overlapping range tombstones
because memtable would otherwise de-overlap them internally.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
41d283fe83 tests: Add tests writing a range tombstone and a row overlapping with its end.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
ff53f601e4 tests: Add tests writing a range tombstone and a row overlapping with its start.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
f552f30d57 tests: Add unit test covering overlapping RTs and rows.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
27e053f933 tests: Add test covering SSTables 3.x with many RTs.
This test checks the validity of the promoted index generated for an
RT-only data file.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
aa4a011eb3 tests: Add test covering mixed rows and range tombstones.
Tests three cases:
 - a row lying inside a range tombstone
 - a row that has the same clustering key as range tombstone start
 - a row that has the same clustering key as range tombstone end

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
492a401855 tests: Add test to cover non-adjacent RTs.
These are two RTs where one's RT end clustering is the same as another
one's RT start bound but they are both exclusive.

In this case those bounds should not (and cannot) be merged into a
single RT boundary when writing RT markers.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
3a96226492 tests: Add unit test covering adjacent range tombstones.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
b3e7982fec tests: Add unit test covering simple range tombstone.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
5559fc2121 sstables: Add support for writing range tombstones in SSTables 3.x format.
For SSTables 3.x. ('mc' format), range tombstones are represented by
their bounds that are written to the data file as so-called RT markers.
For adjacent range tombstones, an RT marker can be of a 'boundary' type
which means it closes the previous range tombstone and opens the new
one.

Internally, sstable_writer_m relies on range_tombstone_stream to both
de-overlap incoming range tombstones and order them so that when they
are drained they can be easily thought of as just pairs of their bounds.
2018-06-20 18:08:36 -07:00
Noam Hasson
6572917fda docker: added support for authenticator & authorizer command arguments
By default Scylla docker runs without the security features.
This patch adds support for the user to supply different params values for the
authenticator and authorizer classes and allowing to setup a secure Scylla in
Docker.
For example if you want to run a secure Scylla with password and authorization:
docker run --name some-scylla -d scylladb/scylla --authenticator
PasswordAuthenticator --authorizer CassandraAuthorizer

Update the Docker documentation with the new command line options.

Signed-off-by: Noam Hasson <noam@scylladb.com>
Message-Id: <20180620122340.30394-1-noam@scylladb.com>
2018-06-20 20:33:59 +03:00
Gleb Natapov
f53ae2d07f storage_service: avoid "ignored future" message during schema check failure
Message-Id: <20180620134402.GQ1918@scylladb.com>
2018-06-20 18:53:47 +03:00
Takuya ASADA
6acb2add4a dist/ami: show unsupported instance type message even scylla_ami_setup is still running
On current .bash_profile it prints "Constructing RAID volume..." when
scylla_ami_setup is still running, even it running on unsupported
instance types.

To avoid that we need to run instance type check at first, then we can
run rest of the script.

Fixes #2739

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613111539.30517-1-syuu@scylladb.com>
2018-06-20 16:49:15 +03:00
Takuya ASADA
4151120752 dist/debian: change owner of build/debs/ to current user
Currently build/debs/ is owned by root user since pbuilder requires to
run in root.
So chown them after finished building.

Fixes #3447

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613093213.28827-1-syuu@scylladb.com>
2018-06-20 16:48:17 +03:00
Tomasz Grabiec
4523706312 gdb: Adjust for removal of the 'fsu' field from cpu_mem
Message-Id: <1529497459-15287-1-git-send-email-tgrabiec@scylladb.com>
2018-06-20 15:27:35 +03:00
Avi Kivity
b97e1aeff5 Merge "Consume row marker correctly" from Piotr
"
Make sure we properly handle row marker and row tombstone
when reading a row.

Tests: unit {release}
"

* 'haaawk/sstables3/read-liveness-info-v4' of ssh://github.com/scylladb/seastar-dev:
  sstable: consume row marker in data_consume_rows_context_m
  sstable: Add consumer_m::consume_row_marker_and_tombstone
  sstable: add is_set and to_row_marker to liveness_info
2018-06-20 14:44:03 +03:00
Piotr Jastrzebski
75edaff7b6 sstable: consume row marker in data_consume_rows_context_m
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-20 13:13:29 +02:00
Piotr Jastrzebski
cbfc741d70 sstable: Add consumer_m::consume_row_marker_and_tombstone
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-20 13:13:16 +02:00
Tomasz Grabiec
5548eb96f7 Merge "store prepared statements parameters values" from Vlad
* https://github.com/vladzcloudius/scylla.git tracing_prepared_parameters-v6:
  cql3::query_options: add get_names() method
  tracing::trace_state: hide the internals of params_values
  tracing: store queries statements for BATCH
  tracing: store the prepared statements parameters values
2018-06-19 19:12:26 +02:00
Avi Kivity
f912eefbe2 Merge "Fix numerous issues in AMI related scriptology" from Vlad
"
A few fixes in scripts that were found when debugging #3508.
This series fixed this issue.

"

Fixes #3508

* 'ami_scripts_fixes-v1' of https://github.com/vladzcloudius/scylla:
  scylla_io_setup: properly define the disk_properties YAML hierarchy
  scylla_io_setup: fix a typo: s/write_bandwdith/write_bandwidth/
  scylla_io_setup: hardcode the "mountpoint" YAML node to "/var/lib/scylla" for AMIs
  scylla_io_setup: print the io_properties.yaml file name and not its handle info
  scylla_lib.sh: tolerate perftune.py errors
2018-06-19 19:31:23 +03:00
Avi Kivity
e0eb66af6b Merge "Do not allow compaction controller shares to grow indefinitely" from Glauber
"
We are seeing some workloads with large datasets where the compaction
controller ends up with a lot of shares. Regardless of whether or not
we'll change the algorithm, this patchset handles a more basic issue,
which is the fact that the current controller doesn't set a maximum
explicitly, so if the input is larger than the maximum it will keep
growing without bounds.

It also pushes the maximum input point of the compaction controller from
10 to 30, allowing us to err on the side of caution for the 2.2 release.
"

* 'tame-controller' of github.com:glommer/scylla:
  controller: do not increase shares of controllers for inputs higher than the maximum
  controller: adjust constants for compaction controller
2018-06-19 18:49:02 +03:00
Avi Kivity
b6b5647836 Merge "Fix querier-cache related issues" from Botond
"
This mini series fixes some querier-cache related issues discovered
while working on stateful range-scans.
1) A problem in the memory based cache eviction test that is is yet
   unexposed (#3529).
2) Possible usage of invalidated iterators in querier_cache (#3424).
3) lookup() possibly returning a querier with the wrong read range
   (#3530).

Tests: unit(release)
"

* 'fix-querier-cache-invalid-iterators-master' of https://github.com/denesb/scylla:
  querier: find_querier(): return end() when no querier matches the range
  querier_cache: restructure entries storage
  tests/querier_cache: fix memory based eviction test
2018-06-19 16:29:03 +03:00
Paweł Dziepak
e55034a33e cql3: batch_statement: use external_memory_usage() to get mutation size
batch_statement::verify_batch_size() verifies that the total size of
mutations generated by the batch statement is smaller than certain
configurable thresholds. This is done by a custom mutation_partition
visitor, which violates atomic_cell_view::value() preconditions by
calling it even for dead cells.

The simples solution is to use
mutation_partition::external_memory_usage() instead.

Message-Id: <20180619131405.12601-1-pdziepak@scylladb.com>
2018-06-19 16:26:52 +03:00
Duarte Nunes
d3e24076b0 tests/cell_locker_test: Prevent timeout underflow
Timeout underflow causes the test to hang, due to a seastar bug
with negative time_points.


Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180619091635.34228-1-duarte@scylladb.com>
2018-06-19 16:26:52 +03:00
Duarte Nunes
ee4b3c4c2d database: Await pending writes before truncating CF on drop
When dropping a table, wait for the column family to quiesce so that
no pending writes compete with the truncate operation, possibly
allowing data to be left on disk.

Fixes #2562

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180618193134.31971-1-duarte@scylladb.com>
2018-06-19 16:26:52 +03:00
Botond Dénes
9490b8935c .gitignore: add resources directory
This directory is necessary when running dtests against a scylla
repository.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8d9ad8dae2b9d2ec3cc6c9c4d6527fba8ce91272.1529387008.git.bdenes@scylladb.com>
2018-06-19 16:26:51 +03:00
Piotr Sarna
61e3ee6c3c cql3: fix supernumerary column on view update
Patch f39891a999 fixed 3443,
but also introduced a regression in dtest - new column
was unconditionally added to view during ALTER TABLE ADD,
while it should only be the case for "include all columns" views.
This patch fixes the regression (spotted by query_new_column_test).

References #3443
Message-Id: <7410d965255a514d78cf0ce941a3236b9d8ddbbd.1529399135.git.sarna@scylladb.com>
2018-06-19 16:26:51 +03:00
Botond Dénes
2609a17a23 querier: find_querier(): return end() when no querier matches the range
When none of the queriers found for the lookup key match the lookup
range `_entries.end()` should be returned as the search failed. Instead
the iterator returned from the failed `std::find_if()` is returned
which, if the find failed, will be the end iterator returned by the
previous call to `_entries.equal_range()`. This is incorrect because as
long as `equal_range()`'s end iterator is not also `_entries.end()` the
search will always return an iterator to a querier regardless of whether
any of them actually matches the read range.
Fix by returning `_entries.end()` when it is detected that no queriers
match the range.

Fixes: #3530
2018-06-19 13:20:43 +03:00
Botond Dénes
7ce7f3f0cc querier_cache: restructure entries storage
Currently querier_cache uses a `std::unordered_map<utils::UUID, querier>`
to store cache entries and an `std::list<meta_entry>` to store meta
information about the querier entries, like insertion order, expiry
time, etc.

All cache eviction algorithms use the meta-entry list to evict entries
in reverse insertion order (LRU order). To make this possible
meta-entries keep an iterator into the entry map so that given a
meta-entry one can easily erase the querier entry. This however poses a
problem as std::unordered_map can possibly invalidate all its iterators
when new items are inserted. This is use-after-free waiting to happen.

Another disadvantages of the current solution is that it requires the
meta-entry to use a weak pointer to the querier entry so that in case
that is removed (as a result of a successful lookup) it doesn't try to
access it. This has an impact on all cache eviction algorithms as they
have to be prepared to deal with stale meta-entries. Stale meta-entries
also unnecesarily consume memory.

To solve these problems redesign how querier_cache stores entries
completely. Instead of storing the entries in an `std::unordered_map`
and storing the meta-entries in an `std::list`, store the entries in an
`std::list` and an intrusive-map (index) for lookups. This new design
has severeal advantages over the old one:
* The entries will now be in insert order, so eviction strategies can
  work on the entry list itself, no need to involve additional data
  structures for this.
* All data related to an entry is stored in one place, no data
  duplication.
* Removing an entry automatically removes it from the index as intrusive
  containers support auto unlink. This means there is no need to store
  iterators for long terms, risking use-after-free when the container
  invalidates it's iterators.

Additional changes:
* Modify eviction strategies so that they work with the `entry`
  interface rather than the stored value directly.

Ref #3424
2018-06-19 13:20:40 +03:00
Botond Dénes
b9d51b4c08 tests/querier_cache: fix memory based eviction test
Do increment the key counter after inserting the first querier into the
cache. Otherwise two queriers with the same key will be inserted and
will fail the test. This problem is exposed by the changes the next
patches make to the querier-cache but will be fixed before to maintain
bisectability of the code.

Fixes: #3529
2018-06-19 13:20:13 +03:00
Vladimir Krivopalov
100eb03f29 sstables: Write end-of-partition byte before flushing the last index block.
This is to stay compliant with the Origin for SSTables 3.x.
It differs from SSTables 2.x (ka/la) as for those the last promoted
index block is pushed first and the end-of-partition byte is written
after.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-18 14:28:25 -07:00
Vladimir Krivopalov
ad0b911b03 sstables: Move to_deletion_time helper up and make it static.
It is used for writing end_open_marker for promoted index.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-18 14:25:13 -07:00
Vladimir Krivopalov
03cf20676c Revert "Add missing enum values to bound_kind."
This reverts commit 3ecc9e9ce4.

It also adds another enum to be used instead.
2018-06-18 14:22:12 -07:00
Vladimir Krivopalov
0cf42e7fd2 range_tombstone_stream: Remove an unused boolean flag.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-18 14:22:12 -07:00
Glauber Costa
e0b209b271 controller: do not increase shares of controllers for inputs higher than the maximum
Right now there is no limit to how much the shares of the controllers
can grow. That is not a big problem from the memtable flush controller,
since it has a natural maximum in the dirty limit.

But the compaction controller, the way it's written today, can grow
forever and end up with a very large value for shares. We'll cap that at
adjust() time by not allowing shares to grow indefinitely.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-18 15:16:39 -04:00
Glauber Costa
70c47eb045 controller: adjust constants for compaction controller
Right now the controller adjusts its shares based on how big the backlog
is in comparison to shard memory. We have seen in some tests that if the
dataset becomes too big, this may cause compactions to dominate.

While we may change the input altogether in future versions, I'd like to
propose a quick change for the time being: move the high point from 10x
memory size to 30x memory size. This will cause compactions to increase
in shares more slowly.

While this is as magic as the 10 before, they will allow us to err in
the side of caution, with compactions not becoming aggressive enough to
overly disrupt workloads.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-18 15:16:38 -04:00
Piotr Jastrzebski
4c261d2e51 sstable: add is_set and to_row_marker to liveness_info
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-18 20:26:39 +02:00
Paweł Dziepak
71471bb322 Merge "Make front-end processing scheduling aware" from Avi
"
This patchset runs the protocol servers under the "statement" scheduling
group, and makes all execution_stages in that path scheduling aware.

I used inheriting_concrete_execution_stage instead of passing the
scheduling group to concrete_execution_stage's constructor for two
reasons:

 1. For cql statements, there is no easily accessible object that
    can host the concrete_execution_stage and be reached from both
    main.cc and the statements,
 2. In the future, we will want to assign users to different
    scheduling_groups, thus providing performance isolation for
    service-level agreements (SLAs). Using an inheriting
    execution_stage allows us to make the scheduling_group decision
    in one place.

Depends on two unmerged patches in seastar, one fixing
inheriting_concrete_execution_stage compilation with reference parameters,
and one making smp::submit_to() scheduling aware.
"

* tag 'cql-sched/v1' of https://github.com/avikivity/scylla:
  cql: make modification_statement execution_stage scheduling aware
  cql: make batch_statement execution_stage scheduling aware
  cql: make select_statement execution_stage scheduling aware
  transport: make native protocol request processing execution_stage scheduling aware
  main: start client protocol servers under the statement scheduling group
2018-06-18 16:38:30 +01:00
Avi Kivity
0cf4cf5981 cql: make modification_statement execution_stage scheduling aware
Inherit scheduling from the caller, preventing a fall back into the main group.
2018-06-18 18:30:21 +03:00
Avi Kivity
9479d3f345 cql: make batch_statement execution_stage scheduling aware
Inherit scheduling from the caller, preventing a fall back into the main group.
2018-06-18 18:30:21 +03:00
Avi Kivity
fdfc347595 cql: make select_statement execution_stage scheduling aware
Inherit scheduling from the caller, preventing a fall back into the main group.
2018-06-18 18:30:21 +03:00
Avi Kivity
ec788d2a7a transport: make native protocol request processing execution_stage scheduling aware
Inherit scheduling from the caller, preventing a fall back into the main group.
2018-06-18 18:30:21 +03:00
Avi Kivity
ea39e3e9d4 main: start client protocol servers under the statement scheduling group
This will isolate client protocol and coordinator-side processing from
the rest of the system.
2018-06-18 18:30:21 +03:00
Paweł Dziepak
79fae49689 Merge seastar upstream
* seastar 6422ece...3c60b82 (5):
  > reactor: inherit scheduling_group in smp::submit_to()
  > execution_stage: fix inheriting_concrete_execution_stage with reference arguments
  > tests: shared_ptr: Add typename keyword to fix compilation
  > configure: Fix --static-stdc++ flag
  > scheduling: Move friends' definitions outside the class scope
2018-06-18 12:16:57 +01:00
Avi Kivity
782827cc1b Update seastar submodule
* seastar e7275e4...6422ece (7):
  > build: enable concepts whenever they are supported by compiler
  > shared_ptr: Enable releasing ownership of the object stored in lw_shared_ptr
  > reactor: change way of calculating task quota violations
  > Merge "Add metrics for steal time and task quota violations" from Glauber
  > bitops.hh/log2ceil(): add special case for n == 1
  > circular_buffer: add clear()
  > build: add core/execution_stage.{cc,hh} to core_files
2018-06-17 21:53:50 +03:00
Avi Kivity
f0fc888381 Merge "Try harder to move STCS towards zero-backlog" from Glauber
"
Tests: unit (release)

Before merging the LCS controller, we merged patches that would
guarantee that LCS would move towards zero backlog - otherwise the
backlog could get too high.

We didn't do the same for STCS, our first controlled strategy. So we may
end up with a situation where there are many SSTables inducing a large
backlog, but they are not yet meeting the minimum criteria for
compaction. The backlog, then, never goes down.

This patch changes the SSTable selection criteria so that if there is
nothing to do, we'll keep pushing towards reaching a state of zero
backlog. Very similar to what we did for LCS.
"

* 'stcs-min-threshold-v4' of github.com:glommer/scylla:
  STCS: bypass min_threshold unless configure to enforce strictly
  compaction_strategy: allow the user to tell us if min_threshold has to be strict
2018-06-17 18:07:23 +03:00
Glauber Costa
fd51ff3d9e STCS: bypass min_threshold unless configure to enforce strictly
If we fail to produce a SizeTiered compaction with the configured
min_threshold, we can try again to compact any two - unless there is a
global bypass telling us no to.

This will still privilege doing larger compactions in size buckets where
that is possible, but if we are idle will try to compact any two

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-15 14:27:22 -04:00
Glauber Costa
290d553c3a compaction_strategy: allow the user to tell us if min_threshold has to be strict
Now that we have the controller, we would like to take min_threshold as
a hint. If there is nothing to compact, we can ignore that and start
compacting less than min_threshold SSTables so that the backlog keeps
reducing.

But there are cases in which we don't want min_threshold to be a hint
and we want to enforce it strictly. For instance, if write amplification
is more of a concern than space amplification.

This patch adds a YAML option that allows the user to tell us that. We will
default to false, meaning min_threshold is not strictly enforced.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-15 13:42:43 -04:00
Avi Kivity
75b53c4170 Merge "sstables 3.x read counters v2 00/10] Support reading counters" from Piotr
"
Implement and test support for reading counters in SSTables 3.
"

* 'haaawk/sstables3/read-counters-v2' of ssh://github.com/scylladb/seastar-dev:
  sstable_3_x_test: add test for counters
  data_consume_rows_context_m: support reading counters
  Add consumer_m::consume_counter_column
  Extract make_counter_cell
  row.hh & mp_row_consumer.hh: Add required includes
  Use serialization_header::adjust in read_statistics
  sstables 3: add serialization_header::adjust
  data_consume_rows_context_m: add is_column_counter
  data_consume_rows_context_m: Remove unused CELL_PATH_SIZE state
  column_translation: add is_counter
2018-06-15 17:33:40 +03:00
Takuya ASADA
3f8719d67e dist/common/scripts/scylla_coredump_setup: fix typo
Correct function name is "is_debian_variant", not "is_debian_variants"

Fixed #3507

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612155353.28229-1-syuu@scylladb.com>
2018-06-15 12:11:52 +01:00
Glauber Costa
e1246a3a3a sstable_test: write to temporary directory
Currently the SSTable test is failing (at least for me and Raphael),
complaining about the file it tries to write already existing. We have
helpers now to generate temporary directories, so we should use it.

The test passes after that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180614210036.16662-1-glauber@scylladb.com>
2018-06-15 11:00:08 +02:00
Piotr Sarna
d7eb6e6c7f tests: fix a typo in idl_test.cc
Fixes #3520
Message-Id: <831ead669a30d1b136d9ae50c4a1ac7057cf3340.1529047397.git.sarna@scylladb.com>
2018-06-15 09:56:45 +01:00
Piotr Jastrzebski
346e559c1b sstable_3_x_test: add test for counters
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
2942f6eecc data_consume_rows_context_m: support reading counters
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
785e14dfb9 Add consumer_m::consume_counter_column
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
6f559445d0 Extract make_counter_cell
It will be used by both consumers.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
88b66189b7 row.hh & mp_row_consumer.hh: Add required includes
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
369e4a4987 Use serialization_header::adjust in read_statistics
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
a3683d6e0f sstables 3: add serialization_header::adjust
In SSTables 3, min timestamp and min deletion time in serialization
header are not stored normally but instead the difference between
their value and the cassandra "epoch" is stored.
This is supposed to make SSTables smaller. As a consequence, we have
to add the "epoch" after reading the values to obtain the actual
values of min timestamp and min deletion time.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:10:48 +02:00
Tomasz Grabiec
78274276f5 row_cache: Use the memtable cleaner to create memtable snapshot during update
Memtable entries should be cleaned using memtable cleaner, which
unlike the cache' cleaner is not associated with the cache
tracker. It's an error to clean a snapshot using tracker which doesn't
own the entries. This will corrupt cache tracker's row counter.

Fixes failure of test_exception_safety_of_update_from_memtable from
row_cache.cc in debug mode and with allocation failure injection
enabled.

Introduce in "cache: Defer during partition merging"
(70c72773be).
Message-Id: <1528988256-20578-1-git-send-email-tgrabiec@scylladb.com>
2018-06-14 18:03:02 +03:00
Piotr Sarna
6b3a97e34a hints: fix max_shard_disk_space_size initialization
Previously max_shard_disk_space_size was unconditionally initialized
with the capacity of hints_directory. But, it's likely that
hints_directory doesn't exist at all if hinted handoff is not enabled,
which results in Scylla failing to boot.
So, max_shard_disk_space_size is now initialized with the capacity
of hints_for_views directory, which is always present.
This commit also moves max_shard_disk_space_size to the .cc file
where it belongs - resource_manager.cc.

Tests: unit (release)

Message-Id: <9f7b86b6452af328c05c5c6c55bfad3382e12445.1528977363.git.sarna@scylladb.com>
2018-06-14 14:24:01 +01:00
Duarte Nunes
5a8b8afe19 Merge "Add support for datetime functions" from Piotr
"
This series adds the following datetime functions to CQL:
 - currentTimestamp
 - currentDate
 - currentTime
 - currentTimeUUID
 - timeUUIDToDate
 - timestampToDate
 - timeUUIDToTimestamp
 - dateToTimestamp
 - timeUUIDToUnixTimestamp
 - timestampToUnixTimestamp
 - dateToUnixTimestamp

It also comes with datetime conversions test added to cql_query_test.

Note: issue #2949 also mentioned queries like:
 $ SELECT * FROM myTable WHERE date >= currentDate() - 2d;
but it's a broader topic of supporting arithmetic operations in general,
so it's moved to #3499.

Tests: unit (release)
"

* 'support_datetime_functions_3' of https://github.com/psarna/scylla:
  tests: add datetime conversions to cql_query_tests
  cql3: add time conversion functions
  cql3: add current* time functions
  types: add time_native_type
2018-06-14 12:31:39 +01:00
Piotr Sarna
5900e7f55f tests: add datetime conversions to cql_query_tests
Test case related to datetime converting functions
is added to cql_query_tests suite.
2018-06-14 11:49:11 +02:00
Piotr Sarna
695015a27e cql3: add time conversion functions
Following functions are added:
 - timeuuidtodate
 - timestamptodate
 - timeuuidtotimestamp
 - datetotimestamp
 - timeuuidtounixtimestamp
 - timestamptounixtimestamp
 - datetounixtimestamp

Fixes #2949
2018-06-14 11:49:11 +02:00
Piotr Sarna
087998b768 cql3: add current* time functions
Following date/time-related functions are added:
 - currentTimestamp
 - currentDate
 - currentTime
 - currentTimeUUID
2018-06-14 11:49:08 +02:00
Piotr Sarna
90d323a522 types: add time_native_type
CQL3's time_type didn't have any suitable native type,
so time_native_type is introduced to serve that purpose.
2018-06-14 11:11:41 +02:00
Takuya ASADA
9971576ecb dist: drop collectd support from package
Since scyllatop no longer needs collectd, now we are able to drop collectd.

resolves #3490

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1528961612-8528-1-git-send-email-syuu@scylladb.com>
2018-06-14 10:40:23 +03:00
Takuya ASADA
fcc1a9f6bb dist/redhat: Disables ambient capabilities when systemd/kernel doesn't support it
CentOS 7.4 does support to use ambient capabilities on systemd unit
file, but on some other RHEL7 compatible enviroment doesn't, it causes
Scylla startup failure.

To avoid the issue, move AmbientCapabilities line to
/etc/systemd/system/scylla.server.service.d/, install .conf only when
both systemd and kernel supported the feature.

Fixes #3486

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613232327.7839-1-syuu@scylladb.com>
2018-06-14 10:32:56 +03:00
Avi Kivity
aeffbb6732 database: stop using incremental selectors
There is a bug in incremental_selector for partitioned_sstable_set, so
until it is found, stop using it.

This degrades scan performance of Leveled Compaction Strategy tables.

Fixes #3513. (as a workaround)
Introduced: 2.1
Message-Id: <20180613131547.19084-1-avi@scylladb.com>
2018-06-13 17:57:57 +02:00
Paweł Dziepak
d5982569bc Merge "Fix fragmented serialization" from Piotr
"
After issue 3501 it turned out that IDL generates incorrect
serialization code for fragmented buffers. This series addresses
the problem by:
 * providing serialization code for FragmentRange
 * changing IDL generation rules for fragmented buffers, so they
   expect a lower layer to iterate over fragments
 * adding a test to cql_query_test suite that covers #3501
 * adding a test to idl_tests suite that covers fragmented serialization
"

* 'fix_fragmented_serialization_3' of https://github.com/psarna/scylla:
  tests: add fragmented serialization test to idl_tests
  tests: add long text value test
  idl: remove for_each from fragmented serialization
  serializer: add FragmentRange serialization
2018-06-13 14:11:16 +01:00
Gleb Natapov
98b7f6148b fix regression in perf_row_cache_update test
logalloc should be initialized explicitly by every test that uses it
now.

Message-Id: <20180613093657.GY11809@scylladb.com>
2018-06-13 15:21:20 +03:00
Avi Kivity
29976600b4 Update scylla-ami submodule
* dist/ami/files/scylla-ami 1f5329f...36e8511 (1):
  > don't try to add busy devices to the RAID.
2018-06-13 15:19:57 +03:00
Piotr Sarna
551e8f5d8c tests: add fragmented serialization test to idl_tests
IDL tests now has an additional test that checks whether serializing
and deserializing of fragmented buffers is working properly.

References #3501
2018-06-13 13:54:12 +02:00
Piotr Sarna
cdd87af408 tests: add long text value test
Test adding a long (>8192) text/varchar value is added to cql suite.

References #3501
2018-06-13 13:54:12 +02:00
Piotr Sarna
450e014558 idl: remove for_each from fragmented serialization
Previously fragmented buffers of bytes were serialized
with a for_each loop. Since serializing bytes involves writing
size first and then data, only first fragment (and its size)
would be taken into account.
This commit changes fragmented code generation so it expects
that serialized range has a serialize(output, T) specification
and expects it to iterate over fragments on its own (just like
serializer for basic_value_view does).

Fixes #3501
2018-06-13 13:54:09 +02:00
Piotr Sarna
e525a0d51b serializer: add FragmentRange serialization
Serialization for FragmentRange classes is added to serialization
suite. It first serializes total length to a 32bit field and then
writes each fragment to output.

References #3501
2018-06-13 13:44:08 +02:00
Piotr Jastrzebski
42d2a162dd data_consume_rows_context_m: add is_column_counter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-13 09:27:58 +02:00
Piotr Jastrzebski
d4d3e6f8eb data_consume_rows_context_m: Remove unused CELL_PATH_SIZE state
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-13 09:27:58 +02:00
Piotr Jastrzebski
ca7ede7eaf column_translation: add is_counter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-13 09:27:58 +02:00
Vlad Zolotarov
0004c29aba scylla_io_setup: properly define the disk_properties YAML hierarchy
disk_properties map should be an entry in the 'disk' list hierarchy.
Currently this list is going to containe a single element.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 19:14:22 -04:00
Vlad Zolotarov
038b2f3be2 scylla_io_setup: fix a typo: s/write_bandwdith/write_bandwidth/
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 18:54:34 -04:00
Vlad Zolotarov
26277e5973 scylla_io_setup: hardcode the "mountpoint" YAML node to "/var/lib/scylla" for AMIs
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 15:51:10 -04:00
Vlad Zolotarov
77463ddc3b scylla_io_setup: print the io_properties.yaml file name and not its handle info
In order to get a file name from the given file() handle one should use
a file_handle.name property.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 15:25:55 -04:00
Vlad Zolotarov
aa3d9c38b5 scylla_lib.sh: tolerate perftune.py errors
When we check the currently configured tuning mode perftune.py is allowed
to return an error. get_tune_mode() has to be able to tolerate them.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 14:21:26 -04:00
Vlad Zolotarov
818b5b75ba tracing: store the prepared statements parameters values
Store the prepared statement positional parameters values in the
corresponding system_traces.sessions entry in the 'parameters' column
(which has a map<text,text> type).

Parameters are stored as a pair of "param[X]" : "value", where X is
the index of the parameter starting from 0 and the "value" is the first
64 characters of the parameter's value string representation.

If parameters were given with their names attached (see the description
on bit 0x40 of QUERY flags in the CQL binary protocol specification) then
parameters are going to be stored in the "param[X](<bound variable name>)" : "value"
form.

If the value's string representation is longer than 64 characters then the "value" will
contain only first 64 characters of it and will have the "..." at
the end.

For a BATCH of prepared statements the parameter "name" will have a form of
param[Y][X] where Y is the index of the corresponding prepared statement
in the BATCH and X is the index of the parameter. Both X and Y start from
0.

Note:
Had to switch to boost::range::find() in sstables::big_sstable_set in order to
address the "ambiguous overload" compilation error.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 10:57:05 -04:00
Vlad Zolotarov
a1da285f9e tracing: store queries statements for BATCH
Similarly to the regular QUERY of EXECUTE we want to see the actual
queries statement that were part of the BATCH.

If a traced query has only a single statement to execute then its statement will be stored in a form 'query':'<statement>'.

If there are two or more queries (BATCH) then statements of each query in the BATCH will be stored in a form 'query[X]':'<statement>', where X is the index of the query in the
BATCH starting from 0.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 10:57:05 -04:00
Vlad Zolotarov
c0e51c4521 tracing::trace_state: hide the internals of params_values
Hide it inside the trace_state.cc in order to avoid future circular
dependencies with other .hh files.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 10:57:05 -04:00
Vlad Zolotarov
a469567605 cql3::query_options: add get_names() method
This method returns names of named prepared statement parameters.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 10:57:05 -04:00
Avi Kivity
d91891b6f0 Restore scylla-ami submodule
Commit b38ced0fcd ("Configure logalloc
memory size during initialization") updated the scylla-ami submodule
inadvertently.
2018-06-12 10:56:59 +03:00
Avi Kivity
74a3ab36e3 Restore seastar submodule
Commit b38ced0fcd ("Configure logalloc
memory size during initialization") updated the seastar submodule
inadvertently.
2018-06-12 10:37:35 +03:00
Avi Kivity
24a9a3c679 Merge "Push memory limits configuration up to main" from Gleb
"
May components limit its internal memory pools/caches/queues depending
on amount of memory present in a system. Each of them uses seastar
memory interface to get the information about memory availability
which makes it harder to 1: test the components with various memory
configurations and 2: to see which components reserve memory and how
much each one reserves.

The patch changes all the components that rely on memory size to get this
information through configuration parameter during creation instead of
checking it directly with seastar, so only main interacts with seastar
allocator.
"

* 'gleb/memory-config-v2' of github.com:scylladb/seastar-dev:
  Provide available memory size to compaction_manager object during creation
  Configure authorized_prepared_statment_cache memory limit during object creation
  Configure logalloc memory size during initialization
  Provide cql max request limit to cql server object during creation
  Configure query result memory limiter size limit during object creation
  Configure querier_cache size limit during object creation
  Provide available memory size to messaging_service object during creation
  Provide available memory size to hinted handoff resource manager during creation
  Provide available memory size to storage_proxy object during creation
  Provide available memory size to commitlog during creation
  Provide available memory size to database object during creation
  Configure prepared_statements_cache memory limit from outside
2018-06-11 15:34:14 +03:00
Gleb Natapov
59da525e0d Provide available memory size to compaction_manager object during creation 2018-06-11 15:34:14 +03:00
Gleb Natapov
da20d86423 Configure authorized_prepared_statment_cache memory limit during object creation 2018-06-11 15:34:14 +03:00
Gleb Natapov
b38ced0fcd Configure logalloc memory size during initialization 2018-06-11 15:34:14 +03:00
Gleb Natapov
894673ac14 Provide cql max request limit to cql server object during creation 2018-06-11 15:34:14 +03:00
Gleb Natapov
7832266cd7 Configure query result memory limiter size limit during object creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
04727acee9 Configure querier_cache size limit during object creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
646e400918 Provide available memory size to messaging_service object during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
cdf1289b43 Provide available memory size to hinted handoff resource manager during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
ac88935baa Provide available memory size to storage_proxy object during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
cc47f6c69d Provide available memory size to commitlog during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
f41575a156 Provide available memory size to database object during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
461f20e7b1 Configure prepared_statements_cache memory limit from outside
Pass desirable memory limit during construction instead of querying
memory size explicitly.
2018-06-11 15:34:13 +03:00
Tomasz Grabiec
a91974af7a tests: row_cache: Reduce concurrency limit to avoid bad_alloc
The test uses random mutations. We saw it failing with bad_alloc from time to time.
Reduce concurrency to reduce memory footprint.

Message-Id: <20180611090304.16681-1-tgrabiec@scylladb.com>
2018-06-11 10:06:56 +01:00
Tomasz Grabiec
cd7c7ac40f mutation_partition: Make do_compact() respect range tombstone merging rules
It compares only timestamps, but it should use intrinsic ordering of
the tombstone, which takes deletio ntime into consideration as well.
If we have two range tombstones with the same timestamp but different
deletion time (odd case, but still), then the one with the higher
deletion time should win. That's what all other parts of the system
use to resolve merges, in particular range_tombstone_list and
compact_mutation_state (the fragment stream compactor).

Not respecting this ordering violates the following equality:

  do_compact(do_compact(m1) + m2) == do_compact(m1 + m2)

which may results in some clustered rows being missing in the
right-hand side, but not in the left-hand side, due to differences in
range tombstones.

This impacts only tests currently.
Message-Id: <1528705602-7218-1-git-send-email-tgrabiec@scylladb.com>
2018-06-11 10:05:52 +01:00
Nadav Har'El
41472e2618 legacy_schema_migrator: add comment
When I came across db/legacy_schema_migrator.cc, I had no idea what it
does and though I had obvious guesses (it somehow migrates old schemas,
right?) I didn't know what it really does. So after I figured this out,
I wrote this comment so the next person doesn't need to guess.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180605120225.25173-1-nyh@scylladb.com>
2018-06-10 19:39:06 +03:00
Avi Kivity
4b81feb344 Merge "switch to systemd-coredump on Debian 9" from Takuya
* 'systemd-coredump-debian9' of https://github.com/syuu1228/scylla:
  dist/debian: fix pystache package name on Debian / Ubuntu
  dist/debian: switch to systemd-coredump on Debian 9
  dist/debian: rename 99-scylla.conf to 99-scylla-coredump.conf
2018-06-10 19:38:25 +03:00
Asias He
059ec89ad1 gms: Add is_normal helper to endpoint_state
It is faster than gossiper::is_normal because it avoids to do search in
the std::map<application_state, versioned_value>. It is useful for the
code in the fast path which needs to query if a node is in NORMAL
status.

Fixes #3500

Message-Id: <42db91fa4108f9f4fcf94fed3ec403ccf35d15e9.1528354644.git.asias@scylladb.com>
2018-06-10 19:21:03 +03:00
Vladimir Krivopalov
9c9c85cde5 tests: Add test writing UDT data to SSTables 3.x.
Original data and index files are generated using Cassandra 3.11.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <d0ea8146d6f2a76a5f661271500b35390962a9d4.1528420647.git.vladimir@scylladb.com>
2018-06-10 19:20:42 +03:00
Avi Kivity
74469ecc09 Merge "Support reading collections" from Piotr
"
Implement and test support for reading collections in SSTables 3.

Tests: unit {release}
"

* 'haaawk/sstables3/read-collections-v1' of ssh://github.com/scylladb/seastar-dev:
  sstables 3: Add tests for reading collections
  flat_mutation_reader_assertions: add more flexible asserts
  data_consume_rows_context_m: add support for collections
  mp_row_consumer_m: Add support for collections
  data_consume_rows_context_m: introduce cell_path
  Use column_translation::*_is_collection in reading
  column_translation: add *_column_is_collection()
  column_flags_m: add HAS_COMPLEX_DELETION
  Use read_unsigned_vint_length_bytes for COLUMN_VALUE
  Use read_unsigned_vint_length_bytes for CK_BLOCKS
  Implement read_unsigned_vint_length_bytes
2018-06-10 17:10:52 +03:00
Avi Kivity
2582f53b44 Merge "database and API: Add column_family::get_sstables_by_key" from Amnon
"
This is series is for nodetool getsstables.

This patch is based on:
8daaf9833a

With some minor adjustments because of the code change in sstables.

The idea is to allow searching for all the sstables that contains a
given key.

After this patch if there is a table t1 in keyspace k1 and it has a key
called aa.

curl -X GET "http://localhost:10000/column_family/sstables/by_key/k1%3At1?key=aa"

Will return the list of sstables file names that contains that key.
"

* 'amnon/sstable_for_key_v4' of github.com:scylladb/seastar-dev:
  Add the API implementation to get_sstables_by_key
  api: column_family.json make the get_sstables_for_key doc clearer
  column_family: Add the get_sstables_by_partition_key method
  sstable test: add has_partition_key test
  sstable: Add has_partition_key method
  keys_test: add a test for nodetool_style string
  keys: Add from_nodetool_style_string factory method
2018-06-10 16:53:56 +03:00
Amnon Heiman
8fbc6a22fb Add the API implementation to get_sstables_by_key
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-06-10 16:13:01 +03:00
Amnon Heiman
cc5601d000 api: column_family.json make the get_sstables_for_key doc clearer
This patch makes it clearer that the key that get_sstables_for_key
refers to, is a partition key.
2018-06-10 16:13:01 +03:00
Amnon Heiman
acb0a738eb column_family: Add the get_sstables_by_partition_key method
The get_sstables_by_partition_key method used by the API to return a set of
sstables names that holds a given partition key.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-06-10 16:13:01 +03:00
Amnon Heiman
b8e5029991 sstable test: add has_partition_key test
This patch adds a test to the has_partition_key method, it creates an
sstable with a partition key and then used that key in the
has_partition_key method to verify that it is there.

It creates a different key and use that to verify that a non exist key
return false.
2018-06-10 16:12:12 +03:00
Avi Kivity
ba5d8717c8 tests: disable reactor stall notifier
In case it is interacting badly with ASAN and causing spurious test
failures.
2018-06-10 15:55:00 +03:00
Avi Kivity
95b00aae33 Revert scylla-ami update in "scylla_setup: fix conditional statement of silent mode"
This reverts part of commit 364c2551c8. I mistakenly
changed the scylla-ami submodule in addition to applying the patch. The revert
keeps the intended part of the patch and undoes the scylla-ami change.
2018-06-10 14:53:40 +03:00
Asias He
d23dafa7ac dht: Remove column_families parameter in add_rx_ranges and add_tx_ranges
In 4b1034b (storage_service: Remove the stream_hints), we removed the
only user of the api with the column_families parameter.

std::vector column_families = { db::system_keyspace::HINTS };
streamer->add_tx_ranges(keyspace, std::move(ranges_per_endpoint),
column_families);

We can simplify the code range_streamer a bit by removing it.

Fixes #3476

Tests: dtest update_cluster_layout_tests.py
Message-Id: <c81d79c5e6dbc8dd78c1242837de892e39d6abd2.1528356342.git.asias@scylladb.com>
2018-06-10 14:53:40 +03:00
Piotr Jastrzebski
7d3abb0668 sstables 3: Add tests for reading collections
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:40:10 +02:00
Piotr Jastrzebski
176305c2f2 flat_mutation_reader_assertions: add more flexible asserts
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:39:51 +02:00
Piotr Jastrzebski
f9c62b8188 data_consume_rows_context_m: add support for collections
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:39:07 +02:00
Piotr Jastrzebski
fd89f42b09 mp_row_consumer_m: Add support for collections
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:35:12 +02:00
Piotr Jastrzebski
ffb6b9ed24 data_consume_rows_context_m: introduce cell_path
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:30:40 +02:00
Piotr Jastrzebski
5e1dd89d4d Use column_translation::*_is_collection in reading
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 22:50:23 +02:00
Piotr Jastrzebski
7bb25a2dd9 column_translation: add *_column_is_collection()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 22:48:43 +02:00
Piotr Jastrzebski
2b8ff15f9f column_flags_m: add HAS_COMPLEX_DELETION
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 22:47:19 +02:00
Avi Kivity
f9d66f88bb transport: advertise the shard serving a connection
It is useful for the client driver to know which shard is serving a
particular connection, so it can only send requests through that connection
which will be served by the same shard, eliminating a hop.

Support that by advertising a "SCYLLA_SHARD" option, with a value
corresponding to the shard number.

Acked-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180606203437.1198-1-avi@scylladb.com>
2018-06-07 10:43:16 +03:00
Avi Kivity
4a90eeb326 Update seastar submodule
* seastar 12cffef...e7275e4 (9):
  > tests: execution_stage_test: capture sg by value
  > Merge "Add in-path parameter suport to the code generation" from Amnon
  > Merge "Add scheduling_group inheritance to execution_stage" from Avi
  > tutorial: explain how to find origin of exception
  > tls: Ensure handshake always drains output before return/throw
  > build: cmake: correct stdc++fs library name once more
  > perftune.py: make sure config file existing before write
  > Update travis-ci integration
  > build: fix compilation issues on cmake. missing stdc++-fs
2018-06-06 19:07:16 +03:00
Avi Kivity
6f23403137 Merge "Virtualize IndexInfo system table" from Duarte
"
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.

Fixes #3483
"

* 'built-indexes-virtual-reader/v2' of github.com:duarten/scylla:
  tests/virtual_reader_test: Add test for built indexes virtual reader
  db/system_keysace: Add virtual reader for IndexInfo table
  db/system_keyspace: Explain that table_name is the keyspace in IndexInfo
  index/secondary_index_manager: Expose index_table_name()
  db/legacy_schema_migrator: Don't migrate indexes
2018-06-06 17:35:51 +03:00
Piotr Jastrzebski
f7a1d5a437 Use read_unsigned_vint_length_bytes for COLUMN_VALUE
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-06 15:54:17 +02:00
Piotr Jastrzebski
3b8b165053 Use read_unsigned_vint_length_bytes for CK_BLOCKS
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-06 15:44:53 +02:00
Piotr Jastrzebski
21a0e95a06 Implement read_unsigned_vint_length_bytes
It's a common operation that's used in multiple
places so it's best to have it implemented once.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-06 15:44:06 +02:00
Piotr Sarna
0818eb42ae cql3: remove additional IN relation check
Commit 80fc1b1408 introduced additional checks to ensure
that IN relation in WHERE clause can only occur on last restricted
column. This check is not present in current Cassandra code,
the restriction isn't mentioned anywhere in 'IN relation' documentation
and removing it fixes issue 2865.
Running cql_tests dtest suite doesn't show any regression after removing
this check.

Also at: https://github.com/psarna/scylla/tree/remove_additional_in_relation_check

Tests: dtest (cql_tests), unit (release)

Fixes #2865

Message-Id: <aa8c0b33618dd58cd153e83589ac016bc63f4343.1528288388.git.sarna@scylladb.com>
2018-06-06 16:01:54 +03:00
Tomasz Grabiec
9975135110 row_cache: Make sure reader makes forward progress after each fill_buffer()
If reader's buffer is small enough, or preemption happens often
enough, fill_buffer() may not make enough progress to advance
_lower_bound. If also iteartors are constantly invalidated across
fill_buffer() calls, the reader will not be able to make progress.

See row_cache_test.cc::test_reading_progress_with_small_buffer_and_invalidation()
for an examplary scenario.

Also reproduced in debug-mode row_cache_test.cc::test_concurrent_reads_and_eviction

Message-Id: <1528283957-16696-1-git-send-email-tgrabiec@scylladb.com>
2018-06-06 16:01:52 +03:00
Vlad Zolotarov
12e3e4fb2a service::client_state::has_access(): make readable_system_resources an std::unordered_set
There is not reason to use an std::set for it since we don't care about
the ordering - only about the existance of a particular entry.
Hash table will be more efficient for this use case.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528220892-5784-2-git-send-email-vladz@scylladb.com>
2018-06-06 15:29:29 +03:00
Duarte Nunes
833d34e88a Merge 'Make rows in a secondary index ordered by token' from Piotr
"
As in #3423, ensuring token order on secondary index queries can be done
by adding an additional column to views that back secondary indexes.
This column is a first clustering column and contains token value,
computed on updates.
This series also updates tests and comments refering to issue 3423.

Tests: unit (release, debug)
"

* 'order_by_token_in_si_5' of https://github.com/psarna/scylla:
  cql3: update token order comments
  index, tests: add token column to secondary index schema
  view: add handling of a token column for secondary indexes
  view: add is_index method
2018-06-06 10:07:43 +01:00
Vlad Zolotarov
2dde372ae6 locator::ec2_multi_region_snitch: don't call for ec2_snitch::gossiper_starting()
ec2_snitch::gossiper_starting() calls for the base class (default) method
that sets _gossip_started to TRUE and thereby prevents to following
reconnectable_snitch_helper registration.

Fixes #3454

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528208520-28046-1-git-send-email-vladz@scylladb.com>
2018-06-06 12:00:17 +03:00
Asias He
6496cdf0fb db: Get rid of the streaming memtable delayed flush
In 455d5a5 (streaming memtables: coalesce incoming writes), we
introduced the delayed flush to coalesce incoming streaming mutations
from different stream_plan.

However, most of the time there will be one stream plan at a time, the
next stream plan won't start until the previous one is finished. So, the
current coalescing does not really work.

The delayed flush adds 2s of dealy for each stream session. If we have lots
of table to stream, we will waste a lot of time.

We stream a keyspace in around 10 stream plans, i.e., 10% of ranges a
time. If we have 5000 tables, even if the tables are almost empty, the
delay will waste 5000 * 10 * 2 = 27 hours.

To stream a keyspace with 4 tables, each table has 1000 rows.

Before:

 [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
 [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 125.21 KiB/s
 [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 8.233 seconds

After:

 [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
 [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 4772.32 KiB/s
 [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 0.216 seconds

Fixes #3436

Message-Id: <cb2dde263782d2a2915ddfe678c74f9637ffd65b.1526979175.git.asias@scylladb.com>
2018-06-06 10:16:02 +03:00
Piotr Sarna
70ba8c8317 cql3: update token order comments
Comments about token order were outdated with token column patches
and they are now up to date.

Fixes #3423
2018-06-06 09:02:37 +02:00
Piotr Sarna
4a9bf7ed5b index, tests: add token column to secondary index schema
Additional token column is now present in every view schema
that backs a secondary index. This column is always a first part
of the clustering key, so it forces token order on queries.
Column's name is ideally idx_token, but can be postfixed
with a number to ensure its uniqueness.

It also updates tests to make them acknowledge the new token order.

Fixes #3423
2018-06-06 09:02:33 +02:00
Takuya ASADA
899f7641b6 dist/debian: fix pystache package name on Debian / Ubuntu
It's python-pystache, not python2-pystache.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2018-06-06 15:55:52 +09:00
Takuya ASADA
db9074707a dist/debian: switch to systemd-coredump on Debian 9
Debian 9 has newer systemd that supports systemd-coredump, so enable it.
2018-06-06 15:04:31 +09:00
Takuya ASADA
30386ed215 dist/debian: rename 99-scylla.conf to 99-scylla-coredump.conf
Since 99-scylla.conf is only used for setting coredump handler, rename
it to 99-scylla-coredump.conf.
2018-06-06 14:59:32 +09:00
Piotr Sarna
d5e7b5507b view: add handling of a token column for secondary indexes
In order to ensure token order on secondary index queries,
first clustering column for each view that backs a secondary index
is going to store a token computed from base's partition keys.
After this commit, if there exists a column that is not present
in base schema, it will be filled with computed token.
2018-06-05 18:59:25 +02:00
Tomasz Grabiec
f775fc2e4c mvcc: Fix partition_entry::open_version()
After 70c72773be it's possible that
open_version() is called with a phase which is smaller than the phase
of the latest version, because latest version belongs to the
in-progress cache update. In such case we must return the existing
non-latest snapshot and not create a new version on top of the
in-progress update. Not doing this violates several invariants, and
may lead to inconsistencies, including violation of write atomicity or
temporary loss of writes.

partition_entry::read() was already adjusted by the aforementioned
commit. Do a similar adjustement for open_version().

Fixes sporadic failures of row_cache_test.cc::test_concurrent_reads_and_eviction
Message-Id: <1528211847-22825-1-git-send-email-tgrabiec@scylladb.com>
2018-06-05 18:22:38 +03:00
Takuya ASADA
60844ae67b dist/common/scripts/scylla_coredump_setup: don't run sysctl on Ubuntu 18.04
Since 99-scylla.conf is not included on Ubuntu 18.04, skip running it.

Fixes #3494

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180605093619.9197-1-syuu@scylladb.com>
2018-06-05 12:47:46 +03:00
Takuya ASADA
222b8588ee dist/common/systemd/scylla-server.service.in: add local-fs.target as dependency
We mistakenly only added network-online.target is doens't promises to
wait /var/lib/scylla mount.
To do this we need local-fs.target.

Fixes #3441

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180521083349.8970-1-syuu@scylladb.com>
2018-06-05 12:26:21 +03:00
Piotr Sarna
06eee0f525 view: add is_index method
is_index method returns true if view that owns it
is backing a secondary index.
2018-06-05 11:10:24 +02:00
Piotr Sarna
6130a00597 dist: add scylla/hints directory to scripts
/var/lib/scylla/hints directory was missing from dist-specific
scripts, which may cause package installations to fail.
Package building scripts and descriptions are updated/

Fixes #3495

Message-Id: <0f5596cb49500416820ece023b7f76a4e2427799.1528184949.git.sarna@scylladb.com>
2018-06-05 11:33:29 +03:00
Avi Kivity
4aaf7bbc1d Merge "Add test for compression" from Piotr
"
It turns out that compression just works for SSTables 3.x.
Thanks to the previous work done on the write path.
This series cleans up tests a bit and introduces test for compression
on the read path.
"

* 'haaawk/sstables3/read-compression-v1' of ssh://github.com/scylladb/seastar-dev:
  Add test for compression in sstables 3.x
  Extract test_partition_key_with_values_of_different_types_read
  sstable_3_x_test: use SEASTAR_THREAD_TEST_CASE
  Drop UNCOMPRESSD_ when code will be used for compressed too
2018-06-04 20:33:50 +03:00
Piotr Jastrzebski
25a7f03f7f Add test for compression in sstables 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-04 18:41:10 +02:00
Piotr Jastrzebski
be9c7391aa Extract test_partition_key_with_values_of_different_types_read
It will be used also for testing compression.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-04 18:41:10 +02:00
Piotr Jastrzebski
1f324b7fc8 sstable_3_x_test: use SEASTAR_THREAD_TEST_CASE
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-04 18:40:52 +02:00
Piotr Jastrzebski
3e3ccdb323 Drop UNCOMPRESSD_ when code will be used for compressed too
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-04 18:29:02 +02:00
Avi Kivity
6d6c355dc0 Merge "augment system.local with sharding information" from Glauber
"
This patch adds nr_shards, msb_ignore, and the actual sharding algorithm to the
system.local table. Drivers and other tools can then make use of this
information to talk to scylla in an optimal way
"

* 'system_tables-v3' of github.com:glommer/scylla:
  system_keyspace: add sharding information to local table
  partitioner: export the name of the algorithm used to do intra-node sharding
2018-06-04 18:50:28 +03:00
Glauber Costa
bdce561ada system_keyspace: add sharding information to local table
We would like the clients to be able to route work directly to the right
shards. To do that, they need to know the sharding algorithm and its
parameters.

The algorithm can be copied into the client, but the parameters need to
be exported somewhere. Let's use the local table for that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
---
v2: force msb to zero on non-murmur
2018-06-04 11:25:58 -04:00
Glauber Costa
250d9332dc partitioner: export the name of the algorithm used to do intra-node sharding
We will export this on system tables. To avoid hard-coding it in the system
table level, keep it at least in the dht layer where it belongs.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-04 11:25:58 -04:00
Takuya ASADA
ad4ca1e166 dist: simplified build script templates
Currently, build_deb.sh looks very complicated because each of distribution
requires different parameter, and we are applying them by sed command one-by-one.

This patch will replace them by Mustache, it's simple and easy syntax
template language.
Both .rpm distributions and .deb distributions have pystache (a Python
implimentation of Mustache), we will use it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180604104026.22765-1-syuu@scylladb.com>
2018-06-04 14:38:52 +03:00
Paweł Dziepak
24764712b6 sstable: fix capture by reference of stack variable in continuation
Message-Id: <20180604102542.21799-1-pdziepak@scylladb.com>
2018-06-04 14:35:49 +03:00
Duarte Nunes
dfa779ebe7 Merge 'Separate hinted handoff manager for materialized views' from Piotr
"
This series introduces a separate hinted handoff manager for materialized views.

Steps:
 * decouple resource limits from hinted handoff, so multiple instances can share space
   and throughput limits in order to avoid internal fragmentation for every instance's
   reservations
 * add a subdirectory to data/, responsible for storing materialized view hints
 * decouple registering global metrics from hinted handoff constructor, now that there
   can be more than one instance - otherwise 'registering metrics twice' errors are going to occur
 * add a hints_for_views_manager to storage proxy and route failed view updates to use it
   instead of the original hints_manager
 * restore previous semantics for enabling/disabling hinted handoff - regular hinted handoff
   can be disabled or enabled just for specific datacenters without influencing materialized
   views flow
"

* 'separate_hh_for_mv_4' of https://github.com/psarna/scylla:
  storage_proxy: restore optional hinted handoff
  storage_proxy: add hints manager for views
  hints: decouple hints manager metrics from constructor
  db, config: add view_pending_updates directory
  hints: move space_watchdog to resource manager
  hints: move send limiter to resource manager
  hints: move constants to resource_manager
2018-06-04 12:03:59 +01:00
Duarte Nunes
01676a2cda tests/virtual_reader_test: Add test for built indexes virtual reader
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:31:29 +01:00
Duarte Nunes
3e39985c7a db/system_keysace: Add virtual reader for IndexInfo table
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.

Fixes #3483

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Duarte Nunes
65c4205334 db/system_keyspace: Explain that table_name is the keyspace in IndexInfo
This patch adds the same comment that exists in Apache Cassandra,
explaining that the table_name column in the IndexInfo system table
actually refers to the keyspace name. Don't be fooled.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Duarte Nunes
bc4db67524 index/secondary_index_manager: Expose index_table_name()
Expose secondary_index::index_table_name() so knowledge on how to
built an index name can remain centralized.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Duarte Nunes
7187963bda db/legacy_schema_migrator: Don't migrate indexes
Previous versions contained no indexes, and Apache Cassandra indexes
cannot be migrated to Scylla.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Vlad Zolotarov
e759803f48 cql3::authorized_prepared_statements_cache: properly set the expiration timeout
Because authorized_prepared_statements_cache caches the information that comes from
the permissions cache and from the prepared statements cache it should has the entries
expiration period set to the minimum of expiration periods of these caches.

The same goes to the entry refresh period but since prepared statements cache does have a
refresh period authorized_prepared_statements_cache's entries refresh period
is simply equal to the one of the permissions cache.

Fixes #3473

Tests: dtest{release} auth_test.py

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1527789716-6206-1-git-send-email-vladz@scylladb.com>
2018-06-04 10:34:54 +02:00
Piotr Jastrzebski
0b72594c1f data_consume_rows_context_m: Use find_first and find_next
Those methods of boost::dynamic_bitset allow much more
efficient implementation of skip_absent_columns and
move_to_next_column.

Also fix some indentation and variable naming.

Test: unit {release}

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <8a4dea51060c5a02bb774eac43e9eb67d316049a.1528100153.git.piotr@scylladb.com>
2018-06-04 11:18:03 +03:00
Piotr Sarna
f12fdcffdb storage_proxy: restore optional hinted handoff
Since hinted handoff for materialized views is now a separate entity,
regular hinted handoff can go back to being optional.
2018-06-04 09:46:06 +02:00
Piotr Sarna
a6aae369da storage_proxy: add hints manager for views
This commit adds a separate hints manager that serves
only failed materialized view updates.
2018-06-04 09:46:06 +02:00
Piotr Sarna
204bc17bd7 hints: decouple hints manager metrics from constructor
Now that more than one instance of hints manager can be present
at the same time, registering metrics is moved out of the constructor
to prevent 'registering metrics twice' errors.
2018-06-04 09:46:06 +02:00
Piotr Sarna
a791dce0ae db, config: add view_pending_updates directory
Hints for materialized view updates need to be kept somewhere,
because their dedicated hints manager has to have a root directory.
view_pending_updates directory resides in /data and is used
for that purpose.
2018-06-04 09:46:06 +02:00
Piotr Sarna
f345efc79a hints: move space_watchdog to resource manager
Space watchdog is decoupled from hints manager and moved to resource
manager, so it can be shared among different hints manager instances.
2018-06-04 09:46:01 +02:00
Piotr Sarna
ef40f7e628 hints: move send limiter to resource manager
Send limiting semaphore is moved from hints manager to resource manager.
In consequence, hints manager now keeps a reference to its resource
manager.
2018-06-04 09:35:58 +02:00
Piotr Sarna
2315937854 hints: move constants to resource_manager
Constants related to managing resources are moved to newly created
resource_manager class. Later, this class will be used to manage
(potentially shared) resources of hints managers.
2018-06-04 09:35:58 +02:00
Avi Kivity
9b21fbc055 Merge "LCS: enable compaction controller" from Glauber
"

In preparation, we change LCS so that it tries harder to push data
to the last level, where the backlog is supposed to be zero.

The backlog is defined as:

backlog_of_stcs_in_l0 + Sum(L in level) sizeof(L) * (max_level - L) * fan_out

where:
 * the fan_out is the amount of SSTables we usually compact with the
   next level (usually 10).
 * max_levels is the number of levels currently populated
 * sizeof(L) is the total amount of data in a particular level.

Tests: unit (release)
"

* 'lcs-backlog-v2' of github.com:glommer/scylla:
  LCS: implement backlog tracker for compaction controller
  LCS: don't construct property in the body of constructor
  LCS: try harder to move SSTables to highest levels.
  leveled manifest: turn 10 into a constant
  backlog: add level to write progress monitor
2018-06-04 10:29:56 +03:00
Amos Kong
364c2551c8 scylla_setup: fix conditional statement of silent mode
Commit 300af65555 introdued a problem in
conditional statement, script will always abort in silent mode, it doesn't
care about the return value.

Fixes #3485

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <1c12ab04651352964a176368f8ee28f19ae43c68.1528077114.git.amos@scylladb.com>
2018-06-04 10:14:06 +03:00
Glauber Costa
6317bd45d7 LCS: implement backlog tracker for compaction controller
This is the last missing tracker among the major strategies. After
this, only DTCS is left.

To calculate the backlog, we will define the point of zero-backlog
as having all data in the last level. The backlog is then:

Sum(L in levels) sizeof(L) * (max_levels - L) * fan_out,

where:
 * the fan_out is the amount of SSTables we usually compact with the
   next level (usually 10).
 * max_levels is the number of levels currently populated
 * sizeof(L) is the total amount of data in a particular level.

Care is taken for the backlog not to jump when a new level has been just
recently created.

Aside from that, SSTables that accumulate in L0 can be subject to STCS.
We will then add a STCS backlog in those SSTables to represent that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-03 18:14:09 -04:00
Glauber Costa
04546df55c LCS: don't construct property in the body of constructor
Right now we are constructing the _max_sstable_size_in_mb property in
the body of the constructor, which it makes it hard for us to use from
other properties.

We are doing that because we'd like to test for bounds of that value. So
a cleaner way is to have a helper function for that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-03 18:14:09 -04:00
Glauber Costa
28382cb25c LCS: try harder to move SSTables to highest levels.
Our current implementation of LCS can end up with situations in which
just a bit of data is in the highest levels, with the majority in the
lowest levels. That happens because we will only promote things to
highest levels if the amount of data in the current level is higher than
the maximum.

This is a pre-existing problem in itself, but became even clearer when
we started trying to define what is the backlog for LCS.

We have discussed ways to fix this it by redefining the criteria on when
to move data to the next levels. That would require us to change the way
things are today considerably, allowing parallel compactions, etc. There
is significant risk that we'll increase write amplication and we would
need to carefully validate that.

For now I will propose a simpler change, that essentially solves the
"inverted pyramid" problem of current LCS without major disruption:
keep selecting compaction candidates with the same criteria that we do
today, we should help make sure we are not compacting high levels for no
reason; but if there is nothing to do, use the idle time to push data to
higher levels. As an added benefit, old data that is in the higher level
can also be compacted away faster.

With this patch we see that in an idle, post-load system all data is
eventually pushed to the last level. Systems under constant writes keep
behaving the same way they did before.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-03 18:12:19 -04:00
Glauber Costa
e64b471e3d leveled manifest: turn 10 into a constant
We increase levels in powers of 10 but that is a parameter
of the algorithm. At least make it into a constant so that we can
reuse it somewhere else.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-03 16:55:58 -04:00
Avi Kivity
6f2d3b7f9f Merge "Fix previous row size calculation for SSTables 3.x" from Vladimir
"
SSTables 3.x format ('m') stores the size of previous row or RT marker
inside each row/marker. That potentially allows to traverse rows/markers
in reverse order.

The previous code calculating those sizes appeared to produce invalid
values for all rows except the first one. The problem with detecting
this bug was that neither Cassandra itself nor the sstabledump tool use
those values, they are simply rejected on reading.
From UnfilteredSerializer.deserializeRowBody() method,
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java#L562
:

            if (header.isForSSTable())
            {
                in.readUnsignedVInt(); // Skip row size
                in.readUnsignedVInt(); // previous unfiltered size
            }

So while the previous test files were technically correct in that they
contained valid data readable by Cassandra/sstabledump, they didn't
follow the format specification.

This patchset fixes the code to produce correct values and replaces
incorrect data files with correct ones. The newly generated data files
have been validated to be identical to files generated with Cassandra
using same data and timestamps as unit tests.

Tests: Unit {release}
"

* 'projects/sstables-30/fix-prev-row_size/v1' of https://github.com/argenet/scylla:
  tests: Fix test files to use correct previous row sizes.
  sstables: Fix calculation of previous row size for SSTables 3.x
  sstables: Factor out code building promoted index blocks into separate helpers.
2018-06-03 11:38:22 +03:00
Avi Kivity
a43b3e22fc Merge "Fix clustering blocks serialization for SSTables 3.x" from Vladimir
"
This patchset contains two fixes to the clustering key prefixes
serialization logic for SSTables 3.x.

First, it fixes a vexing typo: a bitwise-and (&) has been used instead
of a remainder operator (%) for truncating the shift value.
This did not show up in existing tests because they all had non-empty
clustering columns values.
Added tests to cover empty clustering columns values.

Second, it fixes the logic of serialization to write values up to the
prefix length, not the length of the clustering key as defined by
schema. This matches the way it is done by the Origin.

There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.
"

* 'projects/sstables-30/fix-clustering-blocks/v1' of https://github.com/argenet/scylla:
  tests: Add test covering compact table with non-full clustering key.
  sstables: Improve clustering blocks writing, use logical clustering prefix size.
  tests: Add test covering large clustering keys (>32 columns) for SSTables 3.x
  tests: Add unit test covering empty values in clustering key.
  sstables: Fix typo in clustering blocks write helper.
2018-06-03 11:35:49 +03:00
Avi Kivity
1071e481ed Merge "Implement support for missing columns in SSTable 3.0" from Piotr
"
Add handling for missing columns and tests for it.

There are 3 cases:
1. Number of columns in a table is smaller than 64
2. Number of columns in a table is greater than 64
2a. and less than half of all possible columns are present in sstable
2b. and at least half of all possible columns are present in sstable

Case 1 is implemented using bit mask and column is present if mask & (1 << <column number>) == 0
Case 2 is implemented by storing list of column numbers for each present column
case 3 is implemented by storing list of column numbers for each absent column
"

* 'haaawk/sstables3/read-missing-columns-v3' of ssh://github.com/scylladb/seastar-dev:
  sstables 3: add test for reading big dense subset of columns
  sstables 3: support reading big dense subsets of columns
  sstables 3: add test for reading big sparse subset of columns
  sstables 3: support reading big sparse subsets of columns
  sstables 3: add test for reading small subset of columns
  sstables 3: support reading small subsets of columns
2018-06-03 10:42:00 +03:00
Avi Kivity
78182a704b partition_snapshot_row_cursor: initialize _dummy and _continuous
Debug mode view_schema_test sometimes complains that a bool member
doesn't contain in-range values, apparenty in the move constructor.

Initialize them for its benefit to avoid false-positive test
failures.
Message-Id: <20180602184934.31258-1-avi@scylladb.com>
2018-06-02 19:51:36 +01:00
Avi Kivity
187ebdbe46 auth: fix possible use of disengaged optional in has_salted_hash()
untyped_result_set_row's cell data type is bytes_opt, and the
get_block() accessor accesses the value assuming it's engaged
(relying on the caller to call has()).

has_unsalted_hash() calls get_blob() without calling has() beforehand,
potentially triggering undefined behavior.

Fix by using get_or() instead, which also simplifies the caller.

I observed failures in Jenkins in this area. It's hard to be sure
this is the root cause, since the failures triggered an internal
consistency assertion in asan rather than an asan report. However,
the error is hard to reproduce and the fix makes sense even if it
doesn't prevent the error.

See #3480 for the asan error.

Fixes #3480 (hopefully).
Message-Id: <20180602181919.29204-1-avi@scylladb.com>
2018-06-02 19:46:32 +01:00
Piotr Jastrzebski
2fd0566eb7 sstables 3: add test for reading big dense subset of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-02 10:41:18 +02:00
Piotr Jastrzebski
829f0c5f80 sstables 3: support reading big dense subsets of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-02 10:41:18 +02:00
Piotr Jastrzebski
4e4972ffea sstables 3: add test for reading big sparse subset of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-02 10:40:56 +02:00
Piotr Jastrzebski
e5fb499736 sstables 3: support reading big sparse subsets of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-01 21:35:28 +02:00
Piotr Jastrzebski
24e9ab4ab6 sstables 3: add test for reading small subset of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-01 21:34:03 +02:00
Piotr Jastrzebski
63d45c4f24 sstables 3: support reading small subsets of columns
Small subset is contains no more than 63 elements.
Support for large subsets will come in the following
patches.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-01 21:33:50 +02:00
Glauber Costa
7e3093709a backlog: add level to write progress monitor
For SSTables being written, we don't know their level yet. Add that
information to the write monitor. New SSTables will always be at L0.
Compacted SSTables will have their level determined by the compaction
process.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-31 21:09:38 -04:00
Vladimir Krivopalov
b6511d1b07 tests: Add test covering compact table with non-full clustering key.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 17:30:36 -07:00
Vladimir Krivopalov
47a7e78bc8 sstables: Improve clustering blocks writing, use logical clustering prefix size.
In the Origin, the size of the clustering key prefix used during
serialization is the actual length of the prefix and not the full size
as defined in schema. So the code is fixed to align with that logic.
This, in particular, is needed to write clustering blocks for RT
markers.

There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 17:30:36 -07:00
Vladimir Krivopalov
3f404f19dc tests: Add test covering large clustering keys (>32 columns) for SSTables 3.x
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 17:30:36 -07:00
Vladimir Krivopalov
487796de85 tests: Add unit test covering empty values in clustering key.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 17:30:36 -07:00
Vladimir Krivopalov
0dadd4fdf3 sstables: Fix typo in clustering blocks write helper.
What supposed to be an operation of taking remainder turned to be a
bitwise 'and'. This didn't show up in existing tests only because they
all had non-empty clustering values.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 15:12:40 -07:00
Avi Kivity
aab6b0ee27 Merge "Introduce new in-memory representation for cells" from Paweł
"
This is the first part of the first step of switching Scylla. It covers
converting cells to the new serialisation format. The actual structure
of the cells doesn't differ much from the original one with a notable
exception of the fact that large values are now fragmented and
linearisation needs to be explicit. Counters and collections still
partially rely on their old, custom serialisation code and their
handling is not optimial (although not significantly worse than it used
to be).

The new in-memory representation allows objects to be of varying size
and makes it possible to provide deserialisation context so that we
don't need to keep in each instance of an IMR type all the information
needed to interpret it. The structure of IMR types is described in C++
using some metaprogramming with the hopes of making it much easier to
modify the serialisation format that it would be in case of open-coded
serialisation functions.

Moreover, IMR types can own memory thanks to a limited support for
destructors and movers (the latter are not exactly the same thing as C++
move constructors hence a different name). This makes it (relatively)
to ensure that there is an upper bound on the size of all allocations.

For now the only thing that is converted to the IMR are atomic_cells
and collections which means that the reduction in the memory footprint
is not as big as it can be, but introducing the IMR is a big step on its
own and also paves the way towards complete elimination of unbounded
memory allocations.

The first part of this patchset contains miscellaneous preparatory
changes to various parts of the Scylla codebase. They are followed by
introduction of the IMR infrastructure. Then structure of cells is
defined and all helper functions are implemented. Next are several
treewide patches that mostly deal with propagating type information to
the cell-related operations. Finally, atomic_cell and collections are
switched to used the new IMR-based cell implementation.

The IMR is described in much more detail in imr/IMR.md added in "imr:
add IMR documentation".

Refs #2031.
Refs #2409.

perf_simple_query -c4, medians of 30 results:

        ./perf_base  ./perf_imr   diff
 read     308790.08   309775.35   0.3%
 write    402127.32   417729.18   3.9%

The same with 1 byte values:
        ./perf_base1  ./perf_imr1   diff
 read      314107.26    314648.96   0.2%
 write     463801.40    433255.96  -6.6%

The memory footprint is reduced, but that is partially due to removal of
small buffer optimisation (whether it will be restored depends on the
exact mesurements of the performance impact). Generally, this series was
not expected to make a huge difference as this would require converting
whole rows to the IMR.

Memory footprint:
Before:
mutation footprint:
 - in cache: 1264
 - in memtable: 986

After:
mutation footprint:
 - in cache: 1104
 - in memtable: 866

Tests: unit (release, debug)
"

* tag 'imr-cells/v3' of https://github.com/pdziepak/scylla: (37 commits)
  tests/mutation: add test for changing column type
  atomic_cell: switch to new IMR-based cell reperesentation
  atomic_cell: explicitly state when atomic_cell is a collection member
  treewide: require type for creating collection_mutation_view
  treewide: require type for comparing cells
  atomic_cell: introduce fragmented buffer value interface
  treewide: require type to compute cell memory usage
  treewide: require type to copy atomic_cell
  treewide: require type info for copying atomic_cell_or_collection
  treewide: require type for creating atomic_cell
  atomic_cell: require column_definition for creating atomic_cell views
  tests: test imr representation of cells
  types: provide information for IMR
  data: introduce cell
  data: introduce type_info
  imr/utils: add imr object holder
  imr: introduce concepts
  imr: add helper for allocating objects
  imr: allow creating lsa migrators for IMR objects
  imr: introduce placeholders
  ...
2018-05-31 19:21:15 +03:00
Amnon Heiman
bc7503feee Scyllatop to use prometheus by default
Scylla now expose the prometheus API by default. This patch chagnes
scyllatop to use the Prometheus API, the collect API is still available.

The main changes in the patch:
* Move collectd specific logic inside collectd.
* Add support for help information.
* Add command line to configure prometheus end point and to enable
collectd.

* Add a prometheus class that collect information from prometheus.

Fixes: #1541
Message-Id: <20180531124156.26336-1-amnon@scylladb.com>
2018-05-31 18:00:22 +03:00
Tomasz Grabiec
b5e42bc6a0 tests: row_cache: Do not hang when only one of the readers throws
Message-Id: <20180531122729.3314-1-tgrabiec@scylladb.com>
2018-05-31 18:00:22 +03:00
Piotr Sarna
360326fdc5 cql3: add compatibility with libjsoncpp < 1.6.0
Only libjsoncpp >= 1.6.0 offers a safe name() method for value
iterators. For older versions, deprecated memberName() is used
instead. Note that memberName() was deprecated because of its
inability to deal with embedded null characters.

Fixes #3471

Message-Id: <e64a62bfc24ef06daee238d79d557fe6ec8979d3.1527758708.git.sarna@scylladb.com>
2018-05-31 18:00:22 +03:00
Paweł Dziepak
131a47dea3 tests/mutation: add test for changing column type
With the introduction of the new in-memory representation changing
column type has become a more complex operation since it needs to handle
switch from fixed-size to variable-size types. This commit adds an
explicit test for such cases.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
a040d37cd5 atomic_cell: switch to new IMR-based cell reperesentation
This patch changes the implementation of atomic_cell and
atomic_cell_or_collection to use the data::cell implementation which is
based on the new in-memory representation infrastructure.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
0ea6d14cf5 atomic_cell: explicitly state when atomic_cell is a collection member
Collections are not going to be fully converted to the IMR just yet and
still use the old serialisation format. This means that they still don't
support fragmented values very well. This patch passes the information
when an atomic_cell is created as a member of a collection so that later
we can avoid fragmenting the value in such cases.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
e34ff8b4bf treewide: require type for creating collection_mutation_view 2018-05-31 15:51:11 +01:00
Paweł Dziepak
9bb1f10bb6 treewide: require type for comparing cells 2018-05-31 15:51:11 +01:00
Paweł Dziepak
aa25f0844f atomic_cell: introduce fragmented buffer value interface
As a prepratation for the switch to the new cell representation this
patch changes the type returned by atomic_cell_view::value() to one that
requires explicit linearisation of the cell value. Even though the value
is still implicitly linearised (and only when managed by the LSA) the
new interface is the same as the target one so that no more changes to
its users will be needed.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
ec9d166a4f treewide: require type to compute cell memory usage 2018-05-31 15:51:11 +01:00
Paweł Dziepak
418c159057 treewide: require type to copy atomic_cell 2018-05-31 15:51:11 +01:00
Paweł Dziepak
27014a23d7 treewide: require type info for copying atomic_cell_or_collection 2018-05-31 15:51:11 +01:00
Paweł Dziepak
e9d6fc48ac treewide: require type for creating atomic_cell 2018-05-31 15:51:11 +01:00
Paweł Dziepak
93130e80fb atomic_cell: require column_definition for creating atomic_cell views 2018-05-31 15:51:11 +01:00
Paweł Dziepak
b25cc61a13 tests: test imr representation of cells 2018-05-31 15:51:11 +01:00
Paweł Dziepak
43b216b43d types: provide information for IMR 2018-05-31 15:51:11 +01:00
Paweł Dziepak
eec33fda14 data: introduce cell
This commit introduces cell serializers and views based on the in-memory
representation infrastructure. The code doesn't assume anything about
how the cells are stored, they can be either a part of another IMR
object (once the rows are converted to the IMR) or a separate objects
(just like current atomic_cell).
2018-05-31 15:51:11 +01:00
Duarte Nunes
f8626c7c93 tests/view_schema_test: Test view correctness under base schema changes
Reproducer for #3443.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180530194536.51202-2-duarte@scylladb.com>
2018-05-31 12:10:50 +03:00
Duarte Nunes
c4f267bdfe database: Refresh view dependent fields when altering base
A view schema's view_info contains the id of the base regular column
that view includes in its primary key. Since the column id of a
particular column can potentially change with a new schema version, we
need to refresh the stored column id. We weren't doing that when
unselected base columns are added, and this patch fixes it by
triggering an update of the view schema when base columns are added
and the view contains a base regular column in its PK.

Fixes #3443

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180530194536.51202-1-duarte@scylladb.com>
2018-05-31 12:10:49 +03:00
Paweł Dziepak
544b3c9a34 data: introduce type_info
This patch introduces type_info class which contains all type
information needed by IMR deserialisation contexts.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
4929c1f39a imr/utils: add imr object holder
imr::object<> is an owning pointer to an IMR objects. It is LSA-aware.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
fd47858755 imr: introduce concepts
This commit adds type traits and concepts for sizers, serializers and
writers that help explicitly specify requirements of various interfaces.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
28ea36a686 imr: add helper for allocating objects
IMR objects may own memory. object_allocator takes care of allocating
memory for all owned objects during the serialisation of their owner.

In practice a writer of the parent object would accept a helper object
created by object_allocator. That helper object would be either
responsible for computing the size of buffers that have to be allocated
or perform the actual serialisation in the same two phase manner as it
is done for the parent IMR object.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
79941f2fc7 imr: allow creating lsa migrators for IMR objects
This patch introduces helpers for creating LSA migrators from IMR
deserialisation contexts and context factories.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
5ddb118c78 imr: introduce placeholders
In some cases the actual value of an IMR object is not know at the
serialisation time. If the type is fixed-size we can use a placeholder
to defer writing it to a more conveninent moment.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
8c38f09fbc tests/imr: add tests for destructor and mover methods 2018-05-31 10:09:01 +01:00
Paweł Dziepak
fa7b080443 imr: introduce destructor and mover methods
This patch introduces destructors and movers for IMR objects which
enables them to own memory. Custom destructors and methods can be
defined by specialising appropriate classes.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
c02bfb942d imr/compound: introduce tagged_type<Tag, T> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
a29a88c9d9 tests/imr/compound: add tests for structure<...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
4f51901dfe imr/compound: introduce structure<...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
466d91f652 tests/imr/compound: add tests for variant<Ts...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
8e4c8ce2c4 imr/compound: introduce variant<Ts...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
7c28c9eda8 tests/imr: add test for optional<T> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
6d7b205d1a imr: introduce optional<T> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
eb2479fa9a tests: add test for new in memory representation 2018-05-31 10:09:01 +01:00
Paweł Dziepak
a995fb337c imr: introduce fundamental types
This patch introduces fundamental IMR types: a set of flags, a POD type
and a buffer.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
5f960beca1 imr: add IMR documentation 2018-05-31 10:09:01 +01:00
Paweł Dziepak
0092076167 tests: add helpers for generating random data 2018-05-31 10:09:01 +01:00
Paweł Dziepak
cc76480174 tests: introduce tests for metaprogramming helpers 2018-05-31 10:09:01 +01:00
Paweł Dziepak
ba5e64383a utils: add metaprogramming helper functions 2018-05-31 10:09:01 +01:00
Paweł Dziepak
5845d52632 idl: allow fragmented bytes_view in serialisation
This patch adds new way of serialising bytes and sstring objects in the
IDL. Using write_fragmented_<field-name>() the caller can pass a range
of fragments that would be serialised without linearising the buffer.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
c41b9fc7ec utils: add fragment range
This patch introduces a FragmentRange concept which is the minimal interface all
classes representing a fragmented buffer should satisfy.
2018-05-31 10:09:01 +01:00
Vladimir Krivopalov
0886c189bf tests: Fix test files to use correct previous row sizes.
Since sstabledump and Cassandra do not use row size values, the new
files have been validated to be identical to files generated by
Cassandra with the same data inserted at same timestamps.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-30 18:18:35 -07:00
Vladimir Krivopalov
2d86fcc8ab sstables: Fix calculation of previous row size for SSTables 3.x
The previous code incorrectly calculated sizes of previous rows while
writing SSTables in 3.x ('m') format.
The problem with detecting this issue was that neither sstabledump nor
Cassandra 3.x itself use those values, as of today, they are simply
ignored when data is read from files.

Still, we want to be compatible and write correct values as they may be
of use in the future.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-30 18:14:12 -07:00
Vladimir Krivopalov
71f7f45d64 sstables: Factor out code building promoted index blocks into separate helpers.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-30 12:40:18 -07:00
Nadav Har'El
a1cbeeffcd tests/view_complex_test.cc: fix and enable buggy test
tests/view_complex_test.cc contained a #ifdef'ed-out test claiming to
be a reproducer for issue #3362. Unfortunately, it it is not - after
earlier commits the only reason this test still fails is a mistake in
the test, which expects 0 rows in a case where the real result is 1 row.
Issue #3362 does *not* have to be fixed to fix this test.

So this patch fixes the broken test, and enables it. It also adds comments
explaining what this test is supposed to do, and why it works the way it
does.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180530142214.29398-1-nyh@scylladb.com>
2018-05-30 15:39:25 +01:00
Avi Kivity
9999e0e6bc Merge "Implement support for static rows in SSTable 3.0" from Piotr
"
Add handling for static rows and tests for it.
"

* 'haaawk/sstables3/read-static-v1' of ssh://github.com/scylladb/seastar-dev:
  sstable_3_x_test: Add test_uncompressed_compound_static_row_read
  sstable_3_x_test: add test_uncompressed_static_row_read
  flat_mutation_reader_assertions: improve static row assertions
  data_consume_rows_context_m: Implement support for static rows
  mp_row_consumer_m: Implement support for static rows
  mp_row_consumer_m: Extract fill_cells
2018-05-30 17:17:17 +03:00
Paweł Dziepak
62d0639fe9 Merge "Avoid reactor stalls in cache with large partitions" from Tomasz
"
We currently suffer from reactor stalls caused by non-preemptible processing
of large partitions in the following places:

  (1) dropping partition entries from cache or memtables does not defer

  (2) dropping partition versions abandoned by detached snapshots does not defer

  (3) merging of partition versions when snapshots go away does not defer

  (4) cache update from memtable processes partition entries without deferring (#2578)

  (5) partition entries are upgraded to new schema atomically

This series fixes problems (1), (2) and (4), but not (3) and (5).

(1) and (2) are fixed by introducing mutation_cleaner objects which are
containers for garbage partition versions which are delaying actual freeing.
Freeing happens from memory reclaimers and is incremental.

(3) and (5) are not solved yet.

(4) is solved by having partition merging process partitions with row
granularity and defer in the middle of partition. In order to preserve update
atomicity on partition level as perceived by reads, when update starts we
create a snapshot to the current version of partition and process memtable
entry by inserting data into a separate partition version. This way if upgrade
defers in the middle of partition reads can still go to the old version and
not see partial writes. Snapshots are marked with phase numbers, and reads
will use the previous phase until whole partition is upgraded. When partition
is finally merged, the snapshots go away and the new version will eventually
be merged to the old version. Due to (3) however, this merging may still add
latency to the upgrade path.

Remaining work:

  - Solving problem (3). I think the approach to take here would be to
    move the task of merging versions to the background, maybe into mutation_cleaner.

  - Merging range tombstones incrementally.

Performance
===========

Performance improvements were evaluated using tests/perf_row_cache_update -c1 -m1G,
which measures time it takes to update cache from memtable for various workloads
and schemas.

For large partition with lots of small rows we see a significant reduction of
scheduling latency from ~550ms to ~23ms. The cause of remainig latency is
problem (3) stated above. The run time is reduced by 70%.

For small partition case without clustering columns we see no degradation.

For small partition case with clustering key, but only 3 small rows per partition,
we see a 30% degradation in run time.

For large partition with lots of range tombstones we see degradation of 15% in
run time and scheduling latency.

Below you can see full statistics for cache update run time:

=== Small partitions, no overwrites:

Before:

  avg = 433.965155
  stdev = 35.958024
  min = 340.093201
  max = 468.564514

After:

  avg = 436.929447 (+1%)
  stdev = 37.130237
  min = 349.410339
  max = 489.953400

=== Small partition with a few rows:

Before:

  avg = 315.379316
  stdev = 30.059120
  min = 240.340561
  max = 342.408295

After:

  avg = 407.232691 (+30%)
  stdev = 53.918717
  min = 269.514648
  max = 444.846649

=== Large partition, lots of small rows:

Before:

  avg = 412.870689
  stdev = 227.411317
  min = 286.990631
  max = 1263.417847

After:

  avg = 124.351705 (-70%)
  stdev = 4.705762
  min = 110.063255
  max = 129.643387

=== Large partition, lots of range tombstones:

Before:

  avg = 601.172644
  stdev = 121.376866
  min = 223.502136
  max = 874.111572

After:

  avg = 695.627588 (+15%)
  stdev = 135.057004
  min = 337.173950
  max = 784.838745
"

* tag 'tgrabiec/clear-gently-all-partitions-v3' of github.com:tgrabiec/scylla:
  mvcc: Use small_vector<> in partition_snapshot_row_cursor
  utils: Extract small_vector.hh
  mvcc: Erase rows gradually in apply_to_incomplete()
  mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible
  cache: real_dirty_memory_accounter: Move unpinning out of the hot path
  mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
  mutation_partition: Reduce row lookups in apply_monotonically()
  cache: Release dirty memory with row granularity
  cache: Defer during partition merging
  mvcc: partition_snapshot_row_cursor: Introduce consume_row()
  mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
  mvcc: Make apply_to_incomplete() work with attached versions
  cache: Propagate phase to apply_to_incomplete()
  cache: Prepare for incremental apply_to_incomplete()
  Introduce a coroutine wrapper
  tests: mvcc: Encapsulate memory management details
  tests: cache: Take into account that update() may defer
  cache: real_dirty_memory_accounter: Allow construction without memtable
  cache: Extract real_dirty_memory_accounter
  mvcc: Destroy memtable partition versions gently
  memtable: Destroy partitions incrementally from clear_gently()
  mvcc: Remove rows from tracker gently
  cache: Destroy partition versions incrementally
  Introduce mutation_cleaner
  mvcc: Introduce partition_version_list
  mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
  database: Add API for incremental clearing of partition entries
  cache: Define trivial methods inline
  tests: Improve perf_row_cache_update
  mutation_reader: Make empty mutation source advertize no partitions
2018-05-30 14:12:29 +01:00
Tomasz Grabiec
4561e97efe mvcc: Use small_vector<> in partition_snapshot_row_cursor
I measured 8% improvement in cache update throughput for small
partitions.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
db36ff0643 utils: Extract small_vector.hh 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
5b59df3761 mvcc: Erase rows gradually in apply_to_incomplete()
So that we avoid double-buffering partitions.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
b7fdf4309f mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
8d66f6da58 cache: real_dirty_memory_accounter: Move unpinning out of the hot path
Instead of calling into real dirty memory manager per row, call it per
deferring point.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
60000b98a4 mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
Leverage the fact that it is called with monotonically increasing
positions, and avoid lookups in case the current target entry is the
successor of desired position. Reduces cache update latency by 40%
for large partition in a time-series workload.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
82e8217ba0 mutation_partition: Reduce row lookups in apply_monotonically()
This change speeds up merging of partition versions with many rows in
case the merged version has many rows which fall between existing rows
in the target version. This is often the case for time-series
workloads, which insert rows at the front. Lookup can be avoided for
all but the first row in the stride because we already have a
reference to the successor in the target tree, we only need to check
that the current entry in the target tree is still the successor.

This change greatly reduces amount of lookups per row during version
merging of large partitions in time-series workloads.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
5bc201df10 cache: Release dirty memory with row granularity 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
70c72773be cache: Defer during partition merging 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
051bb74583 mvcc: partition_snapshot_row_cursor: Introduce consume_row() 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
518fd7083f mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
A version of maybe_refresh() optimized for snapshots which are
no longer populated. Will be used to implement cache update from
memtable.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
c653137b2b mvcc: Make apply_to_incomplete() work with attached versions
Needed before making it preemptible. We cannot steal the entry since
we may need to resume merging later.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
1792be3697 cache: Propagate phase to apply_to_incomplete()
It will be needed to create snapshots with appropriate phase markers.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
494cb3f3da cache: Prepare for incremental apply_to_incomplete()
Incremental merging will be implemented by the means of resumable
functions, which return stop_iteration::no when not yet
finished. We're not using futures, so that the caller can do work
around preemption points as well.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
a19c5cbc16 Introduce a coroutine wrapper
Represents a deferring operation which defers cooperatively with the caller.

The operation is started and resumed by calling run(), which returns
with stop_iteration::no whenever the operation defers and is not
completed yet. When the operation is finally complete, run() returns
with stop_iteration::yes.

This allows the caller to:

 1) execute some post-defer and pre-resume actions atomically

 2) have control over when the operation is resumed and in which context,
    in particular the caller can cancel the operation at deferring points.

It will be used to implement deferring partition_version::apply_to_incomplete().
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
6bd1a04c10 tests: mvcc: Encapsulate memory management details
Curently tests have a single LSA region lock around construction of
managed objects, their manipulation, and access. This way we avoid the
complexity of dealing with allocating sections. That will not be
possible once apply_to_incomplete() is changed to enter an allocating
section itself becasue this requires region to be unlocked at
entry. The tests will have to take more fine-grained locks. That is
somewhat tricky add would add a lot of noise to tests. This patch will
make things easier by abstracting LSA management, among other things,
inside mvcc_conatiner and mvcc_partition classes.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
f6e21accc7 tests: cache: Take into account that update() may defer
The test incorrectly assumed that once update() is started the
cache will return only versions from last_generation. This will not
hold once we start to defer during partition merging.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
c10d9e1607 cache: real_dirty_memory_accounter: Allow construction without memtable 2018-05-30 14:41:40 +02:00
Tomasz Grabiec
6ecda1ccd7 cache: Extract real_dirty_memory_accounter 2018-05-30 14:41:40 +02:00
Tomasz Grabiec
3f19f76c67 mvcc: Destroy memtable partition versions gently
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.

Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.

Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.

Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
c2d702622e memtable: Destroy partitions incrementally from clear_gently()
Destroying large partitions may stall the reactor for a long
time. Avoid this by clearing incrementally.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
81d231f35b mvcc: Remove rows from tracker gently
Some parititons may have a lot of rows. Better to iterate over them
incrementally as part of clear_gently() to avoid stalls.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
f0c1edd672 cache: Destroy partition versions incrementally
Instead of destroying whole partition_versions at once, we will do that
gently using mutation_cleaner to avoid reactor stalls.

Large deletions could happen when large partition gets invalidated,
upgraded to a new schema, or when it's abandaned by a detached snapshot.

Refs #3289.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
e0803ff71e Introduce mutation_cleaner
Used for collecting unsued partition_version objects and freeing them
incrementally. Will be used for both cache and memtables.
2018-05-30 14:41:39 +02:00
Tomasz Grabiec
e5aa02efeb mvcc: Introduce partition_version_list 2018-05-30 12:18:56 +02:00
Tomasz Grabiec
ca1ee93577 mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
We didn't rely on that yet, it seems, but will.

(cherry picked from commit 21a744337de01f699d5c5c340483ad23cabab2ee)
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
40cc766cf2 database: Add API for incremental clearing of partition entries
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:

  stop_iteration clear_gently() noexcept;

It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:

  return repeat([this] { return clear_gently(); });

The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
2f75212ca4 cache: Define trivial methods inline
They have users in a different compilation unit, in partition_version.cc
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
25b3641d9e tests: Improve perf_row_cache_update
We now test more kinds of workloads:
 - small partitions with no clustering key
 - large partition with lots of small rows
 - large partition with lots of range tombstones

We also collect statistics about scheduling latency induced by cache
update.

Example output:

Small partitions, no overwrites:
update: 356.809113 [ms], stall: {ticks: 396, min: 0.006867 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.358102 [ms]}, cache: 257/257 [MB] LSA: 257/257 [MB] std free: 83 [MB]
update: 337.542999 [ms], stall: {ticks: 373, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.358102 [ms]}, cache: 514/514 [MB] LSA: 514/514 [MB] std free: 83 [MB]
update: 383.485291 [ms], stall: {ticks: 425, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 771/788 [MB] LSA: 771/788 [MB] std free: 83 [MB]
update: 574.968811 [ms], stall: {ticks: 634, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.629722 [ms], max: 1.955666 [ms]}, cache: 879/917 [MB] LSA: 879/917 [MB] std free: 83 [MB]
update: 411.541138 [ms], stall: {ticks: 455, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.358102 [ms]}, cache: 787/835 [MB] LSA: 787/835 [MB] std free: 83 [MB]
update: 368.491211 [ms], stall: {ticks: 408, min: 0.001332 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 750/790 [MB] LSA: 750/790 [MB] std free: 83 [MB]
update: 343.671967 [ms], stall: {ticks: 380, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 734/769 [MB] LSA: 734/769 [MB] std free: 83 [MB]
update: 320.277283 [ms], stall: {ticks: 357, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 724/753 [MB] LSA: 724/753 [MB] std free: 83 [MB]
update: 310.583282 [ms], stall: {ticks: 344, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 714/740 [MB] LSA: 714/740 [MB] std free: 83 [MB]
update: 303.627106 [ms], stall: {ticks: 338, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.955666 [ms]}, cache: 707/731 [MB] LSA: 707/731 [MB] std free: 83 [MB]
update: 296.742523 [ms], stall: {ticks: 330, min: 0.001332 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 701/724 [MB] LSA: 701/724 [MB] std free: 83 [MB]
update: 286.598541 [ms], stall: {ticks: 319, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 697/719 [MB] LSA: 697/719 [MB] std free: 83 [MB]
update: 288.649323 [ms], stall: {ticks: 321, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 694/715 [MB] LSA: 694/715 [MB] std free: 83 [MB]
update: 282.069916 [ms], stall: {ticks: 314, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 692/712 [MB] LSA: 692/712 [MB] std free: 83 [MB]
update: 292.462036 [ms], stall: {ticks: 325, min: 0.001917 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 689/708 [MB] LSA: 689/708 [MB] std free: 83 [MB]
update: 274.390442 [ms], stall: {ticks: 305, min: 0.001332 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 687/705 [MB] LSA: 687/705 [MB] std free: 83 [MB]
invalidation: 172.617508 [ms]
Large partition, lots of small rows:
update: 262.132721 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.005722 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 268.650944 [ms]}, cache: 187/188 [MB] LSA: 187/188 [MB] std free: 82 [MB]
update: 281.359467 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 322.381152 [ms]}, cache: 375/376 [MB] LSA: 375/376 [MB] std free: 82 [MB]
update: 287.229065 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 322.381152 [ms]}, cache: 563/564 [MB] LSA: 563/564 [MB] std free: 82 [MB]
update: 1294.816284 [ms], stall: {ticks: 4, min: 0.001917 [ms], 50%: 0.005722 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 1386.179840 [ms]}, cache: 586/625 [MB] LSA: 586/625 [MB] std free: 82 [MB]
update: 845.022461 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.005722 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 962.624896 [ms]}, cache: 439/475 [MB] LSA: 439/475 [MB] std free: 82 [MB]
update: 380.335938 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 386.857376 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 477.234680 [ms], stall: {ticks: 4, min: 0.002760 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 525.955017 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 548.003784 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.006866 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 528.697937 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 609.292603 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.005722 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 575.762451 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 668.489536 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 530.801392 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 535.948364 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 527.143555 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.020501 [ms], 99%: 0.020501 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 521.869202 [ms], stall: {ticks: 4, min: 0.002760 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
invalidation: 173.069733 [ms]
Large partition, lots of range tombstones:
update: 224.003220 [ms], stall: {ticks: 4, min: 0.001917 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 268.650944 [ms]}, cache: 52/52 [MB] LSA: 52/52 [MB] std free: 82 [MB]
update: 570.882874 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 105/105 [MB] LSA: 105/105 [MB] std free: 82 [MB]
update: 577.249878 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 158/158 [MB] LSA: 158/158 [MB] std free: 82 [MB]
update: 580.239624 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 211/211 [MB] LSA: 211/211 [MB] std free: 82 [MB]
update: 614.187134 [ms], stall: {ticks: 4, min: 0.001917 [ms], 50%: 0.004768 [ms], 90%: 0.011864 [ms], 99%: 0.011864 [ms], max: 668.489536 [ms]}, cache: 264/264 [MB] LSA: 264/264 [MB] std free: 82 [MB]
update: 618.709229 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.003973 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 317/317 [MB] LSA: 317/317 [MB] std free: 82 [MB]
update: 626.943359 [ms], stall: {ticks: 4, min: 0.001598 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 369/370 [MB] LSA: 369/370 [MB] std free: 82 [MB]
update: 602.873474 [ms], stall: {ticks: 4, min: 0.001917 [ms], 50%: 0.003973 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 422/423 [MB] LSA: 422/423 [MB] std free: 82 [MB]
update: 617.522583 [ms], stall: {ticks: 4, min: 0.001598 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 475/475 [MB] LSA: 475/475 [MB] std free: 82 [MB]
update: 627.291138 [ms], stall: {ticks: 4, min: 0.001598 [ms], 50%: 0.004768 [ms], 90%: 0.011864 [ms], 99%: 0.011864 [ms], max: 668.489536 [ms]}, cache: 528/528 [MB] LSA: 528/528 [MB] std free: 82 [MB]
update: 623.720886 [ms], stall: {ticks: 4, min: 0.001598 [ms], 50%: 0.003973 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 581/581 [MB] LSA: 581/581 [MB] std free: 82 [MB]
update: 630.735596 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 634/634 [MB] LSA: 634/634 [MB] std free: 82 [MB]
update: 2776.525635 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 2874.382592 [ms]}, cache: 687/687 [MB] LSA: 687/687 [MB] std free: 82 [MB]
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
bb96518cc5 mutation_reader: Make empty mutation source advertize no partitions
So that perf_row_cache_update will always populate cache.
2018-05-30 12:18:56 +02:00
Avi Kivity
dd26cf1490 Merge "db/view: Clarifications to range movement scenarios" from Duarte
"
This series provides reasoning and clarification for the current
structure of mutate_MV(), and how we handle some scenarios related to
range movements.
"

* 'materialized-views/clarifications/v3' of github.com:duarten/scylla:
  db/view: Remove ifdef'd Java code
  db/view: Ignore scenario where base replica hasn't joined the ring
  db/view: Handle case when base has no paired view replica
2018-05-29 18:51:06 +03:00
Avi Kivity
928af7701c Merge "Implement reading clustering columns from SSTables 3.x" from Piotr
"
Add handling for clustering columns and tests for it.
"

* 'haaawk/sstables3/read-ck-v3' of ssh://github.com/scylladb/seastar-dev:
  Add test_uncompressed_compound_ck_read for SSTables 3.x
  Add test_uncompressed_simple_read for SSTables 3.x
  Implement reading clustering key from SSTables 3.x
  column_translation: cache fixed value lengths for ck
  data_consume_rows_context_m: use cached fixed column value lenghts
  column_translation: store fix lengths of column values
  consume_row_start: change type of clustering key
  Rename ROW_BODY state to CLUSTERING_ROW
2018-05-29 18:49:26 +03:00
Piotr Jastrzebski
d2300bc5a9 sstable_3_x_test: Add test_uncompressed_compound_static_row_read
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:55:36 +02:00
Piotr Jastrzebski
6639ef8769 sstable_3_x_test: add test_uncompressed_static_row_read
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:55:11 +02:00
Piotr Jastrzebski
18cced2edc flat_mutation_reader_assertions: improve static row assertions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:52:55 +02:00
Piotr Jastrzebski
6ab660880d data_consume_rows_context_m: Implement support for static rows
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:52:14 +02:00
Piotr Jastrzebski
c9c2fc8e4b mp_row_consumer_m: Implement support for static rows
Add consumer_m::consume_static_row_start

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:50:15 +02:00
Piotr Jastrzebski
f018e5dfed mp_row_consumer_m: Extract fill_cells
This lambda will be used not only for regular columns
but also for static columns.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:46:02 +02:00
Laura Novich
e053da6f51 scylla_setup: adjust language
Edited the text for the scylla setup, improving readability for the prompts
with regards to grammar and usage.

Signed-off by: Laura Novich <laura@scylladb.com>
Message-Id: <CAGcEH3Xa6TFy=_rdz_=NP0b23vEDZmfRQzAdxV-f04C1p+AzTw@mail.gmail.com>
2018-05-29 09:56:41 +03:00
Piotr Sarna
ffe52681ea storage_proxy: add mv stats to write handler
Previous patch for issue 3416 did not cover passing write stats
to write response handler, which results in some write stats
being incorrectly counted as user write stats, while they belong
to materialized views.
This one fixes that by passing correct write stats reference
to write response handler constructor.

Also at: https://github.com/psarna/scylla/commits/fix_3416_again

Closes #3416
Message-Id: <53ef3cc96ccadfdad8992d92ed6a41473419eb0a.1527510473.git.sarna@scylladb.com>
2018-05-28 17:50:49 +01:00
Piotr Jastrzebski
a7a152b27f Add test_uncompressed_compound_ck_read for SSTables 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
5c0f9f17ba Add test_uncompressed_simple_read for SSTables 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
c89b485871 Implement reading clustering key from SSTables 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
101e38f19b column_translation: cache fixed value lengths for ck
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
b7149d349c data_consume_rows_context_m: use cached fixed column value lenghts
Take them from column_translation instead of parsing the type every
time.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
9d41c2299d column_translation: store fix lengths of column values
We don't need to parse the type every time.
It's better to cache fix lengths of column values
for sstable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
351c9e5d65 consume_row_start: change type of clustering key
Clustering key in 3.x format is stored differently
so it's easier to create a vector of temporary buffers
instead of a single block of concatenated bytes.

Each temporary buffer stores a value of a single
clustering column.

This is because the way clustering key is stored on disk
in SSTables 3.x is not the same as the way we store it
internally.

This means that we have to first read a value of every
clustering column into temporary_buffer and only then
we can create clustering key using a vector of those
temporary buffers.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:27:56 +02:00
Amnon Heiman
1f28e97458 sstable: Add has_partition_key method
This patch adds a helper function to sstable to check if it has a given
partition key.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-05-28 18:12:17 +03:00
Amnon Heiman
cd1f4ccb89 keys_test: add a test for nodetool_style string
This patch adds a test for single and compund partition key that is
created from a nodetoold style string.
2018-05-28 18:12:12 +03:00
Amnon Heiman
c517ee8353 keys: Add from_nodetool_style_string factory method
Based on:
8daaf9833a

This patch adds a from_nodetool_style_string factory method to partition_key.
The string format is follows the nodetool format, that column in the
partition keys are split by ':'.
For example, if a partition key has two column col1 and col2, to get the
partition key that has col1 = val1 and col2 = val2:

val1:val2
2018-05-28 18:09:51 +03:00
Tomasz Grabiec
aefb5e0fbd Merge "Get rid of cql_statement::execute_internal" from Avi
execute_internal() duplicates several code paths, especially in
the select path, for no good reason.  It boils down to timeout and
consistency level selection which can be done based on
client_state::is_internal().

This patchset eliminated the duplication and execute_internal(),
simplifying the code.

* github.com:avikivity/scylla cql-no-execute_internal/v2:
  cql: schema_altering_statement: make execute() and execute_internal()
    equivalent
  cql: select_statement: make execute() and execute_internal()
    equivalent
  cql: query_processor: don't call cql_statement::execute_internal() any
    more
  cql: cql_statement: remove execute_internal()
2018-05-28 13:01:43 +02:00
Avi Kivity
8033785b36 Update scylla-ami submodule
* dist/ami/files/scylla-ami 025644d...1f5329f (1):
  > scylla_install_ami: Update CentOS to latest version
2018-05-28 13:59:57 +03:00
Avi Kivity
ff3e86888a tests: report tests as they are completed
As each test completes, report it. This prevents a long-running
test in the beginning of the list from stalling output.
Message-Id: <20180526173517.23078-1-avi@scylladb.com>
2018-05-28 13:58:01 +03:00
Avi Kivity
3a4d11d374 Merge "Introduce frozen_mutation_fragment" from Paweł
"
This series introduces frozen_mutation_fragment which can be used to
send mutation_fragments over the wire to a remote node. The main
intended user is going to be the new streaming implementation.

The first part of the series fixes some IDL issues related to empty
structures and variant being the first member of a structure. Both these
problems make the generated code fail to build and they do not, in any
way, affect the existing on-wire protocol.

Logic responsible for freezing and unfreezing of mutation_fragments is
heavily based on the existing code for freezing mutations and shares the
same drawbacks (for example, unnecessary copy during unfreezing). These
preexisting performance problems can be fixed incrementally.

Another performance problem (which affects frozen_mutations as well, but
to a lesser extent) is that since the batching is done at a different
layer each frozen mutation fragment is a separate bytes_ostream object
owning at least one  memory buffer. If the mutation fragments are small
this will cause an excessive number of allocations. This could be solved
either by freezing fragments in batches (though it goes against the RPC
layer doing its own batching) or using bytes_ostream or an equivalent
object with a buffer allocation policy more suitable for such use cases.
This also is something that probably could be an incremental fix.

Tests: unit (release)
"

* tag 'frozen_mutation_fragment/v1-rebased' of https://github.com/pdziepak/scylla:
  idl: add idl description of frozen_mutation_fragments
  tests: add test for frozen_mutation_fragments
  frozen_mutation: introduce frozen_mutation_fragment
  tests/idl: test variant being the first member of a structure
  idl: create variant state in root node
  tests/idl: test serialising and deserialising empty structures
  idl-compiler: avoid unused variable in empty struct deserialisers
  tests/mutation_reader: disambiguate freeze() overload
2018-05-28 13:54:01 +03:00
Takuya ASADA
55d6be9254 Revert "dist/ami: update CentOS base image to latest version"
This reverts commit 69d226625a.
Since ami-4bf3d731 is Market Place AMI, not possible to publish public AMI based on it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180523112414.27307-1-syuu@scylladb.com>
2018-05-28 13:52:34 +03:00
Duarte Nunes
99d678d079 db/view: Remove ifdef'd Java code
It provides no useful information, so just get rid of it.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-28 11:51:23 +01:00
Duarte Nunes
ad18d535e9 db/view: Ignore scenario where base replica hasn't joined the ring
Apache Cassandra handles a case where the node hasn't joined the ring
and may consequentially have an outdated view of it. Following the same
reasoning as with the previous patch, we ignore this scenario. It
happens when there are range movements, and this node is bootstrapping,
but there are already other mechanisms in the cluster, such as hinted
handoff and dual-writing to replicas during range movements, that
contribute to this update eventually making its way to the view.

This patch doesn't change any behavior, but it provides the reasoning
why we won't use the batchlog as Cassandra does, or the hinted handoff
log as we will, to later send the update when the node is joined (note
that Cassandra just sends the mutations "later", and doesn't check
again for any condition or change).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-28 11:51:23 +01:00
Duarte Nunes
be45e6a1b7 db/view: Handle case when base has no paired view replica
If no view replica is paired with the current base replica, it means
there's a range movement going on (decommission or move), such that
this base replica is gaining new token ranges. The current node is
thus a pending_endpoint from the POV of the coordinator that sent the
request.

Sending view updates to the view replica this base will eventually be
paired with only makes a difference when the base update didn't make
it to the node which is currently being decommissioned or moved-from.

The update will, however, make it to that node if HH is enabled at the
coordinator, before the range movement finishes, or later to this node
when it becomes a natural endpoint for the token.

We still ensure we send to any pending view endpoints though, at least
until we handle that case more optimally.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-28 11:51:18 +01:00
Avi Kivity
b70febe246 cql: cql_statement: remove execute_internal()
With no callers, it can be safely removed.
2018-05-27 12:40:27 +03:00
Avi Kivity
c8a66efb6a cql: query_processor: don't call cql_statement::execute_internal() any more
All cql_statement::execute_internal() overrides now either throw or
call execute().  Since we shouldn't be calling the throwing overrides
internally, we can safely call execute() instead.  This allows us to
get rid of execute_internal().
2018-05-27 12:37:37 +03:00
Avi Kivity
eb19798f99 cql: select_statement: make execute() and execute_internal() equivalent
execute_internal(), for some code paths, differs from execute by the
following:
 1. it uses CL_ONE unconditionally
 2. it has no query timeout
 3. it doesn't use execution stages

for other code paths, it just calls execute.

As preparation for getting rid of execute_internal(), unify the two
code paths.

Commit 4859b759b9 caused the consistency level and timeouts
to be provided by the caller, so using the caller provided parameters
instead of overriding them does not change behavior.
2018-05-27 12:36:02 +03:00
Avi Kivity
d998f06633 cql: schema_altering_statement: make execute() and execute_internal() equivalent
To get rid of execute_internal(), make the normal execute() equivalent and call
it instead of having two different paths.
2018-05-27 11:08:55 +03:00
Duarte Nunes
4859b759b9 Merge 'Make all timeouts explicit' from Avi
"
This patchset makes all users of query_processor specify their timeouts
explicitly, in preparation for the removal of
cql_statement::execute_internal() (whose main function was to override
timeouts).
"

* tag 'cql-explicit-timeouts/v1' of https://github.com/avikivity/scylla:
  query_processor: require clients to specify timeout configuration
  query_processor: un-default consistency level in make_internal_options
2018-05-26 16:10:58 +02:00
Avi Kivity
6e97609049 Merge "Improve support for data types handling in SSTables 3.x" from Vladimir
"
Firstly, this patchset removes the is_fixed_length() function of
abstract_type in favour of value_length_if_fixed().

Secondly, it fixed the byte_type to be compatible with Cassandra which
erroneously treats it as a variable-length data type.

Lastly, it adds a unit test covering all non-composite CQL data types
for writing.

Tests: unit {release}
"

* 'projects/sstables-30/different-data-types/v1' of https://github.com/argenet/scylla:
  tests: Add a unit test for writing different data types to SSTables 3.x format.
  types: Treat byte_type as a variable-length type for compatibility reasons.
  types: Remove is_value_fixed() and use value_length_if_fixed() instead.
2018-05-26 10:24:35 +03:00
Vladimir Krivopalov
0951153292 tests: Add a unit test for writing different data types to SSTables 3.x format.
This tests covers all non-composite CQL data types.
The resulting files are dumped using sstabledump as follows:

[
  {
    "partition" : {
      "key" : [ "key" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 174,
        "liveness_info" : { "tstamp" : "1525385507816568" },
        "cells" : [
          { "name" : "asciival", "value" : "hello" },
          { "name" : "bigintval", "value" : 9223372036854775807 },
          { "name" : "blobval", "value" : "0x6772656174" },
          { "name" : "boolval", "value" : true },
          { "name" : "dateval", "value" : "2017-05-05" },
          { "name" : "decimalval", "value" : 5.45 },
          { "name" : "doubleval", "value" : 36.6 },
          { "name" : "durationval", "value" : 1h4m48s20ms },
          { "name" : "floatval", "value" : 7.62 },
          { "name" : "inetval", "value" : "192.168.0.110" },
          { "name" : "intval", "value" : -2147483648 },
          { "name" : "smallintval", "value" : 32767 },
          { "name" : "timeuuidval", "value" : "50554d6e-29bb-11e5-b345-feff819cdc9f" },
          { "name" : "timeval", "value" : "19:45:05.090000000" },
          { "name" : "tinyintval", "value" : 127 },
          { "name" : "tsval", "value" : "2015-05-01 09:30:54.234Z" },
          { "name" : "uuidval", "value" : "01234567-0123-0123-0123-0123456789ab" },
          { "name" : "varcharval", "value" : "привет" },
          { "name" : "varintval", "value" : 123 }
        ]
      }
    ]
  }
]

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-25 21:41:23 -07:00
Vladimir Krivopalov
3981dd6dd6 types: Treat byte_type as a variable-length type for compatibility reasons.
Although values of the byte_type that corresponds to CQL TINYINT type
always occupy only a single byte, Cassandra treats this it as a
variable-length type for SSTables 3.0 reading and writing.

While it is clearly a mistake at Cassandra side, we have to stay
compatible.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-25 21:41:23 -07:00
Vladimir Krivopalov
24cb062834 types: Remove is_value_fixed() and use value_length_if_fixed() instead.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-25 21:41:23 -07:00
Paweł Dziepak
ed12555192 idl: add idl description of frozen_mutation_fragments 2018-05-25 10:15:10 +01:00
Paweł Dziepak
0bac487426 tests: add test for frozen_mutation_fragments 2018-05-25 10:15:10 +01:00
Paweł Dziepak
aa4e589ace frozen_mutation: introduce frozen_mutation_fragment
This patch introduces IDL definition as well as serialisers and
deserialisers for freezing mutation_fragment so that they can be
transferred between nodes in a cluster.
2018-05-25 10:15:10 +01:00
Paweł Dziepak
b2e9491728 tests/idl: test variant being the first member of a structure 2018-05-25 10:15:10 +01:00
Paweł Dziepak
a5731ded98 idl: create variant state in root node
Each non-final IDL object is preceeded by a frame containing its size.
In case of boost::variant there is a frame for the variant itself, an
integer determining the active alternative of the variant and a frame of
that active alternative.

However, if a variant was the first member of a writable stub object the
IDL would generate code that would not write the frame for the variant.
This is not a very severe issue since there are no such cases right now
as  C++ type system would no allow such generated code to compile.
2018-05-25 10:15:10 +01:00
Paweł Dziepak
d731cf427d tests/idl: test serialising and deserialising empty structures 2018-05-25 10:15:10 +01:00
Paweł Dziepak
f719516be8 idl-compiler: avoid unused variable in empty struct deserialisers
Deserialisers generated by IDL compiler first create a substream
covering the deserialised structure and then skip and read appropriate
members. If there are no members the substream will be unused and prompt
the compiler to emit a warning.
2018-05-25 10:15:10 +01:00
Paweł Dziepak
fde9e1d55f tests/mutation_reader: disambiguate freeze() overload
freeze() is about to get overloaded so make sure we don't get any
ambiguities.
2018-05-25 10:15:10 +01:00
Duarte Nunes
4db0b4af58 Merge 'secondary index: Fixes for tables with multiple clustering columns' from Nadav
"
This patch series fixes #3405: secondary-index search only provided
correct results in certain cases, where entire partitions or contiguous
partition slices matched the query. When this was not the case, and
individual clustering rows match or do not match the query, the wrong
results were returned.

To fix this bug, we need to fix the two stages of secondary-index search:

1. In the first stage, we read from the index MV a list of row keys
   (i.e., primary keys) matching the query. We can no longer remember
   just the partition keys, and need to keep the list of full primary keys.

2. In the second stage, we have a list of rows (not partitions) and need
   to read their selected contents to return to the user. Since CQL queries
   do not have a syntax to select an arbitrary list of rows, we have to
   add new code to do such a selection.

Because we provide an ad-hoc, inefficient, implementation for the row
selection described in stage 2, these patches leave two paths in the code:
The old path, efficiently selecting entire partitions, and the new path,
selecting individual rows. The old path is still used when it is applicable,
which is when a partition key column or the first clustering key column
is searched.
"

* 'si-fix-v4' of http://github.com/nyh/scylla:
  secondary index: test multiple clustering column
  secondary index: fix wrong results returned in certain cases
  secondary index: method for fetching list of rows from base table
  secondary index: method for fetching list of rows from index
  select_statement.cc: refactor find_index_partition_ranges()
  select_statement.cc: fix variable lifetime errors
2018-05-24 21:36:18 +01:00
Nadav Har'El
a6d9ea2fb5 secondary index: test multiple clustering column
This patch adds a test for secondary indexes on a table which has many
columns - two partition key column, two clustering key columns, and two
regular columns. We add a bunch of data in various rows and partitions,
index all columns and search on this data and verify the results.

This test exposed various bugs in secondary index search, including
issue #3405. After we fixed those bugs, the test now passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:56:57 +03:00
Nadav Har'El
1b29dd44f7 secondary index: fix wrong results returned in certain cases
The current secondary-index search code, in
indexed_table_select_statement::do_execute(), begins by fetching a list
of partitions, and then the content of these partitions from the base
table. However, in some cases, when the table has clustering columns and
not searching on the first one of them, doing this work in partition
granularity is wrong, and yields wrong results as demonstrated in
issue #3405.

So in this patch, we recognize the cases where we need to work in
clustering row granularity, and in those cases use the new functions
introduced in the previous patches - find_index_clustering_rows() and
the execute() variant taking a list of primary-keys of rows.

Fixes #3405.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:56:03 +03:00
Nadav Har'El
adf6d742be secondary index: method for fetching list of rows from base table
We add a new variant of select_statement::execute() which allows selecting
an arbitrary list of clustering rows. The existing execute() variant can't
do that - it can only take a list of *partitions*, and read the same
clustering rows from all of them.

The new select variant is not needed for regular CQL queries (which do
not have a syntax allowing reading a list of rows with arbitrary primary
keys), but we will need it for secondary index search, for solving
issue #3405.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:54:36 +03:00
Nadav Har'El
a096a82adc secondary index: method for fetching list of rows from index
We already have a method find_index_partition_ranges(), to fetch a list
of partition keys from the secondary index. However, as we shall see in
the following patches (and see also issue #3405), getting a list of entire
partitions is not always enough - the secondary index actually holds a list
of primary keys, which includes clustering keys, and in some queries we
can't just ignore them.

So this patch provides a new method find_index_clustering_rows(), to
query the secondary index and get a list of matching clustering keys.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:53:29 +03:00
Nadav Har'El
083b2ae573 select_statement.cc: refactor find_index_partition_ranges()
The function find_index_partition_ranges() is used in secondary index
searches for fetching a list of matching partition. In a following patch,
we want to add a similar function for getting a list of *rows*. To avoid
duplicate code, in this patch we split parts of find_index_partition_ranges()
into two new functions:

1. get_index_schema() returns a pointer to the index view's schema.

2. read_posting_list() reads from this view the posting list (i.e., list
   of keys) for the current searched value.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:50:45 +03:00
Nadav Har'El
7dc9b77682 select_statement.cc: fix variable lifetime errors
do_with() provides code a *reference* to an object which will be kept
alive. It is a mistake to make a copy of this object or of parts of it,
because then the lifetime of this copy will have to be maintained as well.

In particular, it is a mistake to do do_with(..., [] (auto x) { ... }) -
note how "auto x" appears instead of the correct "auto& x". This causes
the object to be copied, and its lifetime not maintained.

This patch fixes several cases where this rule was broken in
select_statement.cc. I could not reproduce actual crashes caused by
these mistakes, but in theory they could have happened.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:46:12 +03:00
Piotr Jastrzebski
3b6e80a180 Rename ROW_BODY state to CLUSTERING_ROW
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-24 12:48:33 +02:00
Avi Kivity
0b8d06ebf9 Merge seastar upstream
* seastar a48fe69...12cffef (5):
  > variant_utils: don't pass variant by rref to boost::apply_visitor
  > Revert "build: fix compilation issues on cmake. missing stdc++-fs"
  > reactor: prevent expected overflow from triggering ubsan warning
  > cmake: Add cmake option to disable testing altogether
  > build: fix compilation issues on cmake. missing stdc++-fs
2018-05-24 12:17:56 +03:00
Avi Kivity
f893dc61f0 Merge "Implement reading columns from SSTable 3 format" from Piotr
"
This patchset implements reading row columns from SSTable 3 format data file.

Tests: units (release)
"

* 'haaawk/sstables3/read-columns-v4' of ssh://github.com/scylladb/seastar-dev: (21 commits)
  Add test for reading column values of different types.
  Support all fixed size column types from SSTable 3.x
  Add abstract_type::value_length_if_fixed
  Add test for simple table with value
  flat_reader_assertions: Add produces_row taking column values
  Implement reading rows and columns in data_consume_rows_context_m
  Introduce column_flags_m
  Add column_translation to data_consume_rows_context_m
  Pass schema to data_consume_context
  Add column_translation.hh
  consumer_m: Add consume methods for consuming rows and columns
  Extract make_atomic_cell from mp_row_consumer_k_l
  Rename NON_STATIC_ROW_* states to ROW_BODY_*
  Add liveness_info and use it in reading sstables
  Add helper methods for parsing simple types.
  Add unfiltered_flags_m::has_all_columns
  data_consume_context: use make_unique instead of new
  Pass serialization_header to data_consume_rows_context*
  Use disk_string_vint_size for bytes_array_vint_size
  Introduce disk_string_vint_size type
  ...
2018-05-24 10:11:25 +03:00
Takuya ASADA
e0d49aae37 dist/debian: fix missing --configfile parameter on pdebuild
We need to specify --configfile on pdebuild too, otherwise we will
always fail to build .deb on newly created build environment.
Only reason why we still able to build .deb is we already copied
.pbuilderrc to home directory on existing build environment.

Fixes #3456

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180523204112.24669-1-syuu@scylladb.com>
2018-05-24 10:10:27 +03:00
Piotr Jastrzebski
7869bd98b1 Add test for reading column values of different types.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
a572d126e4 Support all fixed size column types from SSTable 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
7a25819e5a Add abstract_type::value_length_if_fixed
This info is used by SSTable 3.x format to read column values
without reading their lengths.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
f58f10d708 Add test for simple table with value
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
0a5d06b2f3 flat_reader_assertions: Add produces_row taking column values
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
9348006092 Implement reading rows and columns in data_consume_rows_context_m
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
f6e1c38486 Introduce column_flags_m
This will be used for reading columns from data file.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
609854e21a Add column_translation to data_consume_rows_context_m
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
7fd222e639 Pass schema to data_consume_context
It will be needed to obtain column_translation that will
be added to data_consume_context in the next patch.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
d3f3cd36dd Add column_translation.hh
It contains a class that manages mapping between sstable
columns and schema column definitions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
25b8cf9e4c consumer_m: Add consume methods for consuming rows and columns
Also implement them in mp_row_consumer_m.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:53:29 +02:00
Piotr Jastrzebski
94e3138dc5 Extract make_atomic_cell from mp_row_consumer_k_l
It will be used in both mp_row_consumer_k_l and
mp_row_consumer_m.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
c6d5ebc274 Rename NON_STATIC_ROW_* states to ROW_BODY_*
New name describes the states in a better way as those states
will be used both for static and non-static rows.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
10c669d2b5 Add liveness_info and use it in reading sstables
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
b2f9841dd4 Add helper methods for parsing simple types.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
d8cd8e04ed Add unfiltered_flags_m::has_all_columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
51d079e17c data_consume_context: use make_unique instead of new
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
54ef775501 Pass serialization_header to data_consume_rows_context*
This header is needed to parse data for SSTable 3.0 format

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
b849eefc8c Use disk_string_vint_size for bytes_array_vint_size
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
76f0f2693d Introduce disk_string_vint_size type
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:30:03 +02:00
Piotr Jastrzebski
5ca4bfd69a disk_array_vint_size: Remove unused Size template parameter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:15:44 +02:00
Duarte Nunes
4eb47d136b Merge 'Introduce authorized_prepared_statements_cache' from Vlad
"
This series introduces a cache of already authenticated prepared statements which
is meant to optimize the prepared statement lookup when authentication is enabled.

This cache allows to perform a single cache lookup per EXECUTE operation as opposed
to at least 2 lookups: one in the prepared statements cache and one in the authentication
cache.

Tests:
   - cql_query_test {debug, release}.
   - cassandra-stress with authentication enabled and with short eviction timeout.
   - Manual (with printouts) checks:
      - Tested the eviction due to eviction in the prepared_statements_cache:
         - Artificially decreased the prepared_statements_cache size and ran c-s with different keyspaces.
         - Verified that the corresponding authorized_prepared_statements_cache entry is evicted and re-populated.
      - Tested the BATCH of prepared statements (with dtest infrastructure):
         - Verified that for each prepared statement authorized_prepared_statements_cache is updated only once:
            - The batch contained a few entries of the same prepared statement.
"

* 'authorized_prepared_statements_cache-v3' of https://github.com/vladzcloudius/scylla:
  cql3: use authorized_prepared_statements_cache in the BATCH processing
  cql3::statements::batch_statement: introduce a single_statement class
  cql3: introduce the authorized_prepared_statements_cache class
  loading_shared_values: introduce the templated find() overload
  tests: loading_cache_test: add a tests for a loading_cache::remove(key)/remove(iterator)
  utils::loading_cache: add remove(key)/remove(iterator) methods
  cql3::query_processor: properly stop() prepared_statements_cache object
2018-05-23 14:40:09 +01:00
Avi Kivity
3dd2f68712 dist: drop libunwind dependency
Since Seastar no longer (1f005fb434) requires libunwind, we can
drop it from our dependency list.  This helps the power build, for
which no libunwind is available.

Fixes #3453.
Message-Id: <20180523114750.10753-1-avi@scylladb.com>
2018-05-23 13:53:29 +02:00
Avi Kivity
1f005fb434 Merge seastar upstream
* seastar 5da5d4e...a48fe69 (1):
  > backtrace: drop libwind in favor of libc backtrace()
2018-05-23 14:42:14 +03:00
Duarte Nunes
eed09dfdf9 mutation_partition: Throw std::out_of_range with backtrace on cell_at
Makes it easier to investigate bugs.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180521133753.16375-1-duarte@scylladb.com>
2018-05-23 13:51:54 +03:00
Avi Kivity
701e6f2cff Merge "Implement backlog controller for TWCS" from Glauber
"
This series implements the backlog tracker for TWCS, allowing it to
be controlled. The backlog for a TWCS colum family is just the sum of
the SizeTiered backlogs for all the windows that we know about.

A possible optimization for this is to stop tracking windows after
they become old enough and revert to zero backlog. I reverted that
last minute, though, since this will probably cause the backlog to
completely misrepresent reality if we import SSTables into old buckets
with things like repairs or nodetool refresh.
"

* 'twcs-backlog-v4.1' of github.com:glommer/scylla:
  backlog: implement backlog tracker for the TWCS
  STCS_backlog: allow users to query for the total bytes managed
  backlog: keep track of maximum timestamp in write monitor
  memtable: also keep track of max timestamp
2018-05-23 13:37:49 +03:00
Glauber Costa
44a89d654b backlog: implement backlog tracker for the TWCS
The TWCS backlog is relatively simple: we just need to keep track of
which SSTable belong to which time window (and actually as usual,
just their sizes). That is an easy thing to do since we can statically
calculate the time bound from the timestamp.

Once we do that we can just sum the backlogs for each individual window.
Time windows that are well enough into the past can be at some point
discarded when their backlogs become zero.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-23 06:20:21 -04:00
Nadav Har'El
433fc6c36e keys.hh: simplify empty clustering-key check
The exploded_clustering_prefix type has a convenient is_empty() method
and an even more convenient "operator bool" shortcut. Unfortunately,
the other clustering prefix types (clustering_key_prefix,
clustering_key_prefix_view) have, for historic reasons, an is_empty
method which takes a schema parameter. That also means they can't
have an "operator bool" shortcut.

But checking if a prefix doesn't really need the schema - all we need to
check is whether the byte representation is empty. The result is simpler
and more efficient code, and easier to use. It is also more consistent -
all clustering-key-related types will have an "operator bool" instead of
just some of them.

To avoid massive code changes, we leave a is_empty(schema) variant, which
simply calls is_empty(). There's already precedent for that - various
methods which have a variant taking schema (and ignoring it) and one
taking nothing.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180521174220.13262-1-nyh@scylladb.com>
2018-05-23 11:46:23 +02:00
Takuya ASADA
300af65555 dist/common/scripts/scylla_setup: abort running script when one of setup failed in silent mode
Current script silently continues even one of setup fails, need to
abort.

Fixes #3433

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180522180355.1648-1-syuu@scylladb.com>
2018-05-23 11:05:33 +03:00
Vlad Zolotarov
82f7d1d006 cql3: use authorized_prepared_statements_cache in the BATCH processing
Like with the EXECUTE command avoid authorizing the same prepared
statement twice - this time in the context of processing the BATCH
command.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:15:03 -04:00
Vlad Zolotarov
9723988926 cql3::statements::batch_statement: introduce a single_statement class
This is a helper class needed to control the handling process of a single
statement in the current batch. In particular it has the boolean defining
if the authorization is needed for this statement.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:15:03 -04:00
Vlad Zolotarov
a138c59991 cql3: introduce the authorized_prepared_statements_cache class
Add a cache that would store the checked weak pointer to already authorized prepared statements
and which key is a tuple of an authenticated_user and key of the prepared_statements_cache.

The entries will be held as long as the corresponding prepared statement is valid (cached)
and will be discarded with the period equal to the refresh period of the permissions cache.

Entries are also going to be discarded after 60 minutes if not used.

The purpose of this new cache is to save the lookup in the permissions cache for already authenticated
resource (whatever is needed to be authenticated for the particular prepared statement).

This is meant to improve the cache coherency as well (since we are going to look in a single cache
instead of two).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:15:03 -04:00
Vlad Zolotarov
3114cef42c loading_shared_values: introduce the templated find() overload
This overload alows searching the elements by an arbitrary key as long as it is "hashable"
to the same values as the default key and if there is a comparator for
this new key.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:15:00 -04:00
Vlad Zolotarov
ab251a1fc3 tests: loading_cache_test: add a tests for a loading_cache::remove(key)/remove(iterator)
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:05:01 -04:00
Vlad Zolotarov
34620deee4 utils::loading_cache: add remove(key)/remove(iterator) methods
remove(key): removes the entry with the given key if exists, otherwise does nothing.
remote(iterator): removes an entry by a given iterator (returned from loading_cache::find()).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:05:00 -04:00
Piotr Sarna
b7ac2da238 main: initialize hints manager unconditionally
This commit makes sure that hints manager is always initialized,
including creating hints directories and starting it.
It needs to be fixed because hints manager is internally used
to store failed materialized view replicas.

Fixes #3451
Message-Id: <44532fd3704e20cabeb9c4985dace5650fd22d2c.1527018865.git.sarna@scylladb.com>
2018-05-22 22:21:50 +01:00
Duarte Nunes
ed2a1518f8 Merge 'Allow dropping tables with active secondary indexes' from Piotr
"
This series addresses issue #3202 about dropping a table with secondary
indexes present. Previously dropping such tables was impossible due to
materialized view restrictions (which is an implementation detail
of Scylla's secondary indexes).

Implemented:
 * fixing 'DROP KEYSPACE' with active materialized views
 * adapting schema_builder to make it easy to drop indexes
 * dropping all dependent SI before dropping a table
 * a test case for dropping a table with secondary indexes
"

* 'drop_si_before_drop_table_3' of https://github.com/psarna/scylla:
  tests: add test for dropping a table with secondary indexes
  migration_manager: allow dropping table with secondary indexes
  schema: add clearing indexes to schema builder
  database: do not truncate already removed views
2018-05-22 22:20:35 +01:00
Vlad Zolotarov
5bde36f29e cql3::query_processor: properly stop() prepared_statements_cache object
prepared_statements_cache has a timer that evicts old entries - it needs to be properly stopped.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 16:33:52 -04:00
Piotr Sarna
76848fb577 tests: add test for dropping a table with secondary indexes
This commit adds a test case for dropping a table with dependent
secondary indexes. Dependent materialized views prohibit the table
from being dropped, but dropping a table with dependent SI is legal.

References #3202
2018-05-22 21:10:51 +02:00
Piotr Sarna
7e4813a466 migration_manager: allow dropping table with secondary indexes
Previously dropping a table with secondary indexes failed, because
SI are internally backed by materialized views.
This commit triggers dropping dependent secondary indexes before
dropping a table.

Fixes #3202
2018-05-22 21:10:51 +02:00
Piotr Sarna
0513dc17a1 schema: add clearing indexes to schema builder
This commit adds 'without_indexes()' method to builder,
used to clear all previous index declarations from schema definition.
2018-05-22 21:10:51 +02:00
Piotr Sarna
f8237dd664 database: do not truncate already removed views
This commit clears table's views before truncating it
in drop_column_family function. The only case when
views are not empty during drop is when they're backing secondary
indexes of a base table and they are all atomically dropped
in the same go as the base table itself.
This change will prevent trying to truncate views that were
already dropped, which used to result in no_such_column_family error.

References #3202
2018-05-22 21:10:51 +02:00
Duarte Nunes
a3bbd52e2e Merge 'Add materialized view metrics' from Piotr
"
This series introduces materialized view statistics, as stated in issue #3385:
 - updates pushed
 - updates failed
 - row lock stats

It also addresses issue #3416 by decoupling user write stats from view
update stats.
"

* 'materialized_view_metrics_9' of https://github.com/psarna/scylla:
  view: adapt view_stats to act as write stats
  storage_proxy: decouple write_stats from stats
  db: add row locking metrics
  view: add view metrics
2018-05-22 18:41:51 +01:00
Glauber Costa
be39736293 STCS_backlog: allow users to query for the total bytes managed
We would like to know whether there is still backlog at rest in a
particular STCS object. This is useful, for instance, in the TWCS
backlog, that uses STCS so it can delete old windows that are no longer
used.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 13:40:15 -04:00
Glauber Costa
b573a2ff61 backlog: keep track of maximum timestamp in write monitor
For sealed SSTables we can get the maximum timestamp from the statistics
component.  But for partially written SSTables, the metadata is not yet
available.

One way to solve this would be to make the SSTable statistics available
earlier. But we would end up with a maximum timestamp that potentially
changes all the time as we write more cells.

A better approach is to take note of what's the maximum timestamp in a
memtable before we start to flush, and when time comes for us to flush
we will use the progress manager to inform the consumers about the
maximum timestamp.

For SSTables being compacted, we can't know for sure what is the maximum
timestamp as some entries could be TTLd already. But the maximum of all
SSTables present in the compaction is a good enough estimation for this
purposes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 12:55:58 -04:00
Glauber Costa
68d1c64e7a memtable: also keep track of max timestamp
We are now keeping track of the minimum timestamp in a memtable. Also
keep track of the max timestamp so we can know what it is before we
finish flushing the entire memtable to an SSTable. Will be used by
partially written SSTables undergoing TWCS.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 12:55:58 -04:00
Avi Kivity
49892a06b9 Merge "exception safety and minimum work for compaction controller" from Glauber
"
This was sent before as two separate patchsets. It is now unified
because it has a lot of common infrastructure.

In this patchset I am aiming at two goals:

1) Provide a minimum amount of shares for user-initiated operations like
nodetool compact and nodetool cleanup

2) Be more robust with exceptions in the backlog tracker

For the first, the main difference is that I now made the compaction
controller a part of the compaction manager. It then becomes easy to
consult with the compaction controller for the correct amount of shares
those operations should have.

In compaction_strategy.cc, the major_compaction_strategy object was
actually already unused before. So instead of making use of it, which
would require some form of information flow downwards about the backlog
we need to export, I am creating a user-initiated backlog type inside
the compaction manager.

With the two changes described above everything is very well
self-contained within the compaction manager and the implementation
becomes trivial.

For the second, I am now handling exceptions in two places:

1) the backlog computation. Those are const functions so if we just have
a transient exception when compacting the backlog, all we need to do is
return some fixed amount of shares and try again in the next adjustment
window.

2) the process of adding / removing SSTables. Those are harder, since if
we fail to manipulate the list we'll be left in an inconsistent state.
The best approach is then to disable the backlog tracker and return a
fixed amount of shares globally.

Tests: unit (release)
"

* 'backlog-improvements-v3' of github.com:glommer/scylla:
  compaction_manager: disable backlog tracker if we see an exception
  backlog tracker: protect against exceptions in backlog calculation.
  STCS_backlog: protect against negative backlog
  STCS_backlog: remove unused attribute
  compaction strategy: move size tiered backlog to a header
  compaction_strategy: delete major_compaction_strategy class
  compaction: make sure that user-initiated compactions always have a minimum priority
  backlog_controller: add constants to represent a globally disabled controller
  backlog_controller: move compaction controller to the compaction manager
  backlog_controller: allow users to compute inverse function of shares
2018-05-22 18:35:42 +03:00
Piotr Sarna
3792bed3ed view: adapt view_stats to act as write stats
This commit adapts view_stats structure so it can be passed
to storage_proxy as write stats. Thanks to that, mv replica updates
will not interfere with user write metrics. As a side effect it also
provides more stats to replica view updates.

Closes #3385
Closes #3416
2018-05-22 16:52:58 +02:00
Piotr Sarna
1d590b3ca4 storage_proxy: decouple write_stats from stats
This commit extracts metrics related to writes from stats structure,
so it can be easily replaced later, e.g. for materialized view metrics.

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Piotr Sarna
9246bb36bc db: add row locking metrics
This commit adds statistics to row_locker class. Metrics are
independendly counted for all lock types: row<->partition and
exclusive<->shared.

Metrics gathered:
 - total acquisitions
 - operations that wait on the lock
 - histogram of the time spent on waiting on this type of lock

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Piotr Sarna
49bebcfa25 view: add view metrics
This commit introduces view statistics:
 - updates pushed to local/remote replicas
 - updates failed to be pushed to local/remote replicas

Metrics are kept on per-table basis, i.e. updates_pushed_remote
shows the number of total updates (mutations) pushed to all paired
mv replicas that this particular table has.
Every single update is taken into consideration, so if view update
requires removing a row from one view and adding a row to another,
it will be counted as 2 updates.

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Tomasz Grabiec
e554a39fbb tests: memtable_snapshot_source: Fix compact()
Compactor collects all currently active memtables and later replaces
them with the merged result. The problem is that active memtable
belongs to the input set during compaction and as a result mutations
applied concurrently with compaction could be lost once compaction
replaces the memtables. The fix is to open a new active memtable when
compaction starts.

Caused sporadic failures of row_cache_test.cc:test_continuity_is_populated_when_read_overlaps_with_older_version()
Message-Id: <1526997724-13037-1-git-send-email-tgrabiec@scylladb.com>
2018-05-22 15:08:07 +01:00
Glauber Costa
d4e7783188 compaction_manager: disable backlog tracker if we see an exception
If we see an exception when adding or removing SSTables from the backlog
tracker, the backlog tracker can be inconsistent forever. It would be
best if we act before that happens and disable the backlog tracker. Once
the backlog tracker is disabled it will default to returning a fixed
number of shares.

We can either disable the backlog tracker or remove it. But if we remove
it we can end up with a backlog of zero if that's the only tracker with
a backlog. We then keep it registered but mark it as disabled. This also
leaves room for recovery in some situations: we can recover the backlog
by a doing a schema change in the column family that had the backlog
disabled, for instance.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:32 -04:00
Glauber Costa
fde26ec633 backlog tracker: protect against exceptions in backlog calculation.
Backlog calculations should be exception free, but there are at cases in
which I can see they happening. One example is if  some backlog tracker
that uses temporary objects fails an allocation.

Memory shortages can be specially pernicious: if we leave the
responsibility of catching those to the individual backlog tracker, we
will keep trying to make more allocations in the other backlog trackers
if we have many column families. By handling it here we can stop that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
3e08bd17f0 STCS_backlog: protect against negative backlog
A negative backlog can be interpreted as a very large backlog.
Part of that is because we keep the total_size as an unsigned type,
which is what we expect. But in case there is an issue-- like an
exception that causes some SSTable not to be tracked then this size
can become negative. Returning a zero backlog is better than allowing
it to be interpreted as a giant number.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
4b4e9f6c8c STCS_backlog: remove unused attribute
This attribute ended up being unused in the final version.
Spotted now while reading the code for other purposes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
10046593be compaction strategy: move size tiered backlog to a header
It's very common to other strategies to include a SizeTiered
step somehow inside their algorithms: LCS will do SizeTiered on
L0, TWCS will do SizeTiered within a window, etc.

To make it easier for those strategies to consume the SizeTiered
backlog tracker, we will move that to its own file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
36ccb1dd7c compaction_strategy: delete major_compaction_strategy class
It was already unused before this series. In an earlier version I have
used it to provide an ad-hoc backlog for major compactions. But now that
this is done by the compaction manager, this class really isn't being
used.

And it is likely it won't be: major compaction is not a compaction
strategy a user can choose, unlike the others that need to be built
through make_compaction_strategy.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:33:59 -04:00
Glauber Costa
9320d6f17f compaction: make sure that user-initiated compactions always have a minimum priority
We have observed the following behavior with user initiated compactions,
like major compactions:

- if there are no writes, the backlog doesn't increase.
- as compaction progresses the backlog decreases.
- at some point, the backlog is so low that compaction barely makes any
  progress.

Going forward, we should allow one to read from the generated partial
SSTables, in which case this doesn't matter that much. But for
user-iniated compactions we would like to guarantee a minimum baseline.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:33:25 -04:00
Glauber Costa
c55ab93178 backlog_controller: add constants to represent a globally disabled controller
There are situations in which we want the controllers to stop working
altogether. Usually that's when we have an unimplemented controller or
some exception.

We want to return fixed shares in this case, but this is a very
different situation from when we want fixed shares for *one* backlog
tracker: we want to return fixed shares, yes, but if we disable 200
backlog trackers (because they all failed, for instance), we don't want
that fixed number x 200 to be our backlog.

So the mechanism to globally disable the controller is still granted,
and infinity is a good way to represent that. It's a float that the
controller can easily test against. But actually using infinity in the
code is confusing. People reading it may interpret it as the other way
around from what it means, just meaning "a very large backlog".

Let's turn that into a constant instead. It will help us convey meaning.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:25:23 -04:00
Glauber Costa
d758a416f8 backlog_controller: move compaction controller to the compaction manager
There was recently an attempt to add minimum shares to major compactions
which ended up being harder than it should be due to all the plumbing
necessary to call the compaction controller from inside the compaction
manager-- since it is currently a database object. We had this problem
again when trying to return fixed shares in case of an exception.

Taking a step back, all of those problems stem from the fact that the
compaction controller really shouldn't be a part of the database: as it
deals with compactions and its consequences it is a lot more natural to
have it inside the compaction manager to begin with.

Once we do that, all the aforementioned problems go away. So let's move
there where it belongs.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:24:19 -04:00
Calle Wilund
62c3b4c429 commitlog: Ensure file objects are closed before object free
Fixes #3446

Previously, only shutdown-synced objects where actually closed,
which is wrong.

This introduces yet another queue, processed together with the
deletion objects, which ensures we explicitly close all objects
that have been discarded.

Message-Id: <20180521140456.32100-1-calle@scylladb.com>
2018-05-22 14:52:06 +03:00
Duarte Nunes
4b2fd8d6f2 Merge 'Use hinted handoff to replay missed updates from base to view' from Piotr
"This series leverages hinted handoff for failed view replica
updates."

* 'materialized_view_updates_with_hh_5' of https://github.com/psarna/scylla:
  storage_proxy: enable hinted handoff for materialized views
  storage_proxy: make view updates use consistency_level::ANY
2018-05-22 11:24:37 +01:00
Paweł Dziepak
05c94bc98d mutation_partition: do not dereference null in find_cell()
row::find_cell() may be called for cells that do not exist in that row.
In such case nullptr shall be returned, this patch makes sure that
it is not dereferenced.
Message-Id: <20180522091726.24396-1-pdziepak@scylladb.com>
2018-05-22 10:31:09 +01:00
Glauber Costa
d3f985ef46 backlog_controller: allow users to compute inverse function of shares
There are some situations in which we want to force a specific amount of
shares and don't have a backlog. We can provide a function to get that
from the controller.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-21 19:35:07 -04:00
Avi Kivity
51f5599c75 Merge seastar upstream
* seastar a6cb005...5da5d4e (6):
  > append_challenged_posix_file_impl: Ensure continuation uses non-stale object
  > utils: make make_visitor() public
  > tcp: Adjust receive window
  > tcp: Fix allowed sending size calculation in can_send
  > tcp: Fix assert in tcp::tcb::output_one
  > be more descriptive with failed syscalls for filesystem operations

Contains alternative fix for #3446 (will also be fixed directly).
2018-05-21 20:35:30 +03:00
Piotr Sarna
f5d6326ced storage_proxy: enable hinted handoff for materialized views
This commit initializes and enables hinted handoff for materialized
views, even if HH is not explicitly turned on in config.

User writes still use hinted handoff only if it is explicitly enabled,
while materialized views are allowed to use it unconditionally
in order to store failed replica updates somewhere.

Fixes #3383
2018-05-21 17:09:27 +02:00
Piotr Sarna
da0d458f5f storage_proxy: make view updates use consistency_level::ANY
This commit makes view replica updates internally use consistency
level ANY, so in case an update fails it will fall back to hinted
handoff.

References #3383
2018-05-21 17:09:27 +02:00
Piotr Sarna
ba9e8a4f2c tests: initialize hints directory for cql env
This commit initializes hints_directory config value for cql_test_env.
It's needed now because materialized views support force-enables
hinted handoff.

Message-Id: <2aadf35eee329c1f89977c4a55660f330bd9d591.1526914827.git.sarna@scylladb.com>
2018-05-21 18:06:01 +03:00
Botond Dénes
204f6fd478 test.py: print test args when listing failed tests
This can be very helpful when a test only fails when run with some
particular arguments.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <dac1f7e23afa904156e65c3bb3c8fd52b7e999ff.1526906955.git.bdenes@scylladb.com>
2018-05-21 17:28:18 +03:00
Avi Kivity
f9c2ff1f9c install: prepare /etc directory
install(1) creates missing directories on recent Fedora, but not
on CentOS 7. This causes the RPM build (which installs to a pristine
tree, without an existing /etc) to fail.

Fix by setting up /etc.

Tests: rpm (Fedora, CentOS)
Message-Id: <20180520124937.20466-1-avi@scylladb.com>
2018-05-21 09:51:46 +02:00
Asias He
db8c3a7059 streaming: Do not use dht::split_ranges_to_shards
There is no need to call dht::split_ranges_to_shards to split the token
range into <shard> : <a lot of small ranges> mapping and create a flat
mutation reader with a lot of small ranges.

Because:

1) The flat mutation reader on each shard only returns data belongs to
this local shard, there is no correctness issue if we do not split and
feed the sub ranges only belongs to this local shard.

2) With murmur3_partitioner_ignore_msb_bits = 12, it is almost certain
that given a token range, all the shards will have data for the range
anyway. Even if we ask all the shards to work on the token range and
some of the shards have no data for it, it is fine. We simply send no
data from this shard.

Tests: update_cluster_layout_tests.py

Message-Id: <ac00cd21d6156c47b74451dd415d627481e48212.1526864222.git.asias@scylladb.com>
2018-05-21 10:42:45 +03:00
Takuya ASADA
5407c34c73 dist/debian: depends to coreutils instead of realpath on Ubuntu 18.04
On Ubuntu 18.04 realpath package is dropped, it becomes part of coreutils.

Fixes #3445

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180521031954.30815-1-syuu@scylladb.com>
2018-05-21 10:42:05 +03:00
Asias He
0c54c6e16f storage_service: Add node has left the cluster log
Remove a node from the cluster is a major operation, it deserves a log
for it. Add a log when node is removed from the cluster by `nodetool
decommission` or `nodetool removenode`.

Message-Id: <b6adf34492c8138296911f2b37b39e9dd8ed10a2.1523347916.git.asias@scylladb.com>
2018-05-19 21:47:05 +03:00
Asias He
e20038eb84 streaming: Handle stream_mutation rpc handler on all shards
In streaming, the sender sends the mutations on all the local shards in
parallel, it is possible that the receiver handle more than one such
connection on the same shard. It is determined by where the tcp
connection goes. Current rpc ignores the dest shard id when sending the
rpc message.

For instance, say node1 has 2 shards, node2 has 2 shards. Currently, we
can end up with like this:

   Node 1 shard 0 -> Node 2 shard 1
   Node 1 shard 1 -> Node 2 shard 1

It is better if we do:

   Node 1 shard 0 -> Node 2 shard 0
   Node 1 shard 1 -> Node 2 shard 1

This patch solves this problem by let the handler always handle on
shard = src_cpu_id % smp::count.

If sender and receiver have the same shard config, it is completely
distributed the work evenly.

If sender and receiver do not have the same shard config, it is
unavoidable some of the shard will do more work than the others.

Tests: dtest update_cluster_layout_tests.py

Message-Id: <911827bcf67459a07ec92623a9ed4c4fbba195ca.1524622375.git.asias@scylladb.com>
2018-05-19 21:08:25 +03:00
Calle Wilund
f69a52c475 storage_service: Add more error info to "isolate_on_error" shutdown
Fixes #2793

Prints error handle class (commitlog or "other/disk") + exception
type and message. While not exhaustive, at least gives a correlation
point to (hopefully) other log printouts.

Message-Id: <20180509081040.7676-1-calle@scylladb.com>
2018-05-19 21:06:03 +03:00
Piotr Jastrzebski
1520ffe7f5 sstables: check buffer size when reading vints
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6ecbedae818fbef1f67a4472aba4ce443b9df0ee.1525888830.git.piotr@scylladb.com>
2018-05-19 21:01:45 +03:00
Avi Kivity
46a0109608 Merge "Support compression when writing SSTables 3.x." from Vladimir
"
For compression, SSTables 3.x format uses CRC32 for checksumming
compressed chunks as well as for calculating the full file checksum.
Also, while for older formats "full checksum" of a compressed data file
means a combination of checksums of its compressed chunks, in SSTables
3.x this now reads literally and assumes the checkum of all bytes
written, including per-chunk digests.

Tests: unit {debug, release}
"

* 'projects/sstables-30/write-compression/v3' of https://github.com/argenet/scylla:
  tests: Add unit tests for writing compressed SSTables 3.x.
  tests: Validate Digest32.crc for SSTables 3.x write tests.
  tests: Fix invalid Digest file for write_counter_table test.
  sstables: Support writing compressed SSTables 3.0.
  sstables: Make compressed streams customizable on checksumming.
  sstables: Move checksum calculation logic to compressed_output_stream.
2018-05-19 20:52:08 +03:00
Vladimir Krivopalov
d588a7e743 tests: Add unit tests for writing compressed SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:08 +03:00
Vladimir Krivopalov
e5ab271863 tests: Validate Digest32.crc for SSTables 3.x write tests.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:08 +03:00
Vladimir Krivopalov
fcc7bad777 tests: Fix invalid Digest file for write_counter_table test.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:07 +03:00
Vladimir Krivopalov
dd00d90a05 sstables: Support writing compressed SSTables 3.0.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:07 +03:00
Vladimir Krivopalov
cc62ad3b69 sstables: Make compressed streams customizable on checksumming.
Use either Adler32 or CRC32 while writing to or reading from a
compressed stream.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:07 +03:00
Vladimir Krivopalov
5183294676 sstables: Move checksum calculation logic to compressed_output_stream.
Previously, compressed_output_stream used to calculate checksum of the
supplied chunk and pass it to the 'compression' object to combine with
the full checksum calculated on prior writes.
Now, all the checksum calculation happens inside
compressed_output_stream and 'compression' only stores the result.

This is done to loosen ties between two classes and simplify
compressed_output_stream customisation with various checksum algorithms.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:07 +03:00
Glauber Costa
596a525950 commitlog: don't move pointer to segment
We are currently moving the pointer we acquired to the segment inside
the lambda in which we'll handle the cycle.

The problem is, we also use that same pointer inside the exception
handler. If an exception happens we'll access it and we'll crash.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180518125820.10726-1-glauber@scylladb.com>
2018-05-18 17:25:18 +02:00
Avi Kivity
684bb2042d Merge "Fixes and improvements for gdb LSA commands" from Tomasz
* tag 'tgrabiec/fixes-and-improvements-for-gdb-scripts-v1' of github.com:tgrabiec/scylla:
  gdb: Print live object size from 'scylla lsa-segment'
  gdb: Extend 'scylla segment-descs' output with full occupancy info
  gdb: Print allocated object's type name instead of full LSA migrator
  gdb: Fix LSA migrator discovery
  gdb: Drop code related to LSA zones
  gdb: Fix uses of removed segment_desctriptor::_lsa_managed
  lsa: Add use for debug::static_migrators
2018-05-17 15:54:21 +03:00
Tomasz Grabiec
d4a2d22812 gdb: Print live object size from 'scylla lsa-segment' 2018-05-17 14:22:20 +02:00
Tomasz Grabiec
08026a64c5 gdb: Extend 'scylla segment-descs' output with full occupancy info
After:

 0x600007220000: lsa free=24800  used=106272  81.08% region=0x600000403210
 0x600007240000: lsa free=13     used=131059  99.99% region=0x600000403210
 0x600007260000: lsa free=23072  used=108000  82.40% region=0x600000403210
 0x600007280000: lsa free=16772  used=114300  87.20% region=0x600000403210
 0x6000072a0000: lsa free=23996  used=107076  81.69% region=0x600000401410
 0x6000072c0000: lsa free=15552  used=115520  88.13% region=0x600000403210
2018-05-17 14:22:20 +02:00
Tomasz Grabiec
abd667d924 gdb: Print allocated object's type name instead of full LSA migrator
Before:

  0x6000302604e0: live {_vptr.migrate_fn_type = 0x3797a00 <vtable for standard_migrator<cache_entry>+16>, _migrators = std::any containing seastar::lw_shared_ptr<(anonymous namespace)::migrators> = {[contained value] = {_p = 0x600000080a80}}, _align = 8, _index = 0} @ 0x6000302604e8

After:

  0x6000302604e0: live cache_entry @ 0x6000302604e8
2018-05-17 14:22:14 +02:00
Tomasz Grabiec
653fcc10bb gdb: Fix LSA migrator discovery
Fixes 'scylla lsa-segment' which broke after recent changes, probably
commit b3699f286d.
2018-05-17 14:22:14 +02:00
Tomasz Grabiec
bb8f82f43f gdb: Drop code related to LSA zones
LSA zones have been removed.
2018-05-17 14:22:14 +02:00
Tomasz Grabiec
84a7961c23 gdb: Fix uses of removed segment_desctriptor::_lsa_managed 2018-05-17 14:22:14 +02:00
Tomasz Grabiec
498a4132c5 lsa: Add use for debug::static_migrators
Otherwise GDB complains about it being optimized out, breaking our
debug scritps.
2018-05-17 14:22:14 +02:00
Avi Kivity
d9c80cac26 dist: move Red Hat installation from .spec %install to new install.sh
Move code to a traditional install.sh script (more traditional would be
a "make install", but this is close enough).

This allows testing installation independently of packaging. In addition,
non-Red Hat-packaging can share much of the code in install.sh.

Ref #3243.

Tests: build+install rpm
Message-Id: <20180517114147.30863-1-avi@scylladb.com>
2018-05-17 13:46:27 +02:00
Avi Kivity
98967da94f Merge seastar upstream
* seastar 0a1a327...a6cb005 (1):
  > Merge " misc fixes for iotune" from Glauber
2018-05-17 12:42:46 +03:00
Avi Kivity
3b8118d4e5 dist: redhat: get rid of raid0.devices_discard_performance
This parameter is not available on recent Red Hat kernels or on
non-Red Hat kernels (it was removed on 3.10.0-772.el7,
RHBZ 1455932). The presence of the parameter on kernels that don't
support it cause the module load to fail, with the result that the
storage is not available.

Fix by removing the parameter. For someone running an older Red Hat
kernel the effect will be that discard is disabled, but they can fix
that by updating the kernel. For someone running a newer kernel, the
effect will be that they can access their data.

Fixes #3437.
Message-Id: <20180516134913.6540-1-avi@scylladb.com>
2018-05-16 15:38:29 +01:00
Avi Kivity
20271b3890 Update scylla-ami submodule
* dist/ami/files/scylla-ami e0b35dc...025644d (1):
  > Merge "AMI build fix" from Takuya
2018-05-16 12:33:45 +03:00
Avi Kivity
05cec4a265 Merge "Reduce LSA memory reclamation overhead" from Tomasz
"
Main optimization is in the patch titled "lsa: Reduce amount of segment compactions".

I measured 50% reduction of cache update run time in a steady state for an
append-only workload with large partition, in perf_row_cache_update version from:

  c3f9e6ce1f/tests/perf_row_cache_update.cc

Other workloads, and other allocation sites probably also could see the
improvement.
"

* tag 'tgrabiec/reduce-lsa-segment-compactions-v1' of github.com:tgrabiec/scylla:
  lsa: Expose counters for allocation and compaction throughput
  lsa: Reduce amount of segment compactions
  lsa: Avoid the call to segment_pool::descriptor() in compact()
  lsa: Make reclamation on reserve refill more efficient
2018-05-16 10:24:20 +03:00
Tomasz Grabiec
534068a0f7 Update seastar submodule
Fixes #3339

* seastar 840002c...0a1a327 (7):
  > Merge "fix perftune.py issues with cpu-masks on big machines" from Vlad
  > Merge 'Handle Intel's NICs in a special way'  from Vlad
  > reactor: fix calculation of idle ticks
  > log: streamline logging internals a little
  > Merge "CMake imrovements and compatibility" from Jesse
  > iotune: fix typo in property name
  > cmake: do not find_package(Boost ...) if Boost is a target
2018-05-16 09:11:22 +02:00
Avi Kivity
832e8fb1e0 Merge "Support writing counters in SSTables 3.x format." from Vladimir
"
This patchset adds support for writing counter cells in SSTables 3.x
format ('m'). The logic of writing counters is almost identical to that
used for the old 2.x format ('k'/'l') with the only difference that the
data length preceding serialised shards is written as a vint.

Tests: unit {release}.

Generated SSTables are verified to be processed fine by sstabledump
(note that sstabledump only outputs the binary data for counters, not
their actual values, same as sstable2json).

Verified with Cassandra 3.11 to get the expected values from the
counters table:
cqlsh> SELECT * from sst3.counter_table;

 pk  | ck  | rc1 | rc2
-----+-----+-----+-----
 key | ck1 |  10 |   1

(1 rows)

Verified that the deleted counter can no longer be updated:
cqlsh> use sst3 ;
cqlsh:sst3> UPDATE counter_table SET rc1 = rc1 + 2 WHERE pk = 'key' AND ck = 'ck2';
cqlsh:sst3> SELECT * from sst3.counter_table;

 pk  | ck  | rc1 | rc2
-----+-----+-----+-----
 key | ck1 |  10 |   1

(1 rows)
"

* 'projects/sstables-30/write_counters/v1' of https://github.com/argenet/scylla:
  tests: Unit tests to cover writing counters in SSTables 3.x format.
  sstables: Support writing counters for SSTables 3.x.
  sstables: Move code writing counter value into a separate helper.
2018-05-16 08:46:15 +03:00
Raphael S. Carvalho
59c57861ae tests/sstable_test: switch to dynamic temporary dir creation
sstable test fails when running concurrently (for example, release and debug
mode) because it uses a static temporary dir in lots of tests.
Let's fix it by switching to dynamic temporary dir, which is created using
mkdtemp(). Also the sstable tests will now run in /tmp, and so it's made
much faster.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180516042044.15336-1-raphaelsc@scylladb.com>
2018-05-16 08:00:29 +03:00
Tomasz Grabiec
4fdd61f1b0 lsa: Expose counters for allocation and compaction throughput
Allow observing amplification induced by segment compaction.
2018-05-15 21:49:01 +02:00
Tomasz Grabiec
3775a9ecec lsa: Reduce amount of segment compactions
Reclaiming memory through segment compaction is expensive. For
occupancy of 85%, in order to reclaim one free segment, we need to
compact 7 segments, by migrating 6 segments worth of data. This results
in significant amplification. Compaction involves moving objects,
which in some cases is expensive in itself as well
(See https://github.com/scylladb/scylla/issues/3247).

This patch reduces amount of segment compactions in favor of doing
more eviction. It especially helps workloads in which LRU order
matches allocation order, in which case there will be no segment
compaction, and just eviction.

In perf_row_cache_update test case for large partition with lots of
rows, which simulates appending workload, I measured that for each new
object allocated, 2 need to be migrated, before the patch. After the
patch, only 0.003 objects are migrated. This reduces run time of
cache update part by 50%.
2018-05-15 21:49:01 +02:00
Vladimir Krivopalov
a16b8d5d77 tests: Unit tests to cover writing counters in SSTables 3.x format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-15 11:44:44 -07:00
Vladimir Krivopalov
ffd8886da9 sstables: Support writing counters for SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-15 11:44:44 -07:00
Vladimir Krivopalov
28c3c21c73 sstables: Move code writing counter value into a separate helper.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-15 11:44:44 -07:00
Avi Kivity
5f3a5c436e Merge "chunked vector memory estimation" from Glauber
"
The memory estimations we have when using the chunked vector
are usually slightly wrong. We can make them more accurate by
exporting the memory usage directly as a chunked_vector API.
"

* 'chunked_memory-v2' of github.com:glommer/scylla:
  large_bitset: be more accurate with memory usage
  chunked_vector: exports its current memory usage
2018-05-15 19:00:36 +03:00
Glauber Costa
2ba08178ca large_bitset: be more accurate with memory usage
We are slightly underestimating the amount of memory we use. Now that
the chunked vector can exports its internal memory usage we can use that
directly.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-15 11:22:21 -04:00
Glauber Costa
7190bb4f95 chunked_vector: exports its current memory usage
There are times in which we would like to estimate how much memory
a chunked_vector is using. We have two strategies to do it:

1) multiply the size by the size of the elements. That is wrong, because
the chunked_vector can allocate larger chunks in anticipation of more
elements to come.

2) multiply the number of chunks by 128kB. That is also wrong, because
the chunk_vector will not always allocate the entire chunk if there are
only a few elements in it.

The best way to deal with it is to allow the chunked_vector to exports
its current memory usage.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-15 11:22:21 -04:00
Raphael S. Carvalho
83e64192d3 tests/perf: fix compaction and write mode of perf_sstable
storage_service_for_tests must be instantiated only once at a global
scope.

Fixes #3369.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180510042200.2548-1-raphaelsc@scylladb.com>
2018-05-15 18:00:18 +03:00
Avi Kivity
e0ef39705f dist: redhat: properly package scylla_blocktune.py
Commit 9eb8ea8b11 installed
scylla_blocktune.py as part of preparing the rpm, but forgot
to add it to the installed file list, breaking the rpm build.

Fix by listing the file in the %files section.
Message-Id: <20180506202807.5719-1-avi@scylladb.com>
2018-05-15 18:00:05 +03:00
Piotr Sarna
40bf5d671b cql: add secondary index metrics
This commit adds basic secondary index metrics to cql_stats:
 * total number of indexes creates
 * total number of indexes dropped
 * total number of reads from a secondary index
 * total number of rows read from a secondary index

References #3384
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <d5eda7a343cee547c921dd4d289ecb1ac1c2bf24.1526374243.git.sarna@scylladb.com>
2018-05-15 17:59:53 +03:00
Avi Kivity
4f81e1f55a Merge "Use CRC32 to calculate checksums for SSTables 3.0." from Vladimir
"
SSTables 3.x (format 'm') use CRC32 instead of Adler32 for calculating
checksums. This patchset introduces support for CRC32 along with Adler32
in checksummed_file_writer to be used for SSTables written in 'mc'
format.

Structures and helpers introduced for CRC32 will be later used for
calculating checksums for compressed files as well (not a part of this
patchset).

Tests: unit {release}
"

* 'projects/sstables-30/write-digest-crc/v3' of https://github.com/argenet/scylla:
  tests: Add test covering checksumming SSTables 3.0 with CRC32.
  sstables: Support CRC32 checksum for SSTables 3.x.
  sstables: Move adler32 routines under the scope of a class.
  sstables: Move checksum utils into separate header.
  sstables: Remove unused 'checksum_file' flag from checksummed_file_writer.
2018-05-15 10:18:14 +03:00
Duarte Nunes
3a7d655d01 Merge 'transport: reduce unneeded continuations' from Avi
"
The native protocol server generates mant reactor tasks that
can be easily eliminated. I measured a read workload with 100%
cache hit rate, seeing the number of tasks per request drop
from ~31 to ~27, and an increase of 3% in throughput.
"

* tag 'transport-optimize-1/v1' of https://github.com/avikivity/scylla:
  transport: remove unused capture of flags variable
  transport: merge response write and error handling continuations
  transport: make write_repsonse() return void
  transport: de-template a lambda
  transport: merge memory-management and logging continuations
  transport: remove gate continuation
  transport: merge two response processing continuations
  transport: simplify response processing continuation
  transport: remove gratuitous continuation from process_request_one()
2018-05-14 10:12:07 +01:00
Avi Kivity
a99e820bb9 query_processor: require clients to specify timeout configuration
Remove implicit timeouts and replace with caller-specified timeouts.
This allows removing the ambiguity about what timeout a statement is
executed with, and allows removing cql_statement::execute_internal(),
which mostly overrode timeouts and consistency levels.

Timeout selection is now as follows:

  query_processor::*_internal: infinite timeout, CL=ONE
  query_processor::process(), execute(): user-specified consisistency level and timeout

All callers were adjusted to specify an infinite timeout. This can be
further adjusted later to use the "other" timeout for DCL and the
read or write timeout (as needed) for authentication in the normal
query path.

Note that infinite timeouts don't mean that the query will hang; as
soon as the failure detector decides that the node is down, RPC
responses will termiante with a failure and the query will fail.
2018-05-14 09:41:06 +03:00
Avi Kivity
4500baaaf4 transport: remove unused capture of flags variable 2018-05-14 09:41:06 +03:00
Avi Kivity
2a1f231f82 query_processor: un-default consistency level in make_internal_options
Make the consistency level explicit in the caller in order to clarify
what is going on.

An "internal" query used to mean that it was accessing local tables,
so infinite timeouts and a consistency level of ONE were indicated,
but authentication accesses non-local tables so explicit consistency
level and timeouts are needed.
2018-05-14 09:41:06 +03:00
Avi Kivity
88f8fe3168 transport: merge response write and error handling continuations
The response write continuation does not defer, so traditional try/catch
works well and saves a continuation.
2018-05-14 09:41:06 +03:00
Avi Kivity
3e8d1c8fd7 transport: make write_repsonse() return void
It just schedules the response, and returns immediately.

(I thought about calling it schedule_response(), but usually it will
write the response immediately, since waiting for network writes is
rare in a local network).
2018-05-14 09:41:06 +03:00
Avi Kivity
b26f36c2ec transport: de-template a lambda
Generic templates = annoying.
2018-05-14 09:41:06 +03:00
Avi Kivity
7a9b73f166 transport: merge memory-management and logging continuations
Merge a continuation that just keeps things alive with another that
just logs things.
2018-05-14 09:41:06 +03:00
Avi Kivity
f0887a55e4 transport: remove gate continuation
with_gate() generates a continuation if the protected function defers.
Avoid that by merging a gate::leave() call with another, preexisting,
continuation.
2018-05-14 09:41:06 +03:00
Avi Kivity
876837a5da transport: merge two response processing continuations
We have one coninuation transforming the result, and another shutting
down tracing. Since the first cannot defer, we can merge the two, reducing
the number of tasks processed by the reactor.
2018-05-14 09:41:06 +03:00
Avi Kivity
38619138be transport: simplify response processing continuation
A continuation in the response processing path is only doing
transformation on the output. Make that clear by returning a value,
not a future.
2018-05-14 09:41:06 +03:00
Avi Kivity
f0a1478b6c transport: remove gratuitous continuation from process_request_one()
No need to call then() just to convert exceptions to futures,
futurize_apply() does this with less ado.
2018-05-14 09:41:06 +03:00
Vladimir Krivopalov
1da6144f90 tests: Add test covering checksumming SSTables 3.0 with CRC32.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Vladimir Krivopalov
e6dfa008d8 sstables: Support CRC32 checksum for SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Vladimir Krivopalov
adb43959d1 sstables: Move adler32 routines under the scope of a class.
This is a step towards making digest algorithm customizable at compile
time.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Vladimir Krivopalov
4e4030676f sstables: Move checksum utils into separate header.
Checksummed writer doesn't need to include all compression stuff.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Nadav Har'El
f5536d607e secondary index: fix multiple appearance of rows
This patch fixes a bug where queries using a secondary index would, in
some cases, produce the same rows multiple times.

The problem was that the code begins by finding a list of primary keys
that match the search, and then work on the partitions containing them.
If multiple rows matched in the same partition, the partition was considered
multiple times, and the same rows were output multiple times.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180510203141.17157-1-nyh@scylladb.com>
2018-05-13 20:08:14 +02:00
Avi Kivity
7d29addb1f mutation_reader: optimize make_combined_reader for the single-reader case
If we're given a single reader (can be common in a low-write-rate table,
where most of the data will be in a single large sstable, or in leveled
tables) then we can avoid the overhead of the combining reader by returning
the single input.

Tests: unit (release)
Message-Id: <20180513130333.15424-1-avi@scylladb.com>
2018-05-13 20:07:10 +02:00
Duarte Nunes
a23bda3393 Merge 'Implement separate timeout for range queries' from Avi
"
This patchset implements separate timeouts for range queries, and lays
the foundations for separate timeouts for other query types.

While the feature in itself is worthy, the real motivation is to have
the timeouts decided by the caller, instead of storage_proxy. This in
turn is required to disentangle each layer behaving differently
depending on whether the query is internal or not; instead, the goal
is to have each caller declare its needs in terms of consistency level
and timeouts, and have the lower layers implement its requirements
instead of making their own decisions.

Fixes #3013.

Tests: unit (release)
"

* tag '3013/v1.1' of https://github.com/avikivity/scylla:
  storage_proxy: remove default_query_timeout()
  storage_proxy: don't use default timeouts
  query_options: augment with timeout_config
  thrift: configure thrift transport and handler with a timeout_config
  transport: configure native transport with a timeout_config
  cql3: define and populate timeout_config_selector
  timeout_config: introduce timeout configuration
2018-05-13 20:05:50 +02:00
Glauber Costa
3d2c4c1cf8 main: change I/O scheduler verification code
Before we accept running while not in developer mode, we verify that
the I/O Scheduler is properly configured. Up until now, that meant
verifying that --max-io-requests is properly set and that the number
of I/O Queues is enough to leave at least 4 requests per I/O Queue.

Systems that move to newer versions of Scylla may continue doing that,
so we need to be backwards compatible and keep testing for that.
However, newer systems will not set that option, but pass a YAML
property file (or string) instead. So we need to make sure that
either one of those is set.

If the property file is set, I am deciding here not to test for
number of I/O queues. scylla_io_setup will usually configure that
anyway, plus we plan on soon moving to all-shards-dispatch making
that less important.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180509163737.5907-1-glauber@scylladb.com>
2018-05-13 19:22:54 +03:00
Glauber Costa
2e0c673432 database: release flush permits earlier
There is an ongoing discussion in issue 2678 about the right time to
release permits. Right now we are releasing the permit after we flush
all data for the memtable plus the SSTables accompanying components -
plus flushing them, closing them, etc.

During all that time, we are increasing virtual dirty by adding more
data to the buffers but we are not able to decrease it-- until we
release the permit we can't start flushing the next memtable. This is
much more of a concern than I/O overlapping as described in the issue.

We have a hook in the SSTable write process that is (should be) called
as soon as data is written. We should move the permit release there.

We aren't, though, calling that as early as we could. The call to the
data written hook is writing after the Index is closed, summary is
sealed, etc.

This patch fixes that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180508182746.28310-2-glauber@scylladb.com>
2018-05-13 19:22:54 +03:00
Tomasz Grabiec
8faafdaae5 lsa: Avoid the call to segment_pool::descriptor() in compact() 2018-05-11 19:07:23 +02:00
Tomasz Grabiec
19edf3970e lsa: Make reclamation on reserve refill more efficient
Currently reserve refill allocates segments repeatedly until the
reserve threhsold is met. If single segment allocation needs to
reclaim memory, it will ask the reclaimer for one segment. The
reclaimer could make better decisions if it knew the total number of
segments we try to allocate. In particular, it would not attempt to
compact any segment until it evicts total amount of memory first,
which may reduce the total amount of segment compactions during
refill.

This patch changes refill to increase reclamation step used by
allocate_segment() so that it matches the total amount of memory we
refill.
2018-05-11 19:07:23 +02:00
Takuya ASADA
6fa3c4dcad dist/redhat: replace scylla-libgcc72/scylla-libstdc++72 with scylla-2.2 metapackage
We have conflict between scylla-libgcc72/scylla-libstdc++72 and
scylla-libgcc73/scylla-libstdc++73, need to replace *72 package with
scylla-2.2 metapackage to prevent it.

Fixes #3373

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180510081246.17928-1-syuu@scylladb.com>
2018-05-11 09:41:57 +03:00
Vladimir Krivopalov
f443e85476 sstables: Remove unused 'checksum_file' flag from checksummed_file_writer.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 11:11:06 -07:00
Paweł Dziepak
863a96db48 Merge "Fix partition tombstones for SSTables 3.x" from Vladimir
"Previously, partition tombstone was not written for partitions with no
rows causing corrupted data files.

This is now fixed and covered with tests.

In addition, we now track partition tombstones while collecting encoding
statistics."

* 'projects/sstables-30/fix-partition-tombstone/v3' of https://github.com/argenet/scylla:
  tests: Don't use deprecated schema constructor.
  tests: Add tests to cover partitions consisting only of partition keys.
  sstables: Make sure partition level tombstone is written for partitions with no rows.
  memtable: Collect statistics from partition-level tombstone.
2018-05-10 16:27:20 +01:00
Vladimir Krivopalov
d7177d9013 tests: Don't use deprecated schema constructor.
Rely entirely on schema_builder facilities while preparing schema for
unit tests.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 08:13:29 -07:00
Vladimir Krivopalov
64cdb30379 tests: Add tests to cover partitions consisting only of partition keys.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 08:12:58 -07:00
Vladimir Krivopalov
97079208db sstables: Make sure partition level tombstone is written for partitions with no rows.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 07:28:54 -07:00
Vladimir Krivopalov
ffc3a1ffeb memtable: Collect statistics from partition-level tombstone.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 07:28:50 -07:00
Duarte Nunes
21ccf173a1 Merge 'Preparatory cleanup for stateful range-scans' from Botond
"
This is preparatory cleanup series with fixes/cleanup of miscellaneous
issues that I discovered while working on the stateful range-scans.
Since the stateful range-scans series, even without these patches, is a
20+ patches strong series I'd like to fast-track this, to ease reviewing
the former.
Most of the changes here are related to code-hygenie and effectiveness
and there is a patch that is correctness-related ("querier: check only
the end bound of ranges when matching them") and one that is related to
ease-of-use ("range: clean the deduced transformed type").
Note that altough these changes were made in the context of working on
the stateful range-scans they make sense on their own as well.

Tests: unit(release, debug)
"

* '1865/pre-range-scans-cleanup/v1' of https://github.com/denesb/scylla:
  multishard_combining_reader: use optimized optional for the shard reader
  Use dht::token_range alias for last/preferred replicas
  storage_proxy::coordinator_query_result: merge constructors into one w/ default params
  querier: check only the end bound of ranges when matching them
  querier: take range and slice by value
  querier: remove const params from make_compaction_state()
  querier: make _range and _slice const
  flat_multi_range_mutation_reader: optimize for non-plural range vectors
  range: clean the deduced transformed type
2018-05-10 11:09:44 +01:00
Botond Dénes
7a3eab90c8 multishard_combining_reader: use optimized optional for the shard reader
Use flat_mutation_reader_opt instead of
std::optional<flat_mutation_reader>.
2018-05-10 13:06:47 +03:00
Duarte Nunes
d49348b0e1 Merge 'Include OPTIONS with LIST ROLES' from Jesse
"
Fixes #3420.

Tests: dtest (`auth_test.py`), unit (release)
"

* 'jhk/fix_3420/v2' of https://github.com/hakuch/scylla:
  cql3: Include custom options in LIST ROLES
  auth: Query custom options from the `authenticator`
  auth: Add type alias for custom auth. options
2018-05-10 11:03:29 +01:00
Vladimir Krivopalov
e5477c6c6c utils: Use dedicated enum for Bloom filter format instead of a boolean.
It better reflects the purpose of the parameter and provides better type-safety.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <10a4fc16dafa0fb3234969041f68f9e7bfc61312.1525899669.git.vladimir@scylladb.com>
2018-05-10 09:47:41 +03:00
Avi Kivity
76c64e1f26 Merge "Prepare for the new in-memory representation" from Paweł
"
These patches were extracted from much larger series that introduces new
in-memory representation of cells. They contain various enhanecments and
fixes that to a varying degree make sense on its own. Sending them
separately will hopefuly ease the review and merging proces of the whole
IMR effort.

Tests: unit(release).
"

* tag 'pre-imr/v1' of https://github.com/pdziepak/scylla:
  tests/perf: add microbenchmarks for basic row operations
  tests: simple_schema: add make_row_from_serialized_value()
  row: add clear_hash()
  types: move compare_unsigned() to bytes.hh
  lsa: provide migrator with the object size
  lsa: add free() that does not require object size
  db/view/build_progress: avoid copying mutation fragment
  mutation_partition: enable ADL for cell swap
  types: make some collection_type_impl functions non-static
  counters: drop revertability of apply()
  mutable_view: add default constructor and const_iterator
  tests/mutation_reader: do not apply mutations created on another shard
  sstables: do not call atomic_cell::value() for dead cells
  lsa: sanitize use of migrators
  lsa: reuse registered migrator ids
  lsa: make migrators table thread-local
2018-05-10 09:41:49 +03:00
Botond Dénes
ddd70dc113 Use dht::token_range alias for last/preferred replicas
Use the pre-existing type alias instead of fully spelling out the type
everywhere.
2018-05-10 06:22:39 +03:00
Botond Dénes
52affa2a61 storage_proxy::coordinator_query_result: merge constructors into one w/ default params 2018-05-10 06:22:39 +03:00
Botond Dénes
3b6f4e4901 querier: check only the end bound of ranges when matching them
The querier provides a `matches(const nonwrapping_range&)` member to
allow for checking whether a range matches that with which the querier
was originally created. The check for match is more lax than a strict
equality check as ranges are shrunk query progresses.
Because of this the above member only checked that one of the bounds of
the examined ranges matches. This is adequate as for this purpose
because, in the context of a single query, it is guaranteed that no
two read requests to the same replica will have overlapping range.
However Avi pointed out in a recent, related review, that this check can
be made a little more strict by requiring that the end-bounds of the
two ranges *always* matches, instead of allowing any of the bounds to
match.
2018-05-10 06:22:39 +03:00
Botond Dénes
eba90d0208 querier: take range and slice by value
It needs to copy these anyway so give callers the opportunity to move
these in.
2018-05-10 06:22:39 +03:00
Botond Dénes
546a0e292e querier: remove const params from make_compaction_state() 2018-05-10 06:22:39 +03:00
Botond Dénes
bc01833cad querier: make _range and _slice const
Since we are storing them on the heap we can make them const and still
be movable. We get the cake and can eat it too.
2018-05-10 06:22:39 +03:00
Botond Dénes
f5b012c952 flat_multi_range_mutation_reader: optimize for non-plural range vectors
Don't create a flat_multi_range_mutation_reader when the range vector
has 0 or 1 element. In the former case create an empty reader and in the
latter just create a reader with the mutation-source with the only range
in the vector.
2018-05-10 06:22:39 +03:00
Botond Dénes
16319c2036 range: clean the deduced transformed type
wrapping_range and nonwrapping_range offer a transform() member function
which allows creating a new range by applying a transformer function to
the bounds of the current range. The type of bounds of the new range is
deduced from the return type for this transformer function. However the
return type is used as-is, with any CV or reference attached to it.
Since it doesn't make sense to create a range of references or a type
with CV qualifiers strip these off the deduced type.
2018-05-10 06:22:39 +03:00
Jesse Haber-Kucharsky
4ffb4c6788 cql3: Include custom options in LIST ROLES
An implementation of `authenticator` can support custom options for a
each role.

If, to make up an example, the authenticator supported the `region` key,
then a role would be created as follows:

CREATE ROLE jsmith WITH OPTIONS = { 'region': 'north_america' }
                    AND PASSWORD = 'super_secure';

LIST ROLES will now print this custom option map as an additional column
with the heading "options".

However, none of the implementations of `authenticator` in Scylla
currently support OPTIONS, so LIST ROLES will in practice, for now,
print the empty set:

 role      | super | login | options
-----------+-------+-------+---------
 cassandra |  True |  True |        {}
2018-05-09 21:17:14 -04:00
Jesse Haber-Kucharsky
cd0553ca6a auth: Query custom options from the authenticator
None of the `authenticator` implementations we have support custom
options, but we should support this operation to support the relevant
CQL statements.
2018-05-09 21:12:50 -04:00
Jesse Haber-Kucharsky
e149e48609 auth: Add type alias for custom auth. options 2018-05-09 21:12:47 -04:00
Paweł Dziepak
0b8a85b15f tests/perf: add microbenchmarks for basic row operations 2018-05-09 16:52:26 +01:00
Paweł Dziepak
e949061126 tests: simple_schema: add make_row_from_serialized_value()
simple_schema::make_row() is not very well suited for performance tests
of row and cell creation since it serialises the value. This patch
introduces a new function that performs only minimal actions.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
33dffd5fb6 row: add clear_hash()
Needed to measure the performance of hashing a cell.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
f9940f620a types: move compare_unsigned() to bytes.hh
compare_unsigned() is a general utility function that compares two
bytes_view byte-by-byte. There is no need to include whole type.hh in
order to make it available.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
c6c5accd19 lsa: provide migrator with the object size
While the migration function should have enough information to obtain
the object size itself, the LSA logic needs to compute it as well.
IMR is going to make calculating object sizes more expensive, so by
providing the infromation to the migrator we can avoid some needless
operations.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
884888dc11 lsa: add free() that does not require object size
It is non-trivial to get the size of an IMR object. However, the
standard allocator doesn't really need it and LSA can compute it itself
by asking the migrator.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
75b8b521d9 db/view/build_progress: avoid copying mutation fragment 2018-05-09 16:52:26 +01:00
Paweł Dziepak
00509913fc mutation_partition: enable ADL for cell swap
Calling fully qualified std::swap() prohibits the cell objects from
using their own swap implementations. This patch invokes std::swap in
the usual ADL-friendly way.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
0b4c6b8938 types: make some collection_type_impl functions non-static
The switch to the new in-memory representation will require a larger
parts of the logic be aware of the type of the values they are dealing
with. In most cases it is not a significant burden for the users.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
a2b5779714 counters: drop revertability of apply()
Since 4cfcd8055e 'Merge "Drop reversible
apply() from mutation_partition" from Tomasz' it is no longer required
for apply() to be revertable.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
f7438a8b96 mutable_view: add default constructor and const_iterator
Makes the interface more consistent with bytes_view.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
7c5c77369a tests/mutation_reader: do not apply mutations created on another shard
Scylla uses shared-nothing architecture and communication between the
shards is supposed to be very restricted. Applying to a memtable
mutations created on another shard is way to complex operation to be
allowed. Using frozen mutations is a much safer option.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
55d1d7adfb sstables: do not call atomic_cell::value() for dead cells
The preconditions of value() require the cell to be live.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
b1bec336b3 lsa: sanitize use of migrators
Having migrators dynamically registered and deregistered opens a new
class of bugs. This patch adds some additional checks in the debug mode
with the hopes of catching any misuse early.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
cca9f8c944 lsa: reuse registered migrator ids
With the introduction of the new in-memory representation we will get
type- and schema-dependent migrators. Since there is no bound how many
times they can be created and destroyed it is better to be safe and
reuse registered migrator ids.
2018-05-09 16:52:20 +01:00
Paweł Dziepak
b3699f286d lsa: make migrators table thread-local
Migrators can be registered and deregistered at any time. If the table
is not thread-local we risk race conditions.
2018-05-09 16:10:46 +01:00
Avi Kivity
8d09820472 Merge "Load serialization header for SSTables in 3.0 format" from Piotr
"
SSTable 3.0 format introduces serialization header which is used in reading SSTables in that format.
This patchset implements loading of this new component of Statistics.db.

Tests: units (release)
"

* 'haaawk/sstables3/load_serialization_header_v2' of ssh://github.com/scylladb/seastar-dev:
  Load serialization_header from statistics
  Add parse for disk_array_vint_size
  Add helpers to read/parse vints
  Add signed_vint::serialized_size_from_first_byte
  Add sstable::get_serialization_header
  Move random_access_reader to separate header
2018-05-09 17:48:48 +03:00
Glauber Costa
94f686f946 memtable controller: reduce adjustment period to 50ms
250ms is too high of a period for memtable controller. Since memtable
flushes are relatively efficient, specially in comparison to
compactions, if the shares are high we can flush a lot of data down with
the high shares - so in the next adjustment period our shares will be
minuscule and we won't flush much at all.

This leads to oscillating behavior that is mitigated by adjusting
faster.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180508182746.28310-3-glauber@scylladb.com>
2018-05-09 17:40:46 +03:00
Paweł Dziepak
920131b2f7 Merge "mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged" from Tomasz
"Fixes a bug in partition_snapshot::merge_partition_versions(), which would not
attempt merging if the snapshot is attached to the latest version (in which
case _version is nullptr and _entry is != nullptr). This would cause
partition_version objects to accumulate if there was an older snapshot and it
went away before the latest snapshot. Versions will be removed when the whole
entry goes away (flush or eviction).

May cause performance problems.

Fixes #3402."

* 'tgrabiec/fix-merge_partition_versions' of github.com:tgrabiec/scylla:
  mvcc: Test version merging when snapshots go away
  anchorless_list: Make ranges conform to SinglePassRange
  anchorless_list: Drop deprecated use of std::iterator
  mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged
2018-05-09 15:10:56 +01:00
Piotr Jastrzebski
70a204cdd0 Load serialization_header from statistics
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 15:46:59 +02:00
Piotr Jastrzebski
3e4bc923a8 Add parse for disk_array_vint_size
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 15:46:59 +02:00
Piotr Jastrzebski
6b4df2d424 Add helpers to read/parse vints
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 15:46:46 +02:00
Glauber Costa
aadc709068 scylla_io_setup: run new iotune.
The newer version of iotune, recently merged to Seastar, accepts
a new parameter that tells us where should we store the properties
about the disk.

We are already generating that properties file for the AMI case.
Let's also pass that parameter when calling iotune.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180507175757.9144-1-glauber@scylladb.com>
2018-05-09 16:32:43 +03:00
Amnon Heiman
6bf759128b scylla-housekeeping: support new 2018.1 path variation
Starting from 2018.1 and 2.2 there was a change in the repository path.
It was made to support multiple product (like manager and place the
enterprise in a different path).

As a result, the regular expression that look for the repository fail.

This patch change the way the path is searched, both rpm and debian
varations are combined and both options of the repository path are
unified.

See scylladb/scylla-enterprise#527

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180429151926.20431-1-amnon@scylladb.com>
2018-05-09 15:22:30 +03:00
Botond Dénes
777f3c7dc2 mutation_reader_test: don't lock up with smp=1
test_foreign_reader_destroyed_with_pending_read_ahead lock up completely
when run with SMP=1. As a solution skip the test-case when SMP < 2.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <815585c40a65a66f3b03e6393b46fbd6849c8ef5.1525866777.git.bdenes@scylladb.com>
2018-05-09 15:10:18 +03:00
Piotr Jastrzebski
b602dea726 Add signed_vint::serialized_size_from_first_byte
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 11:41:00 +02:00
Piotr Jastrzebski
589463165c Add sstable::get_serialization_header
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 11:40:59 +02:00
Piotr Jastrzebski
aa126639c0 Move random_access_reader to separate header
It will be used not only in sstables.cc but also
in helpers for reading sstables in M format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 11:40:59 +02:00
Avi Kivity
911c2e7953 Merge "Support Bloom filter format for SSTables 3.x." from Vladimir
"
In SSTables 3.0, the base and increment fields have been swapped in
Bloom filters to reduce collisions (see CASSANDRA-8413). This affects
the resulting values written to Filter.db.

This patchset adds support for reading/writing Filter.db in the format
corresponding to the version of SSTables.

Tests: unit {release}

Filter.db files have been generated using Cassandra 3.11 with same data
as in unit tests and are validated to match those generated by Scylla.
"

* 'projects/sstables-30/write-filter/v1-2' of https://github.com/argenet/scylla:
  Fix mistakes and typos in comments (minor clean-up)
  Check Filter.db in SSTables 3.x write tests.
  Support Bloom filter format used in SSTables 3.0.
  Remove unused overload of i_filter::get_filter().
2018-05-09 11:16:09 +03:00
Vladimir Krivopalov
51c8ea74d6 sstables: generate non-empty summaries for m format
Add summary entries as needed. Also removes the duplicate line that
assigned summary byte cost.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <0d387c68523bae0c121cb15ad1e651ee9a8e4b4a.1525732404.git.vladimir@scylladb.com>
2018-05-09 11:15:02 +03:00
Vladimir Krivopalov
b59549cd16 Fix mistakes and typos in comments (minor clean-up)
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-08 15:28:43 -07:00
Vladimir Krivopalov
e739bb3280 Check Filter.db in SSTables 3.x write tests.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-08 15:28:35 -07:00
Vladimir Krivopalov
0f37c0e684 Support Bloom filter format used in SSTables 3.0.
The two hash values, base and increment, used to produce indices for
setting bits in the filter, have been swapped in SSTables 3.0.
See CASSANDRA-8413 for details.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-08 15:28:27 -07:00
Vladimir Krivopalov
fe2358e8bd Remove unused overload of i_filter::get_filter().
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-08 15:28:18 -07:00
Calle Wilund
b2b1a1f7e1 database: Fix assert in truncate
Fixes crash in cql_tests.StorageProxyCQLTester.table_test
"avoid race condition when deleting sstable on behalf..." changed
discard_sstables behaviour to only return rp:s for sstables owned
and submitted for deletion (not all matching time stamp),
which can in some cases cause zero rp returned.
Message-Id: <20180508070003.1110-1-calle@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
48c96d09d6 db::hints::manager: drain hints when the node is decommissioned/removed
When node is decommissioned/removed it will drain all its hints and all
remote nodes that have hints to it will drain their hints to this node.

What "drain" means? - The node that "drains" hints to a specific
destination will ignore failures and will continue sending hints till the end
of the current segment, erase it and move to the next one till there are
no more segments left.

After all hints are drained the corresponding hints directory is removed.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
ec76f8a27d db::hints::manager: add a few more trace messages
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
6ede32156f db::hints::manager::end_point_hints_manager::sender: add set_stopping()/stopping() methods
It's nicer to have access methods instead of working directly with enum_set methods and values.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
94da744f37 db::hints::manager::end_point_hints_manager::stop(): log the last exception instead of forwarding it
Returning a future with an exception from end_point_manager::stop()
is practically useless because the best the caller can do is to log
it and continue as if it didn't happen because it has other things
to shut down.

Therefore in order to simplify the caller we will log the exception
if it happens and will always return a non-exceptional future.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
8aedbf9d18 db::hints: manager.hh: cleanup: fix the comments
Fix the comments that went out of sync with the current implementation.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
5463b58faa db::hints::manager: rework end_point_hints_manager::stop() to use seastar::async()
This simplifies the code reading and extending.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Botond Dénes
6f7d919470 database: when dropping a table evict all relevant queriers
Queriers shouldn't outlive the table they read from as that could lead
to use-after-free problems when they are destroyed.

Fixes: #3414

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <3d7172cef79bb52b7097596e1d4ebba3a6ff757e.1525716986.git.bdenes@scylladb.com>
2018-05-07 21:20:25 +03:00
Duarte Nunes
c053275a48 db/view/row_locking: Add timeout when waiting for the lock
This ensures we respect the write timeout set by the client when
applying base writes, in case a writes takes too long to acquire the
row lock for the read-before-write phase of a materialized view
update.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180507132755.8751-1-duarte@scylladb.com>
2018-05-07 18:22:39 +01:00
Duarte Nunes
113294074d Merge seastar upstream
* seastar ac02df7...840002c (20):
  > dpdk: protect against missing statistics
  > alien: make visible in documentation
  > Merge "rewrite iotune to conform to the new ioscheduler" from Glauber
  > app_template: Correct outdated comment
  > apps, tests: Catch polymorphic exceptions by reference
  > configure.py: Enhance detection for gcc -fvisibility=hidden bug
  > reactor: add rudimentary task histogram reporting
  > Revert "Merge "rewrite iotune to conform to the new ioscheduler" from Glauber"
  > Merge "rewrite iotune to conform to the new ioscheduler" from Glauber
  > build: Use the same warning name for Clang and GCC
  > core/rwlock: Add support for timeouts
  > fs qualification: protect against EINTR
  > Docker: Fix failing build due to missing GNU make
  > reactor: move optional to experimental so we compile with c++14
  > future: remove allocation from future::get() thread context switch
  > Merge "rpc streaming" from Gleb
  > reactor: put mountpoint_params in seastar namespace
  > Tutorial: in PDF version of tutorial, better backtick typesetting
  > tutorial: support, and start using, links to other sections
  > tutorial: improve second half of semaphores section

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-07 18:22:39 +01:00
Tomasz Grabiec
58fe331c7e mvcc: Test version merging when snapshots go away 2018-05-07 13:54:30 +02:00
Avi Kivity
368e15a8e2 Update scylla-ami submodule
* dist/ami/files/scylla-ami 8a6e4dd...e0b35dc (1):
  > change default roles for EBS / ephemeral
2018-05-07 12:34:04 +03:00
Duarte Nunes
4b3562c3f5 db/view: Limit number of pending view updates
This patch adds a simple and naive mechanism to ensure a base replica
doesn't overwhelm a potentially overloaded view replica by sending too
many concurrent view updates. We add a semaphore to limit to 100 the
number of outstanding view updates. We limit globally per shard, and
not per destination view replica. We also limit statically.

Refs #2538

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-2-duarte@scylladb.com>
2018-05-07 11:25:27 +03:00
Duarte Nunes
2be75bdfc9 db/timeout_clock: Properly scope type names
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-1-duarte@scylladb.com>
2018-05-07 11:24:41 +03:00
Nadav Har'El
c93b56034d tests: improve usability of cql_assertions.hh error messages
The functions in cql_assertions.hh are very convenient, but have one
frustrating drawback: When you have many of those assertions in one
test, it's very hard to know *which* of the similar assertions failed.

The problem is that an error often looks like this:

unknown location(0): fatal error: in "test_many_columns":
std::runtime_error: Expected 2 row(s) but got 0
tests/cql_assertions.cc(131): last checkpoint

Which of the many similar checks in "test_many_columns" failed? Note the
unhelpful "unknown location" and also the "last checkpoint" points to code
in cql_assertions.cc, not in the actual test, so it is useless.

The root cause of these problems is that the Boost macros use the C
preprocessor __FILE__ and __LINE__, which in actual C++ functions like
is_rows() remembers its location, instead of the caller. Fixing this will
not be simple. But this patch has a much simpler solution - fixing the
"last checkpoint". What ruins the last checkpoint is the use of BOOST_REQUIRE
inside the cql_assertions.cc is_rows() - when that succeeds, it records
the location inside cql_assertions.cc (!) as the last success.

If we just replace BOOST_REQUIRE by our own test (just like in the rest of
the cql_assertions.cc code), this code will not override the last checkpoint.
The user can see the last real successful BOOST_REQUIRE, or use
BOOST_TEST_PASSPOINT() to set his own checkpoints between different parts of
the same test.

After this patch, and with adding BOOST_TEST_PASSPOINT() calls between
different parts of my test, the failure above now looks like:

unknown location(0): fatal error: in "test_many_columns":
std::runtime_error: Expected 2 row(s) but got 0
tests/secondary_index_test.cc(299): last checkpoint

The "last checkpoint" now shows me exactly where my failing check was.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501152638.26238-1-nyh@scylladb.com>
2018-05-07 09:19:45 +01:00
Duarte Nunes
eabe471ce8 tests/secondary_index_test: Don't catch polymorphic exceptions by value
Don't slice exception by catching them by value. Instead of catching
by reference, use assert_that_failed().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180506153745.4512-1-duarte@scylladb.com>
2018-05-06 18:53:40 +03:00
Duarte Nunes
ab5a45b00c Merge 'Improve debuggability of result_message' from Avi
"This patchset adds ostream operators to result_message and uses them
in cql_assertions."

* tag 'result_message-print/v1.1' of https://github.com/avikivity/scylla:
  tests: cql_assersions: improve error message when a row is not found
  transport: add ostream support to result_message
  transport: const correctness for result_message::accept()
2018-05-06 14:52:56 +01:00
Avi Kivity
6d3fb69827 tests: cql_assersions: improve error message when a row is not found
Display the row and the result set.
2018-05-06 16:28:37 +03:00
Avi Kivity
07d69ebce2 transport: add ostream support to result_message
Allow printing result_message:s for debugging.
2018-05-06 16:28:35 +03:00
Avi Kivity
50d4d01cb7 tests: fix view_schema_test cql_assertion types
Use utf8_type where warranted.

Fixes view_schema_test failure where the rows did not match. I don't
understand exactly why the failure happened (using the wrong type
should not cause a failure here), but the change fixes the problem.

Tests: view_schema_test (release)
Message-Id: <20180506130015.7450-1-avi@scylladb.com>
2018-05-06 14:25:22 +01:00
Avi Kivity
31f2b3ce15 transport: const correctness for result_message::accept()
The visitor does not alter the result_message it is visiting (and
its signature indicates that) so accept() should be const-qualified
to indicate that and to allow visiting const result_message:s.
2018-05-06 15:51:48 +03:00
Avi Kivity
cc900c23a6 Merge "Write Statistics.db in SSTables 3.x format." from Vladimir
"
This patchset adds support for writing Statistics.db in the SSTables
'mc' (3.x) format. This file is essential for reading data stored in
Data.db as it contains base values used for delta encoding and types of
columns.

This patchset also fixes several bugs found in writing data and index
files as well as bugs in a statistics-related structure definition.

Tests: unit {debug, release}

All SSTables files for write unit tests are validated to be processed by
sstabledump and output is verified to show the expected data.
"

* 'projects/sstables-30/write-statistics/v1' of https://github.com/argenet/scylla:
  Add test covering the composite partition key case.
  Add Statistics.db files to write tests for SSTables 3.0.
  Do not check rows and cells for expiration when writing them to the data file.
  Fix promoted index serialization.
  Fix the order of items in stats_metadata.
  Fix timestamp_epoch value which was truncated on exceeding int32_t type limit.
  Write serialization header to Statistics.db for SSTables 3.x.
  Do not pass schema to metadata_collector::update(column_stats)
  Collect metadata statistics when writing SSTables 3.0.
  Call get_metadata_collector() instead of referencing sstable::_collector directly.
  Fix logic of writing TTLed cells in SSTable 3.0 format.
  Separate statistics for count of cells, columns and rows in column_stats.
  Deserialize collection in a way that doesn't incur shared_ptr counter increment and is generally shorter.
  Track both min & max values for timestamp, TTL and local deletion time in metadata_collector.
  Add class for tracking both extremum values (min and max) on updates.
2018-05-05 16:53:08 +03:00
Vladimir Krivopalov
4ecb3a5e2a Add test covering the composite partition key case.
Mainly to check that the composite type is properly serialized when
writing serialization header to Statistics.db.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:11 -07:00
Vladimir Krivopalov
1b3989adcd Add Statistics.db files to write tests for SSTables 3.0.
For these tests to work, all time-related values are now fixed as these
are stored in Statistics.db files.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:11 -07:00
Vladimir Krivopalov
293ee6ae3f Do not check rows and cells for expiration when writing them to the data file.
Although this logic may be seen as a useful optimization, it hinders
unit tests writing SSTables 3.0 as those need to have fixed time-related
values to produce Statistics.db files with the same content on each run.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:11 -07:00
Vladimir Krivopalov
44bc0f1493 Fix promoted index serialization.
There is a new field introduced in the SSTables 3.0 index file format
named 'partition_header_length' that can be used to skip over to the
first clustering row in a wide partition. This one has not been
previously written and caused malformed indices.

Updated the corresponding test to include a static row and write
multiple wide partitions.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:10 -07:00
Vladimir Krivopalov
56ac941a2e Fix the order of items in stats_metadata.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:10 -07:00
Vladimir Krivopalov
926cdc6d70 Fix timestamp_epoch value which was truncated on exceeding int32_t type limit.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:10 -07:00
Vladimir Krivopalov
5db6002720 Write serialization header to Statistics.db for SSTables 3.x.
Serialization header is a new components in Statistics.db introduced in
SSTables 3.0 ('ma') format. It is essential for reading data file as it
contains the base values used for delta-encoded values (timestamps,
TTLs, local deletion times) and description of column types.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:43:17 -07:00
Vladimir Krivopalov
6e4601d177 Do not pass schema to metadata_collector::update(column_stats)
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:22:32 -07:00
Vladimir Krivopalov
a10ad6b623 Collect metadata statistics when writing SSTables 3.0.
Track min/max timestamps, TTLs, local deletion times and count of cells,
columns and rows.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:22:30 -07:00
Raphael S. Carvalho
abcfc19fe9 db: make compaction slightly faster by not using filtering reader on unshared sstable
After reboot, all existing sstables are considered shared. That's a safe default.
Reader used by compaction decides to use filtering reader (filters out data that
doesn't belong to this shard) if sstable is considered shared even though it may
actually be unshared.
By avoiding filtering reader we're avoiding an extra check for each key, and that
may be meaningful for compaction of tons of small partitions and even range
reads of such. We do so by fixing sstable::_shared, which is now set properly for
existing sstables at start.

quick check using microbenchmark which extends perf_sstable with compaction mode:
before: 69407.61 +- 37.03 partitions / sec (30 runs, 1 concurrent ops)
after: 70161.09 +- 40.35 partitions / sec (30 runs, 1 concurrent ops)

Fixes #3042.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180504182158.21130-1-raphaelsc@scylladb.com>
2018-05-04 19:34:09 +01:00
Raphael S. Carvalho
b65bc511fe sstables/compaction_manager: log user initiated compaction
Sometimes it's hard to figure out from log whether user run major
compaction.

Fixes #1303.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180504181047.20277-1-raphaelsc@scylladb.com>
2018-05-04 19:15:58 +01:00
Duarte Nunes
7916368df8 Merge "Introduce system.large_partitions table" from Piotr
"
This series introduces a system.large_partitions table,
used to gather information on largest partitions in the cluster.

Schema below allows easy extraction of most offending keys and removal
by sstable name, which happens when a table is compacted away.

Schema: (
  keyspace_name text,
  table_name text,
  sstable_name text,
  partition_size bigint,
  key text,
  compaction_time timestamp,
  PRIMARY KEY((keyspace_name, table_name), sstable_name, partition_size, key)
) WITH CLUSTERING ORDER BY (partition_size DESC);
"

Closes #3292.

* 'large_partition_table_3' of https://github.com/psarna/scylla:
  database, sstables, tests: add large_partition_handler
  db: add large_partition_handler interface with implementations
  docs: init system_keyspace entry with system.large_partitions
  db: add system.large_partitions table
2018-05-04 18:18:50 +01:00
Piotr Sarna
bc019205b3 schema: fix typos in a comment
Message-Id: <2b2a169e8a511fa9e0e1556ac7559ce9bef896e1.1525431353.git.sarna@scylladb.com>
2018-05-04 15:26:51 +01:00
Piotr Sarna
fe02c3d0e2 database, sstables, tests: add large_partition_handler
This commit makes database, sstables and tests aware
of which large_partition_handler they use.
Proper large_partition_handler is retrievable from config information
and is based on existing compaction_large_partition_warning_threshold_mb
entry. Right now CQL TABLE variant of large_partition_handler is used
in the database.

Tests use a NOP version of large_partition_handler, which does not
depend on CQL queries at all.
2018-05-04 14:38:13 +02:00
Piotr Sarna
14b3c7e7e7 db: add large_partition_handler interface with implementations
This commit introduces large_partition_handler class, which can be used
to take additional action when large partitions are written.

It comes with two implementations:
 * NOP, used in tests, which does nothing on large partition
   update/delete
 * CQL TABLE, which inserts/deletes information on particular sstable
   to system.large_partitions table, in order to be retrievable from
   cqlsh later.

References #3292
2018-05-04 12:46:31 +02:00
Piotr Sarna
3c82a8a2ff docs: init system_keyspace entry with system.large_partitions
This commit is a first step towards documenting system.* tables.
It contains information about system.large_partitions table.

References #3292
2018-05-04 12:45:40 +02:00
Piotr Sarna
02822efbc8 db: add system.large_partitions table
This commit adds a system.large_partitions table, which can be used
to trace largest partitions of a cluster.
Schema: (
  keyspace_name text,
  table_name text,
  sstable_name text,
  partition_size bigint,
  key text,
  compaction_time timestamp,
  PRIMARY KEY((keyspace_name, table_name), sstable_name, partition_size, key)
) WITH CLUSTERING ORDER BY (partition_size DESC);

References #3292
2018-05-04 12:45:40 +02:00
Raphael S. Carvalho
ce689a0807 database: avoid race condition when deleting sstable on behalf of cf truncate
After removal of deletion manager, caller is now responsible for properly
submitting the deletion of a shared sstable. That's because deletion manager
was responsible for holding deletion until all owners agreed on it.
Resharding for example was changed to delete the shared sstables at the end,
but truncate wasn't changed and so race condition could happen when deleting
same sstable at more than one shard in parallel. Change the operation to only
submit a shared sstable for deletion in only one owner.

Fixes dtest migration_test.TestMigration.migrate_sstable_with_schema_change_test

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180503193427.24049-1-raphaelsc@scylladb.com>
2018-05-04 11:42:56 +01:00
Vladimir Krivopalov
8342073758 Call get_metadata_collector() instead of referencing sstable::_collector directly.
A step to untie classes sstable_writer_m and sstable so that eventually
we could stop them being friends.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
f1816d77cc Fix logic of writing TTLed cells in SSTable 3.0 format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
3e471116b4 Separate statistics for count of cells, columns and rows in column_stats.
SSTables 3.0 format makes a distinction between count of cells and count
of columns. In that sense, a column of a collection type counts as one
column but every atomic cell in it counts as a separate cell.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
fdfe79e899 Deserialize collection in a way that doesn't incur shared_ptr counter
increment and is generally shorter.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
7039dee12b Track both min & max values for timestamp, TTL and local deletion time
in metadata_collector.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
8b8c9a5d10 Add class for tracking both extremum values (min and max) on updates.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Tomasz Grabiec
5e985192b2 db: Log table id and schema version on boot
Message-Id: <1524585689-12458-1-git-send-email-tgrabiec@scylladb.com>
2018-05-03 10:50:31 +03:00
Botond Dénes
5d5bc0e1ab mutation_reader_test: fix multishard-reader test with smp > 3
test_multishard_combining_reader_destroyed_with_pending_create_reader
was failing because it relied on smp == 3 and thus the shard on which
the reader creation is blocked being shard-2. Since the test requires to
be run with smp >= 3 we can hardcode this shard to be 2 because if the
test runs at all we are guaranteed to have at least smp >= 3.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <38883a1f4c18ca0cd065aa13826a4f1858353289.1525328233.git.bdenes@scylladb.com>
2018-05-03 10:30:21 +03:00
Botond Dénes
efa08f623a mutation_reader_test: add description to multishard-tests
These tests are quite complicated and require intimate knowledge of how
foreign_reader and multishard_combining_reader operates. Knowing these
two objects is still required to understand the tests but make it that
much easier by explaining how they were designed to test what they test.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8de580131a8652924de920c2bc68a98e579398ee.1525328226.git.bdenes@scylladb.com>
2018-05-03 10:30:20 +03:00
Paweł Dziepak
bfc017daa8 tests/mutation_reader: do not capture on-stack variable by reference
'shard' is a short-lived on-stack variable that gets captured by
reference by continuation that gets executed on another shard.

Fixes a race condition that leads to an heap-use-after-free.

Message-Id: <20180502150507.2776-1-pdziepak@scylladb.com>
2018-05-02 18:07:37 +03:00
Botond Dénes
d80e586ccb mutation_reader_test: remove leftover comments
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <580dcf664fc4fc84f3a29137fba5c982f57d7601.1525269726.git.bdenes@scylladb.com>
2018-05-02 17:03:50 +03:00
Botond Dénes
e14b0ca13e mutation_reader_test: fix possible use-after-free
The test_foreign_reader_destroyed_with_pending_read_ahead test currently
doesn't ensure that the objects in it's scope are destroyed in the
correct order. This is necessary as there are severeal foreign pointers
to objects that live on remote shards and use each other. Since
foreign pointers destory their managed object in the background we
cannot rely on the to reliably destroy objects in order, nor can we be
sure when the object they manage is actually destroy.
So to work around that ensure that the puppet_reader is destroyed before
the remote_control it references even has a chance of being destroyed.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <232eaa899878b03fb2a765c2916e4f05841472a3.1525269726.git.bdenes@scylladb.com>
2018-05-02 17:03:49 +03:00
Nadav Har'El
68b5eafcc6 secondary index: test index naming
Test for Scylla's default choice of secondary index name (we found one
small problem, see issue #3403, and left it commented out). Also test
the ability to give indices non-default names.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501153439.26619-1-nyh@scylladb.com>
2018-05-02 08:12:14 +03:00
Nadav Har'El
311b25948c secondary index: test indexing of partition-key column
Add a test that adding a secondary-index for an only partition key column
is not allowed (it would be redundant), but indexing one of several partition
key columns *is* allowed. This reproduced issue #3404, and verifies that
it was fixed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501121544.22869-2-nyh@scylladb.com>
2018-05-02 08:11:04 +03:00
Nadav Har'El
79c6bb642f secondary index: fix indexing of partition-key column
Indexing an only partition key component is not allowed (because it would
be redundant), but it should be allowed to index one of several partition
key components. We had a bug in that case: the underlying materialized view
we created had the same column as both a partition key and a clustering
key, which resulted in an assertion failure. This patch fixes that.

Fixes #3404.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501121544.22869-1-nyh@scylladb.com>
2018-05-02 08:06:38 +03:00
Nadav Har'El
21d7507b74 secondary index: move stuff out of db/index directory
The db/index directory contains just a few lines of code that exists
there for historical reasons. It's confusing that we have both db/index
and index/ directory related to secondary-indexing.

This patch moves what little is still in db/index/ to index/. In the
future we should probably get rid of the "secondary_index" class we had
there, but for now, let's at least not have a whole new directory for it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501101246.21143-1-nyh@scylladb.com>
2018-05-01 13:21:24 +03:00
Tomasz Grabiec
0455a19ce0 anchorless_list: Make ranges conform to SinglePassRange
They were missing const version of iterators as well as iterator and
const_iterator member type aliases.
2018-04-30 18:45:32 +02:00
Tomasz Grabiec
9b7e49ef35 anchorless_list: Drop deprecated use of std::iterator 2018-04-30 18:45:32 +02:00
Tomasz Grabiec
aa1458377c mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged
Fixes a bug in partition_snapshot::merge_partition_versions(), which
would not attempt merging if the snapshot is attached to the latest
version (in which case _version is nullptr and _entry is !=
nullptr). This would cause partition_version objects to accumulate if
there was an older snapshot and it went away before the latest
snapshot. Versions will be removed when the whole entry goes away
(flush or eviction).

May have caused performance problems.

Fixes #3402.
2018-04-30 18:45:32 +02:00
Avi Kivity
25545590a4 Merge "Read-ahead related fixes for multishard readers" from Botond
"
Both multishard_combining_reader and foreign_reader use read-head in the
background to avoid blocking consumers. These read-aheads can be still
pending when the reader is destroyed and hence extra attention is needed
to avoid memory errors. Recent manual testing, done in the context of
testing code that is using the multishard reader, proved that these
cases were not handled correctly in the initial series introducing it
(2d126a79b).
This series introduces fixes and comprehensive tests for all problematic
scenarios:
1) multishard_combining_reader is destroyed with pending reader creation
on a remote shard.
2) foreign_reader is destroyed with pending read-ahead.
3) multishard_combining_reader is destroyed with pending read-ahead.
"

* 'multishard-reader-read-ahead-fixes/v2' of https://github.com/denesb/scylla:
  test.py: add custom seastar flags for mutation_reader_test
  test.py: move custom seastar flags for tests declarative
  mutation_reader_test: add read-ahead related multishard reader tests
  tests/mutation_reader_test: change recommented smp to 3
  mutation_reader_test: fix name of existing multishard reader tests
  simple_schema: add global_simple_schema
  simple_schema.hh: remove unused include
  multishard_combining_reader: prepare for read-ahead otliving the reader
  foreign_reader: prepare for read-ahead outliving the reader
  multishard_combining_reader: avoid creating the shard reader twice
  multishard_combining_reader: read_ahead: don't assume reader is created
  multishard_combining_reader: move read-ahead related methods
  multishard_combining_reader: avoid looking up the shard reader twice
  multishard_combining_reader: use optional for maybe created reader
2018-04-30 17:41:50 +03:00
Botond Dénes
f96084d38e test.py: add custom seastar flags for mutation_reader_test
Use -c3 if possible (if the machines has at least 3 cores).
2018-04-30 17:17:45 +03:00
Botond Dénes
52f0bb0481 test.py: move custom seastar flags for tests declarative 2018-04-30 17:17:45 +03:00
Botond Dénes
79684eff8e mutation_reader_test: add read-ahead related multishard reader tests
Add tests for foreign_reader and multishard_combining_reader that check
that readers destroyed while there is pending read-head will not result
in use-after-free.
Specifically check that:
* multishard_combining_reader destroyed with pending reader creation
* foreign_reader destroyed with pending read-ahead
* multishard_combining_reader destroyed with pending read-ahead
does not result in use-after-free or SEGFAULT.

These tests try to do their best to check for correct behaviour with
various BOOST_REQUIRE* checks but they still heavily rely on ASAN to
detect any use-after-free, SEGFAULT or similar errors.
2018-04-30 17:17:45 +03:00
Botond Dénes
cb25afa8bf tests/mutation_reader_test: change recommented smp to 3
Of the test_multishard_combining_reader_reading_empty_table test.
Running this test with smp=3 instead of smp=2 helps detecting additional
read-ahead related memory problems.
2018-04-30 17:17:45 +03:00
Botond Dénes
78266f11c4 mutation_reader_test: fix name of existing multishard reader tests
s/multishard_combined_reader/multishard_combining_reader/
2018-04-30 17:17:44 +03:00
Botond Dénes
783f0f09bf simple_schema: add global_simple_schema
Which allows a simple_schema instance to be transferred to another
shard. In fact a new simple_schema instance will be created on the
remote shard but it will use the same schema instance the the original
one.
2018-04-30 17:17:44 +03:00
Botond Dénes
ed7bde99bc simple_schema.hh: remove unused include 2018-04-30 17:17:44 +03:00
Botond Dénes
04643fb223 multishard_combining_reader: prepare for read-ahead otliving the reader
When the multishard reader is destroyed there might be severeal pending
read-aheads running in the background. These read-aheads need their
associated reader to stay alive until after the read-ahead completes.
To solve this move the flat_mutation_reader into a struct and manage
this struct's lifetime through a shared pointer. Fibers associated with
read-aheads that might outlive the multishard reader will hold on to a
copy of the shard pointer keeping the underlying reader alive until they
complete. To avoid doing any extra work a flag is added to this state
which is set when the multishard reader is destroyed. When this flag is
set, pending continuations will return early.  All this is encapsulated
in multishard_combining_reader::shard_reader the multishard reader code
itself need not be changed.
2018-04-30 17:16:21 +03:00
Botond Dénes
a05d398be7 foreign_reader: prepare for read-ahead outliving the reader
The foreign reader keeps track of ongoing read-aheads via a
foreign_ptr to the read-ahead's future on the remote shard. This pointer
is overwritten after each "remote call" to the remote reader with a
pointer to the future of the new read-ahead's future.
There are severeal problems with the current implementation:
1) There is a new read-ahead launched after each "remote call"
  unconditionally, even if the remote reader is at EOS. This will start
  unecessary read-ahead when the reader is already finished and may be
  soon destroyed (legally) by the client.
2) The pointer to the remote read-ahead future is not set to nullptr
  when a remote call is issued. Thus in the destructor, where we
  attach a continuation to the read-ahead's future to extend the
  reader's lifetime until after the read-ahead finishes, we migh attach
  a continuation to a future that already has one and run into a failed
  assert().

To fix this issues reset the read-ahead pointer to nullptr each time a
remote call is issued and don't start a new read-ahead if the remote
reader is at EOS. This way we can ensure that when the reader is
destroyed we either have a valid and non-stale read-aead future or none
at all and can reliably make a decision about whether we need to extend
the lifetime of the remote reader or not.
2018-04-30 14:34:43 +03:00
Botond Dénes
704d3d8421 multishard_combining_reader: avoid creating the shard reader twice
The multishard reader creates its shard readers on demand when they are
first attempted to be used. However at this time the reader migh already
be in the progress of being created, initiated by a previous read-ahead.
To avoid creating the shard reader twice, before creating the reader
check whether there are any read-aheads in progress. If there is, it
already created (is creating or will create) the reader and hence
synchronise with the read ahead. Synchronisation happens via a promise,
the read ahead creates a promise which will be fulfilled when the reader
is created. A concurrent create_reader() call will wait on this promise
instead of attempting to create a new reader.
2018-04-30 14:34:43 +03:00
Botond Dénes
f9464cfcd7 multishard_combining_reader: read_ahead: don't assume reader is created
Currently it is assumed that when read_ahead is called the reader is
already created. Under most circumstances this will not be true. It was
blind (bad) luck that we didn't hit this before (during testing).
2018-04-30 14:34:43 +03:00
Botond Dénes
d9fceb398a multishard_combining_reader: move read-ahead related methods
To the group of methods that do not assume the reader is already
created. A patch will follow that will update read_ahead() to not assume
that the reader is created.
2018-04-30 14:34:43 +03:00
Botond Dénes
5dcfaa68f6 multishard_combining_reader: avoid looking up the shard reader twice 2018-04-30 14:34:43 +03:00
Botond Dénes
79504a7d28 multishard_combining_reader: use optional for maybe created reader
After a little "research" [1] it turns out my initial fears were
completely without ground, std::optional::operator->() and
std::optional::opterator*() doesn't involve an unnecessary branch and
thus there is no need to hand-roll an optional with a separate bool.

[1] http://en.cppreference.com/w/cpp/utility/optional/operator*
2018-04-30 14:34:37 +03:00
Avi Kivity
c8a6fe3044 storage_proxy: remove default_query_timeout()
No longer used.
2018-04-30 13:19:53 +03:00
Avi Kivity
d8dd7e05a7 storage_proxy: don't use default timeouts
Require all callers to supply timeouts instead of relying on defaults.

Since all callers now have the timeouts set up, they can easily supply
them.
2018-04-30 13:19:53 +03:00
Avi Kivity
7b5db486a0 query_options: augment with timeout_config
Add a timeout_config member to query_options. This lets the query
processor know what timeouts the user of this query want to apply.
2018-04-30 13:19:53 +03:00
Avi Kivity
fcea3ed722 thrift: configure thrift transport and handler with a timeout_config
Let the thrift transport server and request handler know about the
per-request-type timeouts, in preparation for actually using them.
2018-04-30 13:19:53 +03:00
Avi Kivity
f9370ab7e6 transport: configure native transport with a timeout_config
Let the native transport server know about the per-request-type
timeouts, in preparation for actually using them.
2018-04-30 13:19:53 +03:00
Avi Kivity
49fdf01b5d cql3: define and populate timeout_config_selector
Determine which timeout we need to apply at prepare time. We
don't know the numerical value (since it depends on whoever is
executing the query, not just the statement type), but we know
which member of timeout_config we need, so determine and remember
that.
2018-04-30 13:19:49 +03:00
Tomasz Grabiec
423712f1fe storage_proxy: Request schema from the coordinator in the original DC
The mutation forwarding intermediary (src_addr) may not always know
about the schema which was used by the original coordinator. I think
this may be the cause of the "Schema version ... not found" error seen
in one of the clusters which entered some pathological state:

  storage_proxy - Failed to apply mutation from 1.1.1.1#5: std::_Nested_exception<schema_version_loading_failed> (Failed to load schema version 32893223-a911-3a01-ad70-df1eb2a15db1): std::runtime_error (Schema version 32893223-a911-3a01-ad70-df1eb2a15db1 not found)


Fixes #3393.

Message-Id: <1524639030-1696-1-git-send-email-tgrabiec@scylladb.com>
2018-04-30 12:51:09 +03:00
Nadav Har'El
1bbf7ba78c secondary index: add tests for IF NOT EXISTS, IF EXISTS
Confirm that issue #2991 is indeed fixed - creating a secondary index
with IF NOT EXISTS ignores an already existing index, and dropping with
IF EXISTS ignores a non-existant index.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180430071714.10154-1-nyh@scylladb.com>
2018-04-30 10:36:50 +02:00
Nadav Har'El
6e3a53fab0 secondary index: improve testing of case-sensitive column names
The existing test_secondary_index_case_sensitive only tested the
case-sensitive case of the column being indexed, and only in some
scenarios. Further testing exposed more bugs - issue #3388, issue #3391,
issue #3401. This patch adds tests which reproduced those bugs, and now
verifies their fix.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-9-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
a556b2b367 materialized views: fix test_case_sensitivity test
test_case_sensitivity from tests/view_schema_test.cc was well-intentioned,
aiming to test from different angles the issue of non-lowercase (quoted)
column names and their interaction with materialized views.

But unfortunately, it didn't test anything! This is because the quotation
marks were forgotten, so all the identifier in this test were folded to
lowercase, and the test didn't test non-lowercase identifiers like it
intended.

So this patch adds the missing quotes, to make this test great again.

After the patches for issues #3388 and #3391 which I sent earlier, the
test *passes* (before those patches, the fixed test did not pass -
the unfixed test trivially passed).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-8-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
46d4f6f352 secondary index: fix yet another case sensitivity bug
When the secondary index code builds a "%s IS NOT NULL" clause for a
CQL statement, it needs to quote the column name if it needs to be
(not only lowercase, digits and _).

Fixes #3401.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-7-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
8012f231ca materialized views: fix another case-sensitivity bug
We had another case-sensitivity bug in materialized views, where if
a case-sensitive (quoted) column name was listed explicitly on "SELECT"
(instead of implicitly, e.g., in "SELECT *") the column name was
incorrectly folded to lower-case and inserts would fail.

This patch fixes the code, where a "SELECT" statement was built using
the desired column names, but column names that needed quoting were
not being quoted. The bug was in a helper function build_select_statement()
which took column name strings and failed to quote them. We clean up this
function to take column definitions instead of strings - and take care
of the quoting itself. It also needs to quote the table's name in the
select statement being built.

Fixes #3391.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-6-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
e2b2506cb1 materialized views - fix case-sensitive IS NOT NULL
Before this patch, if a materialized view is defined with the restriction
IS NOT NULL on a case-sensitive (quoted) column name, inserts fail with
a "restriction 'foobar IS NOT null' unknown column foobar" error, where
foobar is the lowercased version of the case-sensitive column name.

The problem is that the code uses single_column_relation::to_string()
to convert the relation into a CQL where clause. And indeed, this method
generates a CQL expression; But it calls column_identifier::raw::to_string()
to print identifiers. This is the wrong function - it doesn't quote
identifiers that need quoting because they are not lowercase.

So this patch uses column_identifier::raw::to_cql_string() (a method we
added in the previous patch) to generate the properly quoted CQL relation.

Fixes #3388

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-5-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
b8ee50e6b9 Implement column_identifier::raw::to_cql_string()
Implement a method column_identifier::raw::to_cql_string(). Exactly like
the one without "raw", this method quotes the identifier name as needed
for CQL. We'll need this method in a later patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-4-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
993c4441e5 column_identifier::to_cql_string() using maybe_quote()
There is no reason for to_cql_string() and maybe_quote() to both
implement the same quoting algorithm. Use the latter to implement the
former.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-3-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
f4178f9582 Fix cql3::util::maybe_quote()
The utility function maybe_quote() is supposed to quote identifier names
(name of keyspace, table, or column) according to CQL rules, e.g., if the
name has any uppercase or non-alphanumeric characters, it needs to be
quoted. Unfortunatelty, it didn't quite do the right thing, so this patch
fixes that. This patch also adds a comment explaining what maybe_quote()
is supposed to do (until now, users could only guess).

Fixes #3400.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-2-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
ecc85297a4 secondary index: clean up dead unquoting code
In commit d674b6f672, I fixed a case-
sensitive column name bug by avoiding CQL quoting of a column name
in create_index_statement.cc when building a "targets" option string.
However, there is also matching code in target_parser.hh to unquote
that option string. So this unquoting code is no longer necessary, and
should be dropped.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-1-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Avi Kivity
b6d74b1c19 timeout_config: introduce timeout configuration
Different request types have different timeouts (for example,
read requests have shorter timeouts than truncate requests), and
also different request sources have different timeouts (for example,
an internal local query wants infinite timeout while a user query
has a user-defined timeout).

To allow for this, define two types: timeout_config represents the
timeout configuration for a source (e.g. user), while
timeout_config_selector represents the request type, and is used
to select a timeout within a timeout configuration. The latter is
implemented as a pointer-to-member.

Also introduce an infinite timeout configuration for internal
queries.
2018-04-29 19:52:40 +03:00
Nadav Har'El
a0bc0d2d11 secondary index: fix support for compound partition key
In the current code, if the base table has a compound partition key (i.e.,
multiple partition-key columns) searching its secondary indexes didn't work.
There is no real reason why this, it was a just a bug in preparing the
second query:

Every SI query is converted to two queries. The first queries the associated
materialized view, to find a list of primary keys. Those we need to use in a
second query, of the base table. The second query needs to list, as
restrictions, the keys found above. When a partition key is compound, its
components build one key and one restriction. But in the buggy code, we
incorrectly used each component as a separate (improperly formatted) key
and restriction, and obviously this didn't work.

This patch also adds a test that reproduces this problem and confirms its fix.

In the fixed code I also found another incorrect use of to_cql_string() (which
could break case-sensitive primary key column names) and changed it to
to_string().

Fixes #3210.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429124138.24406-1-nyh@scylladb.com>
2018-04-29 14:40:13 +01:00
Duarte Nunes
b1dd1876e5 gms/gossiper: Prevent duplicate processing of EchoMessage reply
We make multiple attempts to mark a node as alive. We do that be
sending an EchoMessage, and marking the node as alive upon receiving a
successful answer. In case there's a network partition and the nodes
can't reach each other, multiple messages may be delivered and
processed.

We can avoid processing duplicate EchoMessage replies by checking
whether we had already marked the node as alive.

Fixes #1184

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180428191942.31990-1-duarte@scylladb.com>
2018-04-29 14:20:01 +03:00
Avi Kivity
51b235aa7e compress: adjust HAVE_LZ4_COMPRESS_DEFAULT macro for new name
Seastar changed the name of this macro.
2018-04-29 12:57:27 +03:00
Avi Kivity
0530653da9 Merge "adapt scylla_io_setup to recent I/O Scheduler changes" from Glauber
"
Recently many changes have landed in seastar for the I/O Scheduler. We
can now describe the I/O storage of a machine by its visible properties
like throughput and bandwidth instead of relying in an indirect
calculation.

For the instances we support, we can just measure that and start using
them right away.

A version of iotune that computes those properties is not yet ready, but
in its making I have noticed that we aren't really setting the nomerges
and scheduler properties of the disks under testing. We definitely
should, since that can influence the results. So this patchset also
starts doing that.

The commandline for iotunev2 shouldn't change much. When it is ready we
will just adjust this script once more.
"

* 'scylla_io_setup' of github.com:glommer/scylla:
  scylla_io_setup: preconfigure i3 and i2 instances with new I/O scheduler properties
  scylla_lib: drop support for m3 and c3 AWS instance types
  io_setup: call blocktune before tuning I/O
  blocktune: allow it to be called as a library.
  scripts: move scylla-blocktune to scripts location
2018-04-29 11:44:06 +03:00
Avi Kivity
7161244130 Merge seastar upstream
* seastar 70aecca...ac02df7 (5):
  > Merge "Prefix preprocessor definitions" from Jesse
  > cmake: Do not enable warnings transitively
  > posix: prevent unused variable warning
  > build: Adjust DPDK options to fix compilation
  > io_scheduler: adjust property names

DEBUG, DEFAULT_ALLOCATOR, and HAVE_LZ4_COMPRESS_DEFAULT macro
references prefixed with SEASTAR_. Some may need to become
Scylla macros.
2018-04-29 11:03:21 +03:00
Raphael S. Carvalho
043fadb15b sstables/twcs: fix setting of timestamp resolution
iterator incorrectly dereferenced when timestamp resolution not
explicitly specified.

following dtests are fixed:
compaction_additional_test.CompactionAdditionalStrategyTests_with_TimeWindowCompactionStrategy.compaction_is_started_on_boot_test
compaction_additional_test.CompactionAdditionalTest.compact_data_by_time_window_test
compaction_additional_test.CompactionAdditionalTest.compaction_removes_ttld_data_by_time_windows_test
compaction_test.TestCompaction_with_DateTieredCompactionStrategy.compaction_strategy_switching_test

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180427192545.17440-1-raphaelsc@scylladb.com>
2018-04-29 10:44:44 +03:00
Glauber Costa
0c29289c22 scylla_io_setup: preconfigure i3 and i2 instances with new I/O scheduler properties
We can use iotunev2 (or any other I/O generator) to test for the limits
of the disks for the i2 and i3 instance classes. The values I got here
are the values I got from ~5 invocations of the (yet to be upstreamed)
iotune v2, with the IOPS numbers rounded for convenience of reading.

During the execution, I verified that the disks were saturated so we
can trust these numbers even if iotunev2 is merged in a different form.
The numbers are very consistent, unlike what we usually saw with the
first version of iotune.

Previously, we were just multiplying the concurrency number by the
number of disks. Now that we have better infrastructure, we will
manually test i3.large and i3.xlarge, since their disks are smaller
and slower.

For the other i3, and all instances in the i2 family storage scales up
by adding more disks. So we can keep multiplying the characteristics of
one known disk by the number of disks and assuming perfect scaling.

Example for i3, obtained with i3.2xlarge:

read_iops = 411k
read_bandwidth = 1.9GB/s

So for i3.16xlarge, we would have read_iops = 3.28M and 15GB/s - very
close to the numbers advertised by AWS.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Glauber Costa
c85fbd16cb scylla_lib: drop support for m3 and c3 AWS instance types
m3 has 80GB SSDs in its largest form and I doubt anybody has ever
used it with Scylla.

I am also not aware of any c3 deployments. Since it is past generation,
it doesn't even show up in the default instance selector anymore.

I propose we drop AMI support for it. In practice, what that means is
that we won't auto-tune its I/O properties and people that want to use
it will have to run scylla_io_setup - like they do today with the EBS
instances.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Glauber Costa
685a7c9ae6 io_setup: call blocktune before tuning I/O
We are not configuring the disks the way we want them with respect to
scheduler and nomerges. This is an oversigh that became clear now that
I started rewriting iotune-- since I will explicitly test for that. But
since this can affect the results, it should be here all along.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Glauber Costa
9eb8ea8b11 blocktune: allow it to be called as a library.
This patch makes the functions in scylla-blocktune available as a
library for other scripts - namely scylla_io_setup.

The filename, scylla-blocktune, is not the most convenient thing to call
from python so instead of just wrapping it in the usual test for
__main__ I am just splitting the file into two.

Another option would be to patch all callers to call
scylla_blocktune.py, but because we are usually not using extensions in
scripts that are meant to be called directly I decided for the split.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Glauber Costa
f837d5b1f1 scripts: move scylla-blocktune to scripts location
scylla-blocktune currently lives in the top level but this is mostly
historical. When time comes for us to install it, the packaging systems
will copy it to /usr/lib/scylla with the others.

So for consistency let's make sure that it also lives in the scripts
directory.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Vladimir Krivopalov
b3572acd6e A few improvements to encoding_stats structure.
- Use the same default epoch as Origin
  - Use default value for the encoding_stats parameter in sstable::write_components()

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <846c6d2cbb97d2dd25968cb00b8557c86ff5e35c.1524854727.git.vladimir@scylladb.com>
2018-04-27 22:03:38 +03:00
Avi Kivity
2fb1bcfd13 Update scylla-ami submodule
* dist/ami/files/scylla-ami 02b1853...8a6e4dd (1):
  > ds2_configure.py: always use Ec2Snitch for single region case

Fixes #1800.
2018-04-27 21:02:27 +03:00
Vladimir Krivopalov
36fe06fd3e Make abstract_type::is_fixed_length() non-virtual.
This method is called agressively through SSTable 3.0 read/write, we
want to reasonably optimise it to no incur extra indirect calls.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <2d00ddecd112af867a30d3d6930c10165dd5af34.1524851530.git.vladimir@scylladb.com>
2018-04-27 20:57:46 +03:00
Tomasz Grabiec
b1465291cf db: schema_tables: Treat drop of scylla_tables.version as an alter
After upgrade from 1.7 to 2.0, nodes will record a per-table schema
version which matches that on 1.7 to support the rolling upgrade. Any
later schema change (after the upgrade is done) will drop this record
from affected tables so that the per-table schema version is
recalculated. If nodes perform a schema pull (they detect schema
mismatch), then the merge will affect all tables and will wipe the
per-table schema version record from all tables, even if their schema
did not change. If then only some nodes get restarted, the restarted
nodes will load tables with the new (recalculated) per-table schema
version, while not restarted nodes will still use the 1.7 per-table
schema version. Until all nodes are restarted, writes or reads between
nodes from different groups will involve a needless exchange of schema
definition.

This will manifest in logs with repeated messages indicating schema
merge with no effect, triggered by writes:

  database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
  database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
  database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f

The sync will be performed if the receiving shard forgets the foreign
version, which happens if it doesn't process any request referencing
it for more than 1 second.

This may impact latency of writes and reads.

The fix is to treat schema changes which drop the 1.7 per-table schema
version marker as an alter, which will switch in-memory data
structures to use the new per-table schema version immediately,
without the need for a restart.

Fixes #3394

Tests:
    - dtest: schema_test.py, schema_management_test.py
    - reproduced and validated the fix with run_upgrade_tests.sh from git@github.com:tgrabiec/scylla-dtest.git
    - unit (release)

Message-Id: <1524764211-12868-1-git-send-email-tgrabiec@scylladb.com>
2018-04-27 17:12:33 +03:00
Avi Kivity
6154ea734d Merge "upport for writing SSTables 3.0 - rows only" from Vladimir
"
This patch series introduces initial support for writing SSTables in
'mc' format (aka SSTables 3.0).

Currently, the following components are written in 3.0 format:
  - Data.db
  - Index.db
  - Summary.db
(there were no changes to summary files format compared to ka/la)
Other SSTables components are written in the old format for now as they
still need to exist to satisfy post-flush processing.

For now, only rows are written to the data file and indexed. Range
tombstones are not supported.

Writing rows is supported in full with the only exception being counter
cells. All the other features (TTLed data, row/cell level tombstones,
collections, etc) are supported.

Unit tests rely on producing files and binary-comparing them with
'golden' copies that are produced using Cassandra 3.11. This is done to
not block until reading SSTables 3.0 format is implemented.

=======================================
Implementation notes
=======================================

Internally, sstable_writer has been refactored to support multiple
implementations that are instantiated in its constructor based on the
sstable version. Little to no code is shared among sstable_writer_v2 and
sstable_writer_v3 as we only intend to support sstable_writer_v2
alongside sstable_writer_v3 for a single release (to be able to do
rollback on rolling upgrade failure) and then plan to get rid of it
entirely and switch to always writing SSTables in the new format.

The design of sstable_writer_v3 mostly follows that of its precursors
sstable_writer(_v2) and components_writer. Some refactoring and further
code rearrangements are expected in the future but the main code is
there.
"

* 'projects/sstables-30/write-rows/v2' of https://github.com/argenet/scylla:
  Add tests for writing data and index files in SSTables 3.0 ('mc') format.
  Support for writing SSTables 3.0 ('mc') Data.db and Index.db files - rows only.
  Add missing enum values to bound_kind.
  Add building blocks for writing data in SSTables 3.0 format.
  Refactor sstable_writer to support various internal implementations.
  Add is_fixed_length() to data types.
  Add mutation_partition::apply_insert() overload that accepts TTL and expiry for row marker.
2018-04-27 17:10:31 +03:00
Piotr Jastrzebski
d839a945b4 Use goto instead of break in data_consume_rows_context_m::process_state
This way the code will be better predicted.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <271333caa723e8f3ed1db4fbe6b014ebde2b5d3a.1524818584.git.piotr@scylladb.com>
2018-04-27 11:56:13 +03:00
Vladimir Krivopalov
77fdfa3e7a Add tests for writing data and index files in SSTables 3.0 ('mc') format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
15ef4ca73c Support for writing SSTables 3.0 ('mc') Data.db and Index.db files - rows only.
This fix adds functionality for writing data in 'mc' format to Data.db
file according to the SSTables 3.0 data format as described at https://github.com/scylladb/scylla/wiki/SSTables-3.0-Data-File-Format
and Index.db file according to the specification at https://github.com/scylladb/scylla/wiki/SSTables-3.0-Index-File-Format

The following cases are not supported yet:
  - writing counter cells
  - range tombstones

In Index.db, end open markers are not written since range tombstones are not
supported for data files yet.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
3ecc9e9ce4 Add missing enum values to bound_kind.
bound_kind::clustering, bound_kind::excl_end_incl_start and
bound_kind::incl_end_excl_start are used during SSTables 3.0 writing.

bound_kind::static_clustering is not used yet but added for completeness
and parity with the Origin.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
a95664be08 Add building blocks for writing data in SSTables 3.0 format.
For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
bb2bea928a Refactor sstable_writer to support various internal implementations.
This is preparatory work for supporting writing SSTables in multiple
formats.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
54bd74fda0 Add is_fixed_length() to data types.
For any given CQL data type, this member returns whether its values are
of fixed or variable length. This is used by SSTables 3.0 format to only
store the length value for variable-length cells.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
ed62b9a667 Add mutation_partition::apply_insert() overload that accepts TTL and expiry for row marker.
For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 13:27:42 -07:00
Piotr Jastrzebski
a8154e2825 Fix use-after-free in summary parsing
Buffer received from read_exactly is referenced by
a pointer used in do_until loop but is not kept around
and is destroyed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <5edd6d08ec4466fe6abd0e83b4bfb24f1f5c80fa.1524747108.git.piotr@scylladb.com>
2018-04-26 15:54:41 +03:00
Avi Kivity
5119c1e9c1 Merge "Implement reading simple table from sstable 3.x" from Piotr
"
This patchset prepares everything for support of both 2.x and 3.x formats and implements reading from sstable 3.x
very simple table with just partition keys.

Tests: units (release)
"

* 'haaawk/sstables3/read_only_partitions_v4' of ssh://github.com/scylladb/seastar-dev: (22 commits)
  Test for reading sstable in MC format with no columns
  Use new mp_row_consumer_m and data_consume_rows_context_m
  Introduce mp_row_consumer_m
  Rename mp_row_consumer to mp_row_consumer_k_l
  Introduce consumer_m and data_consume_rows_context_m
  Use read_short_length_bytes in RANGE_TOMBSTONE
  Use read_short_length_bytes in ATOM_START
  Use read_short_length_bytes in ROW_START
  Add continuous_data_consumer::read_short_length_bytes
  Reduce duplication with continuous_data_consumer::read_partial_int
  Add test for a simple table with just partition key
  Add test for reading index
  Extract mp_row_consumer to separate header
  Make sstable_mutation_reader independent from mp_row_consumer
  Make sstable_mutation_reader a template
  Make data_consume_context a template
  Move data_consume_rows_context from row.cc to row.hh
  Decouple sstable.hh and row.hh
  Reduce visibility of sstable::data_consume_*
  Move data_consume_context to separate header
  ...
2018-04-26 14:35:42 +03:00
Botond Dénes
b2d71ed872 install_dependencies.sh: centos: add systemd-devel
This optional dependency is needed to properly integrate with systemd.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <bacd07958531e6541d5b1a4ea885f01491002a7b.1524740540.git.bdenes@scylladb.com>
2018-04-26 14:32:36 +03:00
Piotr Jastrzebski
5c223c13d6 Test for reading sstable in MC format with no columns
Just a simple table with only partition key.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
6dd7ce2582 Use new mp_row_consumer_m and data_consume_rows_context_m
When SSTable is in MC format then use those new classes
to be able to read the sstable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
9ba64f65e1 Introduce mp_row_consumer_m
This is a version of mp_row_consumer that can
handle SSTables in MC format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
4aec023927 Rename mp_row_consumer to mp_row_consumer_k_l
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
2ee3d8b87b Introduce consumer_m and data_consume_rows_context_m
Those classes can handle SSTables in MC format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
b343212073 Use read_short_length_bytes in RANGE_TOMBSTONE
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
90bb7802cc Use read_short_length_bytes in ATOM_START
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
6a81a755ee Use read_short_length_bytes in ROW_START
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
06ceea9c3e Add continuous_data_consumer::read_short_length_bytes
This is a common operation so it's better to have it
implemented in a single place.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
e664360730 Reduce duplication with continuous_data_consumer::read_partial_int
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
9a3f93a42b Add test for a simple table with just partition key
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
c6d4f49abb Add test for reading index
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
63f0b57365 Extract mp_row_consumer to separate header
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
e5145b87b0 Make sstable_mutation_reader independent from mp_row_consumer
Take consumer as template parameter instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
9c93f9f5f4 Make sstable_mutation_reader a template
Take DataConsumeRowsContext type as parameter.
This will allow us to implement different context
for reading 3.x files.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
9fad5831df Make data_consume_context a template
Parametrize it with the type of data consume rows context.

There will be different implementations used for different
sstable file formats.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
e2b393df13 Move data_consume_rows_context from row.cc to row.hh
It will be used as a template parameter for sstable_mutation_reader
once it's turned into a template. This means the definition has
to be accessible.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
0e405719e8 Decouple sstable.hh and row.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
bcf5717753 Reduce visibility of sstable::data_consume_*
They are used just in partition.cc, row.cc and sstables_test.cc
so it is usefull to cut their scope by moving them
to data_consume_context.hh.

This will make it much easier to turn data_consume_context into
a template.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
578aa6826f Move data_consume_context to separate header
It's used only in row.cc, partition.cc and sstables_test.cc
so it's better to reduce the dependency just to those files.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
a55cec544e mp_row_consumer: stop depending on sstable_mutation_reader
Introduce mp_row_consumer_reader to cut
a cyclic dependency between them.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
0efcc6b33f Fix use-after-free in estimated_histogram parsing
A pointer to buf was used in do_until but buf wasn't
kept around and was destroyed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:48:02 +02:00
Takuya ASADA
782ebcece4 dist/debian: add --jobs <njobs> option just like build_rpm.sh
On some build environment we may want to limit number of parallel jobs since
ninja-build runs ncpus jobs by default, it may too many since g++ eats very
huge memory.
So support --jobs <njobs> just like on rpm build script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180425205439.30053-1-syuu@scylladb.com>
2018-04-26 12:44:06 +03:00
Duarte Nunes
6f9bc28edf Merge 'Collect statistics on updates to memtables' from Vladimir
"
This patchset brings in a statistics collector that tracks minimal
values for timestamps, TTLs and local deletion times for all the updates
made to a given memtable.

This statistics is later used when flushing memtables into SSTables
using 3.x ('mc') format to delta-encode corresponding values using
collected minimums as bases (that is why it is called encoding
statistics).

This patchset is sent out apart from other changes that introduce
writing SSTables 3.x to facilitate read path implementation that also
needs the encoding_stats structure.

The tests for write path implicitly cover this functionality as any rows
written to a SSTable 3.0 file make use of delta-encoding.
"

* 'projects/sstables-30/collect-encoding-statistics-v4' of https://github.com/argenet/scylla:
  Collect encoding statistics for memtable updates.
  Factor out min_tracker and max_tracker as common helpers.
  Always pass mutation_partitions to partition_entry::apply()
2018-04-26 00:39:15 +01:00
Vladimir Krivopalov
948c4d79d3 Collect encoding statistics for memtable updates.
We keep track of all updates and store the minimal values of timestamps,
TTLs and local deletion times across all the inserted data.
These values are written as a part of serialization_header for
Statistics.db and used for delta-encoding values when writing Data.db
file in SSTables 3.0 (mc) format.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-25 15:39:14 -07:00
Vladimir Krivopalov
f6f99919da Factor out min_tracker and max_tracker as common helpers.
They will be re-used for collecting encoding statistics which is needed
to write SSTables 3.0.

Part of #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-25 14:58:47 -07:00
Vladimir Krivopalov
e1ee833861 Always pass mutation_partitions to partition_entry::apply()
Previously it was also possible to pass a frozen_mutation to it.
Now we de-serialize frozen mutations at the calling side.

This is a pre-requisite for collecting memtable statistics needed for
writing into the SSTables 3.0 format.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-25 14:58:47 -07:00
Moreno Garcia
8dde91d03c docker: Create data_dir if it does not exist
When provisioning a Scylla docker image with --developer-mode 0 (disabled)
scylla_raid_setup is not invoked. As a consequence the "data" directory is not
created and scylla_io_setup fails (steps to reproduce and error message provided
at the end).

This patch adds the same verifications present in scylla_io_setup to docker's
scyllasetup.py and creates the data directory in the case it is not present.

--

Steps to reproduce on AWS i3.2xlarge with Ubuntu 16.04:

sudo -s
apt update && apt upgrade -y && apt-get install docker.io -y

mdadm --create --verbose --force --run /dev/md0 --level=0 -c1024 --raid-devices=1 /dev/nvme0n1
mkfs.xfs /dev/md0 -f -K
mkdir /var/lib/scylla
mount -t xfs /dev/md0 /var/lib/scylla

docker run --name some-scylla \
  --volume /var/lib/scylla:/var/lib/scylla \
  -p 9042:9042 -p 7000:7000 -p 7001:7001 -p 7199:7199 \
  -p 9160:9160 -p 9180:9180 -p 10000:10000 \
  -d scylladb/scylla --overprovisioned 1 --developer-mode 0

docker logs some-scylla
  running: (['/usr/lib/scylla/scylla_dev_mode_setup', '--developer-mode', '0'],)
  running: (['/usr/lib/scylla/scylla_io_setup'],)
  terminate called after throwing an instance of 'std::system_error'
    what():  open: No such file or directory
  ERROR:root:/var/lib/scylla/data did not pass validation tests, it may not be on XFS and/or has limited disk space.
  This is a non-supported setup, and performance is expected to be very bad.
  For better performance, placing your data on XFS-formatted directories is required.
  To override this error, enable developer mode as follow:
  sudo /usr/lib/scylla/scylla_dev_mode_setup --developer-mode 1
  failed!
  Traceback (most recent call last):
    File "/docker-entrypoint.py", line 15, in <module>
      setup.io()
    File "/scyllasetup.py", line 34, in io
      self._run(['/usr/lib/scylla/scylla_io_setup'])
    File "/scyllasetup.py", line 23, in _run
      subprocess.check_call(*args, **kwargs)
    File "/usr/lib64/python3.4/subprocess.py", line 558, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['/usr/lib/scylla/scylla_io_setup']' returned non-zero exit status 1

ls -latr /var/lib/scylla
  total 4
  drwxr-xr-x 44 root root 4096 Abr 24 13:02 ..
  drwxr-xr-x  2 root root    6 Abr 24 13:10 .

Signed-off-by: Moreno Garcia <moreno@scylladb.com>
Message-Id: <20180424173729.22151-1-moreno@scylladb.com>
2018-04-25 17:48:34 +03:00
Calle Wilund
b1edf75c8b types: Make seastar::inet_address the "native" type for CQL inet.
Fixes #3187

Requires seastar "inet_address: Add constructor and conversion function
from/to IPv4"

Implements support IPv6 for CQL inet data. The actual data stored will
now vary between 4 and 16 bytes. gms::inet_address has been augumented
to interop with seastar::inet_address, though of course actually trying
to use an Ipv6 address there or in any of its tables with throw badly.

Tests assuming ipv4 changed. Storing a ipv4_address should be
transparent, as it now "widens". However, since all ipv4 is
inet_address, but not vice versa, there is no implicit overloading on
the read paths. I.e. tests and system_keyspace (where we read ip
addresses from tables explicitly) are modified to use the proper type.
Message-Id: <20180424161817.26316-1-calle@scylladb.com>
2018-04-24 23:12:07 +01:00
Duarte Nunes
9111c6e49a Merge seastar upstream
* seastar 1bb44ac...70aecca (12):
  > Experimental CMake-based build system
  > inet_address: Add constructor and conversion function from/to IPv4
  > tls: Add missing includes and forward declarations to header
  > install_dependencies.sh: fix remaining centos issues
  > rpc: Add missing return when closing client socket
  > install-dependencies.sh: install g++7.3 for centos, instead of g++7.2
  > reactor: fix race beween alien queue construction and start
  > Merge "enhance the I/O Scheduler with bandwidth and throughput limits" from Glauber
  > reactor: gracefully exit if exception happens during initialization
  > build: really add alien_test
  > Merge "reactor: add alien::submit_to()" from Kefu
  > queue: do not consume from aborted queue

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-24 23:07:13 +01:00
Duarte Nunes
f5eeafe1bf tests/secondary_index_test: Add test for dropping index-backing MV
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180424140745.7144-2-duarte@scylladb.com>
2018-04-24 17:02:59 +01:00
Duarte Nunes
9146de3118 service/migration_manager: Don't drop index-backing MV
Unless dropped by the index itself, forbid dropping an index-backing
MV using `drop materialized view`.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180424140745.7144-1-duarte@scylladb.com>
2018-04-24 17:01:59 +01:00
Nadav Har'El
d674b6f672 secondary index: fix bug in indexing case-sensitive column names
CQL normally folds identifiers such as column names to lowercase. However,
if the column name is quoted, case-sensitive column names and other strange
characters can be used. We had a bug where such columns could be indexed,
but then, when trying to use the index in a SELECT statement, it was not
found.

The existing code remembered the index's column after converting it to CQL
format (adding quotes). But such conversion was unnecessary, and wrong,
because the rest of the code works with bare strings and does not involve
actual CQL statements. So the fix avoids this mistaken conversion.

This patch also includes a test to reproduce this problem.

Fixes #3154.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424154920.15924-1-nyh@scylladb.com>
2018-04-24 16:57:17 +01:00
Piotr Sarna
d323b5cddc tests: add missing case-sensitive JSON tests
This commit complements cql_query_test with case-sensitivity cases
for both SELECT JSON and INSERT JSON statements.
Message-Id: <20bc7df2ec644618727183e09f2352ca5546a9b9.1524576066.git.sarna@scylladb.com>
2018-04-24 16:30:56 +03:00
Piotr Sarna
000ce24306 cql3: solve JSON case-sensitivity issues
This commit fixes two closely related issues with handling
case-sensitive column names in JSON:
 * according to doc, case-sensitive names should be wrapped with
   additional pair of double quotes during JSON SELECT
 * logic error in parse_json() prevented INSERT JSON from working
   properly on case-sensitive column names

This commit is followed by updated cql_query_test, which checks
case-sensitive cases as well.
Message-Id: <82d9d5e193a656e99bc86b297c00662a6fb808a0.1524576066.git.sarna@scylladb.com>
2018-04-24 16:30:55 +03:00
Avi Kivity
13ea1a89b5 Merge "Implement loading sstables in 3.x format" from Piotr
"
Pass sstable version to parse, write and describe_type methods to make it possible to handle different versions.
For now serialization header from 3.x format is ignored.

Tests: units (release)
"

* 'haaawk/sstables3/loading_v4' of ssh://github.com/scylladb/seastar-dev:
  Add test for loading the whole sstable
  Add test for loading statistics
  Add support for 3_x stats metadata
  Pass sstable version to describe_type
  Pass sstable version to write methods
  metadata_type: add Serialization type
  Pass sstable_version_types to parse methods
  Add test for reading filter
  Add test for read_summary
  sstables 3.x: Add test for reading TOC
  sstable: Make component_map version dependent
  sstable::component_type: add operator<<
  Extract sstable::component_type to separete header
  Remove unused sstable::get_shared_components
  sstable_version_types: add mc version
2018-04-24 12:49:41 +03:00
Piotr Jastrzebski
6310fc5f1c Add test for loading the whole sstable
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
9e78b6d4c6 Add test for loading statistics
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
df457166b0 Add support for 3_x stats metadata
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
e1e23ec555 Pass sstable version to describe_type
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
1cc1f9af5f Pass sstable version to write methods
This will allow writing different versions differently

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
08da518dae metadata_type: add Serialization type
Ignore it while reading sstable 3_x and throw
if it's present when reading 2_x.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
cb84ca8abb Pass sstable_version_types to parse methods
Parsing will depend on the sstable version when
we have support for both 2_x and 3_x formats.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
444b468d46 Add test for reading filter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
ff06d2153c Add test for read_summary
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
10f9b06145 sstables 3.x: Add test for reading TOC
Make sure DigestCRC32 is handled correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
561ca34ec2 sstable: Make component_map version dependent
Introduce sstable_version_constants that will be a proxy
serving correct constants depending on the format version.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
7aef74c55f sstable::component_type: add operator<<
Make it possible to print out component_type.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
d492e92b15 Extract sstable::component_type to separete header
It will be used in other places which won't depend on
sstable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:29:57 +02:00
Nadav Har'El
4af2604e76 secondary index: update test.py
I forgot that I also need to update test.py for the new test.

It's unfortunate that this script doesn't pick up the list of
tests automatically (perhaps with a black-list of tests we don't
want to run). I wonder if there are additional tests we are
forgetting to run.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424085911.29732-1-nyh@scylladb.com>
2018-04-24 12:11:38 +03:00
Nadav Har'El
9605059a2b secondary index: move tests to separate source file
Move the two tests we have for the secondary indexing feature from the
huge tests/cql_query_test.cc to a new file, secondary_index_test.cc.

Having these tests in a separate file will make it easier and faster to
write more tests for this feature, and to run these tests together.

This patch doesn't change anything in the tests' code - it's just a code
move.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424084700.28816-1-nyh@scylladb.com>
2018-04-24 11:49:57 +03:00
Takuya ASADA
4a8ed4cc6f dist/common/scripts/scylla_raid_setup: prevent 'device or resource busy' on creating mdraid device
According to this web site, there is possibility we have race condition with
mdraid creation vs udev:
http://dev.bizo.com/2012/07/mdadm-device-or-resource-busy.html
And looks like it can happen on our AMI, too (see #2784).

To initialize RAID safely, we should wait udev events are finished before and
after mdadm executed.

Fixes #2784

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1505898196-28389-1-git-send-email-syuu@scylladb.com>
2018-04-24 11:48:40 +03:00
Vladimir Krivopalov
fc644a8778 Fix Scylla to compile with older versions of JsonCpp (<= 1.7.0).
Old versions of JsonCpp declare the following typedefs for internally
used aliases:
    typedef long long int Int64;
    typedef unsigned long long int UInt64;

In newer versions (1.8.x), those are declared as:
    typedef int64_t Int64;
    typedef uint64_t UInt64;

Those base types are not identical so in cases when a type has
constructors overloaded only for specific integral types (such as
Json::Value in JsonCpp or data_value in Scylla), an attempt to
pack/unpack an integer from/to a JSON object causes ambiguous calls.

Fixes #3208

Tests: unit {release}.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <e9fff9f41e0f34b15afc90b5439be03e4295623e.1524556258.git.vladimir@scylladb.com>
2018-04-24 10:58:38 +03:00
Piotr Jastrzebski
279b426ee8 Remove unused sstable::get_shared_components
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 09:45:55 +02:00
Piotr Jastrzebski
7248752698 sstable_version_types: add mc version
This is the latest version of 3.x SSTable format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 09:45:55 +02:00
Raphael S. Carvalho
11940ca39e sstables: Fix bloom filter size after resharding by properly estimating partition count
We were feeding the total estimation partition count of an input shared
sstable to the output unshared ones.

So sstable writer thinks, *from estimation*, that each sstable created
by resharding will have the same data amount as the shared sstable they
are being created from. That's a problem because estimation is feeded to
bloom filter creation which directly influences its size.
So if we're resharding all sstables that belong to all shards, the
disk usage taken by filter components will be multiplied by the number
of shards. That becomes more of a problem with #3302.

Partition count estimation for a shard S will now be done as follow:
    //
    // TE, the total estimated partition count for a shard S, is defined as
    // TE = Sum(i = 0...N) { Ei / Si }.
    //
    // where i is an input sstable that belongs to shard S,
    //       Ei is the estimated partition count for sstable i,
    //       Si is the total number of shards that own sstable i.

Fixes #2672.
Refs #3302.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180423151001.9995-1-raphaelsc@scylladb.com>
2018-04-23 18:11:20 +03:00
Avi Kivity
8a8f688dbf Merge "Materialized views: Fixes to update generation" from Duarte
"
Fixes to several issues around view update generation, pertaining to
timestamp and TTL management.

Fixes #3361
Fixes #3360
Fixes #3140
Refs #3362

Tests: unit(release, debug), dtest(materialized_views.py)
"

Reviewed-by: Nadav Har'El <nyh@scylladb.com>

* 'materialized-views/fixes-galore/v2' of http://github.com/duarten/scylla:
  mutation_partition: Clarify comment about emptiness
  tests: Add view_complex_test
  tests/view_schema_test: Complete test
  db/view: Move cells instead of copying in add_cells_to_view()
  db/view: Handle unselected base columns and corner cases
  mutation_partition: Regular base column in view determines row liveness
  db/view: Don't avoid read-before-write when view PK matches base
  db/view: Process base updates to column unselected by its views
  db/view: Consider partition tombstone when generating updates
  tests/view_schema_test: Remove unneeded test
  mutation_fragment: Allow querying if row is live
  view_info: Add view_column() overload
  view_info: Explicitly initialize base-dependent fields
  cql3/alter_table_statement: Forbid dropping columns of MV base tables
2018-04-23 16:49:29 +03:00
Nadav Har'El
1ec5688b0b Materialized Views: fix incorrect limitations on row filtering
This patch fixes several cases where it was disallowed to create
a materialized view with a filter ("where ..."), for no good reason.
After this patch, these cases will be allowed. Fixes #2367.

In ordinary SELECT queries, certain types of filtering which is known to
be deceptively inefficient is now allowed. For example, trying to query
a range of partition keys cannot be done without reading the entire
database (because the murmur3 tokenizer randomizes the order of partitions).
Restricting two partition key components also cannot be done without
reading excessive amount of the entire partition. So Scylla, following
Cassandra, chooses to disallow such SELECT queries, and give an error
message.

However, the same SELECT statements *should* be allowed when defining a
materialized view. In this case, the filter is just used to check an
individual row - not to search for one - so there is no performance
concern.

Unfortunately the existing code did these validations while building the
SELECT statement's "restrictions", in code shared by both uses of SELECT
(query and MV definition). It was easy to move one of the validations
to later code which runs after the restriction has already been built (and
knows if it is working for query or MV), but because of the way the
"restrictions" objects (translated from Cassandra 2's code) hide what they
contain, many of the checks are harder to perform after having built the
restrictions object. So instead, we add in strategic places in the
restriction-handling code a new "allow_filtering" flag. If restrictions
are built with allow_filtering=true, the extra performance-oriented tests
on the filtering restrictions is not done. Materialized views sets
allow_filtering=true.

The allow_filtering flag will also be useful later when we want to support
the "ALLOW FILTERING" query option which is currently not supported properly
(we have several open issues on that). However note that this patch doesn't
complete that support: I left a FIXME in the spot where we set
allow_filtering in the Materialized Views case, but in the futre also need
to set it if the user specified "ALLOWED FILTERING" in the query.

This patch also enables several unit tests written by Duarte which used to
fail because of this bug, and now pass. These tests verify that the
restrictions are now allowed and filter the view as desired; But I also
added test code to verify that the same restrictions are still forbidden,
as before, when used in ordinary SELECT queries.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Message-Id: <20180423124343.17591-1-nyh@scylladb.com>
2018-04-23 14:08:04 +01:00
Avi Kivity
ff055a291a Merge "Improve "out-of-the-box" build experience on centos" from Botond
"
Make sure install_dependencies.sh installs all the right dependencies
and that the example `configure.py` invokation can just be copy-pasted
into the terminal and will "just work".

Ref: #3208
"

* 'fix_centos_compile/v2' of https://github.com/denesb/scylla:
  install_dependencies.sh: update centos package list and example
  configure.py: add --with-ragel option
  configure.py: add --with-antlr3
  configure.py: check compiler version first
2018-04-23 15:49:27 +03:00
Botond Dénes
bfe741c03d install_dependencies.sh: update centos package list and example
Add missing packages to `yum install` list:
* scylla-boost163-static
* scylla-python34-pyparsing20

Update the configure.py example so that it just works:
* Change g++ to 7.3
* Add --with-antlr3 pointing to antlr3 installed from scylla 3rdparty
2018-04-23 15:46:43 +03:00
Botond Dénes
1efcf215b6 configure.py: add --with-ragel option
To allow the user to select the exact ragel executable they whish to
use.
2018-04-23 15:46:43 +03:00
Botond Dénes
784be9cc43 configure.py: add --with-antlr3
To allow the user to select the exact antlr3 executable they whish to
use.
2018-04-23 15:46:43 +03:00
Botond Dénes
ea8d8f9fbf configure.py: check compiler version first
Before checking anything else (presence of boost, its version, etc.)
check that the compiler is present and can compile and link a simple c++
program.
Before if the compiler was not set up correctly configure.py would fail
at one of the other try_compile checks, whichever came first (usually
the one checking for boost). This lead the user into chasing some
false-positive error when in fact the compiler wasn't working.
2018-04-23 15:46:43 +03:00
Takuya ASADA
7b92c3fd3f dist: Drop AmbientCapabilities from scylla-server.service for Debian 8
Debian 8 causes "Invalid argument" when we used AmbientCapabilities on systemd
unit file, so drop the line when we build .deb package for Debian 8.
For other distributions, keep using the feature.

Fixes #3344

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180423102041.2138-1-syuu@scylladb.com>
2018-04-23 13:27:14 +03:00
Avi Kivity
269207fdf6 Merge "Introducing INSERT JSON and fromJson to CQL3" from Piotr
"
This series complements JSON support with INSERT JSON and fromJson
cql function.

INSERT JSON implementation tries hard to interfere as little as possible
with regular INSERT path. So, after being parsed, insertJsonStatement
exists as a separate statement and is handled in a special way.
Overridden add_update_for_key extracts values from JSON map and applies
them to columns.

Converting from insert_json_statement to insert_statement uses auxiliary
from_json_object methods to convert JSON-encoded types to bytes.
Then, terms are matched to appropriate column names and cells are
updated.

fromJson CQL function uses the same from_json_object helper methods,
but applies them to single arguments, not whole rows.

Existing json handling functions from json.hh and libjsoncpp were used
where possible.

Things implemented:
 * expanding CQL grammar to accept INSERT JSON
 * converting JSON representation of cql values to cql terms
 * serving 'INSERT INTO xxx JSON yyy' clause
 * tests for INSERT JSON and fromJson()
"

* 'json_ops_2' of https://github.com/psarna/scylla:
  tests: add cql unit tests for INSERT JSON
  cql3: add fromJson() function
  cql3: add INSERT JSON parsing to CQL grammar
  cql3: add support for INSERT JSON clause
  cql3: decouple execute from term binding in setters
  cql3: change operation::make_* functions to static
  cql3: add from_json_object function to types
  cql3: Make literals::NULL_VALUE public
2018-04-23 13:19:54 +03:00
Piotr Sarna
97e89f2efb tests: add cql unit tests for INSERT JSON
This commit adds tests for INSERT JSON clause, which is expected
to accept JSON strings and insert appropriate values to columns
defined there.
The tests also cover fromJson function calls and inserting prepared
batch statements with INSERT JSON inside.

References #2058
2018-04-23 12:00:57 +02:00
Piotr Sarna
cd76a01747 cql3: add fromJson() function
This function extends JSON support with fromJson() function,
which can be used in UPDATE clause to transform JSON value
into a value with proper CQL type.

fromJson() accepts strings and may return any type, so its instances,
like toJson(), are generated during calls.

This commit also extends functions::get() with additional
'receiver' parameter. This parameter is used to extract receiver type
information neeeded to generate proper fromJson instance.
Receiver is known only during insert/update, so functions::get() also
accepts a nullptr if receiver is not known (e.g. during selection).

References #2058
2018-04-23 12:00:57 +02:00
Piotr Sarna
9dd34bf34d cql3: add INSERT JSON parsing to CQL grammar
This commit makes it possible to parse INSERT JSON statement
in CQL grammar, so it's available via cqlsh.

References #2058
2018-04-23 12:00:57 +02:00
Piotr Sarna
cdcbf654a8 cql3: add support for INSERT JSON clause
This commit adds the implementation of INSERT JSON clause
which accepts JSON object as parameter and inserts appropriate
values into appropriate columns, as defined in given JSON.

Example:
INSERT INTO testme JSON '{
  "id" : 77,
  "name" : "Jones",
  "ranking" : 8.5
}'

References #2058
2018-04-23 12:00:57 +02:00
Piotr Sarna
bfe3c20035 cql3: decouple execute from term binding in setters
This commit makes it possible to pass values to setters,
instead of having to pass cql3::term instances.
Thanks to that previously prepared terminals can be directly
used in a setter execution.

References #2058
2018-04-23 12:00:56 +02:00
Piotr Sarna
2b729a10bc cql3: change operation::make_* functions to static
This commit makes operation::make* functions static, because they
don't access any instance-specific data anyway. It is later needed
to decouple setter execution from binding a cql3::term.
2018-04-23 12:00:56 +02:00
Piotr Sarna
1d40d2186e cql3: add from_json_object function to types
This commit adds a 'from_json_object' method which will be used
for converting JSON representation of a value to raw bytes representing
the same value. This functionality will be needed by 'INSERT JSON'
clause implementation, which can turn these raw bytes into cql3::term.

References #2058
2018-04-23 12:00:56 +02:00
Piotr Sarna
e3dfa2193b cql3: Make literals::NULL_VALUE public
This commit makes NULL_VALUE public for future use in JSON translation.

References #2058
2018-04-23 12:00:56 +02:00
Botond Dénes
c34b69f4b2 Add PULL_REQUEST_TEMPLATE.md
Hopefully it will guide people wanting to contribute to the mailing
list.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <73c5d9c9884d8595b466412486494d6aa45d1d55.1524476490.git.bdenes@scylladb.com>
2018-04-23 10:45:25 +01:00
Avi Kivity
1a6b891ce2 Update scylla-ami submodule
* dist/ami/files/scylla-ami 9b4be70...02b1853 (1):
  > scylla_install_ami: remove the host id file after scylla_setup
2018-04-23 12:43:56 +03:00
Avi Kivity
b7b3d2bfec tests: continuous_data_consumer_test: increase coverage
Cover also values in the ranges 0 to 1 and 2^63 to 2^64 - 1.
Message-Id: <20180422150938.29143-2-avi@scylladb.com>
2018-04-23 11:39:06 +03:00
Avi Kivity
732177d2b0 tests: continuous_data_consumer_test: reduce runtime
continuous_data_consumer_test takes an unreasonable amount of
time to run, especially in debug mode.  Reduce the run time by
reducing the number of loops.
Message-Id: <20180422150938.29143-1-avi@scylladb.com>
2018-04-23 11:39:06 +03:00
Duarte Nunes
c8baba4e3a mutation_partition: Clarify comment about emptiness
empty() doesn't distinguish between live and dead data, so clarify
that in its comment.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
cc6c96bc92 tests: Add view_complex_test
This patch introduces view_complex_test and adds more test coverage
for materialized views.

A new file was introduced to avoid making view_schema_test slower.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
7ba1291731 tests/view_schema_test: Complete test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
844e0b41d1 db/view: Move cells instead of copying in add_cells_to_view()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
4b4d1dbd1f db/view: Handle unselected base columns and corner cases
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns.

This patch ensures that unselected columns are considered as much as
possible, even though some limitations will still exist. In
particular, we need to represent multiple timestamps (from all the
unselected columns), but have only mechanisms to record a single
timestamp.

We also have some issues when dealing with selected column, and the
way we currently delete them. Consider the following:

create table cf (p int, c int, a int, b int, primary key (p, c))
create materialized view vcf as select a, b
from cf where p is not null and c is not null
primary key (p, c)

1) update cf using timestamp 10 set a = 1 where p = 1 and c = 1
2) delete a from cf using timestamp 11 where p = 1 and c = 1
3) update cf using timestamp 1 set a = 2 where p = 1 and c = 1

After 1), the MV should include a row with row marker @ ts10,
p = 1, c = 1, a = 1. After 2), this row should be removed.

At 3), we should add a row with row marker @ ts1, p = 1, c = 1, a = 1,
with a lower timestamp. This means that the delete should not
insert a row tombstone with timestamp @ 11, as we do now but it should
just delete the view's row marker (which exists) with ts1.

Refs #3362
Fixes #3140
Fixes #3361

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
67dac67c46 mutation_partition: Regular base column in view determines row liveness
When views contain a primary key column that is not part of the base
table primary key, that column determines whether the row is live or
not. We need to ensure that when that cell is dead, and thus the
derived row marker, either by normal deletion of by TTL, so is the
rest of the row.

This patch introduces the idea of shawdowing row marker. We map the
status of the regular base column in the view's PK to the view row's
marker. If this marker is dead, so is that cell in the base table, and
so should the view row become. To enforce that, a view row's dead
marker shadows the whole row if that view includes a base regular
column in its PK.

Fixes #3360

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
4dfce4d369 db/view: Don't avoid read-before-write when view PK matches base
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns. When calculating the view's row marker we need
to access those unselected columns, so we can't avoid the
read-before-write as we were doing.

Refs #3362

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
bd3cedd240 db/view: Process base updates to column unselected by its views
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns. So, process base updates to columns unselected by
any of its views.

Refs #3362

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
ac9b93eb89 db/view: Consider partition tombstone when generating updates
Not adding the partition tombstone to the current list of tombstones
may cause updates to be incorrectly generated.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
e6467f46b7 tests/view_schema_test: Remove unneeded test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
b0cb5480d5 mutation_fragment: Allow querying if row is live
For clustering_row and static_row, allow querying whether they are
live or not.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
164f043768 view_info: Add view_column() overload
For when we already have the base's column_definition.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
31370fd7b1 view_info: Explicitly initialize base-dependent fields
Instead of lazily-initializing the regular base column in the view's
PK field, explicitly initialize it. This will be used by future
patches that don't have access to the schema when wanting to obtain
that column.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
b77b71436d cql3/alter_table_statement: Forbid dropping columns of MV base tables
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns.

The fact that unselected columns can keep a view row alive also
requires that users cannot drop columns of base tables with
materialized views, which this patch implements.

Refs #3362

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Avi Kivity
28be4ff5da Revert "Merge "Implement loading sstables in 3.x format" from Piotr"
This reverts commit 513479f624, reversing
changes made to 01c36556bf. It breaks
booting.

Fixes #3376.
2018-04-23 06:47:00 +03:00
Avi Kivity
513479f624 Merge "Implement loading sstables in 3.x format" from Piotr
"
Pass sstable version to parse, write and describe_type methods to make it possible to handle different versions.
For now serialization header from 3.x format is ignored.

Tests: units (release)
"

* 'haaawk/sstables3/loading_v3' of ssh://github.com/scylladb/seastar-dev:
  Add test for loading the whole sstable
  Add test for loading statistics
  Add support for 3_x stats metadata
  Pass sstable version to describe_type
  Pass sstable version to write methods
  metadata_type: add Serialization type
  Pass sstable_version_types to parse methods
  Add test for reading filter
  Add test for read_summary
  sstables 3.x: Add test for reading TOC
  sstable: Make component_map version dependent
  sstable::component_type: add operator<<
  Extract sstable::component_type to separete header
  Remove unused sstable::get_shared_components
  sstable_version_types: add mc version
2018-04-22 16:18:39 +03:00
Piotr Jastrzebski
0288121c0a Add test for loading the whole sstable
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 15:07:03 +02:00
Piotr Jastrzebski
fbe9ee72d6 Add test for loading statistics
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 15:07:03 +02:00
Piotr Jastrzebski
b683870644 Add support for 3_x stats metadata
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 15:06:51 +02:00
Takuya ASADA
01c36556bf dist/debian: use --configfile to specify pbuilderrc
Use --configfile to specify pbuilderrc, instead of copying it to home directory.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180420024624.9661-1-syuu@scylladb.com>
2018-04-22 16:06:42 +03:00
Piotr Jastrzebski
26ab3056ae Pass sstable version to describe_type
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 14:41:11 +02:00
Piotr Jastrzebski
0022c309ee Pass sstable version to write methods
This will allow writing different versions differently

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 14:41:10 +02:00
Piotr Jastrzebski
65fe564cd2 metadata_type: add Serialization type
Ignore it while reading sstable 3_x and throw
if it's present when reading 2_x.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 14:40:04 +02:00
Piotr Jastrzebski
d68f3b328f Pass sstable_version_types to parse methods
Parsing will depend on the sstable version when
we have support for both 2_x and 3_x formats.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
9b448b9082 Add test for reading filter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
6bb5468ba0 Add test for read_summary
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
6c2cf40ce8 sstables 3.x: Add test for reading TOC
Make sure DigestCRC32 is handled correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
00756582ca sstable: Make component_map version dependent
Introduce sstable_version_constants that will be a proxy
serving correct constants depending on the format version.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
94fbec788e sstable::component_type: add operator<<
Make it possible to print out component_type.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
82d483a1d3 Extract sstable::component_type to separete header
It will be used in other places which won't depend on
sstable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:45:29 +02:00
Avi Kivity
70220d8f85 tests: sstable_datafile_test: peel off redundant parentheses around compression_parameters initializer
The compression_parameter constructor is called with an extra level of
parentheses. Presumably this caused a temporary object to be constructed
and then moved into the argument being initialized, but gcc 8 complains
about ambiguity.

Make it happy by stripping off the redundant parentheses.
Message-Id: <20180421121854.12314-1-avi@scylladb.com>
2018-04-21 13:53:29 +01:00
Avi Kivity
7a141c0240 tests: network_topology_strategy_test: peel off redundant parentheses around token initializer
The token constructor is called with an extra level of parentheses. Presumably
this caused a temporary object to be constructed and then moved into the
variable being initialized, but gcc 8 complains about ambiguity.

Make it happy by stripping off the redundant parentheses.
Message-Id: <20180421121736.12136-1-avi@scylladb.com>
2018-04-21 13:53:29 +01:00
Avi Kivity
7c54e8559c mutation_fragment: fix concept for mutation_fragment::consume()
The parameters to the MutationFragmentConsumer concept must be concrete
types, not decltype(auto).

Reported by gcc 8.
Message-Id: <20180421110738.7574-1-avi@scylladb.com>
2018-04-21 13:53:29 +01:00
Duarte Nunes
6eeb6514f1 Merge 'Introduce "scylla active-sstables" command' from Tomasz
"
Prints info about sstables used by readers

Example:

  (gdb) scylla active-sstables
  sstable "keyspace1"."standard1"#5, readers=3 data_file_size=39393952
  sstable "keyspace1"."standard1"#6, readers=3 data_file_size=127513304
  sstable_count=2, total_index_lists_size=0
"

* 'tgrabiec/gdb-scylla-active-sstables' of github.com:tgrabiec/scylla:
  gdb: Introduce "scylla active-sstables" command
  gdb: Make list_unordered_map() more general
  gdb: Improve compatibility with python2.7
2018-04-19 19:04:59 +01:00
Tomasz Grabiec
fb126abdc5 gdb: Introduce "scylla active-sstables" command
Prints info about sstables used by readers

Example:

  (gdb) scylla active-sstables
  sstable "keyspace1"."standard1"#5, readers=3 data_file_size=39393952
  sstable "keyspace1"."standard1"#6, readers=3 data_file_size=127513304
  sstable_count=2, total_index_lists_size=0
2018-04-19 19:45:52 +02:00
Tomasz Grabiec
68dd61a0e7 gdb: Make list_unordered_map() more general
1) vt.name returns None for some types, use str() instead
 2) some unorderd_maps use 'false' as the second Hash_node template parameter
 3) some consumers will prefer a reference to the value instead of its address
2018-04-19 19:06:00 +02:00
Tomasz Grabiec
309257ddda gdb: Improve compatibility with python2.7
Which is still used in some builds of GDB
2018-04-19 19:04:26 +02:00
Piotr Jastrzebski
0c96573807 Remove unused sstable::get_shared_components
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-18 10:24:57 +02:00
Piotr Jastrzebski
4f1528192f sstable_version_types: add mc version
This is the latest version of 3.x SSTable format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-18 10:24:57 +02:00
Duarte Nunes
1db6d7d6e2 cql3/functions: Add some missing functions
Fixes #3368

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180417170638.12625-1-duarte@scylladb.com>
2018-04-17 21:15:14 +03:00
Duarte Nunes
17917e12ce db/view: Wait for schema agreement in background upon view building
Waiting for schema agreement in the foreground may cause the node to
not boot in useful time.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180417125915.11262-1-duarte@scylladb.com>
2018-04-17 18:03:43 +03:00
Duarte Nunes
b5e7d5fa2c column_family: Make reader without going through mutation source
When doing the read before write for a materialized view update, call
make_reader directly.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180417091918.10043-1-duarte@scylladb.com>
2018-04-17 12:22:36 +03:00
Takuya ASADA
e99f43ef43 dist/debian: call lsb_release after command existance check
Fixes #3364

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523907917-13188-1-git-send-email-syuu@scylladb.com>
2018-04-17 10:54:39 +03:00
Avi Kivity
2c2175ab34 Merge "Add support for reading variant integers from SSTables" from Piotr
"
Enhance continuous_data_consumer to use existing vint serialization for reading
variant integers from SSTables.

Also available at:
https://github.com/scylladb/seastar-dev/commits/haaawk/sstables3/unsigned-vint-v6

Tests: units (release)
"

* 'haaawk/sstables3/unsigned-vint-v6' of ssh://github.com/scylladb/seastar-dev:
  sstables: add test for continuous_data_consumer::read_unsigned_vint
  buffer_input_stream: make it possible to specify chunk size
  Add tests for make_limiting_data_source
  Introduce make_limiting_data_source
  sstables: add continuous_data_consumer::read_unsigned_vint
  Cover serialized_size_from_first_byte in tests
  core: add unsigned_vint::serialized_size_from_first_byte
  sstables: add all dependant headers to consumer.hh
  sstables: add all dependant headers to exceptions.hh
  core: add #pragma once to vint-serialization.hh
2018-04-17 10:09:38 +03:00
Takuya ASADA
ace44784e8 dist/debian: use ~root as HOME to place .pbuilderrc
When 'always_set_home' is specified on /etc/sudoers pbuilder won't read
.pbuilderrc from current user home directory, and we don't have a way to change
the behavor from sudo command parameter.

So let's use ~root/.pbuilderrc and switch to HOME=/root when sudo executed,
this can work both environment which does specified always_set_home and doesn't
specified.

Fixes #3366

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523926024-3937-1-git-send-email-syuu@scylladb.com>
2018-04-17 09:37:16 +03:00
Takuya ASADA
5a71d4f814 dist/debian: use apt-get instead of apt
To suppress following warning, use apt-get instead of apt:
"WARNING: apt does not have a stable CLI interface. Use with caution in scripts."

Fixes #3365

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523909727-13343-1-git-send-email-syuu@scylladb.com>
2018-04-17 09:29:16 +03:00
Piotr Jastrzebski
c5dda1c0c9 sstables: add test for continuous_data_consumer::read_unsigned_vint
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 21:14:34 +02:00
Piotr Jastrzebski
fdad8eba97 buffer_input_stream: make it possible to specify chunk size
This will allow to force input stream to return its data
in chunks of a specified size.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 21:11:13 +02:00
Piotr Jastrzebski
4406d11095 Add tests for make_limiting_data_source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 21:00:35 +02:00
Piotr Jastrzebski
cc6e619aa9 Introduce make_limiting_data_source
This method takes a data_source and returns another data_source
that returns data from the input source but in chunks of limited
size.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 20:56:30 +02:00
Piotr Jastrzebski
b68d1fa5bd sstables: add continuous_data_consumer::read_unsigned_vint
This allows reading unsigned variant integers from
SSTable format 3.x.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 20:30:10 +02:00
Piotr Jastrzebski
4431c1bbe7 Cover serialized_size_from_first_byte in tests
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 20:26:44 +02:00
Piotr Jastrzebski
e423529077 core: add unsigned_vint::serialized_size_from_first_byte
This method takes first byte and determins how many bytes
are used to represent an unsigned variant integer.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 20:12:03 +02:00
Botond Dénes
07fb2e9c4d make_foreign_reader: don't wrap local readers
If the to-be-wrapped reader is local (lives on the same shard where
make_foreign_reader() is called) there is no need to wrap it with
foreign_reader. Just return it as is.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <886ed883b707f163603a40b56b8823f2bb6c47c6.1523873224.git.bdenes@scylladb.com>
2018-04-16 15:11:20 +03:00
Piotr Jastrzebski
20705c4536 sstables: add all dependant headers to consumer.hh
Before it was depending on byteorder.hh that just happend
to be included in all compilation units that were using consumer.hh
This change makes the header compile when used in new compilation units.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 11:02:49 +02:00
Piotr Jastrzebski
9288074d02 sstables: add all dependant headers to exceptions.hh
Before it was depending on print.hh that just happend
to be included in all compilation units that were using
exceptions.hh. This change makes the header compile
 when used in new compilation units.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 11:02:33 +02:00
Avi Kivity
7c01e66d53 cql3: query_processor: store and use just local shard reference of storage_proxy
Since storage_proxy provides access to the entire cluster, a local shard
reference is sufficient.  Adjust query_processor to store a reference to
just the local shard, rather than a seastar::sharded<storage_proxy> and
adjust callers.

This simplifies the code a little.
Message-Id: <20180415142656.25370-3-avi@scylladb.com>
2018-04-16 10:20:50 +02:00
Avi Kivity
f7b102238a cql3: change cql_statement methods to accept a local storage_proxy
The storage_proxy represents the entire cluster, so there's never a need
to access it on a remote shard; the local shard instance will contact
remote shard or remote nodes as needed.

Simplify the API by passing storage_proxy references instead of
seastar::sharded<storage_proxy> references. query_processor and
other callers are adjusted to call seastar::sharded::local() first.
Message-Id: <20180415142656.25370-2-avi@scylladb.com>
2018-04-16 10:18:28 +02:00
Avi Kivity
52882d1bd9 dist: debian: try harder to set the target distribution
build_deb.sh relies on pbuilder picking up a ~/.pbuilderrc which we
copy from the script. According to the pbuilder manual, "~" will refer
to the root directory (since pbuilder is run via sudo). In practice
we've observed this working with "~" referring to the current user's
home directory, but also sometimes failing, while complaining
about /root/.pbuilderrc failing. When it fails, it fails to set
the correct distribution.

To be extra sure, also copy .pbuilderrc to root's home directory. This
way, whatever behavior pbuilder chooses to follow, it will have a
configuration file to read.
Message-Id: <20180410134508.9415-1-avi@scylladb.com>
2018-04-16 10:10:47 +02:00
Avi Kivity
e0545cd2ad Merge seastar upstream
* seastar 2da7d46...1bb44ac (7):
  > doc: exclude non-API paths and symbols
  > docs: move detailed descriptions to top of page
  > doc: add default layout file
  > Merge "Misc fixes for io_tester" from Glauber
  > Merge RPC template cleanup from Gleb
  > Revert "Merge rpc template cleanup from Gleb"
  > Merge rpc template cleanup from Gleb
2018-04-15 15:48:50 +03:00
Daniel Fiala
a3533a62ba Allow /upload to be at the end of a path for sstable file
The patch fixes a bug introduce by commit
089b54f2d2.

When sstable files are stored in .../upload directory
and refresh is initialised with `nodetool` then it fails
because Scyla doesn't expect .../upload to be a part of the path.

Fixes #3334.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180413132019.17779-1-daniel@scylladb.com>
2018-04-14 15:25:55 +03:00
Piotr Sarna
5a6fcebed6 cql3: add toJson function
This commit extends JSON support with toJson() function,
which can be used in SELECT clause to transform a single argument
to JSON form.

toJson() accepts any type including nested collection types,
so instead of being declared with concrete types,
proper toJson() instances are generated during calls.

This commit also supplements JSON CQL query tests with toJson calls.

Finally, it refactors JSON tests so they use do_with_cql_env_thread.

References #2058

Message-Id: <a7833650428e9ef590765a14e91c4d42532588f4.1523528698.git.sarna@scylladb.com>
2018-04-14 15:23:47 +03:00
Gleb Natapov
1a9aaece3e cql_server: fix a race between closing of a connection and notifier registration
There is a race between cql connection closure and notifier
registration. If a connection is closed before notification registration
is complete stale pointer to the connection will remain in notification
list since attempt to unregister the connection will happen to early.
The fix is to move notifier unregisteration after connection's gate
is closed which will ensure that there is no outstanding registration
request. But this means that now a connection with closed gate can be in
notifier list, so with_gate() may throw and abort a notifier loop. Fix
that by replacing with_gate() by call to is_closed();

Fixes: #3355
Tests: unit(release)

Message-Id: <20180412134744.GB22593@scylladb.com>
2018-04-12 16:56:50 +03:00
Raphael S. Carvalho
0c72781939 sstables/twcs: add support to millisecond timestamp resolution
That's blocking KairosDB users because it uses TWCS with millisecond
timestamp resolution.

Also older drivers use millisecond instead of the default microsecond.

Fixes #3152.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180411171244.19958-1-raphaelsc@scylladb.com>
2018-04-12 12:46:52 +03:00
Glauber Costa
98d784aba7 sstables: correctly calculate number of bits in filter
In my well intentioned attempt to use fewer magic numbers in the loading
code I replaced "64" with something calculated automatically from the
type being used.

Except I did it wrong, because sizeof(uint64_t) is 8, not 64.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180411155903.27665-1-glauber@scylladb.com>
2018-04-11 19:03:30 +03:00
Avi Kivity
dc0c458c12 Merge "First series on JSON support in CQL" from Piotr
"
This series introduces 'SELECT JSON' clause support for CQL.
Things implemented:
 * expanding CQL grammar with JSON keyword
 * converting values to JSON format
 * serving 'SELECT JSON *' clauses
 * tests for 'SELECT JSON'
"

* 'json_ops' of https://github.com/psarna/scylla:
  tests: add cql unit tests for SELECT JSON
  cql3: Add JSON token to CQL grammar
  cql3: add support for SELECT JSON clause
  cql3: add to_json_string function to types
2018-04-11 18:26:53 +03:00
Piotr Sarna
fa66e64c24 tests: add cql unit tests for SELECT JSON
This commit adds tests for SELECT JSON clause,
which is expected to return rows in JSON format.

References #2058
2018-04-11 17:12:21 +02:00
Piotr Sarna
1b6e3ccd2b cql3: Add JSON token to CQL grammar
This commit adds JSON keyword to CQL grammar and allows parsing
'SELECT JSON' command in CQL. Additionally, it will be useful
in implementing 'INSERT JSON(...)'.

References #2058
2018-04-11 17:12:21 +02:00
Piotr Sarna
15545da572 cql3: add support for SELECT JSON clause
This commit adds the implementation of SELECT JSON clause
which returns rows in JSON format. Each returned row has a single
'[json]' column.

References #2058
2018-04-11 17:12:02 +02:00
Avi Kivity
2d126a79b5 Merge "Multishard combined reader" from Botond
"
The multishard combined reader provides a convenient
flat_mutation_reader implementation that takes care of efficiently
reading a range from all shards that own data belonging to the range.
All this happens transparently, the user of the reader need only pass a
factory function to the multishard reader which it uses to create
remote readers when needed. These remote readers will then be managed
through foreign reader which abstracts away the fact that the reader is
located on a remote shard.
Sub readers are created for the entire read range, meaning they are free
to cross shard-range limits to fill their buffer. The output of these
sub readers is merged in a round-robin manner, the same way data is
distributed among shards. The multishard reader will move to the next
shard's reader whenever it encounters a partition whose token is after
the delimiter token.
To improve throughput and latency two levels of read-ahead is employed.
One in foreign_reader, which will try to fill the remote shard reader's
buffer in the background, in parallel to processing the results on the
local shard. And one in the multishard reader itself which will
exponentially increase concurrency whenever a sub-reader's buffer
becomes empty. But only if this happened after crossing a shard
boundary. This is important because there is no point in increasing
concurrency if a single sub reader can fill the multishard readers'
buffer.
"

* 'multishard-reader/v3' of https://github.com/denesb/scylla:
  Add unit tests for multishard_combined_reader
  Add multishard_combined_reader
  flat_mutation_reader: add peek_buffer()
  Add unit tests for foreign_reader
  forwardable reader: implement fast_forward_to(position_in_partition)
  Add foreign_reader
  flat_mutation_reader: add detach_buffer()
2018-04-11 18:03:35 +03:00
Glauber Costa
c93bc6b853 sstables: don't rely on parameter evaluation order
Asias reported in issue #3351 that a floating point exception was seen
while loading SSTables. Looking at the trace, that seems to be because
we tried to issue a modulo operation with something that was likely 0.

That field comes from the nr_bits attribute in the large bitset, and our
current code should set it to whatever we read from the Filter file -
something that has been working for ages.

The difference is that after the patch that Asias identified as culprit,
we are moving the array from which we compute the size in the same
parameter list where we are computing the size.

This works for me and passed all my tests - likely because my compiler
was doing left-to-right evaluation as I would expect it to do. But the
standard doesn't guarantee that at all, and it reads:

"Order of evaluation of the operands of almost all C++ operators
(including the order of evaluation of function arguments in a
function-call expression and the order of evaluation of the
subexpressions within any expression) is unspecified. The compiler can
evaluate operands in any order, and may choose another order when the
same expression is evaluated again."

This likely fixes the bug, but even if it doesn't we should patch it,
since we currently have something that is technically an UB.

Fixes #3351.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180411144036.24748-1-glauber@scylladb.com>
2018-04-11 18:01:06 +03:00
Daniel Fiala
202bff0b18 database: Remember versions and formats of all temporary TOC files.
The patch fixes a bug introduce by commit 089b54f2d2.
This bug exhibited when master was deployed in an attempt to populate
materialised views. The nodes restarted in the middle and they were not able
to come back.

The fix is to remember formats and versions of sstables for every generation.

Fixes: #3324.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180410083114.17315-1-daniel@scylladb.com>
2018-04-11 16:47:33 +03:00
Piotr Sarna
399ab1d455 cql3: add to_json_string function to types
This commit adds a 'to_json_string' method which will be used
for converting values to JSON strings. In several cases it's not
sufficient to use 'to_string', e.g. actual strings need to be
surrounded with double quotes.

References #2058
2018-04-11 13:27:56 +02:00
Avi Kivity
4c588de70f tests: apply overprovisioned flag to all tests
Some tests escaped the --overprovisioned flag, causing them to
compete over cpu 0. Add the flag to all tests.
Message-Id: <20180410181606.8341-1-avi@scylladb.com>
2018-04-11 10:48:52 +02:00
Botond Dénes
f931b45dfa test_resources_based_cache_eviction: s/assert/BOOST_REQUIRE_*/
After moving this test into a SEASTAR_THREAD_TEST_CASE we can use the
BOOST_REQUIRE_* macros which have much better diagnostics than simple
assert()s.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d2faa5db2bc352e6a2dcf09287faed42284c3248.1523432699.git.bdenes@scylladb.com>
2018-04-11 10:55:21 +03:00
Botond Dénes
49128d12cf Move querier_cache_resource_based_eviction test into querier_cache.cc
Turns out do_with_cql_env can be used from within SEASTAR test cases so
no reason to have a separate file for a single test case.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <028a28b7d90a3bc5ed4719ce273da05880133c0e.1523432699.git.bdenes@scylladb.com>
2018-04-11 10:55:19 +03:00
Botond Dénes
ff3982a817 Add unit tests for multishard_combined_reader 2018-04-11 10:03:50 +03:00
Botond Dénes
3a6f397fd0 Add multishard_combined_reader
Takes care of reading a range from all shards that own a subrange in the
range. The read happens sequentially, reading from one shard at a time.
Under the scenes it uses combined_mutation_reader and foreign_reader,
the former providing the merging logic and the latter taking care of
transferring the output of the remote readers to the local shard.
Readers are created on-demand by a reader-selector implementation that
creates readers for yet unvisited shards as the read progresses.
The read starts with a concurrency of one, that is the reader reads from
a single shard at a time. The concurrency is exponentially increased (to
a maximum of the number of shards) when a reader's buffer is empty after
moving the next shard. This condition is important as we only wan't to
increase concurrency for sparse tables that have little data and the
reader has to move between shards often. When concurrency is > 1, the
reader issues background read-aheads to the next shards so that by the
time it needs to move to them they have the data ready.
For dense tables (where we rarely cross shards) we rely on the
foreign_reader to issue sufficient read-aheads on its own to avoid
blocking.
2018-04-11 10:03:47 +03:00
Botond Dénes
94140258d0 flat_mutation_reader: add peek_buffer()
Allows peeking at the next mutation fragment in the buffer. As opposed
to the existing `peek()` it assumes there's at least one fragment in the
buffer. Useful for code that already ensured that the buffer is not
empty and doesn't want to introduce a continuation (via `peek()`).
2018-04-11 09:22:49 +03:00
Botond Dénes
de4a3c8bdb Add unit tests for foreign_reader 2018-04-11 09:22:49 +03:00
Botond Dénes
50b67232e5 forwardable reader: implement fast_forward_to(position_in_partition)
Instead of throwing std::bad_function_call. Needed by the foreign_reader
unit test. Not sure how other tests didn't hit this before as the test
is using `run_mutation_source_tests()`.
2018-04-11 09:22:49 +03:00
Botond Dénes
2c0f8d0586 Add foreign_reader
Local representant of a reader located on a remote shard. Manages the
lifecycle and takes care of seamlessly transferring fragments produced
by the remote reader. Fragments are *copied* between the shards in
batches, a bufferful at a time.
To maximize throughput read-ahead is used. After each fill_buffer() or
fast_forward_to() a read-ahead (a fill_buffer() on the remote reader) is
issued. This read-ahead runs in the background and is brough back to
foreground on the next fill_buffer() or fast_forward_to() call.
2018-04-11 09:22:45 +03:00
Botond Dénes
334efb4d70 flat_mutation_reader: add detach_buffer()
Allows for detaching the internal buffer of the reader. Enables
convenient transferring of buffered fragmends in a single batch but
will force the reader to reallocate it's buffer on the next
fill_buffer() call.
Introduced for foreign_reader which favours quick transferring of the
fragments between shards in a single batch, over minimizing allocations,
which can be amortized by background read-aheads.
2018-04-11 09:08:51 +03:00
Piotr Jastrzebski
190cdd27f0 core: add #pragma once to vint-serialization.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-10 20:09:40 +02:00
Raphael S. Carvalho
638a647b7d sstables/compaction_manager: do not break lcs invariant by not allowing parallel compaction for it
After change to serialize compaction on compaction weight (eff62bc61e),
LCS invariant may break because parallel compaction can start, and it's
not currently supported for LCS.

The condition is that weight is deregistered right before last sstable
for a leveled compaction is sealed, so it may happen that a new compaction
starts for the same column family meanwhile that will promote a sstable to
an overlapping token range.

That leads to strategy restoring invariant when it finds the overlapping,
and that means wasted resources.
The fix is about removing a fast path check which is incorrect now because
we release weight early and also fixing a check for ongoing compaction
which prevented compaction from starting for LCS whenever weight tracker
was not empty.

Fixes #3279.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180410034538.30486-1-raphaelsc@scylladb.com>
2018-04-10 20:02:08 +03:00
Avi Kivity
fc488adc72 logalloc: remove segment_descriptor::_lsa_managed
_lsa_managed is always 1:1 with _region, so we can remove it, saving
some space in the segment descriptor vector.

Tests: unit (release), logalloc_test (debug)
Message-Id: <20180410122606.10671-1-avi@scylladb.com>
2018-04-10 13:54:38 +01:00
Asias He
d71a94a08b gossip: Add tokens and host_id in add_saved_endpoint
Problem:

   Start node 1 2 3
   Shutdown node2
   Shutdown node1 node3
   Start node1 node3
   Try to repalce_address for node 2
   The replace operation fails with the error:
     seastar - Exiting on unhandled exception: std::runtime_error
     (Cannot replace_address node2 because it doesn't exist in gossip)

This is because after all nodes shutdown, the other nodes do not have the
tokens and host_id info of node2 until node2 boots up and talks to the cluster.

If node2 can not boots up for whatever reason, currently the only way to
recover node2 is to `nodetool removenode` and bootstrap node2 again. This will
change tokens in the cluster and cause more data movement than just replacing
node2.

To fix, we add the tokens and host_id gossip application state in add_saved_endpoint
during boot up.

This is pretty safe because the generation for application state added by
add_saved_endpoint is zero, if node2 actually boots, other nodes will update
with node2's version.

Before:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool

    {
        "addrs": "127.0.0.2",
        "generation": 0,
        "is_alive": false,
        "update_time": 1523344828953,
        "version": 0
    }

Node 2 can not be replaced.

After:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool

    {
        "addrs": "127.0.0.2",
        "application_state": [
            {
                "application_state": 12,
                "value": "31284090-2557-4036-9367-7bb4ef49c35a",
                "version": 2
            },
            {
                "application_state": 13,
                "value": "... a lot of tokens ...",
                "version": 1
            }
        ],
        "generation": 0,
        "is_alive": false,
        "update_time": 1523344828953,
        "version": 0
    }

Node 2 can be replaced.

Tests: dtest/replace_address_test.py
Fixes: #3347
Message-Id: <117fd6649939e0505847335791be8d7a96e7d273.1523346805.git.asias@scylladb.com>
2018-04-10 13:14:31 +02:00
Piotr Jastrzebski
5cd48407ad test: logalloc_test: Fix build for boost 1.63
Due to https://svn.boost.org/trac10/ticket/12778?replyto=3
BOOST_REQUIRE_NE does not work with nullptr.

Tests: units (release)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <e7158e8a235356fad99560f6fcbecb57615cefe6.1523298193.git.piotr@scylladb.com>
2018-04-10 12:50:22 +03:00
Piotr Jastrzebski
3565820526 sstables: Remove unused mp_row_consumer::skip_partition
The method is never called so we can remove it and
mp_row_consumer::_skip_partition which is set only
by mp_row_consumer::skip_partition

Tests: units (release)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <cae8b09032c58361b7cfb9d02a792cb31b186f5c.1523298605.git.piotr@scylladb.com>
2018-04-10 12:50:05 +03:00
Glauber Costa
b2f9958071 large_bitset: use a chunked_vector internally and simplify API
save and load functions for the large_bitset were introduced by Avi with
d590e327c0.

In that commit, Avi says:

"... providing iterator-based load() and save() methods.  The methods
support partial load/save so that access to very large bitmaps can be
split over multiple tasks."

The only user of this interface is SSTables. And turns out we don't really
split the access like that. What we do instead is to create a chunked vector
and then pass its begin() method with position = 0 and let it write everything.

The problem here is that this require the chunked vector to be fully
initialized, not just reserved. If the bitmap is large enough that in itself
can take a long time without yielding (up to 16ms seen in my setup).

We can simplify things considerably by moving the large_bitset to use a
chunked vector internally: it already uses a poor man's version of it
by allocating chunks internally (it predates the chunked_vector).

By doing that, we can turn save() into a simple copy operation, and do
away with load altogether by adding a new constructor that will just
copy an existing chunked_vector.

Fixes #3341
Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180409234726.28219-1-glauber@scylladb.com>
2018-04-10 10:25:06 +03:00
Paweł Dziepak
252c5dfa52 Merge "logalloc: replace zones with segment-at-a-time alloc/free" from Avi
"This patchset removes zones and replaces them with a simpler system. LSA tries
to allocate segments at higher addresses, so that we'll end up with the standard
allocator using lower addresses and LSA using higher addresses, allowing for easier
allocation from std."

* tag 'lsa-no-zones/v6' of https://github.com/avikivity/scylla:
  tests: add logalloc_test for large contiguous allocations in a challenging environemnt
  logalloc: limit std segment allocations in debug mode
  logalloc: introduce prime_segment_pool()
  logalloc: limit non-contiguous reclaims
  logalloc: pre-allocate all memory as lsa on startup
  tests: add random test for dynamic_bitset
  dynamic_bitset: optimize for large sets
  dynamic_bitset: get rid of resize()
  dynamic_bitset: remove find_*_clear() variants
  logalloc: reduce segment size to 128k
  logalloc: get rid of the emergency reserve stack
  logalloc: replace zones with segment-at-a-time alloc/free
2018-04-09 10:30:11 +02:00
Avi Kivity
80651e6dcc database: reduce idle memtable flush cpu shares to 1%
Commit 1671d9c433 (not on any release branch)
accidentally bumped the idle memtable flush cpu shares to 100 (representing
10%), causing flushes to be too when they don't comsume too much cpu.

Fixes #3243.
Message-Id: <20180408104601.9607-1-avi@scylladb.com>
2018-04-08 17:12:14 +01:00
Avi Kivity
53d97b1da3 Merge seastar upstream
* seastar 33d8f74...2da7d46 (4):
  > http routes: Add parameters to path when adding alias
  > future: compile-time optimize futurize<void>::apply()
  > memory: remove unneeded union 'pla'
  > queue: not_empty()/not_full() should throw when called after abort
2018-04-08 16:36:45 +03:00
Piotr Jastrzebski
9ad00b8207 data_consume_rows_context: Mark RANGE_TOMBSTONE_5 as nonconsuming
This state does not read any data and is used only to perform
action when finishing to read a primitive type.

According to comment on continuous_data_consumer::non_consuming
such states should be marked as non_consuming.

Tests: units (release)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <55a5c9b76268b50312ecd044291f28dcd8179a22.1523005293.git.piotr@scylladb.com>
2018-04-08 15:16:13 +03:00
Alexys Jacob
d3d736cd87 dist: gentoo: rename prometheus node exporter package
net-analyzer/prometheus-node_exporter got moved to app-metrics/node_exporter
and the service name changed on Gentoo Linux

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180405135605.26146-1-ultrabug@gentoo.org>
2018-04-08 14:11:38 +03:00
Avi Kivity
c3a2471c9e tests: add logalloc_test for large contiguous allocations in a challenging environemnt
Test large std allocations in an evironement that has seen many persistent
std allocations interspersed with lsa allocations, causing memory fragmentation.
2018-04-07 21:04:10 +03:00
Avi Kivity
2c670f6161 logalloc: limit std segment allocations in debug mode
Address Sanitizer has a global limit on the number of allocations
(note: not number of allocations less number of frees, but cumulative
number of allocations). Running some tests in debug mode on a machine
with sufficient memory can break that limit.

Work around that limit by restricting the amount of memory the
debug mode segment_pool can allocate. It's also nicer for running
the test on a workstation.
2018-04-07 21:04:10 +03:00
Avi Kivity
2baa16b371 logalloc: introduce prime_segment_pool()
To segregate std and lsa allocations, we prime the segment pool
during initialization so that lsa will release lower-addressed
memory to std, rather than lsa and std competing for memory at
random addresses.

However, tests often evict all of lsa memory for their own
purposes, which defeats this priming.

Extract the functionality into a new prime_segment_pool()
function for use in tests that rely on allocation segregation.
2018-04-07 14:52:58 +03:00
Avi Kivity
ff6325ee7e logalloc: limit non-contiguous reclaims
We may fail to reclaim because a region has reclaim disabled (usually because
it is in an allocating_section. Failed reclaims can cause high CPU usage
if all of the lower addresses happen to be in a reclaim-disabled region (this
is somewhat mitigated by the fact that checking for reclaim disabled is very
cheap), but worse, failing a segment reclaim can lead to reclaimed memory
being fragmented.  This results in the original allocation continuing to fail.

To combat that, we limit the number of failed reclaims. If we reach the limit,
we fail the reclaim.  The surrounding allocating_section will release the
reclaim_lock, and increase reserves, which will result in reclaim being
retried with all regions being reclaimable, and succeed in allocating
contiguous memory.
2018-04-07 14:52:58 +03:00
Avi Kivity
c6c659ce7a logalloc: pre-allocate all memory as lsa on startup
Since lsa tries to keep some non-lsa memory as reserve, we end up
with three blocks of memory: at low addresses, non-lsa memory that was
allocated during startup or subsequently freed by lsa; at middle addresses,
lsa; and at the top addresses, memory that lsa left alone during initial
cache population due to the reserve.

After time passes, both std and lsa will allocate from the top section,
causing a mix of lsa and non-lsa memory. Since lsa tries to free from
lower addresses, this mix will stay there forever, increasing fragmentation.

Fix that by disabling the reserve during startup and allocating all of memory
for lsa. Any further allocation will then have to be satisfied by lsa first
freeing memory from the low addresses, so we will now have just two sections
of memory: low addresses for std, and top addresses for lsa.

Note that this startup allocation does not page in lsa segments, since the
segment constructor does not touch memory.
2018-04-07 14:52:58 +03:00
Avi Kivity
413bf34fbd tests: add random test for dynamic_bitset
Compare against vector<bool> as a reference.
2018-04-07 14:52:58 +03:00
Avi Kivity
ff52767ec9 dynamic_bitset: optimize for large sets
Add 1:64 summary bitmaps so that searching for set bits is O(log n)
instead of O(n).
2018-04-07 14:52:58 +03:00
Avi Kivity
14510ae986 dynamic_bitset: get rid of resize()
Makes it easier to modify later on. Maybe "dynamic" is not so justified now.
2018-04-07 14:52:58 +03:00
Avi Kivity
f219ae1275 dynamic_bitset: remove find_*_clear() variants
They are no longer used, and cannot be efficiently implemenented
for large bitsets using a summary vector approach without slowing
down the find_*_set() variants, which are used.

Also remove find_previous_set() for the same reason.
2018-04-07 14:52:58 +03:00
Avi Kivity
54db0f3d30 logalloc: reduce segment size to 128k
Reducing the segment size reduces the time needed to compact segments,
and increases the number of segments that can be compacted (and so
the probability of finding low-occupancy segments).

128k is the size of I/O buffers and of thread stacks, so we can't
go lower than that without more significant changes.
2018-04-07 14:52:58 +03:00
Avi Kivity
3f17dbfcbc logalloc: get rid of the emergency reserve stack
Instead of keeping specific segments in the emergency reserve,
just keep the number of segments in the reserve. This simplifies the
code considerably.
2018-04-07 14:52:55 +03:00
Avi Kivity
fa73d844e9 logalloc: replace zones with segment-at-a-time alloc/free
This patch replaces the zones mechanism with something simpler: a
single segment is moved from the standard allocator to lsa and vice
versa, at a time. Fragmentation resistance is (hopefully) achieved
by having lsa prefer high addresses for lsa data, and return segments
at low address to the standard allocator. Over time, the two will move
apart.

Moving just once segment at a time reduces the latency costs of
transferring memory between free and std.
2018-04-07 13:48:40 +03:00
Piotr Sarna
a5b6047ffa cql3: add row-wise read statistics
Database read metrics is now extended by total number of rows read,
exported through cql_rows_read field.

Closes #3146
Message-Id: <02f0816c509f3d7fea06da22869eea61548284e2.1522919708.git.sarna@scylladb.com>
2018-04-05 13:39:08 +03:00
Paweł Dziepak
67aaaefde7 Merge "api: type-erase more of the column_family API" from Avi
"Together with the already merged patch, we reduce the object file
from 114MB to 81MB."

* tag 'api-diet-1/v1' of https://github.com/avikivity/scylla:
  api: type-erase all-column_family map_reduce variant
  api: simplify 6-argument map_reduce_cf() variant
2018-04-05 11:07:17 +02:00
Botond Dénes
3c078d2554 forwardable reader: pass down timeout in fast_forward_to()
The `const dht::partition_range&` overload to be more precise. The
timeout wasn't passed to the underlying reader. Spotted during test
debugging.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <39c02a55196d923bd0af8e6be6f0baa578cba070.1522915463.git.bdenes@scylladb.com>
2018-04-05 11:43:21 +03:00
Avi Kivity
1fa8682412 Merge seastar upstream
* seastar 7328d17...33d8f74 (3):
  > memory: switch to buddy allocation
  > tls: Ensure we always pass through semaphores on shutdown
  > memory: replace placement-new in unions with member construction

See scylladb/seastar#426.
2018-04-05 11:12:30 +03:00
Raphael S. Carvalho
30b6c9b4cd database: make sure sstable is also forwarded to shard responsible for its generation
After f59f423f3c, sstable is loaded only at shards
that own it so as to reduce the sstable load overhead.

The problem is that a sstable may no longer be forwarded to a shard that needs to
be aware of its existence which would result in that sstable generation being
reallocated for a write request.
That would result in a failure as follow:
"SSTable write failed due to existence of TOC file for generation..."

This can be fixed by forwarding any sstable at load to all its owner shards
*and* the shard responsible for its generation, which is determined as follow:
s = generation % smp::count

Fixes #3273.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180405035245.30194-1-raphaelsc@scylladb.com>
2018-04-05 10:58:05 +03:00
Tzach Livyatan
58e47fa0b3 docs/docker: Fix and add links to Scylla docs
- Fix link for reporting a Scylla problem
- Add a link to Best Practices for Running Scylla on Docker

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20180404065129.16776-1-tzach@scylladb.com>
2018-04-04 10:52:04 +03:00
Piotr Sarna
ae3265f905 cql_server: use handle_exception for failed accepts
Follows up "cql_server: replace recursion in do_accepts with repeat".
Failed accepts are now handled with handle_exception routine
instead of generic then_wrapped.
Message-Id: <db820a674100ae57f3acc7b49ebae57d0c2bdbb8.1522785444.git.sarna@scylladb.com>
2018-04-03 21:34:46 +01:00
Piotr Sarna
b298bb2f7a cql_server: replace recursion in do_accepts with repeat
Recursion in do_accepts function is now replaced with
repeat utility.

Fixes #2467

Message-Id: <07d6da60726fc3ecc06139309b9716180e8accf7.1522777060.git.sarna@scylladb.com>
2018-04-03 21:23:11 +03:00
Avi Kivity
9cef37e643 Merge "db/view: View building fixes" from Duarte
"
Fixes to the view building process, discovered from field experience.

Tests: dtest(materialized_view_tests.py, smp=2)
"

* 'views/view-build-fixes/v1' of https://github.com/duarten/scylla:
  db/view: Start view building after schema agreement
  db/system_keyspace: scylla_views_builds_in_progress writes are user mem
  db/view: Require configuration option to enable view building
2018-04-03 17:42:21 +03:00
Duarte Nunes
b84bbfc51d tests/view_schema_test: Test empty partition key entries are rejected
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180403122244.10626-2-duarte@scylladb.com>
2018-04-03 15:25:53 +03:00
Duarte Nunes
ec8960df45 db/view: Reject view entries with non-composite, empty partition key
Empty partition keys are not supported on normal tables - they cannot
be inserted or queried (surprisingly, the rules for composite
partition keys are different: all components are then allowed to be
empty). However, the (non-composite) partition key of a view could end
up being empty if that column is: a base table regular column, a
base table clustering key column, or a base table partition key column,
part of a composite key.

Fixes #3262
Refs CASSANDRA-14345

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180403122244.10626-1-duarte@scylladb.com>
2018-04-03 15:25:52 +03:00
Duarte Nunes
d4db043f03 db/view: Start view building after schema agreement
If a base table or view has been dropped in one node, but another
one hasn't yet learned about it, it starts the view build process
immediately on boot, possibly calculating unneeded view updates and
causing errors at the view replica, if that replica has already
processed the schema changes. We should thus wait for schema
agreement, even if the node is a seed.

Fixes #3328

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-03 13:16:28 +01:00
Duarte Nunes
75bb66a50d db/system_keyspace: scylla_views_builds_in_progress writes are user mem
Treat writes to scylla_views_builds_in_progress as user memory, as the
number of writes is dependent on the amount of user data on views
(times the number of views, divided by the view building batch size).

Fixes #3325

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-03 13:16:28 +01:00
Duarte Nunes
bf5045c7eb db/view: Require configuration option to enable view building
View building, enabled by default, can contain or expose issues that
prevent the node from starting. In those cases, it is necessary to
disable view building such that the node can be submitted to
maintenance operations.

Fixes #3329

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-03 13:16:28 +01:00
Avi Kivity
6c35db2c44 api: type-erase all-column_family map_reduce variant
Encapsulate the map_reduce parameters in type-erased
std::function, as well as the iterator-on-all-column-families
logic. Reduces binary size by 18%.
2018-04-03 13:08:22 +03:00
Avi Kivity
0ade558999 api: simplify 6-argument map_reduce_cf() variant
The 6-argument map_reduce_cf function is identical to the 5-argument
version, except that it applies performs an extra cast (by calling
the 6th argument's operator=()).

Simplify the code by calling the 5-argument version from the 6-argument
version.

Reduces binary size by ~10%.
2018-04-03 12:22:14 +03:00
Duarte Nunes
11ece46f14 db/view: Remove leftover debug statement
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180402175238.5528-1-duarte@scylladb.com>
2018-04-03 09:41:33 +01:00
Avi Kivity
cadd983856 api: type-erase map_reduce_cf()
map_reduce_cf() is called with varying template parameters which each
have to be compiled separately. Unifying the internals to use types based
on std::any reduced the object size by 15% (115MB->99MB) with presumably
a commensurate decrease in compile time.

A version that used "I" instead of "std::any" (and thus merged the
internals only for callers that used the same result type) delivered
a 10% decrease in object size.  While std::any is less safe, in this
case it is completely encapsulated.
Message-Id: <20180402213732.432-1-avi@scylladb.com>
2018-04-03 09:31:04 +01:00
Avi Kivity
ffcdcd6d16 tests: logalloc_test: relax test_large_allocation
test_large_allocation attempts to allocate almost half of memory.
With a buddy allocator, even if more than half of memory is free,
and even if it is contiguous, it is unlikely to be available as a
single allocation because the allocator inserts boundaries at powers-
of-two addresses.

Relax the test by allocating smaller chunks (but still the same amount,
and still with challenging sizes); allocating half of memory contiguously
is not a goal.

Also use a vector instead of a deque, and reserve it, so we don't get
intervening non-lsa allocations. I'm not sure there's a problem there
but let's not depend on the allocation patterns.
Message-Id: <20180401150828.13921-1-avi@scylladb.com>
2018-04-02 19:23:06 +01:00
Avi Kivity
7ab52947dc conf: define named_value<log_level> externally
While building with -O1, I saw that the linker could not find
the vtable for named_value<log_level>. Rather than fixing up the
includes (and likely lengthening build time), fix by defining
the class as an extern template, preventing it from being
instantiated at the call site.
Message-Id: <20180401150235.13451-1-avi@scylladb.com>
2018-04-02 19:23:06 +01:00
Avi Kivity
3964fd0be2 client_state: initialize _remote_addr for internal queries
-O1 complains that client_state::_remote_addr is not initialized
(and it is right). The call site is tracing, which likely won't be
invoked for internal queries, but still.
Message-Id: <20180401150410.13651-1-avi@scylladb.com>
2018-04-02 19:23:06 +01:00
Avi Kivity
2edf36f863 bytes: don't allocate NUL terminator
Since bytes is used to encapsulate blobs, not strings, there's no
need for a NUL terminator. It will never be passed to a function
that expects a C string.
Message-Id: <20180401151009.14108-1-avi@scylladb.com>
2018-04-02 19:23:06 +01:00
Duarte Nunes
abe8bbe7b5 Merge seastar upstream
* seastar a66cc34...7328d17 (5):
  > sstring: add support for non-nul-terminated sstrings
  > core/sharded: Make async_sharded_service dtor virtual
  > reactor: pass naked pointer to submit_io
  > Merge http: "Add alias support to the API" from Amnon
  > systemwide_memory_barrier: use madvise(MADV_DONTNEED) instead of mprotect()

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-02 19:23:06 +01:00
Glauber Costa
ef84780c27 docker: default docker to overprovisioned mode.
By default, overprovisioned is not enabled on docker unless it is
explicitly set. I have come to believe that this is a mistake.

If the user is running alone in the machine, and there are no other
processes pinned anywhere - including interrupts - not running
overprovisioned is the best choice.

But everywhere else, it is not: even if a user runs 2 docker containers
in the same machine and statically partitions CPUs with --smp (but
without cpuset) the docker containers will pin themselves to the same
sets of CPU, as they are totally unaware of each other.

It is also very common, specially in some virtualized environments, for
interrupts not to be properly distributed - being particularly keen on
being delivered on CPU0, a CPU which Scylla will pin by default.

Lastly, environments like Kubernetes simply don't support pinning at the
moment.

This patch enables the overprovisioned flag if it is explicitly set -
like we did before - but also by default unless --cpuset is set.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180331142131.842-1-glauber@scylladb.com>
2018-04-01 09:17:20 +03:00
Takuya ASADA
95129c4b12 dist/ami: point wiki page when variables.json
Since there's no document for build_ami.sh on this repo, point to wiki page.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521710239-9687-1-git-send-email-syuu@scylladb.com>
2018-03-29 18:54:42 +03:00
Glauber Costa
a9ef72537f parse and ignore background writer controller
Unused options are not exposed as command line options and will prevent
Scylla from booting when present, although they can still be pased over
YAML, for Cassandra compatibility.

That has never been a problem, but we have been adding options to i3
(and others) that are now deprecated, but were previously marked as
Used. Systems with those options may have issues upgrading.

While this problem is common to all Unused options, the likelihood for
any other unused option to appear in the command line is near zero,
except for those two - since we put them there ourselves.

There are two ways to handle this issue:

1) Mark them as Used, and just ignore them.
2) Add them explicitly to boost program options, and then ignore them.

The second option is preferred here, because we can add them as hidden
options in program_options, meaning they won't show up in the help. We
can then just print a discrete message saying that those options are,
for now on ignored.

v2: mark set as const (Botond)
v3: rebase on top of master, identation suggested by Duarte.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180329145517.8462-1-glauber@scylladb.com>
2018-03-29 17:57:30 +03:00
Avi Kivity
c9aa9f0d86 Revert "logalloc: capture current scheduling group for deferring function"
This reverts commit 3b53f922a3. It's broken
in two ways:

 1. concrete_allocating_function::allocate()'ss caller,
    region_group::start_releaser() loop, will delete the object
    as soon as it returns; however we scheduled some work depending
    on `this` in a separate continuation (via with_scheduling_group())
 2. the calling loop's termination condition depends on the work being
    done immediately, not later.
2018-03-29 16:08:12 +03:00
Vladimir Krivopalov
3a9cb54c76 Merge the pair of index_readers into just one tracking a range.
Historically, we had two index_readers per a sstable_mutation_reader,
one for the lower bound and one for the upper bound. Most of public
members of the index_reader class were only called on either of those.
With the changes introduced in #2981, two readers are even more tied
together as they now have a shared-per-pair list of index pages that
needs proper cleanup and was protruding woefully into the caller code.

This fix re-structures index_reader so that it now keeps track of both
lower and upper bounds. The shared_index_lists structure is encapsulated
within index_reader and becomes an internal detail rather than a
liability.

Fixes #3220.

Tests: unit (debug, release)
+
Tested using cassandra-stress commands from #3189.

perf_fast_forward results indicate there is no performance degradation
caused by thix fix.

=========================== Baseline ===================================
running: large-partition-skips
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1       0         0.494458   1000000    2022418   1018     126960      27       0        0        0        0        0        0        0  97.6%
1       1         1.754717    500000     284946    997     127064       6       0        0        3        3        0        0        0  99.9%
1       8         0.551664    111112     201413    997     127064       6       0        0        3        3        0        0        0  99.7%
1       16        0.383888     58824     153232   1001     127080      10       0        0        5        5        0        0        0  99.5%
1       32        0.289073     30304     104832    997     127064      28       0        0        3        3        0        0        0  99.3%
1       64        0.236963     15385      64926    997     127064     122       0        0        3        3        0        0        0  99.2%
1       256       0.172901      3892      22510    997     127064     217       0        0        3        3        0        0        0  95.5%
1       1024      0.117570       976       8301    997     127064     235       0        0        3        3        0        0        0  49.0%
1       4096      0.085811       245       2855    664      27172     375     274        0        3        3        0        0        0  21.4%
64      1         0.512781    984616    1920149   1142     127064     139       0        0        3        3        0        0        0  98.7%
64      8         0.479232    888896    1854833   1001     127080      10       0        0        5        5        0        0        0  99.6%
64      16        0.451193    800000    1773078    997     127064       6       0        0        3        3        0        0        0  99.6%
64      32        0.408684    666688    1631305    997     127064       6       0        0        3        3        0        0        0  99.5%
64      64        0.351906    500032    1420924    997     127064      14       0        0        3        3        0        0        0  99.5%
64      256       0.227008    200000     881026    997     127064     211       0        0        3        3        0        0        0  99.1%
64      1024      0.125803     58880     468032    997     127064     290       0        0        3        3        0        0        0  65.1%
64      4096      0.098155     15424     157139    703      27856     401     267        0        3        3        0        0        0  25.8%

running: large-partition-slicing
Testing slicing of large partition:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000701         1       1427      9        296       6       4        0        3        3        0        0        0  12.4%
0       32        0.000698        32      45827      9        296       6       3        0        3        3        0        0        0  13.9%
0       256       0.000808       256     316920     10        328       6       3        0        3        3        0        0        0  24.9%
0       4096      0.004368      4096     937697     25        808      14       3        0        3        3        0        0        0  45.9%
500000  1         0.001196         1        836     13        412       9       4        0        3        3        0        0        0  22.7%
500000  32        0.001200        32      26664     13        412       9       4        0        3        3        0        0        0  22.2%
500000  256       0.001503       256     170338     14        444      10       4        0        3        3        0        0        0  25.3%
500000  4096      0.004351      4096     941465     30        956      20       4        0        3        3        0        0        0  50.7%

running: large-partition-slicing-clustering-keys
Testing slicing of large partition using clustering keys:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000625         1       1601      7        176       6       0        0        3        3        0        0        0  23.2%
0       32        0.000604        32      53016      7        176       6       0        0        3        3        0        0        0  24.7%
0       256       0.000695       256     368498      8        180       6       0        0        3        3        0        0        0  36.4%
0       4096      0.004083      4096    1003106     20        692      12       1        0        3        3        0        0        0  47.0%
500000  1         0.001198         1        835     12        516       9       3        0        3        3        0        0        0  22.8%
500000  32        0.000981        32      32631     12        388       9       3        0        3        3        0        0        0  29.2%
500000  256       0.001320       256     194011     13        384      10       3        0        3        3        0        0        0  29.0%
500000  4096      0.003944      4096    1038567     25        840      17       2        0        3        3        0        0        0  52.2%

running: large-partition-slicing-single-key-reader
Testing slicing of large partition, single-partition reader:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000849         1       1178      9        488       6       0        0        3        3        0        0        0  16.5%
0       32        0.000661        32      48415      9        296       6       0        0        3        3        0        0        0  22.2%
0       256       0.000756       256     338648     10        328       6       0        0        3        3        0        0        0  33.3%
0       4096      0.004147      4096     987610     22        840      12       1        0        3        3        0        0        0  47.9%
500000  1         0.001041         1        960     13        476       9       3        0        3        3        0        0        0  25.9%
500000  32        0.001020        32      31375     13        412       9       3        0        3        3        0        0        0  29.1%
500000  256       0.001265       256     202373     14        444      10       3        0        3        3        0        0        0  32.0%
500000  4096      0.004121      4096     994014     30        988      18       3        0        3        3        0        0        0  52.7%

running: large-partition-select-few-rows
Testing selecting few rows from a large partition:
stride  rows      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1000000 1         0.000668         1       1498      9        296       6       4        0        3        3        0        0        0  19.8%
500000  2         0.000976         2       2048     13        412       9       4        0        3        3        0        0        0  29.0%
250000  4         0.001408         4       2842     18        572      12       6        0        3        3        0        0        0  28.8%
125000  8         0.002004         8       3993     29        912      19      10        0        3        3        0        0        0  34.0%
62500   16        0.002883        16       5551     50       1584      32      18        0        3        3        0        0        0  41.9%
2       500000    1.053215    500000     474737   1138     127080     120       0        0        5        5        0        0        0  99.7%

running: large-partition-forwarding
Testing forwarding with clustering restriction in a large partition:
pk-scan   time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
yes       0.002717         2        736     24       2684       8      16        0        3        3        0        0        0  19.7%
no        0.001004         2       1992     13        412       8       2        0        3        3        0        0        0  30.2%

running: small-partition-skips
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         1.466523   1000000     681885   1369     139732      33       1        0        0        0        0        0        0  99.7%
-> 1       1        12.792183    500000      39086   6235     177736    5155       0        0     5123     7663        0        0        0  96.4%
-> 1       8         3.451431    111112      32193   6235     177736    5155       0        0     5123     9673        0        0        0  84.8%
-> 1       16        2.223815     58824      26452   6234     177704    5154       0        0     5122     9965        0        0        0  75.0%
-> 1       32        1.512511     30304      20036   6233     177680    5155       1        0     5123    10090        0        0        0  61.8%
-> 1       64        1.129465     15385      13621   6227     177464    5154       0        0     5122    10159        0        0        0  49.5%
-> 1       256       0.733282      3892       5308   6211     175464    5178      24        0     5122    10220        0        0        0  33.8%
-> 1       1024      0.397302       976       2457   5946     142152    5369     217        0     5120    10235        0        0        0  32.1%
-> 1       4096      0.187746       245       1305   5499      81992    5296     142        0     5122    10240        0        0        0  46.8%
-> 64      1         2.428488    984616     405444   7332     177736    5155      25        0     5123     5208        0        0        0  79.9%
-> 64      8         2.262876    888896     392817   6235     177736    5155       0        0     5123     5654        0        0        0  78.1%
-> 64      16        2.137544    800000     374261   6234     177732    5154       0        0     5122     6110        0        0        0  77.1%
-> 64      32        1.862466    666688     357960   6235     177736    5155       0        0     5123     6844        0        0        0  73.7%
-> 64      64        1.547757    500032     323069   6234     177728    5155       0        0     5123     7651        0        0        0  68.7%
-> 64      256       0.914612    200000     218672   6233     177704    5154       0        0     5122     9202        0        0        0  55.5%
-> 64      1024      0.475472     58880     123835   6229     177492    5154       5        0     5122     9930        0        0        0  45.4%
-> 64      4096      0.271239     15424      56865   6158     169480    5257     114        0     5115    10142        0        0        0  44.1%

running: small-partition-slicing
Testing slicing small partitions:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.003209         1        312      3        260       2       7        0        1        1        0        0        0  15.5%
0       32        0.004205        32       7610     16       1428      10       0        0        5        5        0        0        0  15.7%
0       256       0.009830       256      26042     97       8572      62       0        0       31       31        0        0        0  18.7%
0       4096      0.015471      4096     264748    100       8704      64       0        0       32       32        0        0        0  48.4%
500000  1         0.003654         1        274     34        492      33       0        0       32       64        0        0        0  28.7%
500000  32        0.004287        32       7464     40       1260      36       0        0       32       64        0        0        0  26.0%
500000  256       0.009598       256      26673    100       8748      64       4        0       32       64        0        0        0  20.6%
500000  4096      0.014151      4096     289449    119       7892      85       0        0       53       64        0        0        0  54.1%

========================  With the patch ================================
running: large-partition-skips
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1       0         0.468887   1000000    2132711   1018     126960      29       0        0        0        0        0        0        0  98.4%
1       1         1.735113    500000     288166   1001     127080      10       0        0        5        5        0        0        0  99.9%
1       8         0.535616    111112     207447    997     127064       6       0        0        3        3        0        0        0  99.6%
1       16        0.365487     58824     160947   1001     127080      15       0        0        5        5        0        0        0  99.5%
1       32        0.272208     30304     111326    997     127064      21       0        0        3        3        0        0        0  99.3%
1       64        0.224049     15385      68668    997     127064     208       0        0        3        3        0        0        0  99.1%
1       256       0.159247      3892      24440    997     127064     250       0        0        3        3        0        0        0  94.7%
1       1024      0.102107       976       9559    997     127064     292       0        0        3        3        0        0        0  53.6%
1       4096      0.084310       245       2906    664      27172     371     273        0        3        3        0        0        0  20.2%
64      1         0.508340    984616    1936923   1142     127064     129       0        0        3        3        0        0        0  98.1%
64      8         0.470369    888896    1889786    997     127064       6       0        0        3        3        0        0        0  99.6%
64      16        0.439917    800000    1818526   1001     127080      10       0        0        5        5        0        0        0  99.6%
64      32        0.397938    666688    1675358    997     127064       6       0        0        3        3        0        0        0  99.5%
64      64        0.344144    500032    1452972    997     127064      18       0        0        3        3        0        0        0  99.4%
64      256       0.219996    200000     909107    997     127064     251       0        0        3        3        0        0        0  99.1%
64      1024      0.124294     58880     473715    997     127064     284       1        0        3        3        0        0        0  62.2%
64      4096      0.097580     15424     158065    703      27856     400     267        0        3        3        0        0        0  25.3%

running: large-partition-slicing
Testing slicing of large partition:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000733         1       1365      9        296       6       4        0        3        3        0        0        0  19.3%
0       32        0.000705        32      45417      9        296       6       3        0        3        3        0        0        0  15.3%
0       256       0.000830       256     308364     10        328       6       3        0        3        3        0        0        0  26.7%
0       4096      0.004631      4096     884529     25        808      14       3        0        3        3        0        0        0  48.1%
500000  1         0.001184         1        845     13        412       9       4        0        3        3        0        0        0  23.7%
500000  32        0.001199        32      26690     13        412       9       4        0        3        3        0        0        0  21.9%
500000  256       0.001530       256     167296     14        444      10       4        0        3        3        0        0        0  26.8%
500000  4096      0.004379      4096     935474     30        956      19       4        0        3        3        0        0        0  51.5%

running: large-partition-slicing-clustering-keys
Testing slicing of large partition using clustering keys:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000620         1       1614      7        176       6       0        0        3        3        0        0        0  27.4%
0       32        0.000625        32      51218      7        176       6       0        0        3        3        0        0        0  27.0%
0       256       0.000701       256     365148      8        180       6       0        0        3        3        0        0        0  35.2%
0       4096      0.004063      4096    1008130     20        692      12       1        0        3        3        0        0        0  47.6%
500000  1         0.001208         1        827     12        516       9       3        0        3        3        0        0        0  24.3%
500000  32        0.000973        32      32876     12        388       9       3        0        3        3        0        0        0  28.7%
500000  256       0.001315       256     194612     13        384      10       3        0        3        3        0        0        0  29.0%
500000  4096      0.003950      4096    1037068     25        840      17       2        0        3        3        0        0        0  52.7%

running: large-partition-slicing-single-key-reader
Testing slicing of large partition, single-partition reader:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000844         1       1185      9        488       6       0        0        3        3        0        0        0  16.5%
0       32        0.000656        32      48753      9        296       6       0        0        3        3        0        0        0  23.1%
0       256       0.000751       256     341011     10        328       6       0        0        3        3        0        0        0  34.0%
0       4096      0.004173      4096     981632     22        840      12       1        0        3        3        0        0        0  47.0%
500000  1         0.001036         1        966     13        476       9       3        0        3        3        0        0        0  25.4%
500000  32        0.001014        32      31573     13        412       9       3        0        3        3        0        0        0  27.4%
500000  256       0.001280       256     200044     14        444      10       3        0        3        3        0        0        0  31.8%
500000  4096      0.004081      4096    1003746     30        988      18       3        0        3        3        0        0        0  51.6%

running: large-partition-select-few-rows
Testing selecting few rows from a large partition:
stride  rows      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1000000 1         0.000668         1       1498      9        296       6       3        0        3        3        0        0        0  21.7%
500000  2         0.000958         2       2088     13        412       9       4        0        3        3        0        0        0  27.7%
250000  4         0.001495         4       2676     18        572      12       6        0        3        3        0        0        0  25.8%
125000  8         0.002069         8       3866     29        912      19      10        0        3        3        0        0        0  30.8%
62500   16        0.002856        16       5603     50       1584      32      18        0        3        3        0        0        0  41.7%
2       500000    1.063129    500000     470310   1138     127080     120       0        0        5        5        0        0        0  99.7%

running: large-partition-forwarding
Testing forwarding with clustering restriction in a large partition:
pk-scan   time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
yes       0.002567         2        779     24       2684       8      16        0        3        3        0        0        0  21.5%
no        0.001013         2       1975     13        412       8       2        0        3        3        0        0        0  28.9%

running: small-partition-skips
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         1.349959   1000000     740763   1369     139732      33       1        0        0        0        0        0        0  99.7%
-> 1       1        12.640751    500000      39555   8144     191168    7064       0        0     7032    11481        0        0        0  96.2%
-> 1       8         3.404269    111112      32639   6651     180660    5571       0        0     5539    10505        0        0        0  84.5%
-> 1       16        2.175424     58824      27040   6434     179116    5354       0        0     5322    10365        0        0        0  74.3%
-> 1       32        1.493365     30304      20292   6335     178404    5257       0        0     5225    10294        0        0        0  61.1%
-> 1       64        1.112168     15385      13833   6256     177672    5183       0        0     5151    10217        0        0        0  48.7%
-> 1       256       0.719282      3892       5411   6211     175464    5178      24        0     5122    10220        0        0        0  33.3%
-> 1       1024      0.393236       976       2482   5946     142152    5369     217        0     5120    10235        0        0        0  30.7%
-> 1       4096      0.185284       245       1322   5499      81992    5296     142        0     5122    10240        0        0        0  44.7%
-> 64      1         2.356711    984616     417792   7361     177944    5184      21        0     5152     5266        0        0        0  79.1%
-> 64      8         2.192331    888896     405457   6253     177868    5173       0        0     5141     5690        0        0        0  77.2%
-> 64      16        2.029835    800000     394121   6245     177812    5165       0        0     5133     6132        0        0        0  75.7%
-> 64      32        1.806448    666688     369060   6245     177808    5165       0        0     5133     6864        0        0        0  72.6%
-> 64      64        1.508492    500032     331478   6242     177788    5163       0        0     5131     7667        0        0        0  67.7%
-> 64      256       0.892881    200000     223994   6233     177704    5154       0        0     5122     9202        0        0        0  54.2%
-> 64      1024      0.465715     58880     126429   6229     177492    5154       0        0     5122     9930        0        0        0  44.0%
-> 64      4096      0.266582     15424      57858   6158     169480    5257     114        0     5115    10142        0        0        0  42.3%

running: small-partition-slicing
Testing slicing small partitions:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.003113         1        321      3        260       2       0        0        1        1        0        0        0  13.4%
0       32        0.004166        32       7682     16       1428      10       0        0        5        5        0        0        0  14.9%
0       256       0.009813       256      26088     97       8572      62       0        0       31       31        0        0        0  18.4%
0       4096      0.014798      4096     276794    100       8704      64       0        0       32       32        0        0        0  46.3%
500000  1         0.003700         1        270     34        492      33       0        0       32       64        0        0        0  28.4%
500000  32        0.004030        32       7940     40       1260      36       0        0       32       64        0        0        0  27.8%
500000  256       0.009514       256      26908    100       8748      64       0        0       32       64        0        0        0  20.2%
500000  4096      0.013368      4096     306413    119       7892      85       0        0       53       64        0        0        0  53.6%

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <a72818f79ca4081a606424545b0053fa581d49e7.1522173144.git.vladimir@scylladb.com>
2018-03-29 15:23:31 +03:00
Asias He
f539e993d3 gossip: Relax generation max difference check
start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node

I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:

On node1, node3

    [shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704

On node2

    [shard 0] storage_service - JOINING: waiting for schema information to complete

This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.

   gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
   gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();

Each force_newer_generation_unsafe increases the generation by 1.

Here is an example,

Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
   {
   "addrs": "127.0.0.2",
   "generation": 0,
   "is_alive": false,
   "update_time": 1521778757334,
   "version": 0
   },
```

After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
 {
     "addrs": "127.0.0.2",
     "application_state": [
         {
             "application_state": 0,
             "value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
             "version": 214
         },
         {
             "application_state": 6,
             "value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
             "version": 153
            }
     ],
     "generation": 2,
     "is_alive": false,
     "update_time": 1521779276246,
     "version": 0
 },
```

In gossiper::apply_state_locally, we have this check:

```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
    // assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
  logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);

}
```
to skip the gossip update.

To fix, we relax generation max difference check to allow the generation
of a removed node.

After this patch, the removed node bootstraps successfully.

Tests: dtest:update_cluster_layout_tests.py
Fixes #3331

Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
2018-03-29 12:09:49 +03:00
Glauber Costa
b092234f2b sstables: print informative message earlier
Just saw this today during a crash when creating Materialized Views.
It is still unclear why this happened. But the message says:

Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: scylla: sstables/sstables.cc:2973: sstables::sstable::remove_sstable_with_temp_toc(seastar::sstring, seastar::sstring, seastar::sstring, int64_t, sstables::sstable::version_types, sstables::sstable::format_types)::<lambda()>: Assertion `tmptoc == true' failed.
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Aborting on shard 0.
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Backtrace:
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4b4c
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4df5
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4ea3
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libpthread.so.0+0x000000000000f0ff
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x00000000000355f6
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x0000000000036ce7
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e565
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e611
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000015969d0
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x0000000001596f7a
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x000000000051ca8d

I can't even guess which table caused the problem, let alone which SSTable.
That's because those asserts are the very first thing we do. We can discuss
whether or not assert is the right behaviour (usually we can't guarantee the
state is sane if that is missing, so I don't see a problem)

But it would be nice to see which SSTable we are processing before we assert.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180328160856.10717-1-glauber@scylladb.com>
2018-03-28 19:55:04 +03:00
Avi Kivity
4419e60207 Merge "Add a confiugration API" from Amnon
"
The configuration API is part of scylla v2 configuration.
It uses the new definition capabilities of the API to dynamically create
the swagger definition for the configuration.
This mean that the swagger will contain an entry with description and
type for each of the config value.

To get the v2 of the swager file:
http://localhost:10000/v2

If using with swagger ui, change http://localhost:10000/api-doc to http://localhost:10000/v2
It takes longer to load because the file is much bigger now.
"

* 'amnon/config_api_v5' of github.com:scylladb/seastar-dev:
  Explanation about the API V2
  API: add the config API as part of the v2 API.
  Defining the config api
2018-03-28 12:45:17 +03:00
Amnon Heiman
71a04b5d26 Explanation about the API V2
Currently it holds a general explanation about the V2 and specific entry
about the config.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-03-28 12:42:04 +03:00
Amnon Heiman
94c2d82942 API: add the config API as part of the v2 API.
After this patch, the API v2 will contain a config section with all the
configuration parametes.

get http://localhost:10000/v2

Will contain the config section.

An example for getting a configuration parameter:
curl http://localhost:10000/v2/config/listen_address

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-03-28 12:42:04 +03:00
Amnon Heiman
6d907e43e0 Defining the config api
The config API is created dynamically from the config. This mean that
the swagger definition file will contain the description and types based on the
configuration.

The config.json file is used by the code generator to define a path that is
used to register the handler function.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-03-28 12:41:55 +03:00
Vladimir Krivopalov
b268ea951a tests: perf_fast_forward: Sanitize JSON files names.
Substitute various brackets and parentheses with alnum strings, remove
whitespaces, strip single-range values off curly braces.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <206adea8d05a1e64ce2627df1e4da3a845454906.1522171869.git.vladimir@scylladb.com>
2018-03-28 12:29:07 +03:00
Tomasz Grabiec
52c61df930 Relax includes
To avoid unnecessary recompilations.
Message-Id: <1522168295-994-1-git-send-email-tgrabiec@scylladb.com>
2018-03-28 10:49:07 +03:00
Avi Kivity
4c3e82bd67 Merge "db/view: Populate views with existing base table data" from Duarte
"
This series introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.

The view_builder uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.

We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the view building process.

We employ a flat_mutation_reader for each base table for which we're
building views.

We consume from the reader associated with each base table until all
its views are built. If the reader reaches the end and there are
incomplete views, then a view was added while others were being built.
In such cases, we restart the reader to the beginning of the current
token, but not to the beginning of the token range, when the view is
added. Then, when we exhaust the reader, we simply create a new one
for the whole token range, and resume building the pending views.

We aim to be resource-conscious. On a given shard, at any given moment,
we consume at most from one reader. We also strive for fairness, in that
each build step inserts entries for the views of a different base. Each
build step reads and generates updates for batch_size rows. We lack a
controller, which could potentially allow us to go faster (to execute
multiple steps at the same time, or consume more rows per batch), and
also which would apply backpressure, so we could, for example, delay
executing a build step.

Interaction with the system tables:
  - When we start building a view, we add an entry to the
    scylla_views_builds_in_progress system table. If the node restarts
    at this point, we'll consider these newly inserted views as having
    made no progress, and we'll treat them as new views;
  - When we finish a build step, we update the progress of the views
    that we built during this step by writing the next token to the
    scylla_views_builds_in_progress table. If the node restarts here,
    we'll start building the views at the token in the next_token
    column.
  - When we finish building a view, we mark it as completed in the
    built views system table, and remove it from the in-progress system
    table. Under failure, the following can happen:
        * When we fail to mark the view as built, we'll redo the last
          step upon node reboot;
        * When we fail to delete the in-progress record, upon reboot
          we'll remove this record.
    A view is marked as completed only when all shards have finished
    their share of the work, that is, if a view is not built, then all
    shards will still have an entry in the in-progress system table;
  - A view that a shard finished building, but not all other shards,
    remains in the in-progress system table, with first_token ==
    next_token.

Interaction with the distributed system tables:
  - When we start building a view, we mark the view build as being
    in-progress;
  - When we finish building a view, we mark the view as being built.
    Upon failure, we ensure that if the view is in the in-progress
    system table, then it may not have been written to this table. We
    don't load the built views from this table when starting. When
    starting, the following happens:
         * If the view is in the system.built_views table and not the
           in-progress system table, then it will be in this one;
         * If the view is in the system.built_views table and not in
           this one, it will still be in the in-progress system table -
           we detect this and mark it as built in this table too,
           keeping the invariant;
         * If the view is in this table but not in system.built_views,
           then it will also be in the in-progress system table - we
           don't detect this and will redo the missing step, for
           simplicity.

View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.

When building view updates, we consider that everything is new and
nothing pre-existing is there (which means no tombstones will be sent
out to the paired view replicas).

Tests:
  unit (debug)
  dtest (materialized_view_test.py(smp=1, smp=2))
"

* 'view-building/v4' of https://github.com/duarten/scylla: (22 commits)
  tests/view_build_test: Add tests for view building
  tests/cql_test_env: Move eventually() to this file
  tests/cql_assertions: Assert result set is not empty
  tests/cql_test_env: Start the view_builder
  db/view/view_builder: Allow synchronizing with the end of a build
  db/view/view_builder: Actually build views
  flat_mutation_reader: Make reader from mutation fragments
  db/view/view_builder: React to schema changes
  service/migration_listener: Add class for view notifications
  db/view: Introduce view_builder
  column_family: Add function to populate views
  column_family: Allow synchronizing with in-progress writes
  database: Compare view id instead of name in find_views()
  database: Add get_views() function
  db/view: Return a future when sending view updates
  service/storage_service: Allow querying the view build status
  db: Introduce system_distributed_keyspace
  tests: Add unit test for build_progress_virtual_reader
  db/system_keyspace: Add API for MV-related system tables
  db/system_keyspace: Add virtual reader for MV in-progress build status
  ...
2018-03-27 15:41:28 +03:00
Daniel Fiala
051ed12ad2 cql3/functions: Print function declaration with cql3 types, not with internal types.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180327084953.20313-3-daniel@scylladb.com>
2018-03-27 13:33:29 +03:00
Duarte Nunes
9f5cfa76f7 tests/view_build_test: Add tests for view building
This is a separate file from view_schema_test because that one is
already becoming too long to run; also, having multiple test files
means they can be executed in parallel.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
e5031f70ef tests/cql_test_env: Move eventually() to this file
Move eventually() from view_schema_test to cql_test_env.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
8528584056 tests/cql_assertions: Assert result set is not empty
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
a2c94e7925 tests/cql_test_env: Start the view_builder
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
a45fa8eaa2 db/view/view_builder: Allow synchronizing with the end of a build
Intended for use by unit tests, this patch allows synchronizing with
the end of a build for a particular view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
5f822e3928 db/view/view_builder: Actually build views
This patch adds the missing view building code to the eponymous class.

We consume from the reader associated with each base table until all
its views are built. If the reader reaches the end and there are
incomplete views, then a view was added while others were being built.
In such cases, we restart the reader to the beginning of the current
token, but not to the beginning of the token range, when the view is
added. Then, when we exhaust the reader, we simply create a new one
for the whole token range, and resume building the pending views.

We aim to be resource-conscious. On a given shard, at any given moment,
we consume at most from one reader. We also strive for fairness, in that
each build step inserts entries for the views of a different base. Each
build step reads and generates updates for batch_size rows. We lack a
controller, which could potentially allow us to go faster (to execute
multiple steps at the same time, or consume more rows per batch), and
also which would apply backpressure, so we could, for example, delay
executing a build step.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
1f3e3d3813 flat_mutation_reader: Make reader from mutation fragments
Builds a reader from a set of ordered mutations fragments. This is
useful for building a reader out of a subset of segments returned by a
different reader. It is equivalent to building a mutation out of the
set of mutation fragments, and calling
make_flat_mutation_reader_from_mutations, except that it doest not yet
support fast-forwarding.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
a21efeffa0 db/view/view_builder: React to schema changes
The view_builder now uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.

We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the upcoming view building
code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
3ffa3b6b54 service/migration_listener: Add class for view notifications
Add a convenience base class for view notifications, which provides
a default implementation for all other types of notifications.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
901faabaa2 db/view: Introduce view_builder
This patch introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.

This patch introduces only the bootstrap functionality, which is
responsible for loading the data stored in the system tables and
filling the in-memory data structures with the relevant information,
to be used in subsequent patches for the actual view building. The
interaction with the system tables is as follows.

Interaction with the tables in system_keyspace:
  - When we start building a view, we add an entry to the
    scylla_views_builds_in_progress system table. If the node restarts
    at this point, we'll consider these newly inserted views as having
    made no progress, and we'll treat them as new views;
  - When we finish a build step, we update the progress of the views
    that we built during this step by writing the next token to the
    scylla_views_builds_in_progress table. If the node restarts here,
    we'll start building the views at the token in the next_token
    column.
  - When we finish building a view, we mark it as completed in the
    built views system table, and remove it from the in-progress system
    table. Under failure, the following can happen:
        * When we fail to mark the view as built, we'll redo the last
          step upon node reboot;
        * When we fail to delete the in-progress record, upon reboot
          we'll remove this record.
    A view is marked as completed only when all shards have finished
    their share of the work, that is, if a view is not built, then all
    shards will still have an entry in the in-progress system table;
  - A view that a shard finished building, but not all other shards,
    remains in the in-progress system table, with first_token ==
    next_token.

Interaction with the distributed system table (view_build_status):
  - When we start building a view, we mark the view build as being
    in-progress;
  - When we finish building a view, we mark the view as being built.
    Upon failure, we ensure that if the view is in the in-progress
    system table, then it may not have been written to this table. We
    don't load the built views from this table when starting. When
    starting, the following happens:
         * If the view is in the system.built_views table and not the
           in-progress system table, then it will be in view_build_status;
         * If the view is in the system.built_views table and not in
           this one, it will still be in the in-progress system table -
           we detect this and mark it as built in this table too,
           keeping the invariant;
         * If the view is in this table but not in system.built_views,
           then it will also be in the in-progress system table - we
           don't detect this and will redo the missing step, for
           simplicity.

View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
f298f57137 column_family: Add function to populate views
The populate_views() function takes a set of views to update, a
tokento select base table partitions, and the set of sstables to
query. This lays the foundation for a view building mechanism to exist,
which walks over a given base table, reads data token-by-token,
calculates view updates (in a simplified way, compared to the existing
functions that push view updates), and sends them to the paired view
replicas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
67dd3e6e5d column_family: Allow synchronizing with in-progress writes
This patch adds a mechanism to class column_family through which we
can synchronize with in-progress writes. This is useful for code that,
after some modification, needs to ensure that new writes will see it
before it can proceed.

In particular, this will be used by the view building code, which needs
to wait until the in-progress writes, which may have missed that there
is now a view, is observable to the view building code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
9640205f11 database: Compare view id instead of name in find_views()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
9b9ba525f7 database: Add get_views() function
Returns all the schemas that are views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
dc44a08370 db/view: Return a future when sending view updates
While we now send view mutations asynchronously in the normal view
write path, other processes interested in sending view updates, such
as streaming or view building, may wish to do it synchronously.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
ff15068a41 service/storage_service: Allow querying the view build status
This patch adds support for the nodetool viewbuildstatus command,
which shows the progress of a materialized view build across the
cluster.

A view can be absent from the result, successfully built, or
currently being built.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
78b232d98f db: Introduce system_distributed_keyspace
This patch introduces a distributed system keyspace, used to hold
system tables that need to be replicated across a set of replicas
(that is, can't use the LocalStrategy).

In following patches, we will use this keyspace to hold a table
containing view building status updates for each node, used to support
range movements and a new nodetool command.

Fixes #3237

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
412f081db9 tests: Add unit test for build_progress_virtual_reader
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
4227641a3d db/system_keyspace: Add API for MV-related system tables
This patch implements an API to access the MV-related system tables,
which pertain to the view building process.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
b2cae7ea09 db/system_keyspace: Add virtual reader for MV in-progress build status
Provide a virtual reader so users can query the in-progress view table
in a way compatible with Apache Cassandra.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
7811474697 db/system_keyspace: Add Scylla-specific MV system table
When building a materialized view, we divide our work by shard, so we
need to register which shard did what work in the in-progress system
table. We also add the token we started at, which will enable some
optimizations in the view building code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
38831888d2 db/system_keyspace: Include MV system tables in all_tables()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Avi Kivity
16a7650873 Merge "More extensions: commitlog + system tables" from Calle
"
Additional extension points.

* Allows wrapping commitlog file io (including hinted handoff).
* Allows system schema modification on boot, allowing extensions
  to inject extensions into hardcoded schemas.

Note: to make commitlog file extensions work, we need to both
enforce we can be notified on segment delete, and thus need to
fix the old issue of hard ::unlink call in segment destructor.
Segment delete is therefore moved to a batch routine, run at
intervals/flush. Replay segments and hints are also deleted via
the commitlog object, ensuring an extension is notified (metadata).

Configurable listeneres are now allowed to inject configuration
object into the main config. I.e. a local object can, either
by becoming a "configurable" or manually, add references to
self-describing values that will be parsed from the scylla.yaml
file, effectively extending it.

All these wonderful abstractions courtesy of encryption of course.
But super generalized!
"

* 'calle/commitlog_ext' of github.com:scylladb/seastar-dev:
  db::extensions: Allow extensions to modify (system) schemas
  db::commitlog: Add commitlog/hints file io extension
  db::commitlog: Do segment delete async + force replay delete go via CL
  main/init: Change configurable callbacks and calls to allow adding opts
  util::config_file: Add "add" config item overload
2018-03-26 16:18:22 +03:00
Calle Wilund
ff41f47a08 db::extensions: Allow extensions to modify (system) schemas
Allows extensions/config listeners to potentially augument
(system) schemas at boot time. This is only useful for schemas
who do not pass through system_schema tables.
2018-03-26 11:58:28 +00:00
Calle Wilund
bb1a2c6c2e db::commitlog: Add commitlog/hints file io extension
To allow on-disk data to be augumented.
2018-03-26 11:58:27 +00:00
Calle Wilund
2bc98aebaf db::commitlog: Do segment delete async + force replay delete go via CL
Refs #2858

Push segement files to be deleted to a pending list, and process at
intervals or flush-requests (or shutdown). Note that we do _not_
indescrimenately do deletes in non-anchored tasks, because we need
to guarantee that finshed segments are fully deleted and gone on CL
shutdown, not to be mistaken for replayables.

Also make sure we delete segments replayed via commitlog call,
so IFF we add metadata processing for CL, we can clear it out.
2018-03-26 11:58:27 +00:00
Duarte Nunes
a985ea0fcb column_family: Don't retry flushing memtable if shutdown is requested
Since we just keep retrying, this can cause Scylla to not shutdown for
a while.

The data will be safe in the commit log.

Note that this patch doesn't fix the issue when shutdown goes through
storage_service::drain_on_shutdown - more work is required to handle
that case.

Ref #3318.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-3-duarte@scylladb.com>
2018-03-26 14:36:40 +03:00
Duarte Nunes
50ad37d39b column_family: Increase scope of exception handling when flushing a memtable
In column_family::try_flush_memtable_to_sstable, the handle_exception()
block is on the inside of the continuations to
write_memtable_to_sstable(), which, if it fails, will leave the
sstable in the compaction_backlog_tracker::_ongoing_writes map, which
will waste disk space, and that sstable will map to a dangling pointer
to a destroyed database_sstable_write_monitor, which causes a seg
fault when accessed (for example, through the backlog_controller,
which accounts the _ongoing_writes when calculating the backlog).

Fix this by increasing the scope of handle_exception().

Fixes #3315

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-2-duarte@scylladb.com>
2018-03-26 14:36:16 +03:00
Duarte Nunes
b7bd9b8058 backlog_controller: Stop update timer
On database shutdown, this timer can cause use-after-free errors if
not stopped.

Refs #3315

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-1-duarte@scylladb.com>
2018-03-26 14:36:16 +03:00
Botond Dénes
0e6aa91269 Fix test.py output and error handling
* Don't dump output of failed tests immediately, print the output
for failed tests in the end instead.
* Fix exception printing in run_test(): don't assume passed in error
object is a `bytes` (or bytes-like) object, call the object's str
operator instead and let callers encode bytes objects instead.
* Don't assume Exception object has an `out` member, use operator str
instead to convert it to string.
* Don't print progress in run_test() directly because it results in
incomprehensible output as the executors race to print to stdout. Leave
progress report to the caller who can serialize progress prints.
* Automatically detect non-tty stdout and don't try to edit already
printed text.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7bb7e0003ded9b28710250bff851ea849bb99f7d.1522062795.git.bdenes@scylladb.com>
2018-03-26 14:26:45 +03:00
Avi Kivity
999df41a49 Merge "Bug fixes for access-control, and finalizing roles" from Jesse
"
This series does not add or change any features of access-control and
roles, but addresses some bugs and finalizes the switch to roles.

"auth: Wait for schema agreement" and the patch prior help avoid false
negatives for integration tests and error messages in logs.

"auth: Remove ordering dependence" fixes an important bug in `auth` that
could leave the default superuser in a corrupted state when it is first
created.

Since roles are feature-complete (to the best of the author's knowledge
as of this writing), the final patch in the series removes any warnings
about them being unimplemented.

Tests: unit (release), dtest (PENDING)
"

* 'jhk/auth_fixes/v1' of https://github.com/hakuch/scylla:
  Roles are implemented
  auth: Increase delay before background tasks start
  auth: Remove ordering dependence
  auth: Don't warn on rescheduled task
  auth: Wait for schema agreement
  Single-node clusters can agree on schema
2018-03-26 09:29:41 +03:00
Jesse Haber-Kucharsky
849cf49b8d Roles are implemented
Fixes #1941.
2018-03-26 00:52:59 -04:00
Jesse Haber-Kucharsky
af24637565 auth: Increase delay before background tasks start
I've observed failures due to "missing" the peer nodes by about 1
second. Adding 5 second to the existing delay should cover most false
negative test results.

Fixes #3320.
2018-03-26 00:52:55 -04:00
Jesse Haber-Kucharsky
00f7bc676d auth: Remove ordering dependence
If `auth::password_authenticator` also creates `system_auth.roles` and
we fix the existence check for the default superuser in
`auth::standard_role_manager` to only search for the columns that it
owns (instead of the column itself), then both modules' initialization
are independent of one another.

Fixes #3319.
2018-03-25 22:38:11 -04:00
Jesse Haber-Kucharsky
968c61c296 auth: Don't warn on rescheduled task
Apache Cassandra also prints at the `info` level. This change prevents
tasks which we expect to be rescheduled from failing tests and scaring
users.

A good example of this importance of this change is when queries with a
quorum consistency level (for the default superuser) fail because a
quorum is not available. We will try again in this case, and this should
not cause integration tests to fail.
2018-03-25 22:38:11 -04:00
Jesse Haber-Kucharsky
881656cea4 auth: Wait for schema agreement
Some modules of `auth` create a default superuser if it does not already
exist.

The existence check is through a SELECT query with quorum consistency
level. If the schema for the applicable tables has not yet propagated to
a peer node at the time that it processes this query, then the
`storage_proxy` will print an error message to the log and the query
will be retried.

Eventually, the schema will propagate and the default superuser will be
created. However, the error message in the log causes integration tests
to fail (and is somewhat annoying).

Now, prior to querying for existing data, we wait for all gossip peers
to have the same schema version as we do.

Fixes #2852.
2018-03-25 22:38:08 -04:00
Jesse Haber-Kucharsky
3e415e28bc Single-node clusters can agree on schema
At some points while bootstrapping [1], new non-seed Scylla nodes wait
for schema agreement among all known endpoints in the cluster.

The check for schema agreement was in
`service::migration_manager::is_ready_for_bootstrap`. This function
would return `true` if, at the time of its invocation, the node was
aware of at least one `UP` peer (not itself) and that all `UP` peers had
the same schema version as the node.

We wish to re-use this check in the `auth` sub-system to ensure that
the schema for internal system tables used for access-control have
propagated to the entire cluster.

Unlike in `service/storage_service.cc`, where `is_ready_for_bootstrap`
was only invoked for seed nodes, we wish to wait for schema agreement
for all nodes regardless of whether or not they are seeds.

For a single-node cluster with itself as a seed,
`is_ready_for_bootstrap` would always return `false`.

We therefore change the conditions for schema agreement. Schema
agreement is now reached when there are no known peers (so the endpoint
map of the gossiper consists only of ourselves), or when there is at
least one `UP` peer and all `UP` peers have the same schema version as
us.

This change should not impact any bootstrap behavior in
`storage_service` because seed nodes do not invoke the function and
non-seed nodes wait for peer visibility before checking for schema
agreement.

Since this function is no longer checking for schema agreement only in
the context of bootstrapping non-seed nodes, we rename it to reflect its
generality.

[1] http://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html
2018-03-25 22:08:42 -04:00
Duarte Nunes
aed28c667c db/view: Pass pending endpoints to storage_proxy::send_to_endpoint
This minimizes the number of mutation copies by just doing a single
call to send_to_endpoint().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180325121412.76844-2-duarte@scylladb.com>
2018-03-25 15:45:22 +03:00
Duarte Nunes
fb54c09e0b service/storage_proxy: Pass pending endpoints to send_to_endpoint()
This will allow us to minimize the number of mutation copies in
mutate_MV().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180325121412.76844-1-duarte@scylladb.com>
2018-03-25 15:45:21 +03:00
Avi Kivity
389fb54a42 tests: sstable_test: fix for_each_sstable_version concept (again)
I see the following error:

seastar/core/future-util.hh:597:10: note:   constraints not satisfied
seastar/core/future-util.hh:597:10: note:     with ‘sstables::sstable_version_types* c’
seastar/core/future-util.hh:597:10: note:     with ‘sub_partitions_read::run_test_case()::<lambda(sstables::sstable::version_types)> aa’
seastar/core/future-util.hh:597:10: note: the required expression ‘seastar::futurize_apply(aa, (* c.begin()))’ would be ill-formed
seastar/core/future-util.hh:597:10: note: ‘seastar::futurize_apply(aa, (* c.begin()))’ is not implicitly convertible to ‘seastar::future<>’

The C array all_sstable_versions decayed to a pointer (see second gcc note)
and of course doesn't support std::begin().

Fix by replacing the C array with an std::array<>, which supports std::begin().

Not clear what made this break again, or why it worked before.
Message-Id: <20180325095239.12407-1-avi@scylladb.com>
2018-03-25 13:02:57 +01:00
Duarte Nunes
44996fa6ae Merge 'Reduce link dependencies in tests' from Avi
"
This patchset removes unneeded object files from the test link,
reducing unnecessary links and reducing link time and executable
size.

Tests: build (release)
"

* tag 'test-link/v1' of https://github.com/avikivity/scylla:
  build: link release.o into scylla and perf_fast_forward binaries only
  build: don't link api/ into tests
2018-03-24 20:54:49 +00:00
Avi Kivity
09453ca0db build: link release.o into scylla and perf_fast_forward binaries only
release.o depends on the release date and git hash, and therefore changes
every time ./configure.py is executed.  In turn, this causes all tests to
relink.

Improve the situation by only linking release.o into binaries that require
it.

This helps continuous integration scripts, which call configure.py
unconditionally. Developers usually won't, so they will not see significant
savings.

Tests: build (release)
2018-03-24 22:55:03 +03:00
Avi Kivity
e78cea4121 build: don't link api/ into tests
They don't need it.
2018-03-24 22:55:02 +03:00
Duarte Nunes
f298e3e6f8 database: Log exception which caused flush to fail
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180322204419.12961-1-duarte@scylladb.com>
2018-03-23 10:57:35 +00:00
Takuya ASADA
81fbcbf6bc dist/redhat: don't redefine __debug_install_post on Fedora27 or later
Redefining _debug_install_post does not work on Fedora27 or later,
it seems because of debuginfo generation process had been changed:
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/ITJHJTUO2WFEAYIHANSM6AMAB5SIFASI/

To prevent the build error, move scylla-gdb.py to scylla-server package on
Fedora 27 or later.

Fixes #3313

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521735371-29408-1-git-send-email-syuu@scylladb.com>
2018-03-22 19:39:14 +02:00
Takuya ASADA
879c9f1bf8 dist/redhat: don't use yaml-cpp-static on Fedora
Since Fedora still does not have separated yaml-cpp-static package, don't
depends on it.

Fixes #3183

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521735663-29516-1-git-send-email-syuu@scylladb.com>
2018-03-22 18:24:28 +02:00
Avi Kivity
054854839a Merge "Fix abort during counter table read-on-delete" from Tomasz
"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.

Fixes #3304.
"

* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Test reads with no clustering ranges and no static columns
  tests: simple_schema: Allow creating schema with no static column
  clustering_ranges_walker: Stop after static row in case no clustering ranges
2018-03-22 17:36:20 +02:00
Tomasz Grabiec
604166143c tests: mutation_source_test: Test reads with no clustering ranges and no static columns
Reproduces issue #3304.
2018-03-22 15:00:48 +01:00
Tomasz Grabiec
3a974d1776 tests: simple_schema: Allow creating schema with no static column 2018-03-22 14:44:54 +01:00
Tomasz Grabiec
d1cb6bbf95 clustering_ranges_walker: Stop after static row in case no clustering ranges
When there are no clustering ranges, stop at position which is right
after the static row instead of position which is after all clustered
rows.

This fixes an abort in sstable reader when querying a partition with
no clustering ranges (happens with counter tables) which also doesn't
have any static columns. In such case, the sstable_mutation_reader
will setup the data_consume_context such that it only covers the
static row of the partition, knowing that there is no need to ready
any clustering row. See partition.cc::advance_to_upper_bound().  Later
when we're done with reading the static row (which is absent), we will
try to skip to the first clustering range, which in this case is
missing.  If clustering_ranges_walker tells us to skip to
after_all_clustering_rows(), we will hit an asser inside
continuous_data_consumer::fast_forward_to() due to attempt to skip
past the original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine, becuase we end up
at the same data file position.

Fixes #3304.
2018-03-22 14:44:48 +01:00
Botond Dénes
a65b063ab2 incremental_reader_selector: remote unused members
Since 3d725d6823 the incremental_reader_selector creates readers via
a factory function so these members, used previously for creating the
readers, are not needed anymore.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <64b5cef93c1f9a2e544ccfd89e293627e99dd4cd.1521724155.git.bdenes@scylladb.com>
2018-03-22 13:14:03 +00:00
Takuya ASADA
bef08087e1 scripts/scylla_install_pkg: follow redirection of specified repo URL
We should follow redirection on curl, just like normal web browser does.
Fixes #3312

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521712056-301-1-git-send-email-syuu@scylladb.com>
2018-03-22 12:55:43 +02:00
Vladimir Krivopalov
3010b637c9 perf_fast_forward: fix error in date formatting
Instead of 'month', 'minutes' has been used.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <1e005ecaa992d8205ca44ea4eebbca4621ad9886.1521659341.git.vladimir@scylladb.com>
2018-03-22 09:57:15 +00:00
Avi Kivity
a7d86410b5 Merge "Split more tasks out of the generic scheduling group" from Glauber
"
There are a lot of things that we should be grouping in scheduling
groups that we aren't yet. The write path is not tagged at all,
mutation_query isn't either. Some, like streaming, are used - but not in
all places where they are needed.

Tests: unit (release)
"

* 'split-scheduling-groups-v2' of github.com:glommer/scylla:
  database: group statements in their own scheduling group
  database: apply streaming mutations with streaming priority
  logalloc: capture current scheduling group for deferring function
2018-03-21 15:02:50 +02:00
Nadav Har'El
e5de66d0c4 Materialized Views: unit test for missing view key columns
Add a unit test for reproducing issue #2720 (and verifying its fix)
If a user tries to create a view whose primary key is missing any of the
base table's primary key columns, the creation should fail.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-3-nyh@scylladb.com>
2018-03-21 09:47:41 +00:00
Nadav Har'El
c809dd2e66 Materialized Views: change order of view creation verification
Changed the order to check a couple of error conditions *after* checking
for too many or missing primary key columns. This order (showing the
too many or missing key columns first) is more useful, and is the order
in Cassadra's code.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-2-nyh@scylladb.com>
2018-03-21 09:47:41 +00:00
Nadav Har'El
871cecfd3b Materialized Views: fix checking that view key includes base key
A view's primary key must include all the columns of the base's primary
key. If we don't check this and fail the table's creation, we can discover
problems later on when using the table, as demonstrated in issue #2720.

We had such checking code (translated from the same code in Java) but it
had an extra "else" which caused nothing to be put in "missing_pk_columns"
so the error was never recognized.

Also, when the error does happen, we should print the column's name_as_text(),
not name() which is (surprisingly) just a number.

Fixes #2720.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-1-nyh@scylladb.com>
2018-03-21 09:47:41 +00:00
Nadav Har'El
06aaace5a4 Materialized View: fix one of the unit tests
One of the tests created a base table with 5 primary key columns, but
put only 4 of them in the view. This is not allowed, but prior to fixing
issue #2720 this error was silently ignored. Let's fix the error instead
of relying on this silence.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180321094352.22329-1-nyh@scylladb.com>
2018-03-21 09:46:55 +00:00
Duarte Nunes
0d74442252 tests/sstable_test: Fix concept for for_each_sstable_version
Un-break the build.

Fixes #3307

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180320182011.11068-1-duarte@scylladb.com>
2018-03-20 22:26:06 +00:00
Glauber Costa
9188059427 database: group statements in their own scheduling group
When we introduced the CPU scheduler, we have also introduced a group
for commitlog - but never used it. There is also doubtful value in
separating reads from writes, since they are often part of the same
workload.

To accomodate for that, let's rename the query group to "statement"
(query is not incorrect, just confusing), and move the write path,
currently ungrouped, inside it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-20 16:58:36 -04:00
Glauber Costa
c8e169f6d8 database: apply streaming mutations with streaming priority
We are flushing the streaming memtables with streaming priority, but
applying the mutations themselves is still done with normal priorities.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-20 16:58:35 -04:00
Glauber Costa
3b53f922a3 logalloc: capture current scheduling group for deferring function
When we call run_when_memory_available, it is entirely possible that
the caller is doing that inside a scheduling_group. If we don't defer
we will execute correctly. But if we do defer, the current code will
execute - in the future - with the default scheduling group.

This patch fixes that by capturing the caller scheduling group and
making sure the function is executed later using it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-20 16:58:35 -04:00
Duarte Nunes
237184324e Merge 'Make the read repair decision per-query instead of per-page' from Botond
"
Since f8613a8415 we have reader-caching
on replicas for single-partition queries. This caching works best when
all pages of a query are sent to the same replicas consistently and thus
they can reuse the cached readers there.
The propability-based nature of read-repair works against this as on any
given page a read-repair will be attempted or not based on probability.
This will cause hight drop-rates on the replicas used for read-repair as
the cached reader will not be reusable if the replica was skipped for
one or more pages.
To fix this make the repair-decision once, on the first page of the
query and store the decision in the paging-state. On all remaining
pages of the query use this stored decision.

Tests: unit-tests(release, debug), dtest(paging_advanced_tests.py)

Refs: #1865
"

* 'per_query_repair_decision/v2' of https://github.com/denesb/scylla:
  Make the read-repair decision only once
  storage_proxy: add coordinator_query_options and coordinator_query_result
  Add query_read_repair_decision to paging-state
2018-03-20 11:59:41 +00:00
Takuya ASADA
2045891cc2 dist/debian: use rebuilt libyaml-cpp on Debian9
On Debian9, distribution provided libyaml-cpp does not able to link against
scylla, use rebuilt one from our 3rdparty repo.

fixes #3221

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521496288-12856-1-git-send-email-syuu@scylladb.com>
2018-03-20 12:30:47 +02:00
Nadav Har'El
07f88aef51 Materialized Views: test verification of only one new key column
For several reasons that I cannot fit in the margin, when a view is
created, at most ONE regular column from the base table may be added
to the view's key.
This small new test verifies that if we try to add two columns, the
view creation fails.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319235453.1613-1-nyh@scylladb.com>
2018-03-20 00:30:18 +00:00
Nadav Har'El
1d4ceaa237 Materialized Views: Fix IS NOT NULL unit test
We had a unit test, test_primary_key_is_not_null, for testing that
we correctly complain - or don't complain - on missing "IS NOT NULL"
restrictions, as expected.

However, this test missed the actual bug we had regarding IS NOT NULL
checking - see issue #2628 - because it thought a silly syntax error
which caused an exception, was the exception we expected to see :-)

So in this patch, I rewrote this test. It fixes the test's bug and
demonstrates issue #2628 (and verifies its fix), and also tests a few
more corner cases.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319235000.1399-1-nyh@scylladb.com>
2018-03-20 00:30:18 +00:00
Nadav Har'El
da110d612e Materialized Views: Fix "IS NOT NULL" checking
When creating a materialized view, the user must provide a "IS NOT NULL"
restriction for each of the created view's primary columns. If such a
restriction is missing, the view creation should fail. In #2628 we noticed
that sometimes it wasn't failing, but later updates to such table would fail,
which is a bug.

There is actually one special case where "IS NOT NULL" is optional:
It is optional on the base's partition key column (when there is just
one of these) because it is already assumed that the partition key in
its entirety can never be.

Our "IS NOT NULL" test, validate_primary_key(), had two logic errors
which caused it to miss some cases of missing "IS NOT NULL":

1. Instead of checking whether a certain column is a the base's only
   partition-key column, and avoid testing IS NOT NULL just for that
   specific column, the code tested whether the schema *has* such a
   column, and if it did, the test was skipped for all columns.

2. When the code found the one new column in the view's primary key, it
   was so happy to find it that it immediately returned, and forgot to
   test the IS NOT NULL on that column :-)

Both errors are fixed by this patch.
See the next patch for a unit test.

Fixes #2628.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319233657.522-1-nyh@scylladb.com>
2018-03-20 00:30:18 +00:00
Glauber Costa
f80d4a28d7 flat_mutation_reader: explicitly yield at every partition
Right now we have yield points between partition processing guaranteed
by the fact that there are .get()s around the code, and those include
an yield point.

We have been discussing removing the implicit yield point from get and
pushing that to the caller. In that spirit, let's yield explicitly here
if needed.

It should be the responsibility of the loop that it doesn't hurt
latency, either by the fact that it is bounded by a small number of
iterations or yields. In other words, that loop should have a yield
point on every iteration (like the non-thread variant does).

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180319173051.8918-1-glauber@scylladb.com>
2018-03-19 19:39:01 +02:00
Avi Kivity
03c22ad524 Merge "Support for Cassandra 2.2 (LA) SSTable formats" from Daniel
"
These patches add support for C* 2.2 file(name) format.

Namely:
  * It forces Scylla to write files in la format.
  * Adds storage-service feature for them.
  * cf and ks are determined from directory, not from file-name (for 2.2 format).
  * Adds some other fixes to make dtest happy.
  * Unit tests work with la format or with both formats.
"

* 'danfiala/filename-format-2.2-v4' of https://github.com/hagrid-the-developer/scylla:
  tests/sstables: Tests use la format or iterate over both formats.
  tests/sstables: Helper functions support 2.2 format directory structure.
  stables: Use 2.2 (la) format as a default format to store sstables if it is enabled by feature-bits.
  storage_service: Support la sstable storage format as a feature.
  sstables: make_descriptor accepts sstable-directory, because it is necessary to determine cf and ks in 2.2 format.
  sstables: Throw more detail exception for unknown item in reverse_map.
  sstables/compaction: Suppress NaN in a report of a throughput.
2018-03-19 17:49:44 +02:00
Botond Dénes
eee9bda85b Make the read-repair decision only once
Make the read-repair decision on the first page of a paged-query and use
it for all the remaining pages. This helps querier-cache hit-rates as
reads to nodes will be sent consistently throught the query.
2018-03-19 16:29:43 +02:00
Avi Kivity
fe4049f074 Merge "gossip: Fixes to shadow round" from Duarte
"
Fixes to gossip pertaining to the shadow round.

In particular, an issue preventing a node from being marked as alive is
fixed: After the shadow round and the feature checking, we remove any
endpoints from the state - namely, those that contacted us -, before
re-adding them again. This is because those nodes that replied would
have been marked as alive in the endpoint state map (but not fully,
they'd be absent from the live endpoints list), and re-adding them marks
them as dead.

If the shadow round failed, after doing the feature checking against the
system tables, we were not clearing the state map and re-adding the
endpoints. This leaves the alive marker set, and prevents
real_mark_alive() from eventually being called.

Fixes #3301
"

* 'gossip/shadow-round-fixes/v3' of https://github.com/duarten/scylla:
  gms/gossiper: Remove superfluous check
  service/storage_service: Always re-add loaded endpoints
  gms/gossiper: Check for shadow round completion before throwing
2018-03-19 15:22:35 +02:00
Botond Dénes
2e2abf6edb storage_proxy: add coordinator_query_options and coordinator_query_result
As yet more parameters and return-values are about to be added to all
storage_proxy::query_* methods we need a way that scales better than
changing the signatures every time. To this end we aggregate all
non-mandatory query parameters into `coordinator_query_options` and all
return values into `coordinator_query_result`.
This way new fields can be simply added to the respective structs while
the signatures of the methods themselves and their client code can
remain unchanged.
2018-03-19 15:17:35 +02:00
Botond Dénes
b55dcc2ce5 Add query_read_repair_decision to paging-state
This new field will store the repair-decision made on the first page of
the query. This decision will be sticky to all pages of the query.
In mixed clusters the decision might not happen on the first page and it
might even change during the query as old coordinators will not store
nor respect the decision.
2018-03-19 15:17:31 +02:00
Daniel Fiala
4d703f9c6a tests/sstables: Tests use la format or iterate over both formats.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-19 14:12:10 +01:00
Daniel Fiala
386cae4ad2 tests/sstables: Helper functions support 2.2 format directory structure.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-19 14:12:09 +01:00
Daniel Fiala
089b54f2d2 stables: Use 2.2 (la) format as a default format to store sstables if it is enabled by feature-bits.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-19 14:12:01 +01:00
Daniel Fiala
802be72ca6 storage_service: Support la sstable storage format as a feature.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-19 14:10:31 +01:00
Duarte Nunes
9cadfb27f1 gms/gossiper: Remove superfluous check
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-19 13:08:53 +00:00
Duarte Nunes
2c7b77b6d2 service/storage_service: Always re-add loaded endpoints
After the shadow round and the feature checking, we remove any
endpoints from the state - namely, those that contacted us -, before
re-adding them again. This is because those nodes that replied would
have been marked as alive in the endpoint state map (but not fully,
they'd be absent from the live endpoints list), and re-adding them
marks them as dead.

If the shadow round failed, after doing the feature checking against
the system tables, we were not clearing the state map and re-adding
the endpoints. This left the alive marker set, and prevented
real_mark_alive() from eventually being called.

Fixes #3301

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-19 13:08:53 +00:00
Duarte Nunes
69b28a4f2b gms/gossiper: Check for shadow round completion before throwing
For values of `shadow_round_ms` lower than 1 second, this was assuming
failure without checking.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-19 13:08:53 +00:00
Avi Kivity
601d8f7cff test: switch boost.test from --log_sink to --logger
Upstream fix works only for --logger according to

  https://github.com/boostorg/test/pull/124
Message-Id: <20180319121520.11110-1-avi@scylladb.com>
2018-03-19 13:26:28 +01:00
Calle Wilund
eb10d32ff9 main/init: Change configurable callbacks and calls to allow adding opts
Refs #2526

Allows sub-configs to dynamically add yaml/command line options to
the main config object, i.e. extend the scylla.yaml
2018-03-19 12:24:04 +00:00
Calle Wilund
fc97e39782 util::config_file: Add "add" config item overload 2018-03-19 12:24:04 +00:00
Duarte Nunes
71fddad376 Merge 'Reduce unit test runtime' from Avi
This patchset reduces the time required to run the tests, mostly by
running them in parallel.

I measured a reduction of 3.5X on a 1s4c4t desktop (release mode).

Tests: unit (release)

* tag 'faster-tests/v2' of https://github.com/avikivity/scylla:
  tests: run tests in parallel
  tests: simplify timeout handling
  tests: don't require crash integrity
  tests: allow sharing the machine with other tests
  tests: extract seastar options to a separate variable
  tests: reduce memory for tests
  tests: add "--" unconditionally for boost tests
  tests: start cql_test_env without binding to messaging port
  storage_service: allow starting gossiper without binding to messaging port
  gms: allow gossiper to start_gossiping() without binding to the port
  tests: close file correctly in loading_file_test
2018-03-19 10:24:55 +00:00
Avi Kivity
31b86a46a0 tests: run tests in parallel
Launch tests in a concurrent executor with worker count determined
by available memory.
2018-03-19 12:17:10 +02:00
Avi Kivity
638611a350 tests: simplify timeout handling
The subprocess module can handle timeouts itself, so use this
to simplify the module code.
2018-03-19 12:16:58 +02:00
Avi Kivity
95abed020b tests: don't require crash integrity
We don't resume tests after crashes, so no need to spend time waiting
for the disk to fsync.
2018-03-19 12:16:58 +02:00
Avi Kivity
b3d8dadf0c tests: allow sharing the machine with other tests
By using the overprovisioned flag, we reduce polling and pinning, so
less CPU time is wasted and the scheduler has more options to schedule
reactor threads.
2018-03-19 12:16:58 +02:00
Avi Kivity
3d84c8945d tests: extract seastar options to a separate variable 2018-03-19 12:16:58 +02:00
Avi Kivity
8b1cff90ce tests: reduce memory for tests
If we reduce memory for an individual test, we can run more
in parallel.
2018-03-19 12:16:58 +02:00
Avi Kivity
c3750176d8 tests: add "--" unconditionally for boost tests
Now that we have a minimum boost version, we don't need to check whether
boost requires "--" before test-specific command line arguments. Removing
the check speeds up the test a little.
2018-03-19 12:16:58 +02:00
Avi Kivity
9a04def202 tests: start cql_test_env without binding to messaging port
Allows running tests in parallel.
2018-03-19 12:16:52 +02:00
Avi Kivity
ee68bfa49d storage_service: allow starting gossiper without binding to messaging port 2018-03-19 12:16:11 +02:00
Avi Kivity
02ce0c4cde gms: allow gossiper to start_gossiping() without binding to the port
This is useful in tests, which don't communicate. Binding to a port can
fail if the system is running something else.

It would be better to prevent even more of the gossiper from starting up,
but that is more difficult.
2018-03-19 12:16:11 +02:00
Avi Kivity
f2dd31ee76 tests: close file correctly in loading_file_test
Otherwise, we crash with --overprovisioned on a use-after-free.
2018-03-19 12:16:11 +02:00
Duarte Nunes
810db425a5 gms/gossiper: Synchronize endpoint state destruction
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.

This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map  (in all cores).

Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.

Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
2018-03-18 14:38:04 +02:00
Avi Kivity
38e1eb5e42 Update scylla-ami submodule
* dist/ami/files/scylla-ami 5170011...9b4be70 (1):
  > do not special case i3 for controller code
2018-03-18 11:37:00 +02:00
Takuya ASADA
378bf7cec0 dist/debian: switch Debian9 to boost-1.65
We switched Debian8/Ubuntu14/Ubuntu16 to boost-1.65 to fix #3090, but Debian9
stil uses distribution provided boost-1.62, it causes same build error.
So switch it to our boost-1.65, too.

See c636f552e0

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521250590-16510-1-git-send-email-syuu@scylladb.com>
2018-03-18 10:23:43 +02:00
Daniel Fiala
10db711259 sstables: make_descriptor accepts sstable-directory, because it is necessary to determine cf and ks in 2.2 format.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-18 06:09:47 +01:00
Daniel Fiala
abdf22f5cd sstables: Throw more detail exception for unknown item in reverse_map.
* This can help with debugging.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-18 05:54:15 +01:00
Daniel Fiala
c5eca593fc sstables/compaction: Suppress NaN in a report of a throughput.
* It causes failures in dtest.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-18 05:46:32 +01:00
Glauber Costa
f5c32423b8 summary: don't go through all entries when computing memory size.
Summary has a function, memory_size(), that estimates the amount of
memory the summary takes. It is my understanding that this is called
to serve information to tooling.

First, this function is innacurate because it doesn't take into account
the tokens per each entry, just the keys. But more importantly, it has
to iterate over all keys which can be pretty expensive if the entries
list is long. We are now keeping that in a memory area, with just
pointers in the entry. So instead of iterating through the entries, we
can iterate through the memory areas, which is much cheaper.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180316120915.16809-1-glauber@scylladb.com>
2018-03-16 12:57:19 +00:00
Duarte Nunes
fef9d4fa72 service/storage_service: Avoid superfluous seastar::thread
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180315212202.12176-1-duarte@scylladb.com>
2018-03-16 12:52:15 +00:00
Nadav Har'El
e9702aa126 Materialized Views: don't lose updates while cluster is changing
When the cluster is changed (nodes added or removed), ranges of tokens
are moved between nodes. Scylla initiates a streaming process between an
old and a new owner of the range, which can take a long time. During
that streaming time, the new owner of the range is known as a "pending node"
for this range, and all updates must go to both the old owner (in case the
movement fails!) and the pending node (in case the movement succeeds).

For materialized views, because they are ordinary tables, streaming moves
all the view's data that existed before the streaming started. But we did
not send updates done to the view *during* the streaming. A dtest
demonstrates that the new node will miss some of the view update, and will
require a repair of the view tables immediately after the cluster change
ends, which is not good. To fix that, we need to send every new update
that happens during the streaming also to the "pending node". We already
did this properly for base-table updates, but not to the view updates:
Each base table replica wrote to only one paired view table replica,
and nobody wrote to the new pending node (in case where there is one,
for the particular view token involved).

In this patch, we make sure that all view updates go also to the "pending
nodes" when there are any. We do the same thing that Cassandra does, which
is - *all* base replicas write the update to the pending node(s).
Arguably, it is inefficient that all replicas send the update to the same
node. In most cases it is enough to send it from just one base replica -
the one who is slated to be the new node's pair.  I opened
https://issues.apache.org/jira/browse/CASSANDRA-14262 about this idea.
But that is an optimization. The patch as-is already fixes the bug.

Fixes #3211

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180313171853.17283-1-nyh@scylladb.com>
2018-03-16 12:00:29 +00:00
Duarte Nunes
934d805b4b Merge 'Grant default permissions' from Jesse
The functional change in this series is in the last patch
("auth: Grant all permissions to object creator").

The first patch addresses `const` correctness in `auth`. This change
allowed the new code added in the last patch to be written with the
correct `const` specifiers, and also some code to be removed.

The second-to-last patch addresses error-handling in the authorizer for
unsupported operations and is a prerequisite for the last patch (since
we now always grant permissions for new database objects).

Tests: unit (release)

* 'jhk/default_permissions/v3' of https://github.com/hakuch/scylla:
  auth: Grant all permissions to object creator
  auth: Unify handling for unsupported errors
  auth: Fix life-time issue with parameter
  auth: Fix `const` correctness
2018-03-16 09:43:36 +01:00
Avi Kivity
9eb7c0c65b Merge "Remove (some) reactor stalls in the SSTable code" from Glauber
"
This is an improvement on my latest series. Instead of just
dealing with the problem of destroying the Summary that I have
identified in a previous test, I have tried to find other sources
of stalls.

Some of them are on readers and would affect early processes and
operations like nodetool refresh.

Others are on writers, which can affect any SSTable being written.

Two of those stalls (on large filter, on summary read), I saw in a
synthetic benchmark where I used very small values + nodetool compact
to generate one SSTable with many keys. They were 80ms and 20ms
respectively, and now they are totally gone.

For others, I just tried to be safe (for instance, if we know
reading/writing large vectors can be costly, just always insert
preemption points in them).

With all of these patches applied, I no longer see stalls coming from
the SSTable code in those tests (although given enough time, I am sure I
can find more).

Tests: unit (release)
Fixes: #3282, Fixes #3281, Fixes #3269
"

* 'sstables-stalls-v3-updated' of github.com:glommer/scylla:
  large_bitset/bloom filter: add preemption points in loops
  sstables: read filter in a thread
  abstract summary entry version of the token with a token view
  add a token_view
  sstables: rework summary entries reading
  sstables: avoid calls to resize for vectors
  sstables: replace potentially large for loop with do_until
  summary_entry: do not store key bytes in each summary entry
  tests: change tests to make summary non-copyable
  chunked_vector: do not iterate to destruct trivially destructible types
2018-03-16 09:43:36 +01:00
Glauber Costa
7fd31088f2 large_bitset/bloom filter: add preemption points in loops
SSTables that contain many keys - a common case with small partitions in
long lived nodes - can generate filters that are quite large.

I have seen stalls over 80ms when reading a filter that was the result
of a 6h write load of very small keys after nodetool compact (filter was
in the 100s of MB)

Similar care should be taken when creating the filter, as if the
estimated number of partitions is big, the resulting large_bitset can be
quite big as well.

If we treat the i_filter.hh and large_bitset.hh interfaces as truly
generic, then maybe we should have an in_thread version along with a
common version. But the bloom filter is the only user for both and even
if that changes in the future, it is still a good idea to run something
with a massive loop in a thread.

So for simplicity, I am just asserting that we are on a thread to avoid
surprises, and inserting preemption points in the loops.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:15 -04:00
Glauber Costa
c424ba01df sstables: read filter in a thread
Constructing filter objects can be quite expensive. We will insert some
yield points around, and that is made a lot easier if we are calling
things from a thread.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:15 -04:00
Glauber Costa
e680c7c8cc abstract summary entry version of the token with a token view
dht::token doesn't have a trivial destructor, so destroying an array
full of those can be quite expensive. If we use the same trick as we
used for the summary - storing the token data in a stable memory
location - we can leave the entries with a trivial destructor and destroy
the chunks themselves. Those being larger, they will be more efficient
to delete.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:15 -04:00
Glauber Costa
dddc7e1676 add a token_view
Ideally we would like tokens to be trivially destructible, so that we
can easily dispose of giant vectors holding them. While that is hard to
do with our current infrastructure, we can introduce a token_view, which
holds a bytes_view elements instead of the real data - making it
trivially destructible.

The comparators are then changed to take a token_view, and an implicit
conversion function is provided from tokens so they get compared.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:09 -04:00
Duarte Nunes
9da2b66cff cql3/untyped_result_set: Conform to boost::range concept
Enable some of that boost::copy_range goodness.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180315121801.2808-1-duarte@scylladb.com>
2018-03-15 13:34:44 +01:00
Takuya ASADA
69d226625a dist/ami: update CentOS base image to latest version
Since we requires updated version of systemd, we need to update CentOS base
image.

Fixes #3184

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1518118694-23770-1-git-send-email-syuu@scylladb.com>
2018-03-15 10:47:37 +02:00
Takuya ASADA
945e6ec4f6 dist/debian: use 3rdparty ppa on Ubuntu 18.04
Currently Ubuntu 18.04 uses distribution provided g++ and boost, but it's easier
to maintain Scylla package to build with same version toolchain/libraries, so
switch to them.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521075576-12064-1-git-send-email-syuu@scylladb.com>
2018-03-15 10:41:05 +02:00
Takuya ASADA
1bb3531b90 dist/redhat: build only scylla, iotune
Since we don't package tests, we don't need to build them.
It reduces package building time.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521066363-4859-1-git-send-email-syuu@scylladb.com>
2018-03-15 10:40:35 +02:00
Takuya ASADA
856dc0a636 dist/redhat: switch to gcc-7.3
We have hit following bug on debug-mode binary:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82560
Since it's fixed on gcc-7.3, we need to upgrade our gcc package.

See: https://groups.google.com/d/topic/scylladb-dev/RIdIpqMeTog/discussion
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521064473-17906-1-git-send-email-syuu@scylladb.com>
2018-03-15 10:39:25 +02:00
Vladimir Krivopalov
5c3b32a9bf Remove to_boost_visitor heler.
The minimal Boost version required for Scylla now is 1.58 and this
helper is no longer needed.
Replaced it with more generic visitation utils from Seastar.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <e589ace7ac411d3d55dead475a8a2271f51642f1.1520976010.git.vladimir@scylladb.com>
2018-03-14 23:49:07 +00:00
Avi Kivity
bb4b1f0e91 Merge "Ubuntu/Debian build error fixes" from Takuya
* 'debian-ubuntu-build-fixes-v2' of https://github.com/syuu1228/scylla:
  dist/debian: build only scylla, iotune
  dist/debian: switch to boost-1.65
  dist/debian: switch to gcc-7.3
2018-03-14 22:50:40 +02:00
Takuya ASADA
7f891e7a48 dist/debian: build only scylla, iotune
Since we don't package tests, we don't need to build them.
It reduces package building time.
2018-03-15 04:33:11 +09:00
Glauber Costa
89b28a4bea sstables: rework summary entries reading
Like we did for generic arrays, let's move away from resize() in trying
to read summary entries and move to a reserve/push pattern.

I have tested this patch reading a summary file that a couple of MB big.
Stalls up to 20ms were seen. After applying this patch, no such stalls
are present.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 13:35:15 -04:00
Glauber Costa
a33f0d6f92 sstables: avoid calls to resize for vectors
resize is considered harmful, since it will attempt to allocate memory
and initialize each element of the vector. This can cause reactor stalls
that correlates to latency peaks.

A better idiom is reserve first - so we know we will have enough memory
and won't have to move contents - and push_back/emplace_back each
individual member.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 13:32:36 -04:00
Glauber Costa
0d9488eae6 sstables: replace potentially large for loop with do_until
We are pushing ints here, so it shouldn't be that bad in practice.
But a potentially gigantic for loop is just asking for a stall since we won't
need_preempt() it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 11:58:03 -04:00
Glauber Costa
091b0f9d41 summary_entry: do not store key bytes in each summary entry
If we store a bytes_view instead of bytes, that has a trivial destructor
and then we don't need to destroy each element individually. To do that,
we allocate the data in a couple of large arrays which can be disposed of
easily and point to it.

We still can't destroy trivially because of the token.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 10:46:20 -04:00
Glauber Costa
d15bfbe548 tests: change tests to make summary non-copyable
Right now the summary can be copied, but in real life there is no reason
for this to be a requirement. Tests want it, so we can destroy a summary,
load another, and compare the two. We can achieve this by allowing the first
summary to be moved, and then we can still have a reference to the second.

I am about to make a change that will make the summary not copyable as a
requirement, so we need to do this first.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 10:46:20 -04:00
Glauber Costa
00d04b49a0 chunked_vector: do not iterate to destruct trivially destructible types
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 09:16:54 -04:00
Takuya ASADA
c636f552e0 dist/debian: switch to boost-1.65
We get following compile error on Debian/Ubuntu with boost-1.63:

/opt/scylladb/include/boost/intrusive/pointer_plus_bits.hpp:76:48: error: '*((void*)& __tmp +136)' is used uninitialized in this function [-Werror=uninitialized]
       n = pointer(uintptr_t(p) | (uintptr_t(n) & Mask));
                                         ~~~~~~~~~~~~~~^~~~~~~

This is known issue (https://github.com/boostorg/intrusive/issues/29), fixed
on boost-1.65.

Switch to boost-1.65 to fix the issue.

Fixes #3090
2018-03-14 22:13:24 +09:00
Avi Kivity
a0bc126ae2 Merge seastar upstream
* seastar bcfbe0c...a66cc34 (3):
  > reactor: fix sleep mode
  > cpu scheduler: don't penalize first group to run
  > Simple shellscript to find out which logical CPU's shards are mapped to
2018-03-14 14:14:21 +02:00
Jesse Haber-Kucharsky
6a360c2d17 auth: Grant all permissions to object creator
When a table, keyspace, or role is created, the creator now is
automatically granted all applicable permissions on the object.

This behavior is consistent with Apache Cassandra.

Fixes #3216.
2018-03-14 01:54:31 -04:00
Jesse Haber-Kucharsky
c502fe24ce auth: Unify handling for unsupported errors
Instead of some functions in `allow_all_authorizer` throwing exceptions
and others being silently pass-through, we consistently return exception
futures with `auth::unsupported_authorization_operation`. These errors
are converted to `invalid_request_exception` in the CQL error and
ignored where appropriate in the auth subsystem.
2018-03-14 01:54:28 -04:00
Jesse Haber-Kucharsky
97235445d3 auth: Fix life-time issue with parameter 2018-03-14 01:32:53 -04:00
Jesse Haber-Kucharsky
9117a689cf auth: Fix const correctness
This patch came about because of an important (and obvious, in
hindsight) realization: instances of the authorizer, role manager, and
authenticator are clients for access-control state and not the state
itself. This is reflected directly in Scylla: `auth::service` is
sharded across cores and this is possible because each instance queries
and modifies the same global state.

To give more examples, the value of an instance of `std::vector<int>` is
the structure of the container and its contents. The value of `int
file_descriptor` is an identifier for state maintained elsewhere.

Having watched an excellent talk by Herb Sutter [1] and having read an
informative blog post [2], it's clear that a member function marked
`const` communicates that the observable state of the instance is not
modified.

Thus, the member functions of the role-manager, authenticator, and
authorizer clients should not be marked `const` only if the state of the
client itself is observably changed. By this principle, member functions
which do not change the state of the client, but which mutate the global
state the client is associated with (for example, by creating a role)
are marked `const`.

The `start` (and `stop`) functions of the client have the dual role of
initializing (finalizing) both the local client state and the
external state; they are not marked `const`.

[1] https://herbsutter.com/2013/01/01/video-you-dont-know-const-and-mutable/

[2] http://talesofcpp.fusionfenix.com/post-2/episode-one-to-be-or-not-to-be-const
2018-03-14 01:32:43 -04:00
Takuya ASADA
c3b2e2580a dist/debian: switch to gcc-7.3
We have hit following bug on debug-mode binary:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82560
Since it's fixed on gcc-7.3, we need to upgrade our gcc package.

See: https://groups.google.com/d/topic/scylladb-dev/RIdIpqMeTog/discussion
2018-03-08 00:06:32 +09:00
1033 changed files with 40193 additions and 13878 deletions

View File

@@ -1,3 +1,9 @@
This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.
- [] I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.
*Installation details*
Scylla version (or git commit hash):
Cluster size:

4
.github/PULL_REQUEST_TEMPLATE.md vendored Normal file
View File

@@ -0,0 +1,4 @@
Scylla doesn't use pull-requests, please send a patch to the [mailing list](mailto:scylladb-dev@googlegroups.com) instead.
See our [contributing guidelines](../CONTRIBUTING.md) and our [Scylla development guidelines](../HACKING.md) for more information.
If you have any questions please don't hesitate to send a mail to the [dev list](mailto:scylladb-dev@googlegroups.com).

1
.gitignore vendored
View File

@@ -18,3 +18,4 @@ CMakeLists.txt.user
*.egg-info
__pycache__CMakeLists.txt.user
.gdbinit
resources

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=666.development
VERSION=2.3.6
if test -f version
then

View File

@@ -455,7 +455,7 @@
"operations":[
{
"method":"GET",
"summary":"Returns a list of filenames that contain the given key on this node",
"summary":"Returns a list of sstable filenames that contain the given partition key on this node",
"type":"array",
"items":{
"type":"string"
@@ -475,7 +475,7 @@
},
{
"name":"key",
"description":"The key",
"description":"The partition key. In a composite-key scenario, use ':' to separate the columns in the key.",
"required":true,
"allowMultiple":false,
"type":"string",

30
api/api-doc/config.json Normal file
View File

@@ -0,0 +1,30 @@
"/v2/config/{id}": {
"get": {
"description": "Return a config value",
"operationId": "find_config_id",
"produces": [
"application/json"
],
"tags": ["config"],
"parameters": [
{
"name": "id",
"in": "path",
"description": "ID of config to return",
"required": true,
"type": "string"
}
],
"responses": {
"200": {
"description": "Config value"
},
"default": {
"description": "unexpected error",
"schema": {
"$ref": "#/definitions/ErrorModel"
}
}
}
}
}

View File

@@ -2129,6 +2129,41 @@
]
}
]
},
{
"path":"/storage_service/view_build_statuses/{keyspace}/{view}",
"operations":[
{
"method":"GET",
"summary":"Gets the progress of a materialized view build",
"type":"array",
"items":{
"type":"mapper"
},
"nickname":"view_build_statuses",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"view",
"description":"View name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
}
],
"models":{
@@ -2193,11 +2228,11 @@
"description":"The column family"
},
"total":{
"type":"int",
"type":"long",
"description":"The total snapshot size"
},
"live":{
"type":"int",
"type":"long",
"description":"The live snapshot size"
}
}

View File

@@ -39,6 +39,7 @@
#include "http/exception.hh"
#include "stream_manager.hh"
#include "system.hh"
#include "api/config.hh"
namespace api {
@@ -65,6 +66,7 @@ future<> set_server_init(http_context& ctx) {
rb->set_api_doc(r);
rb02->set_api_doc(r);
rb02->register_api_file(r, "swagger20_header");
set_config(rb02, ctx, r);
rb->register_function(r, "system",
"The system related API");
set_system(ctx, r);

View File

@@ -429,7 +429,7 @@ void set_column_family(http_context& ctx, routes& r) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
utils::estimated_histogram res(0);
for (auto i: *cf.get_sstables() ) {
res.merge(i->get_stats_metadata().estimated_column_count);
res.merge(i->get_stats_metadata().estimated_cells_count);
}
return res;
},
@@ -905,5 +905,20 @@ void set_column_family(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(res);
});
});
cf::get_sstables_for_key.set(r, [&ctx](std::unique_ptr<request> req) {
auto key = req->get_query_param("key");
auto uuid = get_uuid(req->param["name"], ctx.db.local());
return ctx.db.map_reduce0([key, uuid] (database& db) {
return db.find_column_family(uuid).get_sstables_by_partition_key(key);
}, std::unordered_set<sstring>(),
[](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {
a.insert(b.begin(),b.end());
return a;
}).then([](const std::unordered_set<sstring>& res) {
return make_ready_future<json::json_return_type>(container_to_vec(res));
});
});
}
}

View File

@@ -24,6 +24,7 @@
#include "api.hh"
#include "api/api-doc/column_family.json.hh"
#include "database.hh"
#include <any>
namespace api {
@@ -37,9 +38,15 @@ template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,
Mapper mapper, Reducer reducer) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([mapper, uuid](database& db) {
return mapper(db.find_column_family(uuid));
}, init, reducer);
using mapper_type = std::function<std::any (database&)>;
using reducer_type = std::function<std::any (std::any, std::any)>;
return ctx.db.map_reduce0(mapper_type([mapper, uuid](database& db) {
return I(mapper(db.find_column_family(uuid)));
}), std::any(std::move(init)), reducer_type([reducer = std::move(reducer)] (std::any a, std::any b) mutable {
return I(reducer(std::any_cast<I>(std::move(a)), std::any_cast<I>(std::move(b))));
})).then([] (std::any r) {
return std::any_cast<I>(std::move(r));
});
}
@@ -51,35 +58,42 @@ future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& n
});
}
template<class Mapper, class I, class Reducer, class Result>
future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,
Mapper mapper, Reducer reducer, Result result) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([mapper, uuid](database& db) {
return mapper(db.find_column_family(uuid));
}, init, reducer);
}
template<class Mapper, class I, class Reducer, class Result>
future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& name, I init,
Mapper mapper, Reducer reducer, Result result) {
return map_reduce_cf_raw(ctx, name, init, mapper, reducer, result).then([result](const I& res) mutable {
return map_reduce_cf_raw(ctx, name, init, mapper, reducer).then([result](const I& res) mutable {
result = res;
return make_ready_future<json::json_return_type>(result);
});
}
template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, I init,
Mapper mapper, Reducer reducer) {
return ctx.db.map_reduce0([mapper, init, reducer](database& db) {
struct map_reduce_column_families_locally {
std::any init;
std::function<std::any (column_family&)> mapper;
std::function<std::any (std::any, std::any)> reducer;
std::any operator()(database& db) const {
auto res = init;
for (auto i : db.get_column_families()) {
res = reducer(res, mapper(*i.second.get()));
}
return res;
}, init, reducer);
}
};
template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, I init,
Mapper mapper, Reducer reducer) {
using mapper_type = std::function<std::any (column_family&)>;
using reducer_type = std::function<std::any (std::any, std::any)>;
auto wrapped_mapper = mapper_type([mapper = std::move(mapper)] (column_family& cf) mutable {
return I(mapper(cf));
});
auto wrapped_reducer = reducer_type([reducer = std::move(reducer)] (std::any a, std::any b) mutable {
return I(reducer(std::any_cast<I>(std::move(a)), std::any_cast<I>(std::move(b))));
});
return ctx.db.map_reduce0(map_reduce_column_families_locally{init, std::move(wrapped_mapper), wrapped_reducer}, std::any(init), wrapped_reducer).then([] (std::any res) {
return std::any_cast<I>(std::move(res));
});
}

112
api/config.cc Normal file
View File

@@ -0,0 +1,112 @@
/*
* Copyright 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "api/config.hh"
#include "api/api-doc/config.json.hh"
#include "db/config.hh"
#include <sstream>
#include <boost/algorithm/string/replace.hpp>
namespace api {
template<class T>
json::json_return_type get_json_return_type(const T& val) {
return json::json_return_type(val);
}
/*
* As commented on db::seed_provider_type is not used
* and probably never will.
*
* Just in case, we will return its name
*/
template<>
json::json_return_type get_json_return_type(const db::seed_provider_type& val) {
return json::json_return_type(val.class_name);
}
std::string format_type(const std::string& type) {
if (type == "int") {
return "integer";
}
return type;
}
future<> get_config_swagger_entry(const std::string& name, const std::string& description, const std::string& type, bool& first, output_stream<char>& os) {
std::stringstream ss;
if (first) {
first=false;
} else {
ss <<',';
};
ss << "\"/config/" << name <<"\": {"
"\"get\": {"
"\"description\": \"" << boost::replace_all_copy(boost::replace_all_copy(boost::replace_all_copy(description,"\n","\\n"),"\"", "''"), "\t", " ") <<"\","
"\"operationId\": \"find_config_"<< name <<"\","
"\"produces\": ["
"\"application/json\""
"],"
"\"tags\": [\"config\"],"
"\"parameters\": ["
"],"
"\"responses\": {"
"\"200\": {"
"\"description\": \"Config value\","
"\"schema\": {"
"\"type\": \"" << format_type(type) << "\""
"}"
"},"
"\"default\": {"
"\"description\": \"unexpected error\","
"\"schema\": {"
"\"$ref\": \"#/definitions/ErrorModel\""
"}"
"}"
"}"
"}"
"}";
return os.write(ss.str());
}
namespace cs = httpd::config_json;
#define _get_config_value(name, type, deflt, status, desc, ...) if (id == #name) {return get_json_return_type(ctx.db.local().get_config().name());}
#define _get_config_description(name, type, deflt, status, desc, ...) f = f.then([&os, &first] {return get_config_swagger_entry(#name, desc, #type, first, os);});
void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r) {
rb->register_function(r, [] (output_stream<char>& os) {
return do_with(true, [&os] (bool& first) {
auto f = make_ready_future();
_make_config_values(_get_config_description)
return f;
});
});
cs::find_config_id.set(r, [&ctx] (const_req r) {
auto id = r.param["id"];
_make_config_values(_get_config_value)
throw bad_param_exception(sstring("No such config entry: ") + id);
});
}
}

30
api/config.hh Normal file
View File

@@ -0,0 +1,30 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "api.hh"
#include <seastar/http/api_docs.hh>
namespace api {
void set_config(std::shared_ptr<api_registry_builder20> rb, http_context& ctx, routes& r);
}

View File

@@ -852,6 +852,15 @@ void set_storage_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));
});
});
ss::view_build_statuses.set(r, [&ctx] (std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto view = req->param["view"];
return service::get_local_storage_service().view_build_statuses(std::move(keyspace), std::move(view)).then([] (std::unordered_map<sstring, sstring> status) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(std::move(status), res));
});
});
}
}

222
atomic_cell.cc Normal file
View File

@@ -0,0 +1,222 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "atomic_cell.hh"
#include "atomic_cell_or_collection.hh"
#include "types.hh"
/// LSA mirator for cells with irrelevant type
///
///
const data::type_imr_descriptor& no_type_imr_descriptor() {
static thread_local data::type_imr_descriptor state(data::type_info::make_variable_size());
return state;
}
atomic_cell atomic_cell::make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_dead(timestamp, deletion_time), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_counter_update(timestamp, value), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_uninitialized(imr_data.type_info(), timestamp, size), &imr_data.lsa_migrator())
);
}
static imr::utils::object<data::cell::structure> copy_cell(const data::type_imr_descriptor& imr_data, const uint8_t* ptr)
{
using imr_object_type = imr::utils::object<data::cell::structure>;
// If the cell doesn't own any memory it is trivial and can be copied with
// memcpy.
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
if (!f.template get<data::cell::tags::external_data>()) {
data::cell::context ctx(f, imr_data.type_info());
// XXX: We may be better off storing the total cell size in memory. Measure!
auto size = data::cell::structure::serialized_object_size(ptr, ctx);
return imr_object_type::make_raw(size, [&] (uint8_t* dst) noexcept {
std::copy_n(ptr, size, dst);
}, &imr_data.lsa_migrator());
}
return imr_object_type::make(data::cell::copy_fn(imr_data.type_info(), ptr), &imr_data.lsa_migrator());
}
atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)
: atomic_cell(type.imr_state().type_info(),
copy_cell(type.imr_state(), other._view.raw_pointer()))
{ }
atomic_cell_or_collection atomic_cell_or_collection::copy(const abstract_type& type) const {
if (!_data.get()) {
return atomic_cell_or_collection();
}
auto& imr_data = type.imr_state();
return atomic_cell_or_collection(
copy_cell(imr_data, _data.get())
);
}
atomic_cell_or_collection::atomic_cell_or_collection(const abstract_type& type, atomic_cell_view acv)
: _data(copy_cell(type.imr_state(), acv._view.raw_pointer()))
{
}
static collection_mutation_view get_collection_mutation_view(const uint8_t* ptr)
{
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
auto ti = data::type_info::make_collection();
data::cell::context ctx(f, ti);
auto view = data::cell::structure::get_member<data::cell::tags::cell>(ptr).as<data::cell::tags::collection>(ctx);
auto dv = data::cell::variable_value::make_view(view, f.get<data::cell::tags::external_data>());
return collection_mutation_view { dv };
}
collection_mutation_view atomic_cell_or_collection::as_collection_mutation() const {
return get_collection_mutation_view(_data.get());
}
collection_mutation::collection_mutation(const collection_type_impl& type, collection_mutation_view v)
: _data(imr_object_type::make(data::cell::make_collection(v.data), &type.imr_state().lsa_migrator()))
{
}
collection_mutation::collection_mutation(const collection_type_impl& type, bytes_view v)
: _data(imr_object_type::make(data::cell::make_collection(v), &type.imr_state().lsa_migrator()))
{
}
collection_mutation::operator collection_mutation_view() const
{
return get_collection_mutation_view(_data.get());
}
bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_cell_or_collection& other) const
{
auto ptr_a = _data.get();
auto ptr_b = other._data.get();
if (!ptr_a || !ptr_b) {
return !ptr_a && !ptr_b;
}
if (type.is_atomic()) {
auto a = atomic_cell_view::from_bytes(type.imr_state().type_info(), _data);
auto b = atomic_cell_view::from_bytes(type.imr_state().type_info(), other._data);
if (a.timestamp() != b.timestamp()) {
return false;
}
if (a.is_live()) {
if (!b.is_live()) {
return false;
}
if (a.is_counter_update()) {
if (!b.is_counter_update()) {
return false;
}
return a.counter_update_value() == b.counter_update_value();
}
if (a.is_live_and_has_ttl()) {
if (!b.is_live_and_has_ttl()) {
return false;
}
if (a.ttl() != b.ttl() || a.expiry() != b.expiry()) {
return false;
}
}
return a.value() == b.value();
}
return a.deletion_time() == b.deletion_time();
} else {
return as_collection_mutation().data == other.as_collection_mutation().data;
}
}
size_t atomic_cell_or_collection::external_memory_usage(const abstract_type& t) const
{
if (!_data.get()) {
return 0;
}
auto ctx = data::cell::context(_data.get(), t.imr_state().type_info());
auto view = data::cell::structure::make_view(_data.get(), ctx);
auto flags = view.get<data::cell::tags::flags>();
size_t external_value_size = 0;
if (flags.get<data::cell::tags::external_data>()) {
if (flags.get<data::cell::tags::collection>()) {
external_value_size = get_collection_mutation_view(_data.get()).data.size_bytes();
} else {
auto cell_view = data::cell::atomic_cell_view(t.imr_state().type_info(), view);
external_value_size = cell_view.value_size();
}
// Add overhead of chunk headers. The last one is a special case.
external_value_size += (external_value_size - 1) / data::cell::maximum_external_chunk_length * data::cell::external_chunk_overhead;
external_value_size += data::cell::external_last_chunk_overhead;
}
return data::cell::structure::serialized_object_size(_data.get(), ctx)
+ imr_object_type::size_overhead + external_value_size;
}
std::ostream& operator<<(std::ostream& os, const atomic_cell_or_collection& c) {
if (!c._data.get()) {
return os << "{ null atomic_cell_or_collection }";
}
using dc = data::cell;
os << "{ ";
if (dc::structure::get_member<dc::tags::flags>(c._data.get()).get<dc::tags::collection>()) {
os << "collection";
} else {
os << "atomic cell";
}
return os << " @" << static_cast<const void*>(c._data.get()) << " }";
}

View File

@@ -30,189 +30,48 @@
#include <cstdint>
#include <iosfwd>
#include <seastar/util/gcc6-concepts.hh>
#include "data/cell.hh"
#include "data/schema_info.hh"
#include "imr/utils.hh"
template<typename T, typename Input>
static inline
void set_field(Input& v, unsigned offset, T val) {
reinterpret_cast<net::packed<T>*>(v.begin() + offset)->raw = net::hton(val);
}
class abstract_type;
class collection_type_impl;
template<typename T>
static inline
T get_field(const bytes_view& v, unsigned offset) {
return net::ntoh(*reinterpret_cast<const net::packed<T>*>(v.begin() + offset));
}
using atomic_cell_value_view = data::value_view;
using atomic_cell_value_mutable_view = data::value_mutable_view;
class atomic_cell_or_collection;
/*
* Represents atomic cell layout. Works on serialized form.
*
* Layout:
*
* <live> := <int8_t:flags><int64_t:timestamp>(<int32_t:expiry><int32_t:ttl>)?<value>
* <dead> := <int8_t: 0><int64_t:timestamp><int32_t:deletion_time>
*/
class atomic_cell_type final {
private:
static constexpr int8_t LIVE_FLAG = 0x01;
static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.
static constexpr int8_t COUNTER_IN_PLACE_REVERT = 0x10;
static constexpr unsigned flags_size = 1;
static constexpr unsigned timestamp_offset = flags_size;
static constexpr unsigned timestamp_size = 8;
static constexpr unsigned expiry_offset = timestamp_offset + timestamp_size;
static constexpr unsigned expiry_size = 4;
static constexpr unsigned deletion_time_offset = timestamp_offset + timestamp_size;
static constexpr unsigned deletion_time_size = 4;
static constexpr unsigned ttl_offset = expiry_offset + expiry_size;
static constexpr unsigned ttl_size = 4;
friend class counter_cell_builder;
private:
static bool is_counter_update(bytes_view cell) {
return cell[0] & COUNTER_UPDATE_FLAG;
}
static bool is_counter_in_place_revert_set(bytes_view cell) {
return cell[0] & COUNTER_IN_PLACE_REVERT;
}
template<typename BytesContainer>
static void set_counter_in_place_revert(BytesContainer& cell, bool flag) {
cell[0] = (cell[0] & ~COUNTER_IN_PLACE_REVERT) | (flag * COUNTER_IN_PLACE_REVERT);
}
static bool is_live(const bytes_view& cell) {
return cell[0] & LIVE_FLAG;
}
static bool is_live_and_has_ttl(const bytes_view& cell) {
return cell[0] & EXPIRY_FLAG;
}
static bool is_dead(const bytes_view& cell) {
return !is_live(cell);
}
// Can be called on live and dead cells
static api::timestamp_type timestamp(const bytes_view& cell) {
return get_field<api::timestamp_type>(cell, timestamp_offset);
}
template<typename BytesContainer>
static void set_timestamp(BytesContainer& cell, api::timestamp_type ts) {
set_field(cell, timestamp_offset, ts);
}
// Can be called on live cells only
private:
template<typename BytesView>
static BytesView do_get_value(BytesView cell) {
auto expiry_field_size = bool(cell[0] & EXPIRY_FLAG) * (expiry_size + ttl_size);
auto value_offset = flags_size + timestamp_size + expiry_field_size;
cell.remove_prefix(value_offset);
return cell;
}
public:
static bytes_view value(bytes_view cell) {
return do_get_value(cell);
}
static bytes_mutable_view value(bytes_mutable_view cell) {
return do_get_value(cell);
}
// Can be called on live counter update cells only
static int64_t counter_update_value(bytes_view cell) {
return get_field<int64_t>(cell, flags_size + timestamp_size);
}
// Can be called only when is_dead() is true.
static gc_clock::time_point deletion_time(const bytes_view& cell) {
assert(is_dead(cell));
return gc_clock::time_point(gc_clock::duration(
get_field<int32_t>(cell, deletion_time_offset)));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::time_point expiry(const bytes_view& cell) {
assert(is_live_and_has_ttl(cell));
auto expiry = get_field<int32_t>(cell, expiry_offset);
return gc_clock::time_point(gc_clock::duration(expiry));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::duration ttl(const bytes_view& cell) {
assert(is_live_and_has_ttl(cell));
return gc_clock::duration(get_field<int32_t>(cell, ttl_offset));
}
static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
managed_bytes b(managed_bytes::initialized_later(), flags_size + timestamp_size + deletion_time_size);
b[0] = 0;
set_field(b, timestamp_offset, timestamp);
set_field(b, deletion_time_offset, deletion_time.time_since_epoch().count());
return b;
}
static managed_bytes make_live(api::timestamp_type timestamp, bytes_view value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size());
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
return b;
}
static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + sizeof(value));
b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, value_offset, value);
return b;
}
static managed_bytes make_live(api::timestamp_type timestamp, bytes_view value, gc_clock::time_point expiry, gc_clock::duration ttl) {
auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size());
b[0] = EXPIRY_FLAG | LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, expiry_offset, expiry.time_since_epoch().count());
set_field(b, ttl_offset, ttl.count());
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
return b;
}
// make_live_from_serializer() is intended for users that need to serialise
// some object or objects to the format used in atomic_cell::value().
// With just make_live() the patter would look like follows:
// 1. allocate a buffer and write to it serialised objects
// 2. pass that buffer to make_live()
// 3. make_live() needs to prepend some metadata to the cell value so it
// allocates a new buffer and copies the content of the original one
//
// The allocation and copy of a buffer can be avoided.
// make_live_from_serializer() allows the user code to specify the timestamp
// and size of the cell value as well as provide the serialiser function
// object, which would write the serialised value of the cell to the buffer
// given to it by make_live_from_serializer().
template<typename Serializer>
GCC6_CONCEPT(requires requires(Serializer serializer, bytes::iterator it) {
serializer(it);
})
static managed_bytes make_live_from_serializer(api::timestamp_type timestamp, size_t size, Serializer&& serializer) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + size);
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
serializer(b.begin() + value_offset);
return b;
}
template<typename ByteContainer>
friend class atomic_cell_base;
/// View of an atomic cell
template<mutable_view is_mutable>
class basic_atomic_cell_view {
protected:
data::cell::basic_atomic_cell_view<is_mutable> _view;
friend class atomic_cell;
};
public:
using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const uint8_t*, uint8_t*>;
protected:
explicit basic_atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> v)
: _view(std::move(v)) { }
basic_atomic_cell_view(const data::type_info& ti, pointer_type ptr)
: _view(data::cell::make_atomic_cell_view(ti, ptr))
{ }
template<typename ByteContainer>
class atomic_cell_base {
protected:
ByteContainer _data;
protected:
atomic_cell_base(ByteContainer&& data) : _data(std::forward<ByteContainer>(data)) { }
friend class atomic_cell_or_collection;
public:
bool is_counter_update() const {
return atomic_cell_type::is_counter_update(_data);
operator basic_atomic_cell_view<mutable_view::no>() const noexcept {
return basic_atomic_cell_view<mutable_view::no>(_view);
}
bool is_counter_in_place_revert_set() const {
return atomic_cell_type::is_counter_in_place_revert_set(_data);
void swap(basic_atomic_cell_view& other) noexcept {
using std::swap;
swap(_view, other._view);
}
bool is_counter_update() const {
return _view.is_counter_update();
}
bool is_live() const {
return atomic_cell_type::is_live(_data);
return _view.is_live();
}
bool is_live(tombstone t, bool is_counter) const {
return is_live() && !is_covered_by(t, is_counter);
@@ -221,122 +80,132 @@ public:
return is_live() && !is_covered_by(t, is_counter) && !has_expired(now);
}
bool is_live_and_has_ttl() const {
return atomic_cell_type::is_live_and_has_ttl(_data);
return _view.is_expiring();
}
bool is_dead(gc_clock::time_point now) const {
return atomic_cell_type::is_dead(_data) || has_expired(now);
return !is_live() || has_expired(now);
}
bool is_covered_by(tombstone t, bool is_counter) const {
return timestamp() <= t.timestamp || (is_counter && t.timestamp != api::missing_timestamp);
}
// Can be called on live and dead cells
api::timestamp_type timestamp() const {
return atomic_cell_type::timestamp(_data);
return _view.timestamp();
}
void set_timestamp(api::timestamp_type ts) {
atomic_cell_type::set_timestamp(_data, ts);
_view.set_timestamp(ts);
}
// Can be called on live cells only
auto value() const {
return atomic_cell_type::value(_data);
data::basic_value_view<is_mutable> value() const {
return _view.value();
}
// Can be called on live cells only
size_t value_size() const {
return _view.value_size();
}
bool is_value_fragmented() const {
return _view.is_value_fragmented();
}
// Can be called on live counter update cells only
int64_t counter_update_value() const {
return atomic_cell_type::counter_update_value(_data);
return _view.counter_update_value();
}
// Can be called only when is_dead(gc_clock::time_point)
gc_clock::time_point deletion_time() const {
return !is_live() ? atomic_cell_type::deletion_time(_data) : expiry() - ttl();
return !is_live() ? _view.deletion_time() : expiry() - ttl();
}
// Can be called only when is_live_and_has_ttl()
gc_clock::time_point expiry() const {
return atomic_cell_type::expiry(_data);
return _view.expiry();
}
// Can be called only when is_live_and_has_ttl()
gc_clock::duration ttl() const {
return atomic_cell_type::ttl(_data);
return _view.ttl();
}
// Can be called on live and dead cells
bool has_expired(gc_clock::time_point now) const {
return is_live_and_has_ttl() && expiry() <= now;
}
bytes_view serialize() const {
return _data;
}
void set_counter_in_place_revert(bool flag) {
atomic_cell_type::set_counter_in_place_revert(_data, flag);
return _view.serialize();
}
};
class atomic_cell_view final : public atomic_cell_base<bytes_view> {
atomic_cell_view(bytes_view data) : atomic_cell_base(std::move(data)) {}
public:
static atomic_cell_view from_bytes(bytes_view data) { return atomic_cell_view(data); }
class atomic_cell_view final : public basic_atomic_cell_view<mutable_view::no> {
atomic_cell_view(const data::type_info& ti, const uint8_t* data)
: basic_atomic_cell_view<mutable_view::no>(ti, data) {}
template<mutable_view is_mutable>
atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> view)
: basic_atomic_cell_view<mutable_view::no>(view) { }
friend class atomic_cell;
public:
static atomic_cell_view from_bytes(const data::type_info& ti, const imr::utils::object<data::cell::structure>& data) {
return atomic_cell_view(ti, data.get());
}
static atomic_cell_view from_bytes(const data::type_info& ti, bytes_view bv) {
return atomic_cell_view(ti, reinterpret_cast<const uint8_t*>(bv.begin()));
}
friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);
};
class atomic_cell_mutable_view final : public atomic_cell_base<bytes_mutable_view> {
atomic_cell_mutable_view(bytes_mutable_view data) : atomic_cell_base(std::move(data)) {}
class atomic_cell_mutable_view final : public basic_atomic_cell_view<mutable_view::yes> {
atomic_cell_mutable_view(const data::type_info& ti, uint8_t* data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data) {}
public:
static atomic_cell_mutable_view from_bytes(bytes_mutable_view data) { return atomic_cell_mutable_view(data); }
static atomic_cell_mutable_view from_bytes(const data::type_info& ti, imr::utils::object<data::cell::structure>& data) {
return atomic_cell_mutable_view(ti, data.get());
}
friend class atomic_cell;
};
class atomic_cell_ref final : public atomic_cell_base<managed_bytes&> {
public:
atomic_cell_ref(managed_bytes& buf) : atomic_cell_base(buf) {}
};
using atomic_cell_ref = atomic_cell_mutable_view;
class atomic_cell final : public atomic_cell_base<managed_bytes> {
atomic_cell(managed_bytes b) : atomic_cell_base(std::move(b)) {}
class atomic_cell final : public basic_atomic_cell_view<mutable_view::yes> {
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
atomic_cell(const data::type_info& ti, imr::utils::object<data::cell::structure>&& data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data.get()), _data(std::move(data)) {}
public:
atomic_cell(const atomic_cell&) = default;
class collection_member_tag;
using collection_member = bool_class<collection_member_tag>;
atomic_cell(atomic_cell&&) = default;
atomic_cell& operator=(const atomic_cell&) = default;
atomic_cell& operator=(const atomic_cell&) = delete;
atomic_cell& operator=(atomic_cell&&) = default;
static atomic_cell from_bytes(managed_bytes b) {
return atomic_cell(std::move(b));
void swap(atomic_cell& other) noexcept {
basic_atomic_cell_view<mutable_view::yes>::swap(other);
_data.swap(other._data);
}
atomic_cell(atomic_cell_view other) : atomic_cell_base(managed_bytes{other._data}) {}
operator atomic_cell_view() const {
return atomic_cell_view(_data);
operator atomic_cell_view() const { return atomic_cell_view(_view); }
atomic_cell(const abstract_type& t, atomic_cell_view other);
static atomic_cell make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const bytes& value,
collection_member cm = collection_member::no) {
return make_live(type, timestamp, bytes_view(value), cm);
}
static atomic_cell make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
return atomic_cell_type::make_dead(timestamp, deletion_time);
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value) {
return atomic_cell_type::make_live(timestamp, value);
}
static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value) {
return make_live(timestamp, bytes_view(value));
}
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
return atomic_cell_type::make_live_counter_update(timestamp, value);
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl)
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const bytes& value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member cm = collection_member::no)
{
return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
return make_live(type, timestamp, bytes_view(value), expiry, ttl, cm);
}
static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value,
gc_clock::time_point expiry, gc_clock::duration ttl)
{
return make_live(timestamp, bytes_view(value), expiry, ttl);
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value, ttl_opt ttl) {
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value, ttl_opt ttl, collection_member cm = collection_member::no) {
if (!ttl) {
return atomic_cell_type::make_live(timestamp, value);
return make_live(type, timestamp, value, cm);
} else {
return atomic_cell_type::make_live(timestamp, value, gc_clock::now() + *ttl, *ttl);
return make_live(type, timestamp, value, gc_clock::now() + *ttl, *ttl, cm);
}
}
template<typename Serializer>
static atomic_cell make_live_from_serializer(api::timestamp_type timestamp, size_t size, Serializer&& serializer) {
return atomic_cell_type::make_live_from_serializer(timestamp, size, std::forward<Serializer>(serializer));
}
static atomic_cell make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size);
friend class atomic_cell_or_collection;
friend std::ostream& operator<<(std::ostream& os, const atomic_cell& ac);
};
@@ -350,33 +219,24 @@ class collection_mutation_view;
// list: tbd, probably ugly
class collection_mutation {
public:
managed_bytes data;
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
collection_mutation() {}
collection_mutation(managed_bytes b) : data(std::move(b)) {}
collection_mutation(collection_mutation_view v);
collection_mutation(const collection_type_impl&, collection_mutation_view v);
collection_mutation(const collection_type_impl&, bytes_view bv);
operator collection_mutation_view() const;
};
class collection_mutation_view {
public:
bytes_view data;
bytes_view serialize() const { return data; }
static collection_mutation_view from_bytes(bytes_view v) { return { v }; }
atomic_cell_value_view data;
};
inline
collection_mutation::collection_mutation(collection_mutation_view v)
: data(v.data) {
}
inline
collection_mutation::operator collection_mutation_view() const {
return { data };
}
class column_definition;
int compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right);
void merge_column(const column_definition& def,
void merge_column(const abstract_type& def,
atomic_cell_or_collection& old,
const atomic_cell_or_collection& neww);

View File

@@ -33,12 +33,15 @@ template<>
struct appending_hash<collection_mutation_view> {
template<typename Hasher>
void operator()(Hasher& h, collection_mutation_view cell, const column_definition& cdef) const {
auto m_view = collection_type_impl::deserialize_mutation_form(cell);
cell.data.with_linearized([&] (bytes_view cell_bv) {
auto ctype = static_pointer_cast<const collection_type_impl>(cdef.type);
auto m_view = ctype->deserialize_mutation_form(cell_bv);
::feed_hash(h, m_view.tomb);
for (auto&& key_and_value : m_view.cells) {
::feed_hash(h, key_and_value.first);
::feed_hash(h, key_and_value.second, cdef);
}
});
}
};
@@ -50,7 +53,9 @@ struct appending_hash<atomic_cell_view> {
feed_hash(h, cell.timestamp());
if (cell.is_live()) {
if (cdef.is_counter()) {
::feed_hash(h, counter_cell_view(cell));
counter_cell_view::with_linearized(cell, [&] (counter_cell_view ccv) {
::feed_hash(h, ccv);
});
return;
}
if (cell.is_live_and_has_ttl()) {
@@ -85,9 +90,9 @@ struct appending_hash<atomic_cell_or_collection> {
template<typename Hasher>
void operator()(Hasher& h, const atomic_cell_or_collection& c, const column_definition& cdef) const {
if (cdef.is_atomic()) {
feed_hash(h, c.as_atomic_cell(), cdef);
feed_hash(h, c.as_atomic_cell(cdef), cdef);
} else {
feed_hash(h, c.as_collection_mutation(), cdef);
}
}
};
};

View File

@@ -25,42 +25,56 @@
#include "schema.hh"
#include "hashing.hh"
#include "imr/utils.hh"
// A variant type that can hold either an atomic_cell, or a serialized collection.
// Which type is stored is determined by the schema.
// Has an "empty" state.
// Objects moved-from are left in an empty state.
class atomic_cell_or_collection final {
managed_bytes _data;
// FIXME: This has made us lose small-buffer optimisation. Unfortunately,
// due to the changed cell format it would be less effective now, anyway.
// Measure the actual impact because any attempts to fix this will become
// irrelevant once rows are converted to the IMR as well, so maybe we can
// live with this like that.
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
private:
atomic_cell_or_collection(managed_bytes&& data) : _data(std::move(data)) {}
atomic_cell_or_collection(imr::utils::object<data::cell::structure>&& data) : _data(std::move(data)) {}
public:
atomic_cell_or_collection() = default;
atomic_cell_or_collection(atomic_cell_or_collection&&) = default;
atomic_cell_or_collection(const atomic_cell_or_collection&) = delete;
atomic_cell_or_collection& operator=(atomic_cell_or_collection&&) = default;
atomic_cell_or_collection& operator=(const atomic_cell_or_collection&) = delete;
atomic_cell_or_collection(atomic_cell ac) : _data(std::move(ac._data)) {}
atomic_cell_or_collection(const abstract_type& at, atomic_cell_view acv);
static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }
atomic_cell_view as_atomic_cell() const { return atomic_cell_view::from_bytes(_data); }
atomic_cell_ref as_atomic_cell_ref() { return { _data }; }
atomic_cell_mutable_view as_mutable_atomic_cell() { return atomic_cell_mutable_view::from_bytes(_data); }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm.data)) {}
atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_ref as_atomic_cell_ref(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm._data)) { }
atomic_cell_or_collection copy(const abstract_type&) const;
explicit operator bool() const {
return !_data.empty();
return bool(_data);
}
bool can_use_mutable_view() const {
return !_data.is_fragmented();
static constexpr bool can_use_mutable_view() {
return true;
}
static atomic_cell_or_collection from_collection_mutation(collection_mutation data) {
return std::move(data.data);
}
collection_mutation_view as_collection_mutation() const {
return collection_mutation_view{_data};
}
bytes_view serialize() const {
return _data;
}
bool operator==(const atomic_cell_or_collection& other) const {
return _data == other._data;
}
size_t external_memory_usage() const {
return _data.external_memory_usage();
void swap(atomic_cell_or_collection& other) noexcept {
_data.swap(other._data);
}
static atomic_cell_or_collection from_collection_mutation(collection_mutation data) { return std::move(data._data); }
collection_mutation_view as_collection_mutation() const;
bytes_view serialize() const;
bool equals(const abstract_type& type, const atomic_cell_or_collection& other) const;
size_t external_memory_usage(const abstract_type&) const;
friend std::ostream& operator<<(std::ostream&, const atomic_cell_or_collection&);
};
namespace std {
inline void swap(atomic_cell_or_collection& a, atomic_cell_or_collection& b) noexcept
{
a.swap(b);
}
}

View File

@@ -72,18 +72,22 @@ public:
return make_ready_future<authenticated_user>(anonymous_user());
}
virtual future<> create(stdx::string_view, const authentication_options& options) override {
virtual future<> create(stdx::string_view, const authentication_options& options) const override {
return make_ready_future();
}
virtual future<> alter(stdx::string_view, const authentication_options& options) override {
virtual future<> alter(stdx::string_view, const authentication_options& options) const override {
return make_ready_future();
}
virtual future<> drop(stdx::string_view) override {
virtual future<> drop(stdx::string_view) const override {
return make_ready_future();
}
virtual future<custom_options> query_custom_options(stdx::string_view role_name) const override {
return make_ready_future<custom_options>();
}
virtual const resource_set& protected_resources() const override {
static const resource_set resources;
return resources;

View File

@@ -58,24 +58,30 @@ public:
return make_ready_future<permission_set>(permissions::ALL);
}
virtual future<> grant(stdx::string_view, permission_set, const resource&) override {
throw exceptions::invalid_request_exception("GRANT operation is not supported by AllowAllAuthorizer");
virtual future<> grant(stdx::string_view, permission_set, const resource&) const override {
return make_exception_future<>(
unsupported_authorization_operation("GRANT operation is not supported by AllowAllAuthorizer"));
}
virtual future<> revoke(stdx::string_view, permission_set, const resource&) override {
throw exceptions::invalid_request_exception("REVOKE operation is not supported by AllowAllAuthorizer");
virtual future<> revoke(stdx::string_view, permission_set, const resource&) const override {
return make_exception_future<>(
unsupported_authorization_operation("REVOKE operation is not supported by AllowAllAuthorizer"));
}
virtual future<std::vector<permission_details>> list_all() const override {
throw exceptions::invalid_request_exception("LIST PERMISSIONS operation is not supported by AllowAllAuthorizer");
return make_exception_future<std::vector<permission_details>>(
unsupported_authorization_operation(
"LIST PERMISSIONS operation is not supported by AllowAllAuthorizer"));
}
virtual future<> revoke_all(stdx::string_view) override {
return make_ready_future();
virtual future<> revoke_all(stdx::string_view) const override {
return make_exception_future(
unsupported_authorization_operation("REVOKE operation is not supported by AllowAllAuthorizer"));
}
virtual future<> revoke_all(const resource&) override {
return make_ready_future();
virtual future<> revoke_all(const resource&) const override {
return make_exception_future(
unsupported_authorization_operation("REVOKE operation is not supported by AllowAllAuthorizer"));
}
virtual const resource_set& protected_resources() const override {

View File

@@ -43,9 +43,11 @@ std::ostream& operator<<(std::ostream&, authentication_option);
using authentication_option_set = std::unordered_set<authentication_option>;
using custom_options = std::unordered_map<sstring, sstring>;
struct authentication_options final {
std::optional<sstring> password;
std::optional<std::unordered_map<sstring, sstring>> options;
std::optional<custom_options> options;
};
inline bool any_authentication_options(const authentication_options& aos) noexcept {

View File

@@ -69,7 +69,9 @@ namespace auth {
class authenticated_user;
///
/// Abstract interface for authenticating users.
/// Abstract client for authenticating role identity.
///
/// All state necessary to authorize a role is stored externally to the client instance.
///
class authenticator {
public:
@@ -120,7 +122,7 @@ public:
///
/// The options provided must be a subset of `supported_options()`.
///
virtual future<> create(stdx::string_view role_name, const authentication_options& options) = 0;
virtual future<> create(stdx::string_view role_name, const authentication_options& options) const = 0;
///
/// Alter the authentication record of an existing user.
@@ -129,12 +131,19 @@ public:
///
/// Callers must ensure that the specification of `alterable_options()` is adhered to.
///
virtual future<> alter(stdx::string_view role_name, const authentication_options& options) = 0;
virtual future<> alter(stdx::string_view role_name, const authentication_options& options) const = 0;
///
/// Delete the authentication record for a user. This will disallow the user from logging in.
///
virtual future<> drop(stdx::string_view role_name) = 0;
virtual future<> drop(stdx::string_view role_name) const = 0;
///
/// Query for custom options (those corresponding to \ref authentication_options::options).
///
/// If no options are set the result is an empty container.
///
virtual future<custom_options> query_custom_options(stdx::string_view role_name) const = 0;
///
/// System resources used internally as part of the implementation. These are made inaccessible to users.

View File

@@ -44,6 +44,7 @@
#include <experimental/string_view>
#include <functional>
#include <optional>
#include <stdexcept>
#include <tuple>
#include <vector>
@@ -79,8 +80,15 @@ inline bool operator<(const permission_details& pd1, const permission_details& p
< std::forward_as_tuple(pd2.role_name, pd2.resource, pd2.permissions);
}
class unsupported_authorization_operation : public std::invalid_argument {
public:
using std::invalid_argument::invalid_argument;
};
///
/// Abstract interface for authorizing users to access resources.
/// Abstract client for authorizing roles to access resources.
///
/// All state necessary to authorize a role is stored externally to the client instance.
///
class authorizer {
public:
@@ -107,27 +115,37 @@ public:
///
/// Grant a set of permissions to a role for a particular \ref resource.
///
virtual future<> grant(stdx::string_view role_name, permission_set, const resource&) = 0;
/// \throws \ref unsupported_authorization_operation if granting permissions is not supported.
///
virtual future<> grant(stdx::string_view role_name, permission_set, const resource&) const = 0;
///
/// Revoke a set of permissions from a role for a particular \ref resource.
///
virtual future<> revoke(stdx::string_view role_name, permission_set, const resource&) = 0;
/// \throws \ref unsupported_authorization_operation if revoking permissions is not supported.
///
virtual future<> revoke(stdx::string_view role_name, permission_set, const resource&) const = 0;
///
/// Query for all directly granted permissions.
///
/// \throws \ref unsupported_authorization_operation if listing permissions is not supported.
///
virtual future<std::vector<permission_details>> list_all() const = 0;
///
/// Revoke all permissions granted directly to a particular role.
///
virtual future<> revoke_all(stdx::string_view role_name) = 0;
/// \throws \ref unsupported_authorization_operation if revoking permissions is not supported.
///
virtual future<> revoke_all(stdx::string_view role_name) const = 0;
///
/// Revoke all permissions granted to any role for a particular resource.
///
virtual future<> revoke_all(const resource&) = 0;
/// \throws \ref unsupported_authorization_operation if revoking permissions is not supported.
///
virtual future<> revoke_all(const resource&) const = 0;
///
/// System resources used internally as part of the implementation. These are made inaccessible to users.

View File

@@ -25,8 +25,10 @@
#include "cql3/query_processor.hh"
#include "cql3/statements/create_table_statement.hh"
#include "database.hh"
#include "schema_builder.hh"
#include "service/migration_manager.hh"
#include "timeout_config.hh"
namespace auth {
@@ -48,7 +50,7 @@ future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_f
return exponential_backoff_retry::do_until_value(1s, 1min, as, [func = std::move(func)] {
return func().then_wrapped([] (auto&& f) -> stdx::optional<empty_state> {
if (f.failed()) {
auth_log.warn("Auth task failed with error, rescheduling: {}", f.get_exception());
auth_log.info("Auth task failed with error, rescheduling: {}", f.get_exception());
return { };
}
return { empty_state() };
@@ -58,13 +60,13 @@ future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_f
}
future<> create_metadata_table_if_missing(
const sstring& table_name,
stdx::string_view table_name,
cql3::query_processor& qp,
const sstring& cql,
stdx::string_view cql,
::service::migration_manager& mm) {
auto& db = qp.db().local();
if (db.has_schema(meta::AUTH_KS, table_name)) {
if (db.has_schema(meta::AUTH_KS, sstring(table_name))) {
return make_ready_future<>();
}
@@ -85,4 +87,18 @@ future<> create_metadata_table_if_missing(
return mm.announce_new_column_family(b.build(), false);
}
future<> wait_for_schema_agreement(::service::migration_manager& mm, const database& db) {
static const auto pause = [] { return sleep(std::chrono::milliseconds(500)); };
return do_until([&db] { return db.get_version() != database::empty_version; }, pause).then([&mm] {
return do_until([&mm] { return mm.have_schema_agreement(); }, pause);
});
}
const timeout_config& internal_distributed_timeout_config() noexcept {
static const auto t = 5s;
static const timeout_config tc{t, t, t, t, t, t, t};
return tc;
}
}

View File

@@ -22,6 +22,7 @@
#pragma once
#include <chrono>
#include <experimental/string_view>
#include <seastar/core/future.hh>
#include <seastar/core/abort_source.hh>
@@ -36,6 +37,9 @@
using namespace std::chrono_literals;
class database;
class timeout_config;
namespace service {
class migration_manager;
}
@@ -65,16 +69,23 @@ future<> once_among_shards(Task&& f) {
}
inline future<> delay_until_system_ready(seastar::abort_source& as) {
return sleep_abortable(10s, as);
return sleep_abortable(15s, as);
}
// Func must support being invoked more than once.
future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_function<future<>()> func);
future<> create_metadata_table_if_missing(
const sstring& table_name,
stdx::string_view table_name,
cql3::query_processor&,
const sstring& cql,
stdx::string_view cql,
::service::migration_manager&);
future<> wait_for_schema_agreement(::service::migration_manager&, const database&);
///
/// Time-outs for internal, non-local CQL queries.
///
const timeout_config& internal_distributed_timeout_config() noexcept;
}

View File

@@ -103,19 +103,21 @@ future<bool> default_authorizer::any_granted() const {
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{},
true).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return !results->empty();
});
}
future<> default_authorizer::migrate_legacy_metadata() {
future<> default_authorizer::migrate_legacy_metadata() const {
alogger.info("Starting migration of legacy permissions metadata.");
static const sstring query = sprint("SELECT * FROM %s.%s", meta::AUTH_KS, legacy_table_name);
return _qp.process(
query,
db::consistency_level::LOCAL_ONE).then([this](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::LOCAL_ONE,
infinite_timeout_config).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
return do_with(
row.get_as<sstring>("username"),
@@ -157,18 +159,18 @@ future<> default_authorizer::start() {
create_table,
_migration_manager).then([this] {
_finished = do_after_system_ready(_as, [this] {
if (legacy_metadata_exists()) {
return any_granted().then([this](bool any) {
if (!any) {
return migrate_legacy_metadata();
}
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
alogger.warn("Ignoring legacy permissions metadata since role permissions exist.");
return make_ready_future<>();
});
}
if (legacy_metadata_exists()) {
if (!any_granted().get0()) {
migrate_legacy_metadata().get0();
return;
}
return make_ready_future<>();
alogger.warn("Ignoring legacy permissions metadata since role permissions exist.");
}
});
});
});
});
@@ -196,6 +198,7 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{*maybe_role.name, r.name()}).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return permissions::NONE;
@@ -210,7 +213,7 @@ default_authorizer::modify(
stdx::string_view role_name,
permission_set set,
const resource& resource,
stdx::string_view op) {
stdx::string_view op) const {
return do_with(
sprint(
"UPDATE %s.%s SET %s = %s %s ? WHERE %s = ? AND %s = ?",
@@ -225,16 +228,17 @@ default_authorizer::modify(
return _qp.process(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
{permissions::to_strings(set), sstring(role_name), resource.name()}).discard_result();
});
}
future<> default_authorizer::grant(stdx::string_view role_name, permission_set set, const resource& resource) {
future<> default_authorizer::grant(stdx::string_view role_name, permission_set set, const resource& resource) const {
return modify(role_name, std::move(set), resource, "+");
}
future<> default_authorizer::revoke(stdx::string_view role_name, permission_set set, const resource& resource) {
future<> default_authorizer::revoke(stdx::string_view role_name, permission_set set, const resource& resource) const {
return modify(role_name, std::move(set), resource, "-");
}
@@ -250,6 +254,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
return _qp.process(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
{},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
std::vector<permission_details> all_details;
@@ -267,7 +272,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
});
}
future<> default_authorizer::revoke_all(stdx::string_view role_name) {
future<> default_authorizer::revoke_all(stdx::string_view role_name) const {
static const sstring query = sprint(
"DELETE FROM %s.%s WHERE %s = ?",
meta::AUTH_KS,
@@ -277,6 +282,7 @@ future<> default_authorizer::revoke_all(stdx::string_view role_name) {
return _qp.process(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result().handle_exception([role_name](auto ep) {
try {
std::rethrow_exception(ep);
@@ -286,7 +292,7 @@ future<> default_authorizer::revoke_all(stdx::string_view role_name) {
});
}
future<> default_authorizer::revoke_all(const resource& resource) {
future<> default_authorizer::revoke_all(const resource& resource) const {
static const sstring query = sprint(
"SELECT %s FROM %s.%s WHERE %s = ? ALLOW FILTERING",
ROLE_NAME,
@@ -297,6 +303,7 @@ future<> default_authorizer::revoke_all(const resource& resource) {
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{resource.name()}).then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
@@ -314,6 +321,7 @@ future<> default_authorizer::revoke_all(const resource& resource) {
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{r.get_as<sstring>(ROLE_NAME), resource.name()}).discard_result().handle_exception(
[resource](auto ep) {
try {

View File

@@ -77,15 +77,15 @@ public:
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override;
virtual future<> grant(stdx::string_view, permission_set, const resource&) override;
virtual future<> grant(stdx::string_view, permission_set, const resource&) const override;
virtual future<> revoke( stdx::string_view, permission_set, const resource&) override;
virtual future<> revoke( stdx::string_view, permission_set, const resource&) const override;
virtual future<std::vector<permission_details>> list_all() const override;
virtual future<> revoke_all(stdx::string_view) override;
virtual future<> revoke_all(stdx::string_view) const override;
virtual future<> revoke_all(const resource&) override;
virtual future<> revoke_all(const resource&) const override;
virtual const resource_set& protected_resources() const override;
@@ -94,9 +94,9 @@ private:
future<bool> any_granted() const;
future<> migrate_legacy_metadata();
future<> migrate_legacy_metadata() const;
future<> modify(stdx::string_view, permission_set, const resource&, stdx::string_view);
future<> modify(stdx::string_view, permission_set, const resource&, stdx::string_view) const;
};
} /* namespace auth */

View File

@@ -149,7 +149,9 @@ static sstring gensalt() {
// blowfish 2011 fix, blowfish, sha512, sha256, md5
for (sstring pfx : { "$2y$", "$2a$", "$6$", "$5$", "$1$" }) {
salt = pfx + input;
if (crypt_r("fisk", salt.c_str(), &tlcrypt)) {
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
prefix = pfx;
return salt;
}
@@ -162,7 +164,7 @@ static sstring hashpw(const sstring& pass) {
}
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
return utf8_type->deserialize(row.get_blob(SALTED_HASH)) != data_value::make_null(utf8_type);
return !row.get_or<sstring>(SALTED_HASH, "").empty();
}
static const sstring update_row_query = sprint(
@@ -177,13 +179,14 @@ bool password_authenticator::legacy_metadata_exists() const {
return _qp.db().local().has_schema(meta::AUTH_KS, legacy_table_name);
}
future<> password_authenticator::migrate_legacy_metadata() {
future<> password_authenticator::migrate_legacy_metadata() const {
plogger.info("Starting migration of legacy authentication metadata.");
static const sstring query = sprint("SELECT * FROM %s.%s", meta::AUTH_KS, legacy_table_name);
return _qp.process(
query,
db::consistency_level::QUORUM).then([this](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
auto username = row.get_as<sstring>("username");
auto salted_hash = row.get_as<sstring>(SALTED_HASH);
@@ -191,6 +194,7 @@ future<> password_authenticator::migrate_legacy_metadata() {
return _qp.process(
update_row_query,
consistency_for_user(username),
internal_distributed_timeout_config(),
{std::move(salted_hash), username}).discard_result();
}).finally([results] {});
}).then([] {
@@ -201,12 +205,13 @@ future<> password_authenticator::migrate_legacy_metadata() {
});
}
future<> password_authenticator::create_default_if_missing() {
future<> password_authenticator::create_default_if_missing() const {
return default_role_row_satisfies(_qp, &has_salted_hash).then([this](bool exists) {
if (!exists) {
return _qp.process(
update_row_query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
{hashpw(DEFAULT_USER_PASSWORD), DEFAULT_USER_NAME}).then([](auto&&) {
plogger.info("Created default superuser authentication record.");
});
@@ -220,8 +225,16 @@ future<> password_authenticator::start() {
return once_among_shards([this] {
gensalt(); // do this once to determine usable hashing
auto f = create_metadata_table_if_missing(
meta::roles_table::name,
_qp,
meta::roles_table::creation_query(),
_migration_manager);
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_salted_hash).get0()) {
if (legacy_metadata_exists()) {
plogger.warn("Ignoring legacy authentication metadata since nondefault data already exist.");
@@ -239,7 +252,7 @@ future<> password_authenticator::start() {
});
});
return make_ready_future<>();
return f;
});
}
@@ -298,12 +311,17 @@ future<authenticated_user> password_authenticator::authenticate(
return _qp.process(
query,
consistency_for_user(username),
internal_distributed_timeout_config(),
{username},
true);
}).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
if (res->empty() || !checkpw(password, res->one().get_as<sstring>(SALTED_HASH))) {
auto salted_hash = std::experimental::optional<sstring>();
if (!res->empty()) {
salted_hash = res->one().get_opt<sstring>(SALTED_HASH);
}
if (!salted_hash || !checkpw(password, *salted_hash)) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
return make_ready_future<authenticated_user>(username);
@@ -317,7 +335,7 @@ future<authenticated_user> password_authenticator::authenticate(
});
}
future<> password_authenticator::create(stdx::string_view role_name, const authentication_options& options) {
future<> password_authenticator::create(stdx::string_view role_name, const authentication_options& options) const {
if (!options.password) {
return make_ready_future<>();
}
@@ -325,10 +343,11 @@ future<> password_authenticator::create(stdx::string_view role_name, const authe
return _qp.process(
update_row_query,
consistency_for_user(role_name),
internal_distributed_timeout_config(),
{hashpw(*options.password), sstring(role_name)}).discard_result();
}
future<> password_authenticator::alter(stdx::string_view role_name, const authentication_options& options) {
future<> password_authenticator::alter(stdx::string_view role_name, const authentication_options& options) const {
if (!options.password) {
return make_ready_future<>();
}
@@ -342,17 +361,25 @@ future<> password_authenticator::alter(stdx::string_view role_name, const authen
return _qp.process(
query,
consistency_for_user(role_name),
internal_distributed_timeout_config(),
{hashpw(*options.password), sstring(role_name)}).discard_result();
}
future<> password_authenticator::drop(stdx::string_view name) {
future<> password_authenticator::drop(stdx::string_view name) const {
static const sstring query = sprint(
"DELETE %s FROM %s WHERE %s = ?",
SALTED_HASH,
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(query, consistency_for_user(name), {sstring(name)}).discard_result();
return _qp.process(
query, consistency_for_user(name),
internal_distributed_timeout_config(),
{sstring(name)}).discard_result();
}
future<custom_options> password_authenticator::query_custom_options(stdx::string_view role_name) const {
return make_ready_future<custom_options>();
}
const resource_set& password_authenticator::protected_resources() const {

View File

@@ -81,11 +81,13 @@ public:
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const override;
virtual future<> create(stdx::string_view role_name, const authentication_options& options) override;
virtual future<> create(stdx::string_view role_name, const authentication_options& options) const override;
virtual future<> alter(stdx::string_view role_name, const authentication_options& options) override;
virtual future<> alter(stdx::string_view role_name, const authentication_options& options) const override;
virtual future<> drop(stdx::string_view role_name) override;
virtual future<> drop(stdx::string_view role_name) const override;
virtual future<custom_options> query_custom_options(stdx::string_view role_name) const override;
virtual const resource_set& protected_resources() const override;
@@ -94,9 +96,9 @@ public:
private:
bool legacy_metadata_exists() const;
future<> migrate_legacy_metadata();
future<> migrate_legacy_metadata() const;
future<> create_default_if_missing();
future<> create_default_if_missing() const;
};
}

View File

@@ -93,10 +93,12 @@ using role_set = std::unordered_set<sstring>;
enum class recursive_role_query { yes, no };
///
/// Abstract role manager.
/// Abstract client for managing roles.
///
/// All implementations should throw role-related exceptions as documented, but authorization-related checking is
/// handled by the CQL layer, and not here.
/// All state necessary for managing roles is stored externally to the client instance.
///
/// All implementations should throw role-related exceptions as documented. Authorization is not addressed here, and
/// access-control should never be enforced in implementations.
///
class role_manager {
public:
@@ -113,17 +115,17 @@ public:
///
/// \returns an exceptional future with \ref role_already_exists for a role that has previously been created.
///
virtual future<> create(stdx::string_view role_name, const role_config&) = 0;
virtual future<> create(stdx::string_view role_name, const role_config&) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
virtual future<> drop(stdx::string_view role_name) = 0;
virtual future<> drop(stdx::string_view role_name) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
virtual future<> alter(stdx::string_view role_name, const role_config_update&) = 0;
virtual future<> alter(stdx::string_view role_name, const role_config_update&) const = 0;
///
/// Grant `role_name` to `grantee_name`.
@@ -133,7 +135,7 @@ public:
/// \returns an exceptional future with \ref role_already_included if granting the role would be redundant, or
/// create a cycle.
///
virtual future<> grant(stdx::string_view grantee_name, stdx::string_view role_name) = 0;
virtual future<> grant(stdx::string_view grantee_name, stdx::string_view role_name) const = 0;
///
/// Revoke `role_name` from `revokee_name`.
@@ -142,7 +144,7 @@ public:
///
/// \returns an exceptional future with \ref revoke_ungranted_role if the role was not granted.
///
virtual future<> revoke(stdx::string_view revokee_name, stdx::string_view role_name) = 0;
virtual future<> revoke(stdx::string_view revokee_name, stdx::string_view role_name) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.

View File

@@ -36,6 +36,21 @@ namespace meta {
namespace roles_table {
stdx::string_view creation_query() {
static const sstring instance = sprint(
"CREATE TABLE %s ("
" %s text PRIMARY KEY,"
" can_login boolean,"
" is_superuser boolean,"
" member_of set<text>,"
" salted_hash text"
")",
qualified_name(),
role_col_name);
return instance;
}
stdx::string_view qualified_name() noexcept {
static const sstring instance = AUTH_KS + "." + sstring(name);
return instance;
@@ -57,12 +72,14 @@ future<bool> default_role_row_satisfies(
return qp.process(
query,
db::consistency_level::ONE,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
@@ -86,7 +103,8 @@ future<bool> any_nondefault_role_row_satisfies(
return do_with(std::move(p), [&qp](const auto& p) {
return qp.process(
query,
db::consistency_level::QUORUM).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return false;
}

View File

@@ -40,6 +40,8 @@ namespace meta {
namespace roles_table {
stdx::string_view creation_query();
constexpr stdx::string_view name{"roles", 5};
stdx::string_view qualified_name() noexcept;

View File

@@ -37,7 +37,7 @@
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/config.hh"
#include "db/consistency_level.hh"
#include "db/consistency_level_type.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
#include "service/migration_listener.hh"
@@ -77,11 +77,18 @@ private:
void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}
void on_drop_keyspace(const sstring& ks_name) override {
_authorizer.revoke_all(auth::make_data_resource(ks_name));
_authorizer.revoke_all(
auth::make_data_resource(ks_name)).handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
});
}
void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override {
_authorizer.revoke_all(auth::make_data_resource(ks_name, cf_name));
_authorizer.revoke_all(
auth::make_data_resource(
ks_name, cf_name)).handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
});
}
void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}
@@ -177,9 +184,7 @@ future<> service::start() {
return once_among_shards([this] {
return create_keyspace_if_missing();
}).then([this] {
return _role_manager->start();
}).then([this] {
return when_all_succeed(_authorizer->start(), _authenticator->start());
return when_all_succeed(_role_manager->start(), _authorizer->start(), _authenticator->start());
}).then([this] {
_permissions_cache = std::make_unique<permissions_cache>(_permissions_cache_config, *this, log);
}).then([this] {
@@ -191,6 +196,10 @@ future<> service::start() {
}
future<> service::stop() {
// Only one of the shards has the listener registered, but let's try to
// unregister on each one just to make sure.
_migration_manager.unregister_listener(_migration_listener.get());
return _permissions_cache->stop().then([this] {
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop());
});
@@ -218,6 +227,7 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.process(
default_user_query,
db::consistency_level::ONE,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
@@ -227,6 +237,7 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.process(
default_user_query,
db::consistency_level::QUORUM,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
@@ -235,7 +246,8 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.process(
all_users_query,
db::consistency_level::QUORUM).then([](auto results) {
db::consistency_level::QUORUM,
infinite_timeout_config).then([](auto results) {
return make_ready_future<bool>(!results->empty());
});
});
@@ -402,7 +414,7 @@ static void validate_authentication_options_are_supported(
future<> create_role(
service& ser,
const service& ser,
stdx::string_view name,
const role_config& config,
const authentication_options& options) {
@@ -415,7 +427,7 @@ future<> create_role(
&validate_authentication_options_are_supported,
options,
ser.underlying_authenticator().supported_options()).then([&ser, name, &options] {
return ser.underlying_authenticator().create(sstring(name), options);
return ser.underlying_authenticator().create(name, options);
}).handle_exception([&ser, &name](std::exception_ptr ep) {
// Roll-back.
return ser.underlying_role_manager().drop(name).then([ep = std::move(ep)] {
@@ -426,7 +438,7 @@ future<> create_role(
}
future<> alter_role(
service& ser,
const service& ser,
stdx::string_view name,
const role_config_update& config_update,
const authentication_options& options) {
@@ -444,10 +456,15 @@ future<> alter_role(
});
}
future<> drop_role(service& ser, stdx::string_view name) {
future<> drop_role(const service& ser, stdx::string_view name) {
return do_with(make_role_resource(name), [&ser, name](const resource& r) {
auto& a = ser.underlying_authorizer();
return when_all_succeed(a.revoke_all(name), a.revoke_all(r));
return when_all_succeed(
a.revoke_all(name),
a.revoke_all(r)).handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
});
}).then([&ser, name] {
return ser.underlying_authenticator().drop(name);
}).then([&ser, name] {
@@ -471,7 +488,7 @@ future<bool> has_role(const service& ser, const authenticated_user& u, stdx::str
}
future<> grant_permissions(
service& ser,
const service& ser,
stdx::string_view role_name,
permission_set perms,
const resource& r) {
@@ -480,8 +497,19 @@ future<> grant_permissions(
});
}
future<> grant_applicable_permissions(const service& ser, stdx::string_view role_name, const resource& r) {
return grant_permissions(ser, role_name, r.applicable_permissions(), r);
}
future<> grant_applicable_permissions(const service& ser, const authenticated_user& u, const resource& r) {
if (is_anonymous(u)) {
return make_ready_future<>();
}
return grant_applicable_permissions(ser, *u.name, r);
}
future<> revoke_permissions(
service& ser,
const service& ser,
stdx::string_view role_name,
permission_set perms,
const resource& r) {

View File

@@ -75,12 +75,14 @@ public:
};
///
/// Central interface into access-control for the system.
/// Client for access-control in the system.
///
/// Access control encompasses user/role management, authentication, and authorization. This class provides access to
/// Access control encompasses user/role management, authentication, and authorization. This client provides access to
/// the dynamically-loaded implementations of these modules (through the `underlying_*` member functions), but also
/// builds on their functionality with caching and abstractions for common operations.
///
/// All state associated with access-control is stored externally to any particular instance of this class.
///
class service final {
permissions_cache_config _permissions_cache_config;
std::unique_ptr<permissions_cache> _permissions_cache;
@@ -149,26 +151,14 @@ public:
future<bool> exists(const resource&) const;
authenticator& underlying_authenticator() {
return *_authenticator;
}
const authenticator& underlying_authenticator() const {
return *_authenticator;
}
authorizer& underlying_authorizer() {
return *_authorizer;
}
const authorizer& underlying_authorizer() const {
return *_authorizer;
}
role_manager& underlying_role_manager() {
return *_role_manager;
}
const role_manager& underlying_role_manager() const {
return *_role_manager;
}
@@ -206,7 +196,7 @@ bool is_protected(const service&, const resource&) noexcept;
/// \returns an exceptional future with \ref unsupported_authentication_option if an unsupported option is included.
///
future<> create_role(
service&,
const service&,
stdx::string_view name,
const role_config&,
const authentication_options&);
@@ -219,7 +209,7 @@ future<> create_role(
/// \returns an exceptional future with \ref unsupported_authentication_option if an unsupported option is included.
///
future<> alter_role(
service&,
const service&,
stdx::string_view name,
const role_config_update&,
const authentication_options&);
@@ -229,7 +219,7 @@ future<> alter_role(
///
/// \returns an exceptional future with \ref nonexistant_role if the named role does not exist.
///
future<> drop_role(service&, stdx::string_view name);
future<> drop_role(const service&, stdx::string_view name);
///
/// Check if `grantee` has been granted the named role.
@@ -247,17 +237,34 @@ future<bool> has_role(const service&, const authenticated_user&, stdx::string_vi
///
/// \returns an exceptional future with \ref nonexistent_role if the named role does not exist.
///
/// \returns an exceptional future with \ref unsupported_authorization_operation if granting permissions is not
/// supported.
///
future<> grant_permissions(
service&,
const service&,
stdx::string_view role_name,
permission_set,
const resource&);
///
/// Like \ref grant_permissions, but grants all applicable permissions on the resource.
///
/// \returns an exceptional future with \ref nonexistent_role if the named role does not exist.
///
/// \returns an exceptional future with \ref unsupported_authorization_operation if granting permissions is not
/// supported.
///
future<> grant_applicable_permissions(const service&, stdx::string_view role_name, const resource&);
future<> grant_applicable_permissions(const service&, const authenticated_user&, const resource&);
///
/// \returns an exceptional future with \ref nonexistent_role if the named role does not exist.
///
/// \returns an exceptional future with \ref unsupported_authorization_operation if revoking permissions is not
/// supported.
///
future<> revoke_permissions(
service&,
const service&,
stdx::string_view role_name,
permission_set,
const resource&);
@@ -277,6 +284,9 @@ using recursive_permissions = bool_class<struct recursive_permissions_tag>;
/// \returns an exceptional future with \ref nonexistent_role if a role name is included which refers to a role that
/// does not exist.
///
/// \returns an exceptional future with \ref unsupported_authorization_operation if listing permissions is not
/// supported.
///
future<std::vector<permission_details>> list_filtered_permissions(
const service&,
permission_set,

View File

@@ -89,6 +89,7 @@ static future<stdx::optional<record>> find_record(cql3::query_processor& qp, std
return qp.process(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name)},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
@@ -118,6 +119,10 @@ static future<record> require_record(cql3::query_processor& qp, stdx::string_vie
});
}
static bool has_can_login(const cql3::untyped_result_set_row& row) {
return row.has("can_login") && !(boolean_type->deserialize(row.get_blob("can_login")).is_null());
}
stdx::string_view standard_role_manager_name() noexcept {
static const sstring instance = meta::AUTH_PACKAGE_NAME + "CassandraRoleManager";
return instance;
@@ -135,18 +140,7 @@ const resource_set& standard_role_manager::protected_resources() const {
return resources;
}
future<> standard_role_manager::create_metadata_tables_if_missing() {
static const sstring create_roles_query = sprint(
"CREATE TABLE %s ("
" %s text PRIMARY KEY,"
" can_login boolean,"
" is_superuser boolean,"
" member_of set<text>,"
" salted_hash text"
")",
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
future<> standard_role_manager::create_metadata_tables_if_missing() const {
static const sstring create_role_members_query = sprint(
"CREATE TABLE %s ("
" role text,"
@@ -158,19 +152,19 @@ future<> standard_role_manager::create_metadata_tables_if_missing() {
return when_all_succeed(
create_metadata_table_if_missing(
sstring(meta::roles_table::name),
meta::roles_table::name,
_qp,
create_roles_query,
meta::roles_table::creation_query(),
_migration_manager),
create_metadata_table_if_missing(
sstring(meta::role_members_table::name),
meta::role_members_table::name,
_qp,
create_role_members_query,
_migration_manager));
}
future<> standard_role_manager::create_default_role_if_missing() {
return default_role_row_satisfies(_qp, [](auto&&) { return true; }).then([this](bool exists) {
future<> standard_role_manager::create_default_role_if_missing() const {
return default_role_row_satisfies(_qp, &has_can_login).then([this](bool exists) {
if (!exists) {
static const sstring query = sprint(
"INSERT INTO %s (%s, is_superuser, can_login) VALUES (?, true, true)",
@@ -180,6 +174,7 @@ future<> standard_role_manager::create_default_role_if_missing() {
return _qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
{meta::DEFAULT_SUPERUSER_NAME}).then([](auto&&) {
log.info("Created default superuser role '{}'.", meta::DEFAULT_SUPERUSER_NAME);
return make_ready_future<>();
@@ -199,13 +194,14 @@ bool standard_role_manager::legacy_metadata_exists() const {
return _qp.db().local().has_schema(meta::AUTH_KS, legacy_table_name);
}
future<> standard_role_manager::migrate_legacy_metadata() {
future<> standard_role_manager::migrate_legacy_metadata() const {
log.info("Starting migration of legacy user metadata.");
static const sstring query = sprint("SELECT * FROM %s.%s", meta::AUTH_KS, legacy_table_name);
return _qp.process(
query,
db::consistency_level::QUORUM).then([this](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
role_config config;
config.is_superuser = row.get_as<bool>("super");
@@ -231,7 +227,9 @@ future<> standard_role_manager::start() {
return this->create_metadata_tables_if_missing().then([this] {
_stopped = auth::do_after_system_ready(_as, [this] {
return seastar::async([this] {
if (any_nondefault_role_row_satisfies(_qp, [](auto&&) { return true; }).get0()) {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get0()) {
if (this->legacy_metadata_exists()) {
log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
}
@@ -256,7 +254,7 @@ future<> standard_role_manager::stop() {
return _stopped.handle_exception_type([] (const sleep_aborted&) { });
}
future<> standard_role_manager::create_or_replace(stdx::string_view role_name, const role_config& c) {
future<> standard_role_manager::create_or_replace(stdx::string_view role_name, const role_config& c) const {
static const sstring query = sprint(
"INSERT INTO %s (%s, is_superuser, can_login) VALUES (?, ?, ?)",
meta::roles_table::qualified_name(),
@@ -265,12 +263,13 @@ future<> standard_role_manager::create_or_replace(stdx::string_view role_name, c
return _qp.process(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name), c.is_superuser, c.can_login},
true).discard_result();
}
future<>
standard_role_manager::create(stdx::string_view role_name, const role_config& c) {
standard_role_manager::create(stdx::string_view role_name, const role_config& c) const {
return this->exists(role_name).then([this, role_name, &c](bool role_exists) {
if (role_exists) {
throw role_already_exists(role_name);
@@ -281,7 +280,7 @@ standard_role_manager::create(stdx::string_view role_name, const role_config& c)
}
future<>
standard_role_manager::alter(stdx::string_view role_name, const role_config_update& u) {
standard_role_manager::alter(stdx::string_view role_name, const role_config_update& u) const {
static const auto build_column_assignments = [](const role_config_update& u) -> sstring {
std::vector<sstring> assignments;
@@ -308,11 +307,12 @@ standard_role_manager::alter(stdx::string_view role_name, const role_config_upda
build_column_assignments(u),
meta::roles_table::role_col_name),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result();
});
}
future<> standard_role_manager::drop(stdx::string_view role_name) {
future<> standard_role_manager::drop(stdx::string_view role_name) const {
return this->exists(role_name).then([this, role_name](bool role_exists) {
if (!role_exists) {
throw nonexistant_role(role_name);
@@ -327,6 +327,7 @@ future<> standard_role_manager::drop(stdx::string_view role_name) {
return _qp.process(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name)}).then([this, role_name](::shared_ptr<cql3::untyped_result_set> members) {
return parallel_for_each(
members->begin(),
@@ -366,6 +367,7 @@ future<> standard_role_manager::drop(stdx::string_view role_name) {
return _qp.process(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result();
};
@@ -379,7 +381,7 @@ future<>
standard_role_manager::modify_membership(
stdx::string_view grantee_name,
stdx::string_view role_name,
membership_change ch) {
membership_change ch) const {
const auto modify_roles = [this, role_name, grantee_name, ch] {
@@ -392,6 +394,7 @@ standard_role_manager::modify_membership(
return _qp.process(
query,
consistency_for_role(grantee_name),
internal_distributed_timeout_config(),
{role_set{sstring(role_name)}, sstring(grantee_name)}).discard_result();
};
@@ -403,6 +406,7 @@ standard_role_manager::modify_membership(
"INSERT INTO %s (role, member) VALUES (?, ?)",
meta::role_members_table::qualified_name()),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
case membership_change::remove:
@@ -411,6 +415,7 @@ standard_role_manager::modify_membership(
"DELETE FROM %s WHERE role = ? AND member = ?",
meta::role_members_table::qualified_name()),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
}
@@ -421,7 +426,7 @@ standard_role_manager::modify_membership(
}
future<>
standard_role_manager::grant(stdx::string_view grantee_name, stdx::string_view role_name) {
standard_role_manager::grant(stdx::string_view grantee_name, stdx::string_view role_name) const {
const auto check_redundant = [this, role_name, grantee_name] {
return this->query_granted(
grantee_name,
@@ -452,7 +457,7 @@ standard_role_manager::grant(stdx::string_view grantee_name, stdx::string_view r
}
future<>
standard_role_manager::revoke(stdx::string_view revokee_name, stdx::string_view role_name) {
standard_role_manager::revoke(stdx::string_view revokee_name, stdx::string_view role_name) const {
return this->exists(role_name).then([this, revokee_name, role_name](bool role_exists) {
if (!role_exists) {
throw nonexistant_role(sstring(role_name));
@@ -511,7 +516,10 @@ future<role_set> standard_role_manager::query_all() const {
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
return _qp.process(query, db::consistency_level::QUORUM).then([](::shared_ptr<cql3::untyped_result_set> results) {
return _qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([](::shared_ptr<cql3::untyped_result_set> results) {
role_set roles;
std::transform(

View File

@@ -66,15 +66,15 @@ public:
virtual future<> stop() override;
virtual future<> create(stdx::string_view role_name, const role_config&) override;
virtual future<> create(stdx::string_view role_name, const role_config&) const override;
virtual future<> drop(stdx::string_view role_name) override;
virtual future<> drop(stdx::string_view role_name) const override;
virtual future<> alter(stdx::string_view role_name, const role_config_update&) override;
virtual future<> alter(stdx::string_view role_name, const role_config_update&) const override;
virtual future<> grant(stdx::string_view grantee_name, stdx::string_view role_name) override;
virtual future<> grant(stdx::string_view grantee_name, stdx::string_view role_name) const override;
virtual future<> revoke(stdx::string_view revokee_name, stdx::string_view role_name) override;
virtual future<> revoke(stdx::string_view revokee_name, stdx::string_view role_name) const override;
virtual future<role_set> query_granted(stdx::string_view grantee_name, recursive_role_query) const override;
@@ -89,17 +89,17 @@ public:
private:
enum class membership_change { add, remove };
future<> create_metadata_tables_if_missing();
future<> create_metadata_tables_if_missing() const;
bool legacy_metadata_exists() const;
future<> migrate_legacy_metadata();
future<> migrate_legacy_metadata() const;
future<> create_default_role_if_missing();
future<> create_default_role_if_missing() const;
future<> create_or_replace(stdx::string_view role_name, const role_config&);
future<> create_or_replace(stdx::string_view role_name, const role_config&) const;
future<> modify_membership(stdx::string_view role_name, stdx::string_view grantee_name, membership_change);
future<> modify_membership(stdx::string_view role_name, stdx::string_view grantee_name, membership_change) const;
};
}

View File

@@ -118,18 +118,22 @@ public:
});
}
virtual future<> create(stdx::string_view role_name, const authentication_options& options) override {
virtual future<> create(stdx::string_view role_name, const authentication_options& options) const override {
return _authenticator->create(role_name, options);
}
virtual future<> alter(stdx::string_view role_name, const authentication_options& options) override {
virtual future<> alter(stdx::string_view role_name, const authentication_options& options) const override {
return _authenticator->alter(role_name, options);
}
virtual future<> drop(stdx::string_view role_name) override {
virtual future<> drop(stdx::string_view role_name) const override {
return _authenticator->drop(role_name);
}
virtual future<custom_options> query_custom_options(stdx::string_view role_name) const override {
return _authenticator->query_custom_options(role_name);
}
virtual const resource_set& protected_resources() const override {
return _authenticator->protected_resources();
}
@@ -214,11 +218,11 @@ public:
return make_ready_future<permission_set>(transitional_permissions);
}
virtual future<> grant(stdx::string_view s, permission_set ps, const resource& r) override {
virtual future<> grant(stdx::string_view s, permission_set ps, const resource& r) const override {
return _authorizer->grant(s, std::move(ps), r);
}
virtual future<> revoke(stdx::string_view s, permission_set ps, const resource& r) override {
virtual future<> revoke(stdx::string_view s, permission_set ps, const resource& r) const override {
return _authorizer->revoke(s, std::move(ps), r);
}
@@ -226,11 +230,11 @@ public:
return _authorizer->list_all();
}
virtual future<> revoke_all(stdx::string_view s) override {
virtual future<> revoke_all(stdx::string_view s) const override {
return _authorizer->revoke_all(s);
}
virtual future<> revoke_all(const resource& r) override {
virtual future<> revoke_all(const resource& r) const override {
return _authorizer->revoke_all(r);
}

View File

@@ -47,6 +47,7 @@
class backlog_controller {
public:
future<> shutdown() {
_update_timer.cancel();
return std::move(_inflight_update);
}
protected:
@@ -95,6 +96,12 @@ protected:
}
virtual ~backlog_controller() {}
public:
backlog_controller(backlog_controller&&) = default;
float backlog_of_shares(float shares) const;
seastar::scheduling_group sg() {
return _scheduling_group;
}
};
// memtable flush CPU controller.
@@ -118,7 +125,7 @@ public:
flush_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares) : backlog_controller(sg, iop, static_shares) {}
flush_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty)
: backlog_controller(sg, iop, std::move(interval),
std::vector<backlog_controller::control_point>({{soft_limit, 100}, {soft_limit + (hard_dirty_limit - soft_limit) / 2, 200} , {hard_dirty_limit, 1000}}),
std::vector<backlog_controller::control_point>({{soft_limit, 10}, {soft_limit + (hard_dirty_limit - soft_limit) / 2, 200} , {hard_dirty_limit, 1000}}),
std::move(current_dirty)
)
{}
@@ -126,7 +133,9 @@ public:
class compaction_controller : public backlog_controller {
public:
static constexpr unsigned normalization_factor = 10;
static constexpr unsigned normalization_factor = 30;
static constexpr float disable_backlog = std::numeric_limits<double>::infinity();
static constexpr float backlog_disabled(float backlog) { return std::isinf(backlog); }
compaction_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares) : backlog_controller(sg, iop, static_shares) {}
compaction_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval, std::function<float()> current_backlog)
: backlog_controller(sg, iop, std::move(interval),

View File

@@ -29,7 +29,7 @@
#include <functional>
#include "utils/mutable_view.hh"
using bytes = basic_sstring<int8_t, uint32_t, 31>;
using bytes = basic_sstring<int8_t, uint32_t, 31, false>;
using bytes_view = std::experimental::basic_string_view<int8_t>;
using bytes_mutable_view = basic_mutable_view<bytes_view::value_type>;
using bytes_opt = std::experimental::optional<bytes>;
@@ -78,3 +78,11 @@ struct appending_hash<bytes_view> {
h.update(reinterpret_cast<const char*>(v.begin()), v.size() * sizeof(bytes_view::value_type));
}
};
inline int32_t compare_unsigned(bytes_view v1, bytes_view v2) {
auto n = memcmp(v1.begin(), v2.begin(), std::min(v1.size(), v2.size()));
if (n) {
return n;
}
return (int32_t) (v1.size() - v2.size());
}

View File

@@ -65,8 +65,9 @@ private:
size_type _size;
public:
class fragment_iterator : public std::iterator<std::input_iterator_tag, bytes_view> {
chunk* _current;
chunk* _current = nullptr;
public:
fragment_iterator() = default;
fragment_iterator(chunk* current) : _current(current) {}
fragment_iterator(const fragment_iterator&) = default;
fragment_iterator& operator=(const fragment_iterator&) = default;
@@ -289,6 +290,24 @@ public:
}
}
// Removes n bytes from the end of the bytes_ostream.
// Beware of O(n) algorithm.
void remove_suffix(size_t n) {
_size -= n;
auto left = _size;
auto current = _begin.get();
while (current) {
if (current->offset >= left) {
current->offset = left;
_current = current;
current->next.reset();
return;
}
left -= current->offset;
current = current->next.get();
}
}
// begin() and end() form an input range to bytes_view representing fragments.
// Any modification of this instance invalidates iterators.
fragment_iterator begin() const { return { _begin.get() }; }

View File

@@ -60,6 +60,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
// - _next_row_in_range = _next.position() < _upper_bound
// - _last_row points at a direct predecessor of the next row which is going to be read.
// Used for populating continuity.
// - _population_range_starts_before_all_rows is set accordingly
reading_from_underlying,
end_of_stream
@@ -86,6 +87,18 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
partition_snapshot_row_cursor _next_row;
bool _next_row_in_range = false;
// True iff current population interval, since the previous clustering row, starts before all clustered rows.
// We cannot just look at _lower_bound, because emission of range tombstones changes _lower_bound and
// because we mark clustering intervals as continuous when consuming a clustering_row, it would prevent
// us from marking the interval as continuous.
// Valid when _state == reading_from_underlying.
bool _population_range_starts_before_all_rows;
// Whether _lower_bound was changed within current fill_buffer().
// If it did not then we cannot break out of it (e.g. on preemption) because
// forward progress is not guaranteed in case iterators are getting constantly invalidated.
bool _lower_bound_changed = false;
future<> do_fill_buffer(db::timeout_clock::time_point);
void copy_from_cache_to_buffer();
future<> process_static_row(db::timeout_clock::time_point);
@@ -226,6 +239,7 @@ inline
future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_point timeout) {
if (_state == state::move_to_underlying) {
_state = state::reading_from_underlying;
_population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema);
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)}, timeout).then([this, timeout] {
@@ -253,9 +267,13 @@ future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_poin
}
_next_row.maybe_refresh();
clogger.trace("csm {}: next={}, cont={}", this, _next_row.position(), _next_row.continuous());
while (!is_buffer_full() && _state == state::reading_from_cache) {
_lower_bound_changed = false;
while (_state == state::reading_from_cache) {
copy_from_cache_to_buffer();
if (need_preempt()) {
// We need to check _lower_bound_changed even if is_buffer_full() because
// we may have emitted only a range tombstone which overlapped with _lower_bound
// and thus didn't cause _lower_bound to change.
if ((need_preempt() || is_buffer_full()) && _lower_bound_changed) {
break;
}
}
@@ -346,12 +364,12 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
}
});
return make_ready_future<>();
});
}, timeout);
}
inline
bool cache_flat_mutation_reader::ensure_population_lower_bound() {
if (!_ck_ranges_curr->start()) {
if (_population_range_starts_before_all_rows) {
return true;
}
if (!_last_row.refresh(*_snp)) {
@@ -365,7 +383,7 @@ bool cache_flat_mutation_reader::ensure_population_lower_bound() {
rows_entry::compare less(*_schema);
// FIXME: Avoid the copy by inserting an incomplete clustering row
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_last_row));
current_allocator().construct<rows_entry>(*_schema, *_last_row));
e->set_continuous(false);
auto insert_result = rows.insert_check(rows.end(), *e, less);
auto inserted = insert_result.second;
@@ -406,6 +424,7 @@ inline
void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
if (!can_populate()) {
_last_row = nullptr;
_population_range_starts_before_all_rows = false;
_read_context->cache().on_mispopulate();
return;
}
@@ -418,7 +437,7 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
cr.cells().prepare_hash(*_schema, column_kind::regular_column);
}
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(cr.key(), cr.tomb(), cr.marker(), cr.cells()));
current_allocator().construct<rows_entry>(*_schema, cr.key(), cr.tomb(), cr.marker(), cr.cells()));
new_entry->set_continuous(false);
auto it = _next_row.iterators_valid() ? _next_row.get_iterator_in_latest_version()
: mp.clustered_rows().lower_bound(cr.key(), less);
@@ -439,6 +458,7 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
with_allocator(standard_allocator(), [&] {
_last_row = partition_snapshot_row_weakref(*_snp, it, true);
});
_population_range_starts_before_all_rows = false;
});
}
@@ -460,15 +480,19 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
_next_row.touch();
position_in_partition_view next_lower_bound = _next_row.dummy() ? _next_row.position() : position_in_partition_view::after_key(_next_row.key());
for (auto &&rts : _snp->range_tombstones(_lower_bound, _next_row_in_range ? next_lower_bound : _upper_bound)) {
position_in_partition::less_compare less(*_schema);
// This guarantees that rts starts after any emitted clustering_row
// and not before any emitted range tombstone.
if (rts.trim_front(*_schema, _lower_bound)) {
if (!less(_lower_bound, rts.position())) {
rts.set_start(*_schema, _lower_bound);
} else {
_lower_bound = position_in_partition(rts.position());
_lower_bound_changed = true;
if (is_buffer_full()) {
return;
}
push_mutation_fragment(std::move(rts));
}
push_mutation_fragment(std::move(rts));
}
// We add the row to the buffer even when it's full.
// This simplifies the code. For more info see #3139.
@@ -505,6 +529,7 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
_last_row = nullptr;
_lower_bound = std::move(lb);
_upper_bound = std::move(ub);
_lower_bound_changed = true;
_ck_ranges_curr = next_it;
auto adjacent = _next_row.advance_to(_lower_bound);
_next_row_in_range = !after_current_range(_next_row.position());
@@ -582,6 +607,7 @@ void cache_flat_mutation_reader::add_clustering_row_to_buffer(mutation_fragment&
auto new_lower_bound = position_in_partition::after_key(row.key());
push_mutation_fragment(std::move(mf));
_lower_bound = std::move(new_lower_bound);
_lower_bound_changed = true;
}
inline
@@ -589,10 +615,16 @@ void cache_flat_mutation_reader::add_to_buffer(range_tombstone&& rt) {
clogger.trace("csm {}: add_to_buffer({})", this, rt);
// This guarantees that rt starts after any emitted clustering_row
// and not before any emitted range tombstone.
if (!rt.trim_front(*_schema, _lower_bound)) {
position_in_partition::less_compare less(*_schema);
if (!less(_lower_bound, rt.end_position())) {
return;
}
_lower_bound = position_in_partition(rt.position());
if (!less(_lower_bound, rt.position())) {
rt.set_start(*_schema, _lower_bound);
} else {
_lower_bound = position_in_partition(rt.position());
_lower_bound_changed = true;
}
push_mutation_fragment(std::move(rt));
}

View File

@@ -23,21 +23,7 @@
#include <boost/intrusive/unordered_set.hpp>
#if __has_include(<boost/container/small_vector.hpp>)
#include <boost/container/small_vector.hpp>
template <typename T, size_t N>
using small_vector = boost::container::small_vector<T, N>;
#else
#include <vector>
template <typename T, size_t N>
using small_vector = std::vector<T>;
#endif
#include "utils/small_vector.hh"
#include "fnv1a_hasher.hh"
#include "mutation_fragment.hh"
#include "mutation_partition.hh"
@@ -45,7 +31,7 @@ using small_vector = std::vector<T>;
#include "db/timeout_clock.hh"
class cells_range {
using ids_vector_type = small_vector<column_id, 5>;
using ids_vector_type = utils::small_vector<column_id, 5>;
position_in_partition_view _position;
ids_vector_type _ids;

View File

@@ -70,7 +70,7 @@ public:
{
if (!with_static_row) {
if (_current == _end) {
_current_start = _current_end = position_in_partition_view::after_all_clustered_rows();
_current_start = position_in_partition_view::before_all_clustered_rows();
} else {
_current_start = position_in_partition_view::for_range_start(*_current);
_current_end = position_in_partition_view::for_range_end(*_current);

View File

@@ -25,7 +25,8 @@
#include "exceptions/exceptions.hh"
#include "sstables/compaction_backlog_manager.hh"
class column_family;
class table;
using column_family = table;
class schema;
using schema_ptr = lw_shared_ptr<const schema>;

View File

@@ -25,6 +25,7 @@
#include <boost/range/adaptor/transformed.hpp>
#include "compound.hh"
#include "schema.hh"
#include "sstables/version.hh"
//
// This header provides adaptors between the representation used by our compound_type<>
@@ -302,7 +303,7 @@ private:
}
public:
template <typename Describer>
auto describe_type(Describer f) const {
auto describe_type(sstables::sstable_version_types v, Describer f) const {
return f(const_cast<bytes&>(_bytes));
}

View File

@@ -241,7 +241,7 @@ size_t lz4_processor::compress(const char* input, size_t input_len,
output[1] = (input_len >> 8) & 0xFF;
output[2] = (input_len >> 16) & 0xFF;
output[3] = (input_len >> 24) & 0xFF;
#ifdef HAVE_LZ4_COMPRESS_DEFAULT
#ifdef SEASTAR_HAVE_LZ4_COMPRESS_DEFAULT
auto ret = LZ4_compress_default(input, output + 4, input_len, LZ4_compressBound(input_len));
#else
auto ret = LZ4_compress(input, output + 4, input_len);

View File

@@ -228,6 +228,7 @@ scylla_tests = [
'tests/memory_footprint',
'tests/perf/perf_sstable',
'tests/cql_query_test',
'tests/secondary_index_test',
'tests/storage_proxy_test',
'tests/schema_change_test',
'tests/mutation_reader_test',
@@ -235,6 +236,7 @@ scylla_tests = [
'tests/row_cache_test',
'tests/test-serialization',
'tests/sstable_test',
'tests/sstable_3_x_test',
'tests/sstable_mutation_test',
'tests/sstable_resharding_test',
'tests/memtable_test',
@@ -273,12 +275,15 @@ scylla_tests = [
'tests/input_stream_test',
'tests/virtual_reader_test',
'tests/view_schema_test',
'tests/view_build_test',
'tests/view_complex_test',
'tests/counter_test',
'tests/cell_locker_test',
'tests/row_locker_test',
'tests/streaming_histogram_test',
'tests/duration_test',
'tests/vint_serialization_test',
'tests/continuous_data_consumer_test',
'tests/compress_test',
'tests/chunked_vector_test',
'tests/loading_cache_test',
@@ -293,11 +298,18 @@ scylla_tests = [
'tests/extensions_test',
'tests/cql_auth_syntax_test',
'tests/querier_cache',
'tests/querier_cache_resource_based_eviction',
'tests/limiting_data_source_test',
'tests/meta_test',
'tests/imr_test',
'tests/partition_data_test',
'tests/reusable_buffer_test',
'tests/json_test'
]
perf_tests = [
'tests/perf/perf_mutation_readers'
'tests/perf/perf_mutation_readers',
'tests/perf/perf_mutation_fragment',
'tests/perf/perf_idl',
]
apps = [
@@ -358,6 +370,10 @@ arg_parser.add_argument('--enable-gcc6-concepts', dest='gcc6_concepts', action='
help='enable experimental support for C++ Concepts as implemented in GCC 6')
arg_parser.add_argument('--enable-alloc-failure-injector', dest='alloc_failure_injector', action='store_true', default=False,
help='enable allocation failure injection')
arg_parser.add_argument('--with-antlr3', dest='antlr3_exec', action='store', default=None,
help='path to antlr3 executable')
arg_parser.add_argument('--with-ragel', dest='ragel_exec', action='store', default=None,
help='path to ragel executable')
args = arg_parser.parse_args()
defines = []
@@ -367,6 +383,7 @@ extra_cxxflags = {}
cassandra_interface = Thrift(source = 'interface/cassandra.thrift', service = 'Cassandra')
scylla_core = (['database.cc',
'atomic_cell.cc',
'schema.cc',
'frozen_schema.cc',
'schema_registry.cc',
@@ -379,21 +396,24 @@ scylla_core = (['database.cc',
'frozen_mutation.cc',
'memtable.cc',
'schema_mutations.cc',
'release.cc',
'supervisor.cc',
'utils/logalloc.cc',
'utils/large_bitset.cc',
'utils/buffer_input_stream.cc',
'utils/limiting_data_source.cc',
'mutation_partition.cc',
'mutation_partition_view.cc',
'mutation_partition_serializer.cc',
'mutation_reader.cc',
'flat_mutation_reader.cc',
'mutation_query.cc',
'json.cc',
'keys.cc',
'counters.cc',
'compress.cc',
'sstables/mp_row_consumer.cc',
'sstables/sstables.cc',
'sstables/sstable_version.cc',
'sstables/compress.cc',
'sstables/row.cc',
'sstables/partition.cc',
@@ -402,9 +422,12 @@ scylla_core = (['database.cc',
'sstables/compaction_manager.cc',
'sstables/integrity_checked_file_impl.cc',
'sstables/prepended_input_stream.cc',
'sstables/m_format_write_helpers.cc',
'sstables/m_format_read_helpers.cc',
'transport/event.cc',
'transport/event_notifier.cc',
'transport/server.cc',
'transport/messages/result_message.cc',
'cql3/abstract_marker.cc',
'cql3/attributes.cc',
'cql3/cf_name.cc',
@@ -492,6 +515,8 @@ scylla_core = (['database.cc',
'cql3/variable_specifications.cc',
'db/consistency_level.cc',
'db/system_keyspace.cc',
'db/system_distributed_keyspace.cc',
'db/size_estimates_virtual_reader.cc',
'db/schema_tables.cc',
'db/cql_type_parser.cc',
'db/legacy_schema_migrator.cc',
@@ -499,15 +524,17 @@ scylla_core = (['database.cc',
'db/commitlog/commitlog_replayer.cc',
'db/commitlog/commitlog_entry.cc',
'db/hints/manager.cc',
'db/hints/resource_manager.cc',
'db/config.cc',
'db/extensions.cc',
'db/heat_load_balance.cc',
'db/index/secondary_index.cc',
'db/large_partition_handler.cc',
'db/marshal/type_parser.cc',
'db/batchlog_manager.cc',
'db/view/view.cc',
'db/view/row_locking.cc',
'index/secondary_index_manager.cc',
'index/secondary_index.cc',
'utils/UUID_gen.cc',
'utils/i_filter.cc',
'utils/bloom_filter.cc',
@@ -604,6 +631,7 @@ scylla_core = (['database.cc',
'vint-serialization.cc',
'utils/arch/powerpc/crc32-vpmsum/crc32_wrapper.cc',
'querier.cc',
'data/cell.cc',
]
+ [Antlr3Grammar('cql3/Cql.g')]
+ [Thrift('interface/cassandra.thrift', 'Cassandra')]
@@ -640,7 +668,9 @@ api = ['api/api.cc',
'api/api-doc/stream_manager.json',
'api/stream_manager.cc',
'api/api-doc/system.json',
'api/system.cc'
'api/system.cc',
'api/config.cc',
'api/api-doc/config.json',
]
idls = ['idl/gossip_digest.idl.hh',
@@ -668,7 +698,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/cache_temperature.idl.hh',
]
scylla_tests_dependencies = scylla_core + api + idls + [
scylla_tests_dependencies = scylla_core + idls + [
'tests/cql_test_env.cc',
'tests/cql_assertions.cc',
'tests/result_set_assertions.cc',
@@ -681,7 +711,7 @@ scylla_tests_seastar_deps = [
]
deps = {
'scylla': idls + ['main.cc'] + scylla_core + api,
'scylla': idls + ['main.cc', 'release.cc'] + scylla_core + api,
}
pure_boost_tests = set([
@@ -709,6 +739,11 @@ pure_boost_tests = set([
'tests/auth_resource_test',
'tests/enum_set_test',
'tests/cql_auth_syntax_test',
'tests/meta_test',
'tests/imr_test',
'tests/partition_data_test',
'tests/reusable_buffer_test',
'tests/json_test',
])
tests_not_using_seastar_test_framework = set([
@@ -727,7 +762,6 @@ tests_not_using_seastar_test_framework = set([
'tests/memory_footprint',
'tests/gossip',
'tests/perf/perf_sstable',
'tests/querier_cache_resource_based_eviction',
]) | pure_boost_tests
for t in tests_not_using_seastar_test_framework:
@@ -740,7 +774,7 @@ for t in scylla_tests:
deps[t] += scylla_tests_dependencies
deps[t] += scylla_tests_seastar_deps
else:
deps[t] += scylla_core + api + idls + ['tests/cql_test_env.cc']
deps[t] += scylla_core + idls + ['tests/cql_test_env.cc']
perf_tests_seastar_deps = [
'seastar/tests/perf/perf_tests.cc'
@@ -759,6 +793,10 @@ deps['tests/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'tests/mur
deps['tests/allocation_strategy_test'] = ['tests/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/log_heap_test'] = ['tests/log_heap_test.cc']
deps['tests/anchorless_list_test'] = ['tests/anchorless_list_test.cc']
deps['tests/perf/perf_fast_forward'] += ['release.cc']
deps['tests/meta_test'] = ['tests/meta_test.cc']
deps['tests/imr_test'] = ['tests/imr_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/reusable_buffer_test'] = ['tests/reusable_buffer_test.cc']
warnings = [
'-Wno-mismatched-tags', # clang-only
@@ -832,6 +870,22 @@ for pkglist in optional_packages:
alternatives = ':'.join(pkglist[1:])
print('Missing optional package {pkglist[0]} (or alteratives {alternatives})'.format(**locals()))
compiler_test_src = '''
#if __GNUC__ < 7
#error "MAJOR"
#elif __GNUC__ == 7
#if __GNUC_MINOR__ < 3
#error "MINOR"
#endif
#endif
int main() { return 0; }
'''
if not try_compile_and_link(compiler=args.cxx, source=compiler_test_src):
print('Wrong GCC version. Scylla needs GCC >= 7.3 to compile.')
sys.exit(1)
if not try_compile(compiler=args.cxx, source='#include <boost/version.hpp>'):
print('Boost not installed. Please install {}.'.format(pkgname("boost-devel")))
sys.exit(1)
@@ -957,6 +1011,16 @@ do_sanitize = True
if args.static:
do_sanitize = False
if args.antlr3_exec:
antlr3_exec = args.antlr3_exec
else:
antlr3_exec = "antlr3"
if args.ragel_exec:
ragel_exec = args.ragel_exec
else:
ragel_exec = "ragel"
with open(buildfile, 'w') as f:
f.write(textwrap.dedent('''\
configure_args = {configure_args}
@@ -970,7 +1034,7 @@ with open(buildfile, 'w') as f:
pool seastar_pool
depth = 1
rule ragel
command = ragel -G2 -o $out $in
command = {ragel_exec} -G2 -o $out $in
description = RAGEL $out
rule gen
command = echo -e $text > $out
@@ -1016,7 +1080,7 @@ with open(buildfile, 'w') as f:
# Because we add such a variable to every function, and because `ExceptionBaseType` is not a global
# name, we also add a global typedef to avoid compilation errors.
command = sed -e '/^#if 0/,/^#endif/d' $in > $builddir/{mode}/gen/$in $
&& antlr3 $builddir/{mode}/gen/$in $
&& {antlr3_exec} $builddir/{mode}/gen/$in $
&& sed -i -e 's/^\\( *\)\\(ImplTraits::CommonTokenType\\* [a-zA-Z0-9_]* = NULL;\\)$$/\\1const \\2/' $
-e '1i using ExceptionBaseType = int;' $
-e 's/^{{/{{ ExceptionBaseType\* ex = nullptr;/; $
@@ -1024,7 +1088,7 @@ with open(buildfile, 'w') as f:
s/exceptions::syntax_exception e/exceptions::syntax_exception\& e/' $
build/{mode}/gen/${{stem}}Parser.cpp
description = ANTLR3 $in
''').format(mode = mode, **modeval))
''').format(mode = mode, antlr3_exec = antlr3_exec, **modeval))
f.write('build {mode}: phony {artifacts}\n'.format(mode = mode,
artifacts = str.join(' ', ('$builddir/' + mode + '/' + x for x in build_artifacts))))
compiles = {}

View File

@@ -39,16 +39,32 @@ private:
return ::is_compatible(new_def.kind, kind) && new_def.type->is_value_compatible_with(*old_type);
}
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, atomic_cell_view cell) {
if (is_compatible(new_def, old_type, kind) && cell.timestamp() > new_def.dropped_at()) {
dst.apply(new_def, atomic_cell_or_collection(cell));
if (!is_compatible(new_def, old_type, kind) || cell.timestamp() <= new_def.dropped_at()) {
return;
}
auto new_cell = [&] {
if (cell.is_live() && !old_type->is_counter()) {
if (cell.is_live_and_has_ttl()) {
return atomic_cell_or_collection(
atomic_cell::make_live(*new_def.type, cell.timestamp(), cell.value().linearize(), cell.expiry(), cell.ttl())
);
}
return atomic_cell_or_collection(
atomic_cell::make_live(*new_def.type, cell.timestamp(), cell.value().linearize())
);
} else {
return atomic_cell_or_collection(*new_def.type, cell);
}
}();
dst.apply(new_def, std::move(new_cell));
}
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, collection_mutation_view cell) {
if (!is_compatible(new_def, old_type, kind)) {
return;
}
cell.data.with_linearized([&] (bytes_view cell_bv) {
auto&& ctype = static_pointer_cast<const collection_type_impl>(old_type);
auto old_view = ctype->deserialize_mutation_form(cell);
auto old_view = ctype->deserialize_mutation_form(cell_bv);
collection_type_impl::mutation_view new_view;
if (old_view.tomb.timestamp > new_def.dropped_at()) {
@@ -60,6 +76,7 @@ private:
}
}
dst.apply(new_def, ctype->serialize_mutation_form(std::move(new_view)));
});
}
public:
converting_mutation_partition_applier(
@@ -120,11 +137,11 @@ public:
// Appends the cell to dst upgrading it to the new schema.
// Cells must have monotonic names.
static void append_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, const atomic_cell_or_collection& cell) {
static void append_cell(row& dst, column_kind kind, const column_definition& new_def, const column_definition& old_def, const atomic_cell_or_collection& cell) {
if (new_def.is_atomic()) {
accept_cell(dst, kind, new_def, old_type, cell.as_atomic_cell());
accept_cell(dst, kind, new_def, old_def.type, cell.as_atomic_cell(old_def));
} else {
accept_cell(dst, kind, new_def, old_type, cell.as_collection_mutation());
accept_cell(dst, kind, new_def, old_def.type, cell.as_collection_mutation());
}
}
};

View File

@@ -78,10 +78,10 @@ std::vector<counter_shard> counter_cell_view::shards_compatible_with_1_7_4() con
return sorted_shards;
}
static bool apply_in_place(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
static bool apply_in_place(const column_definition& cdef, atomic_cell_mutable_view dst, atomic_cell_mutable_view src)
{
auto dst_ccmv = counter_cell_mutable_view(dst.as_mutable_atomic_cell());
auto src_ccmv = counter_cell_mutable_view(src.as_mutable_atomic_cell());
auto dst_ccmv = counter_cell_mutable_view(dst);
auto src_ccmv = counter_cell_mutable_view(src);
auto dst_shards = dst_ccmv.shards();
auto src_shards = src_ccmv.shards();
@@ -118,48 +118,19 @@ static bool apply_in_place(atomic_cell_or_collection& dst, atomic_cell_or_collec
auto src_ts = src_ccmv.timestamp();
dst_ccmv.set_timestamp(std::max(dst_ts, src_ts));
src_ccmv.set_timestamp(dst_ts);
src.as_mutable_atomic_cell().set_counter_in_place_revert(true);
return true;
}
static void revert_in_place_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
void counter_cell_view::apply(const column_definition& cdef, atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
assert(dst.can_use_mutable_view() && src.can_use_mutable_view());
auto dst_ccmv = counter_cell_mutable_view(dst.as_mutable_atomic_cell());
auto src_ccmv = counter_cell_mutable_view(src.as_mutable_atomic_cell());
auto dst_shards = dst_ccmv.shards();
auto src_shards = src_ccmv.shards();
auto dst_it = dst_shards.begin();
auto src_it = src_shards.begin();
while (src_it != src_shards.end()) {
while (dst_it != dst_shards.end() && dst_it->id() < src_it->id()) {
++dst_it;
}
assert(dst_it != dst_shards.end() && dst_it->id() == src_it->id());
dst_it->swap_value_and_clock(*src_it);
++src_it;
}
auto dst_ts = dst_ccmv.timestamp();
auto src_ts = src_ccmv.timestamp();
dst_ccmv.set_timestamp(src_ts);
src_ccmv.set_timestamp(dst_ts);
src.as_mutable_atomic_cell().set_counter_in_place_revert(false);
}
bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
auto dst_ac = dst.as_atomic_cell();
auto src_ac = src.as_atomic_cell();
auto dst_ac = dst.as_atomic_cell(cdef);
auto src_ac = src.as_atomic_cell(cdef);
if (!dst_ac.is_live() || !src_ac.is_live()) {
if (dst_ac.is_live() || (!src_ac.is_live() && compare_atomic_cell_for_merge(dst_ac, src_ac) < 0)) {
std::swap(dst, src);
return true;
}
return false;
return;
}
if (dst_ac.is_counter_update() && src_ac.is_counter_update()) {
@@ -167,22 +138,26 @@ bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_
auto dst_v = dst_ac.counter_update_value();
dst = atomic_cell::make_live_counter_update(std::max(dst_ac.timestamp(), src_ac.timestamp()),
src_v + dst_v);
return true;
return;
}
assert(!dst_ac.is_counter_update());
assert(!src_ac.is_counter_update());
with_linearized(dst_ac, [&] (counter_cell_view dst_ccv) {
with_linearized(src_ac, [&] (counter_cell_view src_ccv) {
if (counter_cell_view(dst_ac).shard_count() >= counter_cell_view(src_ac).shard_count()
&& dst.can_use_mutable_view() && src.can_use_mutable_view()) {
if (apply_in_place(dst, src)) {
return true;
if (dst_ccv.shard_count() >= src_ccv.shard_count()) {
auto dst_amc = dst.as_mutable_atomic_cell(cdef);
auto src_amc = src.as_mutable_atomic_cell(cdef);
if (!dst_amc.is_value_fragmented() && !src_amc.is_value_fragmented()) {
if (apply_in_place(cdef, dst_amc, src_amc)) {
return;
}
}
}
src.as_mutable_atomic_cell().set_counter_in_place_revert(false);
auto dst_shards = counter_cell_view(dst_ac).shards();
auto src_shards = counter_cell_view(src_ac).shards();
auto dst_shards = dst_ccv.shards();
auto src_shards = src_ccv.shards();
counter_cell_builder result;
combine(dst_shards.begin(), dst_shards.end(), src_shards.begin(), src_shards.end(),
@@ -191,22 +166,9 @@ bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_
});
auto cell = result.build(std::max(dst_ac.timestamp(), src_ac.timestamp()));
src = std::exchange(dst, atomic_cell_or_collection(cell));
return true;
}
void counter_cell_view::revert_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
if (dst.as_atomic_cell().is_counter_update()) {
auto src_v = src.as_atomic_cell().counter_update_value();
auto dst_v = dst.as_atomic_cell().counter_update_value();
dst = atomic_cell::make_live(dst.as_atomic_cell().timestamp(),
long_type->decompose(dst_v - src_v));
} else if (src.as_atomic_cell().is_counter_in_place_revert_set()) {
revert_in_place_apply(dst, src);
} else {
std::swap(dst, src);
}
src = std::exchange(dst, atomic_cell_or_collection(std::move(cell)));
});
});
}
stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, atomic_cell_view b)
@@ -216,13 +178,15 @@ stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, at
if (!b.is_live() || !a.is_live()) {
if (b.is_live() || (!a.is_live() && compare_atomic_cell_for_merge(b, a) < 0)) {
return atomic_cell(a);
return atomic_cell(*counter_type, a);
}
return { };
}
auto a_shards = counter_cell_view(a).shards();
auto b_shards = counter_cell_view(b).shards();
return with_linearized(a, [&] (counter_cell_view a_ccv) {
return with_linearized(b, [&] (counter_cell_view b_ccv) {
auto a_shards = a_ccv.shards();
auto b_shards = b_ccv.shards();
auto a_it = a_shards.begin();
auto a_end = a_shards.end();
@@ -244,18 +208,21 @@ stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, at
if (!result.empty()) {
diff = result.build(std::max(a.timestamp(), b.timestamp()));
} else if (a.timestamp() > b.timestamp()) {
diff = atomic_cell::make_live(a.timestamp(), bytes_view());
diff = atomic_cell::make_live(*counter_type, a.timestamp(), bytes_view());
}
return diff;
});
});
}
void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset) {
// FIXME: allow current_state to be frozen_mutation
auto transform_new_row_to_shards = [clock_offset] (auto& cells) {
cells.for_each_cell([clock_offset] (auto, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
auto transform_new_row_to_shards = [&s = *m.schema(), clock_offset] (column_kind kind, auto& cells) {
cells.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto& cdef = s.column_at(kind, id);
auto acv = ac_o_c.as_atomic_cell(cdef);
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
@@ -266,32 +233,35 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
};
if (!current_state) {
transform_new_row_to_shards(m.partition().static_row());
transform_new_row_to_shards(column_kind::static_column, m.partition().static_row());
for (auto& cr : m.partition().clustered_rows()) {
transform_new_row_to_shards(cr.row().cells());
transform_new_row_to_shards(column_kind::regular_column, cr.row().cells());
}
return;
}
clustering_key::less_compare cmp(*m.schema());
auto transform_row_to_shards = [clock_offset] (auto& transformee, auto& state) {
auto transform_row_to_shards = [&s = *m.schema(), clock_offset] (column_kind kind, auto& transformee, auto& state) {
std::deque<std::pair<column_id, counter_shard>> shards;
state.for_each_cell([&] (column_id id, const atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
auto& cdef = s.column_at(kind, id);
auto acv = ac_o_c.as_atomic_cell(cdef);
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
counter_cell_view ccv(acv);
counter_cell_view::with_linearized(acv, [&] (counter_cell_view ccv) {
auto cs = ccv.local_shard();
if (!cs) {
return; // continue
}
shards.emplace_back(std::make_pair(id, counter_shard(*cs)));
});
});
transformee.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
auto& cdef = s.column_at(kind, id);
auto acv = ac_o_c.as_atomic_cell(cdef);
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
@@ -313,7 +283,7 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
});
};
transform_row_to_shards(m.partition().static_row(), current_state->partition().static_row());
transform_row_to_shards(column_kind::static_column, m.partition().static_row(), current_state->partition().static_row());
auto& cstate = current_state->partition();
auto it = cstate.clustered_rows().begin();
@@ -323,10 +293,10 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
++it;
}
if (it == end || cmp(cr.key(), it->key())) {
transform_new_row_to_shards(cr.row().cells());
transform_new_row_to_shards(column_kind::regular_column, cr.row().cells());
continue;
}
transform_row_to_shards(cr.row().cells(), it->row().cells());
transform_row_to_shards(column_kind::regular_column, cr.row().cells(), it->row().cells());
}
}

View File

@@ -79,7 +79,7 @@ static_assert(std::is_pod<counter_id>::value, "counter_id should be a POD type")
std::ostream& operator<<(std::ostream& os, const counter_id& id);
template<typename View>
template<mutable_view is_mutable>
class basic_counter_shard_view {
enum class offset : unsigned {
id = 0u,
@@ -88,7 +88,8 @@ class basic_counter_shard_view {
total_size = unsigned(logical_clock) + sizeof(int64_t),
};
private:
typename View::pointer _base;
using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const signed char*, signed char*>;
pointer_type _base;
private:
template<typename T>
T read(offset off) const {
@@ -100,7 +101,7 @@ public:
static constexpr auto size = size_t(offset::total_size);
public:
basic_counter_shard_view() = default;
explicit basic_counter_shard_view(typename View::pointer ptr) noexcept
explicit basic_counter_shard_view(pointer_type ptr) noexcept
: _base(ptr) { }
counter_id id() const { return read<counter_id>(offset::id); }
@@ -111,7 +112,7 @@ public:
static constexpr size_t off = size_t(offset::value);
static constexpr size_t size = size_t(offset::total_size) - off;
typename View::value_type tmp[size];
signed char tmp[size];
std::copy_n(_base + off, size, tmp);
std::copy_n(other._base + off, size, _base + off);
std::copy_n(tmp, size, other._base + off);
@@ -138,7 +139,7 @@ public:
};
};
using counter_shard_view = basic_counter_shard_view<bytes_view>;
using counter_shard_view = basic_counter_shard_view<mutable_view::no>;
std::ostream& operator<<(std::ostream& os, counter_shard_view csv);
@@ -198,7 +199,7 @@ public:
return do_apply(other);
}
static size_t serialized_size() {
static constexpr size_t serialized_size() {
return counter_shard_view::size;
}
void serialize(bytes::iterator& out) const {
@@ -252,15 +253,33 @@ public:
}
atomic_cell build(api::timestamp_type timestamp) const {
return atomic_cell::make_live_from_serializer(timestamp, serialized_size(), [this] (bytes::iterator out) {
serialize(out);
});
// If we can assume that the counter shards never cross fragment boundaries
// the serialisation code gets much simpler.
static_assert(data::cell::maximum_external_chunk_length % counter_shard::serialized_size() == 0);
auto ac = atomic_cell::make_live_uninitialized(*counter_type, timestamp, serialized_size());
auto dst_it = ac.value().begin();
auto dst_current = *dst_it++;
for (auto&& cs : _shards) {
if (dst_current.empty()) {
dst_current = *dst_it++;
}
assert(!dst_current.empty());
auto value_dst = dst_current.data();
cs.serialize(value_dst);
dst_current.remove_prefix(counter_shard::serialized_size());
}
return ac;
}
static atomic_cell from_single_shard(api::timestamp_type timestamp, const counter_shard& cs) {
return atomic_cell::make_live_from_serializer(timestamp, counter_shard::serialized_size(), [&cs] (bytes::iterator out) {
cs.serialize(out);
});
// We don't really need to bother with fragmentation here.
static_assert(data::cell::maximum_external_chunk_length >= counter_shard::serialized_size());
auto ac = atomic_cell::make_live_uninitialized(*counter_type, timestamp, counter_shard::serialized_size());
auto dst = ac.value().first_fragment().begin();
cs.serialize(dst);
return ac;
}
class inserter_iterator : public std::iterator<std::output_iterator_tag, counter_shard> {
@@ -287,28 +306,32 @@ public:
// <counter_id> := <int64_t><int64_t>
// <shard> := <counter_id><int64_t:value><int64_t:logical_clock>
// <counter_cell> := <shard>*
template<typename View>
template<mutable_view is_mutable>
class basic_counter_cell_view {
protected:
atomic_cell_base<View> _cell;
using linearized_value_view = std::conditional_t<is_mutable == mutable_view::no,
bytes_view, bytes_mutable_view>;
using pointer_type = typename linearized_value_view::pointer;
basic_atomic_cell_view<is_mutable> _cell;
linearized_value_view _value;
private:
class shard_iterator : public std::iterator<std::input_iterator_tag, basic_counter_shard_view<View>> {
typename View::pointer _current;
basic_counter_shard_view<View> _current_view;
class shard_iterator : public std::iterator<std::input_iterator_tag, basic_counter_shard_view<is_mutable>> {
pointer_type _current;
basic_counter_shard_view<is_mutable> _current_view;
public:
shard_iterator() = default;
shard_iterator(typename View::pointer ptr) noexcept
shard_iterator(pointer_type ptr) noexcept
: _current(ptr), _current_view(ptr) { }
basic_counter_shard_view<View>& operator*() noexcept {
basic_counter_shard_view<is_mutable>& operator*() noexcept {
return _current_view;
}
basic_counter_shard_view<View>* operator->() noexcept {
basic_counter_shard_view<is_mutable>* operator->() noexcept {
return &_current_view;
}
shard_iterator& operator++() noexcept {
_current += counter_shard_view::size;
_current_view = basic_counter_shard_view<View>(_current);
_current_view = basic_counter_shard_view<is_mutable>(_current);
return *this;
}
shard_iterator operator++(int) noexcept {
@@ -318,7 +341,7 @@ private:
}
shard_iterator& operator--() noexcept {
_current -= counter_shard_view::size;
_current_view = basic_counter_shard_view<View>(_current);
_current_view = basic_counter_shard_view<is_mutable>(_current);
return *this;
}
shard_iterator operator--(int) noexcept {
@@ -335,22 +358,23 @@ private:
};
public:
boost::iterator_range<shard_iterator> shards() const {
auto bv = _cell.value();
auto begin = shard_iterator(bv.data());
auto end = shard_iterator(bv.data() + bv.size());
auto begin = shard_iterator(_value.data());
auto end = shard_iterator(_value.data() + _value.size());
return boost::make_iterator_range(begin, end);
}
size_t shard_count() const {
return _cell.value().size() / counter_shard_view::size;
return _cell.value().size_bytes() / counter_shard_view::size;
}
public:
protected:
// ac must be a live counter cell
explicit basic_counter_cell_view(atomic_cell_base<View> ac) noexcept : _cell(ac) {
explicit basic_counter_cell_view(basic_atomic_cell_view<is_mutable> ac, linearized_value_view vv) noexcept
: _cell(ac), _value(vv)
{
assert(_cell.is_live());
assert(!_cell.is_counter_update());
}
public:
api::timestamp_type timestamp() const { return _cell.timestamp(); }
static data_type total_value_type() { return long_type; }
@@ -381,18 +405,22 @@ public:
}
};
struct counter_cell_view : basic_counter_cell_view<bytes_view> {
struct counter_cell_view : basic_counter_cell_view<mutable_view::no> {
using basic_counter_cell_view::basic_counter_cell_view;
template<typename Function>
static decltype(auto) with_linearized(basic_atomic_cell_view<mutable_view::no> ac, Function&& fn) {
return ac.value().with_linearized([&] (bytes_view value_view) {
counter_cell_view ccv(ac, value_view);
return fn(ccv);
});
}
// Returns counter shards in an order that is compatible with Scylla 1.7.4.
std::vector<counter_shard> shards_compatible_with_1_7_4() const;
// Reversibly applies two counter cells, at least one of them must be live.
// Returns true iff dst was modified.
static bool apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
// Reverts apply performed by apply_reversible().
static void revert_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
static void apply(const column_definition& cdef, atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
// Computes a counter cell containing minimal amount of data which, when
// applied to 'b' returns the same cell as 'a' and 'b' applied together.
@@ -401,9 +429,15 @@ struct counter_cell_view : basic_counter_cell_view<bytes_view> {
friend std::ostream& operator<<(std::ostream& os, counter_cell_view ccv);
};
struct counter_cell_mutable_view : basic_counter_cell_view<bytes_mutable_view> {
struct counter_cell_mutable_view : basic_counter_cell_view<mutable_view::yes> {
using basic_counter_cell_view::basic_counter_cell_view;
explicit counter_cell_mutable_view(atomic_cell_mutable_view ac) noexcept
: basic_counter_cell_view<mutable_view::yes>(ac, ac.value().first_fragment())
{
assert(!ac.value().is_fragmented());
}
void set_timestamp(api::timestamp_type ts) { _cell.set_timestamp(ts); }
};

View File

@@ -373,7 +373,7 @@ useStatement returns [::shared_ptr<raw::use_statement> stmt]
;
/**
* SELECT <expression>
* SELECT [JSON] <expression>
* FROM <CF>
* WHERE KEY = "key1" AND COL > 1 AND COL < 100
* LIMIT <NUMBER>;
@@ -384,9 +384,12 @@ selectStatement returns [shared_ptr<raw::select_statement> expr]
::shared_ptr<cql3::term::raw> limit;
raw::select_statement::parameters::orderings_type orderings;
bool allow_filtering = false;
bool is_json = false;
}
: K_SELECT ( ( K_DISTINCT { is_distinct = true; } )?
sclause=selectClause
: K_SELECT (
( K_JSON { is_json = true; } )?
( K_DISTINCT { is_distinct = true; } )?
sclause=selectClause
)
K_FROM cf=columnFamilyName
( K_WHERE wclause=whereClause )?
@@ -394,7 +397,7 @@ selectStatement returns [shared_ptr<raw::select_statement> expr]
( K_LIMIT rows=intValue { limit = rows; } )?
( K_ALLOW K_FILTERING { allow_filtering = true; } )?
{
auto params = ::make_shared<raw::select_statement::parameters>(std::move(orderings), is_distinct, allow_filtering);
auto params = ::make_shared<raw::select_statement::parameters>(std::move(orderings), is_distinct, allow_filtering, is_json);
$expr = ::make_shared<raw::select_statement>(std::move(cf), std::move(params),
std::move(sclause), std::move(wclause), std::move(limit));
}
@@ -448,33 +451,51 @@ orderByClause[raw::select_statement::parameters::orderings_type& orderings]
: c=cident (K_ASC | K_DESC { reversed = true; })? { orderings.emplace_back(c, reversed); }
;
jsonValue returns [::shared_ptr<cql3::term::raw> value]
:
| s=STRING_LITERAL { $value = cql3::constants::literal::string(sstring{$s.text}); }
| ':' id=ident { $value = new_bind_variables(id); }
| QMARK { $value = new_bind_variables(shared_ptr<cql3::column_identifier>{}); }
;
/**
* INSERT INTO <CF> (<column>, <column>, <column>, ...)
* VALUES (<value>, <value>, <value>, ...)
* USING TIMESTAMP <long>;
*
*/
insertStatement returns [::shared_ptr<raw::insert_statement> expr]
insertStatement returns [::shared_ptr<raw::modification_statement> expr]
@init {
auto attrs = ::make_shared<cql3::attributes::raw>();
std::vector<::shared_ptr<cql3::column_identifier::raw>> column_names;
std::vector<::shared_ptr<cql3::term::raw>> values;
bool if_not_exists = false;
::shared_ptr<cql3::term::raw> json_value;
}
: K_INSERT K_INTO cf=columnFamilyName
'(' c1=cident { column_names.push_back(c1); } ( ',' cn=cident { column_names.push_back(cn); } )* ')'
K_VALUES
'(' v1=term { values.push_back(v1); } ( ',' vn=term { values.push_back(vn); } )* ')'
( K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
( usingClause[attrs] )?
{
$expr = ::make_shared<raw::insert_statement>(std::move(cf),
std::move(attrs),
std::move(column_names),
std::move(values),
if_not_exists);
}
('(' c1=cident { column_names.push_back(c1); } ( ',' cn=cident { column_names.push_back(cn); } )* ')'
K_VALUES
'(' v1=term { values.push_back(v1); } ( ',' vn=term { values.push_back(vn); } )* ')'
( K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
( usingClause[attrs] )?
{
$expr = ::make_shared<raw::insert_statement>(std::move(cf),
std::move(attrs),
std::move(column_names),
std::move(values),
if_not_exists);
}
| K_JSON
json_token=jsonValue { json_value = $json_token.value; }
( K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
( usingClause[attrs] )?
{
$expr = ::make_shared<raw::insert_json_statement>(std::move(cf),
std::move(attrs),
std::move(json_value),
if_not_exists);
}
)
;
usingClause[::shared_ptr<cql3::attributes::raw> attrs]
@@ -1650,6 +1671,7 @@ basic_unreserved_keyword returns [sstring str]
| K_LANGUAGE
| K_NON
| K_DETERMINISTIC
| K_JSON
) { $str = $k.text; }
;
@@ -1786,6 +1808,7 @@ K_NON: N O N;
K_OR: O R;
K_REPLACE: R E P L A C E;
K_DETERMINISTIC: D E T E R M I N I S T I C;
K_JSON: J S O N;
K_SCYLLA_TIMEUUID_LIST_INDEX: S C Y L L A '_' T I M E U U I D '_' L I S T '_' I N D E X;
K_SCYLLA_COUNTER_SHARD_LIST: S C Y L L A '_' C O U N T E R '_' S H A R D '_' L I S T;

View File

@@ -0,0 +1,187 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "cql3/prepared_statements_cache.hh"
namespace cql3 {
struct authorized_prepared_statements_cache_size {
size_t operator()(const statements::prepared_statement::checked_weak_ptr& val) {
// TODO: improve the size approximation - most of the entry is occupied by the key here.
return 100;
}
};
class authorized_prepared_statements_cache_key {
public:
using cache_key_type = std::pair<auth::authenticated_user, typename cql3::prepared_cache_key_type::cache_key_type>;
private:
cache_key_type _key;
public:
authorized_prepared_statements_cache_key(auth::authenticated_user user, cql3::prepared_cache_key_type prepared_cache_key)
: _key(std::move(user), std::move(prepared_cache_key.key())) {}
cache_key_type& key() { return _key; }
const cache_key_type& key() const { return _key; }
bool operator==(const authorized_prepared_statements_cache_key& other) const {
return _key == other._key;
}
bool operator!=(const authorized_prepared_statements_cache_key& other) const {
return !(*this == other);
}
static size_t hash(const auth::authenticated_user& user, const cql3::prepared_cache_key_type::cache_key_type& prep_cache_key) {
return utils::hash_combine(std::hash<auth::authenticated_user>()(user), utils::tuple_hash()(prep_cache_key));
}
};
/// \class authorized_prepared_statements_cache
/// \brief A cache of previously authorized statements.
///
/// Entries are inserted every time a new statement is authorized.
/// Entries are evicted in any of the following cases:
/// - When the corresponding prepared statement is not valid anymore.
/// - Periodically, with the same period as the permission cache is refreshed.
/// - If the corresponding entry hasn't been used for \ref entry_expiry.
class authorized_prepared_statements_cache {
public:
struct stats {
uint64_t authorized_prepared_statements_cache_evictions = 0;
};
static stats& shard_stats() {
static thread_local stats _stats;
return _stats;
}
struct authorized_prepared_statements_cache_stats_updater {
static void inc_hits() noexcept {}
static void inc_misses() noexcept {}
static void inc_blocks() noexcept {}
static void inc_evictions() noexcept {
++shard_stats().authorized_prepared_statements_cache_evictions;
}
};
private:
using cache_key_type = authorized_prepared_statements_cache_key;
using checked_weak_ptr = typename statements::prepared_statement::checked_weak_ptr;
using cache_type = utils::loading_cache<cache_key_type,
checked_weak_ptr,
utils::loading_cache_reload_enabled::yes,
authorized_prepared_statements_cache_size,
std::hash<cache_key_type>,
std::equal_to<cache_key_type>,
authorized_prepared_statements_cache_stats_updater>;
public:
using key_type = cache_key_type;
using value_type = checked_weak_ptr;
using entry_is_too_big = typename cache_type::entry_is_too_big;
using iterator = typename cache_type::iterator;
private:
cache_type _cache;
logging::logger& _logger;
public:
// Choose the memory budget such that would allow us ~4K entries when a shard gets 1GB of RAM
authorized_prepared_statements_cache(std::chrono::milliseconds entry_expiration, std::chrono::milliseconds entry_refresh, size_t cache_size, logging::logger& logger)
: _cache(cache_size, entry_expiration, entry_refresh, logger, [this] (const key_type& k) {
_cache.remove(k);
return make_ready_future<value_type>();
})
, _logger(logger)
{}
future<> insert(auth::authenticated_user user, cql3::prepared_cache_key_type prep_cache_key, value_type v) noexcept {
return _cache.get_ptr(key_type(std::move(user), std::move(prep_cache_key)), [v = std::move(v)] (const cache_key_type&) mutable {
return make_ready_future<value_type>(std::move(v));
}).discard_result();
}
iterator find(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
struct key_view {
const auth::authenticated_user& user_ref;
const cql3::prepared_cache_key_type& prep_cache_key_ref;
};
struct hasher {
size_t operator()(const key_view& kv) {
return cql3::authorized_prepared_statements_cache_key::hash(kv.user_ref, kv.prep_cache_key_ref.key());
}
};
struct equal {
bool operator()(const key_type& k1, const key_view& k2) {
return k1.key().first == k2.user_ref && k1.key().second == k2.prep_cache_key_ref.key();
}
bool operator()(const key_view& k2, const key_type& k1) {
return operator()(k1, k2);
}
};
return _cache.find(key_view{user, prep_cache_key}, hasher(), equal());
}
iterator end() {
return _cache.end();
}
void remove(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
iterator it = find(user, prep_cache_key);
_cache.remove(it);
}
size_t size() const {
return _cache.size();
}
size_t memory_footprint() const {
return _cache.memory_footprint();
}
future<> stop() {
return _cache.stop();
}
};
}
namespace std {
template <>
struct hash<cql3::authorized_prepared_statements_cache_key> final {
size_t operator()(const cql3::authorized_prepared_statements_cache_key& k) const {
return cql3::authorized_prepared_statements_cache_key::hash(k.key().first, k.key().second);
}
};
inline std::ostream& operator<<(std::ostream& out, const cql3::authorized_prepared_statements_cache_key& k) {
return out << "{ " << k.key().first << ", " << k.key().second << " }";
}
}

View File

@@ -22,6 +22,7 @@
#include "cql3/column_identifier.hh"
#include "exceptions/exceptions.hh"
#include "cql3/selection/simple_selector.hh"
#include "cql3/util.hh"
#include <regex>
@@ -62,14 +63,11 @@ sstring column_identifier::to_string() const {
}
sstring column_identifier::to_cql_string() const {
static const std::regex unquoted_identifier_re("[a-z][a-z0-9_]*");
if (std::regex_match(_text.begin(), _text.end(), unquoted_identifier_re)) {
return _text;
}
static const std::regex double_quote_re("\"");
std::string result = _text;
std::regex_replace(result, double_quote_re, "\"\"");
return '"' + result + '"';
return util::maybe_quote(_text);
}
sstring column_identifier::raw::to_cql_string() const {
return util::maybe_quote(_text);
}
column_identifier::raw::raw(sstring raw_text, bool keep_case)

View File

@@ -123,6 +123,7 @@ public:
bool operator!=(const raw& other) const;
virtual sstring to_string() const;
sstring to_cql_string() const;
friend std::hash<column_identifier::raw>;
friend std::ostream& operator<<(std::ostream& out, const column_identifier::raw& id);

View File

@@ -85,8 +85,8 @@ public:
virtual ::shared_ptr<terminal> bind(const query_options& options) override { return {}; }
virtual sstring to_string() const override { return "null"; }
};
static thread_local const ::shared_ptr<terminal> NULL_VALUE;
public:
static thread_local const ::shared_ptr<terminal> NULL_VALUE;
virtual ::shared_ptr<term> prepare(database& db, const sstring& keyspace, ::shared_ptr<column_specification> receiver) override {
if (!is_assignable(test_assignment(db, keyspace, receiver))) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment/decrement");
@@ -203,10 +203,14 @@ public:
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
auto value = _t->bind_and_get(params._options);
execute(m, prefix, params, column, std::move(value));
}
static void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const column_definition& column, cql3::raw_value_view value) {
if (value.is_null()) {
m.set_cell(prefix, column, std::move(make_dead_cell(params)));
} else if (value.is_value()) {
m.set_cell(prefix, column, std::move(make_cell(*value, params)));
m.set_cell(prefix, column, std::move(make_cell(*column.type, *value, params)));
}
}
};

View File

@@ -395,18 +395,15 @@ operator<<(std::ostream& os, const cql3_type::raw& r) {
namespace util {
sstring maybe_quote(const sstring& s) {
static const std::regex unquoted("\\w*");
static const std::regex double_quote("\"");
if (std::regex_match(s.begin(), s.end(), unquoted)) {
return s;
sstring maybe_quote(const sstring& identifier) {
static const std::regex unquoted_identifier_re("[a-z][a-z0-9_]*");
if (std::regex_match(identifier.begin(), identifier.end(), unquoted_identifier_re)) {
return identifier;
}
std::ostringstream ss;
ss << "\"";
std::regex_replace(std::ostreambuf_iterator<char>(ss), s.begin(), s.end(), double_quote, "\"\"");
ss << "\"";
return ss.str();
static const std::regex double_quote_re("\"");
std::string result = identifier;
std::regex_replace(result, double_quote_re, "\"\"");
return '"' + result + '"';
}
}

View File

@@ -45,6 +45,7 @@
#include "service/query_state.hh"
#include "service/storage_proxy.hh"
#include "cql3/query_options.hh"
#include "timeout_config.hh"
namespace cql_transport {
@@ -62,10 +63,15 @@ class metadata;
shared_ptr<const metadata> make_empty_metadata();
class cql_statement {
timeout_config_selector _timeout_config_selector;
public:
explicit cql_statement(timeout_config_selector timeout_selector) : _timeout_config_selector(timeout_selector) {}
virtual ~cql_statement()
{ }
timeout_config_selector get_timeout_config_selector() const { return _timeout_config_selector; }
virtual uint32_t get_bound_terms() = 0;
/**
@@ -81,7 +87,7 @@ public:
*
* @param state the current client state
*/
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) = 0;
virtual void validate(service::storage_proxy& proxy, const service::client_state& state) = 0;
/**
* Execute the statement and return the resulting result or null if there is no result.
@@ -90,15 +96,7 @@ public:
* @param options options for this query (consistency, variables, pageSize, ...)
*/
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) = 0;
/**
* Variant of execute used for internal query against the system tables, and thus only query the local node = 0.
*
* @param state the current query state
*/
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute_internal(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) = 0;
execute(service::storage_proxy& proxy, service::query_state& state, const query_options& options) = 0;
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const = 0;
@@ -111,6 +109,7 @@ public:
class cql_statement_no_metadata : public cql_statement {
public:
using cql_statement::cql_statement;
virtual shared_ptr<const metadata> get_result_metadata() const override {
return make_empty_metadata();
}

View File

@@ -67,6 +67,12 @@ class error_collector : public error_listener<RecognizerType, ExceptionBaseType>
*/
const sstring_view _query;
/**
* An empty bitset to be used as a workaround for AntLR null dereference
* bug.
*/
static typename ExceptionBaseType::BitsetListType _empty_bit_list;
public:
/**
@@ -144,6 +150,14 @@ private:
break;
}
default:
// AntLR Exception class has a bug of dereferencing a null
// pointer in the displayRecognitionError. The following
// if statement makes sure it will not be null before the
// call to that function (displayRecognitionError).
// bug reference: https://github.com/antlr/antlr3/issues/191
if (!ex->get_expectingSet()) {
ex->set_expectingSet(&_empty_bit_list);
}
ex->displayRecognitionError(token_names, msg);
}
return msg.str();
@@ -345,4 +359,8 @@ private:
#endif
};
template<typename RecognizerType, typename TokenType, typename ExceptionBaseType>
typename ExceptionBaseType::BitsetListType
error_collector<RecognizerType,TokenType,ExceptionBaseType>::_empty_bit_list = typename ExceptionBaseType::BitsetListType();
}

View File

@@ -42,6 +42,7 @@
#pragma once
#include "types.hh"
#include "cql3/cql3_type.hh"
#include <vector>
#include <iosfwd>
#include <boost/functional/hash.hpp>
@@ -105,9 +106,9 @@ abstract_function::print(std::ostream& os) const {
if (i > 0) {
os << ", ";
}
os << _arg_types[i]->name(); // FIXME: asCQL3Type()
os << _arg_types[i]->as_cql3_type()->to_string();
}
os << ") -> " << _return_type->name(); // FIXME: asCQL3Type()
os << ") -> " << _return_type->as_cql3_type()->to_string();
}
}

View File

@@ -20,6 +20,7 @@
*/
#include "functions.hh"
#include "function_call.hh"
#include "token_fct.hh"
#include "cql3/maps.hh"
@@ -41,11 +42,22 @@ functions::init() {
declare(time_uuid_fcts::make_min_timeuuid_fct());
declare(time_uuid_fcts::make_max_timeuuid_fct());
declare(time_uuid_fcts::make_date_of_fct());
declare(time_uuid_fcts::make_unix_timestamp_of_fcf());
declare(time_uuid_fcts::make_unix_timestamp_of_fct());
declare(time_uuid_fcts::make_currenttimestamp_fct());
declare(time_uuid_fcts::make_currentdate_fct());
declare(time_uuid_fcts::make_currenttime_fct());
declare(time_uuid_fcts::make_currenttimeuuid_fct());
declare(time_uuid_fcts::make_timeuuidtodate_fct());
declare(time_uuid_fcts::make_timestamptodate_fct());
declare(time_uuid_fcts::make_timeuuidtotimestamp_fct());
declare(time_uuid_fcts::make_datetotimestamp_fct());
declare(time_uuid_fcts::make_timeuuidtounixtimestamp_fct());
declare(time_uuid_fcts::make_timestamptounixtimestamp_fct());
declare(time_uuid_fcts::make_datetounixtimestamp_fct());
declare(make_uuid_fct());
for (auto&& type : cql3_type::values()) {
// Note: because text and varchar ends up being synonimous, our automatic makeToBlobFunction doesn't work
// Note: because text and varchar ends up being synonymous, our automatic makeToBlobFunction doesn't work
// for varchar, so we special case it below. We also skip blob for obvious reasons.
if (type == cql3_type::varchar || type == cql3_type::blob) {
continue;
@@ -95,15 +107,22 @@ functions::init() {
declare(aggregate_fcts::make_max_function<sstring>());
declare(aggregate_fcts::make_min_function<sstring>());
declare(aggregate_fcts::make_count_function<simple_date_native_type>());
declare(aggregate_fcts::make_max_function<simple_date_native_type>());
declare(aggregate_fcts::make_min_function<simple_date_native_type>());
declare(aggregate_fcts::make_count_function<timestamp_native_type>());
declare(aggregate_fcts::make_max_function<timestamp_native_type>());
declare(aggregate_fcts::make_min_function<timestamp_native_type>());
declare(aggregate_fcts::make_count_function<timeuuid_native_type>());
declare(aggregate_fcts::make_max_function<timeuuid_native_type>());
declare(aggregate_fcts::make_min_function<timeuuid_native_type>());
declare(aggregate_fcts::make_count_function<utils::UUID>());
declare(aggregate_fcts::make_max_function<utils::UUID>());
declare(aggregate_fcts::make_min_function<utils::UUID>());
//FIXME:
//declare(aggregate_fcts::make_count_function<bytes>());
//declare(aggregate_fcts::make_max_function<bytes>());
@@ -153,23 +172,73 @@ functions::get_overload_count(const function_name& name) {
return _declared.count(name);
}
inline
shared_ptr<function>
make_to_json_function(data_type t) {
return make_native_scalar_function<true>("tojson", utf8_type, {t},
[t](cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
return utf8_type->decompose(t->to_json_string(parameters[0]));
});
}
inline
shared_ptr<function>
make_from_json_function(database& db, const sstring& keyspace, data_type t) {
return make_native_scalar_function<true>("fromjson", t, {utf8_type},
[&db, &keyspace, t](cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
Json::Value json_value = json::to_json_value(utf8_type->to_string(parameters[0].value()));
bytes_opt parsed_json_value;
if (!json_value.isNull()) {
parsed_json_value.emplace(t->from_json_object(json_value, sf));
}
return std::move(parsed_json_value);
});
}
shared_ptr<function>
functions::get(database& db,
const sstring& keyspace,
const function_name& name,
const std::vector<shared_ptr<assignment_testable>>& provided_args,
const sstring& receiver_ks,
const sstring& receiver_cf) {
const sstring& receiver_cf,
shared_ptr<column_specification> receiver) {
static const function_name TOKEN_FUNCTION_NAME = function_name::native_function("token");
static const function_name TO_JSON_FUNCTION_NAME = function_name::native_function("tojson");
static const function_name FROM_JSON_FUNCTION_NAME = function_name::native_function("fromjson");
if (name.has_keyspace()
? name == TOKEN_FUNCTION_NAME
: name.name == TOKEN_FUNCTION_NAME.name)
{
? name == TOKEN_FUNCTION_NAME
: name.name == TOKEN_FUNCTION_NAME.name) {
return ::make_shared<token_fct>(db.find_schema(receiver_ks, receiver_cf));
}
if (name.has_keyspace()
? name == TO_JSON_FUNCTION_NAME
: name.name == TO_JSON_FUNCTION_NAME.name) {
if (provided_args.size() != 1) {
throw exceptions::invalid_request_exception("toJson() accepts 1 argument only");
}
selection::selector *sp = dynamic_cast<selection::selector *>(provided_args[0].get());
if (!sp) {
throw exceptions::invalid_request_exception("toJson() is only valid in SELECT clause");
}
return make_to_json_function(sp->get_type());
}
if (name.has_keyspace()
? name == FROM_JSON_FUNCTION_NAME
: name.name == FROM_JSON_FUNCTION_NAME.name) {
if (provided_args.size() != 1) {
throw exceptions::invalid_request_exception("fromJson() accepts 1 argument only");
}
if (!receiver) {
throw exceptions::invalid_request_exception("fromJson() can only be called if receiver type is known");
}
return make_from_json_function(db, keyspace, receiver->type);
}
std::vector<shared_ptr<function>> candidates;
auto&& add_declared = [&] (function_name fn) {
auto&& fns = _declared.equal_range(fn);
@@ -414,7 +483,7 @@ function_call::raw::prepare(database& db, const sstring& keyspace, ::shared_ptr<
[] (auto&& x) -> shared_ptr<assignment_testable> {
return x;
});
auto&& fun = functions::functions::get(db, keyspace, _name, args, receiver->ks_name, receiver->cf_name);
auto&& fun = functions::functions::get(db, keyspace, _name, args, receiver->ks_name, receiver->cf_name, receiver);
if (!fun) {
throw exceptions::invalid_request_exception(sprint("Unknown function %s called", _name));
}
@@ -478,7 +547,7 @@ function_call::raw::test_assignment(database& db, const sstring& keyspace, share
// of another, existing, function. In that case, we return true here because we'll throw a proper exception
// later with a more helpful error message that if we were to return false here.
try {
auto&& fun = functions::get(db, keyspace, _name, _terms, receiver->ks_name, receiver->cf_name);
auto&& fun = functions::get(db, keyspace, _name, _terms, receiver->ks_name, receiver->cf_name, receiver);
if (fun && receiver->type->equals(fun->return_type())) {
return assignment_testable::test_result::EXACT_MATCH;
} else if (!fun || receiver->type->is_value_compatible_with(*fun->return_type())) {

View File

@@ -80,16 +80,18 @@ public:
const function_name& name,
const std::vector<shared_ptr<assignment_testable>>& provided_args,
const sstring& receiver_ks,
const sstring& receiver_cf);
const sstring& receiver_cf,
::shared_ptr<column_specification> receiver = nullptr);
template <typename AssignmentTestablePtrRange>
static shared_ptr<function> get(database& db,
const sstring& keyspace,
const function_name& name,
AssignmentTestablePtrRange&& provided_args,
const sstring& receiver_ks,
const sstring& receiver_cf) {
const sstring& receiver_cf,
::shared_ptr<column_specification> receiver = nullptr) {
const std::vector<shared_ptr<assignment_testable>> args(std::begin(provided_args), std::end(provided_args));
return get(db, keyspace, name, args, receiver_ks, receiver_cf);
return get(db, keyspace, name, args, receiver_ks, receiver_cf, receiver);
}
static std::vector<shared_ptr<function>> find(const function_name& name);
static shared_ptr<function> find(const function_name& name, const std::vector<data_type>& arg_types);

View File

@@ -117,7 +117,7 @@ make_date_of_fct() {
inline
shared_ptr<function>
make_unix_timestamp_of_fcf() {
make_unix_timestamp_of_fct() {
return make_native_scalar_function<true>("unixtimestampof", long_type, { timeuuid_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
@@ -129,6 +129,163 @@ make_unix_timestamp_of_fcf() {
});
}
inline shared_ptr<function>
make_currenttimestamp_fct() {
return make_native_scalar_function<true>("currenttimestamp", timestamp_type, {},
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
return {timestamp_type->decompose(timestamp_native_type{db_clock::now()})};
});
}
inline shared_ptr<function>
make_currenttime_fct() {
return make_native_scalar_function<true>("currenttime", time_type, {},
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
constexpr int64_t milliseconds_in_day = 3600 * 24 * 1000;
int64_t milliseconds_since_epoch = std::chrono::duration_cast<std::chrono::milliseconds>(db_clock::now().time_since_epoch()).count();
int64_t nanoseconds_today = (milliseconds_since_epoch % milliseconds_in_day) * 1000 * 1000;
return {time_type->decompose(time_native_type{nanoseconds_today})};
});
}
inline shared_ptr<function>
make_currentdate_fct() {
return make_native_scalar_function<true>("currentdate", simple_date_type, {},
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
auto to_simple_date = get_castas_fctn(simple_date_type, timestamp_type);
return {simple_date_type->decompose(to_simple_date(timestamp_native_type{db_clock::now()}))};
});
}
inline
shared_ptr<function>
make_currenttimeuuid_fct() {
return make_native_scalar_function<true>("currenttimeuuid", timeuuid_type, {},
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
return {timeuuid_type->decompose(timeuuid_native_type{utils::UUID_gen::get_time_UUID()})};
});
}
inline
shared_ptr<function>
make_timeuuidtodate_fct() {
return make_native_scalar_function<true>("todate", simple_date_type, { timeuuid_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto ts = db_clock::time_point(db_clock::duration(UUID_gen::unix_timestamp(UUID_gen::get_UUID(*bb))));
auto to_simple_date = get_castas_fctn(simple_date_type, timestamp_type);
return {simple_date_type->decompose(to_simple_date(ts))};
});
}
inline
shared_ptr<function>
make_timestamptodate_fct() {
return make_native_scalar_function<true>("todate", simple_date_type, { timestamp_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto ts_obj = timestamp_type->deserialize(*bb);
if (ts_obj.is_null()) {
return {};
}
auto to_simple_date = get_castas_fctn(simple_date_type, timestamp_type);
return {simple_date_type->decompose(to_simple_date(ts_obj))};
});
}
inline
shared_ptr<function>
make_timeuuidtotimestamp_fct() {
return make_native_scalar_function<true>("totimestamp", timestamp_type, { timeuuid_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto ts = db_clock::time_point(db_clock::duration(UUID_gen::unix_timestamp(UUID_gen::get_UUID(*bb))));
return {timestamp_type->decompose(ts)};
});
}
inline
shared_ptr<function>
make_datetotimestamp_fct() {
return make_native_scalar_function<true>("totimestamp", timestamp_type, { simple_date_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto simple_date_obj = simple_date_type->deserialize(*bb);
if (simple_date_obj.is_null()) {
return {};
}
auto from_simple_date = get_castas_fctn(timestamp_type, simple_date_type);
return {timestamp_type->decompose(from_simple_date(simple_date_obj))};
});
}
inline
shared_ptr<function>
make_timeuuidtounixtimestamp_fct() {
return make_native_scalar_function<true>("tounixtimestamp", long_type, { timeuuid_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
return {long_type->decompose(UUID_gen::unix_timestamp(UUID_gen::get_UUID(*bb)))};
});
}
inline
shared_ptr<function>
make_timestamptounixtimestamp_fct() {
return make_native_scalar_function<true>("tounixtimestamp", long_type, { timestamp_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto ts_obj = timestamp_type->deserialize(*bb);
if (ts_obj.is_null()) {
return {};
}
return {long_type->decompose(ts_obj)};
});
}
inline
shared_ptr<function>
make_datetounixtimestamp_fct() {
return make_native_scalar_function<true>("tounixtimestamp", long_type, { simple_date_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto simple_date_obj = simple_date_type->deserialize(*bb);
if (simple_date_obj.is_null()) {
return {};
}
auto from_simple_date = get_castas_fctn(timestamp_type, simple_date_type);
return {long_type->decompose(from_simple_date(simple_date_obj))};
});
}
}
}
}

View File

@@ -237,7 +237,12 @@ lists::precision_time::get_next(db_clock::time_point millis) {
void
lists::setter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
const auto& value = _t->bind(params._options);
auto value = _t->bind(params._options);
execute(m, prefix, params, column, std::move(value));
}
void
lists::setter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value) {
if (value == constants::UNSET_VALUE) {
return;
}
@@ -299,7 +304,7 @@ lists::setter_by_index::execute(mutation& m, const clustering_key_prefix& prefix
if (!value) {
mut.cells.emplace_back(eidx, params.make_dead_cell());
} else {
mut.cells.emplace_back(eidx, params.make_cell(*value));
mut.cells.emplace_back(eidx, params.make_cell(*ltype->value_comparator(), *value, atomic_cell::collection_member::yes));
}
auto smut = ltype->serialize_mutation_form(mut);
m.set_cell(prefix, column, atomic_cell_or_collection::from_collection_mutation(std::move(smut)));
@@ -326,7 +331,7 @@ lists::setter_by_uuid::execute(mutation& m, const clustering_key_prefix& prefix,
list_type_impl::mutation mut;
mut.cells.reserve(1);
mut.cells.emplace_back(to_bytes(*index), params.make_cell(*value));
mut.cells.emplace_back(to_bytes(*index), params.make_cell(*ltype->value_comparator(), *value, atomic_cell::collection_member::yes));
auto smut = ltype->serialize_mutation_form(mut);
m.set_cell(prefix, column,
atomic_cell_or_collection::from_collection_mutation(
@@ -365,7 +370,7 @@ lists::do_append(shared_ptr<term> value,
auto uuid1 = utils::UUID_gen::get_time_UUID_bytes();
auto uuid = bytes(reinterpret_cast<const int8_t*>(uuid1.data()), uuid1.size());
// FIXME: can e be empty?
appended.cells.emplace_back(std::move(uuid), params.make_cell(*e));
appended.cells.emplace_back(std::move(uuid), params.make_cell(*ltype->value_comparator(), *e, atomic_cell::collection_member::yes));
}
m.set_cell(prefix, column, ltype->serialize_mutation_form(appended));
} else {
@@ -374,7 +379,7 @@ lists::do_append(shared_ptr<term> value,
m.set_cell(prefix, column, params.make_dead_cell());
} else {
auto newv = list_value->get_with_protocol_version(cql_serialization_format::internal());
m.set_cell(prefix, column, params.make_cell(std::move(newv)));
m.set_cell(prefix, column, params.make_cell(*column.type, std::move(newv)));
}
}
}
@@ -395,14 +400,14 @@ lists::prepender::execute(mutation& m, const clustering_key_prefix& prefix, cons
mut.cells.reserve(lvalue->get_elements().size());
// We reverse the order of insertion, so that the last element gets the lastest time
// (lists are sorted by time)
auto&& ltype = static_cast<const list_type_impl*>(column.type.get());
for (auto&& v : lvalue->_elements | boost::adaptors::reversed) {
auto&& pt = precision_time::get_next(time);
auto uuid = utils::UUID_gen::get_time_UUID_bytes(pt.millis.time_since_epoch().count(), pt.nanos);
mut.cells.emplace_back(bytes(uuid.data(), uuid.size()), params.make_cell(*v));
mut.cells.emplace_back(bytes(uuid.data(), uuid.size()), params.make_cell(*ltype->value_comparator(), *v, atomic_cell::collection_member::yes));
}
// now reverse again, to get the original order back
std::reverse(mut.cells.begin(), mut.cells.end());
auto&& ltype = static_cast<const list_type_impl*>(column.type.get());
m.set_cell(prefix, column, atomic_cell_or_collection::from_collection_mutation(ltype->serialize_mutation_form(std::move(mut))));
}

View File

@@ -147,6 +147,7 @@ public:
: operation(column, std::move(t)) {
}
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
static void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value);
};
class setter_by_index : public operation {

View File

@@ -266,6 +266,11 @@ maps::marker::bind(const query_options& options) {
void
maps::setter::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
auto value = _t->bind(params._options);
execute(m, row_key, params, column, std::move(value));
}
void
maps::setter::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value) {
if (value == constants::UNSET_VALUE) {
return;
}
@@ -295,10 +300,11 @@ maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, c
if (!key) {
throw invalid_request_exception("Invalid null map key");
}
auto avalue = value ? params.make_cell(*value) : params.make_dead_cell();
map_type_impl::mutation update = { {}, { { std::move(to_bytes(*key)), std::move(avalue) } } };
// should have been verified as map earlier?
auto ctype = static_pointer_cast<const map_type_impl>(column.type);
auto avalue = value ? params.make_cell(*ctype->get_values_type(), *value, atomic_cell::collection_member::yes) : params.make_dead_cell();
map_type_impl::mutation update;
update.cells.emplace_back(std::move(to_bytes(*key)), std::move(avalue));
// should have been verified as map earlier?
auto col_mut = ctype->serialize_mutation_form(std::move(update));
m.set_cell(prefix, column, std::move(col_mut));
}
@@ -323,10 +329,10 @@ maps::do_put(mutation& m, const clustering_key_prefix& prefix, const update_para
return;
}
for (auto&& e : map_value->map) {
mut.cells.emplace_back(e.first, params.make_cell(e.second));
}
auto ctype = static_pointer_cast<const map_type_impl>(column.type);
for (auto&& e : map_value->map) {
mut.cells.emplace_back(e.first, params.make_cell(*ctype->get_values_type(), e.second, atomic_cell::collection_member::yes));
}
auto col_mut = ctype->serialize_mutation_form(std::move(mut));
m.set_cell(prefix, column, std::move(col_mut));
} else {
@@ -336,7 +342,7 @@ maps::do_put(mutation& m, const clustering_key_prefix& prefix, const update_para
} else {
auto v = map_type_impl::serialize_partially_deserialized_form({map_value->map.begin(), map_value->map.end()},
cql_serialization_format::internal());
m.set_cell(prefix, column, params.make_cell(std::move(v)));
m.set_cell(prefix, column, params.make_cell(*column.type, std::move(v)));
}
}
}

View File

@@ -117,6 +117,7 @@ public:
}
virtual void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) override;
static void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value);
};
class setter_by_key : public operation {

View File

@@ -87,15 +87,15 @@ public:
virtual ~operation() {}
atomic_cell make_dead_cell(const update_parameters& params) const {
static atomic_cell make_dead_cell(const update_parameters& params) {
return params.make_dead_cell();
}
atomic_cell make_cell(bytes_view value, const update_parameters& params) const {
return params.make_cell(value);
static atomic_cell make_cell(const abstract_type& type, bytes_view value, const update_parameters& params) {
return params.make_cell(type, value);
}
atomic_cell make_counter_update_cell(int64_t delta, const update_parameters& params) const {
static atomic_cell make_counter_update_cell(int64_t delta, const update_parameters& params) {
return params.make_counter_update_cell(delta);
}

View File

@@ -68,6 +68,14 @@ public:
static thrift_prepared_id_type thrift_id(const prepared_cache_key_type& key) {
return key.key().second;
}
bool operator==(const prepared_cache_key_type& other) const {
return _key == other._key;
}
bool operator!=(const prepared_cache_key_type& other) const {
return !(*this == other);
}
};
class prepared_statements_cache {
@@ -102,9 +110,9 @@ private:
}
};
public:
static const std::chrono::minutes entry_expiry;
public:
using key_type = prepared_cache_key_type;
using value_type = checked_weak_ptr;
using statement_is_too_big = typename cache_type::entry_is_too_big;
@@ -116,8 +124,8 @@ private:
value_extractor_fn _value_extractor_fn;
public:
prepared_statements_cache(logging::logger& logger)
: _cache(memory::stats().total_memory() / 256, entry_expiry, logger)
prepared_statements_cache(logging::logger& logger, size_t size)
: _cache(size, entry_expiry, logger)
{}
template <typename LoadFunc>
@@ -155,6 +163,10 @@ public:
size_t memory_footprint() const {
return _cache.memory_footprint();
}
future<> stop() {
return _cache.stop();
}
};
}
@@ -168,4 +180,11 @@ inline std::ostream& operator<<(std::ostream& os, const cql3::prepared_cache_key
os << p.key();
return os;
}
template<>
struct hash<cql3::prepared_cache_key_type> final {
size_t operator()(const cql3::prepared_cache_key_type& k) const {
return utils::tuple_hash()(k.key());
}
};
}

View File

@@ -46,10 +46,11 @@ namespace cql3 {
thread_local const query_options::specific_options query_options::specific_options::DEFAULT{-1, {}, {}, api::missing_timestamp};
thread_local query_options query_options::DEFAULT{db::consistency_level::ONE, std::experimental::nullopt,
thread_local query_options query_options::DEFAULT{db::consistency_level::ONE, infinite_timeout_config, std::experimental::nullopt,
std::vector<cql3::raw_value_view>(), false, query_options::specific_options::DEFAULT, cql_serialization_format::latest()};
query_options::query_options(db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
std::vector<cql3::raw_value_view> value_views,
@@ -57,6 +58,7 @@ query_options::query_options(db::consistency_level consistency,
specific_options options,
cql_serialization_format sf)
: _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values(std::move(values))
, _value_views(value_views)
@@ -67,12 +69,14 @@ query_options::query_options(db::consistency_level consistency,
}
query_options::query_options(db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
bool skip_metadata,
specific_options options,
cql_serialization_format sf)
: _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values(std::move(values))
, _value_views()
@@ -84,12 +88,14 @@ query_options::query_options(db::consistency_level consistency,
}
query_options::query_options(db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
specific_options options,
cql_serialization_format sf)
: _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values()
, _value_views(std::move(value_views))
@@ -99,9 +105,10 @@ query_options::query_options(db::consistency_level consistency,
{
}
query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_value> values, specific_options options)
query_options::query_options(db::consistency_level cl, const ::timeout_config& timeout_config, std::vector<cql3::raw_value> values, specific_options options)
: query_options(
cl,
timeout_config,
{},
std::move(values),
false,
@@ -113,6 +120,7 @@ query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_val
query_options::query_options(std::unique_ptr<query_options> qo, ::shared_ptr<service::pager::paging_state> paging_state)
: query_options(qo->_consistency,
qo->get_timeout_config(),
std::move(qo->_names),
std::move(qo->_values),
std::move(qo->_value_views),
@@ -124,7 +132,7 @@ query_options::query_options(std::unique_ptr<query_options> qo, ::shared_ptr<ser
query_options::query_options(std::vector<cql3::raw_value> values)
: query_options(
db::consistency_level::ONE, std::move(values))
db::consistency_level::ONE, infinite_timeout_config, std::move(values))
{}
db::consistency_level query_options::get_consistency() const
@@ -209,19 +217,18 @@ void query_options::prepare(const std::vector<::shared_ptr<column_specification>
}
auto& names = *_names;
std::vector<cql3::raw_value> ordered_values;
std::vector<cql3::raw_value_view> ordered_values;
ordered_values.reserve(specs.size());
for (auto&& spec : specs) {
auto& spec_name = spec->name->text();
for (size_t j = 0; j < names.size(); j++) {
if (names[j] == spec_name) {
ordered_values.emplace_back(_values[j]);
ordered_values.emplace_back(_value_views[j]);
break;
}
}
}
_values = std::move(ordered_values);
fill_value_views();
_value_views = std::move(ordered_values);
}
void query_options::fill_value_views()

View File

@@ -44,13 +44,14 @@
#include <seastar/util/gcc6-concepts.hh>
#include "timestamp.hh"
#include "bytes.hh"
#include "db/consistency_level.hh"
#include "db/consistency_level_type.hh"
#include "service/query_state.hh"
#include "service/pager/paging_state.hh"
#include "cql3/column_specification.hh"
#include "cql3/column_identifier.hh"
#include "cql3/values.hh"
#include "cql_serialization_format.hh"
#include "timeout_config.hh"
namespace cql3 {
@@ -70,6 +71,7 @@ public:
};
private:
const db::consistency_level _consistency;
const timeout_config& _timeout_config;
const std::experimental::optional<std::vector<sstring_view>> _names;
std::vector<cql3::raw_value> _values;
std::vector<cql3::raw_value_view> _value_views;
@@ -103,12 +105,14 @@ public:
query_options(const query_options&) = delete;
explicit query_options(db::consistency_level consistency,
const timeout_config& timeouts,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
bool skip_metadata,
specific_options options,
cql_serialization_format sf);
explicit query_options(db::consistency_level consistency,
const timeout_config& timeouts,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
std::vector<cql3::raw_value_view> value_views,
@@ -116,6 +120,7 @@ public:
specific_options options,
cql_serialization_format sf);
explicit query_options(db::consistency_level consistency,
const timeout_config& timeouts,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
@@ -147,10 +152,12 @@ public:
// forInternalUse
explicit query_options(std::vector<cql3::raw_value> values);
explicit query_options(db::consistency_level, std::vector<cql3::raw_value> values, specific_options options = specific_options::DEFAULT);
explicit query_options(db::consistency_level, const timeout_config& timeouts,
std::vector<cql3::raw_value> values, specific_options options = specific_options::DEFAULT);
explicit query_options(std::unique_ptr<query_options>, ::shared_ptr<service::pager::paging_state> paging_state);
db::consistency_level get_consistency() const;
const timeout_config& get_timeout_config() const { return _timeout_config; }
cql3::raw_value_view get_value_at(size_t idx) const;
cql3::raw_value_view make_temporary(cql3::raw_value value) const;
size_t get_values_count() const;
@@ -161,6 +168,11 @@ public:
::shared_ptr<service::pager::paging_state> get_paging_state() const;
/** Serial consistency for conditional updates. */
std::experimental::optional<db::consistency_level> get_serial_consistency() const;
const std::experimental::optional<std::vector<sstring_view>>& get_names() const noexcept {
return _names;
}
api::timestamp_type get_timestamp(service::query_state& state) const;
/**
* The protocol version for the query. Will be 3 if the object don't come from
@@ -188,7 +200,7 @@ query_options::query_options(query_options&& o, std::vector<OneMutationDataRange
std::vector<query_options> tmp;
tmp.reserve(values_ranges.size());
std::transform(values_ranges.begin(), values_ranges.end(), std::back_inserter(tmp), [this](auto& values_range) {
return query_options(_consistency, {}, std::move(values_range), _skip_metadata, _options, _cql_serialization_format);
return query_options(_consistency, _timeout_config, {}, std::move(values_range), _skip_metadata, _options, _cql_serialization_format);
});
_batch_options = std::move(tmp);
}

View File

@@ -58,6 +58,7 @@ using namespace cql_transport::messages;
logging::logger log("query_processor");
logging::logger prep_cache_log("prepared_statements_cache");
logging::logger authorized_prepared_statements_cache_log("authorized_prepared_statements_cache");
distributed<query_processor> _the_query_processor;
@@ -91,12 +92,16 @@ api::timestamp_type query_processor::next_timestamp() {
return _internal_state->next_timestamp();
}
query_processor::query_processor(distributed<service::storage_proxy>& proxy, distributed<database>& db)
query_processor::query_processor(service::storage_proxy& proxy, distributed<database>& db, query_processor::memory_config mcfg)
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
, _internal_state(new internal_state())
, _prepared_cache(prep_cache_log) {
, _prepared_cache(prep_cache_log, mcfg.prepared_statment_cache_size)
, _authorized_prepared_cache(std::min(std::chrono::milliseconds(_db.local().get_config().permissions_validity_in_ms()),
std::chrono::duration_cast<std::chrono::milliseconds>(prepared_statements_cache::entry_expiry)),
std::chrono::milliseconds(_db.local().get_config().permissions_update_interval_in_ms()),
mcfg.authorized_prepared_cache_size, authorized_prepared_statements_cache_log) {
namespace sm = seastar::metrics;
_metrics.add_group(
@@ -159,6 +164,11 @@ query_processor::query_processor(distributed<service::storage_proxy>& proxy, dis
sm::description("Counts a total number of LOGGED batches that were executed as UNLOGGED "
"batches.")),
sm::make_derive(
"rows_read",
_cql_stats.rows_read,
sm::description("Counts a total number of rows read during CQL requests.")),
sm::make_derive(
"prepared_cache_evictions",
[] { return prepared_statements_cache::shard_stats().prepared_cache_evictions; },
@@ -172,7 +182,46 @@ query_processor::query_processor(distributed<service::storage_proxy>& proxy, dis
sm::make_gauge(
"prepared_cache_memory_footprint",
[this] { return _prepared_cache.memory_footprint(); },
sm::description("Size (in bytes) of the prepared statements cache."))});
sm::description("Size (in bytes) of the prepared statements cache.")),
sm::make_derive(
"secondary_index_creates",
_cql_stats.secondary_index_creates,
sm::description("Counts a total number of CQL CREATE INDEX requests.")),
sm::make_derive(
"secondary_index_drops",
_cql_stats.secondary_index_drops,
sm::description("Counts a total number of CQL DROP INDEX requests.")),
// secondary_index_reads total count is also included in all cql reads
sm::make_derive(
"secondary_index_reads",
_cql_stats.secondary_index_reads,
sm::description("Counts a total number of CQL read requests performed using secondary indexes.")),
// secondary_index_rows_read total count is also included in all cql rows read
sm::make_derive(
"secondary_index_rows_read",
_cql_stats.secondary_index_rows_read,
sm::description("Counts a total number of rows read during CQL requests performed using secondary indexes.")),
sm::make_derive(
"authorized_prepared_statements_cache_evictions",
[] { return authorized_prepared_statements_cache::shard_stats().authorized_prepared_statements_cache_evictions; },
sm::description("Counts a number of authenticated prepared statements cache entries evictions.")),
sm::make_gauge(
"authorized_prepared_statements_cache_size",
[this] { return _authorized_prepared_cache.size(); },
sm::description("A number of entries in the authenticated prepared statements cache.")),
sm::make_gauge(
"user_prepared_auth_cache_footprint",
[this] { return _authorized_prepared_cache.memory_footprint(); },
sm::description("Size (in bytes) of the authenticated prepared statements cache."))
});
service::get_local_migration_manager().register_listener(_migration_subscriber.get());
}
@@ -182,7 +231,7 @@ query_processor::~query_processor() {
future<> query_processor::stop() {
service::get_local_migration_manager().unregister_listener(_migration_subscriber.get());
return make_ready_future<>();
return _authorized_prepared_cache.stop().finally([this] { return _prepared_cache.stop(); });
}
future<::shared_ptr<result_message>>
@@ -190,11 +239,11 @@ query_processor::process(const sstring_view& query_string, service::query_state&
log.trace("process: \"{}\"", query_string);
tracing::trace(query_state.get_trace_state(), "Parsing a statement");
auto p = get_statement(query_string, query_state.get_client_state());
options.prepare(p->bound_names);
auto cql_statement = p->statement;
if (cql_statement->get_bound_terms() != options.get_values_count()) {
throw exceptions::invalid_request_exception("Invalid amount of bind variables");
}
options.prepare(p->bound_names);
warn(unimplemented::cause::METRICS);
#if 0
@@ -202,33 +251,55 @@ query_processor::process(const sstring_view& query_string, service::query_state&
metrics.regularStatementsExecuted.inc();
#endif
tracing::trace(query_state.get_trace_state(), "Processing a statement");
return process_statement(std::move(cql_statement), query_state, options);
return process_statement_unprepared(std::move(cql_statement), query_state, options);
}
future<::shared_ptr<result_message>>
query_processor::process_statement(
query_processor::process_statement_unprepared(
::shared_ptr<cql_statement> statement,
service::query_state& query_state,
const query_options& options) {
return statement->check_access(query_state.get_client_state()).then([this, statement, &query_state, &options]() {
auto& client_state = query_state.get_client_state();
return statement->check_access(query_state.get_client_state()).then([this, statement, &query_state, &options] () mutable {
return process_authorized_statement(std::move(statement), query_state, options);
});
}
statement->validate(_proxy, client_state);
future<::shared_ptr<result_message>>
query_processor::process_statement_prepared(
statements::prepared_statement::checked_weak_ptr prepared,
cql3::prepared_cache_key_type cache_key,
service::query_state& query_state,
const query_options& options,
bool needs_authorization) {
auto fut = make_ready_future<::shared_ptr<cql_transport::messages::result_message>>();
if (client_state.is_internal()) {
fut = statement->execute_internal(_proxy, query_state, options);
} else {
fut = statement->execute(_proxy, query_state, options);
}
return fut.then([statement] (auto msg) {
if (msg) {
return make_ready_future<::shared_ptr<result_message>>(std::move(msg));
}
return make_ready_future<::shared_ptr<result_message>>(
::make_shared<result_message::void_message>());
::shared_ptr<cql_statement> statement = prepared->statement;
future<> fut = make_ready_future<>();
if (needs_authorization) {
fut = statement->check_access(query_state.get_client_state()).then([this, &query_state, prepared = std::move(prepared), cache_key = std::move(cache_key)] () mutable {
return _authorized_prepared_cache.insert(*query_state.get_client_state().user(), std::move(cache_key), std::move(prepared)).handle_exception([this] (auto eptr) {
log.error("failed to cache the entry", eptr);
});
});
}
return fut.then([this, statement = std::move(statement), &query_state, &options] () mutable {
return process_authorized_statement(std::move(statement), query_state, options);
});
}
future<::shared_ptr<result_message>>
query_processor::process_authorized_statement(const ::shared_ptr<cql_statement> statement, service::query_state& query_state, const query_options& options) {
auto& client_state = query_state.get_client_state();
statement->validate(_proxy, client_state);
auto fut = statement->execute(_proxy, query_state, options);
return fut.then([statement] (auto msg) {
if (msg) {
return make_ready_future<::shared_ptr<result_message>>(std::move(msg));
}
return make_ready_future<::shared_ptr<result_message>>(::make_shared<result_message::void_message>());
});
}
@@ -340,6 +411,7 @@ query_options query_processor::make_internal_options(
const statements::prepared_statement::checked_weak_ptr& p,
const std::initializer_list<data_value>& values,
db::consistency_level cl,
const timeout_config& timeout_config,
int32_t page_size) {
if (p->bound_names.size() != values.size()) {
throw std::invalid_argument(
@@ -363,10 +435,11 @@ query_options query_processor::make_internal_options(
api::timestamp_type ts = api::missing_timestamp;
return query_options(
cl,
timeout_config,
bound_values,
cql3::query_options::specific_options{page_size, std::move(paging_state), serial_consistency, ts});
}
return query_options(cl, bound_values);
return query_options(cl, timeout_config, bound_values);
}
statements::prepared_statement::checked_weak_ptr query_processor::prepare_internal(const sstring& query_string) {
@@ -397,7 +470,7 @@ struct internal_query_state {
::shared_ptr<internal_query_state> query_processor::create_paged_state(const sstring& query_string,
const std::initializer_list<data_value>& values, int32_t page_size) {
auto p = prepare_internal(query_string);
auto opts = make_internal_options(p, values, db::consistency_level::ONE, page_size);
auto opts = make_internal_options(p, values, db::consistency_level::ONE, infinite_timeout_config, page_size);
::shared_ptr<internal_query_state> res = ::make_shared<internal_query_state>(
internal_query_state{
query_string,
@@ -446,7 +519,7 @@ future<> query_processor::for_each_cql_result(
future<::shared_ptr<untyped_result_set>>
query_processor::execute_paged_internal(::shared_ptr<internal_query_state> state) {
return state->p->statement->execute_internal(_proxy, *_internal_state, *state->opts).then(
return state->p->statement->execute(_proxy, *_internal_state, *state->opts).then(
[state, this](::shared_ptr<cql_transport::messages::result_message> msg) mutable {
class visitor : public result_message::visitor_base {
::shared_ptr<internal_query_state> _state;
@@ -485,9 +558,9 @@ future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(
statements::prepared_statement::checked_weak_ptr p,
const std::initializer_list<data_value>& values) {
query_options opts = make_internal_options(p, values);
query_options opts = make_internal_options(p, values, db::consistency_level::ONE, infinite_timeout_config);
return do_with(std::move(opts), [this, p = std::move(p)](auto& opts) {
return p->statement->execute_internal(
return p->statement->execute(
_proxy,
*_internal_state,
opts).then([&opts, stmt = p->statement](auto msg) {
@@ -500,15 +573,16 @@ future<::shared_ptr<untyped_result_set>>
query_processor::process(
const sstring& query_string,
db::consistency_level cl,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values,
bool cache) {
if (cache) {
return process(prepare_internal(query_string), cl, values);
return process(prepare_internal(query_string), cl, timeout_config, values);
} else {
auto p = parse_statement(query_string)->prepare(_db.local(), _cql_stats);
p->statement->validate(_proxy, *_internal_state);
auto checked_weak_ptr = p->checked_weak_from_this();
return process(std::move(checked_weak_ptr), cl, values).finally([p = std::move(p)] {});
return process(std::move(checked_weak_ptr), cl, timeout_config, values).finally([p = std::move(p)] {});
}
}
@@ -516,8 +590,9 @@ future<::shared_ptr<untyped_result_set>>
query_processor::process(
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level cl,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values) {
auto opts = make_internal_options(p, values, cl);
auto opts = make_internal_options(p, values, cl, timeout_config);
return do_with(std::move(opts), [this, p = std::move(p)](auto & opts) {
return p->statement->execute(_proxy, *_internal_state, opts).then([](auto msg) {
return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
@@ -529,11 +604,18 @@ future<::shared_ptr<cql_transport::messages::result_message>>
query_processor::process_batch(
::shared_ptr<statements::batch_statement> batch,
service::query_state& query_state,
query_options& options) {
return batch->check_access(query_state.get_client_state()).then([this, &query_state, &options, batch] {
batch->validate();
batch->validate(_proxy, query_state.get_client_state());
return batch->execute(_proxy, query_state, options);
query_options& options,
std::unordered_map<prepared_cache_key_type, authorized_prepared_statements_cache::value_type> pending_authorization_entries) {
return batch->check_access(query_state.get_client_state()).then([this, &query_state, &options, batch, pending_authorization_entries = std::move(pending_authorization_entries)] () mutable {
return parallel_for_each(pending_authorization_entries, [this, &query_state] (auto& e) {
return _authorized_prepared_cache.insert(*query_state.get_client_state().user(), e.first, std::move(e.second)).handle_exception([this] (auto eptr) {
log.error("failed to cache the entry", eptr);
});
}).then([this, &query_state, &options, batch] {
batch->validate();
batch->validate(_proxy, query_state.get_client_state());
return batch->execute(_proxy, query_state, options);
});
});
}

View File

@@ -49,6 +49,7 @@
#include <seastar/core/shared_ptr.hh>
#include "cql3/prepared_statements_cache.hh"
#include "cql3/authorized_prepared_statements_cache.hh"
#include "cql3/query_options.hh"
#include "cql3/statements/prepared_statement.hh"
#include "cql3/statements/raw/parsed_statement.hh"
@@ -99,10 +100,14 @@ public:
class query_processor {
public:
class migration_subscriber;
struct memory_config {
size_t prepared_statment_cache_size = 0;
size_t authorized_prepared_cache_size = 0;
};
private:
std::unique_ptr<migration_subscriber> _migration_subscriber;
distributed<service::storage_proxy>& _proxy;
service::storage_proxy& _proxy;
distributed<database>& _db;
struct stats {
@@ -117,6 +122,7 @@ private:
std::unique_ptr<internal_state> _internal_state;
prepared_statements_cache _prepared_cache;
authorized_prepared_statements_cache _authorized_prepared_cache;
// A map for prepared statements used internally (which we don't want to mix with user statement, in particular we
// don't bother with expiration on those.
@@ -135,7 +141,7 @@ public:
static ::shared_ptr<statements::raw::parsed_statement> parse_statement(const std::experimental::string_view& query);
query_processor(distributed<service::storage_proxy>& proxy, distributed<database>& db);
query_processor(service::storage_proxy& proxy, distributed<database>& db, memory_config mcfg);
~query_processor();
@@ -143,7 +149,7 @@ public:
return _db;
}
distributed<service::storage_proxy>& proxy() {
service::storage_proxy& proxy() {
return _proxy;
}
@@ -151,6 +157,21 @@ public:
return _cql_stats;
}
statements::prepared_statement::checked_weak_ptr get_prepared(const auth::authenticated_user* user_ptr, const prepared_cache_key_type& key) {
if (user_ptr) {
auto it = _authorized_prepared_cache.find(*user_ptr, key);
if (it != _authorized_prepared_cache.end()) {
try {
return it->get()->checked_weak_from_this();
} catch (seastar::checked_ptr_is_null_exception&) {
// If the prepared statement got invalidated - remove the corresponding authorized_prepared_statements_cache entry as well.
_authorized_prepared_cache.remove(*user_ptr, key);
}
}
}
return statements::prepared_statement::checked_weak_ptr();
}
statements::prepared_statement::checked_weak_ptr get_prepared(const prepared_cache_key_type& key) {
auto it = _prepared_cache.find(key);
if (it == _prepared_cache.end()) {
@@ -160,11 +181,19 @@ public:
}
future<::shared_ptr<cql_transport::messages::result_message>>
process_statement(
process_statement_unprepared(
::shared_ptr<cql_statement> statement,
service::query_state& query_state,
const query_options& options);
future<::shared_ptr<cql_transport::messages::result_message>>
process_statement_prepared(
statements::prepared_statement::checked_weak_ptr statement,
cql3::prepared_cache_key_type cache_key,
service::query_state& query_state,
const query_options& options,
bool needs_authorization);
future<::shared_ptr<cql_transport::messages::result_message>>
process(
const std::experimental::string_view& query_string,
@@ -215,12 +244,14 @@ public:
future<::shared_ptr<untyped_result_set>> process(
const sstring& query_string,
db::consistency_level,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& = { },
bool cache = false);
future<::shared_ptr<untyped_result_set>> process(
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& = { });
/*
@@ -242,7 +273,11 @@ public:
future<> stop();
future<::shared_ptr<cql_transport::messages::result_message>>
process_batch(::shared_ptr<statements::batch_statement>, service::query_state& query_state, query_options& options);
process_batch(
::shared_ptr<statements::batch_statement>,
service::query_state& query_state,
query_options& options,
std::unordered_map<prepared_cache_key_type, authorized_prepared_statements_cache::value_type> pending_authorization_entries);
std::unique_ptr<statements::prepared_statement> get_statement(
const std::experimental::string_view& query,
@@ -254,9 +289,13 @@ private:
query_options make_internal_options(
const statements::prepared_statement::checked_weak_ptr& p,
const std::initializer_list<data_value>&,
db::consistency_level = db::consistency_level::ONE,
db::consistency_level,
const timeout_config& timeout_config,
int32_t page_size = -1);
future<::shared_ptr<cql_transport::messages::result_message>>
process_authorized_statement(const ::shared_ptr<cql_statement> statement, service::query_state& query_state, const query_options& options);
/*!
* \brief created a state object for paging
*

View File

@@ -64,13 +64,15 @@ class single_column_primary_key_restrictions : public primary_key_restrictions<V
using bounds_range_type = typename primary_key_restrictions<ValueType>::bounds_range_type;
private:
schema_ptr _schema;
bool _allow_filtering;
::shared_ptr<single_column_restrictions> _restrictions;
bool _slice;
bool _contains;
bool _in;
public:
single_column_primary_key_restrictions(schema_ptr schema)
single_column_primary_key_restrictions(schema_ptr schema, bool allow_filtering)
: _schema(schema)
, _allow_filtering(allow_filtering)
, _restrictions(::make_shared<single_column_restrictions>(schema))
, _slice(false)
, _contains(false)
@@ -110,7 +112,7 @@ public:
}
void do_merge_with(::shared_ptr<single_column_restriction> restriction) {
if (!_restrictions->empty()) {
if (!_restrictions->empty() && !_allow_filtering) {
auto last_column = *_restrictions->last_column();
auto new_column = restriction->get_column_def();
@@ -127,11 +129,6 @@ public:
last_column.name_as_text(), new_column.name_as_text()));
}
}
if (_in && _schema->position(new_column) > _schema->position(last_column)) {
throw exceptions::invalid_request_exception(sprint("Clustering column \"%s\" cannot be restricted by an IN relation",
new_column.name_as_text()));
}
}
_slice |= restriction->is_slice();

View File

@@ -113,7 +113,7 @@ public:
class contains;
protected:
bytes_view_opt get_value(const schema& schema,
std::optional<atomic_cell_value_view> get_value(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
@@ -202,6 +202,14 @@ public:
const query_options& options,
gc_clock::time_point now) const override;
virtual std::vector<bytes_opt> values_raw(const query_options& options) const = 0;
virtual std::vector<bytes_opt> values(const query_options& options) const override {
std::vector<bytes_opt> ret = values_raw(options);
std::sort(ret.begin(),ret.end());
ret.erase(std::unique(ret.begin(),ret.end()),ret.end());
return ret;
}
#if 0
@Override
protected final boolean isSupportedBy(SecondaryIndex index)
@@ -224,7 +232,7 @@ public:
return abstract_restriction::term_uses_function(_values, ks_name, function_name);
}
virtual std::vector<bytes_opt> values(const query_options& options) const override {
virtual std::vector<bytes_opt> values_raw(const query_options& options) const override {
std::vector<bytes_opt> ret;
for (auto&& v : _values) {
ret.emplace_back(to_bytes_opt(v->bind_and_get(options)));
@@ -249,7 +257,7 @@ public:
return false;
}
virtual std::vector<bytes_opt> values(const query_options& options) const override {
virtual std::vector<bytes_opt> values_raw(const query_options& options) const override {
auto&& lval = dynamic_pointer_cast<multi_item_terminal>(_marker->bind(options));
if (!lval) {
throw exceptions::invalid_request_exception("Invalid null value for IN restriction");

View File

@@ -41,14 +41,17 @@ using boost::adaptors::transformed;
template<typename T>
class statement_restrictions::initial_key_restrictions : public primary_key_restrictions<T> {
bool _allow_filtering;
public:
initial_key_restrictions(bool allow_filtering)
: _allow_filtering(allow_filtering) {}
using bounds_range_type = typename primary_key_restrictions<T>::bounds_range_type;
::shared_ptr<primary_key_restrictions<T>> do_merge_to(schema_ptr schema, ::shared_ptr<restriction> restriction) const {
if (restriction->is_multi_column()) {
throw std::runtime_error(sprint("%s not implemented", __PRETTY_FUNCTION__));
}
return ::make_shared<single_column_primary_key_restrictions<T>>(schema)->merge_to(schema, restriction);
return ::make_shared<single_column_primary_key_restrictions<T>>(schema, _allow_filtering)->merge_to(schema, restriction);
}
::shared_ptr<primary_key_restrictions<T>> merge_to(schema_ptr schema, ::shared_ptr<restriction> restriction) override {
if (restriction->is_multi_column()) {
@@ -57,7 +60,7 @@ public:
if (restriction->is_on_token()) {
return static_pointer_cast<token_restriction>(restriction);
}
return ::make_shared<single_column_primary_key_restrictions<T>>(schema)->merge_to(restriction);
return ::make_shared<single_column_primary_key_restrictions<T>>(schema, _allow_filtering)->merge_to(restriction);
}
void merge_with(::shared_ptr<restriction> restriction) override {
throw exceptions::unsupported_operation_exception();
@@ -122,9 +125,10 @@ statement_restrictions::initial_key_restrictions<clustering_key_prefix>::merge_t
}
template<typename T>
::shared_ptr<primary_key_restrictions<T>> statement_restrictions::get_initial_key_restrictions() {
static thread_local ::shared_ptr<primary_key_restrictions<T>> initial_kr = ::make_shared<initial_key_restrictions<T>>();
return initial_kr;
::shared_ptr<primary_key_restrictions<T>> statement_restrictions::get_initial_key_restrictions(bool allow_filtering) {
static thread_local ::shared_ptr<primary_key_restrictions<T>> initial_kr_true = ::make_shared<initial_key_restrictions<T>>(true);
static thread_local ::shared_ptr<primary_key_restrictions<T>> initial_kr_false = ::make_shared<initial_key_restrictions<T>>(false);
return allow_filtering ? initial_kr_true : initial_kr_false;
}
std::vector<::shared_ptr<column_identifier>>
@@ -141,10 +145,10 @@ statement_restrictions::get_partition_key_unrestricted_components() const {
return r;
}
statement_restrictions::statement_restrictions(schema_ptr schema)
statement_restrictions::statement_restrictions(schema_ptr schema, bool allow_filtering)
: _schema(schema)
, _partition_key_restrictions(get_initial_key_restrictions<partition_key>())
, _clustering_columns_restrictions(get_initial_key_restrictions<clustering_key_prefix>())
, _partition_key_restrictions(get_initial_key_restrictions<partition_key>(allow_filtering))
, _clustering_columns_restrictions(get_initial_key_restrictions<clustering_key_prefix>(allow_filtering))
, _nonprimary_key_restrictions(::make_shared<single_column_restrictions>(schema))
{ }
#if 0
@@ -162,8 +166,9 @@ statement_restrictions::statement_restrictions(database& db,
::shared_ptr<variable_specifications> bound_names,
bool selects_only_static_columns,
bool select_a_collection,
bool for_view)
: statement_restrictions(schema)
bool for_view,
bool allow_filtering)
: statement_restrictions(schema, allow_filtering)
{
/*
* WHERE clause. For a given entity, rules are: - EQ relation conflicts with anything else (including a 2nd EQ)
@@ -327,6 +332,17 @@ void statement_restrictions::process_partition_key_restrictions(bool has_queriab
_is_key_range = true;
_uses_secondary_indexing = has_queriable_index;
}
if (_partition_key_restrictions->is_slice() && !_partition_key_restrictions->is_on_token() && !for_view) {
// A SELECT query may not request a slice (range) of partition keys
// without using token(). This is because there is no way to do this
// query efficiently: mumur3 turns a contiguous range of partition
// keys into tokens all over the token space.
// However, in a SELECT statement used to define a materialized view,
// such a slice is fine - it is used to check whether individual
// partitions, match, and does not present a performance problem.
throw exceptions::invalid_request_exception(
"Only EQ and IN relation are supported on the partition key (unless you use the token() function)");
}
}
bool statement_restrictions::has_partition_key_unrestricted_components() const {
@@ -414,7 +430,7 @@ void statement_restrictions::validate_secondary_index_selections(bool selects_on
}
}
static bytes_view_opt do_get_value(const schema& schema,
static std::optional<atomic_cell_value_view> do_get_value(const schema& schema,
const column_definition& cdef,
const partition_key& key,
const clustering_key_prefix& ckey,
@@ -422,21 +438,21 @@ static bytes_view_opt do_get_value(const schema& schema,
gc_clock::time_point now) {
switch(cdef.kind) {
case column_kind::partition_key:
return key.get_component(schema, cdef.component_index());
return atomic_cell_value_view(key.get_component(schema, cdef.component_index()));
case column_kind::clustering_key:
return ckey.get_component(schema, cdef.component_index());
return atomic_cell_value_view(ckey.get_component(schema, cdef.component_index()));
default:
auto cell = cells.find_cell(cdef.id);
if (!cell) {
return stdx::nullopt;
return std::nullopt;
}
assert(cdef.is_atomic());
auto c = cell->as_atomic_cell();
return c.is_dead(now) ? stdx::nullopt : bytes_view_opt(c.value());
auto c = cell->as_atomic_cell(cdef);
return c.is_dead(now) ? std::nullopt : std::optional<atomic_cell_value_view>(c.value());
}
}
bytes_view_opt single_column_restriction::get_value(const schema& schema,
std::optional<atomic_cell_value_view> single_column_restriction::get_value(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
@@ -456,7 +472,12 @@ bool single_column_restriction::EQ::is_satisfied_by(const schema& schema,
auto operand = value(options);
if (operand) {
auto cell_value = get_value(schema, key, ckey, cells, now);
return cell_value && _column_def.type->compare(*operand, *cell_value) == 0;
if (!cell_value) {
return false;
}
return cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return _column_def.type->compare(*operand, cell_value_bv) == 0;
});
}
return false;
}
@@ -475,9 +496,11 @@ bool single_column_restriction::IN::is_satisfied_by(const schema& schema,
return false;
}
auto operands = values(options);
return cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return std::any_of(operands.begin(), operands.end(), [&] (auto&& operand) {
return operand && _column_def.type->compare(*operand, *cell_value) == 0;
return operand && _column_def.type->compare(*operand, cell_value_bv) == 0;
});
});
}
static query::range<bytes_view> to_range(const term_slice& slice, const query_options& options) {
@@ -510,7 +533,9 @@ bool single_column_restriction::slice::is_satisfied_by(const schema& schema,
if (!cell_value) {
return false;
}
return to_range(_slice, options).contains(*cell_value, _column_def.type->as_tri_comparator());
return cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return to_range(_slice, options).contains(cell_value_bv, _column_def.type->as_tri_comparator());
});
}
bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
@@ -536,7 +561,8 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
auto&& element_type = col_type->is_set() ? col_type->name_comparator() : col_type->value_comparator();
if (_column_def.type->is_multi_cell()) {
auto cell = cells.find_cell(_column_def.id);
auto&& elements = col_type->deserialize_mutation_form(cell->as_collection_mutation()).cells;
return cell->as_collection_mutation().data.with_linearized([&] (bytes_view collection_bv) {
auto&& elements = col_type->deserialize_mutation_form(collection_bv).cells;
auto end = std::remove_if(elements.begin(), elements.end(), [now] (auto&& element) {
return element.second.is_dead(now);
});
@@ -546,7 +572,9 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
continue;
}
auto found = std::find_if(elements.begin(), end, [&] (auto&& element) {
return element_type->compare(element.second.value(), *val) == 0;
return element.second.value().with_linearized([&] (bytes_view value_bv) {
return element_type->compare(value_bv, *val) == 0;
});
});
if (found == end) {
return false;
@@ -573,16 +601,26 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
auto found = std::find_if(elements.begin(), end, [&] (auto&& element) {
return map_key_type->compare(element.first, *map_key) == 0;
});
if (found == end || element_type->compare(found->second.value(), *map_value) != 0) {
if (found == end) {
return false;
}
auto cmp = found->second.value().with_linearized([&] (bytes_view value_bv) {
return element_type->compare(value_bv, *map_value);
});
if (cmp != 0) {
return false;
}
}
return true;
});
} else {
auto cell_value = get_value(schema, key, ckey, cells, now);
if (!cell_value) {
return false;
}
auto deserialized = _column_def.type->deserialize(*cell_value);
auto deserialized = cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return _column_def.type->deserialize(cell_value_bv);
});
for (auto&& value : _values) {
auto val = value->bind_and_get(options);
if (!val) {
@@ -653,7 +691,9 @@ bool token_restriction::EQ::is_satisfied_by(const schema& schema,
for (auto&& operand : values(options)) {
if (operand) {
auto cell_value = do_get_value(schema, **cdef, key, ckey, cells, now);
satisfied = cell_value && (*cdef)->type->compare(*operand, *cell_value) == 0;
satisfied = cell_value && cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return (*cdef)->type->compare(*operand, cell_value_bv) == 0;
});
}
if (!satisfied) {
break;
@@ -675,7 +715,9 @@ bool token_restriction::slice::is_satisfied_by(const schema& schema,
if (!cell_value) {
return false;
}
satisfied = range.contains(*cell_value, cdef->type->as_tri_comparator());
satisfied = cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return range.contains(cell_value_bv, cdef->type->as_tri_comparator());
});
if (!satisfied) {
break;
}

View File

@@ -67,7 +67,7 @@ private:
class initial_key_restrictions;
template<typename T>
static ::shared_ptr<primary_key_restrictions<T>> get_initial_key_restrictions();
static ::shared_ptr<primary_key_restrictions<T>> get_initial_key_restrictions(bool allow_filtering);
/**
* Restrictions on partitioning columns
@@ -108,7 +108,7 @@ public:
* @param cfm the column family meta data
* @return a new empty <code>StatementRestrictions</code>.
*/
statement_restrictions(schema_ptr schema);
statement_restrictions(schema_ptr schema, bool allow_filtering);
statement_restrictions(database& db,
schema_ptr schema,
@@ -117,7 +117,8 @@ public:
::shared_ptr<variable_specifications> bound_names,
bool selects_only_static_columns,
bool select_a_collection,
bool for_view = false);
bool for_view = false,
bool allow_filtering = false);
private:
void add_restriction(::shared_ptr<restriction> restriction);
void add_single_column_restriction(::shared_ptr<single_column_restriction> restriction);

139
cql3/result_generator.hh Normal file
View File

@@ -0,0 +1,139 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "selection/selection.hh"
#include "stats.hh"
namespace cql3 {
class result_generator {
schema_ptr _schema;
foreign_ptr<lw_shared_ptr<query::result>> _result;
lw_shared_ptr<const query::read_command> _command;
shared_ptr<const selection::selection> _selection;
cql_stats* _stats;
private:
template<typename Visitor>
class query_result_visitor {
const schema& _schema;
std::vector<bytes> _partition_key;
std::vector<bytes> _clustering_key;
uint32_t _partition_row_count = 0;
uint32_t _total_row_count = 0;
Visitor& _visitor;
const selection::selection& _selection;
private:
void accept_cell_value(const column_definition& def, query::result_row_view::iterator_type& i) {
if (def.is_multi_cell()) {
_visitor.accept_value(i.next_collection_cell());
} else {
auto cell = i.next_atomic_cell();
_visitor.accept_value(cell ? std::optional<query::result_bytes_view>(cell->value()) : std::optional<query::result_bytes_view>());
}
}
public:
query_result_visitor(const schema& s, Visitor& visitor, const selection::selection& select)
: _schema(s), _visitor(visitor), _selection(select) { }
void accept_new_partition(const partition_key& key, uint32_t row_count) {
_partition_key = key.explode(_schema);
accept_new_partition(row_count);
}
void accept_new_partition(uint32_t row_count) {
_partition_row_count = row_count;
_total_row_count += row_count;
}
void accept_new_row(const clustering_key& key, query::result_row_view static_row,
query::result_row_view row) {
_clustering_key = key.explode(_schema);
accept_new_row(static_row, row);
}
void accept_new_row(query::result_row_view static_row, query::result_row_view row) {
auto static_row_iterator = static_row.iterator();
auto row_iterator = row.iterator();
_visitor.start_row();
for (auto&& def : _selection.get_columns()) {
switch (def->kind) {
case column_kind::partition_key:
_visitor.accept_value(query::result_bytes_view(bytes_view(_partition_key[def->component_index()])));
break;
case column_kind::clustering_key:
if (_clustering_key.size() > def->component_index()) {
_visitor.accept_value(query::result_bytes_view(bytes_view(_clustering_key[def->component_index()])));
} else {
_visitor.accept_value({});
}
break;
case column_kind::regular_column:
accept_cell_value(*def, row_iterator);
break;
case column_kind::static_column:
accept_cell_value(*def, static_row_iterator);
break;
}
}
_visitor.end_row();
}
void accept_partition_end(const query::result_row_view& static_row) {
if (_partition_row_count == 0) {
_total_row_count++;
_visitor.start_row();
auto static_row_iterator = static_row.iterator();
for (auto&& def : _selection.get_columns()) {
if (def->is_partition_key()) {
_visitor.accept_value(query::result_bytes_view(bytes_view(_partition_key[def->component_index()])));
} else if (def->is_static()) {
accept_cell_value(*def, static_row_iterator);
} else {
_visitor.accept_value({});
}
}
_visitor.end_row();
}
}
uint32_t rows_read() const { return _total_row_count; }
};
public:
result_generator() = default;
result_generator(schema_ptr s, foreign_ptr<lw_shared_ptr<query::result>> result, lw_shared_ptr<const query::read_command> cmd,
::shared_ptr<const selection::selection> select, cql_stats& stats)
: _schema(std::move(s))
, _result(std::move(result))
, _command(std::move(cmd))
, _selection(std::move(select))
, _stats(&stats)
{ }
template<typename Visitor>
void visit(Visitor&& visitor) const {
query_result_visitor<Visitor> v(*_schema, visitor, *_selection);
query::result_view::consume(*_result, _command->slice, v);
_stats->rows_read += v.rows_read();
}
};
}

View File

@@ -47,6 +47,12 @@
#include "service/pager/paging_state.hh"
#include "schema.hh"
#include "query-result-reader.hh"
#include "result_generator.hh"
#include <seastar/util/gcc6-concepts.hh>
namespace cql3 {
class metadata {
@@ -131,10 +137,22 @@ public:
const std::vector<uint16_t>& partition_key_bind_indices() const;
};
GCC6_CONCEPT(
template<typename Visitor>
concept bool ResultVisitor = requires(Visitor& visitor) {
visitor.start_row();
visitor.accept_value(std::optional<query::result_bytes_view>());
visitor.end_row();
};
)
class result_set {
public:
::shared_ptr<metadata> _metadata;
std::deque<std::vector<bytes_opt>> _rows;
friend class result;
public:
result_set(std::vector<::shared_ptr<column_specification>> metadata_);
@@ -163,6 +181,80 @@ public:
// Returns a range of rows. A row is a range of bytes_opt.
const std::deque<std::vector<bytes_opt>>& rows() const;
template<typename Visitor>
GCC6_CONCEPT(requires ResultVisitor<Visitor>)
void visit(Visitor&& visitor) const {
auto column_count = get_metadata().column_count();
for (auto& row : _rows) {
visitor.start_row();
for (auto i = 0u; i < column_count; i++) {
auto& cell = row[i];
visitor.accept_value(cell ? std::optional<query::result_bytes_view>(*cell) : std::optional<query::result_bytes_view>());
}
visitor.end_row();
}
}
class builder;
};
class result_set::builder {
result_set _result;
std::vector<bytes_opt> _current_row;
public:
explicit builder(shared_ptr<metadata> mtd)
: _result(std::move(mtd)) { }
void start_row() { }
void accept_value(std::optional<query::result_bytes_view> value) {
if (!value) {
_current_row.emplace_back();
return;
}
_current_row.emplace_back(value->linearize());
}
void end_row() {
_result.add_row(std::exchange(_current_row, { }));
}
result_set get_result_set() && { return std::move(_result); }
};
class result {
std::unique_ptr<cql3::result_set> _result_set;
result_generator _result_generator;
shared_ptr<cql3::metadata> _metadata;
public:
explicit result(std::unique_ptr<cql3::result_set> rs)
: _result_set(std::move(rs))
, _metadata(_result_set->_metadata)
{ }
explicit result(result_generator generator, shared_ptr<metadata> m)
: _result_generator(std::move(generator))
, _metadata(std::move(m))
{ }
const cql3::metadata& get_metadata() const { return *_metadata; }
cql3::result_set result_set() const {
if (_result_set) {
return *_result_set;
} else {
auto builder = result_set::builder(_metadata);
_result_generator.visit(builder);
return std::move(builder).get_result_set();
}
}
template<typename Visitor>
GCC6_CONCEPT(requires ResultVisitor<Visitor>)
void visit(Visitor&& visitor) const {
if (_result_set) {
_result_set->visit(std::forward<Visitor>(visitor));
} else {
_result_generator.visit(std::forward<Visitor>(visitor));
}
}
};
}

View File

@@ -112,6 +112,32 @@ selectable::with_function::raw::make_count_rows_function() {
std::vector<shared_ptr<cql3::selection::selectable::raw>>());
}
shared_ptr<selector::factory>
selectable::with_anonymous_function::new_selector_factory(database& db, schema_ptr s, std::vector<const column_definition*>& defs) {
auto&& factories = selector_factories::create_factories_and_collect_column_definitions(_args, db, s, defs);
return abstract_function_selector::new_factory(_function, std::move(factories));
}
sstring
selectable::with_anonymous_function::to_string() const {
return sprint("%s(%s)", _function->name().name, join(", ", _args));
}
shared_ptr<selectable>
selectable::with_anonymous_function::raw::prepare(schema_ptr s) {
std::vector<shared_ptr<selectable>> prepared_args;
prepared_args.reserve(_args.size());
for (auto&& arg : _args) {
prepared_args.push_back(arg->prepare(s));
}
return ::make_shared<with_anonymous_function>(_function, std::move(prepared_args));
}
bool
selectable::with_anonymous_function::raw::processes_selection() const {
return true;
}
shared_ptr<selector::factory>
selectable::with_field_selection::new_selector_factory(database& db, schema_ptr s, std::vector<const column_definition*>& defs) {
auto&& factory = _selected->new_selector_factory(db, s, defs);

View File

@@ -46,6 +46,7 @@
#include "core/shared_ptr.hh"
#include "cql3/selection/selector.hh"
#include "cql3/cql3_type.hh"
#include "cql3/functions/function.hh"
#include "cql3/functions/function_name.hh"
namespace cql3 {
@@ -82,6 +83,7 @@ public:
class writetime_or_ttl;
class with_function;
class with_anonymous_function;
class with_field_selection;
@@ -114,6 +116,28 @@ public:
};
};
class selectable::with_anonymous_function : public selectable {
shared_ptr<functions::function> _function;
std::vector<shared_ptr<selectable>> _args;
public:
with_anonymous_function(::shared_ptr<functions::function> f, std::vector<shared_ptr<selectable>> args)
: _function(f), _args(std::move(args)) {
}
virtual sstring to_string() const override;
virtual shared_ptr<selector::factory> new_selector_factory(database& db, schema_ptr s, std::vector<const column_definition*>& defs) override;
class raw : public selectable::raw {
shared_ptr<functions::function> _function;
std::vector<shared_ptr<selectable::raw>> _args;
public:
raw(shared_ptr<functions::function> f, std::vector<shared_ptr<selectable::raw>> args)
: _function(f), _args(std::move(args)) {
}
virtual shared_ptr<selectable> prepare(schema_ptr s) override;
virtual bool processes_selection() const override;
};
};
class selectable::with_cast : public selectable {
::shared_ptr<selectable> _arg;

View File

@@ -53,13 +53,15 @@ selection::selection(schema_ptr schema,
std::vector<const column_definition*> columns,
std::vector<::shared_ptr<column_specification>> metadata_,
bool collect_timestamps,
bool collect_TTLs)
bool collect_TTLs,
trivial is_trivial)
: _schema(std::move(schema))
, _columns(std::move(columns))
, _metadata(::make_shared<metadata>(std::move(metadata_)))
, _collect_timestamps(collect_timestamps)
, _collect_TTLs(collect_TTLs)
, _contains_static_columns(std::any_of(_columns.begin(), _columns.end(), std::mem_fn(&column_definition::is_static)))
, _is_trivial(is_trivial)
{ }
query::partition_slice::option_set selection::get_query_options() {
@@ -100,7 +102,7 @@ public:
*/
simple_selection(schema_ptr schema, std::vector<const column_definition*> columns,
std::vector<::shared_ptr<column_specification>> metadata, bool is_wildcard)
: selection(schema, std::move(columns), std::move(metadata), false, false)
: selection(schema, std::move(columns), std::move(metadata), false, false, trivial::yes)
, _is_wildcard(is_wildcard)
{ }
@@ -342,7 +344,7 @@ void result_set_builder::visitor::add_value(const column_definition& def,
_builder.add_empty();
return;
}
_builder.add_collection(def, *cell);
_builder.add_collection(def, cell->linearize());
} else {
auto cell = i.next_atomic_cell();
if (!cell) {
@@ -426,7 +428,7 @@ int32_t result_set_builder::ttl_of(size_t idx) {
}
bytes_opt result_set_builder::get_value(data_type t, query::result_atomic_cell_view c) {
return {to_bytes(c.value())};
return {c.value().linearize()};
}
}

View File

@@ -84,12 +84,15 @@ private:
const bool _collect_timestamps;
const bool _collect_TTLs;
const bool _contains_static_columns;
bool _is_trivial;
protected:
using trivial = bool_class<class trivial_tag>;
selection(schema_ptr schema,
std::vector<const column_definition*> columns,
std::vector<::shared_ptr<column_specification>> metadata_,
bool collect_timestamps,
bool collect_TTLs);
bool collect_TTLs, trivial is_trivial = trivial::no);
virtual ~selection() {}
public:
@@ -223,6 +226,12 @@ public:
}
}
/**
* Returns true if the selection is trivial, i.e. there are no function
* selectors (including casts or aggregates).
*/
bool is_trivial() const { return _is_trivial; }
friend class result_set_builder;
};

View File

@@ -105,9 +105,11 @@ public:
virtual void reset() = 0;
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, ::shared_ptr<column_specification> receiver) override {
if (receiver->type == get_type()) {
auto t1 = receiver->type->underlying_type();
auto t2 = get_type()->underlying_type();
if (t1 == t2) {
return assignment_testable::test_result::EXACT_MATCH;
} else if (receiver->type->is_value_compatible_with(*get_type())) {
} else if (t1->is_value_compatible_with(*t2)) {
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
} else {
return assignment_testable::test_result::NOT_ASSIGNABLE;

View File

@@ -225,7 +225,12 @@ sets::marker::bind(const query_options& options) {
void
sets::setter::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
const auto& value = _t->bind(params._options);
auto value = _t->bind(params._options);
execute(m, row_key, params, column, std::move(value));
}
void
sets::setter::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value) {
if (value == constants::UNSET_VALUE) {
return;
}
@@ -264,7 +269,7 @@ sets::adder::do_add(mutation& m, const clustering_key_prefix& row_key, const upd
}
for (auto&& e : set_value->_elements) {
mut.cells.emplace_back(e, params.make_cell({}));
mut.cells.emplace_back(e, params.make_cell(*set_type->value_comparator(), {}, atomic_cell::collection_member::yes));
}
auto smut = set_type->serialize_mutation_form(mut);
@@ -274,7 +279,7 @@ sets::adder::do_add(mutation& m, const clustering_key_prefix& row_key, const upd
auto v = set_type->serialize_partially_deserialized_form(
{set_value->_elements.begin(), set_value->_elements.end()},
cql_serialization_format::internal());
m.set_cell(row_key, column, params.make_cell(std::move(v)));
m.set_cell(row_key, column, params.make_cell(*column.type, std::move(v)));
} else {
m.set_cell(row_key, column, params.make_dead_cell());
}

View File

@@ -113,6 +113,7 @@ public:
: operation(column, std::move(t)) {
}
virtual void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) override;
static void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value);
};
class adder : public operation {

View File

@@ -116,18 +116,6 @@ single_column_relation::to_receivers(schema_ptr schema, const column_definition&
throw exceptions::invalid_request_exception(sprint(
"IN predicates on non-primary-key columns (%s) is not yet supported", column_def.name_as_text()));
}
} else if (is_slice()) {
// Non EQ relation is not supported without token(), even if we have a 2ndary index (since even those
// are ordered by partitioner).
// Note: In theory we could allow it for 2ndary index queries with ALLOW FILTERING, but that would
// probably require some special casing
// Note bis: This is also why we don't bother handling the 'tuple' notation of #4851 for keys. If we
// lift the limitation for 2ndary
// index with filtering, we'll need to handle it though.
if (column_def.is_partition_key()) {
throw exceptions::invalid_request_exception(
"Only EQ and IN relation are supported on the partition key (unless you use the token() function)");
}
}
if (is_contains() && !receiver->type->is_collection()) {

View File

@@ -134,7 +134,7 @@ protected:
#endif
virtual sstring to_string() const override {
auto entity_as_string = _entity->to_string();
auto entity_as_string = _entity->to_cql_string();
if (_map_key) {
entity_as_string = sprint("%s[%s]", std::move(entity_as_string), _map_key->to_string());
}

View File

@@ -42,7 +42,7 @@
#include "alter_keyspace_statement.hh"
#include "prepared_statement.hh"
#include "service/migration_manager.hh"
#include "database.hh"
#include "db/system_keyspace.hh"
bool is_system_keyspace(const sstring& keyspace);
@@ -59,7 +59,7 @@ future<> cql3::statements::alter_keyspace_statement::check_access(const service:
return state.has_keyspace_access(_name, auth::permission::ALTER);
}
void cql3::statements::alter_keyspace_statement::validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) {
void cql3::statements::alter_keyspace_statement::validate(service::storage_proxy& proxy, const service::client_state& state) {
try {
service::get_local_storage_proxy().get_db().local().find_keyspace(_name); // throws on failure
auto tmp = _name;
@@ -90,7 +90,7 @@ void cql3::statements::alter_keyspace_statement::validate(distributed<service::s
}
}
future<shared_ptr<cql_transport::event::schema_change>> cql3::statements::alter_keyspace_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) {
future<shared_ptr<cql_transport::event::schema_change>> cql3::statements::alter_keyspace_statement::announce_migration(service::storage_proxy& proxy, bool is_local_only) {
auto old_ksm = service::get_local_storage_proxy().get_db().local().find_keyspace(_name).metadata();
return service::get_local_migration_manager().announce_keyspace_update(_attrs->as_ks_metadata_update(old_ksm), is_local_only).then([this] {
using namespace cql_transport;

View File

@@ -60,8 +60,8 @@ public:
const sstring& keyspace() const override;
future<> check_access(const service::client_state& state) override;
void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
future<shared_ptr<cql_transport::event::schema_change>> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
void validate(service::storage_proxy& proxy, const service::client_state& state) override;
future<shared_ptr<cql_transport::event::schema_change>> announce_migration(service::storage_proxy& proxy, bool is_local_only) override;
virtual std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};

View File

@@ -62,12 +62,12 @@ public:
, _options(std::move(options)) {
}
void validate(distributed<service::storage_proxy>&, const service::client_state&) override;
void validate(service::storage_proxy&, const service::client_state&) override;
virtual future<> check_access(const service::client_state&) override;
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(distributed<service::storage_proxy>&, service::query_state&, const query_options&) override;
execute(service::storage_proxy&, service::query_state&, const query_options&) override;
};
}

View File

@@ -75,7 +75,7 @@ future<> alter_table_statement::check_access(const service::client_state& state)
return state.has_column_family_access(keyspace(), column_family(), auth::permission::ALTER);
}
void alter_table_statement::validate(distributed<service::storage_proxy>& proxy, const service::client_state& state)
void alter_table_statement::validate(service::storage_proxy& proxy, const service::client_state& state)
{
// validated in announce_migration()
}
@@ -165,9 +165,9 @@ static void validate_column_rename(database& db, const schema& schema, const col
}
}
future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::announce_migration(service::storage_proxy& proxy, bool is_local_only)
{
auto& db = proxy.local().get_db().local();
auto& db = proxy.get_db().local();
auto schema = validation::validate_column_family(db, keyspace(), column_family());
if (schema->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER TABLE on Materialized View");
@@ -247,12 +247,15 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
cfm.with_column(column_name->name(), type, _is_static ? column_kind::static_column : column_kind::regular_column);
// Adding a column to a table which has an include all view requires the column to be added to the view
// as well
// as well. If the view has a regular base column in its PK, then the column ID needs to be updated in
// view_info; for that, rebuild the schema.
if (!_is_static) {
for (auto&& view : cf.views()) {
if (view->view_info()->include_all_columns()) {
if (view->view_info()->include_all_columns() || view->view_info()->base_non_pk_column_in_view_pk()) {
schema_builder builder(view);
builder.with_column(column_name->name(), type);
if (view->view_info()->include_all_columns()) {
builder.with_column(column_name->name(), type);
}
view_updates.push_back(view_ptr(builder.build()));
}
}
@@ -305,14 +308,10 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
}
}
// If a column is dropped which is included in a view, we don't allow the drop to take place.
auto view_names = ::join(", ", cf.views()
| boost::adaptors::filtered([&] (auto&& v) { return bool(v->get_column_definition(column_name->name())); })
| boost::adaptors::transformed([] (auto&& v) { return v->cf_name(); }));
if (!view_names.empty()) {
if (!cf.views().empty()) {
throw exceptions::invalid_request_exception(sprint(
"Cannot drop column %s, depended on by materialized views (%s.{%s})",
column_name, keyspace(), view_names));
"Cannot drop column %s on base table %s.%s with materialized views",
column_name, keyspace(), column_family()));
}
break;
}

View File

@@ -77,8 +77,8 @@ public:
bool is_static);
virtual future<> check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual void validate(service::storage_proxy& proxy, const service::client_state& state) override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(service::storage_proxy& proxy, bool is_local_only) override;
virtual std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};

View File

@@ -66,7 +66,7 @@ future<> alter_type_statement::check_access(const service::client_state& state)
return state.has_keyspace_access(keyspace(), auth::permission::ALTER);
}
void alter_type_statement::validate(distributed<service::storage_proxy>& proxy, const service::client_state& state)
void alter_type_statement::validate(service::storage_proxy& proxy, const service::client_state& state)
{
// Validation is left to announceMigration as it's easier to do it while constructing the updated type.
// It doesn't really change anything anyway.
@@ -135,10 +135,10 @@ void alter_type_statement::do_announce_migration(database& db, ::keyspace& ks, b
}
}
future<shared_ptr<cql_transport::event::schema_change>> alter_type_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
future<shared_ptr<cql_transport::event::schema_change>> alter_type_statement::announce_migration(service::storage_proxy& proxy, bool is_local_only)
{
return seastar::async([this, &proxy, is_local_only] {
auto&& db = proxy.local().get_db().local();
auto&& db = proxy.get_db().local();
try {
auto&& ks = db.find_keyspace(keyspace());
do_announce_migration(db, ks, is_local_only);

View File

@@ -59,11 +59,11 @@ public:
virtual future<> check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
virtual void validate(service::storage_proxy& proxy, const service::client_state& state) override;
virtual const sstring& keyspace() const override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(service::storage_proxy& proxy, bool is_local_only) override;
class add_or_alter;
class renames;

View File

@@ -69,14 +69,14 @@ future<> alter_view_statement::check_access(const service::client_state& state)
return make_ready_future<>();
}
void alter_view_statement::validate(distributed<service::storage_proxy>&, const service::client_state& state)
void alter_view_statement::validate(service::storage_proxy&, const service::client_state& state)
{
// validated in announce_migration()
}
future<shared_ptr<cql_transport::event::schema_change>> alter_view_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
future<shared_ptr<cql_transport::event::schema_change>> alter_view_statement::announce_migration(service::storage_proxy& proxy, bool is_local_only)
{
auto&& db = proxy.local().get_db().local();
auto&& db = proxy.get_db().local();
schema_ptr schema = validation::validate_column_family(db, keyspace(), column_family());
if (!schema->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER MATERIALIZED VIEW on Table");
@@ -86,10 +86,10 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_view_statement::an
throw exceptions::invalid_request_exception("ALTER MATERIALIZED VIEW WITH invoked, but no parameters found");
}
_properties->validate(proxy.local().get_db().local().get_config().extensions());
_properties->validate(proxy.get_db().local().get_config().extensions());
auto builder = schema_builder(schema);
_properties->apply_to_builder(builder, proxy.local().get_db().local().get_config().extensions());
_properties->apply_to_builder(builder, proxy.get_db().local().get_config().extensions());
if (builder.get_gc_grace_seconds() == 0) {
throw exceptions::invalid_request_exception(

View File

@@ -43,7 +43,7 @@
#include <seastar/core/shared_ptr.hh>
#include "database.hh"
#include "database_fwd.hh"
#include "cql3/statements/cf_prop_defs.hh"
#include "cql3/statements/schema_altering_statement.hh"
#include "cql3/cf_name.hh"
@@ -61,9 +61,9 @@ public:
virtual future<> check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;
virtual void validate(service::storage_proxy&, const service::client_state& state) override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(service::storage_proxy& proxy, bool is_local_only) override;
virtual std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};

View File

@@ -67,17 +67,10 @@ bool cql3::statements::authentication_statement::depends_on_column_family(
}
void cql3::statements::authentication_statement::validate(
distributed<service::storage_proxy>&,
service::storage_proxy&,
const service::client_state& state) {
}
future<> cql3::statements::authentication_statement::check_access(const service::client_state& state) {
return make_ready_future<>();
}
future<::shared_ptr<cql_transport::messages::result_message>> cql3::statements::authentication_statement::execute_internal(
distributed<service::storage_proxy>& proxy,
service::query_state& state, const query_options& options) {
// Internal queries are exclusively on the system keyspace and makes no sense here
throw std::runtime_error("unsupported operation");
}

View File

@@ -52,6 +52,8 @@ namespace statements {
class authentication_statement : public raw::parsed_statement, public cql_statement_no_metadata, public ::enable_shared_from_this<authentication_statement> {
public:
authentication_statement() : cql_statement_no_metadata(&timeout_config::other_timeout) {}
uint32_t get_bound_terms() override;
std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
@@ -64,10 +66,7 @@ public:
future<> check_access(const service::client_state& state) override;
void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;
future<::shared_ptr<cql_transport::messages::result_message>>
execute_internal(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) override;
void validate(service::storage_proxy&, const service::client_state& state) override;
};
}

View File

@@ -67,7 +67,7 @@ bool cql3::statements::authorization_statement::depends_on_column_family(
}
void cql3::statements::authorization_statement::validate(
distributed<service::storage_proxy>&,
service::storage_proxy&,
const service::client_state& state) {
}
@@ -75,13 +75,6 @@ future<> cql3::statements::authorization_statement::check_access(const service::
return make_ready_future<>();
}
future<::shared_ptr<cql_transport::messages::result_message>> cql3::statements::authorization_statement::execute_internal(
distributed<service::storage_proxy>& proxy,
service::query_state& state, const query_options& options) {
// Internal queries are exclusively on the system keyspace and makes no sense here
throw std::runtime_error("unsupported operation");
}
void cql3::statements::authorization_statement::maybe_correct_resource(auth::resource& resource, const service::client_state& state) {
if (resource.kind() == auth::resource_kind::data) {
const auto data_view = auth::data_resource_view(resource);

View File

@@ -56,6 +56,8 @@ namespace statements {
class authorization_statement : public raw::parsed_statement, public cql_statement_no_metadata, public ::enable_shared_from_this<authorization_statement> {
public:
authorization_statement() : cql_statement_no_metadata(&timeout_config::other_timeout) {}
uint32_t get_bound_terms() override;
std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
@@ -68,10 +70,7 @@ public:
future<> check_access(const service::client_state& state) override;
void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;
future<::shared_ptr<cql_transport::messages::result_message>>
execute_internal(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) override;
void validate(service::storage_proxy&, const service::client_state& state) override;
protected:
static void maybe_correct_resource(auth::resource&, const service::client_state&);

Some files were not shown because too many files have changed in this diff Show More